pandas 文本查询匹配

看过来

《pandas 教程》持续更新中，提供建议、纠错、催更等加作者微信: gr99123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

Pandas 在文本的查询匹配方面也很强大，可以使用正则来做复杂的查询匹配，可以根据需要指定获得匹配后返回的数据。

查询 findall

利用正则查询和给定文本相同的内容：

s = pd.Series(['Lion', 'Monkey', 'Rabbit'])
s.str.findall('Monkey')
'''
0          []
1    [Monkey]
2          []
dtype: object
'''
# 大小写敏感，不会查出内容
s.str.findall('MONKEY')
# 忽略大小写
import re
s.str.findall('MONKEY', flags=re.IGNORECASE)
# 包含 on
s.str.findall('on')
# 以 on 结尾
s.str.findall('on$')
# 包含多个的会形成一个列表
s.str.findall('b')
'''
0        []
1        []
2    [b, b]
dtype: object
'''

可以使用str.find匹配返回匹配结果的位置（从0开始），-1为不匹配：

s.str.find('Monkey')
'''
0   -1
1    0
2   -1
dtype: int64
'''
s.str.find('on')
'''
0    2
1    1
2   -1
dtype: int64
'''

此外，还有 .str.rfind，是从右开始匹配。

包含 contains

判断字符是否有包含关系，经常用在数据筛选中。它默认是支持正则的，如果不需要可以关掉。na=nan 可以指定空值的处理。

s1 = pd.Series(['Mouse', 'dog', 'house and parrot', '23', np.NaN])
s1.str.contains('og', regex=False)
'''
0    False
1     True
2    False
3    False
4      NaN
dtype: object
'''

可以用在数据查询中：

# 名字包含 A 字母
df.loc[df.name.str.contains('A')]
# 包含 A 或者 C
df.loc[df.name.str.contains('A|C')]
# 忽略大小写
import re
df.loc[df.name.str.contains('A|C', flags=re.IGNORECASE)]
# 包含数字
df.loc[df.name.str.contains('\d')]

另外，.str.startswith 和 .str.endswith 还可以指定开头还是结尾包含：

s = pd.Series(['bat', 'Bear', 'cat', np.nan])
s.str.startswith('b')
# 对空值的处理
s.str.startswith('b', na=False)
s.str.endswith('t')
s.str.endswith('t', na=False)

匹配 match

确定每个字符串是否与正则表达式匹配。

pd.Series(['1', '2', '3a', '3b', '03c'],
          dtype="string").str.match(r'[0-9][a-z]')
'''
0    False
1    False
2     True
3     True
4    False
dtype: boolean
'''

使用 contains 最后一个值为 True。

提取 extract

.str.extract 可以利用正则将文本中的数据提取出来形成单独的列，下列中正则将文本分为两部分，第一部分匹配 ab 三个字母，第二位匹配数字，最终得这两列，c3 由于无法匹配，最终得到两列空值。

(pd.Series(['a1', 'b2', 'c3'],
          dtype="string")
 .str
 .extract(r'([ab])(\d)', expand=True)
)
'''
      0     1
0     a     1
1     b     2
2  <NA>  <NA>
'''

expand 参数如果为真则返回一个 DataFrame，不管是一列还是多列，为假时只有一列时才会返回一个 Series/Index。

s.str.extract(r'([ab])?(\d)')
'''
     0  1
0    a  1
1    b  2
2  NaN  3
'''
# 取正则组的命名为列名
s.str.extract(r'(?P<letter>[ab])(?P<digit>\d)')
'''
  letter digit
0      a     1
1      b     2
2    NaN   NaN
'''

匹配全部，会将一个文本中所有符合规则的匹配出来，最终形成一个多层索引数据：

s = pd.Series(["a1a2", "b1b7", "c1"],
              index=["A", "B", "C"],
              dtype="string")
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])'
s.str.extract(two_groups, expand=True) # 单次匹配
s.str.extractall(two_groups)
'''
        letter digit
  match
A 0          a     1
  1          a     2
B 0          b     1
  1          b     7
C 0          c     1
'''

提取虚拟变量

可以从字符串列中提取虚拟变量。例如用“ |”分隔：

s = pd.Series(['a', 'a|b', np.nan, 'a|c'],
              dtype="string")
s.str.get_dummies(sep='|')
'''
   a  b  c
0  1  0  0
1  1  1  0
2  0  0  0
3  1  0  1
'''

也可以对索引进行这种操作：

dx = pd.Index(['a', 'a|b', np.nan, 'a|c'])
idx.str.get_dummies(sep='|')
'''
MultiIndex([(1, 0, 0),
            (1, 1, 0),
            (0, 0, 0),
            (1, 0, 1)],
           names=['a', 'b', 'c'])
'''