pandas where() 和 mask() 按条件替换值

看过来

《pandas 教程》持续更新中，提供建议、纠错、催更等加作者微信: gr99123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

从带有布尔向量的序列中选择值通常会返回数据的子集。为了保证选择输出与原始数据具有相同的形状，可以在 Series 和 DataFrame 中使用 where() 和 mask() 方法。mask 意为面具、面罩，非常形象地表明了在 pandas 中，mask() 的作用，即将给定表达式中为真的值遮罩住，为假的显露出来，还可以对这些罩住的数据进行替换。这个在数据处理中非常实用，可以帮助我们批量地、大范围地选定要观察和要操作的数据。

pandas where() 和 mask()

where() 则与 mask() 相反，则是将表达式为真的值进行替换。

语法

where() 方法替换条件为 False 的值，同时支持 DataFrame 和 Series，语法相同，为：

where(self, cond, other=nan, inplace=False, axis=None, level=None,
      errors='raise', try_cast=<no_default>)

# try_cast 参数将弃用，不建议使用了

mask 替换条件为真的值，语法为：

mask(self, cond, other=nan, inplace=False, axis=None, level=None,
      errors='raise', try_cast=<no_default>)

# try_cast 参数将弃用，不建议使用了

作用：

替换条件为真的值

参数：

cond : bool Series/DataFrame, array-like, 或者 callable。逻辑如下：
- where: 显示保留 cond 表达式对应的真值的原始值，对应 False 值用 other 参数中的相应值替换
- mask: 显示保留 cond 表达式对应的假值的原始值，对应 True 值用 other 参数中的相应值替换
- 如果 cond 是可调用的，则计算 Series/DataFrame 返回布尔 Series/DataFrame 或数组
- 可调用对象必须不更改输入的 Series/DataFrame（尽管 pandas 没有检查）
other : 标量, Series/DataFrame, 或者 callable，cond 表达式下的替换值，默认为 nan，形状必须与原 DataFrame 相同或者可以广播到原 DataFrame
inplace : bool, 默认 False，是否对数据执行就地操作
axis : int, 默认 None，操作的轴方向
level : int, 默认 None，对齐索引级别
errors : str, {'raise', 'ignore'}, 默认 'raise'。请注意，当前此参数不会影响结果将始终强制为合适的数据类型。
- 'raise' : 允许引发异常
- 'ignore' : 抑制异常。错误时返回原始对象
try_cast : bool, 默认 None。尝试将结果强制转换回输入类型（如果可能）。1.3 版本已经删除，如有必要，可手动处理异常。

原数据相同结构的被替换数据（DataFrame 或者 Series）
如果参数 inplace=True 返回 None，原对象被替换

注意

mask 方法是 if-then 习惯用法的一个应用。对于调用数据帧中的每个元素，如果cond为False，则使用该元素；否则，将从数据帧中删除相应的元素other值。

where 不同于 numpy 中的 np.where。大致 df1.where(m, df2) (m 为布尔序列或者矩阵) 与 np.where(m, df1, df2) 相同。可见， np.where 语法为 where(condition, [x, y])，它对应表达式为 True 的值和为 False 的值都需要给定替换值。

需要注意的是 np.where 操作返回的是 'numpy.ndarray' object 对象，不能继续链式使用 pandas 的方法。

用法总结

pandas 的 where() 的处理逻辑与 mask() 相反，主要场景是数据筛选、异常值处理。总结如下：

选择时：

where：显示为真值，替换为假值，可以和 sql 中的 where 类比下，筛选出来看到的是真值
mask：显示为假值，替换为真值，戴上面具看到的是假面

替换时：

where：替换不满足条件的（显示满足的）
mask：替换满足条件的（显示不满足的）

示例

以下是一个简单示例，帮助我们理解：

>>> s = pd.Series(range(5))

# 显示对应真值，假值为缺失值
>>> s.where(s > 0)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

# 显示对应假值，真值为缺失值
>>> s.mask(s > 0)
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

# 不大于1的替换为10
>>> s.where(s > 1, 10)
0    10
1    10
2    2
3    3
4    4
dtype: int64

# 大于1的替换为10
>>> s.mask(s > 1, 10)
0     0
1     1
2    10
3    10
4    10
dtype: int64

DataFrame 中一些复杂的应用示例，我们可以事件将 cond 参数计算好，形成一个蒙板，简单理解就是蒙在数据上的一个板子，它通过一个条件生成一个布尔序列或者矩阵，这个序列或者矩阵上的每个点都是布尔值（True 或者 False）组成的， mask 和 where 来决定如何使用这些布尔值的意义：

>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
>>> df
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

# 设置蒙版，3 的倍数位置上为 True
>>> m = df % 3 == 0
# where 应用蒙板，为假的替换为负值
>>> df.where(m, -df)
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9

# 与 np.where 的相同操作
>>> df.where(m, -df) == np.where(m, df, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

# 与 mask 的相同操作
>>> df.where(m, -df) == df.mask(~m, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

其他参数的示例：

dates = pd.date_range('1/1/2022', periods=8)
r = np.random.default_rng(666)
df = pd.DataFrame(r.integers(-5,5, size=(8,4)),
                  index=dates, columns=['A', 'B', 'C', 'D'])
df
'''
            A  B  C  D
2022-01-01  3  2 -3  0
2022-01-02 -5  0 -2  1
2022-01-03 -1 -2  4 -3
2022-01-04 -5 -2 -1 -1
2022-01-05  0 -1  1  2
2022-01-06 -4 -3  0 -3
2022-01-07 -4 -1 -5  3
2022-01-08 -4  2  0 -4
'''

# 不小于0的值，取相反数
df.where(df < 0, -df)
'''
            A  B  C  D
2022-01-01 -3 -2 -3  0
2022-01-02 -5  0 -2 -1
2022-01-03 -1 -2 -4 -3
2022-01-04 -5 -2 -1 -1
2022-01-05  0 -1 -1 -2
2022-01-06 -4 -3  0 -3
2022-01-07 -4 -1 -5 -3
2022-01-08 -4 -2  0 -4
'''

此外，其中将输入布尔条件（ndarray 或 DataFrame）对齐，以便可以使用设置进行部分选择。这类似于用 loc 部分设置（在内容而非轴标签上，loc 以轴进行操作）。

df2 = df.copy()
# 将指定区域指定条件的值设置为 99
df2[df2[1:5]<0] = 99
df2
'''
             A   B   C   D
2022-01-01   3   2  -3   0
2022-01-02  99   0  99   1
2022-01-03  99  99   4  99
2022-01-04  99  99  99  99
2022-01-05   0  99   1   2
2022-01-06  -4  -3   0  -3
2022-01-07  -4  -1  -5   3
2022-01-08  -4   2   0  -4
'''

执行 Where 时，Where 还可以接受 axis 和 level 参数以对齐输入。

df3 = df.copy()
df3.where(df3<0)
'''
              A    B    C    D
2022-01-01  NaN  NaN -3.0  NaN
2022-01-02 -5.0  NaN -2.0  NaN
2022-01-03 -1.0 -2.0  NaN -3.0
2022-01-04 -5.0 -2.0 -1.0 -1.0
2022-01-05  NaN -1.0  NaN  NaN
2022-01-06 -4.0 -3.0  NaN -3.0
2022-01-07 -4.0 -1.0 -5.0  NaN
2022-01-08 -4.0  NaN  NaN -4.0
'''

# 将大于0的值设置一个序列，会报错
df3.where(df3>0, df3['A'])
# ValueError: Must specify axis=0 or 1

# 不满足条件的替换为指定序列对应的值，按行
df3.where(df3<0, df3['A'], axis='index')
'''
            A  B  C  D
2022-01-01  3  3 -3  3
2022-01-02 -5 -5 -2 -5
2022-01-03 -1 -2 -1 -3
2022-01-04 -5 -2 -1 -1
2022-01-05  0 -1  0  0
2022-01-06 -4 -3 -4 -3
2022-01-07 -4 -1 -5 -4
2022-01-08 -4 -4 -4 -4
'''

# 上边的操作相当于这个，但比这个快
df3.apply(lambda x, y: x.where(x<0, y), y=df3['A'])

# 按列
df3.where(df3<0, df3.head(1), axis=1)
'''
            A  B  C  D
2022-01-01  3  2 -3  0
2022-01-02 -5  2 -2  0
2022-01-03 -1 -2 -3 -3
2022-01-04 -5 -2 -1 -1
2022-01-05  3 -1 -3  0
2022-01-06 -4 -3 -3 -3
2022-01-07 -4 -1 -5  0
2022-01-08 -4  2 -3 -4
'''

where 和 mask 的前两个参数（cond 条件和 other 替换值）都可以是一个可调用的对象，因此可以传入一个函数，这个函数的第一个入参必须是原数据本身，并作为条件和其他参数返回有效输出。如：

# 如果是偶数，则加 10
df.mask(lambda x: x%2==0, lambda x: x+10)
'''
             A   B   C   D
2022-01-01   3  12  -3  10
2022-01-02  -5  10   8   1
2022-01-03  -1   8  14  -3
2022-01-04  -5   8  -1  -1
2022-01-05  10  -1   1  12
2022-01-06   6  -3  10  -3
2022-01-07   6  -1  -5   3
2022-01-08   6  12  10   6
'''

更多示例可以查看pandas 查询筛选数据中的相关内容。

支持对象

可以调用 mask() 的对象有：

pandas.DataFrame.mask
pandas.Series.mask

可以调用 where() 的对象有：

pandas.DataFrame.where
pandas.Series.where

参考

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html

pandas where() 和 mask() 按条件替换值

语法

注意

用法总结

示例

相关学习

支持对象

参考

相关内容