pandas 数据替换

看过来

《pandas 教程》持续更新中，提供建议、纠错、催更等加作者微信: gr99123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

Pandas 中数据替换的方法，包含数值、文本、缺失值等替换，经常用于数据清洗整理，枚举转换，数据修正等情况。Series 中的 replace() 和 DataFrame 中的 replace() 提供了一种高效而灵活的方法。关于 replace() 的更详细用法可以访问 replace() 数据替换。

指定值替换

以下是在 Series 中将 0 替换为 5：

ser = pd.Series([0., 1., 2., 3., 4.])
ser.replace(0, 5)

也可以批量替换：

# 一一对应进行替换
ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])
# 用字典映射对应替换值
ser.replace({0: 10, 1: 100})
# 将 a 列的 0 b 列中的 5 替换为 100
df.replace({'a': 0, 'b': 5}, 100)
#  指定列里的替换规划
df.replace({'a': {0: 100, 4: 400}})

使用替换方式

除了给定指定值进行替换，我们还可以指定一些替换的方法：

# 将 1，2，3 替换为它们前一个值
ser.replace([1, 2, 3], method='pad') # ffill 是它同义词
# 将 1，2，3 替换为它们后一个值
ser.replace([1, 2, 3], method='bfill')

如果指定的要替换的值不存在，则不起作用，也不会报错。以上的替换也适用了字符类型数据。

字符替换

如果遇到字符比较复杂的内容，就是使用正则（默认没有开启）进行匹配：

# 把 bat 替换为 new
df.replace(to_replace='bat', value='new')
# 利用正则将 ba 开头的替换为 new
df.replace(to_replace=r'^ba.$', value='new', regex=True)
# 如果多列规则不一的情况下可以按以下格式对应传入
df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
# 多个规则替换为同一个值
df.replace(regex=[r'^ba.$', 'foo'], value='new')
# 直接多个正则及对应的替换内容
df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})

缺失值相关替换

替换可以处理缺失值相关的问题，如我们可以将无效的值先替换为 nan，再做缺失值处理：

d = {'a': list(range(4)),
     'b': list('ab..'),
     'c': ['a', 'b', np.nan, 'd']
    }
df = pd.DataFrame(d)
# 将.替换为 nan
df.replace('.', np.nan)
# 使用正则，将空格等替换为 nan
df.replace(r'\s*\.\s*', np.nan, regex=True)
# 对应替换，a 换 b, 点换 nan
df.replace(['a', '.'], ['b', np.nan])
# 点换 dot, a 换 astuff(第一位+)
df.replace([r'\.', r'(a)'], ['dot', r'\1stuff'], regex=True)
# b 中的点要替换，替换为 b 替换规则为 nan，可以多列
df.replace({'b': '.'}, {'b': np.nan})
# 使用正则
df.replace({'b': r'\s*\.\s*'}, {'b': np.nan}, regex=True)
# b列的 b 值换为空
df.replace({'b': {'b': r''}}, regex=True)
# b 列的点空格等换 nan
df.replace(regex={'b': {r'\s*\.\s*': np.nan}})
# b列点等+ty
df.replace({'b': r'\s*(\.)\s*'},
           {'b': r'\1ty'},
           regex=True)
# 多个正则规则
df.replace([r'\s*\.\s*', r'a|b'], np.nan, regex=True)
# 用参数名传参
df.replace(regex=[r'\s*\.\s*', r'a|b'], value=np.nan)

替换为 None:

s = pd.Series([10, 'a', 'a', 'b', 'a'])
# 将 a 换为 none
s.replace({'a': None})
# 会使用前一个值，前两个为 10，最后一个为 b method='pad'
s.replace('a', None)

# 如果 nan 替换不成功
df.replace(np.nan, None)
# 可以用以下替换
df.where(df.notnull(), None)

数字替换

# 造数据
df = pd.DataFrame(np.random.randn(10, 2))
df[np.random.rand(df.shape[0]) > 0.5] = 1.5
# 将 1.5 替换为 nan
df.replace(1.5, np.nan)
# 将1.5换为 nan,等于左上角的值换为 a
df.replace([1.5, df.iloc[0, 0]], [np.nan, 'a'])
# 使替换生效
df.replace(1.5, np.nan, inplace=True)

条件替换 case_when()

在 pandas 中，case_when 的概念与 SQL 中的 CASE WHEN 类似，这样的条件替换可以帮助你根据数据的特定条件生成新的列，使得数据更容易理解和处理。这个功能是 pandas 2.2 新增加的功能。例如：

pd_df["difficulty"] = "Unknown"
pd_df["difficulty"] = pd_df["difficulty"].case_when([
    (pd_df.eval("0 < Time < 30"), "Easy"), 
    (pd_df.eval("30 <= Time <= 60"), "Medium"), 
    (pd_df.eval("Time > 60"), "Hard")
])

详情可访问：case_when() 条件替换。

修剪 df.clip()

对一些极端值，如过大或者过小，可以使用 df.clip(lower, upper) 来修剪，当数据大于 upper 时，使用 upper 的值，小于 lower 时用 lower 的值，就像 numpy.clip 方法一样。

df = pd.DataFrame({'a': [-1, 2, 5], 'b': [6, 1, -3]})
df
'''
   a  b
0 -1  6
1  2  1
2  5 -3
'''

# 修剪成最大为3最小为0
df.clip(0,3)
'''
   a  b
0  0  3
1  2  1
2  3  0
'''

# 使用每个列元素的特定下限和上限阈值进行剪辑
# 列，不能小于对应 c 位置的值，不能大于对应的+1值
c = pd.Series([-1, 1, 3])
df.clip(c, c+1, axis=0)
'''
   a  b
0 -1  0
1  2  1
2  4  3
'''