pandas replace() 数据替换

看过来

《pandas 教程》持续更新中，提供建议、纠错、催更等加作者微信: gairuo123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

数据清洗中，替换是常见的操作，pandas 的 DataFrame、Series、时间序列都支持替换操作，同时还对 str 访问器也支持了替换方法。本文将一一介绍 replace() 在这些对象中的使用。

支持的对象

pandas 的以下对象支持 replace() 方法：

pandas.DataFrame.replace
pandas.Series.replace
pandas.Series.str.replace
pandas.Timestamp.replace

DataFrame 和 Series语法

在 DataFrame 和 Series 对象上的 replace() 方法的参数是：

replace(
    to_replace=None,
    value=<no_default>,
    inplace: 'bool' = False,
    limit=None,
    regex: 'bool' = False,
    method: 'str | lib.NoDefault' = <no_default>,
)

简单说，它将 to_replace 中给定的值替换为给定的 value 值。与 loc 或 iloc 的赋值替换不同的是，它们要求您选定一个位置以使用某个值进行更新。

参数

可以不传任何参数(即 None)，返回的是原样的数据。

它的参数有：

to_replace : str, regex, list, dict, Series, int, float, or None
如何查找要替换的值，支持以下众多形式。
- numeric, str or regex:
  - numeric: 替换数字
  - str: 精确匹配的字符
  - regex: 正则表达式
- list, 由 str、regex 或者 numeric 组成:
  - 如果 to_replace 和 value 都是列表，则它们的长度必须长度相同
  - 如果设置 regex=True，列表中的字符串将被认为是正则表达式
  - str, regex 和 numeric 将按上述规则进行匹配
- dict:
  - 字典利用键值表示现有值和替换值的对应关系. 比如,
    {'a': 'b', 'y': 'z'} 将 a 替换为 b，将 y 替换为 z。这时 value 参数应该为 None，即不传值
  - 对于 DataFrame 字典表示不同的列（键表示）替换指定的值（字典值表示, 可以用列表表示多个）。比如 {'a': [1,2], 'b': 'z'} 在列 a 中查找值 1和2、列 b 中的值 z，并使用 value 中指定的内容替换这些值。这时 value 参数应该不为 None，可以指定一个值统一替换为此值，也可以指定一个和这儿同样结构的字典对做替换的一一对应关系
  - 对于 DataFrame 还能使用嵌套字典, 比如{'a': {'b': np.nan}}, 表示将 a 列中的 b 替换为 NaN。这时 value 参数应该为 None，即不传值。你还可以对替换值使用正则表达式，注意列名（嵌套字典最顶层的键）不能使用正则表达式。
- None:
  - 如果传入为空，传入的 regex 必须是字符串、已编译正则表达式，或者 list、dict、ndarray 或此类元素的序列，此时如果 value 是 None，则 regex 必须是一个嵌套字典或者Series
value : scalar（标量）, dict, list, str, regex, 默认为 None。
- 此值用来替换与 to_replace 匹配的值
- 对于 DataFrame，可以使用 dict 值来指定用于每列的值（不在dict中的列将不会填充）。
- 正则表达式、字符串、列表或 dict 等也允许传入表示替换的对应关系
inplace : bool, 默认 False。如果为 True，则就地执行操作并返回 None
limit : int, 默认 None。向前或向后填充的最大尺寸间隙
regex : bool 或者与 to_replace 相同类型的数据, 默认 False
- 如果为 True 时，将 to_replace 和 value 中的字符串解析为正则表达式
- 如果为 True 时 to_replace 必须为字符串
- 这可以是正则表达式的列表、dict 或数组，在这种情况下 to_replace 必须为 None
method : {'pad', 'ffill', 'bfill', None}，当 to_replace 是一个
scalar, list 或者 tuple 以及 value 为 None 时，假定这些匹配上的值为缺失值，用什么方法来填充

返回一个经常替换处理后 DataFrame 或者 Series。

如果传参有问题，可能会抛出以下错误：

AssertionError
- 如果 regex 不是 bool 以及 to_replace 是 None
TypeError
- 如果 to_replace 不是一个 scalar, array-like, dict, None
- 如果 to_replace 是一个 dict 和 value 不是 list, dict, ndarray, Series
- 如果 to_replace 是 None 以及 regex 不可编译转换为正则表达式或是一个list, dict, ndarray, Series
- 当替换多个 bool 或者 datetime64 对象和参数 to_replace 不匹配
ValueError
- 如果 to_replace 和 value 中是一个列表或者数组，但它们的长度不相同

Series 和 DataFrame 示例

在 Series 和 DataFrame 上的一些示例：

# 替换标量
s = pd.Series([1, 2, 3, 4, 5])
s.replace(1, 5) # 1 替换为 5
'''
0    5
1    2
2    3
3    4
4    5
dtype: int64
'''

df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
                   'B': [5, 6, 7, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})
df.replace(0, 5)
'''
    A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e
'''


# 类似列表替换
df.replace([0, 1, 2, 3], 4)
'''
    A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e
'''

df.replace([0, 1, 2, 3], [4, 3, 2, 1])
'''
    A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e
'''

s.replace([1, 2], method='bfill')
'''
0    3
1    3
2    3
3    4
4    5
dtype: int64
'''

# 类似字典替换
df.replace({0: 10, 1: 100})
'''
        A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
'''

df.replace({'A': 0, 'B': 5}, 100)
'''
        A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
'''

df.replace({'A': {0: 100, 4: 400}})
'''
        A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e
'''

正则表达式替换：

df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
                   'B': ['abc', 'bar', 'xyz']})
df.replace(to_replace=r'^ba.$', value='new', regex=True)
'''
        A    B
0   new  abc
1   foo  new
2  bait  xyz
'''

df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
'''
        A    B
0   new  abc
1   foo  bar
2  bait  xyz
'''

df.replace(regex=r'^ba.$', value='new')
'''
        A    B
0   new  abc
1   foo  new
2  bait  xyz
'''

df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})
'''
        A    B
0   new  abc
1   xyz  new
2  bait  xyz
'''

df.replace(regex=[r'^ba.$', 'foo'], value='new')
'''
        A    B
0   new  abc
1   new  new
2  bait  xyz
'''

比较下 s.replace({'a': None}) 和 s.replace('a', None) 来理解 to_replace 参数特性：

# 定义一个序列
s = pd.Series([10, 'a', 'a', 'b', 'a'])
s
'''
0    10
1     a
2     a
3     b
4     a
dtype: object
'''

# 注解1
s.replace({'a': None})
'''
0      10
1    None
2    None
3       b
4    None
dtype: object
'''

# 注解2
s.replace('a')
'''
0    10
1    10
2    10
3     b
4     b
dtype: object
'''

# 注解3
s.replace('a', None)
'''
0      10
1    None
2    None
3       b
4    None
dtype: object
'''

注解：

1）当 to_replace 为一个字典时, 字典中的值部分就相当于 value 参数的值，即以下代码是等效的：

s.replace({'a': None})
s.replace(to_replace={'a': None}, value=None, method=None)

2）如果没有显式传递 value，并且 to_replace 是标量、列表或元组，则使用 method参数（默认的pad 值）进行替换。因此在此例中，第1行和第2行中的 a 值被10替换，第4行中的 b 值被替换的原因

3）如果将 None 显式地传递给 value，与第一个注解中的效果相同

更多使用案例可访问：pandas 数据替换。

Series.str.replace

序列的字符访问器支持 replace() 方法，语法是：

pd.Series.str.replace(
    self,
    pat: 'str | re.Pattern',
    repl: 'str | Callable',
    n: 'int' = -1,
    case: 'bool | None' = None,
    flags: 'int' = 0,
    regex: 'bool | None' = None,
)

替换 Series/Index 中出现的每个模式/正则表达式。

根据 regex 的定义，等效于 meth:str.replace 或者 func:re.sub。

参数

参数如下：

pat : str or compiled regex(已编译正则表达式)。字符串可以是字符序列或正则表达式。
repl : str or callable。替换字符串或可调用对象。将 regex 传递给callable
匹配对象，并且必须返回要使用的替换字符串。可以参考 func:re.sub.
n : int, 默认 -1 (所有)。从开始要进行的替换数
case : bool, 默认 None。确定替换是否区分大小写：
- 如 True, 区分大小写（如果'pat'是字符串，则为默认值）
- 不区分大小写设置为 False
- 如果 pat 是已编译的正则表达式，则无法设置
flags : int, default 0 (no flags)。Regex 模块标志，例如 re.IGNORECASE，忽略大小写。如果 pat 是已编译的正则表达式，则无法设置
regex : bool, default True。确定传入的模式是否为正则表达式：
- 如果为 True，则假定传入的模式是正则表达式。
- 如果为 False，则将模式视为文本字符串
- 如果 pat 是已编译的正则表达式或 repl 是可调用的，则不能设置为 False

返回 Series 或者 Index 对象的副本，其中所有匹配的 pat 替换为 repl。

不符合传参规范的情况会抛出 ValueError 错误。

序列中的 NaN 值保持不变。

注意事项

当 pat 是一个已编译的正则表达式时，所有标志都应该包含在已编译的正则表达式中。对已编译的正则表达式使用 case 、flags 或 regex=False将引发错误。

示例

当 pat 是字符串且 regex 为 True（默认值）时，给定的 pat 编译为正则表达式。当 repl 是字符串时，它会将匹配的正则表达式模式替换为 meth:re.sub。

pd.Series(['foo', 'fuz', np.nan]).str.replace('f.', 'ba', regex=True)
'''
0    bao
1    baz
2    NaN
dtype: object
'''

When pat is a string and regex is False, every pat is replaced with
repl as with :meth:str.replace:

pat 是字符串，regex 是 False，每个 pat 用 meth:str.replace 都替换为 repl:

pd.Series(['f.o', 'fuz', np.nan]).str.replace('f.', 'ba', regex=False)
'''
0    bao
1    fuz
2    NaN
dtype: object
'''

如果 repl 是可调用的，则使用 func：re.sub。可调用函数应该期望一个位置参数（regex对象）并返回一个字符串。

pd.Series(['foo', 'fuz', np.nan]).str.replace('f', repr, regex=True)
'''
0    <re.Match object; span=(0, 1), match='f'>oo
1    <re.Match object; span=(0, 1), match='f'>uz
2                                            NaN
dtype: object
'''

反转每个小写字母：

repl = lambda m: m.group(0)[::-1]
ser = pd.Series(['foo 123', 'bar baz', np.nan])
ser.str.replace(r'[a-z]+', repl, regex=True)
'''
0    oof 123
1    rab zab
2        NaN
dtype: object
'''

使用正则表达式组（提取第二个组和交换大小写）：

pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"
repl = lambda m: m.group('two').swapcase()
ser = pd.Series(['One Two Three', 'Foo Bar Baz'])
ser.str.replace(pat, repl, regex=True)
'''
0    tWO
1    bAR
dtype: object
'''

使用带标志的已编译正则表达式：

import re
regex_pat = re.compile(r'FUZ', flags=re.IGNORECASE)
pd.Series(['foo', 'fuz', np.nan]).str.replace(regex_pat, 'bar', regex=True)
'''
0    foo
1    bar
2    NaN
dtype: object
'''

Timestamp

pandas 的时间类型数据的 replace() 替换方法如下：

pd.Timestamp.replace(
    self,
    year=None,
    month=None,
    day=None,
    hour=None,
    minute=None,
    second=None,
    microsecond=None,
    nanosecond=None,
    tzinfo=<class 'object'>,
    fold=None,
)

它实现了 datetime.replace 方法，可以处理纳秒级数据。

参数

参数有：

year : int, optional
month : int, optional
day : int, optional
hour : int, optional
minute : int, optional
second : int, optional
microsecond : int, optional
nanosecond : int, optional
tzinfo : tz-convertible, optional, 时区
fold : int, optional

它返回替换字段的时间戳，可以把时间组成中的部分进行替换，如年份替换为 2023 年。

示例

以下是一些替换时间的示例：

# 创建时间戳对象：
ts = pd.Timestamp('2020-03-14T15:32:52.192548651', tz='UTC')
ts
# Timestamp('2020-03-14 15:32:52.192548651+0000', tz='UTC')

# 替换年份和时间：
ts.replace(year=1999, hour=10)
# Timestamp('1999-03-14 10:32:52.192548651+0000', tz='UTC')

# 替换时区（不是转换）：
import pytz
ts.replace(tzinfo=pytz.timezone('US/Pacific'))
# Timestamp('2020-03-14 15:32:52.192548651-0700', tz='US/Pacific')

# 对 `pd.NaT` 的模拟兼容操作:
pd.NaT.replace(tzinfo=pytz.timezone('US/Pacific'))
# NaT

参考

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html
https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.replace.html