pandas 文本数据类型

看过来

《pandas 教程》持续更新中，提供建议、纠错、催更等加作者微信: gr99123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

object 和 StringDtype 是 Pandas 的两个文本类型，不过作为新的数据类型，官方推荐 StringDtype 的使用。

object

默认情况下，文本数据会被推断为 object 类型。

pd.Series(['a', 'b', 'c'])
'''
0    a
1    b
2    c
dtype: object
'''

string

string 类型需要专门进行指定：

pd.Series(['a', 'b', 'c'], dtype="string")
pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype())
'''
0    a
1    b
2    c
dtype: string
'''

PyArrow 支持

pyarrow 是 Apache Arrow 的 Python 库，Apache Arrow是一个语言无关的软件框架，用于开发处理列数据的数据分析应用程序。它包含一种标准化的面向列的内存格式，能够表示平面和分层数据，以便在现代 CPU 和 GPU 硬件上进行高效的分析操作。这减少或消除了限制使用大型数据集可行性的因素，如成本、波动性或动态随机存取存储器的物理约束。

它可以高效处理文本数据，应用前需要用 pip 安装 pyarrow。下边是 Pandas 的文本数据如何使用 pyarrow 做处理引擎。

# storage 默认为 python，指定 pyarrow
pd.StringDtype(storage="pyarrow")
# 全局指定
pd.options.mode.string_storage = 'pyarrow'
# dtype 指定
pd.Series(['abc', None, 'def'], dtype="string[pyarrow]")
# 通过 pandas options
with pd.option_context("string_storage", "pyarrow"):
    s = pd.Series(['abc', None, 'def'], dtype="string")

# 查看
pd.StringDtype()
# string[pyarrow]

使用 pyarrow 引擎后，在字符的操作上，并没有什么大的区别，正常操作即可。

转换

可以从其他类型转换到这两个类型：

s = pd.Series(['a', 'b', 'c'])
s.astype("object") # 转换为 object
s.astype("string") # 转换为 string

# 类型转换，支持 string 类型
df.convert_dtypes().dtypes

操作的不同

StringDtype 操作上会和 object 有所不同，基于以下原因推荐使用 StringDtype：

# 数值为 Int64
pd.Series(["a", None, "b"]).str.count("a") # dtype: float64
pd.Series(["a", None, "b"], dtype="string").str.count("a") # dtype: Int64

# 逻辑判断为 boolean
pd.Series(["a", None, "b"]).str.isdigit() # dtype: object
pd.Series(["a", None, "b"], dtype="string").str.isdigit() # dtype: boolean

类似于 Series.str.decode() 在 StringDtype 上不可用，因为 StringArray 只保存字符串，而不是字节。在比较操作中，基于 StringArray 的 arrays.StringArray 和 Series 将返回一个 BooleanDtype 对象。

其余的方法 string 和 object 的操作都相同。

< pandas 文本处理 pandas 教程字符的操作方法 >

更新时间：2021-08-04 14:47:23 标签：pandas 文本数据类型