pandas 分类数据的使用

看过来

《pandas 教程》持续更新中，提供建议、纠错、催更等加作者微信: gr99123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

分类数据具有类别和有序属性，它们列出了它们的可能值以及排序是否重要。这些属性以 s.cat.categories 和 s.cat.ordered 形式体现出来。如果您不手动指定类别和顺序，则可以从传递的参数中推断出它们。

顺序

新的分类数据不会自动排序。您必须显式传递 ordered=True 来指示有序的分类。

查看分类数据的顺序：

s = pd.Series(["a", "b", "c", "a"], dtype="category")
s.cat.categories
# Index(['a', 'b', 'c'], dtype='object')
s.cat.ordered
# False

也可以按特定顺序传递类别：

s = pd.Series(pd.Categorical(["a", "b", "c", "a"],
                             categories=["c", "b", "a"]))
s.cat.categories
# Index(['c', 'b', 'a'], dtype='object')
s.cat.ordered
# False

unique() 的结果并不总是与 Series.cat.categories 相同，因为Series.unique() 具有两个保证，即它按出现的顺序返回类别，并且仅包括实际存在的值。

s = pd.Series(list('babc')).astype(CategoricalDtype(list('abcd')))
s
'''
0    b
1    a
2    b
3    c
dtype: category
Categories (4, object): [a, b, c, d]
'''

# categories
s.cat.categories
# Index(['a', 'b', 'c', 'd'], dtype='object')

# uniques
s.unique()
'''
[b, a, c]
Categories (3, object): [b, a, c]
'''

描述统计 Description

在分类数据上使用 describe() 会产生与字符串类型的 Series 或 DataFrame 类似的输出。

df = pd.DataFrame({"cat": cat, "s": ["a", "c", "c", np.nan]})
df.describe()
'''
       cat  s
count    3  3
unique   2  2
top      c  c
freq     2  2
'''

df["cat"].describe()
'''
count     3
unique    2
top       c
freq      2
Name: cat, dtype: object
'''

重命名类别

重命名类别是通过将新值分配给 Series.cat.categories 属性或使用rename_categories() 方法来完成的：

s = pd.Series(["a", "b", "c", "a"], dtype="category")
s
'''
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]
'''
s.cat.categories = ["Group %s" % g for g in s.cat.categories]
s
'''
0    Group a
1    Group b
2    Group c
3    Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]
'''
s = s.cat.rename_categories([1, 2, 3])
s
'''
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [1, 2, 3]
'''
# 使用字典重命名
s = s.cat.rename_categories({1: 'x', 2: 'y', 3: 'z'})
s
'''
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): [x, y, z]
'''

需要注意的是，指定的类型数据必须不重复，否则会引发 ValueError:

try:
    s.cat.categories = [1, 1, 1]
except ValueError as e:
    print("ValueError:", str(e))
# ValueError: Categorical categories must be unique

NaN 值也会 ValueError:

try:
    s.cat.categories = [1, 2, np.nan]
except ValueError as e:
    print("ValueError:", str(e))
# ValueError: Categorial categories cannot be null

追加新的类别

可以使用 add_categories() 方法完成附加类别：

s = s.cat.add_categories([4])
s.cat.categories
# Index(['x', 'y', 'z', 4], dtype='object')
s
'''
0    x
1    y
2    z
3    x
dtype: category
Categories (4, object): [x, y, z, 4]
'''

删除类别

可以使用 remove_categories() 方法来删除类别，删除的值将替换为 np.nan。

s = s.cat.remove_categories([4])
s
'''
0    x
1    y
2    z
3    x
dtype: category
Categories (3, object): [x, y, z]
'''

删除未使用的类别：

s = pd.Series(pd.Categorical(["a", "b", "a"],
                             categories=["a", "b", "c", "d"]))
s
'''
0    a
1    b
2    a
dtype: category
Categories (4, object): [a, b, c, d]
'''
s.cat.remove_unused_categories()
'''
0    a
1    b
2    a
dtype: category
Categories (2, object): [a, b]
'''

设置类别

如果您要一步一步地删除和添加新类别（这在速度方面有优势），或者只是将类别设置为预定义的，请使用 set_categories() 。

s = pd.Series(["one", "two", "four", "-"], dtype="category")
s
'''
0     one
1     two
2    four
3       -
dtype: category
Categories (4, object): [-, four, one, two]
'''
s = s.cat.set_categories(["one", "two", "three", "four"])
s
'''
0     one
1     two
2    four
3     NaN
dtype: category
Categories (4, object): [one, two, three, four]
'''

请注意 Categorical.set_categories() 无法知道某个类别是故意省略还是由于类型差异（例如，NumPy S1 dtype 和 Python 字符串）而拼写错误或（在 Python3 下）。这可能会导致令人惊讶的行为！

< 分类数据创建 pandas 教程分类数据的顺序 >

更新时间：2023-05-12 06:56:31 标签：pandas 分类数据