pandas 分类数据处理

看过来

《pandas 教程》持续更新中，提供建议、纠错、催更等加作者微信: gairuo123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

Pandas 分类数据也可以使用方法 .loc，.iloc，.at 和 .iat 正常进行访问。唯一的区别是返回类型（获取），并且只能分配已经在类别中的值。

获取数据

如果切片操作返回 DataFrame 或 Series 类型的列，则将保留类别 dtype。

idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
cats = pd.Series(["a", "b", "b", "b", "c", "c", "c"],
                 dtype="category", index=idx)

values = [1, 2, 2, 2, 3, 4, 5]
df = pd.DataFrame({"cats": cats, "values": values}, index=idx)
df.iloc[2:4, :]
''' 
  cats  values
j    b       2
k    b       2
'''

df.iloc[2:4, :].dtypes
'''
cats      category
values       int64
dtype: object
'''

df.loc["h":"j", "cats"]
'''
h    a
i    b
j    b
Name: cats, dtype: category
Categories (3, object): [a, b, c]
'''

df[df["cats"] == "b"]
'''
  cats  values
i    b       2
j    b       2
k    b       2
'''

如果只返回一列，则 Series 的类型为 object：

# get the complete "h" row as a Series
df.loc["h", :]
'''
cats      a
values    1
Name: h, dtype: object
'''

从分类数据返回单个值不是长度为“1”的分类。

df.iat[0, 0]
# 'a'

df["cats"].cat.categories = ["x", "y", "z"]
df.at["h", "cats"]  # returns a string
# 'x'

要获取类别类型的单个值Series，请传入具有单个值的列表：

df.loc[["h"], "cats"]
'''
h    x
Name: cats, dtype: category
Categories (3, object): [x, y, z]
'''

字符和时间的访问

如果 s.cat.categories 是适当的类型，则访问器 .dt 和 .str 将起作用：

str_s = pd.Series(list('aabb'))
str_cat = str_s.astype('category')
str_cat
'''
0    a
1    a
2    b
3    b
dtype: category
Categories (2, object): [a, b]
'''
str_cat.str.contains("a")
'''
0     True
1     True
2    False
3    False
dtype: bool
'''
date_s = pd.Series(pd.date_range('1/1/2015', periods=5))
date_cat = date_s.astype('category')
date_cat
'''
0   2015-01-01
1   2015-01-02
2   2015-01-03
3   2015-01-04
4   2015-01-05
dtype: category
Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-01-05]
'''
date_cat.dt.day
'''
0    1
1    2
2    3
3    4
4    5
dtype: int64
'''

返回的 Series（或DataFrame）的类型与在该类型（而不是类别数据！）的 Series 上使用.str.<method> / .dt.<method>的类型相同。

这意味着，从Series 的访问器上的方法和属性返回的值与从此 Serie s的访问器上的方法和属性返回的值转换为类型 category 之一将相等：

ret_s = str_s.str.contains("a")
ret_cat = str_cat.str.contains("a")
ret_s.dtype == ret_cat.dtype
# True
ret_s == ret_cat
'''
0    True
1    True
2    True
3    True
dtype: bool
'''

在类别上完成工作，然后构建新的系列, 如果您有一个 Series 类型字符串，其中重复了很多元素（即 Series 中唯一元素的数量比 Series 的长度小很多），那么这对性能会有影响。在这种情况下，将原始Series转换为类型类别之一并在其上使用.str.<method>或.dt.<property> 可能会更快。

设置数据

可以在类别列（或系列）中设置值：

idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
cats = pd.Categorical(["a", "a", "a", "a", "a", "a", "a"],
                      categories=["a", "b"])
values = [1, 1, 1, 1, 1, 1, 1]
df = pd.DataFrame({"cats": cats, "values": values}, index=idx)
df.iloc[2:4, :] = [["b", 2], ["b", 2]]
df
''' 
  cats  values
h    a       1
i    a       1
j    b       2
k    b       2
l    a       1
m    a       1
n    a       1
'''
try:
    df.iloc[2:4, :] = [["c", 3], ["c", 3]]
except ValueError as e:
    print("ValueError:", str(e))
# ValueError: Cannot setitem on a Categorical with a new category, set the categories first

通过分配类别数据设置值还将检查类别是否匹配：

df.loc["j":"k", "cats"] = pd.Categorical(["a", "a"], categories=["a", "b"])
df
'''
  cats  values
h    a       1
i    a       1
j    a       2
k    a       2
l    a       1
m    a       1
n    a       1
'''

try:
    df.loc["j":"k", "cats"] = pd.Categorical(["b", "b"],
                                             categories=["a", "b", "c"])
except ValueError as e:
    print("ValueError:", str(e))

# ValueError: Cannot set a Categorical with another, without identical categories

将“分类”分配给其他类型的列的部分将使用以下值：

df = pd.DataFrame({"a": [1, 1, 1, 1, 1], "b": ["a", "a", "a", "a", "a"]})
df.loc[1:2, "a"] = pd.Categorical(["b", "b"], categories=["a", "b"])
df.loc[2:3, "b"] = pd.Categorical(["b", "b"], categories=["a", "b"])
df
'''
   a  b
0  1  a
1  b  a
2  b  b
3  1  b
4  1  a
'''
df.dtypes
'''
a    object
b    object
dtype: object
'''

合并

默认情况下，组合包含相同类别的 Series 或 DataFrames 会使用类别 dtype，否则结果将取决于基础类别的 dtype。导致非分类 dtypes 的合并可能会具有更高的内存使用率。使用. astype 或 union_categoricals 以确保类别结果。

from pandas.api.types import union_categoricals

# same categories
s1 = pd.Series(['a', 'b'], dtype='category')
s2 = pd.Series(['a', 'b', 'a'], dtype='category')
pd.concat([s1, s2])
'''
0    a
1    b
0    a
1    b
2    a
dtype: category
Categories (2, object): [a, b]
'''
# different categories
s3 = pd.Series(['b', 'c'], dtype='category')
pd.concat([s1, s3])
''' 
0    a
1    b
0    b
1    c
dtype: object
'''
# Output dtype is inferred based on categories values
int_cats = pd.Series([1, 2], dtype="category")
float_cats = pd.Series([3.0, 4.0], dtype="category")
pd.concat([int_cats, float_cats])
'''
0    1.0
1    2.0
0    3.0
1    4.0
dtype: float64
'''
pd.concat([s1, s3]).astype('category')
''' 
0    a
1    b
0    b
1    c
dtype: category
Categories (3, object): [a, b, c]
'''
union_categoricals([s1.array, s3.array])
''' 
[a, b, b, c]
Categories (3, object): [a, b, c]
'''

连接

如果要合并不一定具有相同类别的分类，则 union_categoricals() 函数将合并类似列表的分类，新类别将是被合并类别的并集。

from pandas.api.types import union_categoricals
a = pd.Categorical(["b", "c"])
b = pd.Categorical(["a", "b"])
union_categoricals([a, b])
'''
[b, c, a, b]
Categories (3, object): [b, c, a]
'''

默认情况下，结果类别将按其在数据中的显示顺序进行排序。如果要按类别对类别进行分类，请使用 sort_categories = True 参数。

union_categoricals([a, b], sort_categories=True)
'''
[b, c, a, b]
Categories (3, object): [a, b, c]
'''

union_categoricals 也适用于“简单”的情况，将两个具有相同类别和顺序信息（例如您还可以附加的信息）的分类组合在一起。

a = pd.Categorical(["a", "b"], ordered=True)
b = pd.Categorical(["a", "b", "a"], ordered=True)
union_categoricals([a, b])
'''
[a, b, a, b, a]
Categories (2, object): [a < b]
'''

下面将引发 TypeError，因为类别是有序的并且不相同。

a = pd.Categorical(["a", "b"], ordered=True)
b = pd.Categorical(["a", "b", "c"], ordered=True)
union_categoricals([a, b])
# TypeError: to union ordered Categoricals, all categories must be the same

可以使用 ignore_ordered = True 自变量组合具有不同类别或顺序的有序分类。

a = pd.Categorical(["a", "b", "c"], ordered=True)
b = pd.Categorical(["c", "b", "a"], ordered=True)
union_categoricals([a, b], ignore_order=True)
'''
[a, b, c, c, b, a]
Categories (3, object): [a, b, c]
'''

union_categoricals() 也可以与 CategoricalIndex 或包含分类数据的 Series 一起使用，但是请注意，结果数组将始终是普通的 Categorical：

a = pd.Series(["b", "c"], dtype='category')
b = pd.Series(["a", "b"], dtype='category')
union_categoricals([a, b])
'''
[b, c, a, b]
Categories (3, object): [b, c, a]
'''

< 分类数据的顺序 pandas 教程分类数据的操作 >

更新时间：2020-06-25 13:31:35 标签：pandas 分类数据