pandas 分类数据的操作

看过来

《pandas 教程》持续更新中，提供建议、纠错、催更等加作者微信: gr99123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

本文介绍分类数据的参与比较操作，去重、分组、透视等相关操作的一些方法。

对比 Comparisons

在以下三种情况下，可以将分类数据与其他对象进行比较：

相等性（==和!=）与长度与分类数据相同的类似列表的对象（列表，序列，数组等）进行比较
当 ordered == True 并且类别相同时，分类数据与另一个分类系列的所有比较（==，！=，>，> =，<和<=）
分类数据与标量的所有比较。

所有其他比较，特别是两个具有不同类别的分类或与任何类似列表的对象的分类的“非相等”比较，都会引发TypeError。

将分类数据与具有不同类别或排序的 Series，np.array，列表或分类数据进行的任何“非相等”比较都会引发 TypeError，因为自定义类别的排序可以通过两种方式进行解释：一种考虑了排序，另一种没考虑。

cat = pd.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True)
cat_base = pd.Series([2, 2, 2]).astype(CategoricalDtype([3, 2, 1], ordered=True)
cat_base2 = pd.Series([2, 2, 2]).astype(CategoricalDtype(ordered=True)
cat
'''
0    1
1    2
2    3
dtype: category
Categories (3, int64): [3 < 2 < 1]
'''
cat_base
'''
0    2
1    2
2    2
dtype: category
Categories (3, int64): [3 < 2 < 1]
'''
cat_base2
'''
0    2
1    2
2    2
dtype: category
Categories (1, int64): [2]
'''

与具有相同类别和顺序的分类比较或与标量比较：

cat > cat_base
'''
0     True
1    False
2    False
dtype: bool
'''

cat > 2
'''
0     True
1    False
2    False
dtype: bool
'''

相等比较可用于具有相同长度和标量的任何类似列表的对象：

cat == cat_base
'''
0    False
1     True
2    False
dtype: bool
'''
cat == np.array([1, 2, 3])
'''
0    True
1    True
2    True
dtype: bool
'''
cat == 2
'''
0    False
1     True
2    False
dtype: bool
'''

类别不同不能对比：

try:
    cat > cat_base2
except TypeError as e:
    print("TypeError:", str(e))
# TypeError: Categoricals can only be compared...

如果要对分类序列与不是分类数据的类列表对象进行“非相等”比较，则需要明确并将分类数据转换回原始值：

base = np.array([1, 2, 3])
try:
    cat > base
except TypeError as e:
    print("TypeError:", str(e))
# TypeError: Cannot compare a Categorical for op __gt__ with type <class 'numpy.ndarray'>.
# If you want to compare values, use 'np.asarray(cat) <op> other'.

np.asarray(cat) > base
# array([False, False, False])

当您比较两个具有相同类别的无序分类时，不考虑该顺序：

c1 = pd.Categorical(['a', 'b'], categories=['a', 'b'], ordered=False)
c2 = pd.Categorical(['a', 'b'], categories=['b', 'a'], ordered=False)
c1 == c2
# array([ True,  True])

操作

除了Series.min()，Series.max() 和Series.mode() 以外，分类数据还可以进行其他操作。Series方法（例如 Series.value_counts() ）将使用所有类别，即使数据中不存在某些类别：

s = pd.Series(pd.Categorical(["a", "b", "c", "c"],
                             categories=["c", "a", "b", "d"]))
s.value_counts()
'''
c    2
b    1
a    1
d    0
dtype: int64
'''

Groupby还将显示“未使用”类别：

cats = pd.Categorical(["a", "b", "b", "b", "c", "c", "c"],
                      categories=["a", "b", "c", "d"])
df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
df.groupby("cats").mean()
'''
      values
cats
a        1.0
b        2.0
c        4.0
d        NaN
'''

cats2 = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
df2 = pd.DataFrame({"cats": cats2,
                    "B": ["c", "d", "c", "d"],
                    "values": [1, 2, 3, 4]})

df2.groupby(["cats", "B"]).mean()
'''
        values
cats B
a    c     1.0
     d     2.0
b    c     3.0
     d     4.0
c    c     NaN
     d     NaN
'''

数据透视表：

raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
df = pd.DataFrame({"A": raw_cat,
                   "B": ["c", "d", "c", "d"],
                   "values": [1, 2, 3, 4]})
pd.pivot_table(df, values='values', index=['A', 'B'])
'''
     values
A B
a c       1
  d       2
b c       3
  d       4
'''

< 分类数据的顺序 pandas 教程分类数据处理 >

更新时间：2020-06-25 12:55:24 标签：pandas 分类数据