pandas value_counts() 唯一值计数

看过来

《pandas 教程》持续更新中，提供建议、纠错、催更等加作者微信: gr99123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

pandas 对象的 value_counts() 方法一般是用来统计一个 Series 中各个值出现次数的方法。这个方法返回一个新的 Series 对象，其中包含原 Series 中每个不同值的出现次数。

统计中的 value_counts

统计中的 value_counts 指的是一种常用的统计方法，用于计算一组数据中各个取值出现的频次，也就是每个值在数据中出现的次数。在统计学中，这个方法被广泛应用于描述和分析数据的特征和分布情况。

例如，如果有一个数据集包含如下数据：

[1, 2, 3, 2, 1, 1, 4, 3, 2, 1, 3, 3, 4, 2, 1]

那么，这个数据集中的 value_counts 就是：

1出现了5次
2出现了4次
3出现了4次
4出现了2次

pandas 中的 value_counts() 方法实现了这个统计功能，可以对一个 Series 或 DataFrame 对象中的数据进行 value_counts 操作。对于 Series 对象，这个方法会返回一个新的 Series 对象，其中包含原 Series 中每个不同值的出现次数。对于 DataFrame 对象，这个方法可以用来统计 DataFrame 中某一列的每个不同值的出现次数。

在数据分析和探索中，value_counts() 方法是一种非常有用的工具，可以帮助分析数据的分布情况和特征，对数据的特点和规律进行初步的认识和把握。同时，这个方法还可以用来发现和排除数据集中的异常值和噪声，提高数据的质量和准确性。

Series

Series 由于是一维数组，它是 value_counts() 最常用被使用的对象类型。

语法

它的语法如下：

Series.value_counts(
        self,
        normalize: bool = False,
        sort: bool = True,
        ascending: bool = False,
        bins=None,
        dropna: bool = True,
    ) -> Series

参数有：

normalize : bool, 默认为 False。如果为True，则返回的对象将包含相对
唯一值的频数（率）。
sort : bool, 默认为 True。是否按频率排序。
ascending : bool, 默认为 False。是否按升序排序。
bins : int, 可选。将值分组为半开区间，而不是计数，这是方便用于 pd.cut 的方法，只适用于数字数据。
dropna : bool, 默认为 True。是否不包括 NaN 的计数。

它返回一个包含唯一值计数的 Series。生成的对象将按降序排列，以使第一个元素是最频繁出现的元素。默认情况下，排除 NA 值。

示例

以下是一些示例代码：

index = pd.Index([3, 1, 2, 3, 4, np.nan])
index.value_counts()
'''
3.0    2
1.0    1
2.0    1
4.0    1
dtype: int64
'''

将 normalize 设置为 True 时，通过将所有值除以值的和来返回相对频率（可以理解为各个值的占比，它们的总和为 1）。

s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts(normalize=True)
'''
3.0    0.4
1.0    0.2
2.0    0.2
4.0    0.2
dtype: float64
'''

关于 bins 参数。bins（区间）可以将一个连续的变量转换为一个分类变量，并计算其区间唯一值的出现次数，将索引分为指定数量的半开区间。

s.value_counts(bins=3)
'''
(0.996, 2.0]    2
(2.0, 3.0]      2
(3.0, 4.0]      1
dtype: int64
'''

关于 dropna 参数。当 dropna 设置为 False 时，我们还可以看到 NaN 索引值。

s.value_counts(dropna=False)
'''
3.0    2
1.0    1
2.0    1
4.0    1
NaN    1
dtype: int64
'''

DataFrame

DataFrame 的 value_counts() 方法与 Series 的使用逻辑类似，不过它返回的是一个多层索引的 Series。

语法

它在 DataFrame 的语法如下：

DataFrame.value_counts(
        self,
        subset: Sequence[Hashable] | None = None,
        normalize: bool = False,
        sort: bool = True,
        ascending: bool = False,
        dropna: bool = True,
    ) -> Series

返回包含DataFrame中唯一行计数的Series。

它的参数相对 Series 多了一个 subset，它是一个 list-like（鸭子类型），可选，代表计算唯一组合时要使用的列。

返回的 Series 将具有每个输入列的一个层级的 MultiIndex。默认情况下，结果中将省略包含任何NA值的行。默认情况下，生成的 Series 将按降序排列，以便第一个元素是出现最频繁的行。

示例

以下是 DataFrame 相关的一些示例：

df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
                   'num_wings': [2, 0, 0, 0]},
                  index=['falcon', 'dog', 'cat', 'ant'])
df
'''
        num_legs  num_wings
falcon         2          2
dog            4          0
cat            4          0
ant            6          0
'''

df.value_counts()
'''
num_legs  num_wings
4         0            2
2         2            1
6         0            1
dtype: int64
'''

df.value_counts(sort=False)
'''
num_legs  num_wings
2         2            1
4         0            2
6         0            1
dtype: int64
'''

df.value_counts(ascending=True)
'''
num_legs  num_wings
2         2            1
6         0            1
4         0            2
dtype: int64
'''

df.value_counts(normalize=True)
'''
num_legs  num_wings
4         0            0.50
2         2            0.25
6         0            0.25
dtype: float64
'''

当 dropna 设置为 False 时，我们还可以计算具有 NA 值的行。

df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'],
                   'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']})
df
'''
  first_name middle_name
0       John       Smith
1       Anne        <NA>
2       John        <NA>
3       Beth      Louise
'''

df.value_counts()
'''
first_name  middle_name
Beth        Louise         1
John        Smith          1
dtype: int64
'''

df.value_counts(dropna=False)
'''
first_name  middle_name
Anne        NaN            1
Beth        Louise         1
John        Smith          1
            NaN            1
dtype: int64
'''

Index

Index 对象的 value_counts() 的语法和使用与 Series 的相同，语法为：

Index.value_counts(
        self,
        normalize: bool = False,
        sort: bool = True,
        ascending: bool = False,
        bins=None,
        dropna: bool = True,
    ) -> Series

示例如下:

index = pd.Index([3, 1, 2, 3, 4, np.nan])
index.value_counts()
'''
3.0    2
1.0    1
2.0    1
4.0    1
dtype: int64
'''

normalize 设置为 True 的情况：

s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts(normalize=True)
'''
3.0    0.4
1.0    0.2
2.0    0.2
4.0    0.2
dtype: float64
'''

关于 bins 参数也与 Series 相同。

s.value_counts(bins=3)
'''
(0.996, 2.0]    2
(2.0, 3.0]      2
(3.0, 4.0]      1
dtype: int64
'''

关于 dropna 参数也与 Series 相同。

s.value_counts(dropna=False)
'''
3.0    2
1.0    1
2.0    1
4.0    1
NaN    1
dtype: int64
'''

分组对象 DataFrameGroupBy

pandas 的 DataFrame 分组对象也是支持 value_counts() 的，返回包含唯一行计数的 Series 或 DataFrame。此功能是 pandas 1.4 增加的。

语法

DataFrame 分组对象的 value_counts() 方法语法如下：

DataFrameGroupBy.value_counts(
        self,
        subset: Sequence[Hashable] | None = None,
        normalize: bool = False,
        sort: bool = True,
        ascending: bool = False,
        dropna: bool = True,
    ) -> DataFrame | Series:

参数如下：

subset : list-like, 可选。计算唯一组合时要使用的列。
normalize : bool, 默认为 False。是否返回比例而不是频率。
sort : bool, 默认为 True。是否按频率排序。
ascending : bool, 默认为 False。是否按升序排序。
dropna : bool, 默认为 True。是否不包含 NA 值的行计数。

注意：

如果groupby 的 as_index 参数为 True，则返回的 Series 将具有每个输入列一个级别的 MultiIndex。
如果groupby 的 as_index 为 False，则返回的 DataFrame 将有一个附加列，其值为计算结果。该列被标记名为“count”或“proportion”，具体取决于 normalize 参数。
默认情况下，结果中将忽略包含任何 NA 值的行。
默认情况下，结果将按降序排列，以便每组的第一个元素是出现频率最高的行。

示例

以下是一些分组的示例：

df = pd.DataFrame({
   'gender': ['male', 'male', 'female', 'male', 'female', 'male'],
   'education': ['low', 'medium', 'high', 'low', 'high', 'low'],
   'country': ['US', 'FR', 'US', 'FR', 'FR', 'FR']
})

df
'''
        gender  education   country
0       male    low         US
1       male    medium      FR
2       female  high        US
3       male    low         FR
4       female  high        FR
5       male    low         FR
'''

df.groupby('gender').value_counts()
'''
gender  education  country
female  high       FR         1
                   US         1
male    low        FR         2
                   US         1
        medium     FR         1
dtype: int64
'''

df.groupby('gender').value_counts(ascending=True)
'''
gender  education  country
female  high       FR         1
                   US         1
male    low        US         1
        medium     FR         1
        low        FR         2
dtype: int64
'''

df.groupby('gender').value_counts(normalize=True)
'''
gender  education  country
female  high       FR         0.50
                   US         0.50
male    low        FR         0.50
                   US         0.25
        medium     FR         0.25
dtype: float64
'''

df.groupby('gender', as_index=False).value_counts()
'''
   gender education country  count
0  female      high      FR      1
1  female      high      US      1
2    male       low      FR      2
3    male       low      US      1
4    male    medium      FR      1
'''

df.groupby('gender', as_index=False).value_counts(normalize=True)
'''
   gender education country  proportion
0  female      high      FR        0.50
1  female      high      US        0.50
2    male       low      FR        0.50
3    male       low      US        0.25
4    male    medium      FR        0.25
'''

SeriesGroupBy

与 DataFrameGroupBy.value_counts() 类似。

支持的对象

作为方法，value_counts() 支持的对象有：

pandas.Series
pandas.DataFrame
pandas.Index
pandas.SeriesGroupBy.value_counts
pandas.core.groupby.DataFrameGroupBy

参考

https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.value_counts.html