pandas resample() 时间序列数据重采样

看过来

《pandas 教程》持续更新中，提供建议、纠错、催更等加作者微信: gr99123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

pandas 的 resample() 方法主要用于对时间序列数据进行重采样（Resampling），可以将时间序列数据按指定的时间频率进行聚合或拆分。resample() 方法可以用于 DataFrame 和 Series 对象，并且还可以结合 GroupBy 对象使用。

数据类型

主要用于 pandas.DataFrame 和 pandas.Series 类型的数据。
同时也可以用于 DataFrameGroupBy 和 SeriesGroupBy 对象。

语法

重新采样时间序列数据。

一种方便的时间序列频率转换和重采样方法。对象必须具有类似日期时间的索引（DatetimeIndex、PeriodIndex或TimedeltaIndex），或者调用者必须将类似日期时间系列/索引的标签传递给on/level关键字参数。

DataFrame.resample(
    rule, 
    axis=_NoDefault.no_default, 
    closed=None, 
    label=None, 
    convention=_NoDefault.no_default, 
    kind=_NoDefault.no_default, 
    on=None, 
    level=None, 
    origin='start_day', 
    offset=None, 
    group_keys=False
)

DataFrameGroupBy.resample(rule, *args, include_groups=True, **kwargs)
SeriesGroupBy.resample(rule, *args, include_groups=True, **kwargs)

参数讲解

rule:
- 类型：字符串或 DateOffset
- 说明：指定重采样的频率，如 'D' 表示按天重采样，'M' 表示按月重采样。支持的频率包括：'H'（小时）、'T'（分钟）、'S'（秒）等。
axis:
- 类型：整数或字符串
- 说明：要沿着重采样的轴，默认为 0（行）。
- 自2.0.0版本起弃用：使用 frame.T.resample(…) 替代。
closed:
- 类型：{'right', 'left'}
- 说明：定义时间间隔的哪一端是关闭的，默认为 right。
label:
- 类型：{'right', 'left'}
- 说明：在时间间隔内标记的哪一端（即时间戳将代表区间的起点还是终点），默认为 right。
convention:
- 类型：{'start', 'end'}
- 说明：指定区间内部数据的聚合方式，默认值为 start。
- 自2.2.0版本起弃用：在重新采样之前将PeriodIndex转换为DatetimeIndex。
kind:
- 类型：{'timestamp', 'period'}
- 说明：定义返回对象的类型，'timestamp' 返回 Timestamp，'period' 返回 Period。
- 自2.2.0版本起弃用：将索引显式转换为所需类型。
on:
- 类型：字符串
- 说明：指定时间序列的列名，适用于非索引的时间序列。
level:
- 类型：整数或字符串
- 说明：当索引是多层索引时，可以指定级别进行重采样。
origin:
- 类型：{'epoch', 'start', 'start_day'} 或者时间戳字符串
- 说明：用于设置时间偏移的起点。
offset:
- 类型：字符串或 DateOffset
- 说明：用于指定时间偏移。
group_keys:
- 类型：布尔值
- 说明：是否在聚合操作中包括 group 键。

返回值

返回一个 Resampler 对象，这个对象类似于 GroupBy 对象，支持多种聚合操作，例如 mean()、sum()、count() 等。

使用场景

降采样：将高频率的时间序列数据转换为低频率的数据。例如，将分钟级数据转换为天级数据。
升采样：将低频率的时间序列数据转换为高频率的数据，例如，将天级数据转换为分钟级数据，通常需要填充缺失值。
滚动窗口计算：在重采样的基础上，进行滚动窗口计算。

示例

首先，创建一个包含9个一分钟时间戳的 Series。

index = pd.date_range('1/1/2000', periods=9, freq='min')
series = pd.Series(range(9), index=index)
series
'''
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: min, dtype: int64
'''

将序列下采样为3分钟的区间，并将落入区间的时间戳值相加。

series.resample('3min').sum()
'''
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3min, dtype: int64
'''

如上所述，将系列缩小到3分钟的区间，但使用右边缘而不是左边缘标记每个区间。请注意，用作标签的bucket中的值不包括在它标记的bucket内。例如，在原始系列中，桶2000-01-01 00:03:00包含值3，但标签为2000-01 01 00:03.00的重新采样桶中的求和值不包括3（如果包括3，则求和值将是6，而不是3）。

series.resample('3min', label='right').sum()
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3min, dtype: int64

要包含此值，请关闭bin间隔的右侧，如下所示。

series.resample('3min', label='right', closed='right').sum()
'''
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3min, dtype: int64
'''

将该系列放大到30秒的区间。

series.resample('30s').asfreq()[0:5]   # 选择前5行
'''
2000-01-01 00:00:00   0.0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00   1.0
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00   2.0
Freq: 30s, dtype: float64
'''

将序列上采样到30秒的区间中，并使用ffill方法填充NaN值。

series.resample('30s').ffill()[0:5]
'''
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30s, dtype: int64
'''

将序列上采样到30秒的区间中，并使用bfill方法填充NaN值。

series.resample('30s').bfill()[0:5]
'''
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30s, dtype: int64
'''

通过apply传递自定义函数

def custom_resampler(arraylike):
    return np.sum(arraylike) + 5

series.resample('3min').apply(custom_resampler)
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3min, dtype: int64

对于DataFrame对象，关键字on可用于指定重新采样的列而不是索引。

d = {'price': [10, 11, 9, 13, 14, 18, 17, 19],
     'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
df = pd.DataFrame(d)
df['week_starting'] = pd.date_range('01/01/2018',
                                    periods=8,
                                    freq='W')
df
   price  volume week_starting
0     10      50    2018-01-07
1     11      60    2018-01-14
2      9      40    2018-01-21
3     13     100    2018-01-28
4     14      50    2018-02-04
5     18     100    2018-02-11
6     17      40    2018-02-18
7     19      50    2018-02-25
df.resample('ME', on='week_starting').mean()
               price  volume
week_starting
2018-01-31     10.75    62.5
2018-02-28     17.00    60.0

对于具有MultiIndex的DataFrame，关键字级别可用于指定需要在哪个级别进行重采样。

days = pd.date_range('1/1/2000', periods=4, freq='D')
d2 = {'price': [10, 11, 9, 13, 14, 18, 17, 19],
      'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
df2 = pd.DataFrame(
    d2,
    index=pd.MultiIndex.from_product(
        [days, ['morning', 'afternoon']]
    )
)
df2
'''
                      price  volume
2000-01-01 morning       10      50
           afternoon     11      60
2000-01-02 morning        9      40
           afternoon     13     100
2000-01-03 morning       14      50
           afternoon     18     100
2000-01-04 morning       17      40
           afternoon     19      50
'''

df2.resample('D', level=0).sum()
'''
            price  volume
2000-01-01     21     110
2000-01-02     22     140
2000-01-03     32     150
2000-01-04     36      90
'''

如果你想根据固定的时间戳调整箱子的开始：

start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
rng = pd.date_range(start, end, freq='7min')
ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
ts
'''
2000-10-01 23:30:00     0
2000-10-01 23:37:00     3
2000-10-01 23:44:00     6
2000-10-01 23:51:00     9
2000-10-01 23:58:00    12
2000-10-02 00:05:00    15
2000-10-02 00:12:00    18
2000-10-02 00:19:00    21
2000-10-02 00:26:00    24
Freq: 7min, dtype: int64
'''

ts.resample('17min').sum()
'''
2000-10-01 23:14:00     0
2000-10-01 23:31:00     9
2000-10-01 23:48:00    21
2000-10-02 00:05:00    54
2000-10-02 00:22:00    24
Freq: 17min, dtype: int64
'''

ts.resample('17min', origin='epoch').sum()
'''
2000-10-01 23:18:00     0
2000-10-01 23:35:00    18
2000-10-01 23:52:00    27
2000-10-02 00:09:00    39
2000-10-02 00:26:00    24
Freq: 17min, dtype: int64
'''

ts.resample('17min', origin='2000-01-01').sum()
'''
2000-10-01 23:24:00     3
2000-10-01 23:41:00    15
2000-10-01 23:58:00    45
2000-10-02 00:15:00    45
Freq: 17min, dtype: int64
'''

如果要使用偏移Timedelta调整仓位的开始位置，以下两行是等效的：

ts.resample('17min', origin='start').sum()
'''
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17min, dtype: int64
'''

ts.resample('17min', offset='23h30min').sum()
'''
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17min, dtype: int64
'''

如果你想把最大的时间戳作为箱的结尾：

ts.resample('17min', origin='end').sum()
'''
2000-10-01 23:35:00     0
2000-10-01 23:52:00    18
2000-10-02 00:09:00    27
2000-10-02 00:26:00    63
Freq: 17min, dtype: int64
'''

与start_day相反，您可以使用end_day将最大时间戳的上限午夜作为容器的结束，并丢弃不包含数据的容器：

ts.resample('17min', origin='end_day').sum()
'''
2000-10-01 23:38:00     3
2000-10-01 23:55:00    15
2000-10-02 00:12:00    45
2000-10-02 00:29:00    45
Freq: 17min, dtype: int64
'''

关于分组对象的示例：

idx = pd.date_range('1/1/2000', periods=4, freq='min')
df = pd.DataFrame(data=4 * [range(2)],
                  index=idx,
                  columns=['a', 'b'])
df.iloc[2, 0] = 5
df
'''
                    a  b
2000-01-01 00:00:00  0  1
2000-01-01 00:01:00  0  1
2000-01-01 00:02:00  5  1
2000-01-01 00:03:00  0  1
'''

将DataFrame下采样为3分钟的区间，并将落入区间的时间戳值相加。

df.groupby('a').resample('3min', include_groups=False).sum()
'''
                         b
a
0   2000-01-01 00:00:00  2
    2000-01-01 00:03:00  1
5   2000-01-01 00:00:00  1
'''

将该 Series 放大到30秒的区间。

df.groupby('a').resample('30s', include_groups=False).sum()
'''
                    b
a
0   2000-01-01 00:00:00  1
    2000-01-01 00:00:30  0
    2000-01-01 00:01:00  1
    2000-01-01 00:01:30  0
    2000-01-01 00:02:00  0
    2000-01-01 00:02:30  0
    2000-01-01 00:03:00  1
5   2000-01-01 00:02:00  1
'''

按月重新采样。值被分配给该期间的月份。

df.groupby('a').resample('ME', include_groups=False).sum()
'''
            b
a
0   2000-01-31  3
5   2000-01-31  1
'''

如上所述，将系列缩小到3分钟的区间，但关闭区间的右侧。

(
    df.groupby('a')
    .resample('3min', closed='right', include_groups=False)
    .sum()
)
'''
                         b
a
0   1999-12-31 23:57:00  1
    2000-01-01 00:00:00  2
5   2000-01-01 00:00:00  1
'''

将系列下采样为3分钟的区间，并关闭区间的右侧，但使用右边缘而不是左边缘标记每个区间。

(
    df.groupby('a')
    .resample('3min', closed='right', label='right', include_groups=False)
    .sum()
)
'''
                         b
a
0   2000-01-01 00:00:00  1
    2000-01-01 00:03:00  2
5   2000-01-01 00:03:00  1
'''

案例

以下示例展示了如何使用 resample() 方法将天级数据按月进行重采样。

import pandas as pd
import numpy as np

# 创建一个示例 DataFrame
date_rng = pd.date_range(start='2024-01-01', end='2024-03-31', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['data'] = np.random.randint(0, 100, size=(len(date_rng)))

# 将 'date' 列设置为索引
df.set_index('date', inplace=True)

print("原始 DataFrame:")
print(df.head())

# 按月进行重采样并计算每个月的数据总和
df_resampled = df.resample('M').sum()

print("\n按月重采样后的 DataFrame:")
print(df_resampled)

输出结果:

原始 DataFrame:
            data
date            
2024-01-01    24
2024-01-02    43
2024-01-03    95
2024-01-04    58
2024-01-05    26

按月重采样后的 DataFrame:
            data
date            
2024-01-31  1420
2024-02-29  1267
2024-03-31  1546

解释:

创建时间序列数据：我们首先创建了一个以日期为索引的 DataFrame，其中包含一些随机整数数据。
按月重采样：我们使用 resample('M').sum() 将天级数据按月进行重采样，并计算每个月的数据总和。

这个方法在处理时间序列数据时非常有用，尤其是在需要聚合数据或分析时间序列的不同时间段时。

参考

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
https://pandas.pydata.org/docs/reference/api/pandas.Series.resample.html
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.resample.html
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.SeriesGroupBy.resample.html

pandas 时间重采样 Resampler.__iter__() 迭代重采样对象 >

更新时间：2024-08-15 08:32:20 标签：pandas python 时间序列重采样