pandas 重采样线性插补

看过来

《pandas 教程》持续更新中，提供建议、纠错、催更等加作者微信: gr99123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

在数据处理时，由于采集数据量有限，或者采集数据粒度过小，经常需要对数据重采样。在本例中，我们将实现一个类型超分辨率的操作。

数据与需求

以下为我们的源数据：

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [10, 20, 50, 40, 80,],
                   'B': [2, 8, 10, 6, 4, ],
                  })

df
'''
    A   B
0  10   2
1  20   8
2  50  10
3  40   6
4  80   4
'''

这个数据共有 5 行，现在我们需要扩展它，在前相邻两个数据之间由一个扩展为 3 个。如 0 和 1 之间再增加两个数据，最终数据为 13 行。

新增加的数据行，按整体按线性插补的算法补充。

思路

首先将原始数据长度扩展为 3 倍，可以使用 loc[] 方法对索引扩展来生成，同时去掉尾部多余的数据。

再将每行数据扩展出的数据挖去（设置为空），这个操作我们在案例使用 explode() 后不复制其他列中有过讲解。

最后使用 DataFrame 的 interpolate() 插补方法会默认按线性逻辑进行填充。

代码

将索引重复三次：

df.index.repeat(3)
# Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2,
# 3, 3, 3, 4, 4, 4],  dtype='int64')

将得到的索引传入 loc[] 得到扩展数据：

df.loc[df.index.repeat(3)]
'''
    A   B
0  10   2
0  10   2
0  10   2
1  20   8
1  20   8
1  20   8
2  50  10
2  50  10
2  50  10
3  40   6
3  40   6
3  40   6
4  80   4
4  80   4
4  80   4
'''

去掉尾部多余的数据：

(
    df.loc[df.index.repeat(3)]
    .iloc[:-3+1] # 删除最后三个（可为变量），再保留1个，方便以后封装
)
'''
    A   B
0  10   2
0  10   2
0  10   2
1  20   8
1  20   8
1  20   8
2  50  10
2  50  10
2  50  10
3  40   6
3  40   6
3  40   6
4  80   4
'''

再接我们之前案例的方法将扩展出来的数据设置为空：

def func(d: pd.DataFrame):
    d.iloc[1:, :] = None
    return d

(
    df.loc[df.index.repeat(3)]
    .iloc[:-3+1]
    .groupby(level=0)
    .apply(func)
)
'''
      A     B
0  10.0   2.0
0   NaN   NaN
0   NaN   NaN
1  20.0   8.0
1   NaN   NaN
1   NaN   NaN
2  50.0  10.0
2   NaN   NaN
2   NaN   NaN
3  40.0   6.0
3   NaN   NaN
3   NaN   NaN
4  80.0   4.0
'''

最后再用 interpolate() 插补数据，整体代码如下：

def func(d: pd.DataFrame):
    d.iloc[1:, :] = None
    return d

(
    df.loc[df.index.repeat(3)]
    .iloc[:-3+1]
    .groupby(level=0)
    .apply(func)
    .interpolate()
)
'''
           A          B
0  10.000000   2.000000
0  13.333333   4.000000
0  16.666667   6.000000
1  20.000000   8.000000
1  30.000000   8.666667
1  40.000000   9.333333
2  50.000000  10.000000
2  46.666667   8.666667
2  43.333333   7.333333
3  40.000000   6.000000
3  53.333333   5.333333
3  66.666667   4.666667
4  80.000000   4.000000
'''

其他方法

我们还可以尝试用分组方法合并进去空 DataFrame，然后再做插补。

none_df = pd.DataFrame([[None]*len(df.columns)],
                       columns=df.columns,
                       dtype=float,
                      )
none_df
'''
    A   B
0 NaN NaN
'''

(
    df.groupby(level=0, group_keys=False)
    .apply(lambda x: pd.concat([x, *[none_df]*2]))
    .interpolate()
    .iloc[:-2]
)
'''
           A          B
0  10.000000   2.000000
0  13.333333   4.000000
0  16.666667   6.000000
1  20.000000   8.000000
0  30.000000   8.666667
0  40.000000   9.333333
2  50.000000  10.000000
0  46.666667   8.666667
0  43.333333   7.333333
3  40.000000   6.000000
0  53.333333   5.333333
0  66.666667   4.666667
4  80.000000   4.000000
'''

我们就完成了最终的需求。

（完）

pandas 重采样线性插补

数据与需求

思路

代码

其他方法

相关内容