NumPy 创建随机样本数组

说明

NumPy 教程持续更新中，提供建议、纠错、催更等加作者微信: gr99123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

有时候，在做数据分析需要建立模型时为了保证客观反映现实事物规律，需要生成一些随机数据。另外，在学习调研过程中也需要大量的随机数据，如在学习 Pandas 时，可先生成随机数据来处理分析。

随机机制

Numpy 的随机数程序结合 BitGenerator 以创建序列，而 Generator 则使用这些序列从不同的统计分布中采样来生成伪随机数：

BitGenerators（字节生成器）：生成随机数的对象。这些通常是无符号整数字，填充有 32 或 64 位随机的序列
Generator（生成器）：将来自 BitGenerator 的随机位序列转换为在指定时间间隔内遵循特定概率分布（例如均匀、正态或二项式）的数字序列的对象

BitGenerators 伪随机比特生成器

BitGenerator([seed]) 通用 BitGenerator 的基类，它基于不同的算法提供随机比特流。支持以下几种算法：

np.random.SFC64
np.random.MT19937
np.random.PCG64
np.random.Philox

分别是：

PCG-64 - 默认。快速生成器，支持许多并行流，并且可以任意增加。有关更多信息，请参见文档。 PCG-64的周期为 2^{128}。有关此类PRNG的更多详细信息，请参见PCG作者的页面。
MT19937 - 标准的Python BitGenerator。添加一个 MT19937.jumped 函数，该函数返回已生成 2^{128} 抽签状态的新生成器。
Philox - 基于计数器的生成器，能够前进任意数量的步骤或生成独立的流。有关此类位生成器的更多详细信息，请参见 Random123 页面。
SFC64 - 基于随机可逆映射的快速生成器。通常是这四个中最快的发电机。有关更多详细信息，请参见SFC 作者的页面。

Generator 生成器

Generator 可以访问广泛的发行版，并替代 RandomState，两者之间的主要区别在于Generator 依赖于附加的 BitGenerator 来管理状态并生成随机位，然后将这些随机位从有用的分布转换为随机值。 Generator 使用的默认 BitGenerator 为 PCG64。可以通过将实例化的 BitGenerator 传递给 Generator 来更改 BitGenerator。

# 方法零
r = np.random.default_rng(12345)
r.random([3,4])

# 方法一
bg = np.random.SFC64(12345)
rng = np.random.Generator(bg)
rng.integers(1,3, [2,4])

# 方法二
from numpy.random import Generator, Philox
rg = Generator(Philox(12345))
rg.integers(1,3, [2,4])

numpy.random.default_rng() 是默认的随机生成器，可以传入一个种子（seed{None, int, array_like[ints], SeedSequence, BitGenerator, Generator}, optional），在执行时保持结果的稳定。当然也可以直接使用 numpy.random.Generator(bit_generator) 但不推荐。

变化

从 Numpy 1.17.0 版本开始，可以使用许多不同的 BitGenerators 初始化 Generator，它揭示了许多不同的概率分布。传统的 RandomState 随机数例程仍然可用，但仅限于单个 BitGenerator。有关旧版 Randomstate 的改进和区别的完整列表，请参见新增功能或不同之处。

为了方便和向后兼容，将单个 RandomState 实例的方法导入 numpy.random 命名空间，可参见旧式随机生成方法。

新旧对比如下：

# 新方法
r = np.random.default_rng()
r.random([2,3])
r.integers(1,10, size=(3,4))

# 旧方法
np.random.standard_normal(10)
np.random.randint(1,10,size=(3,4))

对旧的方法后期只做兼容，官方表示不再改进和增加新的特性。

简单随机数据

常用的简单数据如下：

方法	说明
integers(low[, high, size, dtype, endpoint])	从低到高的随机整数，endpoint=True 包含高值
random([size, dtype, out])	半开区间内的随机浮点值 [0.0, 1.0)
choice(a[, size, replace, p, axis, shuffle])	从给定的一维数组中生成随机样本
bytes(length)	返回随机字节

以下为部分案例。

随机整数 integers

# 语法
Generator.integers(low, high=None, size=None,
                   dtype=np.int64, endpoint=False)

rng = np.random.default_rng()
rng.integers(2, size=10)
# array([1, 0, 0, 0, 1, 1, 0, 0, 1, 0])  # random
rng.integers(1, size=10)
# array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

# 生成一个 2 x 4 数组，值从 0 到 4
rng.integers(5, size=(2, 4))
'''
array([[4, 0, 2, 1],
       [3, 2, 2, 0]])  # 随机
'''
# 生成具有 3 个不同上界（高界可为 array-like of ints）的 1x3 数组
rng.integers(1, [3, 5, 10])
# array([2, 2, 9])  # random

# 同上，生成具有 3个不同下界的 1×3 数组
rng.integers([1, 5, 7], 10)
# array([9, 8, 7])  # random

# 使用广播生成数据类型为 uint8 的 2×4 数组
rng.integers([1, 3, 5, 7], [[10], [20]], dtype=np.uint8)
'''
array([[ 8,  6,  9,  7],
       [ 1, 16,  9, 12]], dtype=uint8)  # random
'''

随机浮点值 random

# 语法
Generator.random(size=None, dtype=np.float64, out=None)

rng.random()
# 0.09323893259291272 # 随机
type(rng.random())
# <class 'float'>
rng.random((5,))
#  array([0.70.., 0.789.., 0.270.., 0.92.., 0.907.. ]) # 随机

# [-5, 0) 区间中随机数的三乘二数组：
5 * rng.random((3, 2)) - 5
'''
array([[-2.79428462, -2.70227044], # 随机
       [-4.30778077, -2.1288518 ],
       [-0.53373541, -1.3214533 ]])
'''

随机选择 choice

# 语法
Generator.choice(a, size=None, replace=True,
                 p=None, axis=0, shuffle=True)
# 从 np.arange(5) 中生成均匀（uniform）随机样本，尺寸为 3
# 相当于 rng.integers(0,5,3)
rng.choice(5, 3)
# array([0, 3, 4]) # random

# 从 np.arange(5) 中生成非均匀（non-uniform）随机样本，尺寸为 3：
rng.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
# array([3, 3, 0]) # 随机
rng.choice(5, 3, replace=False) # 样本无需替换
# array([3,1,0]) # random
# 以上相当于 rng.permutation(np.arange(5))[:3]

# 无需替换
rng.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
# array([2, 3, 0]) # random

# 上面的任何一个案例都可以用任意数组，而不仅仅是整数
aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
rng.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
'''
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], # random
      dtype='<U11')
'''

随机字节 bytes

# 语法
Generator.bytes(length)
# 长度为 5
rng.bytes(5)
# b'C\xad&\xaf^'

排列

方法	说明
shuffle(x[, axis])	打乱顺序（洗牌）来修改序列
permutation(x[, axis])	随机排列一个序列，或返回一个排列范围

打乱顺序 shuffle

Generator.shuffle(x, axis=0) # 语法
rng = np.random.default_rng()
arr = np.arange(10)
rng.shuffle(arr) # 打乱顺序并修改
arr
# [1 7 5 2 9 4 3 6 0 8] # random
arr = np.arange(9).reshape((3, 3))
rng.shuffle(arr) # 打乱顺序并修改
arr
'''
array([[3, 4, 5], # random
       [6, 7, 8],
       [0, 1, 2]])
'''
arr = np.arange(9).reshape((3, 3))
rng.shuffle(arr, axis=1) # 按行打乱顺序
arr
'''
array([[2, 0, 1], # random
       [5, 3, 4],
       [8, 6, 7]])
'''

随机排列 permutation

Generator.permutation(x, axis=0) 方法有两个作用，如果 x 是整数，则返回一个随机顺序的 np.arange(x)，如果 x 是一个数组，请进行复制并随机随机排列这些元素。

rng = np.random.default_rng()
rng.permutation(10) # 返回乱序范围数列
# array([1, 7, 4, 3, 0, 9, 2, 5, 8, 6]) # random

rng.permutation([1, 4, 9, 12, 15]) # 打乱顺序
# array([15,  1,  9,  4, 12]) # random

arr = np.arange(9).reshape((3, 3))
rng.permutation(arr) # 默认按列打乱顺序。
'''
array([[6, 7, 8], # random
       [0, 1, 2],
       [3, 4, 5]])
'''

rng.permutation("abc") ## 不能传入字符串
'''
Traceback (most recent call last):
    ...
numpy.AxisError: axis 0 is out of bounds for array of dimension 0
'''

arr = np.arange(9).reshape((3, 3))
rng.permutation(arr, axis=1) # 按行打乱顺序
'''
array([[0, 2, 1], # random
       [3, 5, 4],
       [6, 8, 7]])
'''
# 按行列打乱
a = rng.permutation(arr, axis=0)
a = rng.permutation(arr, axis=1)
a
'''
array([[2, 1, 0],
       [5, 4, 3],
       [8, 7, 6]])
'''

集中分布

支持创建以下随机样本分布：

方法	说明
beta(a, b[, size])	Beta 分布 Beta distribution
binomial(n, p[, size])	二项分布 binomial distribution
chisquare(df[, size])	卡方分布 chi-square distribution
dirichlet(alpha[, size])	狄里克莱分布 Dirichlet distribution
exponential([scale, size])	指数分布 exponential distribution
f(dfnum, dfden[, size])	F 分布 F distribution
gamma(shape[, scale, size])	伽马分布 Gamma distribution
geometric(p[, size])	几何分布 geometric distribution
gumbel([loc, scale, size])	耿贝尔分布 Gumbel distribution
hypergeometric(ngood, nbad, nsample[, size])	超几何分布 Hypergeometric distribution
laplace([loc, scale, size])	拉普拉斯分布或双指数分布中抽取具有指定位置（或平均值）和比例（衰减）的样本 Laplace, double exponential
logistic([loc, scale, size])	逻辑分布/逻辑斯特分布 logistic distribution
lognormal([mean, sigma, size])	对数正态分布 log-normal distribution
logseries(p[, size])	对数级数分布 logarithmic series distribution
multinomial(n, pvals[, size])	多项式分布 multinomial distribution
multivariate_hypergeometric(colors, nsample)	从多元超几何分布生成变量 multivariate hypergeometric distribution
multivariate_normal(mean, cov[, size, …])	多元正态分布 multivariate normal distribution
negative_binomial(n, p[, size])	负二项分布 negative binomial distribution
noncentral_chisquare(df, nonc[, size])	负二项分布 noncentral chi-square distribution
noncentral_f(dfnum, dfden, nonc[, size])	非中心F分布 noncentral F distribution
normal([loc, scale, size])	正态（高斯）分布 normal (Gaussian) distribution
pareto(a[, size])	从指定形状的帕累托II或洛马克斯分布中抽取样本 Pareto II or Lomax distribution
poisson([lam, size])	泊松分布 Poisson distribution
power(a[, size])	从正指数为 a-1 的幂分布中抽取[0，1]范围样本 power distribution
rayleigh([scale, size])	瑞利分布 Rayleigh distribution
standard_cauchy([size])	从模式为0的标准柯西分布中抽取样本 standard Cauchy distribution
standard_exponential([size, dtype, method, out])	标准指数分布 standard exponential distribution
standard_gamma(shape[, size, dtype, out])	标准伽马分布 standard Gamma distribution
standard_normal([size, dtype, out])	从标准正态分布（平均值=0，标准偏差=1）中抽取样本 standard Normal distribution
standard_t(df[, size])	从自由度为 df 的标准学生t分布中抽取样本 standard Student’s t distribution
triangular(left, mode, right[, size])	从间隔[left, right]上的三角形分布中抽取样本 triangular distribution
uniform([low, high, size])	均匀分布 uniform distribution
vonmises(mu, kappa[, size])	米塞斯分布 von Mises distribution
wald(mean, scale[, size])	从瓦尔德分布或逆高斯分布中抽取样本 Wald, inverse Gaussian, distribution
weibull(a[, size])	韦伯分布/威布尔分布 Weibull distribution
zipf(a[, size])	齐普夫分布 Zipf distribution

常用的分布案例：

# 正太分布（高斯分布）
mu, sigma = 0, 0.1 # 均值(默认0)和标准差（默认1）
r = np.random.default_rng()
r.normal(size=[4,3]) # 指定形状
s = r.normal(mu, sigma, 1000)
# 可查看直方图
import pandas as pd
pd.Series(s).plot.hist()

旧方法

上这讲过 Numpy 旧的创建随机数据方法，目前新的创建方法正在推进中，旧方法仍然在学习资料和工作中有大量的使用，正时这样的原因，Numpy 也兼容的旧的方法，在此对旧方法做一些介绍，不过强烈建议尽快往新的方法上迁移。

# todo

todo

# todo

伪随机和真随机

伪随机（pseudo random）就是由算法生成的随机数，真随机是真正随机的数。真随机数的例子有很多，比如人群身高、零件规格等。而一般由计算机生成的随机数都是伪随机数。

那么，为什么由算法生成的一列数还可以被称为是随机数呢？首先，人们是无法自行（不借助计算机算法地）判断一组数据的随机性的。人们一般通过一些检验（Test）来判定一列数据是否由某种已知分布生成。所以说不论是由算法生成的还是自然生成的数据，某种意义上只要通过了检验就被认为是随机的。

我们不需要真正的随机数，除非它与安全性（例如加密密钥）有关或应用的基础是随机性（例如数字轮盘赌轮）。

另外，通过计算机产生的随机数都是伪随机，随机数并不意味着每次都有不同的数字，随机意味着无法在逻辑上预测的事物。有的计算机带真随机数发生器，是可以生成随机数的，可能利用的是芯片的热噪声，在此不做讨论。

参考

https://numpy.org/doc/stable/reference/random/index.html
https://numpy.org/doc/stable/reference/random/bit_generators/index.html
https://numpy.org/doc/stable/reference/random/generator.html