NumPy 的数据类型

说明

NumPy 教程持续更新中，提供建议、纠错、催更等加作者微信: gr99123（备注：pandas教程）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

数据类型对象（numpy.dtype 类的实例）描述了应如何解释与数组项相对应的固定大小的内存块中的字节。实际上，除了经常用的的数字系列类型外，NumPy 提供了丰富的数据类型供我们选择。

类型特点

numpy 数组是同构的，包含由 dtype 对象描述的元素，数据类型对象可以由基本数字类型的不同组合构成。它描述了数据的以下方面：

数据类型（整数、浮点、Python对象等）
数据大小（例如整数中的字节数）
数据的字节顺序（小端或大端）
- 如果数据类型是结构化数据类型，则是其他数据类型的聚合（例如，描述由整数和浮点组成的数组项），
- 结构的“字段”的名称是什么，通过这些字段可以访问它们，
- 每个字段的数据类型是什么，以及
- 每个字段占用内存块的哪一部分。
如果数据类型是子数组，那么它的形状和数据类型是什么。

为了描述标量数据的类型，NumPy 中有几个内置的标量类型，用于各种精度的整数、浮点数等。例如，通过索引从数组中提取的项将是一个 Python 对象，其类型是与数组的数据类型关联的标量类型。

类型对象

NumPy 类型对象 Data type objects (dtype)，用来构建一个描述数据类型的方式，它可以让我们来描述复杂的数据类型。下例是一个包含 32 位大端模式（big-endian）整数的简单数据类型：

dt = np.dtype('>i4')
dt.byteorder
# '>'
dt.itemsize
# 4
dt.name
# 'int32'
# 对应的数组标量类型是 int32
dt.type is np.int32
# True
np.float64 == np.dtype(np.float64) == np.dtype('float64')
# True
np.float64 == np.dtype(np.float64).type
# True

一种结构化数据类型，包含16个字符串（在字段“name”中）和两个64位浮点数的子数组（在字段“grades”中）：

dt = np.dtype([('name', np.unicode_, 16), ('grades', np.float64, (2,))])
dt['name']
# dtype('<U16')
dt['grades']
# dtype(('<f8', (2,)))

此数据类型的数组项包装在一个数组标量类型中，该标量类型还包含两个字段：

x = np.array([('Sarah', (8.0, 7.0)), ('John', (6.0, 7.0))], dtype=dt)
x[1]
# ('John', [6., 7.])
x[1]['grades']
# array([6.,  7.])
type(x[1])
# <class 'numpy.void'>
type(x[1]['grades'])
# <class 'numpy.ndarray'>

数据类型

数值类型实际上是 dtype 对象的实例，并对应唯一的字符，包括 np.bool_，np.int32，np.float32，等等。

Numpy 类型	C 类型	说明
np.bool_	bool	Boolean (True or False) 存为 byte
np.byte	signed char	平台定义
np.ubyte	unsigned char	平台定义
np.short	short	平台定义
np.ushort	unsigned short	平台定义
np.intc	int	平台定义
np.uintc	unsigned int	平台定义
np.int_	long	平台定义
np.uint	unsigned long	平台定义
np.longlong	long long	平台定义
np.ulonglong	unsigned long long	平台定义
np.half / np.float16		半精度浮点：符号位，5位指数，10位尾数
np.single	float	平台定义单精度浮点：通常为符号位、8位指数、23位尾数
np.double	double	平台定义双精度浮点：通常是符号位，11位指数，52位尾数
np.longdouble	long double	平台定义扩展精度浮点
np.csingle	float complex	复数，由两个单精度浮点（实部和虚部）表示
np.cdouble	double complex	复数，由两个双精度浮点（实部和虚部）表示
np.clongdouble	long double complex	复数，由两个扩展精度浮点（实部和虚部）表示

由于其中许多具有依赖于平台的定义，因此提供了一组固定大小的别名：

Numpy 类型	C 类型	说明
np.int8	int8_t	Byte (-128 to 127)
np.int16	int16_t	Integer (-32768 to 32767)
np.int32	int32_t	Integer (-2147483648 to 2147483647)
np.int64	int64_t	Integer (-9223372036854775808 to 9223372036854775807)
np.uint8	uint8_t	无符号 integer (0 to 255)
np.uint16	uint16_t	无符号 integer (0 to 65535)
np.uint32	uint32_t	无符号 integer (0 to 4294967295)
np.uint64	uint64_t	无符号 integer (0 to 18446744073709551615)
np.intp	intptr_t	用于索引的 Integer 数，通常与 ssize_t 相同
np.uintp	uintptr_t	足以容纳指针的 Integer
np.float32	float
np.float64 / np.float_	double	请注意，这与内置python float的精度相匹配
np.complex64	float complex	复数，由两个32位浮点（实部和虚部）表示
np.complex128 / np.complex_	double complex	请注意，这与内置python复合体的精度相匹配

表示此数据类型对象的字节顺序的字符，所有内置数据类型对象的字节顺序为“=”或“|”（numpy.dtype.byteorder 可查看）。

‘=’ native
‘<’ little-endian
‘>’ big-endian
‘|’ not applicable

字节顺序是通过对数据类型预先设定 < 或 > 来决定的。 < 意味着小端法(最小值存储在最小的地址，即低位组放在最前面)。> 意味着大端法(最重要的字节存储在最小的地址，即高位组放在最前面)。

每个内建类型都有一个唯一定义它的字符代码：

字符	说明
?	布尔型
b	（有符号）字节
B	无符号字节
i	(有符号) 整型
u	无符号整型 integer
f	浮点型
c	复数浮点型
m	timedelta（时间差）
M	datetime（日期时间）
O	(Python) 对象
S, a	(byte-)字符串（不推荐）
U	Unicode 字符串
V	原始数据 (void)

下方会有利用上述字符构造复杂数据类型的实例。np.sctypeDict.keys() 可以查看所有支持的字符串，np.sctypeDict 是一个字典，它保存了字符串与 NumPy 数据类型的对应关系。

构造数据类型

每当 NumPy 函数或方法中需要某个数据类型时，就可以提供一个 dtype 对象或可以转换为一个的对象。此类转换由 numpy.dtype 构造函数（ dtype(obj[, align, copy])）完成：

# 使用数组标量类型
np.dtype(np.int16) # dtype('int16')
# 结构化类型，一个字段名“f1”，包含int16:
np.dtype([('f1', np.int16)]) # dtype([('f1', '<i2')])

# 结构化类型，一个名为“f1”的字段
# 其本身包含一个具有一个字段的结构化类型：
np.dtype([('f1', [('f1', np.int16)])])
# dtype([('f1', [('f1', '<i2')])])

# 结构化类型，两个字段：第一个字段包含无符号int
# 第二个字段包含int 32：
np.dtype([('f1', np.uint64), ('f2', np.int32)])
# dtype([('f1', '<u8'), ('f2', '<i4')])

# 使用数组协议类型字符串：
np.dtype([('a','f8'),('b','S10')])
# dtype([('a', '<f8'), ('b', 'S10')])

# 使用逗号分隔的字段格式。形状为（2,3）
np.dtype("i4, (2,3)f8")
# dtype([('f0', '<i4'), ('f1', '<f8', (2, 3))])

# 使用元组，int 是固定类型，3 表示字段的形状
# void 是一种灵活的类型，这里的大小为10：
np.dtype([('hello',(np.int64,3)),('world',np.void,10)])
# dtype([('hello', '<i8', (3,)), ('world', 'V10')])

# 将 int16 细分为 2 个 int8，称为x和y
# 0 和 1 是以字节为单位的偏移量：
np.dtype((np.int16, {'x':(np.int8,0), 'y':(np.int8,1)}))
# dtype((numpy.int16, [('x', 'i1'), ('y', 'i1')]))

# 使用字典。两个名为“gender”和“age”的字段：
np.dtype({'names':['gender','age'], 'formats':['S1',np.uint8]})
# dtype([('gender', 'S1'), ('age', 'u1')])

# 偏移量（字节），这里是0和25：
np.dtype({'surname':('S25',0),'age':(np.uint8,25)})
# dtype([('surname', 'S25'), ('age', 'u1')])

更多构造方法实例如下：

np.dtype(np.int32)      # 32-bit integer
np.dtype(np.complex128) # 128-bit complex floating-point number
np.dtype(float)   # Python-compatible floating-point number
np.dtype(int)     # Python-compatible integer
np.dtype(object)  # Python object
np.dtype('b')  # byte, native byte order
np.dtype('>H') # big-endian unsigned short
np.dtype('<f') # little-endian single-precision float
np.dtype('d')  # double-precision floating-point number
np.dtype('i4')   # 32-bit signed integer
np.dtype('f8')   # 64-bit floating-point number
np.dtype('c16')  # 128-bit complex floating-point number
np.dtype('a25')  # 25-length zero-terminated bytes
np.dtype('U25')  # 25-character string
np.dtype('uint32')   # 32-bit unsigned integer
np.dtype('float64')  # 64-bit floating-point number
np.dtype((np.void, 10))  # 10-byte wide data block
np.dtype(('U', 10))   # 10-character unicode string
np.dtype((np.int32, (2,2)))          # 2 x 2 integer sub-array
np.dtype(('i4, (2,3)f8, f4', (2,3))) # 2 x 3 structured sub-array
# fields big (big-endian 32-bit integer) and
# little (little-endian 32-bit integer)
np.dtype([('big', '>i4'), ('little', '<i4')])
# fields R, G, B, A, each being an unsigned 8-bit integer
np.dtype([('R','u1'), ('G','u1'), ('B','u1'), ('A','u1')])
# fields r, g, b, a, each being an 8-bit unsigned integer
np.dtype({'names': ['r','g','b','a'],
               'formats': [np.uint8, np.uint8, np.uint8, np.uint8]})
# fields r and b (with the given titles), both being 8-bit
# unsigned integers, the first at byte position 0 from the
# start of the field and the second at position 2
np.dtype({'names': ['r','b'], 'formats': ['u1', 'u1'],
          'offsets': [0, 2],
          'titles': ['Red pixel', 'Blue pixel']})
# field col1 (10-character string at byte position 0),
# col2 (32-bit float at byte position 10),
# and col3 (integers at byte position 14)
np.dtype({'col1': ('U10', 0), 'col2': (np.float32, 10),
		'col3': (int, 14)})
# 32-bit integer, whose first two bytes are interpreted
# as an integer via field real, and the
# following two bytes via field imag.
np.dtype((np.int32,{'real':(np.int16, 0),'imag':(np.int16, 2)}))
# 32-bit integer, which is interpreted as consisting of
# a sub-array of shape (4,) containing 8-bit integers:
np.dtype((np.int32, (np.int8, 4)))
# 32-bit integer, containing fields r, g, b, a that interpret
# the 4 bytes in the integer as four unsigned integers:
np.dtype(('i4', [('r','u1'),('g','u1'),('b','u1'),('a','u1')]))

类型对象属性

数据类型由以下数据类型属性描述：

属性	说明
dtype.type	用于实例化此数据类型的标量的类型对象
dtype.kind	一种字符码（“biufcmMOSUV”之一），用于识别一般类型的数据
dtype.char	为21种不同的内置类型中的每种类型提供唯一的字符代码
dtype.num	21种不同内置类型中每种类型的唯一编号
dtype.str	此数据类型对象的数组协议 typestring

数据的大小依次描述为：

属性	说明
dtype.name	此数据类型的位宽度名称
dtype.itemsize	此数据类型对象的元素大小

此数据的字节序：

属性	说明
dtype.byteorder	表示此数据类型对象的字节顺序的字符

有关结构化数据类型中的子数据类型的信息：

属性	说明
dtype.fields	为此数据类型定义的命名字段字典，或 None
dtype.names	字段名的有序列表，如果没有字段，则为 None

对于描述子数组的数据类型：

属性	说明
dtype.subdtype	(item_dtype, shape) 对于描述子数组的数据类型
dtype.shape	如果此数据类型描述子数组，则为子数组的形状元组，否则为 ()

提供附加信息的属性：

属性	说明
dtype.hasobject	布尔值，指示此数据类型在任何字段或子数据类型中是否包含任何引用计数对象
dtype.flags	描述如何解释此数据类型的位标志
dtype.isbuiltin	整数，指示此数据类型与内置数据类型的关系
dtype.isnative	指示此数据类型的字节顺序是否为平台本机的布尔值
dtype.descr	`__array_interface__` 数据类型的描述
dtype.alignment	根据编译器，此数据类型所需的对齐方式（字节）
dtype.base	返回子数组的基元素的数据类型，无论其大小或形状如何

方法 dtype.newbyteorder([new_order]) 返回具有不同字节顺序的新数据类型。

类型映射 np.result_type

numpy.result_type(*arrays_and_dtypes) 返回将从其他 NumPy 类型提取出来应用于其他类型。

在 NumPy，类型升级与 C++ 语言中的规则类似，但也有一些细微的差别。当同时使用标量和数组时，数组的类型优先，并考虑标量的实际值。

例如，计算3*a，其中 a 是一个 32 位浮点数组，直观上应该得到一个 32 位浮点输出。如果 3 是 32 位整数，NumPy 规则表明它不能无损地转换为 32 位浮点，因此 64 位浮点应该是结果类型。通过检查常量 “3” 的值，我们发现它适合 8 位整数，可以无损地转换为 32 位浮点。

np.result_type(3, np.arange(7, dtype='i1'))
# dtype('int8')
np.result_type('i4', 'c8')
# dtype('complex128')
np.result_type(3.0, -2)
# dtype('float64')

defaults = {np.dtype('int64'): np.int32,
            np.dtype('float64'): np.float32}
before = 1.
np.array(before, dtype=defaults.get(np.result_type(before), None))

大小端模式

大端模式（big-endian）与小端模式（little-endian）：

大端模式是指数据的低位保存在内存的高地址中，而数据的高位保存在内存的低地址中
小端模式是指数据的低位保存在内存的低地址中，而数据的高位保存在内存的高地址中

大小端模式的由来：在计算机系统中，我们是以字节为单位的，每个地址单元都对应着一个字节，一个字节为8bit。但是在 C 语言中除了8bit的char之外，还有16bit的short型，32bit的long型（要看具体的编译器）。

另外，对于位数大于8位的处理器，例如16位或者32位的处理器，由于寄存器宽度大于一个字节，那么必然存在着一个如果将多个字节安排的问题。因此就导致了大端存储模式和小端存储模式。

例如一个16bit的short型x，在内存中的地址为0x0010，x的值为0x1122。那么0x11为数据高字节，0x22为数据低字节。

对于大端模式，就将0x11放在内存低地址中，即0x0010中；0x22放在内存高地址中，即0x0011中。小端模式，就将0x11放在内存高地址中，即0x0011中；0x22放在内存低地址中，即0x0010中。

我们常用的X86结构是小端模式，而KEIL C51则为大端模式。很多的ARM，DSP都为小端模式。有些ARM处理器还可以由硬件来选择是大端模式还是小端模式。

参考

https://numpy.org/doc/stable/reference/arrays.dtypes.html
https://numpy.org/doc/stable/user/basics.types.html

< NumPy 基础用法 NumPy 教程 NumPy 的标量（Scalar） >

更新时间：2021-12-23 13:49:28 标签：numpy 数据类型 dtype