原文

数据加载API

概述

这篇文档总结了加载数据所支持的格式和迭代器的API,它们包括:

mxnet.io用于公共数据集和公共函数的数据迭代器
mxnet.recordio读写RecordIO的数据格式
mxnet.image用于图像增强的迭代器

首先,我们来看看如何对一个新的数据格式创建一个迭代器。以下的迭代器用来训练符号格式的数据,其输入数据样本的名字是data,标签名是softmax_label.这个迭代器需要提供批量大小,形状,和名称等信息。

>>> nd_iter = mx.io.NDArrayIter(data={'data':mx.nd.ones((100,10))},
...                             label={'softmax_label':mx.nd.ones((100,))},
...                             batch_size=25)
>>> print(nd_iter.provide_data)
[DataDesc[data,(25, 10L),,NCHW]]
>>> print(nd_iter.provide_label)
[DataDesc[softmax_label,(25,),,NCHW]]

以下是一个如何再训练模型中使用数据迭代器的完整案例:

>>> data = mx.sym.Variable('data')
>>> label = mx.sym.Variable('softmax_label')
>>> fullc = mx.sym.FullyConnected(data=data, num_hidden=1)
>>> loss = mx.sym.SoftmaxOutput(data=fullc, label=label)
>>> mod = mx.mod.Module(loss, data_names=['data'], label_names=['softmax_label'])
>>> mod.bind(data_shapes=nd_iter.provide_data, label_shapes=nd_iter.provide_label)
>>> mod.fit(nd_iter, num_epoch=2)

详细的教程在 Iterators - Loading data

数据迭代器

io.NDArrayIter返回的迭代器数据包括 mx.nd.NDArray, numpy.ndarray, h5py.Dataset mx.nd.sparse.CSRNDArray or scipy.sparse.csr_matrix.
io.CSVIter返回CSV文件的迭代器
io.LibSVMIter返回LibSVM迭代器,这种迭代器返回的数据是csr存储模式
io.ImageRecordIterIterates on image RecordIO files
io.ImageRecordInt8IterIterating on image RecordIO files
io.ImageRecordUInt8IterIterating on image RecordIO files
io.MNISTIterMNIST迭代器
recordio.MXRecordIO支持顺序读写RecordIO数据
recordio.MXIndexedRecordIO支持随机读写RecordIO数据
image.ImageIter支持大量增广操作的迭代器
image.ImageDetIter用于检测的,支持大量数据增广操作的迭代器

函数,类手册

数据结构和其他迭代器

io.DataDescDataDesc 用于存储数据或标签的名称,形状,类型和布局信息
io.DataBatch一个数据批量
io.DataIterMXNet数据迭代器的基类
io.ResizeIter根据给定的批量数目重设数据迭代器
io.PrefetchingIter对其他数据迭代器执行预取
io.MXDataIter一个用python包装的C++数据迭代器

读写RecordIO文件的函数

recordio.pack将字符串打包到MXImageRecord.
recordio.unpack将MXImageRecord 解压成字符串
recordio.unpack_img将 MXImageRecord 解压成图像
recordio.pack_img打包一个图像到 MXImageRecord.

如何创建新的迭代器

使用python编写一个新的数据迭代器是很简单的。大部分MXNet的 训练/推理程序都接受具有provide_data和provide_label属性的可迭代对象。这个教程展示了如何从头开始编写一个迭代器。

以下示例演示了如何将多个数据迭代器组合到一个迭代器里。它可以用于多种模式的训练,比如图像的字幕,其中使用ImageRecordTter读图像,使用CSVIter来读文档。

class MultiIter:
    def __init__(self, iter_list):
        self.iters = iter_list
    def next(self):
        batches = [i.next() for i in self.iters]
        return DataBatch(data=[*b.data for b in batches],
                         label=[*b.label for b in batches])
    def reset(self):
        for i in self.iters:
            i.reset()
    @property
    def provide_data(self):
        return [*i.provide_data for i in self.iters]
    @property
    def provide_label(self):
        return [*i.provide_label for i in self.iters]

iter = MultiIter([mx.io.ImageRecordIter('image.rec'), mx.io.CSVIter('txt.csv')])

解析和执行诸如增广等预处理操作的代价比较大。如果十分重视性能的话,可以使用C++来实现数据迭代器,具体参考src/io

如何更改批量的排列方式

默认情况下,后端引擎将迭代器中的样本和标签变量的第一个维度是为batch大小(即NCHW或者NT排列方式)。为了覆盖batch轴,提供的数据(若有标签也包括标签)应该包括排列方式信息。这在RNN中很有用,因为TNC的排列方式通常更有效,例如:

@property
def provide_data(self):
    return [DataDesc(name='seq_var', shape=(seq_length, batch_size), layout='TN')]

The backend engine will recognize the index of N in the layout as the axis for batch size.

后端引擎将把布局中的索引N识别为批量大小的轴。

API参考

mxnet.io - Data Iterators

class mxnet.io.NDArrayIter(data, label=None, batch_size=1, shuffle=False, last_batch_handle='pad', data_name='data', label_name='softmax_label')

源码

返回mx.nd.NDArray, numpy.ndarray, h5py.Dataset``mx.nd.sparse.CSRNDArray or scipy.sparse.csr_matrix的迭代器

示例

>>> data = np.arange(40).reshape((10,2,2))
>>> labels = np.ones([10, 1])
>>> dataiter = mx.io.NDArrayIter(data, labels, 3, True, last_batch_handle='discard')
>>> for batch in dataiter:
...     print batch.data[0].asnumpy()
...     batch.data[0].shape
...
[[[ 36.  37.]
  [ 38.  39.]]
 [[ 16.  17.]
  [ 18.  19.]]
 [[ 12.  13.]
  [ 14.  15.]]]
(3L, 2L, 2L)
[[[ 32.  33.]
  [ 34.  35.]]
 [[  4.   5.]
  [  6.   7.]]
 [[ 24.  25.]
  [ 26.  27.]]]
(3L, 2L, 2L)
[[[  8.   9.]
  [ 10.  11.]]
 [[ 20.  21.]
  [ 22.  23.]]
 [[ 28.  29.]
  [ 30.  31.]]]
(3L, 2L, 2L)
>>> dataiter.provide_data # Returns a list of `DataDesc`
[DataDesc[data,(3, 2L, 2L),,NCHW]]
>>> dataiter.provide_label # Returns a list of `DataDesc`
[DataDesc[softmax_label,(3, 1L),,NCHW]]

In the above example, data is shuffled as shuffle parameter is set to True and remaining examples are discarded as last_batch_handle parameter is set to discard.

Usage of last_batch_handle parameter:

>>> dataiter = mx.io.NDArrayIter(data, labels, 3, True, last_batch_handle='pad')
>>> batchidx = 0
>>> for batch in dataiter:
...     batchidx += 1
...
>>> batchidx  # Padding added after the examples read are over. So, 10/3+1 batches are created.
4
>>> dataiter = mx.io.NDArrayIter(data, labels, 3, True, last_batch_handle='discard')
>>> batchidx = 0
>>> for batch in dataiter:
...     batchidx += 1
...
>>> batchidx # Remaining examples are discarded. So, 10/3 batches are created.
3
>>> dataiter = mx.io.NDArrayIter(data, labels, 3, False, last_batch_handle='roll_over')
>>> batchidx = 0
>>> for batch in dataiter:
...     batchidx += 1
...
>>> batchidx # Remaining examples are rolled over to the next iteration.
3
>>> dataiter.reset()
>>> dataiter.next().data[0].asnumpy()
[[[ 36.  37.]
  [ 38.  39.]]
 [[ 0.  1.]
  [ 2.  3.]]
 [[ 4.  5.]
  [ 6.  7.]]]
(3L, 2L, 2L)

NDArrayIter also supports multiple input and labels.

>>> data = {'data1':np.zeros(shape=(10,2,2)), 'data2':np.zeros(shape=(20,2,2))}
>>> label = {'label1':np.zeros(shape=(10,1)), 'label2':np.zeros(shape=(20,1))}
>>> dataiter = mx.io.NDArrayIter(data, label, 3, True, last_batch_handle='discard')

NDArrayIter also supports mx.nd.sparse.CSRNDArray with last_batch_handle set to discard.

>>> csr_data = mx.nd.array(np.arange(40).reshape((10,4))).tostype('csr')
>>> labels = np.ones([10, 1])
>>> dataiter = mx.io.NDArrayIter(csr_data, labels, 3, last_batch_handle='discard')
>>> [batch.data[0] for batch in dataiter]
[
,
,
]
参数名称参数类型说明
data(array or list of array or dict of string to array)输入数据
label(array or list of array or dict of string to array*,* optional)输入标签
batch_sizeintbatch大小
shufflebool,可选是否打散数据,仅当不使用会py.dataset输入时才有效
last_batch_handlestr,可选How to handle the last batch. This parameter can be ‘pad’, ‘discard’ or ‘roll_over’. If ‘pad’, the last batch will be padded with data starting from the begining If ‘discard’, the last batch will be discarded If ‘roll_over’, the remaining elements will be rolled over to the next iteration and note that it is intended for training and can cause problems if used for prediction.
data_namestr,可选数据名称
label_namestr,可选标签名
Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐