自用MXNet数据加载API文档翻译。。。未完

原文文章目录数据加载API概述数据迭代器函数，类手册数据结构和其他迭代器读写RecordIO文件的函数如何创建新的迭代器如何更改批量的排列方式API参考mxnet.io - Data Iterators数据加载API概述这篇文档总结了加载数据所支持的格式和迭代器的API，它们包括：mxnet.io用于公共数据集和公共函数的数据迭代器mxnet.recordio读写R...

团长sama

740人浏览 · 2019-03-29 10:51:56

团长sama · 2019-03-29 10:51:56 发布

原文

文章目录

数据加载API

数据加载API

概述

这篇文档总结了加载数据所支持的格式和迭代器的API，它们包括：

mxnet.io	用于公共数据集和公共函数的数据迭代器
`mxnet.recordio`	读写RecordIO的数据格式
`mxnet.image`	用于图像增强的迭代器

首先，我们来看看如何对一个新的数据格式创建一个迭代器。以下的迭代器用来训练符号格式的数据，其输入数据样本的名字是data，标签名是softmax_label.这个迭代器需要提供批量大小，形状，和名称等信息。

>>> nd_iter = mx.io.NDArrayIter(data={'data':mx.nd.ones((100,10))},
...                             label={'softmax_label':mx.nd.ones((100,))},
...                             batch_size=25)
>>> print(nd_iter.provide_data)
[DataDesc[data,(25, 10L),,NCHW]]
>>> print(nd_iter.provide_label)
[DataDesc[softmax_label,(25,),,NCHW]]

以下是一个如何再训练模型中使用数据迭代器的完整案例：

>>> data = mx.sym.Variable('data')
>>> label = mx.sym.Variable('softmax_label')
>>> fullc = mx.sym.FullyConnected(data=data, num_hidden=1)
>>> loss = mx.sym.SoftmaxOutput(data=fullc, label=label)
>>> mod = mx.mod.Module(loss, data_names=['data'], label_names=['softmax_label'])
>>> mod.bind(data_shapes=nd_iter.provide_data, label_shapes=nd_iter.provide_label)
>>> mod.fit(nd_iter, num_epoch=2)

详细的教程在 Iterators - Loading data

数据迭代器

`io.NDArrayIter`	返回的迭代器数据包括 `mx.nd.NDArray`, `numpy.ndarray`, `h5py.Dataset` `mx.nd.sparse.CSRNDArray` or `scipy.sparse.csr_matrix`.
`io.CSVIter`	返回CSV文件的迭代器
`io.LibSVMIter`	返回LibSVM迭代器，这种迭代器返回的数据是csr存储模式
`io.ImageRecordIter`	Iterates on image RecordIO files
`io.ImageRecordInt8Iter`	Iterating on image RecordIO files
`io.ImageRecordUInt8Iter`	Iterating on image RecordIO files
`io.MNISTIter`	MNIST迭代器
`recordio.MXRecordIO`	支持顺序读写RecordIO数据
`recordio.MXIndexedRecordIO`	支持随机读写RecordIO数据
`image.ImageIter`	支持大量增广操作的迭代器
`image.ImageDetIter`	用于检测的，支持大量数据增广操作的迭代器

函数，类手册

数据结构和其他迭代器

`io.DataDesc`	DataDesc 用于存储数据或标签的名称，形状，类型和布局信息
`io.DataBatch`	一个数据批量
`io.DataIter`	MXNet数据迭代器的基类
`io.ResizeIter`	根据给定的批量数目重设数据迭代器
`io.PrefetchingIter`	对其他数据迭代器执行预取
`io.MXDataIter`	一个用python包装的C++数据迭代器

读写RecordIO文件的函数

`recordio.pack`	将字符串打包到MXImageRecord.
`recordio.unpack`	将MXImageRecord 解压成字符串
`recordio.unpack_img`	将 MXImageRecord 解压成图像
`recordio.pack_img`	打包一个图像到 `MXImageRecord`.

如何创建新的迭代器

使用python编写一个新的数据迭代器是很简单的。大部分MXNet的训练/推理程序都接受具有provide_data和provide_label属性的可迭代对象。这个教程展示了如何从头开始编写一个迭代器。

以下示例演示了如何将多个数据迭代器组合到一个迭代器里。它可以用于多种模式的训练，比如图像的字幕，其中使用ImageRecordTter读图像，使用CSVIter来读文档。

class MultiIter:
    def __init__(self, iter_list):
        self.iters = iter_list
    def next(self):
        batches = [i.next() for i in self.iters]
        return DataBatch(data=[*b.data for b in batches],
                         label=[*b.label for b in batches])
    def reset(self):
        for i in self.iters:
            i.reset()
    @property
    def provide_data(self):
        return [*i.provide_data for i in self.iters]
    @property
    def provide_label(self):
        return [*i.provide_label for i in self.iters]

iter = MultiIter([mx.io.ImageRecordIter('image.rec'), mx.io.CSVIter('txt.csv')])

解析和执行诸如增广等预处理操作的代价比较大。如果十分重视性能的话，可以使用C++来实现数据迭代器，具体参考src/io

如何更改批量的排列方式

默认情况下，后端引擎将迭代器中的样本和标签变量的第一个维度是为batch大小(即NCHW或者NT排列方式)。为了覆盖batch轴，提供的数据（若有标签也包括标签）应该包括排列方式信息。这在RNN中很有用，因为TNC的排列方式通常更有效，例如：

@property
def provide_data(self):
    return [DataDesc(name='seq_var', shape=(seq_length, batch_size), layout='TN')]

The backend engine will recognize the index of N in the layout as the axis for batch size.

后端引擎将把布局中的索引N识别为批量大小的轴。

API参考

mxnet.io - Data Iterators

class mxnet.io.NDArrayIter(data, label=None, batch_size=1, shuffle=False, last_batch_handle='pad', data_name='data', label_name='softmax_label')

源码

返回mx.nd.NDArray, numpy.ndarray, h5py.Dataset``mx.nd.sparse.CSRNDArray or scipy.sparse.csr_matrix的迭代器

示例

>>> data = np.arange(40).reshape((10,2,2))
>>> labels = np.ones([10, 1])
>>> dataiter = mx.io.NDArrayIter(data, labels, 3, True, last_batch_handle='discard')
>>> for batch in dataiter:
...     print batch.data[0].asnumpy()
...     batch.data[0].shape
...
[[[ 36.  37.]
  [ 38.  39.]]
 [[ 16.  17.]
  [ 18.  19.]]
 [[ 12.  13.]
  [ 14.  15.]]]
(3L, 2L, 2L)
[[[ 32.  33.]
  [ 34.  35.]]
 [[  4.   5.]
  [  6.   7.]]
 [[ 24.  25.]
  [ 26.  27.]]]
(3L, 2L, 2L)
[[[  8.   9.]
  [ 10.  11.]]
 [[ 20.  21.]
  [ 22.  23.]]
 [[ 28.  29.]
  [ 30.  31.]]]
(3L, 2L, 2L)
>>> dataiter.provide_data # Returns a list of `DataDesc`
[DataDesc[data,(3, 2L, 2L),,NCHW]]
>>> dataiter.provide_label # Returns a list of `DataDesc`
[DataDesc[softmax_label,(3, 1L),,NCHW]]

In the above example, data is shuffled as shuffle parameter is set to True and remaining examples are discarded as last_batch_handle parameter is set to discard.

Usage of last_batch_handle parameter:

>>> dataiter = mx.io.NDArrayIter(data, labels, 3, True, last_batch_handle='pad')
>>> batchidx = 0
>>> for batch in dataiter:
...     batchidx += 1
...
>>> batchidx  # Padding added after the examples read are over. So, 10/3+1 batches are created.
4
>>> dataiter = mx.io.NDArrayIter(data, labels, 3, True, last_batch_handle='discard')
>>> batchidx = 0
>>> for batch in dataiter:
...     batchidx += 1
...
>>> batchidx # Remaining examples are discarded. So, 10/3 batches are created.
3
>>> dataiter = mx.io.NDArrayIter(data, labels, 3, False, last_batch_handle='roll_over')
>>> batchidx = 0
>>> for batch in dataiter:
...     batchidx += 1
...
>>> batchidx # Remaining examples are rolled over to the next iteration.
3
>>> dataiter.reset()
>>> dataiter.next().data[0].asnumpy()
[[[ 36.  37.]
  [ 38.  39.]]
 [[ 0.  1.]
  [ 2.  3.]]
 [[ 4.  5.]
  [ 6.  7.]]]
(3L, 2L, 2L)

NDArrayIter also supports multiple input and labels.

>>> data = {'data1':np.zeros(shape=(10,2,2)), 'data2':np.zeros(shape=(20,2,2))}
>>> label = {'label1':np.zeros(shape=(10,1)), 'label2':np.zeros(shape=(20,1))}
>>> dataiter = mx.io.NDArrayIter(data, label, 3, True, last_batch_handle='discard')

NDArrayIter also supports mx.nd.sparse.CSRNDArray with last_batch_handle set to discard.

>>> csr_data = mx.nd.array(np.arange(40).reshape((10,4))).tostype('csr')
>>> labels = np.ones([10, 1])
>>> dataiter = mx.io.NDArrayIter(csr_data, labels, 3, last_batch_handle='discard')
>>> [batch.data[0] for batch in dataiter]
[
,
,
]

参数名称	参数类型	说明
data	(array or list of array or dict of string to array)	输入数据
label	(array or list of array or dict of string to array, optional)	输入标签
batch_size	int	batch大小
shuffle	bool，可选	是否打散数据，仅当不使用会py.dataset输入时才有效
last_batch_handle	str，可选	How to handle the last batch. This parameter can be ‘pad’, ‘discard’ or ‘roll_over’. If ‘pad’, the last batch will be padded with data starting from the begining If ‘discard’, the last batch will be discarded If ‘roll_over’, the remaining elements will be rolled over to the next iteration and note that it is intended for training and can cause problems if used for prediction.
data_name	str，可选	数据名称
label_name	str，可选	标签名