Deep Learning

论文翻译：Deep Learning论文题目：Deep Learning论文来源：www.nature.com/articles/nature14539Deep LearningYann LeCun∗ Yoshua Bengio∗ Geoffrey HintonAbstractDeep learning allows computational models that are composed o

大数据机器学习实验室

697人浏览 · 2020-08-01 09:40:51

大数据机器学习实验室 · 2020-08-01 09:40:51 发布

论文翻译：Deep Learning

论文题目：Deep Learning

论文来源：www.nature.com/articles/nature14539

Deep Learning

Yann LeCun∗ Yoshua Bengio∗ Geoffrey Hinton

Abstract

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec-ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

摘要

深度学习允许由多个处理层组成的计算模型学习具有多个抽象级别的数据表示。这些方法极大地提升了语音识别、视觉目标识别、目标检测以及许多其他领域的最新技术，例如药物发现和基因组学。深度学习通过使用反向传播算法来指示机器应如何更新其内部参数（从上一层的表示形式计算每一层的表示形式），从而发现大型数据集中的复杂结构。深层卷积网络在处理图像、视频、语音和音频方面带来了突破，而递归网络则对诸如文本和语音之类的顺序数据有所启发。

正文

Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning.

机器学习在现代社会多个方面都有涉及，从网络检索的过滤，到电子商务网站的推荐，并且在消费者产品中的应用也在不断增加，例如相机和智能手机，机器学习系统用来从图像方面识别目标，语音转换文本、根据用户兴趣推荐职位或者产品、选择相关的搜索结果，渐渐的像这一类技术的应用称为机器学习。

Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input.

传统机器学习技术受限于处理原始数据的能力，数十年来，构建模式识别或者机器学习系统要求精细的工程和合理的知识去设计一个特征提取器，能够将原始数据（例如图形像素值）转换为一个合理的内部表达，或者特征向量。在这个学习系统当中通常是一个分类器来完成对输入样本的检测和分类。

Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.

表征学习是一系列方法、他能使机器能够依靠原始数据自动挖掘探测或者分类所需要的特征。深度学习方法就是一种拥有多层表示的特征学习方法，它通过简单但是非线性的模块将某一层的表征（从数据输入开始）转换至一个更高、明显更抽象的层次。由于有着足够多的这种转换，很复杂的功能也可以被学习。，对于分类任务，更高层次的表征能够放大输入的区别并且抑制无关的变化，例如一张图像，按照像素矩阵输入，在第一层中被学习的特征通常是在图像中某些特定方向或位置边缘存在与否的问题。在第二层中通常检测图案按照边界安排，忽略边缘细小的变量。第三层通常是将图案组成更大的组合，这些组合与一些相似目标基本符合。接下来的层次会把这一部分的目标结合起来进行检测。机器学习很关键的一个方面就是这些层次特征不是由人类工程设计，他们是使用学习算法从数据中得到的。

Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years. It has turned out to be very good at discovering intricate structures in high-dimensional data and is therefore applicable to many domains of science, business and government. In addition to beating records in image recognition1–4 and speech recognition5–7, it has beaten other machine-learning techniques at predicting the activity of potential drug molecules8, analysing particle accelerator data9,10, reconstructing brain circuits11, and predicting the effects of mutations in non-coding DNA on gene expression and disease12,13. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding14, particularly topic classification, sentiment analysis, question answering15 and language translation16,17.

深度学习在解决问题上已经取得了很大的发展，多年来在人工智能交互方面，已经做出了最好的尝试，他擅长在多维数据中探索复杂结构，因此能够应用到科学、商业、政府诸多方面。除了在图像和语音识别创造了记录之外，他在预测活动的潜在危险因素、分析粒子加速器数据、重构大脑回路和预测非编码变异DNA基因造成的疾病和影响方面还战胜了其他机器学习技术，令人更加惊讶的是，机器学习在自然对不同的任务还产生了极具潜力的结果，尤其是话题分类，情感分析、问答和语言翻译

We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only accelerate this progress.

我们认为机器语言将会有更多的成功在不久的将来，因为他只需要一点的手动工程，所以它能够轻松的利用增加可变计算和数据所带来的优势，目前新的学习算法和结构在深度神经网络的发展将会加速这个过程。

Supervised learning

监督学习

The most common form of machine learning, deep or not, is supervised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or distance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output function of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.

机器学习最常见的形式不是深度学习就是监督学习，想象一下我们要建立一个系统它能够以图片内容为依据进行分类，说一栋房子、一辆汽车、一个人或者一只宠物。我们先收集大量的房子，汽车、人和宠物的图片，形成一个集合，每一个都贴上其类别的标签，在训练中，机器会展示一张图片并且会生成一个分数以向量的形式输出。对于每一个类别中的图片，我们想要期望的分类能够得到最高的分数在所有类别中，但这是不太可能发生的在训练之前。我们计算一个目标函数用来评估错误（或者距离）在输出分数和理想分数形式之间，接下来机器会调整内部的可变参数去减少这个错误，这些可变参数，通常称作权重，是可以被看作“节”的真正的数字，能够定义机器的输入输出函数。在一个典型的深度学习系统中，可能存在上百万的可变权重，并且上百万的标签作为训练机器的样例，为了适当的调整权重分量机器算法计算了一个变化率算法，指明了多少数量的错误将会增加或减少，如果权重有了轻微的增加，接下来权重分量会在变化率分量的对立方向进行调整。

The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.

我们的目标函数，对所有样本的平均，可以看作是一种在权值的高维空间上的多变地形，负的梯度向量方向下降最快，也就最容易取得误差的最小值。

In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization techniques18. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training.

实际上，大多数从业者使用称为随机梯度下降（SGD）的过程。这包括显示几个示例的输入向量、计算输出和误差、计算这些示例的平均梯度以及相应地调整权重。对于训练集的许多小示例，此过程重复，直到目标函数的平均值停止下降。它被称为随机，因为每组小示例都给出所有示例的平均梯度的噪声估计值。与更精细的优化技术相比，这个简单的过程通常发现一组高权重的速度惊人地快。训练后，系统的性能根据一组称为测试集的不同示例进行测量。这用于测试机器的泛化能力 - 它能否对训练期间从未见过的新输入做出明智的答案。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yjrSGqjF-1596246034023)(C:\Users\39820\Desktop\图一.png)]

Figure 1 | Multilayer neural networks and backpropagation.a, A multi-layer neural network (shown by the connected dots) can distort the input space to make the classes of data (examples of which are on the red and blue lines) linearly separable. Note how a regular grid(shown on the left) in input space is also transformed(shown in the middle panel)by hidden units. This is an illustrative example with only two input units, two hidden units and one output unit, but the networks used for object recognition or natural language processing contain tens or hundreds of thousands of units. Reproduced with permission from C. Olah (http://colah.github.io/). b, The chain rule of derivatives tells us how two small effects (that of a small change of x on y, and that of y on z) are composed. A small change Δx in x gets transformed first into a small change Δy in y by getting multiplied by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change Δy creates a change Δz in z. Substituting one equation into the other gives the chain rule of derivatives — how Δx gets turned into Δz through multiplication by the product of ∂y/∂x and ∂z/∂x. It also works when x, y and z are vectors (and the derivatives are Jacobian matrices). c, The equations used for computing the forward pass in a neural net with two hidden layers and one output layer, each constituting a module through which one can backpropagate gradients. At each layer, we first compute the total input z to each unit, which is a weighted sum of the outputs of the units in the layer below. Then a non-linear function f(.) is applied to z to get the output of the unit. For simplicity, we have omitted bias terms. The non-linear functions used in neural networks include the rectified linear unit (ReLU) f(z)=max(0,z), commonly used in recent years, as well as the more conventional sigmoids, such as the hyberbolic tangent, f(z)=(exp(z)−exp(−z))/(exp(z)+exp(−z)) and logistic function logistic, f(z)=1/(1+exp(−z)). d, The equations used for computing the backward pass. At each hidden layer we compute the error derivative with respect tothe output of each unit, which is a weighted sum of the error derivatives with respect to the total inputs to the units in the layer above. We then convert the error derivative with respect tothe output into the error derivative with respect to the input by multiplying it by the gradient of f(z). At the output layer, the error derivative with respect to the output of a unit is computed by differentiating the cost function. This gives yl−tl if the cost function for unit l is 0.5(yl−tl)2, where tl is the target value. Once the ∂E/∂zkis known, the error-derivative for the weight wjk on the connection from unit j in the layer below is just yj ∂E/∂zk.

图1 |多层神经网络和反传播.a，多层神经网络（由连接点显示）可以扭曲输入空间，使数据类别（其中的例子在红色和蓝色线上）线性可分离。请注意输入空间中的常规网格（显示在左侧）如何通过隐藏单位（显示在中间面板中）进行转换。这是一个仅包含两个输入单元、两个隐藏单元和一个输出单元的示例，但用于对象识别或自然语言处理的网络包含数万或数十万个单元。经C.Olah（http://colah.github.io/）许可http://colah.github.io/转载。b、导数链规则告诉我们如何组成两个小效应（y上的x的小变化和y对z的小变化）。x 中的小变化 =x 首先通过乘以 y/x（即部分导数的定义）在 y 中转换为小变化 μy。同样，变化在 z 中创造了一个变化\z。将一个方程替换成另一个方程，给出了导数的链规则 — μx 如何通过乘法将 \y/\x 和\z/\x 的乘法转化为\z。当 x、y 和 z 是矢量（导数是雅各比矩阵）时，它也有效。c、用于计算神经网中具有两个隐藏层和一个输出层的前进通道的方程，每个层构成一个模块，通过该模块可以支持反面梯度。在每个层，我们首先计算每个单位的总输入 z，这是下面层中单位输出的加权总和。然后将非线性函数 f（.）应用于 z 以获得单位的输出。为简单起见，我们省略了偏置词。神经网络中使用的非线性函数包括近年来常用的整流线性单元（ReLU）f（z）=max（0，z），以及更传统的西格莫德，如催眠切线，f（z）=（exp（z）=exp（z）/（exp（z）=exp（\z））和物流函数物流，f（z）=1/（1=exp（+z）。d、用于计算后向传递的方程。在每个隐藏层，我们计算与每个单元的输出的误差导数，这是误差导数与上述层中单位的总输入的加权和。然后，我们将与输出的误差导数转换为与输入的误差导数，将其乘以 f（z）的渐变。在输出层，通过区分成本函数来计算与单位输出的误差导数。如果单位 l 的成本函数为 0.5（yl=tl）2，则这给出了 yl=tl，其中 tl 是目标值。一旦知道 [E/\zkis，下面层中单元 j 连接上权重 wjk 的错误导数只是 yj +E+/zk。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cR9gQZr9-1596246034025)(C:\Users\39820\Desktop\Figure 2.png)]

图2 |在卷积网络内。应用于萨摩耶狗图像（左下;和RGB（红色、绿色、蓝色）输入（右下）的图像的典型卷积网络体系结构的每个层的输出（不是滤波器）。每个矩形图像都是与其中一个已学要素的输出对应的要素贴图，在每个图像位置检测到。信息自下而上流动，较低级别的要素充当定向边缘探测器，并计算输出中每个图像类的分数。ReLU，整流线性单元。

Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category. Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces separated by a hyperplane19. But problems such as image and speech recognition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on raw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods20, but generic features such as those arising with the Gaussian kernel do not allow the learner to generalize well far from the training examples21. The conventional option is to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

机器学习的许多实际应用在手工设计功能上使用线性分类器。两类线性分类器计算要素矢量分量的加权总和。如果加权总和高于阈值，则输入被归类为属于特定类别。自 20 世纪 60 年代起，我们已经知道线性分类器只能将输入空间分割成非常简单的区域，即由超平面 19 分隔的半空间。但是，图像和语音识别等问题要求输入输出函数对输入的不相关变化不敏感，例如物体的位置、方向或照明变化，或语音音调或重音的变化，同时对特定分钟变化非常敏感（例如，白狼和一只叫萨摩耶的狼一样白狗的区别）。在像素级别上，两个萨摩耶在不同姿势和不同环境中的图像可能非常不同，而两个萨摩耶和狼在相同位置和相似背景的图像可能非常相似。线性分类器或任何其他在原始像素上操作的"shallow"分类器无法区分后两个，同时将前两个分类器放在同一类别中。这就是为什么浅类分类器需要一个好的特征提取器来解决选择性 - 不变性困境 - 一个产生对图像的方面有选择性的表示，这些方面对歧视很重要，但对不相关的方面（如动物的姿势）是不变的。为了让分类器更强大，可以使用泛型非线性功能，就像内核方法20一样，但是通用功能（如高斯内核产生的特征）不允许学习者与训练示例21进行远去的泛型功能。传统的选择是手工设计良好的功能提取器，这需要大量的工程技能和领域专业知识。但是，如果可以使用通用学习程序自动学习好功能，这一切都可以避免。这是深度学习的关键优势。、

A deep-learning architecture is a multilayer stack of simple modules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details — distinguishing Samoyeds from white wolves — and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects.

深度学习体系结构是一个简单的模块的多层堆栈，其中所有（或大多数）都受学习，其中许多模块计算非线性输入输出映射。堆栈中的每个模块转换其输入，以增加表示的选择性和不变性。对于多个非线性层（例如深度为 5 到 20），系统可以实现其输入的极其复杂的功能，这些功能同时对细微细节敏感（区分萨摩耶和白狼），并且对背景、姿势、照明和周围物体等不相关的大变化不敏感。

Backpropagation to train multilayer architectures

用于训练多层架构的后向

From the earliest days of pattern recognition22,23, the aim of researchers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s24–27.

从模式识别最早开始，研究人员就的目标是用可训练多层网络取代手工设计的特征，尽管其简单性，但直到20世纪80年代中期才被广泛理解。事实证明，多层架构可以通过简单的随机梯度下降进行训练。只要模块的输入功能和内部权重相对平滑，就可以使用回传播过程计算梯度。在1970年代和1980年代24-27年代，几个不同的团体独立发现了这种可以做到并能奏效的想法。计算目标函数的梯度与多层模块堆栈模块的权重的回传播过程只不过是对导数链规则的实际应用。关键见解是，目标中有关模块输入的导数（或梯度）可以通过从梯度沿该模块的输出（或后续模块的输入）向后工作来计算（图 1）。

The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The ckpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.

计算目标函数的梯度与多层模块堆栈模块的权重的回传播过程只不过是对导数链规则的实际应用。关键见解是，目标中有关模块输入的导数（或梯度）可以通过从梯度沿该模块的输出（或后续模块的输入）向后工作来计算（图 1）。ckpropagation 方程可以反复应用，以在所有模块中传播梯度，从顶部的输出（网络生成预测）一直一直传播到底部（其中输入外部输入）。计算这些渐变后，就直接计算与每个模块权重的渐变。

Many applications of deep learning use feedforward neural network architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a probability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier f(z) = max(z, 0). In past decades, neural nets used smoother non-linearities, such as tanh(z) or 1/(1 + exp(−z)), but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training28. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1).

许多深度学习的应用程序都使用前馈神经网络体系结构（图 1），它们学习将固定大小的输入（例如图像）映射到固定大小的输出（例如，每个类别的概率）。要从一个图层转到下一个图层，一组单位计算来自上一个图层的输入的加权总和，并通过非线性函数传递结果。目前，最流行的非线性函数是整流线性单位（ReLU），它就是半波整流器f（z）=最大值（z，0）。在过去的几十年里，神经网络使用更平滑的非线性，如 tanh（z）或 1/（1 = exp（+z），但 ReLU 在具有多层的网络中通常学习得更快，从而允许在不受监督的预训练的情况下训练深度监督网络28。不在输入层或输出层中的单位通常称为隐藏单位。隐藏图层可被视为以非线性方式扭曲输入，使类别由最后一个图层线性分离（图 1）。在20世纪90年代末，神经网络和后传播基本上被机器学习社区抛弃，而计算机视觉和语音识别社区却忽视了神经网络和后传播。人们普遍认为，学习有用、多阶段、具有功能的提取者，很少事先的知识是不可行的。特别是，人们普遍认为，简单的梯度下降将陷入差局部最小 - 重量配置，没有小的变化将减少平均误差。

In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the remainder29,30. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objective function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.

在20世纪90年代末，神经网络和后传播基本上被机器学习社区抛弃，而计算机视觉和语音识别社区却忽视了神经网络和后传播。人们普遍认为，学习有用、多阶段、具有功能的提取者，很少事先的知识是不可行的。特别是，人们普遍认为，简单的梯度下降将陷入差局部最小 - 重量配置，没有小的变化将减少平均误差。

Interest in deep feedforward networks was revived around 2006 (refs 31–34) by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR). The researchers introduced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropagation33–35. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited36.

2006年（参考文献31-34）由加拿大高级研究所（CIFAR）共同组织的研究人员对深馈网络的兴趣重新燃起。研究人员引入了无人监督的学习程序，可以创建要素检测层，而无需标记数据。学习每一层特征探测器的目的是能够重建或建模下面层中特征探测器（或原始输入）的活动。通过使用此重建目标"预先训练"了几层逐渐复杂的特征探测器，深度网络的重量可以初始化为合理的值。然后，最后一层输出单元可以添加到网络顶部，并且使用标准反传播33-35对整个深度系统进行微调。这非常适用于识别手写数字或检测行人，尤其是当标签数据量非常有限时36。

The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program37 and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coefficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabulary38 and was quickly developed to give record-breaking results on a large vocabulary task39. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups6 and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting40, leading to significantly better generalization when the number of labelled examples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.

这种预培训方法的第一个主要应用是在语音识别方面，由于快速图形处理单元（GPU）的出现，它使得它成为可能，这些单元便于编程 37，并且使研究人员能够更快地训练网络 10 倍或 20 倍。2009年，该方法用于将从声波中提取的系数的短时窗映射到一组可能由窗口中心帧表示的各种语音片段的概率。它在使用小词汇38的标准语音识别基准上取得了破纪录的结果，并很快被开发出来，在大型词汇任务中产生破纪录的结果39。到2012年，许多主要语音组正在开发2009年的深网版本6，并且已经部署在Android手机中。对于较小的数据集，无监督的预训练有助于防止过度拟合 40，当标记示例数量很少时，或在传输设置中，我们有很多示例用于某些"源"任务，但很少用于某些"目标"任务，从而显著提高泛化。一旦恢复深度学习，事实证明，培训前阶段只需要小型数据集。

There was, however, one particular type of deep, feedforward network that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet)41,42. It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computervision community.

然而有一种特定类型的深度前馈网络，它比相邻层之间完全连接的网络更容易训练和推广。这是卷积神经网络（ConvNet）41，42。在神经网络不受青睐的时期，它取得了许多实际的成功，最近被计算机视觉界广泛采用。

Convolutional neural networks

卷积神经网络

ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms; and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.

ConvNet 旨在处理以多个数组形式出现的数据，例如由三个包含三个颜色通道中像素强度的 2D 数组组成的彩色图像。许多数据模式以多个数组的形式出现：1D表示信号和序列，包括语言;用于图像或音频光谱图的 2D;和 3D 用于视频或体积图像。ConvNets 背后有四个关键的想法，它们利用了自然信号的属性：本地连接、共享权重、池和使用多个图层。

The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Different feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name.

典型的 ConvNet（图 2）的体系结构被构建为一系列阶段。前几个阶段由两种类型的层组成：卷积层和池层。卷积图层中的单位以要素映射进行组织，其中每个单位通过一组称为筛选器库的权重连接到上一图层要素映射中的局部贴片。然后，通过非线性（如 ReLU）传递此局部加权总和的结果。要素图中的所有单位共享同一个筛选器库。图层中不同的要素贴图使用不同的筛选器库。这种体系结构的原因有两方面。首先，在阵列数据（如图像）中，局部值组通常高度相关，形成易于检测的独特局部图案。其次，图像和其他信号的局部统计与位置是不变的。换句话说，如果一个图案可以出现在图像的一个部分，它可以出现在任何地方，因此，不同位置的单位共享相同的权重，并在数组的不同部分检测相同的图案的想法。从数学上讲，要素图执行的滤波操作是离散卷积，因此名称。

Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the position of each feature. A typical pooling unit computes the maximum of a local patch of units in one feature map (or in a few feature maps). Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and distortions. Two or three stages of convolution, non-linearity and pooling are stacked, followed by more convolutional and fully-connected layers. Backpropagating gradients through a ConvNet is as simple as through a regular deep network, allowing all the weights in all the filter banks to be trained.

尽管卷积层的作用是检测上一层要素的局部连接，但池层的作用是将语义上相似的要素合并为一个要素。由于形成图案的要素的相对位置可能会有所不同，因此可以通过对每个特征的位置进行粗粒化来可靠地检测图案。典型的池单元计算一个要素图（或几个要素贴）中局部单位修补程序的最大值。相邻池单元从移动多行或多列的修补程序中引入输入，从而减少表示的尺寸，并创建小移移和扭曲的不变性。卷积、非线性和池的两个或三个阶段被堆叠，然后是更多的卷积和完全连接的层。通过 ConvNet 的回传播梯度与通过常规深度网络一样简单，允许训练所有滤波器库中的所有权重。

Deep neural networks exploit the property that many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones. In images, local combinations of edges form motifs, motifs assemble into parts, and parts form objects. Similar hierarchies exist in speech and text from sounds to phones, phonemes, syllables, words and sentences. The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.

深度神经网络利用了许多自然信号是成分层次结构的特性，其中通过组合较低层的特征获得较高层的特征。在图像中，边缘的局部组合形成图案，图案组装成零件，而零件形成对象。从声音到电话，音素，音节，单词和句子，语音和文本中也存在类似的层次结构。当上一层中的元素的位置和外观变化时，池化使表示形式的变化很小。

The convolutional and pooling layers in ConvNets are directly inspired by the classic notions of simple cells and complex cells in visual neuroscience43, and the overall architecture is reminiscent of the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral pathway44. When ConvNet models and monkeys are shown the same picture, the activations of high-level units in the ConvNet explains half of the variance of random sets of 160 neurons in the monkey’s inferotemporal cortex45. ConvNets have their roots in the neocognitron46, the architecture of which was somewhat similar, but did not have an end-to-end supervised-learning algorithm such as backpropagation. A primitive 1D ConvNet called a time-delay neural net was used for the recognition of phonemes and simple words47,48.

ConvNets 中的卷积层和池层直接受到视觉神经科学中简单细胞和复杂细胞的经典概念的启发，整体结构让人联想到视觉皮层腹通路径中的 LGN+V1+V2+V4+IT 层次结构。当 ConvNet 模型和猴子显示相同的图片时，ConvNet 中高级单元的激活解释了猴子推断时层皮层中随机组 160 个神经元的一半方差。ConvNet 的根植于新认知器 46，其体系结构有些相似，但没有端到端的监督学习算法，如反传播。一种称为时间延迟神经网的基元1D ConvNet用于识别语音和简单单词47，48。

There have been numerous applications of convolutional networks going back to the early 1990s, starting with time-delay neural networks for speech recognition47 and document reading42. The document reading system used a ConvNet trained jointly with a probabilistic model that implemented language constraints. By the late 1990s this system was reading over 10% of all the cheques in the United States. A number of ConvNet-based optical character recognition and handwriting recognition systems were later deployed by Microsoft49. ConvNets were also experimented with in the early 1990s for object detection in natural images, including faces and hands50,51, and for face recognition52.

卷积网络的应用可追溯到 20 世纪 90 年代初，从语音识别的时间延迟神经网络开始，47 和文档读取 42。文档阅读系统使用 ConvNet 与实现语言约束的概率模型联合培训。到20世纪90年代末，这个系统已经读取了美国所有支票的10%以上。Microsoft49 后来部署了许多基于 ConvNet 的光学字符识别和手写识别系统。ConvNets 在 20 世纪 90 年代早期也进行了实验，用于自然图像中的物体检测，包括人脸和手50，51，以及人脸识别52。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3qofsTkx-1596246034027)(C:\Users\39820\Desktop\Figure 3.png)]

Figure 3 | From image to text.Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced with permission from ref. 102. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better ‘translation’ of images into captions.

图3 |从图像到文本。由循环神经网络（RNN）生成的字幕，作为额外的输入，从测试图像中提取由深度卷积神经网络（CNN）提取的表示形式，RNN 训练将图像的高级别表示"翻译"到字幕（顶部）。经参考文献102许可转载。当 RNN 能够将注意力集中在输入图像的不同位置（中间和底部;较轻的补丁被给予更多关注）时，我们发现 86 它利用这一点将图像更好地"翻译"成字幕。

Image understanding with deep convolutional networks

深度卷积网络的图像理解

Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abundant, such as traffic sign recognition53, the segmentation of biological images54 particularly for connectomics55, and the detection of faces, text, pedestrians and human bodies in natural images36,50,51,56–58. A major recent practical success of ConvNets is face recognition59.

自 2100 年代初以来，ConvNet 在图像中的对象和区域的检测、分割和识别方面得到了极大的应用。这些都是标记数据相对丰富的任务，如交通标志识别53、生物图像分割54，特别是连接学55，以及自然图像中人脸、文字、行人和人体的检测36，50，51，56-58。ConvNets 最近取得的主要实际成功是人脸识别59。

Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and self-driving cars60,61. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision systems for cars. Other applications gaining importance involve natural language understanding14 and speech recognition7.

重要的是，图像可以在像素级别上贴上标签，这将在技术中应用，包括自动移动机器人和自动驾驶汽车60，61。Mobileye 和 NVIDIA 等公司正在即将发布的汽车视觉系统中使用此类基于 ConvNet 的方法。其他越来越重要的应用涉及自然语言理解和语音识别7。

Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spectacular results, almost halving the error rates of the best competing approaches1. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout62, and techniques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks4,58,59,63–65 and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

尽管取得了这些成功，但 ConvNet 在 2012 年 ImageNet 竞赛之前，基本上被主流计算机视觉和机器学习社区抛弃。当深度卷积网络应用于包含 1，000 个不同类的 Web 上大约 100 万张图像的数据集时，它们取得了惊人的结果，几乎将最佳竞争方法的错误率减半1。这一成功来自于有效使用GPU、REL，一种称为辍学的新正则化技术，以及通过变形现有实例生成更多训练示例的技术。这一成功带来了计算机视觉的革命;ConvNet 现在几乎在所有识别和检测任务中占据主导地位4、58、59、63-65，并在某些任务上处理人类性能。最近一个令人惊叹的演示结合了 ConvNets 和循环网络模块，用于生成图像字幕（图 3）。

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

最近的 ConvNet 架构有 10 到 20 层 REL、数亿个权重和数十亿个单元之间的连接。虽然培训这种大型网络可能仅在两年前需要数周时间，但硬件、软件和算法并行化方面的进展已将培训时间缩短到几个小时。

The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook, Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

基于 ConvNet 的视觉系统的性能已导致大多数主要技术公司，包括谷歌、Facebook、微软、IBM、雅虎、Twitter 和 Adobe，以及数量迅速增加的初创公司，以启动研发项目并部署基于 ConvNet 的图像理解产品和服务。

ConvNets are easily amenable to efficient hardware implementations in chips or field-programmable gate arrays66,67. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

ConvNet 易于适应芯片或现场可编程门阵列中的高效硬件实现 66，67。NVIDIA、Mobileye、英特尔、高通和三星等多家公司正在开发 ConvNet 芯片，以实现智能手机、相机、机器人和自动驾驶汽车的实时视觉应用。

Distributed representations and language processing

分布式表示和语言处理

Deep-learning theory shows that deep nets have two different exponential advantages over classic learning algorithms that do not use distributed representations21. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure40. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features)68,69. Second, composing layers of representation in a deep net brings the potential for another exponential advantage70 (exponential in the depth).

深度学习理论表明，与不使用分布式表示的经典学习算法相比，深网具有两种不同的指数优势。这两个优势都来自组合力，并依赖于具有适当组件结构的基础数据生成分布40。首先，学习分布式表示能够将学习要素的值泛化为超出培训期间看到的值的新组合（例如，使用 n 二进制功能可以进行 2n 组合）68，69。其次，在深网中组成表示层会带来另一个指数优势70（深度指数）的潜力。

The hidden layers of a multilayer neural network learn to represent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local context of earlier words71. Each word in the context is presented to the network as a one-of-N vector, that is, one component has a value of 1 and the rest are 0. In the first layer, each word creates a different pattern of activations, or word vectors (Fig. 4). In a language model, the other layers of the network learn to convert the input word vectors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word. The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word, as was first demonstrated27 in the context of learning distributed representations for symbols. These semantic features were not explicitly present in the input. They were discovered by the learning procedure as a good way of factorizing the structured relationships between the input and output symbols into multiple ‘micro-rules’. Learning word vectors turned out to also work very well when the word sequences come from a large corpus of real text and the individual micro-rules are unreliable71. When trained to predict the next word in a news story, for example, the learned word vectors for Tuesday and Wednesday are very similar, as are the word vectors for Sweden and Norway. Such representations are called distributed representations because their elements (the features) are not mutually exclusive and their many configurations correspond to the variations seen in the observed data. These word vectors are composed of learned features that were not determined ahead of time by experts, but automatically discovered by the neural network. Vector representations of words learned from text are now very widely used in natural language applications14,17,72–76.

多层神经网络的隐藏层学习以便于预测目标输出的方式表示网络的输入。通过训练多层神经网络，从早期单词71的本地上下文中按顺序预测下一个单词，这很好地证明了这一点。上下文中的每个单词都呈现给网络，作为 N 的一个向量，即一个组件的值为 1，其余为 0。在第一层中，每个单词创建不同的激活模式或单词矢量（图 4）。在语言模型中，网络的其他层学会将输入词向量转换为预测下一个单词的输出词向量，该词可用于预测词汇中任何单词显示为下一个单词的概率。网络学习包含许多活动组件的词向量，每个组件都可以解释为单词的单独特征，正如在学习符号分布式表示的上下文中首次演示的 27。这些语义功能在输入中未显式存在。学习过程发现，它们是将输入符号和输出符号之间的结构化关系考虑成多个"微规则"的一种良好方法。当单词序列来自大量真实文本且单个微规则不可靠时，学习词向量也工作得很好。例如，当训练预测新闻报道中的下一个单词时，周二和周三的单词向量非常相似，瑞典和挪威的词向量也非常相似。这种表示被称为分布式表示，因为它们的元素（要素）不是相互排斥的，并且它们的许多配置对应于观测数据中显示的变体。这些词向量由专家未提前确定但由神经网络自动发现的学习特征组成。从文本中学习的单词的矢量表示现在非常广泛地用于自然语言应用程序14，17，72-76。

The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference that underpins effortless commonsense reasoning.

表示问题是逻辑启发和神经网络启发的认知范式之间争论的核心。在逻辑启发范式中，符号的实例是唯一的属性是它与其他符号实例相同或非相同。它没有任何与使用有关的内部结构;并且要用符号进行推理，它们必须与明智地选择的推理规则中的变量绑定。相比之下，神经网络只是使用大活动向量、大权重矩阵和标量非线性来执行支持轻松常识推理的快速"直观"推理类型。

approach to statistical modelling of language did not exploit distributed representations: it was based on counting frequencies of occurrences of short symbol sequences of length up to N (called N-grams). The number of possible N-grams is on the order of VN, where V is the vocabulary size, so taking into account a context of more than a handful of words would require very large training corpora. N-grams treat each word as an atomic unit, so they cannot generalize across semantically related sequences of words, whereas neural language models can because they associate each word with a vector of real valued features, and semantically related words end up close to each other in that vector space (Fig. 4).

语言的统计建模方法没有利用分布式表示：它基于长度高达N（称为N-gram）的短符号序列的计数频率。可能的 N-gram 的数量按 VN 的顺序排列，其中 V 是词汇大小，因此考虑的上下文超过几个字将需要非常大的培训公司。N-gram 将每个单词视为原子单元，因此它们不能概括语义相关的单词序列，而神经语言模型可以，因为它们将每个单词与真实有价值的特征的向量关联，而语义相关的单词在矢量空间中最终彼此接近（图 4）。

Recurrent neural networks

循环神经网络

When backpropagation was first introduced, its most exciting use was for training recurrent neural networks (RNNs). For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs (Fig. 5). RNNs process an input sequence one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly contains information about the history of all the past elements of the sequence. When we consider the outputs of the hidden units at different discrete time steps as if they were the outputs of different neurons in a deep multilayer network (Fig. 5, right), it becomes clear how we can apply backpropagation to train RNNs.

首次引入后向传播时，其最令人兴奋的用途是训练循环神经网络（RNN）。对于涉及顺序输入（如语音和语言）的任务，通常最好使用 RNN（图 5）。RNN 一次处理一个元素的输入序列，以隐藏单位维护一个"状态向量"，该向量隐式包含有关序列所有过去元素的历史记录的信息。当我们考虑不同离散时间步数的隐藏单元的输出时，就像它们是深层多层网络中不同神经元的输出（图5，右）时，我们清楚地了解如何应用回极化来训练 RNN。

RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish77,78.

RNN 是非常强大的动态系统，但训练它们已被证明是有问题的，因为背传播梯度要么在每个时间步长或收缩，因此在很多时间步长中，它们通常会爆炸或消失 77，78。

Thanks to advances in their architecture79,80 and ways of training them81,82, RNNs have been found to be very good at predicting the next character in the text83 or the next word in a sequence75, but they can also be used for more complex tasks. For example, after reading an English sentence one word at a time, an English ‘encoder’ network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence. This thought vector can then be used as the initial hidden state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs a probability distribution for the first word of the French translation. If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability distribution for the second word of the translation and so on until a full stop is chosen17,72,76. Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence. This rather naive way of performing machine translation has quickly become competitive with the state-of-the-art, and this raises serious doubts about whether understanding a sentence requires anything like the internal symbolic expressions that are manipulated by using inference rules. It is more compatible with the view that everyday reasoning involves many simultaneous analogies that each contribute plausibility to a conclusion84,85.

由于他们的架构的进步79，80和培训他们81，82，RNS已经发现非常善于预测文本83或顺序中的下一个单词中的下一个字符75，但它们也可用于更复杂的任务。例如，在一次阅读一个单词的英语句子后，可以训练一个英语"编码器"网络，以便其隐藏单元的最终状态向量是句子所表达的思想的一个很好的表示形式。然后，此思想向量可用作联合训练的法语"解码器"网络的初始隐藏状态（或作为额外输入），该网络输出法语翻译第一个单词的概率分布。如果从此分布选择特定的第一个单词，并作为解码器网络的输入提供，它将输出翻译的第二个单词的概率分布，等等，直到选择完全停止17，72，76。总体而言，此过程根据取决于英语句子的概率分布生成法语单词序列。这种相当幼稚的机器翻译方式已迅速与最先进的技术竞争，这使人们严重怀疑理解句子是否需要任何像使用推理规则操纵的内部符号表达式。它更符合这样的观点，即日常推理涉及许多同时进行的类比，每个类比都有助于得出结论的合理性84，85。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wXCwXuFt-1596246034029)(C:\Users\39820\Desktop\FIgure 4.png)]

图4 |可视化学习的单词矢量。左侧是为建模语言所学的单词表示的插图，使用 t-SNE 算法 103 将非线性投影到 2D 进行可视化。右侧是英语到法语编码器+解码器循环神经网络所学短语的二D表示形式75。可以观察到，语义上相似的单词或单词序列映射到附近的表示形式。单词的分布式表示是通过使用回发法来共同学习每个单词的表示形式和预测目标数量的函数（如序列中的下一个单词（用于语言建模）或整个翻译单词序列（用于机器翻译）18，75。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yWIykTi4-1596246034030)(C:\Users\39820\Desktop\Figure 5.png)]

Figure 5 | A recurrent neural network and the unfolding in time of the computation involved in its forward computation.The artificial neurons (for example, hidden units grouped under node s with values st at time t) get inputs from other neurons at previous time steps (this is represented with the black square, representing a delay of one time step, on the left). In this way, a recurrent neural network can map an input sequence with elements xt into an output sequence with elements ot, with each ot depending on all the previous xtʹ (for tʹ≤t). The same parameters (matrices U,V,W ) are used at each time step. Many other architectures are possible, including a variant in which the network can generate a sequence of outputs (for example, words), each of which is used as inputs for the next time step. The backpropagation algorithm (Fig.1) can be directly applied to the computational graph of the unfolded network on the right, to compute the derivative of a total error (for example, the log-probability of generating the right sequence of outputs) with respect to all the states st and all the parameters

图 5 |循环神经网络及其前向计算中涉及的计算时间展开。人工神经元（例如，在节点 s 下分组的隐藏单元，值 st 在时间 t）在以前的时间步长中从其他神经元获取输入（这用黑色正方形表示，表示左侧一个时间步长的延迟）。这样，循环神经网络就可以将具有元素 xt 的输入序列映射到具有 ot 元素的输出序列中，每个输入序列都依赖于所有xtʹ（对于 tʹ\t）。在每个时间步数上使用相同的参数（矩阵 U、V、W）。许多其他体系结构是可能的，包括网络可以生成一系列输出（例如单词）的变体，每个输出都用作下一个时间步骤的输入。反传播算法（图1）可以直接应用于右侧展开的网络的计算图形，以计算针对所有状态和所有参数的总误差（例如，生成正确输出序列的日志概率）的导数

Instead of translating the meaning of a French sentence into an English sentence, one can learn to ‘translate’ the meaning of an image into an English sentence (Fig. 3). The encoder here is a deep ConvNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling. There has been a surge of interest in such systems recently (see examples mentioned in ref. 86).

与其将法语句子的含义翻译成英语句子，也可以学会将图像的含义"翻译"为英语句子（图3）。这里的编码器是一个深 ConvNet，它将像素转换为其最后一个隐藏层中的活动矢量。解码器是一种与用于机器翻译和神经语言建模的 RNN 类似的 RNN。最近对这类系统产生了浓厚的兴趣（见第86号参考文献中提到的例子）。

RNNs, once unfolded in time (Fig. 5), can be seen as very deep feedforward networks in which all the layers share the same weights. Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long78.

RNN，一旦在时间上展开（图5），可以被看做是非常深的馈送网络，其中所有层共享相同的权重。虽然他们的主要目的是学习长期依赖性，但理论和经验证据表明，很难学会长期存储信息。

To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time79. A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory.

要纠正这一点，一个想法是使用显式内存来增强网络。这种建议的第一个是使用特殊隐藏单元的长期短期记忆（LSTM）网络，其自然行为是长时间记住输入79。称为记忆单元的特殊单元的作用像蓄能器或封闭泄漏神经元：它在下一个时间步长时与自身有一个连接，其权重为 1，因此它复制自己的实际值状态并累积外部信号，但此自连接被另一个单元乘以封闭，该单元学会决定何时清除内存内容。

LSTM networks have subsequently proved to be more effective than conventional RNNs, especially when they have several layers for each time step87, enabling an entire speech recognition system that goes all the way from acoustics to the sequence of characters in the transcription. LSTM networks or related forms of gated units are also currently used for the encoder and decoder networks that perform so well at machine translation17,72,76.

LSTM 网络随后被证明比传统的 RNN 更有效，特别是当它们每个时间步长都有多个层时，可实现从声学到转录中字符序列的整个语音识别系统。LSTM 网络或相关形式的门控单元目前也用于在机器转换中性能如此良好的编码器和解码器网络17，72，76。

Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to88, and memory networks, in which a regular network is augmented by a kind of associative memory89. Memory networks have yielded excellent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions.

在过去的一年里，一些作者提出了不同的建议，用内存模块来扩充RNN。建议包括神经图灵机，其中网络由"磁带样"内存进行增强，RNN 可以选择从 88 读取或写入，以及内存网络，其中常规网络通过一种关联内存进行增强89。内存网络在标准问题解答基准上取得了出色的性能。内存用于记住以后要求网络回答问题的故事。

Beyond simple memorization, neural Turing machines and memory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘algorithms’. Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list88. Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference90. In one test example, the network is shown a 15-sentence version of the The Lord of the Rings and correctly answers questions such as “where is Frodo now?”89.

除了简单的记忆，神经图灵机和记忆网络还被用于通常需要推理和符号操作的任务。神经图灵机可以教"算法"。除其他事项外，当符号的输入由未排序的序列组成时，他们可以学习输出一个排序的符号列表，其中每个符号都伴随着一个实际值，该值在列表 88 中指示其优先级。内存网络可以训练，以跟踪世界的状态，在类似于文本冒险游戏的设置和阅读故事后，他们可以回答需要复杂的推理90的问题。在一个测试示例中，网络显示一个 15 句版本的《指环王》，并正确回答诸如"佛罗多现在在哪里？89.

The future of deep learning

深度学习的未来

Unsupervised learning91–98 had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning. Although we have not focused on it in this Review, we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.

无监督学习91-98对恢复对深度学习的兴趣产生了催化作用，但自那以后，纯粹监督学习的成功却给这种学习产生了阴影。尽管我们在本次审查中没有关注这个问题，但我们期望无监督的学习在较长时间上变得更加重要。人类和动物的学习基本上是不受监督的：我们通过观察世界来发现它的结构，而不是通过被告知每个物体的名字来发现它。

Human vision is an active process that sequentially samples the optic array in an intelligent, task-speciﬁc way using a small, high-resolution fovea with a large, low-resolution surround. We expect much of the future progress in vision to come from systems that are trained end-toend and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Systems combining deep learning and reinforcement learning are in their infancy, but they already outperform passive vision systems99 at classification tasks and produce impressive results in learning to play many different video games100.

人类视觉是一个主动过程，它使用具有大、低分辨率环绕的小型高分辨率 fovea 以智能、特定于任务的方式对光学阵列进行连续采样。我们预计未来愿景的很多进展来自经过培训的端到端系统，并将 ConvNet 与使用强化学习来决定去哪里找的 Rnn 相结合。将深度学习和强化学习相结合的系统还处于起步阶段，但它们在分类任务中已经优于被动视觉系统99，在学习玩许多不同的视频游戏方面产生了令人印象深刻的结果100。

Natural language understanding is another area in which deep learning is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time76,86.

自然语言理解是另一个领域，深度学习准备在未来几年产生重大影响。我们期望使用 RNN 来理解句子或整个文档的系统在学习有选择地一次处理一部分的策略时会变得更好76，86。

Ultimately, major progress in artificial intelligence will come about through systems that combine representation learning with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors101

g with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors101

最终，人工智能的重大进展将来自将表示学习与复杂推理相结合的系统。尽管长期使用深度学习和简单推理进行语音和手写识别，但需要采用新的范式来取代大型向量上的操作对符号表达式的基于规则的操作101

CSDN学习社区

CSDN联合极客时间，共同打造面向开发者的精品内容学习社区，助力成长！

更多推荐