# How to classify MNIST digits with different neural network architectures

## Getting started with neural networks and Keras

Tyler Elliot Bettilyon BlockedUnblockFollowFollowing Aug 8, 2018

BlockedUnblockFollow继2018年8月8日之后

Photo by

Photo by Greg Rakozy on

on Unsplash

I took a Deep Learning course through

我参加了深度学习课程The Bradfield School of Computer Science in June. This series is a journal about what I learned in class, and what I've learned since.

在六月。这个系列是一本关于我在课堂上学到的东西以及从那以后学到的东西。

This is the third article in the series. You can find the

这是本系列的第三篇文章。你可以找到first article in the series here, and the

, and the second article in the series here.

.

**Please note:** All of the code samples below can be found and run in

可以找到并运行下面的所有代码示例this Jupyter Notebook kindly hosted by Google Colaboratory. I encourage you to copy the code, make changes, and experiment with the networks yourself as you read this article.

由Google Colaboratory亲切主持。我鼓励您在阅读本文时自行复制代码，进行更改并自行试验网络。

### Neural networks

Although neural networks have gained enormous popularity over the last few years, for many data scientists and statisticians the whole family of models has (at least) one major flaw: the results are

尽管神经网络在过去几年中已经获得了极大的普及，但对于许多数据科学家和统计学家来说，整个模型家族都有（至少）一个主要缺陷：结果是 **hard to interpret** . One of the reasons that people treat neural networks as a black box is that the structure of any given neural network is hard to think about.

。人们将神经网络视为黑匣子的原因之一是任何给定神经网络的结构都很难被思考。

Neural networks frequently have anywhere from hundreds of thousands to millions of weights that are individually tuned during training to minimize error. With so many variables interacting in complex ways, it is difficult to describe exactly why one particular neural network outperforms some other neural network. This complexity also makes it hard to design top-tier neural network architectures.

神经网络经常具有数十万到数百万的权重，这些权重在训练期间被单独调整以最小化错误。由于许多变量以复杂的方式相互作用，因此很难准确描述为什么一个特定的神经网络优于其他神经网络。这种复杂性也使得设计顶层神经网络架构变得困难。

Some machine learning terminology appears here, in case you haven't seen it before:

这里出现了一些机器学习术语，以防你以前没见过：

- The name
**x**refers to input data, while the name**y**refers to the labels.**ŷ**(pronounced y-hat) refers to the predictions made by a model. **Training data**is the data our model learns from.**Test data**is kept secret from the model until after it has been trained. Test data is used to evaluate our model.- A
**loss function**is a function to quantify how accurate a model's predictions were. - An
**optimization algorithm**controls exactly how the weights of the computational graph are adjusted during training

For a refresher about splitting up test and training data, or if this is new information,

有关拆分测试和培训数据的更新，或者这是新信息，consider reading this article.

.

### MNIST handwritten digits dataset

In this article, we're going to work through a series of simple neural network architectures and compare their performance on the MNIST handwritten digits dataset. The goal for all the networks we examine is the same: take an input image (28x28 pixels) of a handwritten single digit (0--9) and classify the image as the appropriate digit.

在本文中，我们将介绍一系列简单的神经网络架构，并比较它们在MNIST手写数字数据集上的性能。我们检查的所有网络的目标是相同的：获取手写单个数字（0--9）的输入图像（28x28像素）并将图像分类为适当的数字。

State of the art neural network approaches have achieved near-perfect performance, classifying 99.8% of digits correctly from a left-out test set of digits. This impressive performance has real world benefits as well. The US Postal Service processes 493.4 million pieces of mail per day, and 1% of that workload is 4.9 million pieces of mail. Accurate automation can prevent postal workers from individually handling and examining millions of parcels each day. Of course, automatically reading complete addresses isn't as simple as processing individual digits, but let's learn to crawl before we try to jog.

最先进的神经网络方法已经实现了近乎完美的性能，从剩余的测试数字集中正确地对99.8％的数字进行了分类。这种令人印象深刻的表现也带来了现实世美国邮政局每天处理49.34万封邮件，其中1％的邮件是490万封邮件。准确的自动化可以防止邮政工人每天单独处理和检查数百万个包裹。当然，自动读取完整地址并不像处理单个数字那么简单，但让我们在尝试慢跑之前学会抓取。

It's always a good idea to familiarize yourself with a dataset before diving into any machine learning task. Here are some examples of the images in the dataset:

在深入研究任何机器学习任务之前，熟悉数据集总是一个好主意。以下是数据集中图像的一些示例：

*A random selection of MNIST digits. In*

*the Jupyter Notebook*

*you can view more random selections from the dataset.*

The MNIST dataset is a classic problem for getting started with neural networks. I've heard a few people joke that it's the deep learning version of "hello world"--- a lot of simple networks do a surprisingly good job with the dataset, even though some of the digits are pretty tricky:

MNIST数据集是神经网络入门的经典问题。我听说有一些人开玩笑说这是"hello world"的深度学习版本 - 很多简单的网络在数据集方面做得非常出色，尽管有些数字非常棘手：

*This image is from the wonderful book*

*Neural Networks and Deep Learning*

*, available online for free.*

### Preparing the data

The first and most important step in any machine learning task is to prepare the data. For many scientists and industry practitioners, the process of gathering, cleaning, labeling, and storing the data into a usable digital format represents the lion's share of the work. Additionally, any errors introduced during this step will cause the learning algorithm to learn incorrect patterns. As they say: garbage in, garbage out.

任何机器学习任务中的第一步也是最重要的一步是准备数据。对于许多科学家和行业从业者而言，收集，清理，标记和存储数据为可用数字格式的过程代表了大部分工作。此外，在此步骤中引入的任何错误都将导致学习算法学习不正确的模式。正如他们所说：垃圾进，垃圾出来。

Thanks to the Keras library and the hard work of the National Institute of Standards and Technology (the NIST of MNIST) the hardest parts have been done for us. The data's been collected and is already well-formatted for processing. Therefore, it is with deep gratitude for NIST and the Keras maintainers that our Python code for getting the data is simple:

感谢Keras图书馆和国家标准与技术研究所（MNIST的NIST）的辛勤工作，我们已经完成了最难的部分。数据已被收集，并且已经过格式化处理。因此，非常感谢NIST和Keras维护者，我们用于获取数据的Python代码很简单：*Relevant XKCD* * --- Python really is wonderful.*

* --- Python非常棒。*

Once we have the dataset, we have to format it appropriately for our neural network.

一旦我们有了数据集，我们就必须为我们的神经网络适当地格式化它。 **This article is focused only on fully connected neural networks** , which means our input data must be a vector. Instead of several 28x28 images, we'll have several vectors that are all length 784 (28*28=784). This flattening process is not ideal --- we obfuscate information about which pixels are next to each other.

，这意味着我们的输入数据必须是矢量。而不是几个28x28图像，我们将有几个长度为784（28 * 28 = 784）的向量。这种展平过程并不理想 - 我们混淆了哪些像素彼此相邻的信息。

Our networks will overcome this loss of information, but it is worth mentioning

我们的网络将克服这种信息丢失，但值得一提convolutional neural networks (CNNs). These are specifically designed for image processing/computer vision, and maintain these spatial relationships. In a future article, we will revisit MNIST with CNNs and compare our results.

（细胞神经网络）。它们专为图像处理/计算机视觉而设计，并保持这些空间关系。在以后的文章中，我们将重新审视具有CNN的MNIST并比较我们的结果。

Keras, again, provides a simple utility to help us flatten the 28x28 pixels into a vector:

Keras再次提供了一个简单的实用工具来帮助我们将28x28像素压缩成矢量：

We have to do one last thing to this dataset before we're ready to experiment with some neural networks. The labels for this dataset are numerical values from 0 to 9 --- but it's important that our algorithm treats these as items in a set, rather than ordinal values. In our dataset the value "0"

在我们准备好尝试一些神经网络之前，我们必须对这个数据集做最后一件事。此数据集的标签是从0到9的数值---但重要的是我们的算法将它们视为集合中的项目，而不是序数值。在我们的数据集中，值"0" **isn't smaller** than the value "9", they are just two different values from our set of possible classifications.

比值"9"，它们只是我们可能的分类中的两个不同的值。

If our algorithm predicts "8" when it should predict "0" it is wrong to say that the model was "off by 8" --- it simply predicted the wrong category. Similarly, predicting "7" when we should have predicted "8" is not better than predicting "0" when we should have predicted "8" --- both are just wrong.

如果我们的算法在预测为"0"时预测"8"，那么说模型"偏离8"是错误的 - 它只是预测了错误的类别。同样地，当我们应该预测"8"时预测"7"并不比预测"0"时预测"0"更好 - 两者都是错误的。

To address this issue, when we're making predictions about categorical data (as opposed to values from a continuous range), the best practice is to use a "one-hot encoded" vector. This means that we create a vector as long as the number of categories we have, and force the model to set exactly one of the positions in the vector to 1 and the rest to 0 (the single 1 is the "hot" value within the vector).

为了解决这个问题，当我们对分类数据进行预测时（而不是连续范围内的值），最佳做法是使用"一热编码"向量。这意味着我们创建一个矢量，只要我们拥有的类别数量，并强制模型将矢量中的一个位置恰好设置为1，其余位置设置为0（单个1是"热"值内的值向量）。

Thankfully, Keras makes this remarkably easy to do as well:

值得庆幸的是，Keras也很容易做到这一点：

Finally, it is worth mentioning that there are a lot of other things we

最后，值得一提的是，我们还有很多其他的事情 **could** do at this point to normalize/improve the input images themselves. Preprocessing is common (because it's a good idea) but we're going to ignore it for now. Our focus is on examining neural network architectures.

此时要对输入图像进行标准化/改进。预处理很常见（因为这是一个好主意），但我们暂时忽略它。我们的重点是研究神经网络架构。

### Neural network architectures

For fully connected neural networks, there are three essential questions that define the network's architecture:

对于完全连接的神经网络，定义网络架构有三个基本问题：

- How many layers are there?
- How many nodes are there in each of those layers?
- What transfer/activation function is used at each of those layers?

This article explores the first two of these questions, while the third will be explored in a later article. The behavior of the transfer/activation function is closely related to gradient descent and backpropagation, so discussing the available options will make more sense after the next article in this series.

本文探讨了前两个问题，而第三个问题将在后面的文章中进行探讨。传递/激活函数的行为与梯度下降和反向传播密切相关，因此在本系列的下一篇文章之后讨论可用选项将更有意义。

All of the network architectures in this article use the sigmoid transfer function for all of the hidden layers.

本文中的所有网络体系结构都对所有隐藏层使用sigmoid传递函数。

There are other factors that can contribute to the performance of a neural network. These include which loss function is used, which optimization algorithm is used, how many training epochs to run before stopping, and the batch size within each epoch. Changes to batch size and epochs are discussed here. But, to help us compare "apples to apples", I have kept the loss function and optimization algorithm fixed:

还有其他因素可以促进神经网络的性能。这些包括使用哪种损失函数，使用哪种优化算法，停止前要运行多少训练时期，以及每个时期内的批量大小。这里讨论批量大小和时期的变化。但是，为了帮助我们比较"苹果与苹果"，我保留了损失函数和优化算法：

- I've selected a common loss function called categorical cross entropy.
- I've selected one of the simplest optimization algorithms: Stochastic Gradient Descent (SGD).

Whew, now that all of that is out of the way, let's build our very first network:

哇，现在所有这一切都不在了，让我们建立我们的第一个网络：

#### Building the network

All the networks in this article will have the same input layer and output layer. We defined the input layer earlier as a vector with 784 entries --- this is the data from the flattened 28x28 image. The output layer was also implicitly defined earlier when we created a one-hot encoded vector from the labels --- the ten labels correspond to the ten nodes in this layer.

本文中的所有网络都将具有相同的输入层和输出层。我们之前将输入层定义为具有784个条目的向量 - 这是来自展平的28x28图像的数据。当我们从标签创建一个热门编码矢量时，输出层也被隐含地定义了 - 十个标签对应于该层中的十个节点。

Our output layer also uses a special activation function called

我们的输出层还使用了一个特殊的激活函数 **softmax** . This normalizes the values from the ten output nodes such that:

。这标准化了十个输出节点的值，使得：

- all the values are between 0 and 1, and
- the sum of all ten values is 1.

This allows us to treat those ten output values as probabilities, and the largest one is selected as the prediction for the one-hot vector. In machine learning, the softmax function is almost always used when our model's output is a one-hot encoded vector.

这允许我们将这十个输出值视为概率，并且选择最大的一个作为单热矢量的预测。在机器学习中，当我们的模型输出是单热编码矢量时，几乎总是使用softmax函数。

Finally, this model has a single hidden layer with 32 nodes using the sigmoid activation function. The resulting architecture has 25,450 tunable parameters. From the input layer to the hidden layer there are 784*32 = 25,088

最后，该模型具有使用S形激活函数的具有32个节点的单个隐藏层。生成的体系结构具有25,450个可调参数。从输入层到隐藏层有784 * 32 = 25,088 **weights** . The hidden layer has 32 nodes so there are 32

。隐藏层有32个节点，因此有32个节点 **biases** . This brings us to 25,088 + 32 = 25,120 parameters.

。这给我们带来了25,088 + 32 = 25,120个参数。

From the hidden layer to the output layer there are 32*10 = 320 weights.

从隐藏层到输出层，有32 * 10 = 320个权重。

Each of the ten nodes adds a single bias bringing us to 25,088 + 320 + 10 = 25,450 total parameters.

十个节点中的每一个都增加了一个偏差，使我们达到25,088 + 320 + 10 = 25,450个总参数。

Keras has a handy method to help you calculate the number of parameters in a model, calling the

Keras有一个方便的方法来帮助你计算模型中的参数数量，调用`.summary()`

method we get:

method we get:

```
Layer (type) Output Shape Param #
=================================================================
dense_203 (Dense) (None, 32) 25120
_________________________________________________________________
dense_204 (Dense) (None, 10) 330
=================================================================
Total params: 25,450
Trainable params: 25,450
Non-trainable params: 0
```

We can use Keras to train and evaluate this model as well:

我们也可以使用Keras来训练和评估这个模型：

[CODE BLOCK ---

[CODE BLOCK --- train_and_evalulate_first_model.py]

]

*Training and validation accuracy over time. Final test accuracy: 0.87.*

Performance varies a little bit from run to run (give it a try in the

从运行到运行，性能会有所不同（尝试一下）Jupyter notebook), but accuracy is consistently between 87--90%. This is an incredible result. We have obfuscated spatial relationships within the data by flattening the images. We have done zero feature extraction to help the model understand the data. Yet, in under one minute of training on consumer grade hardware, we're already doing nearly nine times better than guessing randomly.

），但准确度始终在87-90％之间。这是一个令人难以置信的结果。我们通过展平图像来模糊数据中的空间关系。我们已经完成了零特征提取，以帮助模型理解数据。然而，在不到一分钟的消费级硬件培训中，我们已经比随机猜测好了近9倍。

### Network depth and layer width

While there are some rules of thumb, the only way to determine the best architecture for any particular task is

虽然有一些经验法则，但确定任何特定任务的最佳架构的唯一方法是 **empirically.** Sometimes the "sensible defaults" will work well, and other times they won't work at all. The only way to find out for sure if your neural network works on your data is to test it, and measure your performance.

有时候，"合理的默认值"会很好用，有时则根本不起作用。确定您的神经网络是否适用于您的数据的唯一方法是测试它并测量您的性能。

Neural network architecture is the subject of quite a lot of open research. Finding a new architecture that outperforms existing architectures on a particular task is often an achievement worthy of publication. It's common for practitioners to select an architecture based on a recent publication, and either copy it wholesale for a new task or make minor tweaks to gain incremental improvement.

神经网络架构是相当多的开放研究的主题。找到一个优于特定任务的现有架构的新架构通常是值得发布的成就。从业者通常会根据最近的出版物选择体系结构，并将其批量复制用于新任务或进行微调以获得渐进式改进。

Still, there is a lot to learn from reinventing some simple wheels from scratch. Let's examine a few alternatives to this small network and examine the impact of those changes.

不过，从头开始重新制作一些简单的轮子还有很多东西需要学习。让我们来看看这个小网络的一些替代方案，并检查这些变化的影响。

#### Network depth

The depth of a multi-layer perceptron (also know as a fully connected neural network) is determined by its number of hidden layers. The network above has one hidden layer. This network is so shallow that it's technically inaccurate to call it "deep learning".

多层感知器（也称为完全连接的神经网络）的深度由其隐藏层的数量决定。上面的网络有一个隐藏层。这个网络太浅了，称其为"深度学习"在技术上是不准确的。

Let's experiment with layers of different lengths to see how the depth of a network impacts its performance. I have written a couple short functions to help reduce boilerplate throughout this tutorial:

让我们试验不同长度的层，看看网络的深度如何影响其性能。在本教程中，我编写了一些简短的函数来帮助减少样板：

The

The `evaluate`

function prints a summary of the model, trains the model, graphs the training and validation accuracy, and prints a summary of its performance on the test data. By default it does all this using the fixed hyperparameters we've discussed, specifically:

功能打印模型摘要，训练模型，绘制训练和验证准确度图表，并在测试数据上打印其性能摘要。默认情况下，它使用我们讨论过的固定超参数完成所有这些操作，具体来说：

- stochastic gradient descent (SGD)
- five training epochs
- training batch size of 128
- the categorical cross entropy loss function.

The

The `create_dense`

function lets us pass in an array of sizes for the hidden layers. It creates a multi-layer perceptron that always has appropriate input and output layers for our MNIST task. Specifically, the models will have:

函数允许我们传递隐藏图层的大小数组。它创建了一个多层感知器，它总是为我们的MNIST任务提供适当的输入和输出层。具体来说，模型将具有：

- an input vector of length 784
- an output vector of length ten that uses a one-hot encoding and the softmax activation function
- a number of layers with the widths specified by the input array all using the sigmoid activation function.

This code uses these functions to create and evaluate several neural nets of increasing depth, each with 32 nodes per hidden layer: In Python: [32] * 2 => [32, 32] and [32] * 3 => [32, 32, 32], and so on...

此代码使用这些函数来创建和评估几个深度增加的神经网络，每个隐藏层有32个节点：在Python中：[32] * 2 => [32,32]和[32] * 3 => [32， 32,32]等等......

Running this code produces some interesting charts, via the evaluate function defined above:

运行此代码会通过上面定义的evaluate函数生成一些有趣的图表：

*One hidden layer, final test accuracy: 0.888*

*2 hidden layers, final test accuracy: 0.767*

*3 hidden layers, final test accuracy: 0.438*

*4 hidden layers, final test accuracy: 0.114*

#### Overfitting

Adding more layers appears to have

添加更多图层似乎有 **decreased** the accuracy of the model. That might not be intuitive --- aren't we giving the model greater flexibility and therefore increasing its ability to make predictions? Unfortunately the trade-off isn't that simple.

模型的准确性。这可能不直观 - 我们不是给模型更大的灵活性，因此增加了预测的能力吗？不幸的是，权衡并非如此简单。

One thing we should look for is

我们应该寻找的一件事是 **overfitting** . Neural networks are flexible enough that they can adjust their parameters to fit the training data so precisely that they no longer generalize to data from outside the training set (for example, the test data). This is kind of like memorizing the answers to a specific math test without learning how to actually do the math.

。神经网络足够灵活，可以调整其参数以适应训练数据，从而精确地使它们不再推广到训练集外部的数据（例如，测试数据）。这有点像记住特定数学测试的答案而不学习如何实际进行数学计算。

Overfitting is a problem with many machine learning tasks. Neural networks are especially prone to overfitting because of the very large number of tunable parameters. One sign that you might be overfitting is that the training accuracy is significantly better than the test accuracy. But only one of our results --- the network with four hidden layers --- has that feature. That model's accuracy is barely better than guessing at random, even during training. Something more subtle is going on here.

过度拟合是许多机器学习任务的问题。由于可调参数的数量非常大，神经网络特别容易过度拟合。您可能过度拟合的一个迹象是训练精度明显优于测试精度。但只有我们的一个结果 - 具有四个隐藏层的网络---具有该功能。即使在训练期间，该模型的准确性几乎不比随机猜测好。这里有一些更微妙的东西。

In some ways a neural network is like a

在某些方面，神经网络就像一个game of telephone --- each layer only gets information from the layer right before it. The more layers we add, the more the original message is changed, which is sometimes a strength and sometimes a weakness.

---每个图层只从它之前的图层中获取信息。我们添加的图层越多，原始邮件的更改就越多，这有时是一种力量，有时甚至是一种弱点。

If the series of layers allow the build-up of useful information, then stacking layers can cause higher levels of meaning to build up. One layer finds edges in an image, another finds edges that make circles, another finds edges that make lines, another finds combinations of circles and lines, and so on.

如果一系列图层允许建立有用信息，那么堆叠图层可能会导致更高层次的意义。一个图层在图像中查找边，另一个图层查找创建圆的边，另一个图层查找构成线的边，另一个图层查找圆和线的组合，依此类推。

On the other hand, if the layers are destructively removing context and useful information then, like in the game of telephone, the signal deteriorates as it passes through the layers until all the valuable information is lost.

另一方面，如果层破坏性地去除上下文和有用信息，那么就像在电话游戏中一样，信号在通过层时恶化，直到所有有价值的信息都丢失。

Imagine you had a hidden layer with only one node --- this would force the network to reduce all the interesting interactions so far into a single value, then propagate that single value through the subsequent layers of the network. Information is lost, and such a network would perform terribly.

想象一下，你有一个只有一个节点的隐藏层 - 这将迫使网络将目前为止所有有趣的交互减少为单个值，然后通过网络的后续层传播该单个值。信息丢失，这样的网络将表现得非常糟糕。

Another useful way to think about this is in terms of image resolution --- originally we had a "resolution" of 784 pixels but we forced the neural network to downsample very quickly to a "resolution" of 32 values. These values are no longer pixels, but combinations of the pixels in the previous layer.

考虑这个问题的另一个有用的方法是在图像分辨率方面 - 最初我们有一个784像素的"分辨率"，但我们强迫神经网络快速下采样到32个值的"分辨率"。这些值不再是像素，而是前一层中像素的组合。

Compressing the resolution once was (evidently) not so bad. But, like repeatedly saving a JPEG, repeated chains of "low resolution" data transfer from one layer to the next can result in lower quality output.

一次压缩决议（显然）并不那么糟糕。但是，与重复保存JPEG一样，重复的"低分辨率"数据传输链从一层传输到下一层会导致较低质量的输出。

Finally, because of the way backpropagation and optimization algorithms work with neural networks, deeper networks require more training time. It may be that our model's 32 node-per-layer architecture just needs to train longer.

最后，由于反向传播和优化算法与神经网络协同工作的方式，更深的网络需要更多的训练时间。可能是我们的模型的每层32节点架构只需要训练更长时间。

If we let the three-layer network from above train for 40 epochs instead of five, we get these results:

如果我们让上面的三层网络训练40个时代而不是5个，我们得到以下结果：

*3 hidden layers, 40 training epochs instead of 5. Final test accuracy: .886*

The only way to really know which of these factors is at play in your own models is to design tests and experiment. Always keep in mind that many of these factors can impact your model at the same time and to different degrees.

真正了解这些因素在您自己的模型中发挥作用的唯一方法是设计测试和实验。请记住，其中许多因素会同时影响您的模型，并且程度不同。

#### Layer width

Another knob we can turn is the number of nodes in each hidden layer. This is called the

我们可以转动的另一个旋钮是每个隐藏层中的节点数。这被称为 **width** of the layer. As with adding more layers, making each layer wider increases the total number of tunable parameters. Making wider layers tends to scale the number of parameters faster than adding more layers. Every time we add a single node to layer

的图层。与添加更多图层一样，使每个图层更宽可增加可调参数的总数。制作更宽的图层往往比添加更多图层更快地缩放参数数量。每次我们向图层添加单个节点 **i** , we have to give that new node an edge to every node in layer

，我们必须为该新节点赋予层中每个节点的边缘 **i+1** .

.

Using the same

Using the same `evaluate`

and

and `create_dense`

functions as above, let's compare a few neural networks with a single hidden layer using different layer widths.

如上所述，让我们使用不同的图层宽度将一些神经网络与单个隐藏层进行比较。

Once again, running this code produces some interesting charts:

再一次，运行此代码会产生一些有趣的图表：

*One hidden layer, 32 nodes. Final test accuracy: .886*

*One hidden layer, 64 nodes. Final test accuracy: .904*

*One hidden layer, 128 nodes. Final test accuracy: .916*

*One hidden layer, 256 nodes. Final test accuracy: .926*

*One hidden layer, 512 nodes. Final test accuracy: .934*

*One hidden layer, 2048 nodes. Final test accuracy: .950. This model has a hint of potential overfitting --- notice where the lines cross at the very end of our training period.*

This time the performance changes are more intuitive --- more nodes in the hidden layer consistently mapped to better performance on the test data. Our accuracy improved from ~87% with 32 nodes to ~95% with 2048 nodes. Not only that, but the accuracy during our final round of training very nearly predicted the accuracy on the test data --- a sign that we are probably not overfitting.

这次性能变化更直观 - 隐藏层中的更多节点始终映射到测试数据的更好性能。我们的准确度从32个节点的~87％提高到2048个节点的~95％。不仅如此，我们最后一轮培训的准确性几乎预测了测试数据的准确性 - 这表明我们可能不会过度拟合。

The cost for this improvement was additional training time. As the number of tunable parameters ballooned from a meager 25,000 with 32 nodes to over 1.6 million with 2,048 nodes, so did training time. This caused our training epochs to go from taking about one second each to about 10 seconds each (on my Macbook Pro --- your mileage may vary).

这项改进的成本是额外的培训时间。随着可调参数的数量从拥有32个节点的25,000个增加到拥有2,048个节点的超过160万个，培训时间也是如此。这导致我们的训练时期从每个大约一秒钟到大约10秒钟（在我的Macbook Pro上 - 你的里程可能会有所不同）。

Still, these models are fast to train relative to many state-of-the-art industrial models. The version of AlphaGo that defeated Lee Sedol

尽管如此，相对于许多最先进的工业模型而言，这些模型的培训速度很快。击败Lee Sedol的AlphaGo版本trained for **4--6** **weeks**. OpenAI wrote a blogpost that also helps contextualize the

。 OpenAI写了一篇博客帖子，也有助于将其置于语境中extraordinary computational resources that go into training state-of-the-art models. From the article:

进入训练最先进的模型。来自文章：

" ... the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.5 month-doubling time ... "

"...最大的AI训练中使用的计算量一直呈指数增长，并且增加了3.5个月......"

When we have good data and a good model, there is a strong correlation between training time and model performance. This is why many state-of-the-art models train on an order of weeks or months once the authors have confidence in the model's ability. It seems that patience is still a virtue.

当我们拥有良好的数据和良好的模型时，培训时间和模型性能之间存在很强的相关性。这就是为什么许多最先进的模型在作者对模型的能力有信心的情况下训练数周或数月。似乎耐心仍然是一种美德。

#### Combining width and depth

With the intuition that more nodes tends to yield better performance, let's revisit the question of adding layers. Recall that stacking layers can either build up meaningful information or destroy information through downsampling.

凭借直觉，更多的节点往往会产生更好的性能，让我们重新审视添加图层的问题。回想一下，堆叠层可以通过下采样建立有意义的信息或破坏信息。

Let's see what happens when we stack

让我们看看堆叠时会发生什么 **bigger** layers. Repeated layers of 32 seemed to degrade the overall performance of our networks --- will that still be true as we stack larger layers?

层。重复的32层似乎会降低我们网络的整体性能 - 当我们堆叠更大的层时，这仍然是真的吗？

I have increased the number of epochs as the depth of the network increases for the reasons discussed above. The combination of deeper networks, with more nodes per hidden layer, and the increased training epochs results in code that takes longer to run. Fortunately, you can check out the

由于上面讨论的原因，随着网络深度的增加，我增加了时期的数量。更深层网络的组合，每个隐藏层有更多节点，以及增加的训练时期导致代码运行时间更长。幸运的是，你可以看看Jupyter Notebook where the results have already been computed.

已经计算出结果的地方。

You can see all the graphs in the Jupyter Notebook, but I want to highlight a few points of interest.

你可以看到Jupyter笔记本中的所有图形，但我想强调一些兴趣点。

With this particular training regimen, the single layer 512-nodes-per-layer network ended up with the highest test accuracy, at 94.7%. In general, the trend we saw before --- where deeper networks perform worse --- persists. However, the differences are pretty small, in the order of 1--2 percentage points. Also, the graphs suggest the discrepancies might be overcome with more training for the deeper networks.

通过这种特殊的训练方案，单层512节点每层网络最终具有最高的测试精度，达到94.7％。总的来说，我们之前看到的趋势 - 更深的网络表现更差 - 仍然存在。但是，差异非常小，大约为1-2个百分点。此外，图表表明，通过对更深层网络的更多培训可以克服差异。

For all numbers of nodes-per-layer, the graphs for one, two, and three hidden layers look pretty standard. More training improves the network, and the rate of improvement slows down as accuracy rises.

对于每层的所有节点数，一个，两个和三个隐藏层的图形看起来非常标准。更多的培训改善了网络，随着准确度的提高，改进速度变慢。

*One 32 node layer*

*Two 128 node layers*

*Three 512 node layers*

But when we get to four and five layers things start looking bad for the 32-node models:

但是，当我们进入四层和五层时，对于32节点模型来说，事情开始变得糟糕：

*Four 32 node layers.*

*Five 32 node layers.*

The other two five-layer networks have interesting results as well:

另外两个五层网络也有趣的结果：

*Five 128 node layers.*

*Five 512 node layers.*

Both of these seem to overcome some initial poor performance and look as though they might be able to continue improving with more training. This may be a limitation of not having enough data to train the network. As our models become more complex, and as information about error has to propagate through more layers, our models can fail to learn --- they don't have enough information to learn from.

这两个似乎都克服了一些最初的糟糕表现，看起来好像他们可以通过更多的培训继续改进。这可能是没有足够数据来训练网络的限制。随着我们的模型变得越来越复杂，并且由于有关错误的信息必须通过更多层传播，我们的模型可能无法学习 - 它们没有足够的信息可供学习。

Decreasing the batch size by processing fewer data-points before giving the model a correction can help. So can increasing the number of epochs, at the cost of increased training time. Unfortunately, it can be difficult to tell if you have a junk architecture or just need more data and more training time without testing your own patience.

通过在给模型进行校正之前处理较少的数据点来减小批量大小可能会有所帮助。因此可以增加时代数量，但代价是增加了培训时间。不幸的是，很难判断你是否拥有垃圾架构，或者只是需要更多的数据和更多的培训时间而不需要测试自己的耐心。

For example, this is what happened when I reduced the batch size to 16 (from 128) and trained the 32-node-per-layer with five hidden layers for 50 epochs (which took about 30 minutes on my hardware):

例如，当我将批量大小减少到16（从128）并且每层32个节点用5个隐藏层训练50个时期（在我的硬件上花了大约30分钟）时，就会发生这种情况：

*Five 32-node hidden layers, batch size 16, 50 epochs. Final test accuracy: .827*

So, it doesn't look like 32 nodes per layer is downsampling or destroying information to an extent that this network cannot learn anything. That said, many of the other networks we've built perform better with significantly less training. Our precious training time would probably be better spent on other network architectures.

因此，每层的32个节点看起来不像是对该网络无法学习任何东西进行下采样或破坏信息。也就是说，我们构建的许多其他网络在训练显着减少的情况下表现更好。我们宝贵的培训时间可能会更好地用于其他网络架构。

### Further steps

While I hope this article was helpful and enlightening, it's not exhaustive. Even experts are sometimes confused and unsure about which architectures will work and which will not. The best way to bolster your intuition is to practice building and experimenting with neural network architectures for yourself.

虽然我希望这篇文章有用且具有启发性，但并非详尽无遗。即使是专家有时也会对哪些架构可行，哪些架构不起作用感到困惑和不确定。支持你的直觉的最好方法是为自己练习构建和试验神经网络架构。

In the

In the next article, we'll explore gradient descent a cornerstone algorithm for training neural networks.

，我们将探索梯度下降作为训练神经网络的基石算法。

If you want some homework --- and I know you do --- Keras has a number of fantastic datasets built in just like the MNIST dataset. To ensure you've learned something from this article, consider recreating experiments like the ones I've done above using

如果你想要一些功课 - 我知道你这样做 - Keras有很多奇妙的数据集，就像MNIST数据集一样。为了确保您从本文中学到了一些东西，请考虑重新创建类似于我上面使用过的实验the Fashion MNIST dataset, which is a bit more challenging than the regular MNIST dataset.

，这比常规的MNIST数据集更具挑战性。

公众号:银河系1号

公众号:银河系1号

联系邮箱：public@space-explore.com

联系邮箱：public@space-explore.com

(未经同意，请勿转载)

(未经同意，请勿转载)