How to classify MNIST digits with different neural network architectures
Getting started with neural networks and Keras
Tyler Elliot Bettilyon BlockedUnblockFollowFollowing Aug 8, 2018
I took a Deep Learning course through
我参加了深度学习课程The Bradfield School of Computer Science in June. This series is a journal about what I learned in class, and what I've learned since.
Please note: All of the code samples below can be found and run in
可以找到并运行下面的所有代码示例this Jupyter Notebook kindly hosted by Google Colaboratory. I encourage you to copy the code, make changes, and experiment with the networks yourself as you read this article.
Although neural networks have gained enormous popularity over the last few years, for many data scientists and statisticians the whole family of models has (at least) one major flaw: the results are
尽管神经网络在过去几年中已经获得了极大的普及，但对于许多数据科学家和统计学家来说，整个模型家族都有（至少）一个主要缺陷：结果是 hard to interpret . One of the reasons that people treat neural networks as a black box is that the structure of any given neural network is hard to think about.
Neural networks frequently have anywhere from hundreds of thousands to millions of weights that are individually tuned during training to minimize error. With so many variables interacting in complex ways, it is difficult to describe exactly why one particular neural network outperforms some other neural network. This complexity also makes it hard to design top-tier neural network architectures.
Some machine learning terminology appears here, in case you haven't seen it before:
- The name x refers to input data, while the name y refers to the labels. ŷ (pronounced y-hat) refers to the predictions made by a model.
- Training data is the data our model learns from.
- Test data is kept secret from the model until after it has been trained. Test data is used to evaluate our model.
- A loss function is a function to quantify how accurate a model's predictions were.
- An optimization algorithm controls exactly how the weights of the computational graph are adjusted during training
For a refresher about splitting up test and training data, or if this is new information,
有关拆分测试和培训数据的更新，或者这是新信息，consider reading this article.
MNIST handwritten digits dataset
In this article, we're going to work through a series of simple neural network architectures and compare their performance on the MNIST handwritten digits dataset. The goal for all the networks we examine is the same: take an input image (28x28 pixels) of a handwritten single digit (0--9) and classify the image as the appropriate digit.
State of the art neural network approaches have achieved near-perfect performance, classifying 99.8% of digits correctly from a left-out test set of digits. This impressive performance has real world benefits as well. The US Postal Service processes 493.4 million pieces of mail per day, and 1% of that workload is 4.9 million pieces of mail. Accurate automation can prevent postal workers from individually handling and examining millions of parcels each day. Of course, automatically reading complete addresses isn't as simple as processing individual digits, but let's learn to crawl before we try to jog.
It's always a good idea to familiarize yourself with a dataset before diving into any machine learning task. Here are some examples of the images in the dataset:
A random selection of MNIST digits. In
the Jupyter Notebook
you can view more random selections from the dataset.
The MNIST dataset is a classic problem for getting started with neural networks. I've heard a few people joke that it's the deep learning version of "hello world"--- a lot of simple networks do a surprisingly good job with the dataset, even though some of the digits are pretty tricky:
MNIST数据集是神经网络入门的经典问题。我听说有一些人开玩笑说这是"hello world"的深度学习版本 - 很多简单的网络在数据集方面做得非常出色，尽管有些数字非常棘手：
This image is from the wonderful book
Neural Networks and Deep Learning
, available online for free.
Preparing the data
The first and most important step in any machine learning task is to prepare the data. For many scientists and industry practitioners, the process of gathering, cleaning, labeling, and storing the data into a usable digital format represents the lion's share of the work. Additionally, any errors introduced during this step will cause the learning algorithm to learn incorrect patterns. As they say: garbage in, garbage out.
Thanks to the Keras library and the hard work of the National Institute of Standards and Technology (the NIST of MNIST) the hardest parts have been done for us. The data's been collected and is already well-formatted for processing. Therefore, it is with deep gratitude for NIST and the Keras maintainers that our Python code for getting the data is simple:
感谢Keras图书馆和国家标准与技术研究所（MNIST的NIST）的辛勤工作，我们已经完成了最难的部分。数据已被收集，并且已经过格式化处理。因此，非常感谢NIST和Keras维护者，我们用于获取数据的Python代码很简单：Relevant XKCD * --- Python really is wonderful.*
* --- Python非常棒。*
Once we have the dataset, we have to format it appropriately for our neural network.
一旦我们有了数据集，我们就必须为我们的神经网络适当地格式化它。 This article is focused only on fully connected neural networks , which means our input data must be a vector. Instead of several 28x28 images, we'll have several vectors that are all length 784 (28*28=784). This flattening process is not ideal --- we obfuscate information about which pixels are next to each other.
，这意味着我们的输入数据必须是矢量。而不是几个28x28图像，我们将有几个长度为784（28 * 28 = 784）的向量。这种展平过程并不理想 - 我们混淆了哪些像素彼此相邻的信息。
Our networks will overcome this loss of information, but it is worth mentioning
我们的网络将克服这种信息丢失，但值得一提convolutional neural networks (CNNs). These are specifically designed for image processing/computer vision, and maintain these spatial relationships. In a future article, we will revisit MNIST with CNNs and compare our results.
Keras, again, provides a simple utility to help us flatten the 28x28 pixels into a vector:
We have to do one last thing to this dataset before we're ready to experiment with some neural networks. The labels for this dataset are numerical values from 0 to 9 --- but it's important that our algorithm treats these as items in a set, rather than ordinal values. In our dataset the value "0"
在我们准备好尝试一些神经网络之前，我们必须对这个数据集做最后一件事。此数据集的标签是从0到9的数值---但重要的是我们的算法将它们视为集合中的项目，而不是序数值。在我们的数据集中，值"0" isn't smaller than the value "9", they are just two different values from our set of possible classifications.
If our algorithm predicts "8" when it should predict "0" it is wrong to say that the model was "off by 8" --- it simply predicted the wrong category. Similarly, predicting "7" when we should have predicted "8" is not better than predicting "0" when we should have predicted "8" --- both are just wrong.
如果我们的算法在预测为"0"时预测"8"，那么说模型"偏离8"是错误的 - 它只是预测了错误的类别。同样地，当我们应该预测"8"时预测"7"并不比预测"0"时预测"0"更好 - 两者都是错误的。
To address this issue, when we're making predictions about categorical data (as opposed to values from a continuous range), the best practice is to use a "one-hot encoded" vector. This means that we create a vector as long as the number of categories we have, and force the model to set exactly one of the positions in the vector to 1 and the rest to 0 (the single 1 is the "hot" value within the vector).
Thankfully, Keras makes this remarkably easy to do as well:
Finally, it is worth mentioning that there are a lot of other things we
最后，值得一提的是，我们还有很多其他的事情 could do at this point to normalize/improve the input images themselves. Preprocessing is common (because it's a good idea) but we're going to ignore it for now. Our focus is on examining neural network architectures.
Neural network architectures
For fully connected neural networks, there are three essential questions that define the network's architecture:
- How many layers are there?
- How many nodes are there in each of those layers?
- What transfer/activation function is used at each of those layers?
This article explores the first two of these questions, while the third will be explored in a later article. The behavior of the transfer/activation function is closely related to gradient descent and backpropagation, so discussing the available options will make more sense after the next article in this series.
All of the network architectures in this article use the sigmoid transfer function for all of the hidden layers.
There are other factors that can contribute to the performance of a neural network. These include which loss function is used, which optimization algorithm is used, how many training epochs to run before stopping, and the batch size within each epoch. Changes to batch size and epochs are discussed here. But, to help us compare "apples to apples", I have kept the loss function and optimization algorithm fixed:
- I've selected a common loss function called categorical cross entropy.
- I've selected one of the simplest optimization algorithms: Stochastic Gradient Descent (SGD).
Whew, now that all of that is out of the way, let's build our very first network:
Building the network
All the networks in this article will have the same input layer and output layer. We defined the input layer earlier as a vector with 784 entries --- this is the data from the flattened 28x28 image. The output layer was also implicitly defined earlier when we created a one-hot encoded vector from the labels --- the ten labels correspond to the ten nodes in this layer.
本文中的所有网络都将具有相同的输入层和输出层。我们之前将输入层定义为具有784个条目的向量 - 这是来自展平的28x28图像的数据。当我们从标签创建一个热门编码矢量时，输出层也被隐含地定义了 - 十个标签对应于该层中的十个节点。
Our output layer also uses a special activation function called
我们的输出层还使用了一个特殊的激活函数 softmax . This normalizes the values from the ten output nodes such that:
- all the values are between 0 and 1, and
- the sum of all ten values is 1.
This allows us to treat those ten output values as probabilities, and the largest one is selected as the prediction for the one-hot vector. In machine learning, the softmax function is almost always used when our model's output is a one-hot encoded vector.
Finally, this model has a single hidden layer with 32 nodes using the sigmoid activation function. The resulting architecture has 25,450 tunable parameters. From the input layer to the hidden layer there are 784*32 = 25,088
最后，该模型具有使用S形激活函数的具有32个节点的单个隐藏层。生成的体系结构具有25,450个可调参数。从输入层到隐藏层有784 * 32 = 25,088 weights . The hidden layer has 32 nodes so there are 32
。隐藏层有32个节点，因此有32个节点 biases . This brings us to 25,088 + 32 = 25,120 parameters.
。这给我们带来了25,088 + 32 = 25,120个参数。
From the hidden layer to the output layer there are 32*10 = 320 weights.
从隐藏层到输出层，有32 * 10 = 320个权重。
Each of the ten nodes adds a single bias bringing us to 25,088 + 320 + 10 = 25,450 total parameters.
十个节点中的每一个都增加了一个偏差，使我们达到25,088 + 320 + 10 = 25,450个总参数。
Keras has a handy method to help you calculate the number of parameters in a model, calling the
.summary() method we get:
method we get:
Layer (type) Output Shape Param # ================================================================= dense_203 (Dense) (None, 32) 25120 _________________________________________________________________ dense_204 (Dense) (None, 10) 330 ================================================================= Total params: 25,450 Trainable params: 25,450 Non-trainable params: 0
We can use Keras to train and evaluate this model as well:
[CODE BLOCK ---
[CODE BLOCK --- train_and_evalulate_first_model.py]
Training and validation accuracy over time. Final test accuracy: 0.87.
Performance varies a little bit from run to run (give it a try in the
从运行到运行，性能会有所不同（尝试一下）Jupyter notebook), but accuracy is consistently between 87--90%. This is an incredible result. We have obfuscated spatial relationships within the data by flattening the images. We have done zero feature extraction to help the model understand the data. Yet, in under one minute of training on consumer grade hardware, we're already doing nearly nine times better than guessing randomly.
Network depth and layer width
While there are some rules of thumb, the only way to determine the best architecture for any particular task is
虽然有一些经验法则，但确定任何特定任务的最佳架构的唯一方法是 empirically. Sometimes the "sensible defaults" will work well, and other times they won't work at all. The only way to find out for sure if your neural network works on your data is to test it, and measure your performance.
Neural network architecture is the subject of quite a lot of open research. Finding a new architecture that outperforms existing architectures on a particular task is often an achievement worthy of publication. It's common for practitioners to select an architecture based on a recent publication, and either copy it wholesale for a new task or make minor tweaks to gain incremental improvement.
Still, there is a lot to learn from reinventing some simple wheels from scratch. Let's examine a few alternatives to this small network and examine the impact of those changes.
The depth of a multi-layer perceptron (also know as a fully connected neural network) is determined by its number of hidden layers. The network above has one hidden layer. This network is so shallow that it's technically inaccurate to call it "deep learning".
Let's experiment with layers of different lengths to see how the depth of a network impacts its performance. I have written a couple short functions to help reduce boilerplate throughout this tutorial:
evaluate function prints a summary of the model, trains the model, graphs the training and validation accuracy, and prints a summary of its performance on the test data. By default it does all this using the fixed hyperparameters we've discussed, specifically:
- stochastic gradient descent (SGD)
- five training epochs
- training batch size of 128
- the categorical cross entropy loss function.
create_dense function lets us pass in an array of sizes for the hidden layers. It creates a multi-layer perceptron that always has appropriate input and output layers for our MNIST task. Specifically, the models will have:
- an input vector of length 784
- an output vector of length ten that uses a one-hot encoding and the softmax activation function
- a number of layers with the widths specified by the input array all using the sigmoid activation function.
This code uses these functions to create and evaluate several neural nets of increasing depth, each with 32 nodes per hidden layer: In Python:  * 2 => [32, 32] and  * 3 => [32, 32, 32], and so on...
此代码使用这些函数来创建和评估几个深度增加的神经网络，每个隐藏层有32个节点：在Python中： * 2 => [32,32]和 * 3 => [32， 32,32]等等......
Running this code produces some interesting charts, via the evaluate function defined above:
One hidden layer, final test accuracy: 0.888
2 hidden layers, final test accuracy: 0.767
3 hidden layers, final test accuracy: 0.438
4 hidden layers, final test accuracy: 0.114
Adding more layers appears to have
添加更多图层似乎有 decreased the accuracy of the model. That might not be intuitive --- aren't we giving the model greater flexibility and therefore increasing its ability to make predictions? Unfortunately the trade-off isn't that simple.
模型的准确性。这可能不直观 - 我们不是给模型更大的灵活性，因此增加了预测的能力吗？不幸的是，权衡并非如此简单。
One thing we should look for is
我们应该寻找的一件事是 overfitting . Neural networks are flexible enough that they can adjust their parameters to fit the training data so precisely that they no longer generalize to data from outside the training set (for example, the test data). This is kind of like memorizing the answers to a specific math test without learning how to actually do the math.
Overfitting is a problem with many machine learning tasks. Neural networks are especially prone to overfitting because of the very large number of tunable parameters. One sign that you might be overfitting is that the training accuracy is significantly better than the test accuracy. But only one of our results --- the network with four hidden layers --- has that feature. That model's accuracy is barely better than guessing at random, even during training. Something more subtle is going on here.
过度拟合是许多机器学习任务的问题。由于可调参数的数量非常大，神经网络特别容易过度拟合。您可能过度拟合的一个迹象是训练精度明显优于测试精度。但只有我们的一个结果 - 具有四个隐藏层的网络---具有该功能。即使在训练期间，该模型的准确性几乎不比随机猜测好。这里有一些更微妙的东西。
In some ways a neural network is like a
在某些方面，神经网络就像一个game of telephone --- each layer only gets information from the layer right before it. The more layers we add, the more the original message is changed, which is sometimes a strength and sometimes a weakness.
If the series of layers allow the build-up of useful information, then stacking layers can cause higher levels of meaning to build up. One layer finds edges in an image, another finds edges that make circles, another finds edges that make lines, another finds combinations of circles and lines, and so on.
On the other hand, if the layers are destructively removing context and useful information then, like in the game of telephone, the signal deteriorates as it passes through the layers until all the valuable information is lost.
Imagine you had a hidden layer with only one node --- this would force the network to reduce all the interesting interactions so far into a single value, then propagate that single value through the subsequent layers of the network. Information is lost, and such a network would perform terribly.
想象一下，你有一个只有一个节点的隐藏层 - 这将迫使网络将目前为止所有有趣的交互减少为单个值，然后通过网络的后续层传播该单个值。信息丢失，这样的网络将表现得非常糟糕。
Another useful way to think about this is in terms of image resolution --- originally we had a "resolution" of 784 pixels but we forced the neural network to downsample very quickly to a "resolution" of 32 values. These values are no longer pixels, but combinations of the pixels in the previous layer.
考虑这个问题的另一个有用的方法是在图像分辨率方面 - 最初我们有一个784像素的"分辨率"，但我们强迫神经网络快速下采样到32个值的"分辨率"。这些值不再是像素，而是前一层中像素的组合。
Compressing the resolution once was (evidently) not so bad. But, like repeatedly saving a JPEG, repeated chains of "low resolution" data transfer from one layer to the next can result in lower quality output.
Finally, because of the way backpropagation and optimization algorithms work with neural networks, deeper networks require more training time. It may be that our model's 32 node-per-layer architecture just needs to train longer.
If we let the three-layer network from above train for 40 epochs instead of five, we get these results:
3 hidden layers, 40 training epochs instead of 5. Final test accuracy: .886
The only way to really know which of these factors is at play in your own models is to design tests and experiment. Always keep in mind that many of these factors can impact your model at the same time and to different degrees.
Another knob we can turn is the number of nodes in each hidden layer. This is called the
我们可以转动的另一个旋钮是每个隐藏层中的节点数。这被称为 width of the layer. As with adding more layers, making each layer wider increases the total number of tunable parameters. Making wider layers tends to scale the number of parameters faster than adding more layers. Every time we add a single node to layer
的图层。与添加更多图层一样，使每个图层更宽可增加可调参数的总数。制作更宽的图层往往比添加更多图层更快地缩放参数数量。每次我们向图层添加单个节点 i , we have to give that new node an edge to every node in layer
，我们必须为该新节点赋予层中每个节点的边缘 i+1 .
Using the same
Using the same
create_dense functions as above, let's compare a few neural networks with a single hidden layer using different layer widths.
Once again, running this code produces some interesting charts:
One hidden layer, 32 nodes. Final test accuracy: .886
One hidden layer, 64 nodes. Final test accuracy: .904
One hidden layer, 128 nodes. Final test accuracy: .916
One hidden layer, 256 nodes. Final test accuracy: .926
One hidden layer, 512 nodes. Final test accuracy: .934
One hidden layer, 2048 nodes. Final test accuracy: .950. This model has a hint of potential overfitting --- notice where the lines cross at the very end of our training period.
This time the performance changes are more intuitive --- more nodes in the hidden layer consistently mapped to better performance on the test data. Our accuracy improved from ~87% with 32 nodes to ~95% with 2048 nodes. Not only that, but the accuracy during our final round of training very nearly predicted the accuracy on the test data --- a sign that we are probably not overfitting.
这次性能变化更直观 - 隐藏层中的更多节点始终映射到测试数据的更好性能。我们的准确度从32个节点的~87％提高到2048个节点的~95％。不仅如此，我们最后一轮培训的准确性几乎预测了测试数据的准确性 - 这表明我们可能不会过度拟合。
The cost for this improvement was additional training time. As the number of tunable parameters ballooned from a meager 25,000 with 32 nodes to over 1.6 million with 2,048 nodes, so did training time. This caused our training epochs to go from taking about one second each to about 10 seconds each (on my Macbook Pro --- your mileage may vary).
这项改进的成本是额外的培训时间。随着可调参数的数量从拥有32个节点的25,000个增加到拥有2,048个节点的超过160万个，培训时间也是如此。这导致我们的训练时期从每个大约一秒钟到大约10秒钟（在我的Macbook Pro上 - 你的里程可能会有所不同）。
Still, these models are fast to train relative to many state-of-the-art industrial models. The version of AlphaGo that defeated Lee Sedol
尽管如此，相对于许多最先进的工业模型而言，这些模型的培训速度很快。击败Lee Sedol的AlphaGo版本trained for 4--6 weeks. OpenAI wrote a blogpost that also helps contextualize the
。 OpenAI写了一篇博客帖子，也有助于将其置于语境中extraordinary computational resources that go into training state-of-the-art models. From the article:
" ... the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.5 month-doubling time ... "
When we have good data and a good model, there is a strong correlation between training time and model performance. This is why many state-of-the-art models train on an order of weeks or months once the authors have confidence in the model's ability. It seems that patience is still a virtue.
Combining width and depth
With the intuition that more nodes tends to yield better performance, let's revisit the question of adding layers. Recall that stacking layers can either build up meaningful information or destroy information through downsampling.
Let's see what happens when we stack
让我们看看堆叠时会发生什么 bigger layers. Repeated layers of 32 seemed to degrade the overall performance of our networks --- will that still be true as we stack larger layers?
层。重复的32层似乎会降低我们网络的整体性能 - 当我们堆叠更大的层时，这仍然是真的吗？
I have increased the number of epochs as the depth of the network increases for the reasons discussed above. The combination of deeper networks, with more nodes per hidden layer, and the increased training epochs results in code that takes longer to run. Fortunately, you can check out the
由于上面讨论的原因，随着网络深度的增加，我增加了时期的数量。更深层网络的组合，每个隐藏层有更多节点，以及增加的训练时期导致代码运行时间更长。幸运的是，你可以看看Jupyter Notebook where the results have already been computed.
You can see all the graphs in the Jupyter Notebook, but I want to highlight a few points of interest.
With this particular training regimen, the single layer 512-nodes-per-layer network ended up with the highest test accuracy, at 94.7%. In general, the trend we saw before --- where deeper networks perform worse --- persists. However, the differences are pretty small, in the order of 1--2 percentage points. Also, the graphs suggest the discrepancies might be overcome with more training for the deeper networks.
通过这种特殊的训练方案，单层512节点每层网络最终具有最高的测试精度，达到94.7％。总的来说，我们之前看到的趋势 - 更深的网络表现更差 - 仍然存在。但是，差异非常小，大约为1-2个百分点。此外，图表表明，通过对更深层网络的更多培训可以克服差异。
For all numbers of nodes-per-layer, the graphs for one, two, and three hidden layers look pretty standard. More training improves the network, and the rate of improvement slows down as accuracy rises.
One 32 node layer
Two 128 node layers
Three 512 node layers
But when we get to four and five layers things start looking bad for the 32-node models:
Four 32 node layers.
Five 32 node layers.
The other two five-layer networks have interesting results as well:
Five 128 node layers.
Five 512 node layers.
Both of these seem to overcome some initial poor performance and look as though they might be able to continue improving with more training. This may be a limitation of not having enough data to train the network. As our models become more complex, and as information about error has to propagate through more layers, our models can fail to learn --- they don't have enough information to learn from.
这两个似乎都克服了一些最初的糟糕表现，看起来好像他们可以通过更多的培训继续改进。这可能是没有足够数据来训练网络的限制。随着我们的模型变得越来越复杂，并且由于有关错误的信息必须通过更多层传播，我们的模型可能无法学习 - 它们没有足够的信息可供学习。
Decreasing the batch size by processing fewer data-points before giving the model a correction can help. So can increasing the number of epochs, at the cost of increased training time. Unfortunately, it can be difficult to tell if you have a junk architecture or just need more data and more training time without testing your own patience.
For example, this is what happened when I reduced the batch size to 16 (from 128) and trained the 32-node-per-layer with five hidden layers for 50 epochs (which took about 30 minutes on my hardware):
Five 32-node hidden layers, batch size 16, 50 epochs. Final test accuracy: .827
So, it doesn't look like 32 nodes per layer is downsampling or destroying information to an extent that this network cannot learn anything. That said, many of the other networks we've built perform better with significantly less training. Our precious training time would probably be better spent on other network architectures.
While I hope this article was helpful and enlightening, it's not exhaustive. Even experts are sometimes confused and unsure about which architectures will work and which will not. The best way to bolster your intuition is to practice building and experimenting with neural network architectures for yourself.
In the next article, we'll explore gradient descent a cornerstone algorithm for training neural networks.
If you want some homework --- and I know you do --- Keras has a number of fantastic datasets built in just like the MNIST dataset. To ensure you've learned something from this article, consider recreating experiments like the ones I've done above using
如果你想要一些功课 - 我知道你这样做 - Keras有很多奇妙的数据集，就像MNIST数据集一样。为了确保您从本文中学到了一些东西，请考虑重新创建类似于我上面使用过的实验the Fashion MNIST dataset, which is a bit more challenging than the regular MNIST dataset.