How to classify MNIST digits with different neural network architectures
Getting started with neural networks and Keras
I took a Deep Learning course through The Bradfield School of Computer Science in June. This series is a journal about what I learned in class, and what I've learned since.
Please note: All of the code samples below can be found and run in this Jupyter Notebook kindly hosted by Google Colaboratory. I encourage you to copy the code, make changes, and experiment with the networks yourself as you read this article.
Although neural networks have gained enormous popularity over the last few years, for many data scientists and statisticians the whole family of models has (at least) one major flaw: the results are hard to interpret . One of the reasons that people treat neural networks as a black box is that the structure of any given neural network is hard to think about.
Neural networks frequently have anywhere from hundreds of thousands to millions of weights that are individually tuned during training to minimize error. With so many variables interacting in complex ways, it is difficult to describe exactly why one particular neural network outperforms some other neural network. This complexity also makes it hard to design top-tier neural network architectures.
Some machine learning terminology appears here, in case you haven't seen it before:
- The name x refers to input data, while the name y refers to the labels. ŷ (pronounced y-hat) refers to the predictions made by a model.
- Training data is the data our model learns from.
- Test data is kept secret from the model until after it has been trained. Test data is used to evaluate our model.
- A loss function is a function to quantify how accurate a model's predictions were.
- An optimization algorithm controls exactly how the weights of the computational graph are adjusted during training
For a refresher about splitting up test and training data, or if this is new information, consider reading this article.
MNIST handwritten digits dataset
In this article, we're going to work through a series of simple neural network architectures and compare their performance on the MNIST handwritten digits dataset. The goal for all the networks we examine is the same: take an input image (28x28 pixels) of a handwritten single digit (0--9) and classify the image as the appropriate digit.
State of the art neural network approaches have achieved near-perfect performance, classifying 99.8% of digits correctly from a left-out test set of digits. This impressive performance has real world benefits as well. The US Postal Service processes 493.4 million pieces of mail per day, and 1% of that workload is 4.9 million pieces of mail. Accurate automation can prevent postal workers from individually handling and examining millions of parcels each day. Of course, automatically reading complete addresses isn't as simple as processing individual digits, but let's learn to crawl before we try to jog.
It's always a good idea to familiarize yourself with a dataset before diving into any machine learning task. Here are some examples of the images in the dataset:
A random selection of MNIST digits. In the Jupyter Notebook you can view more random selections from the dataset.
The MNIST dataset is a classic problem for getting started with neural networks. I've heard a few people joke that it's the deep learning version of "hello world"--- a lot of simple networks do a surprisingly good job with the dataset, even though some of the digits are pretty tricky:
This image is from the wonderful book Neural Networks and Deep Learning , available online for free.
Preparing the data
The first and most important step in any machine learning task is to prepare the data. For many scientists and industry practitioners, the process of gathering, cleaning, labeling, and storing the data into a usable digital format represents the lion's share of the work. Additionally, any errors introduced during this step will cause the learning algorithm to learn incorrect patterns. As they say: garbage in, garbage out.
Thanks to the Keras library and the hard work of the National Institute of Standards and Technology (the NIST of MNIST) the hardest parts have been done for us. The data's been collected and is already well-formatted for processing. Therefore, it is with deep gratitude for NIST and the Keras maintainers that our Python code for getting the data is simple:
Relevant XKCD * --- Python really is wonderful.*
Once we have the dataset, we have to format it appropriately for our neural network. This article is focused only on fully connected neural networks , which means our input data must be a vector. Instead of several 28x28 images, we'll have several vectors that are all length 784 (28*28=784). This flattening process is not ideal --- we obfuscate information about which pixels are next to each other.
Our networks will overcome this loss of information, but it is worth mentioning convolutional neural networks (CNNs). These are specifically designed for image processing/computer vision, and maintain these spatial relationships. In a future article, we will revisit MNIST with CNNs and compare our results.
Keras, again, provides a simple utility to help us flatten the 28x28 pixels into a vector:
We have to do one last thing to this dataset before we're ready to experiment with some neural networks. The labels for this dataset are numerical values from 0 to 9 --- but it's important that our algorithm treats these as items in a set, rather than ordinal values. In our dataset the value "0" isn't smaller than the value "9", they are just two different values from our set of possible classifications.
If our algorithm predicts "8" when it should predict "0" it is wrong to say that the model was "off by 8" --- it simply predicted the wrong category. Similarly, predicting "7" when we should have predicted "8" is not better than predicting "0" when we should have predicted "8" --- both are just wrong.
To address this issue, when we're making predictions about categorical data (as opposed to values from a continuous range), the best practice is to use a "one-hot encoded" vector. This means that we create a vector as long as the number of categories we have, and force the model to set exactly one of the positions in the vector to 1 and the rest to 0 (the single 1 is the "hot" value within the vector).
Thankfully, Keras makes this remarkably easy to do as well:
Finally, it is worth mentioning that there are a lot of other things we could do at this point to normalize/improve the input images themselves. Preprocessing is common (because it's a good idea) but we're going to ignore it for now. Our focus is on examining neural network architectures.
Neural network architectures
For fully connected neural networks, there are three essential questions that define the network's architecture:
- How many layers are there?
- How many nodes are there in each of those layers?
- What transfer/activation function is used at each of those layers?
This article explores the first two of these questions, while the third will be explored in a later article. The behavior of the transfer/activation function is closely related to gradient descent and backpropagation, so discussing the available options will make more sense after the next article in this series.
All of the network architectures in this article use the sigmoid transfer function for all of the hidden layers.
There are other factors that can contribute to the performance of a neural network. These include which loss function is used, which optimization algorithm is used, how many training epochs to run before stopping, and the batch size within each epoch. Changes to batch size and epochs are discussed here. But, to help us compare "apples to apples", I have kept the loss function and optimization algorithm fixed:
- I've selected a common loss function called categorical cross entropy.
- I've selected one of the simplest optimization algorithms: Stochastic Gradient Descent (SGD).
Whew, now that all of that is out of the way, let's build our very first network:
Building the network
All the networks in this article will have the same input layer and output layer. We defined the input layer earlier as a vector with 784 entries --- this is the data from the flattened 28x28 image. The output layer was also implicitly defined earlier when we created a one-hot encoded vector from the labels --- the ten labels correspond to the ten nodes in this layer.
Our output layer also uses a special activation function called softmax . This normalizes the values from the ten output nodes such that:
- all the values are between 0 and 1, and
- the sum of all ten values is 1.
This allows us to treat those ten output values as probabilities, and the largest one is selected as the prediction for the one-hot vector. In machine learning, the softmax function is almost always used when our model's output is a one-hot encoded vector.
Finally, this model has a single hidden layer with 32 nodes using the sigmoid activation function. The resulting architecture has 25,450 tunable parameters. From the input layer to the hidden layer there are 784*32 = 25,088 weights . The hidden layer has 32 nodes so there are 32 biases . This brings us to 25,088 + 32 = 25,120 parameters.
From the hidden layer to the output layer there are 32*10 = 320 weights.
Each of the ten nodes adds a single bias bringing us to 25,088 + 320 + 10 = 25,450 total parameters.
Keras has a handy method to help you calculate the number of parameters in a model, calling the
.summary() method we get:
Layer (type) Output Shape Param # ================================================================= dense_203 (Dense) (None, 32) 25120 _________________________________________________________________ dense_204 (Dense) (None, 10) 330 ================================================================= Total params: 25,450 Trainable params: 25,450 Non-trainable params: 0
We can use Keras to train and evaluate this model as well:
[CODE BLOCK --- train_and_evalulate_first_model.py]
Training and validation accuracy over time. Final test accuracy: 0.87.
Performance varies a little bit from run to run (give it a try in the Jupyter notebook), but accuracy is consistently between 87--90%. This is an incredible result. We have obfuscated spatial relationships within the data by flattening the images. We have done zero feature extraction to help the model understand the data. Yet, in under one minute of training on consumer grade hardware, we're already doing nearly nine times better than guessing randomly.
Network depth and layer width
While there are some rules of thumb, the only way to determine the best architecture for any particular task is empirically. Sometimes the "sensible defaults" will work well, and other times they won't work at all. The only way to find out for sure if your neural network works on your data is to test it, and measure your performance.
Neural network architecture is the subject of quite a lot of open research. Finding a new architecture that outperforms existing architectures on a particular task is often an achievement worthy of publication. It's common for practitioners to select an architecture based on a recent publication, and either copy it wholesale for a new task or make minor tweaks to gain incremental improvement.
Still, there is a lot to learn from reinventing some simple wheels from scratch. Let's examine a few alternatives to this small network and examine the impact of those changes.
The depth of a multi-layer perceptron (also know as a fully connected neural network) is determined by its number of hidden layers. The network above has one hidden layer. This network is so shallow that it's technically inaccurate to call it "deep learning".
Let's experiment with layers of different lengths to see how the depth of a network impacts its performance. I have written a couple short functions to help reduce boilerplate throughout this tutorial:
evaluate function prints a summary of the model, trains the model, graphs the training and validation accuracy, and prints a summary of its performance on the test data. By default it does all this using the fixed hyperparameters we've discussed, specifically:
- stochastic gradient descent (SGD)
- five training epochs
- training batch size of 128
- the categorical cross entropy loss function.
create_dense function lets us pass in an array of sizes for the hidden layers. It creates a multi-layer perceptron that always has appropriate input and output layers for our MNIST task. Specifically, the models will have:
- an input vector of length 784
- an output vector of length ten that uses a one-hot encoding and the softmax activation function
- a number of layers with the widths specified by the input array all using the sigmoid activation function.
This code uses these functions to create and evaluate several neural nets of increasing depth, each with 32 nodes per hidden layer:
In Python:  * 2 => [32, 32] and  * 3 => [32, 32, 32], and so on...
Running this code produces some interesting charts, via the evaluate function defined above:
One hidden layer, final test accuracy: 0.888
2 hidden layers, final test accuracy: 0.767
3 hidden layers, final test accuracy: 0.438
4 hidden layers, final test accuracy: 0.114
Adding more layers appears to have decreased the accuracy of the model. That might not be intuitive --- aren't we giving the model greater flexibility and therefore increasing its ability to make predictions? Unfortunately the trade-off isn't that simple.
One thing we should look for is overfitting . Neural networks are flexible enough that they can adjust their parameters to fit the training data so precisely that they no longer generalize to data from outside the training set (for example, the test data). This is kind of like memorizing the answers to a specific math test without learning how to actually do the math.
Overfitting is a problem with many machine learning tasks. Neural networks are especially prone to overfitting because of the very large number of tunable parameters. One sign that you might be overfitting is that the training accuracy is significantly better than the test accuracy. But only one of our results --- the network with four hidden layers --- has that feature. That model's accuracy is barely better than guessing at random, even during training. Something more subtle is going on here.
In some ways a neural network is like a game of telephone --- each layer only gets information from the layer right before it. The more layers we add, the more the original message is changed, which is sometimes a strength and sometimes a weakness.
If the series of layers allow the build-up of useful information, then stacking layers can cause higher levels of meaning to build up. One layer finds edges in an image, another finds edges that make circles, another finds edges that make lines, another finds combinations of circles and lines, and so on.
On the other hand, if the layers are destructively removing context and useful information then, like in the game of telephone, the signal deteriorates as it passes through the layers until all the valuable information is lost.
Imagine you had a hidden layer with only one node --- this would force the network to reduce all the interesting interactions so far into a single value, then propagate that single value through the subsequent layers of the network. Information is lost, and such a network would perform terribly.
Another useful way to think about this is in terms of image resolution --- originally we had a "resolution" of 784 pixels but we forced the neural network to downsample very quickly to a "resolution" of 32 values. These values are no longer pixels, but combinations of the pixels in the previous layer.
Compressing the resolution once was (evidently) not so bad. But, like repeatedly saving a JPEG, repeated chains of "low resolution" data transfer from one layer to the next can result in lower quality output.
Finally, because of the way backpropagation and optimization algorithms work with neural networks, deeper networks require more training time. It may be that our model's 32 node-per-layer architecture just needs to train longer.
If we let the three-layer network from above train for 40 epochs instead of five, we get these results:
3 hidden layers, 40 training epochs instead of 5. Final test accuracy: .886
The only way to really know which of these factors is at play in your own models is to design tests and experiment. Always keep in mind that many of these factors can impact your model at the same time and to different degrees.
Another knob we can turn is the number of nodes in each hidden layer. This is called the width of the layer. As with adding more layers, making each layer wider increases the total number of tunable parameters. Making wider layers tends to scale the number of parameters faster than adding more layers. Every time we add a single node to layer i , we have to give that new node an edge to every node in layer i+1 .
Using the same
create_dense functions as above, let's compare a few neural networks with a single hidden layer using different layer widths.
Once again, running this code produces some interesting charts:
One hidden layer, 32 nodes. Final test accuracy: .886
One hidden layer, 64 nodes. Final test accuracy: .904
One hidden layer, 128 nodes. Final test accuracy: .916
One hidden layer, 256 nodes. Final test accuracy: .926
One hidden layer, 512 nodes. Final test accuracy: .934
One hidden layer, 2048 nodes. Final test accuracy: .950. This model has a hint of potential overfitting --- notice where the lines cross at the very end of our training period.
This time the performance changes are more intuitive --- more nodes in the hidden layer consistently mapped to better performance on the test data. Our accuracy improved from ~87% with 32 nodes to ~95% with 2048 nodes. Not only that, but the accuracy during our final round of training very nearly predicted the accuracy on the test data --- a sign that we are probably not overfitting.
The cost for this improvement was additional training time. As the number of tunable parameters ballooned from a meager 25,000 with 32 nodes to over 1.6 million with 2,048 nodes, so did training time. This caused our training epochs to go from taking about one second each to about 10 seconds each (on my Macbook Pro --- your mileage may vary).
Still, these models are fast to train relative to many state-of-the-art industrial models. The version of AlphaGo that defeated Lee Sedol trained for 4--6 weeks. OpenAI wrote a blogpost that also helps contextualize the extraordinary computational resources that go into training state-of-the-art models. From the article:
" ... the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.5 month-doubling time ... "
When we have good data and a good model, there is a strong correlation between training time and model performance. This is why many state-of-the-art models train on an order of weeks or months once the authors have confidence in the model's ability. It seems that patience is still a virtue.
Combining width and depth
With the intuition that more nodes tends to yield better performance, let's revisit the question of adding layers. Recall that stacking layers can either build up meaningful information or destroy information through downsampling.
Let's see what happens when we stack bigger layers. Repeated layers of 32 seemed to degrade the overall performance of our networks --- will that still be true as we stack larger layers?
I have increased the number of epochs as the depth of the network increases for the reasons discussed above. The combination of deeper networks, with more nodes per hidden layer, and the increased training epochs results in code that takes longer to run. Fortunately, you can check out the Jupyter Notebook where the results have already been computed.
You can see all the graphs in the Jupyter Notebook, but I want to highlight a few points of interest.
With this particular training regimen, the single layer 512-nodes-per-layer network ended up with the highest test accuracy, at 94.7%. In general, the trend we saw before --- where deeper networks perform worse --- persists. However, the differences are pretty small, in the order of 1--2 percentage points. Also, the graphs suggest the discrepancies might be overcome with more training for the deeper networks.
For all numbers of nodes-per-layer, the graphs for one, two, and three hidden layers look pretty standard. More training improves the network, and the rate of improvement slows down as accuracy rises.
One 32 node layer
Two 128 node layers
Three 512 node layers
But when we get to four and five layers things start looking bad for the 32-node models:
Four 32 node layers.
Five 32 node layers.
The other two five-layer networks have interesting results as well:
Five 128 node layers.
Five 512 node layers.
Both of these seem to overcome some initial poor performance and look as though they might be able to continue improving with more training. This may be a limitation of not having enough data to train the network. As our models become more complex, and as information about error has to propagate through more layers, our models can fail to learn --- they don't have enough information to learn from.
Decreasing the batch size by processing fewer data-points before giving the model a correction can help. So can increasing the number of epochs, at the cost of increased training time. Unfortunately, it can be difficult to tell if you have a junk architecture or just need more data and more training time without testing your own patience.
For example, this is what happened when I reduced the batch size to 16 (from 128) and trained the 32-node-per-layer with five hidden layers for 50 epochs (which took about 30 minutes on my hardware):
Five 32-node hidden layers, batch size 16, 50 epochs. Final test accuracy: .827
So, it doesn't look like 32 nodes per layer is downsampling or destroying information to an extent that this network cannot learn anything. That said, many of the other networks we've built perform better with significantly less training. Our precious training time would probably be better spent on other network architectures.
While I hope this article was helpful and enlightening, it's not exhaustive. Even experts are sometimes confused and unsure about which architectures will work and which will not. The best way to bolster your intuition is to practice building and experimenting with neural network architectures for yourself.
In the next article, we'll explore gradient descent a cornerstone algorithm for training neural networks.
If you want some homework --- and I know you do --- Keras has a number of fantastic datasets built in just like the MNIST dataset. To ensure you've learned something from this article, consider recreating experiments like the ones I've done above using the Fashion MNIST dataset, which is a bit more challenging than the regular MNIST dataset.