# How to classify MNIST digits with different neural network architectures

## Getting started with neural networks and Keras

Tyler Elliot Bettilyon
BlockedUnblockFollow继2018年8月8日之后

Photo by Greg Rakozy
on Unsplash

, and the second article in the series here
.

### Neural networks

。人们将神经网络视为黑匣子的原因之一是任何给定神经网络的结构都很难被思考。

• The name x refers to input data, while the name y refers to the labels. ŷ (pronounced y-hat) refers to the predictions made by a model.
• Training data is the data our model learns from.
• Test data is kept secret from the model until after it has been trained. Test data is used to evaluate our model.
• A loss function is a function to quantify how accurate a model's predictions were.
• An optimization algorithm controls exactly how the weights of the computational graph are adjusted during training

.

### MNIST handwritten digits dataset

A random selection of MNIST digits. In
the Jupyter Notebook
you can view more random selections from the dataset.

MNIST数据集是神经网络入门的经典问题。我听说有一些人开玩笑说这是"hello world"的深度学习版本 - 很多简单的网络在数据集方面做得非常出色，尽管有些数字非常棘手：
This image is from the wonderful book
Neural Networks and Deep Learning

### Preparing the data

* --- Python非常棒。*

，这意味着我们的输入数据必须是矢量。而不是几个28x28图像，我们将有几个长度为784（28 * 28 = 784）的向量。这种展平过程并不理想 - 我们混淆了哪些像素彼此相邻的信息。

（细胞神经网络）。它们专为图像处理/计算机视觉而设计，并保持这些空间关系。在以后的文章中，我们将重新审视具有CNN的MNIST并比较我们的结果。

Keras再次提供了一个简单的实用工具来帮助我们将28x28像素压缩成矢量：

### Neural network architectures

1. How many layers are there?
2. How many nodes are there in each of those layers?
3. What transfer/activation function is used at each of those layers?

• I've selected a common loss function called categorical cross entropy.
• I've selected one of the simplest optimization algorithms: Stochastic Gradient Descent (SGD).

#### Building the network

。这标准化了十个输出节点的值，使得：

• all the values are between 0 and 1, and
• the sum of all ten values is 1.

。隐藏层有32个节点，因此有32个节点 biases
。这给我们带来了25,088 + 32 = 25,120个参数。

Keras有一个方便的方法来帮助你计算模型中的参数数量，调用`.summary()`
method we get:

``````Layer (type) Output Shape Param #
=================================================================
dense_203 (Dense) (None, 32) 25120
_________________________________________________________________
dense_204 (Dense) (None, 10) 330
=================================================================
Total params: 25,450
Trainable params: 25,450
Non-trainable params: 0
``````

[CODE BLOCK --- train_and_evalulate_first_model.py
]
Training and validation accuracy over time. Final test accuracy: 0.87.

），但准确度始终在87-90％之间。这是一个令人难以置信的结果。我们通过展平图像来模糊数据中的空间关系。我们已经完成了零特征提取，以帮助模型理解数据。然而，在不到一分钟的消费级硬件培训中，我们已经比随机猜测好了近9倍。

### Network depth and layer width

#### Network depth

The `evaluate`

• five training epochs
• training batch size of 128
• the categorical cross entropy loss function.

The `create_dense`

• an input vector of length 784
• an output vector of length ten that uses a one-hot encoding and the softmax activation function
• a number of layers with the widths specified by the input array all using the sigmoid activation function.

One hidden layer, final test accuracy: 0.888

2 hidden layers, final test accuracy: 0.767

3 hidden layers, final test accuracy: 0.438

4 hidden layers, final test accuracy: 0.114

#### Overfitting

。神经网络足够灵活，可以调整其参数以适应训练数据，从而精确地使它们不再推广到训练集外部的数据（例如，测试数据）。这有点像记住特定数学测试的答案而不学习如何实际进行数学计算。

---每个图层只从它之前的图层中获取信息。我们添加的图层越多，原始邮件的更改就越多，这有时是一种力量，有时甚至是一种弱点。

3 hidden layers, 40 training epochs instead of 5. Final test accuracy: .886

#### Layer width

，我们必须为该新节点赋予层中每个节点的边缘 i+1
.

Using the same `evaluate`
and `create_dense`

One hidden layer, 32 nodes. Final test accuracy: .886

One hidden layer, 64 nodes. Final test accuracy: .904

One hidden layer, 128 nodes. Final test accuracy: .916

One hidden layer, 256 nodes. Final test accuracy: .926

One hidden layer, 512 nodes. Final test accuracy: .934

One hidden layer, 2048 nodes. Final test accuracy: .950. This model has a hint of potential overfitting --- notice where the lines cross at the very end of our training period.

。 OpenAI写了一篇博客帖子，也有助于将其置于语境中extraordinary computational resources

"...最大的AI训练中使用的计算量一直呈指数增长，并且增加了3.5个月......"

#### Combining width and depth

One 32 node layer

Two 128 node layers

Three 512 node layers

Four 32 node layers.

Five 32 node layers.

Five 128 node layers.

Five 512 node layers.

Five 32-node hidden layers, batch size 16, 50 epochs. Final test accuracy: .827

### Further steps

In the next article
，我们将探索梯度下降作为训练神经网络的基石算法。

，这比常规的MNIST数据集更具挑战性。

(未经同意，请勿转载)