We are going to study Batch Norm, Weight Norm, Layer Norm, Instance Norm, Group Norm, Batch-Instance Norm, Switchable Norm{#591f}

我们将研究Batch Norm，Weight Norm，Layer Norm，Instance Norm，Group Norm，Batch-Instance Norm，Switchable Norm {＃591f}

Let's start with the question,{#d0f6}

让我们从问题开始，{＃d0f6}

Normalization has always been an active area of research in deep learning. Normalization techniques can decrease your model's training time by a huge factor. Let me state some of the benefits of using Normalization.{#5fbf}

规范化一直是深度学习研究的一个活跃领域。标准化技术可以通过一个巨大的因素减少模型的训练时间。让我说明使用规范化的一些好处。{＃5fbf}

- It normalizes each feature so that they maintains the contribution of every feature, as some feature has higher numerical value than others. This way our network can be unbiased(to higher value features).{#5ea7} {#5ea7}
- It reduces
**Internal Covariate Shift**. It is the change in the distribution of network activations due to the change in network parameters during training. To improve the training, we seek to reduce the internal covariate shift.{#af62} {#af62} - In this paper, authors claims that Batch Norm makes loss surface smoother(i.e. it bounds the magnitude of the gradients much more tightly).{#ceb0} {#ceb0}
- It makes the Optimization faster because normalization doesn't allow weights to explode all over the place and restricts them to a certain range.{#b5c7} {#b5c7}
- An unintended benefit of Normalization is that it helps network in Regularization(only slightly, not significantly).{#02db} {#02db}

From above, we can conclude that getting Normalization right can be a crucial factor in getting your model to train effectively, but this isn't as easy as it sounds. Let me support this by certain questions.{#35dc}

从上面我们可以得出结论，正确归一化可能是让你的模型有效训练的关键因素，但这并不像听起来那么容易。让我通过某些问题来支持这一点。{＃35dc}

- How Normalization layers behave in Distributed training ?{#90a4} {#90a4}
- Which Normalization technique should you use for your task like CNN, RNN, style transfer etc ?{#c737} {#c737}
- What happens when you change the batch size of dataset in your training ?{#81be} {#81be}
- Which norm technique would be the best trade-off for computation and accuracy for your network ?{#6c84} {#6c84}

To answer these questions, Let's dive into details of each normalization technique one by one.{#b506}

要回答这些问题，让我们逐一深入了解每种规范化技术的细节。{＃b506}

Batch normalization is a method that normalizes activations in a network across the mini-batch of definite size. For each feature, batch normalization computes the mean and variance of that feature in the mini-batch. It then subtracts the mean and divides the feature by its mini-batch standard deviation.{#3107}

是一种规范化网络中确定大小的小批量激活的方法。对于每个特征，批量标准化计算小批量中该特征的均值和方差。然后它减去均值并按照小批量标准偏差划分特征。{＃3107} this image is taken from https://arxiv.org/pdf/1502.03167.pdf'

此图片来自https://arxiv.org/pdf/1502.03167.pdf%27

But wait, what if increasing the magnitude of the weights made the network perform better?{#b01d}

但是等一下，如果增加权重的大小会使网络表现更好呢？{＃b01d}

To solve this issue, we can add γ and β as scale and shift learn-able parameters respectively. This all can be summarized as:{#436e}

为了解决这个问题，我们可以分别添加γ和β作为比例和移位学习参数。这一切可以概括为：{＃436e} ϵ is the stability constant in the equation.

ε是等式中的稳定常数。

## Problems associated with Batch Normalization : {#e257}

**Variable Batch Size →**If batch size is of 1, then variance would be 0 which doesn't allow batch norm to work. Furthermore, if we have small mini-batch size then it becomes too noisy and training might affect. There would also be a problem in**distributed training**. As, if you are computing in different machines then you have to take same batch size because otherwise γ and β will be different for different systems.{#2096} {#2096}**Recurrent Neural Network**→ In an RNN, the recurrent activations of each time-step will have a different story to tell(i.e. statistics). This means that we have to fit a separate batch norm layer for each time-step. This makes the model more complicated and space consuming because it forces us to store the statistics for each time-step during training.{#4782} {#4782}

Batch norm alternatives(or better

批量规范替代品（或更好norms) are discussed below in details but if you only interested in very short description(or revision just by look at an image) look at this :{#1d8e}

s）将在下面详细讨论，但如果您只对非常简短的描述感兴趣（或仅通过查看图像进行修订），请查看：{＃1d8e} This image is taken from https://arxiv.org/pdf/1803.08494.pdf

此图片来自https://arxiv.org/pdf/1803.08494.pdf

Wait, why don't we

Wait, why don't we **normalize weights of a layer** instead of normalizing the activations directly. Well,

而不是直接规范激活。好，Weight Normalization does exactly that.{#168f}

就是这么做的。{＃168f}

Weight normalization reparameterizes the weights (

重量标准化重新参数化重量（ **ω** ) as :{#92cf}

) as :{#92cf}

It separates the weight vector from its direction, this has a similar effect as in batch normalization with variance. The only difference is in variation instead of direction.{#b12c}

它将权重向量与其方向分开，这与具有方差的批量标准化具有类似的效果。唯一的区别在于变化而不是方向。{＃b12c}

As for the mean, authors of this paper cleverly combine

至于平均值，本文的作者巧妙地结合起来 **mean-only batch normalization** and

and **weight normalization** to get the desired output even in small mini-batches. It means that they subtract out the mean of the minibatch but do not divide by the variance. Finally, they use weight normalization instead of dividing by variance.{#3b01}

即使是小批量生产也能获得理想的产量。这意味着他们减去了小批量的平均值，但没有除以方差。最后，他们使用权重归一化而不是除以方差。{＃3b01}

Note: Mean is less noisy as compared to variance(which above makes mean a good choice over variance) due to the

注意：与方差相比，平均值噪声较小（由于方差，因此上述方法对于方差而言意味着一个很好的选择）law of large numbers

.{#ea19}

{#ea19}

The paper shows that weight normalization combined with mean-only batch normalization achieves the best results on CIFAR-10.{#b097}

该论文表明，权重标准化与仅平均批量标准化相结合，可以在CIFAR-10上获得最佳结果。{＃b097}

Layer normalization normalizes input across the features instead of normalizing input features across the batch dimension in batch normalization.{#f031}

规范化功能的输入，而不是批量标准化中批量维度的输入要素标准化。{＃f031}

A mini-batch consists of multiple examples with the same number of features. Mini-batches are matrices(or tensors) where one axis corresponds to the batch and the other axis(or axes) correspond to the feature dimensions.{#5d92}

小批量包含具有相同数量功能的多个示例。小批量是矩阵（或张量），其中一个轴对应批次，另一个轴（或多个轴）对应于特征尺寸。{＃5d92} i represents batch and j represents features. xᵢ,ⱼ is the i,j-th element of the input data.

我代表批次，j代表特征。 xᵢ，ⱼ是输入数据的第i，第j个元素。

The authors of the paper claims that layer normalization performs better than batch norm in case of

该论文的作者声称，层规范化比批量规范更好 **RNN** s.{#acbb}

s.{#acbb}

Layer normalization and

层规范化和instance normalization is very similar to each other but the difference between them is that instance normalization normalizes across each channel in each training example instead of normalizing across input features in an training example. Unlike batch normalization, the instance normalization layer is applied at test time as well(due to non-dependency of mini-batch).{#b494}

它们彼此非常相似，但它们之间的区别在于实例标准化在每个训练示例中的每个通道上进行标准化，而不是在训练示例中对输入要素进行标准化。与批处理规范化不同，实例规范化层也在测试时应用（由于小批量的非依赖性）。{＃b494} This image is taken from https://arxiv.org/pdf/1607.08022.pdf

此图片来自https://arxiv.org/pdf/1607.08022.pdf

Here,

Here, **x** ∈ ℝ T ×C×W×H be an input tensor containing a batch of T images. Let

∈ℝT×C×W×H是包含一批T图像的输入张量。让 **x** ₜᵢⱼₖ denote its tijk-th element, where k and j span spatial dimensions(

ₜᵢⱼₖ表示其tijk-th元素，其中k和j跨越空间维度（ **H** eight and

eight and **W** idth of the image), i is the feature channel (color channel if the input is an RGB image), and t is the index of the image in the batch.{#ec58}

图像的宽度），i是特征通道（如果输入是RGB图像，则为颜色通道），t是批次中图像的索引。{＃ec58}

This technique is originally devised for

该技术最初是为此设计的 **style transfer** , the problem instance normalization tries to address is that the network should be agnostic to the

，问题实例规范化试图解决的问题是网络应该是不可知的 **contrast** of the original image.{#66d3}

原始图片。{＃66d3}

As the name suggests,

顾名思义，Group Normalization normalizes over group of channels for each training examples. We can say that, Group Norm is in between Instance Norm and Layer Norm.{#2835}

针对每个训练样例对通道组进行标准化。我们可以说，Group Norm介于Instance Norm和Layer Norm之间。{＃2835}

**∵ When we put all the channels into a single group, group normalization becomes Layer normalization. And, when we put each channel into different groups it becomes Instance normalization.** {#f54e}

{#f54e} Sᵢ is defined below.

Sᵢ定义如下。

Here,

Here, **x** is the feature computed by a layer, and i is an index. In the case of 2D images,

是由图层计算的特征，i是索引。在2D图像的情况下， **i** = (

= ( **i** N ,

N , **i** C ,

C , **i** H,

H, **i** W ) is a 4D vector indexing the features in (N, C, H, W) order, where N is the batch axis,

W）是以（N，C，H，W）顺序索引特征的4D向量，其中N是批轴， **C** is the channel axis, and

是通道轴，和 **H** and

and **W** are the spatial height and width axes.

是空间高度和宽度轴。 **G** is the number of groups, which is a pre-defined hyper-parameter. C/G is the number of channels per group. ⌊.⌋ is the floor operation, and "

是组的数量，这是一个预定义的超参数。 C / G是每组的通道数。 ⌊.⌋是地板操作，并且" **⌊kC/(C/G)⌋= ⌊iC/(C/G)⌋** " means that the indexes

"指的是指数 **i** and

and **k** are in the same group of channels, assuming each group of channels are stored in a sequential order along the C axis. GN computes µ and σ along the (H, W) axes and along a group of C/G channels.{#a539}

假设每组通道沿C轴按顺序存储，则它们位于同一组通道中。 GN计算沿（H，W）轴和沿着一组C / G通道的μ和σ。{＃a539}

The problem with Instance normalization is that it completely erases style information. Though, this has its own merits(such as in style transfer) it can be problematic in those conditions where contrast matters(like in weather classification, brightness of the sky matters). Batch-instance normalization attempts to deal with this by learning how much style information should be used for each channel(

实例规范化的问题在于它完全删除了样式信息。虽然，这有其自身的优点（例如在风格转移中），但在对比度很重要的条件下（如天气分类，天空的亮度很重要）可能会出现问题。批处理实例规范化尝试通过了解每个通道应使用多少样式信息来处理此问题（ **C** ).{#750c}

).{#750c}

Batch-Instance Normalization is just an interpolation between batch norm and instance norm.{#0820}

批处理实例规范化只是批处理规范和实例规范之间的插值。{＃0820} the value of ρ is in between 0 and 1.

ρ的值在0和1之间。

The interesting aspect of batch-instance normalization is that the balancing parameter ρ is learned through gradient descent.{#0eb0}

批处理实例规范化的有趣方面是通过梯度下降来学习平衡参数ρ。{＃0eb0}

**From batch-instance normalization, we can conclude that models could learn to adaptively use different normalization methods using gradient descent.** {#37fa}

{#37fa}

Understanding from above, a question may arise.{#a707}

从上面了解，可能会出现一个问题。{＃a707}

## Can we switch the normalization technique whenever needed ? {#9c1b}

The answer would be

答案是 **Yes.** Following technique does exactly that.{#9f4c}

以下技术就是这样做的。{＃9f4c}

This paper proposed switchable normalization, a method that uses a weighted average of different mean and variance statistics from batch normalization, instance normalization, and layer normalization.{#72c4}

本文提出了可切换归一化方法，该方法使用批量归一化，实例归一化和层规范化的不同均值和方差统计的加权平均值。{＃72c4}

The authors showed that switch normalization could potentially outperform batch normalization on tasks such as image classification and object detection.{#ec51}

作者表明，在图像分类和对象检测等任务中，切换标准化可能优于批量标准化。{＃ec51}

The paper showed that the instance normalization were used more often in earlier layers, batch normalization was preferred in the middle and layer normalization being used in the last more often. Smaller batch sizes lead to a preference towards layer normalization and instance normalization.{#de24}

该论文表明，实例归一化在较早的层中更常使用，中间优选批量归一化，最后更常使用层归一化。较小的批量大小会导致对图层规范化和实例规范化的偏好。{＃de24}

- https://arxiv.org/pdf/1502.03167.pdf'{#0e87} {#0e87}
- https://arxiv.org/pdf/1607.06450.pdf{#c93c} {#c93c}
- https://arxiv.org/pdf/1602.07868.pdf{#56d4} {#56d4}
- https://arxiv.org/pdf/1607.08022.pdf{#68b6} {#68b6}
- https://arxiv.org/pdf/1803.08494.pdf{#ca34} {#ca34}
- https://arxiv.org/pdf/1805.07925.pdf{#a044} {#a044}
- https://arxiv.org/pdf/1811.07727v1.pdf{#b77b} {#b77b}

公众号:银河系1号

公众号:银河系1号

联系邮箱：public@space-explore.com

联系邮箱：public@space-explore.com

(未经同意，请勿转载)

(未经同意，请勿转载)