Aakash Bindal

We are going to study Batch Norm, Weight Norm, Layer Norm, Instance Norm, Group Norm, Batch-Instance Norm, Switchable Norm{#591f}
我们将研究Batch Norm,Weight Norm,Layer Norm,Instance Norm,Group Norm,Batch-Instance Norm,Switchable Norm {#591f}

Let's start with the question,{#d0f6}

Normalization has always been an active area of research in deep learning. Normalization techniques can decrease your model's training time by a huge factor. Let me state some of the benefits of using Normalization.{#5fbf}

  1. It normalizes each feature so that they maintains the contribution of every feature, as some feature has higher numerical value than others. This way our network can be unbiased(to higher value features).{#5ea7} {#5ea7}
  2. It reduces Internal Covariate Shift . It is the change in the distribution of network activations due to the change in network parameters during training. To improve the training, we seek to reduce the internal covariate shift.{#af62} {#af62}
  3. In this paper, authors claims that Batch Norm makes loss surface smoother(i.e. it bounds the magnitude of the gradients much more tightly).{#ceb0} {#ceb0}
  4. It makes the Optimization faster because normalization doesn't allow weights to explode all over the place and restricts them to a certain range.{#b5c7} {#b5c7}
  5. An unintended benefit of Normalization is that it helps network in Regularization(only slightly, not significantly).{#02db} {#02db}

From above, we can conclude that getting Normalization right can be a crucial factor in getting your model to train effectively, but this isn't as easy as it sounds. Let me support this by certain questions.{#35dc}

  1. How Normalization layers behave in Distributed training ?{#90a4} {#90a4}
  2. Which Normalization technique should you use for your task like CNN, RNN, style transfer etc ?{#c737} {#c737}
  3. What happens when you change the batch size of dataset in your training ?{#81be} {#81be}
  4. Which norm technique would be the best trade-off for computation and accuracy for your network ?{#6c84} {#6c84}

To answer these questions, Let's dive into details of each normalization technique one by one.{#b506}

Batch normalization is a method that normalizes activations in a network across the mini-batch of definite size. For each feature, batch normalization computes the mean and variance of that feature in the mini-batch. It then subtracts the mean and divides the feature by its mini-batch standard deviation.{#3107}
是一种规范化网络中确定大小的小批量激活的方法。对于每个特征,批量标准化计算小批量中该特征的均值和方差。然后它减去均值并按照小批量标准偏差划分特征。{#3107} this image is taken from https://arxiv.org/pdf/1502.03167.pdf'

But wait, what if increasing the magnitude of the weights made the network perform better?{#b01d}

To solve this issue, we can add γ and β as scale and shift learn-able parameters respectively. This all can be summarized as:{#436e}
为了解决这个问题,我们可以分别添加γ和β作为比例和移位学习参数。这一切可以概括为:{#436e} ϵ is the stability constant in the equation.

Problems associated with Batch Normalization : {#e257}

  1. Variable Batch Size → If batch size is of 1, then variance would be 0 which doesn't allow batch norm to work. Furthermore, if we have small mini-batch size then it becomes too noisy and training might affect. There would also be a problem in distributed training . As, if you are computing in different machines then you have to take same batch size because otherwise γ and β will be different for different systems.{#2096} {#2096}
  2. Recurrent Neural Network → In an RNN, the recurrent activations of each time-step will have a different story to tell(i.e. statistics). This means that we have to fit a separate batch norm layer for each time-step. This makes the model more complicated and space consuming because it forces us to store the statistics for each time-step during training.{#4782} {#4782}

Batch norm alternatives(or better
批量规范替代品(或更好 norm s) are discussed below in details but if you only interested in very short description(or revision just by look at an image) look at this :{#1d8e}
s)将在下面详细讨论,但如果您只对非常简短的描述感兴趣(或仅通过查看图像进行修订),请查看:{#1d8e} This image is taken from https://arxiv.org/pdf/1803.08494.pdf

Wait, why don't we
Wait, why don't we normalize weights of a layer instead of normalizing the activations directly. Well,
而不是直接规范激活。好,Weight Normalization does exactly that.{#168f}

Weight normalization reparameterizes the weights (
重量标准化重新参数化重量( ω ) as :{#92cf}
) as :{#92cf}

It separates the weight vector from its direction, this has a similar effect as in batch normalization with variance. The only difference is in variation instead of direction.{#b12c}

As for the mean, authors of this paper cleverly combine
至于平均值,本文的作者巧妙地结合起来 mean-only batch normalization and
and weight normalization to get the desired output even in small mini-batches. It means that they subtract out the mean of the minibatch but do not divide by the variance. Finally, they use weight normalization instead of dividing by variance.{#3b01}

Note: Mean is less noisy as compared to variance(which above makes mean a good choice over variance) due to the
注意:与方差相比,平均值噪声较小(由于方差,因此上述方法对于方差而言意味着一个很好的选择)law of large numbers
. {#ea19}

The paper shows that weight normalization combined with mean-only batch normalization achieves the best results on CIFAR-10.{#b097}

Layer normalization normalizes input across the features instead of normalizing input features across the batch dimension in batch normalization.{#f031}

A mini-batch consists of multiple examples with the same number of features. Mini-batches are matrices(or tensors) where one axis corresponds to the batch and the other axis(or axes) correspond to the feature dimensions.{#5d92}
小批量包含具有相同数量功能的多个示例。小批量是矩阵(或张量),其中一个轴对应批次,另一个轴(或多个轴)对应于特征尺寸。{#5d92} i represents batch and j represents features. xᵢ,ⱼ is the i,j-th element of the input data.
我代表批次,j代表特征。 xᵢ,ⱼ是输入数据的第i,第j个元素。

The authors of the paper claims that layer normalization performs better than batch norm in case of
该论文的作者声称,层规范化比批量规范更好 RNN s.{#acbb}

Layer normalization and
层规范化和instance normalization is very similar to each other but the difference between them is that instance normalization normalizes across each channel in each training example instead of normalizing across input features in an training example. Unlike batch normalization, the instance normalization layer is applied at test time as well(due to non-dependency of mini-batch).{#b494}
它们彼此非常相似,但它们之间的区别在于实例标准化在每个训练示例中的每个通道上进行标准化,而不是在训练示例中对输入要素进行标准化。与批处理规范化不同,实例规范化层也在测试时应用(由于小批量的非依赖性)。{#b494} This image is taken from https://arxiv.org/pdf/1607.08022.pdf

Here, x ∈ ℝ T ×C×W×H be an input tensor containing a batch of T images. Let
∈ℝT×C×W×H是包含一批T图像的输入张量。让 x ₜᵢⱼₖ denote its tijk-th element, where k and j span spatial dimensions(
ₜᵢⱼₖ表示其tijk-th元素,其中k和j跨越空间维度( H eight and
eight and W idth of the image), i is the feature channel (color channel if the input is an RGB image), and t is the index of the image in the batch.{#ec58}

This technique is originally devised for
该技术最初是为此设计的 style transfer , the problem instance normalization tries to address is that the network should be agnostic to the
,问题实例规范化试图解决的问题是网络应该是不可知的 contrast of the original image.{#66d3}

As the name suggests,
顾名思义,Group Normalization normalizes over group of channels for each training examples. We can say that, Group Norm is in between Instance Norm and Layer Norm.{#2835}
针对每个训练样例对通道组进行标准化。我们可以说,Group Norm介于Instance Norm和Layer Norm之间。{#2835}

∵ When we put all the channels into a single group, group normalization becomes Layer normalization. And, when we put each channel into different groups it becomes Instance normalization. {#f54e}
{#f54e} Sᵢ is defined below.

Here, x is the feature computed by a layer, and i is an index. In the case of 2D images,
是由图层计算的特征,i是索引。在2D图像的情况下, i = (
= ( i N ,
N , i C ,
C , i H,
H, i W ) is a 4D vector indexing the features in (N, C, H, W) order, where N is the batch axis,
W)是以(N,C,H,W)顺序索引特征的4D向量,其中N是批轴, C is the channel axis, and
是通道轴,和 H and
and W are the spatial height and width axes.
是空间高度和宽度轴。 G is the number of groups, which is a pre-defined hyper-parameter. C/G is the number of channels per group. ⌊.⌋ is the floor operation, and "
是组的数量,这是一个预定义的超参数。 C / G是每组的通道数。 ⌊.⌋是地板操作,并且" ⌊kC/(C/G)⌋= ⌊iC/(C/G)⌋ " means that the indexes
"指的是指数 i and
and k are in the same group of channels, assuming each group of channels are stored in a sequential order along the C axis. GN computes µ and σ along the (H, W) axes and along a group of C/G channels.{#a539}
假设每组通道沿C轴按顺序存储,则它们位于同一组通道中。 GN计算沿(H,W)轴和沿着一组C / G通道的μ和σ。{#a539}

The problem with Instance normalization is that it completely erases style information. Though, this has its own merits(such as in style transfer) it can be problematic in those conditions where contrast matters(like in weather classification, brightness of the sky matters). Batch-instance normalization attempts to deal with this by learning how much style information should be used for each channel(
实例规范化的问题在于它完全删除了样式信息。虽然,这有其自身的优点(例如在风格转移中),但在对比度很重要的条件下(如天气分类,天空的亮度很重要)可能会出现问题。批处理实例规范化尝试通过了解每个通道应使用多少样式信息来处理此问题( C ).{#750c}

Batch-Instance Normalization is just an interpolation between batch norm and instance norm.{#0820}
批处理实例规范化只是批处理规范和实例规范之间的插值。{#0820} the value of ρ is in between 0 and 1.

The interesting aspect of batch-instance normalization is that the balancing parameter ρ is learned through gradient descent.{#0eb0}

From batch-instance normalization, we can conclude that models could learn to adaptively use different normalization methods using gradient descent. {#37fa}

Understanding from above, a question may arise.{#a707}

Can we switch the normalization technique whenever needed ? {#9c1b}

The answer would be
答案是 Yes. Following technique does exactly that.{#9f4c}

This paper proposed switchable normalization, a method that uses a weighted average of different mean and variance statistics from batch normalization, instance normalization, and layer normalization.{#72c4}

The authors showed that switch normalization could potentially outperform batch normalization on tasks such as image classification and object detection.{#ec51}

The paper showed that the instance normalization were used more often in earlier layers, batch normalization was preferred in the middle and layer normalization being used in the last more often. Smaller batch sizes lead to a preference towards layer normalization and instance normalization.{#de24}

  1. https://arxiv.org/pdf/1502.03167.pdf'{#0e87} {#0e87}
  2. https://arxiv.org/pdf/1607.06450.pdf{#c93c} {#c93c}
  3. https://arxiv.org/pdf/1602.07868.pdf{#56d4} {#56d4}
  4. https://arxiv.org/pdf/1607.08022.pdf{#68b6} {#68b6}
  5. https://arxiv.org/pdf/1803.08494.pdf{#ca34} {#ca34}
  6. https://arxiv.org/pdf/1805.07925.pdf{#a044} {#a044}
  7. https://arxiv.org/pdf/1811.07727v1.pdf{#b77b} {#b77b}