中英文模式阅读
中文模式阅读
英文模式阅读

Aakash Bindal


我们将研究Batch Norm,Weight Norm,Layer Norm,Instance Norm,Group Norm,Batch-Instance Norm,Switchable Norm {#591f}


让我们从问题开始,{#d0f6}


规范化一直是深度学习研究的一个活跃领域。标准化技术可以通过一个巨大的因素减少模型的训练时间。让我说明使用规范化的一些好处。{#5fbf}

  1. It normalizes each feature so that they maintains the contribution of every feature, as some feature has higher numerical value than others. This way our network can be unbiased(to higher value features).{#5ea7} {#5ea7}
  2. It reduces Internal Covariate Shift . It is the change in the distribution of network activations due to the change in network parameters during training. To improve the training, we seek to reduce the internal covariate shift.{#af62} {#af62}
  3. In this paper, authors claims that Batch Norm makes loss surface smoother(i.e. it bounds the magnitude of the gradients much more tightly).{#ceb0} {#ceb0}
  4. It makes the Optimization faster because normalization doesn't allow weights to explode all over the place and restricts them to a certain range.{#b5c7} {#b5c7}
  5. An unintended benefit of Normalization is that it helps network in Regularization(only slightly, not significantly).{#02db} {#02db}

从上面我们可以得出结论,正确归一化可能是让你的模型有效训练的关键因素,但这并不像听起来那么容易。让我通过某些问题来支持这一点。{#35dc}

  1. How Normalization layers behave in Distributed training ?{#90a4} {#90a4}
  2. Which Normalization technique should you use for your task like CNN, RNN, style transfer etc ?{#c737} {#c737}
  3. What happens when you change the batch size of dataset in your training ?{#81be} {#81be}
  4. Which norm technique would be the best trade-off for computation and accuracy for your network ?{#6c84} {#6c84}

要回答这些问题,让我们逐一深入了解每种规范化技术的细节。{#b506}

Batch normalization
是一种规范化网络中确定大小的小批量激活的方法。对于每个特征,批量标准化计算小批量中该特征的均值和方差。然后它减去均值并按照小批量标准偏差划分特征。{#3107}
此图片来自https://arxiv.org/pdf/1502.03167.pdf%27


但是等一下,如果增加权重的大小会使网络表现更好呢?{#b01d}


为了解决这个问题,我们可以分别添加γ和β作为比例和移位学习参数。这一切可以概括为:{#436e}
ε是等式中的稳定常数。

Problems associated with Batch Normalization : {#e257}

  1. Variable Batch Size → If batch size is of 1, then variance would be 0 which doesn't allow batch norm to work. Furthermore, if we have small mini-batch size then it becomes too noisy and training might affect. There would also be a problem in distributed training . As, if you are computing in different machines then you have to take same batch size because otherwise γ and β will be different for different systems.{#2096} {#2096}
  2. Recurrent Neural Network → In an RNN, the recurrent activations of each time-step will have a different story to tell(i.e. statistics). This means that we have to fit a separate batch norm layer for each time-step. This makes the model more complicated and space consuming because it forces us to store the statistics for each time-step during training.{#4782} {#4782}

批量规范替代品(或更好 norm
s)将在下面详细讨论,但如果您只对非常简短的描述感兴趣(或仅通过查看图像进行修订),请查看:{#1d8e}
此图片来自https://arxiv.org/pdf/1803.08494.pdf


Wait, why don't we normalize weights of a layer
而不是直接规范激活。好,Weight Normalization
就是这么做的。{#168f}


重量标准化重新参数化重量( ω
) as :{#92cf}


它将权重向量与其方向分开,这与具有方差的批量标准化具有类似的效果。唯一的区别在于变化而不是方向。{#b12c}


至于平均值,本文的作者巧妙地结合起来 mean-only batch normalization
and weight normalization
即使是小批量生产也能获得理想的产量。这意味着他们减去了小批量的平均值,但没有除以方差。最后,他们使用权重归一化而不是除以方差。{#3b01}


注意:与方差相比,平均值噪声较小(由于方差,因此上述方法对于方差而言意味着一个很好的选择)law of large numbers
.
{#ea19}


该论文表明,权重标准化与仅平均批量标准化相结合,可以在CIFAR-10上获得最佳结果。{#b097}

Layer normalization
规范化功能的输入,而不是批量标准化中批量维度的输入要素标准化。{#f031}


小批量包含具有相同数量功能的多个示例。小批量是矩阵(或张量),其中一个轴对应批次,另一个轴(或多个轴)对应于特征尺寸。{#5d92}
我代表批次,j代表特征。 xᵢ,ⱼ是输入数据的第i,第j个元素。


该论文的作者声称,层规范化比批量规范更好 RNN
s.{#acbb}


层规范化和instance normalization
它们彼此非常相似,但它们之间的区别在于实例标准化在每个训练示例中的每个通道上进行标准化,而不是在训练示例中对输入要素进行标准化。与批处理规范化不同,实例规范化层也在测试时应用(由于小批量的非依赖性)。{#b494}
此图片来自https://arxiv.org/pdf/1607.08022.pdf


Here, x
∈ℝT×C×W×H是包含一批T图像的输入张量。让 x
ₜᵢⱼₖ表示其tijk-th元素,其中k和j跨越空间维度( H
eight and W
图像的宽度),i是特征通道(如果输入是RGB图像,则为颜色通道),t是批次中图像的索引。{#ec58}


该技术最初是为此设计的 style transfer
,问题实例规范化试图解决的问题是网络应该是不可知的 contrast
原始图片。{#66d3}


顾名思义,Group Normalization
针对每个训练样例对通道组进行标准化。我们可以说,Group Norm介于Instance Norm和Layer Norm之间。{#2835}

∵ When we put all the channels into a single group, group normalization becomes Layer normalization. And, when we put each channel into different groups it becomes Instance normalization.
{#f54e}
Sᵢ定义如下。




Here, x
是由图层计算的特征,i是索引。在2D图像的情况下, i
= ( i
N , i
C , i
H, i
W)是以(N,C,H,W)顺序索引特征的4D向量,其中N是批轴, C
是通道轴,和 H
and W
是空间高度和宽度轴。 G
是组的数量,这是一个预定义的超参数。 C / G是每组的通道数。 ⌊.⌋是地板操作,并且" ⌊kC/(C/G)⌋= ⌊iC/(C/G)⌋
"指的是指数 i
and k
假设每组通道沿C轴按顺序存储,则它们位于同一组通道中。 GN计算沿(H,W)轴和沿着一组C / G通道的μ和σ。{#a539}


实例规范化的问题在于它完全删除了样式信息。虽然,这有其自身的优点(例如在风格转移中),但在对比度很重要的条件下(如天气分类,天空的亮度很重要)可能会出现问题。批处理实例规范化尝试通过了解每个通道应使用多少样式信息来处理此问题( C
).{#750c}


批处理实例规范化只是批处理规范和实例规范之间的插值。{#0820}
ρ的值在0和1之间。


批处理实例规范化的有趣方面是通过梯度下降来学习平衡参数ρ。{#0eb0}

From batch-instance normalization, we can conclude that models could learn to adaptively use different normalization methods using gradient descent.
{#37fa}


从上面了解,可能会出现一个问题。{#a707}

Can we switch the normalization technique whenever needed ? {#9c1b}


答案是 Yes.
以下技术就是这样做的。{#9f4c}


本文提出了可切换归一化方法,该方法使用批量归一化,实例归一化和层规范化的不同均值和方差统计的加权平均值。{#72c4}


作者表明,在图像分类和对象检测等任务中,切换标准化可能优于批量标准化。{#ec51}


该论文表明,实例归一化在较早的层中更常使用,中间优选批量归一化,最后更常使用层归一化。较小的批量大小会导致对图层规范化和实例规范化的偏好。{#de24}

  1. https://arxiv.org/pdf/1502.03167.pdf'{#0e87} {#0e87}
  2. https://arxiv.org/pdf/1607.06450.pdf{#c93c} {#c93c}
  3. https://arxiv.org/pdf/1602.07868.pdf{#56d4} {#56d4}
  4. https://arxiv.org/pdf/1607.08022.pdf{#68b6} {#68b6}
  5. https://arxiv.org/pdf/1803.08494.pdf{#ca34} {#ca34}
  6. https://arxiv.org/pdf/1805.07925.pdf{#a044} {#a044}
  7. https://arxiv.org/pdf/1811.07727v1.pdf{#b77b} {#b77b}

中英文模式阅读
中文模式阅读
英文模式阅读

查看英文原文

查看更多文章


公众号:银河系1号


联系邮箱:public@space-explore.com


(未经同意,请勿转载)