Logo

Chapter 7: Regularization for Deep Learning

As aforementioned, regularization is the study of how to train models that generalize well to unseen inputs. For deep learning, most regularization strategies are based on regularizing estimators, typically by sacrificing bias (increasing bias) to reduce variance. In other words, preventing overfitting.

7.1 Parameter Norm Penalties

The simplest form of regularization is a parameter norm penalty Ω(θ)\Omega(\boldsymbol{\theta}), added to the objective function JJ to produce the regularized objective function J~\tilde{J}.

J~(θ;X,y)=J(θ;X,y)+αΩ(θ)\tilde{J}(\boldsymbol{\theta};\boldsymbol{X},\boldsymbol{y})=J(\boldsymbol{\theta};\boldsymbol{X},\boldsymbol{y})+\alpha\Omega(\boldsymbol{\theta})

Note that α[0,)\alpha \in[0,\infty) is a hyperparameter that weights the contribution of the penalty term Ω\Omega. Larger α\alpha induces grater regularization. This section, we'll discuss the choice of the parameter norm Ω\Omega and its effects.

info

Typically, the parameter norm penalty is only chosen to penalize the weights at each layer, and not the biases. This is because regularizing the biases can cause significant underfitting, and leaving them unregularized does not induce much variance since they affect only a single variable (while weights specify interaction between variables). Thus, Ω(θ)\Omega(\boldsymbol{\theta}) commonly equates to Ω(w)\Omega(\boldsymbol{w}).

tip

Sometimes, it may be desirable to use different α\alpha for each layer. However, this may make hyperparameter selection more expensive/difficult, so it's reasonable to use the same α\alpha for all layers.

L2L^{2} Regularization

We begin with the simplest parameter norm penalty: L2L^{2} or weight decay. It's also known as ridge regression or Tikhonov regularization, and involves adding a regularization term Ω(θ)=12w22\Omega(\boldsymbol{\theta})=\frac{1}{2}\lVert \boldsymbol{w} \rVert^{2}_{2}.

J~(w;X,y)=α2ww+J(w;X,y)\tilde{J}(\boldsymbol{w};\boldsymbol{X},\boldsymbol{y})=\frac{\alpha}{2}\boldsymbol{w}^{\top}\boldsymbol{w}+J(\boldsymbol{w};\boldsymbol{X},\boldsymbol{y})

The new cost function gradient is then

wJ~(w;X,y)=αw+wJ(w;X,y)\nabla_{w} \tilde{J}(\boldsymbol{w};\boldsymbol{X},\boldsymbol{y})=\alpha\boldsymbol{w}+\nabla_{\boldsymbol{w}}J(\boldsymbol{w};\boldsymbol{X},\boldsymbol{y})

And the gradient step is

wwϵ(αw+wJ(w;X,y))(1ϵα)wϵwJ(w;X,y)\begin{align*} \boldsymbol{w} &\leftarrow \boldsymbol{w}-\epsilon(\alpha \boldsymbol{w}+\nabla_{\boldsymbol{w}}J(\boldsymbol{w};\boldsymbol{X},\boldsymbol{y})) \\ &\leftarrow (1-\epsilon\alpha)\boldsymbol{w}-\epsilon \nabla_{\boldsymbol{w}}J(\boldsymbol{w};\boldsymbol{X},\boldsymbol{y}) \end{align*}

In other words, the weight decay term induces the gradient descent to shrink the previous weight vector by a constant factor on each step. So, what happens across the entirety of training?

We first simply analysis by making a quadratic approximation (Taylor series) of the (original) objective function in the neighborhood of the cost-minimizing weights:

J^(θ)=J(w)+12(ww)H(ww)\hat{J}(\boldsymbol{\theta})=J(\boldsymbol{w}^{*})+\frac{1}{2}(\boldsymbol{w}-\boldsymbol{w}^{*})^{\top}\boldsymbol{H}(\boldsymbol{w}-\boldsymbol{w}^{*})

where H\boldsymbol{H} is the Hessian matrix of JJ with respect to w\boldsymbol{w} evaluated at w=argminwJ(w)\boldsymbol{w}^{*}=\arg\min_{\boldsymbol{w}}J(\boldsymbol{w}), the cost-minimizing weights.

Where's the first-order term in the Taylor series?

w\boldsymbol{w}^{*} is the minimizing value, so the gradient is 00. Additionally, this implies H\boldsymbol{H} is positive semidefinite.

Naturally, the minimum of J^\hat{J} occurs when its gradient

wJ^(w)=H(ww)\nabla_{\boldsymbol{w}}\hat{J}(\boldsymbol{w})=\boldsymbol{H}(\boldsymbol{w}-\boldsymbol{w}^{*})

is 0\mathbf{0}.

Now, consider adding the weight decay back. Now, the gradient becomes

αw~+H(ww)\alpha \tilde{\boldsymbol{w}}+\boldsymbol{H}(\boldsymbol{w}-\boldsymbol{w}^{*})

where w~\tilde{\boldsymbol{w}} represents the minimum of the regularized version of the approximation J^\hat{J}. Equating to 00 and solving for w~\tilde{\boldsymbol{w}}, we derive

w~=(H+αI)1Hw\tilde{\boldsymbol{w}} = (\boldsymbol{H}+\alpha \boldsymbol{I})^{-1}\boldsymbol{H}\boldsymbol{w}^{*}

Note that limα0w~=w\lim_{ \alpha \to 0 }\tilde{\boldsymbol{w}}=\boldsymbol{w}^{*}. But what about when α\alpha grows?

H\boldsymbol{H} is real and symmetric, therefore we can eigendecompose it. This allows deriving

w~=Q(Λ+αI)1ΛQw\tilde{\boldsymbol{w}}=\boldsymbol{Q}(\boldsymbol{\Lambda}+\alpha \boldsymbol{I})^{-1}\boldsymbol{\Lambda}\boldsymbol{Q}^{\top}\boldsymbol{w}^{*}

In words, this means that weight decay effectively rescales w\boldsymbol{w}^{*} along the axes defined by the eigenvectors of H\boldsymbol{H}. Specifically, the component of w\boldsymbol{w}^{*} aligned with eigenvector viv_{i} of H\boldsymbol{H} is rescaled by a factor of λiλi+α\frac{\lambda_{i}}{\lambda_{i}+\alpha}. See section 2.7 if you're confused.

Thus, along directions where λiα\lambda_{i}\gg\alpha, the regularization has minimal effect. In contrast, if λiα\lambda_{i}\ll\alpha, these components will be shrunk significantly. A visualization is displayed below.

l2-regularization.png

In other words, only directions that contribute substantially to reducing the cost function are preserved.

One can interpret its effects, for instance, on linear regression. Applying L2L^{2} regularization changes the solution to

w=(XX+αI)1Xy\boldsymbol{w}=(\boldsymbol{X}^{\top}\boldsymbol{X}+\alpha \boldsymbol{I})^{-1}\boldsymbol{X}^{\top}\boldsymbol{y}

Note that the covariance matrix is 1mXX\frac{1}{m}\boldsymbol{X}^{\top}\boldsymbol{X}. Thus, L2L^{2} regularization, adding α\alpha to each element of the diagonal, appears to induce additional variance in the input (recall that Cov(X,X)=Var(X)\mathrm{Cov}(X, X)=\mathrm{Var}(X)) and thus shrink the weights for features whose covariance with the output target is comparatively small.

L1L^{1} Regularization

Formally,

Ω(θ)=w1=iwi\Omega(\boldsymbol{\theta})=\lVert \boldsymbol{w} \rVert _{1}=\sum_{i}\lvert w_{i} \rvert

and

J~(w;X,y)=αw1+J(w;X,y)\tilde{J}(\boldsymbol{w};\boldsymbol{X},\boldsymbol{y})=\alpha \lVert \boldsymbol{w} \rVert _{1}+J(\boldsymbol{w};\boldsymbol{X},\boldsymbol{y})

with gradient

wJ~(w;X,y)=αsign(w)+wJ(w;X,y)\nabla_{\boldsymbol{w}}\tilde{J}(\boldsymbol{w};\boldsymbol{X},\boldsymbol{y})=\alpha \text{sign}(\boldsymbol{w})+\nabla_{\boldsymbol{w}}J(\boldsymbol{w};\boldsymbol{X},\boldsymbol{y})

Notably, the contribution of the regularization term no longer scales according to wiw_{i}; instead, it is either α-\alpha or +α+\alpha. This actually means that there are no clean algebraic solutions to quadratic approximations of J(X,y;w)J(\boldsymbol{X},\boldsymbol{y};\boldsymbol{w}), like there were for L2L^{2} regularization.

We can still interpret the quadratic approximation, however. Again,

wJ^(w)=H(ww)\nabla_{\boldsymbol{w}}\hat{J}(\boldsymbol{w})=\boldsymbol{H}(\boldsymbol{w}-\boldsymbol{w}^{*})

Given our limitations, we will assume the Hessian is diagonal with all positive elements. This holds if the data has been preprocessed to remove correlation between input features (e.g. preprocess with PCA). Thus, the approximation of our regularized objective function becomes

J~(w;X,y)=J(w;X,y)+i[12Hi,i(wiwi)2+αwi]\tilde{J}(\boldsymbol{w};\boldsymbol{X},\boldsymbol{y})=J(\boldsymbol{w}^{*};\boldsymbol{X},\boldsymbol{y})+\sum_{i}\left[ \frac{1}{2}H_{i,i}(\boldsymbol{w}_{i}-\boldsymbol{w}_{i}^{*})^{2}+\alpha \lvert w_{i} \rvert \right]

The analytical solution is

wi=sign(wi)max{wiαHi,i,0}w_{i}=\text{sign}(w_{i}^{*})\max \left\{ \lvert w_{i}^{*} \rvert -\frac{\alpha}{H_{i,i}},0 \right\}

WLOG, let wi>0w_{i}^{*}>0 for all ii. Consider if

The behavior is similar with for wi<0w_{i}^{*}<0.

The notable aspect of L1L^{1} regularization, as hinted above, is that it produces a sparse solution, as it zeroes some parameters. (L2L^{2} regularization does not introduce sparsity; rescaling never zeroes a component). This is critical for tasks like feature selection, i.e. selecting a subset of the available features to be used as input. This is well known as LASSO (Least Absolute Shrinkage and Selection Operator) in the context of linear models.

tip

Remember MAP Bayesian inference from section 5.6? L1L^{1} regularization may be represented by MAP with a prior of an isotropic Laplace distribution over wRn\boldsymbol{w}\in \mathbb{R}^{n}.

7.2 Norm Penalties as Constrained Optimization

We previously alluded to the relation between norm penalties and the generalized Lagrange function used for constrained optimization. We now discuss this in more detail.

For instance, if we wanted to constraint Ω(θ)<kR\Omega(\boldsymbol{\theta})<k\in \mathbb{R}, we construct a generalized Lagrangian

L(θ,α;X,y)=J(θ;X,y)+α(Ω(θ)k)\mathcal{L}(\boldsymbol{\theta},\alpha;\boldsymbol{X},\boldsymbol{y})=J(\boldsymbol{\theta};\boldsymbol{X},\boldsymbol{y})+\alpha(\Omega(\boldsymbol{\theta})-k)

With solution

θ=argminθmaxα,α0 L(θ,α)\boldsymbol{\theta}^{*}=\underset{ \boldsymbol{\theta} }{ \arg\min } \underset{\alpha,\alpha\geq 0}{\max}\ \mathcal{L}(\boldsymbol{\theta},\alpha)

Though, as mentioned in section 4.4, solving the problem requires learning/solving for both θ\boldsymbol{\theta} and α\alpha. Theoretically, it's possible to solve for the value of kk that corresponds to an α\alpha; however, it varies depending on the objective function JJ itself. Instead, we use α\alpha as a hyperparameter, and manually control it to roughly adjust the constraint region. Again, larger α\alpha strengthens the constraints, while smaller α\alpha weakens them.

Occasionally, however, explicit constraints are desired, i.e. we want to limit Ω(θ)\Omega(\boldsymbol{\theta}) by a precisely chosen value of kk. Rather than solving for the corresponding α\alpha, this can be done by using SGD and then projecting θ\boldsymbol{\theta} back to the nearest constraint-satisfying point. This explicit constraints and reprojection method is, at times, useful:

7.3 Regularization and Under-Constrained Problems

Regularization may even be necessary for a machine learning problem to be properly defined.

For instance, many linear models depend on inverting XX\boldsymbol{X}^{\top}\boldsymbol{X}, which may not be invertible. Thus, many use XX+αI\boldsymbol{X}^{\top}\boldsymbol{X}+\alpha \boldsymbol{I} instead, which is always invertible.

For underdetermined problems, learning algorithms may never converge; regularization can ensure convergence eventually occurs.

Moore-Penrose Pseudoinverse

Recall that it may be defined as

X+=limα0+(XX+αI)1X\boldsymbol{X}^{+}=\lim_{ \alpha \to 0^{+} } (\boldsymbol{X}^{\top}\boldsymbol{X}+\alpha \boldsymbol{I})^{-1}\boldsymbol{X}^{\top}

In fact, you might not recognize this as performing linear regression with weight decay! Thus, the pseudoinverse stabilizes underdetermined problems with regularization.

7.4 Dataset Augmentation

In practice, data is limited. One solution for this is the creation of fake data. Depending on the task, this may be straightforward!

Image Recognition

Small changes like translation, saturation, rotation, etc. of existing examples can produce new examples.

Injecting noise into the input of a neural network is also data augmentation. Many tasks should be possible to solve even with small perturbations to the input. Noise injection also works for hidden units; dropout, which we discuss soon, can be interpreted as constructing new inputs by multiplying by noise.

7.5 Noise Robustness

As aforementioned, it's desirable for a model to be resistant to noise in the inputs. Beyond just injecting noise into the inputs, it's possible to achieve this by injecting noise into the weights. Adding small perturbations during training actually encourages parameters to tend to regions of parameter space that are flat, i.e. insensitive to small variations of weights. Refer to the book for the math behind this.

It's also possible to inject noise into the output targets. This is desirable because datasets frequently have some mistakes in the outputs. Label smoothing is one technique that regularizes a model with a softmax of kk different values by replacing hard 0,10,1 classification targets with targets of ϵk1\frac{\epsilon}{k-1} and 1ϵ1-\epsilon, where we are assuming each category label yy is correct with probability 1ϵ1-\epsilon. This also helps encourage convergence for maximum likelihood learning.

7.6 Semi-Supervised Learning

In semi-supervised learning, we use both unlabeled examples from P(x)P(\mathbf{x}) and labeled examples from P(x,y)P(\mathbf{x},\mathbf{y}) to estimate P(yx)P(\mathbf{y}\mid \mathbf{x}). The motivation is that the unsupervised part of the learning can hint towards how to group examples in representation space. In essence, you can think of the unsupervised part as performing the clustering, and the supervised part actually labeling the clusters.

It's not necessary to separate the unsupervised and supervised components, however. It's possible to construct models in which a generative/unsupervised model, either P(x)P(\mathbf{x}) or P(x,y)P(\mathbf{x},\mathbf{y}) shares parameters with a discriminative/supervised model of P(yx)P(\mathbf{y}\mid \mathbf{x}). One may then minimize some function of the supervised criterion logP(yx)-\log P(\mathbf{y}\mid \mathbf{x}) and the unsupervised/generative criterion. In essence, the generative criterion expresses a prior about the supervised problem's solution; that the structures are connected in a way that may be captured via shared parametrization.

7.7 Multi-Task Learning

Multi-task learning involves using examples from several different tasks to improve generalization. A very common form of multi-task learning involves the different tasks sharing the same input x\mathbf{x} and some portion of the hidden layers h(shared)\boldsymbol{h}^{(\text{shared})} intended to capture some common factors—generic parameters. The other hidden layers (and output layer) are the task-specific parameters. Below shows an example architecture.

multitask-learning.png

7.8 Early Stopping

When training large models with sufficient representational capacity to overfit, training error will typically continuously decrease, but validation error will begin increasing after enough iterations. Therefore, it's frequently (really, almost always) desirable to stop the gradient descent algorithm at some point, when the training error has become sufficiently small, to prevent overfitting.

The most common method for this is recording the validation error every time we see an improvement, and returning to the parameter values at the point in which the validation error is minimized. This is known as early stopping, and can be interpreted as an efficient hyperparameter selection algorithm where we treat the number of training steps as a hyperparameter.

Early stopping is popular for a couple reasons.

There are a couple considerations with early stopping, though. Early stopping requires a validation set; a common strategy is to perform two sets of training—one without the validation set that uses early stopping, and then one with all training data (i.e. validation set is now training data). There are two options for the second training procedure.

Previously, we discussed how early stopping acts as a regularizer. Now, we formalize this notion, and show that early stopping essentially equates to L2L^{2} regularization (for a simple linear model with quadratic error function trained by gradient descent).

As before, we make a quadratic approximation of the cost function JJ in the neighborhood of the optimal weights w\boldsymbol{w}^{*}.

J^(θ)=J(w)+12(ww)H(ww)\hat{J}(\boldsymbol{\theta})=J(\boldsymbol{w}^{*})+\frac{1}{2}(\boldsymbol{w}-\boldsymbol{w}^{*})^{\top}\boldsymbol{H}(\boldsymbol{w}-\boldsymbol{w}^{*})

And compute the gradient

wJ^(w)=H(ww)\nabla_{\boldsymbol{w}}\hat{J}(\boldsymbol{w})=\boldsymbol{H}(\boldsymbol{w}-\boldsymbol{w}^{*})

Now, we consider how the parameters change during gradient descent. WLOG, assume w(0)=0\boldsymbol{w}^{(0)}=\mathbf{0}.

w(τ)=w(τ1)ϵwJ^(wτ1)=w(τ1)ϵH(w(τ1)w)w(τ)w=(IϵH)(w(τ1)w)\begin{align*} \boldsymbol{w}^{(\tau)} &= \boldsymbol{w}^{(\tau-1)}-\epsilon \nabla_{\boldsymbol{w}}\hat{J}(\boldsymbol{w}^{\tau-1}) \\ &= \boldsymbol{w}^{(\tau-1)}-\epsilon \boldsymbol{H}(\boldsymbol{w}^{(\tau-1)}-\boldsymbol{w}^{*}) \\ \boldsymbol{w}^{(\tau)}-\boldsymbol{w}^{*} &= (\boldsymbol{I}-\epsilon \boldsymbol{H})(\boldsymbol{w}^{(\tau-1)}-\boldsymbol{w}^{*}) \end{align*}

Rewriting with H\boldsymbol{H}'s eigendecomposition and simplifying returns (plus some small assumptions)

Qw(τ)=[I(IϵΛ)τ]Qw\boldsymbol{Q}^{\top}\boldsymbol{w}^{(\tau)}=[\boldsymbol{I}-(\boldsymbol{I}-\epsilon \boldsymbol{\Lambda})^{\tau}]\boldsymbol{Q}^{\top}\boldsymbol{w}^{*}

While doing some rearrangement in the L2L^{2} equation from [[7. Regularization for Deep Learning#L2L {2} Regularization|section 7.1]] gives

Qw~=[I(Λ+αI)1α]Qw\boldsymbol{Q}^{\top}\tilde{\boldsymbol{w}}=[\boldsymbol{I}-(\boldsymbol{\Lambda}+\alpha \boldsymbol{I})^{-1}\alpha]\boldsymbol{Q}^{\top}\boldsymbol{w}^{*}

Therefore, if hyperparameters ϵ,α,τ\epsilon,\alpha,\tau are chosen such that

(IϵΛ)τ=(Λ+αI)1α(\boldsymbol{I}-\epsilon\boldsymbol{\Lambda})^{\tau}=(\boldsymbol{\Lambda}+\alpha \boldsymbol{I})^{-1}\alpha

L2L^{2} regularization and early stopping are equivalent! In fact, one can conclude that τ1ϵα\tau \approx\frac{1}{\epsilon\alpha}, provided the eigenvalues are sufficiently small.

Intuitively, this is because training learns the directions of high curvature first, and early stopping halts training before the model learns directions of low curvature. Of course, the equivalence only holds for a quadratic approximation of the objective function; however, the primary idea of this relationship generally holds.

7.9 Parameter Tying and Sharing

Frequently, we may want to express a prior that two models should be similar to each other. For instance, two models performing the same classification task with slightly different input distributions.

For this purpose, we can add a parameter norm penalty that ties the models' parameters to each other

Ω(w(A),w(B))=w(A)w(B)22\Omega(\boldsymbol{w}^{(A)},\boldsymbol{w}^{(B)})=\lVert \boldsymbol{w}^{(A)}-\boldsymbol{w}^{(B)} \rVert ^{2}_{2}

for models AA and BB. However, a more popular approach is actually to use constraints that force sets of parameters to be equal—this is parameter sharing. The primary advantage of this approach is that it reduces memory footprint—only a subset of the parameters, i.e. the unique shared set, need to be stored in memory.

Convolutional Neural Networks

Parameter sharing has been used to great effect in CNNs. With many image-related tasks requiring invariance to translation, it's desirable to share the same parameters within the model, between different locations in the image, essentially. We discuss this in detail in Chapter 9.

7.10 Sparse Representations

Recall the L1L^{1} regularization penalty, and how it encourages a sparse parameterization, i.e. encourages many weights to be 00. Sparseness can also be encouraged in the representation of the data, i.e. in the hidden layers (e.g. a hidden layer with only a few active units). This can be implemented with a norm penalty on the representation:

J~(θ;X,y)=J(θ;X,y)+αΩ(h)\tilde{J}(\boldsymbol{\theta};\boldsymbol{X},\boldsymbol{y})=J(\boldsymbol{\theta};\boldsymbol{X},\boldsymbol{y})+\alpha\Omega(\boldsymbol{h})

where h\boldsymbol{h} represents a hidden layer(s). For sparsity, an L1L^{1} penalty is typically useful; though other sparsity-encouraging penalties exist too.

Other approaches place a hard constraint on activation values instead. Orthogonal matching pursuit, for instance, seeks a representation h\boldsymbol{h} such that

argminh,h0<kxWh2\underset{ \boldsymbol{h},\lVert \boldsymbol{h} \rVert _{0}<k }{ \arg\min }\lVert \boldsymbol{x}-\boldsymbol{W}\boldsymbol{h} \rVert ^{2}

where h0\lVert \boldsymbol{h} \rVert_{0} denotes the number of nonzero entries of h\boldsymbol{h}, i.e. it limits the number of active units in the layer to <k<k. The problem is tractable when W\boldsymbol{W} is constrained to be orthogonal, hence the nomenclature, and is often termed "OMP-kk."

7.11 Bagging and Other Ensemble Methods

Bagging, or bootstrap aggregating, separately trains different models and then asks models to vote on test examples. It's an instance of model averaging, a subset of ensemble methods which leverage multiple different models for output.

The motivation for model averaging is that, if the models make errors independently, the errors will be reduced. In fact, for a set of kk regression models, expected squared error decreases linearly in kk, given fully uncorrelated errors between models.

Note that, typically, in order to construct kk separate models, the dataset is preprocessed to create kk different datasets by sampling with replacement from the original dataset. The kk datasets are all the same size, but with high probability are missing examples from the original dataset. Moreover, with the inherent stochasticity in model training (random initialization, random minibatches, etc.), we may expect sufficient distinctions between models to observe benefits from ensemble learning.

7.12 Dropout

Dropout is a powerful, popular method of regularization that provides, effectively, an efficient approximation of bagging.

Dropout essentially models an exponentially large set of models by randomly removing, or dropping, non-output units from an underlying base networks. In most networks, this can be done by multiplying a unit's output value by zero. This process is repeated on the base network for every minibatch during training (model is reset back to base model between iterations), and each unit has an independent probability pp, a hyperparameter, of being dropped. Thus, every training example sees and trains a different model, effectively; except these models share many (but not all) parameters due to being selected from the same underlying model.

Formally, suppose a binary mask vector μ\boldsymbol{\mu}, randomly selected, specifies the units to include (from the set of input/hidden units) and J(θ,μ)J(\boldsymbol{\theta},\boldsymbol{\mu}) denotes the cost function. Dropout training minimizes EμJ(θ,μ)\mathbb{E}_{\boldsymbol{\mu}}J(\boldsymbol{\theta},\boldsymbol{\mu}). This expectation contains exponentially many terms; however, sampling values of μ\boldsymbol{\mu} provides an unbiased estimate of the gradient.

The critical idea of dropout, when compared to bagging, is its parameter sharing, which enables representing exponentially many models with a tractable memory size. Moreover, in dropout, only a small fraction of the possible sub-networks are trained—each for a single step. Yet, the parameter sharing allows for reasonably approximate gradient descent.

At test time, a bagged ensemble accumulates votes from all member models; this is known as inference.

1ki=1kp(i)(yx)\frac{1}{k}\sum_{i=1}^{k}p^{(i)}(y\mid \boldsymbol{x})

In dropout, each sub-model defined by mask vector μ\boldsymbol{\mu} defines a probability distribution p(yx,μ)p(y\mid \boldsymbol{x},\boldsymbol{\mu}). We can still use the arithmetic mean:

μp(μ)p(yx,μ)\sum_{\boldsymbol{\mu}}p(\boldsymbol{\mu})p(y\mid \boldsymbol{x},\boldsymbol{\mu})

where p(μ)p(\boldsymbol{\mu}) is the probability distribution from which μ\boldsymbol{\mu} is sampled from during training. However, this is intractable to evaluate—there are exponentially many μ\boldsymbol{\mu}.

We could approximate dropout inference by averaging the output of a few randomly sampled masks. But, there's still a better approach, which enables a good approximation of the ensemble prediction using only one forward propagation.

By replacing the arithmetic mean with the geometric mean, i.e.

p~ensemble(yx)=μp(yx,μ)2d\tilde{p}_{\text{ensemble}}(y\mid \boldsymbol{x})= \sqrt[2^{d}]{ \prod_{\boldsymbol{\mu}}p(y\mid \boldsymbol{x},\boldsymbol{\mu}) }

where dd is the number of units that may be dropped (i.e. dimension of μ\boldsymbol{\mu}). Note that, for simplicity, we must guarantee that no sub-model assigns probability 00 to an event—otherwise, we have a high risk of the ensemble probability becoming 00.

Also, note that the above distribution is not normalized (hence the ~ above pp), so the ensemble evaluation is really

pensemble(yx)=p~(yx)yp~ensemble(yx)p_{\text{ensemble}(y\mid \boldsymbol{x})}= \frac{\tilde{p}(y\mid \boldsymbol{x})}{\sum_{y'}\tilde{p}_{\text{ensemble}(y'\mid \boldsymbol{x})}}

Nevertheless, the key insight here is that we can approximate pensemblep_{\text{ensemble}} by evaluating p(yx)p(y\mid \boldsymbol{x}) using just the model with all units, with a slight modification—all weights exiting unit ii are multiplied by the probability of including unit ii. This is known as the weight scaling inference rule.

tip

For networks without nonlinear units, the application of this rule perfectly models pensemblep_{\text{ensemble}}. Otherwise, however, it is an approximation.

So, why is dropout so popular?

It does have a couple drawbacks, though.

Interestingly, stochasticity is unnecessary for dropout—it's just a means of approximating the bagging ensemble method. A variant known as fast dropout reduces stochasticity significantly to achieve faster convergence.

An additional insight into dropout's benefits is that, not only is it training an ensemble of models, but it is also forcing units to essentially "adapt" when used in a variety of different networks. This regularizes hidden units to be a feature that performs well/contributes in many contexts, thus decreasing generalization error.

Dropout is also a form of noise injection, in the sense that it randomly erases some hidden units or features from the model, forcing the model to redundantly encode some information. (This process is also a multiplicative injection of noise, as alluded to previously). Thus, dropout improves noise robustness.

7.13 Adversarial Training

An adversarial example is an example extremely close to a real example, according to a human observer, that produces very different predictions from the model. This is, naturally, undesirable; however, training models to avoid misclassifying adversarial examples is beyond this chapter's scope (though this technique does somewhat help). Instead, we consider adversarial training: a regularization technique that trains models on adversarially perturbed examples.

One important observation is that excessive linearity is a common cause of adversarial examples. Adversarial training serves to discourage highly sensitive, locally linear behavior, and essentially expresses a local constancy prior.

Adversarial examples can also provide a method for semi-supervised learning. Given an unlabeled input x\boldsymbol{x}, we compute the model's predicted label y^\hat{y}. Then, we search for an adversarial example x\boldsymbol{x}' that is adversarial to the model's prediction y^\hat{y}. Since the model's label, not the true label, was used, this is known as a virtual adversarial example. By using this in adversarial training, this encourages robustness to small changes to inputs—formally, robustness to small translations along the manifold of the unlabeled data, given that different classes usually lie on separate, disconnected manifolds.

7.14 Tangent Distance, Tangent Prop, and Manifold Tangent Classifier

Recall the manifold hypothesis from Chapter 5, which suggests that, for most real-world data, there exists a low dimensional embedding within the greater, high dimensional space.

The tangent distance algorithm was an earlier invention to make use of this hypothesis; it's a non-parametric nearest neighbor algorithm with one difference: the distance metric is not Euclidean distance, but instead one derived from knowledge of the low dimensional manifolds. In essence, d(x1,x2)d(\boldsymbol{x}_{1},\boldsymbol{x}_{2}) becomes d(M1,M2)d(M_{1},M_{2}), i.e. the shortest distance between points in manifolds M1M_{1} and M2M_{2}, where x1M1\boldsymbol{x}_{1}\in M_{1} and x2M2\mathbf{x}_{2}\in M_{2}.

Naively computing d(M1,M2)d(M_{1},M_{2}) is intractable; however, a cheap alternative is to approximate MiM_{i} with the tangent plane at xi\boldsymbol{x}_{i} and measure the distance between the two tangents. However, this requires manually specifying/calculating the tangent vectors for each x\boldsymbol{x}.

In a similar vein, the tangent prop algorithm trains a neural network classifier with an extra penalty intended to make the output f(x)f(\boldsymbol{x}) locally invariant to known factors of variation, i.e. movement along the example's manifold. This is achieved by encouraging xf(x)\nabla_{\boldsymbol{x}}f(\boldsymbol{x}) to be orthogonal to the known manifold tangent vectors v(i)\boldsymbol{v}^{(i)} at x\boldsymbol{x}, or encouraging the directional derivative of ff at x\boldsymbol{x} in the directions v(i)\boldsymbol{v}^{(i)} to be small. The corresponding regularization penalty would be

Ω(f)=i((xf(x))v(i))2\Omega(f)=\sum_{i}((\nabla_{\boldsymbol{x}}f(\boldsymbol{x}))^{\top}\boldsymbol{v}^{(i)})^{2}

As with tangent distance, though, the tangent vectors must be derived a priori.

Tangent propagation also has several flaws when compared to similar methods like data augmentation.

Double backprop+

Double backprop is a technique that regularizes the Jacobian to be small. Both it and adversarial training encourage model invariance to certain directions of transformation, like tangent prop. Notably, just how tangent prop is the infinitesimal version of data augmentation, double backprop is the analogue for adversarial training.

Finally, in 2011, the manifold tangent classifier was created, which eliminates the most glaring issue of tangent propagation—knowing the tangent vectors before training. It makes use of autoencoders, a model structure discussed in Chapter 14, to approximately learn manifold tangent vectors.