Deep Learning Basics

I plan on putting together some personal applied deep learning notes. They will pull together:

The following post will start with DL basics.

Automatic Differentiation

Autodiff is a general framework to compute derivatives, not just of mathematical functions, but general programs with control flow. Standard techniques for evaluating derivatives have disadvantages:

Autodiff is a symbolic/numerical hybrid. The core idea is that all algorithms can be written as a composition of simple functions with known derivatives. This is called the trace and can be represented as a DAG. Consider the example from BPRS2015 where f(x1,x2)=log(x1)+x1x2sin(x2)f(x_1, x_2) = \log(x_1) + x_1x_2 - \sin(x_2). To evaluate derivatives, we do the following:

Any control flow code is just eliminated: branches are taken, loops unrolled, etc. This leaves us with a linear execution trace. The above formulation is known as forward mode autodiff. It is efficient for functions g:RNRMg: \mathbb{R}^N \rightarrow \mathbb{R}^M where NMN \ll M. It will require just NN passes to compute the Jacobian. It provides matrix-free way of for evaluating Jacobian-vector products. Specifically:

Jfr=[y1x1y1xnymx1ymxn][r1rn] J_{f} r = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \dots & \frac{\partial y_1}{\partial x_n}\\ \dots & \dots & \dots \\ \frac{\partial y_m}{\partial x_1} & \dots & \frac{\partial y_m}{\partial x_n}\\ \end{bmatrix} \begin{bmatrix} r_1\\ \dots\\ r_n \end{bmatrix}

can be computed in a single forward pass by initializing x˙=r\dot{x} = r. For functions where NMN \gg M, like neural nets, we use reverse mode autodiff. Here the derivatives are propogated back from a given output. Again, consider our example and let y=f(x1,x2)y = f(x_1, x_2). If we define vˉi=yvi\bar{v}_i = \frac{\partial y}{\partial v_{i}}, then we see the total derivative with respect to v0v_0 is:

vˉ0=vˉ2v2v0+vˉ3v3v0 \bar{v}_0 = \bar{v}_2 \frac{\partial v_2}{\partial v_0} + \bar{v}_3 \frac{\partial v_3}{\partial v_0}

We can compute such values in a two step procedure. First, run a forward pass to compute the intermediate values of viv_i and track the dependencies in the DAG. Second, propogate the derivatives backward from outputs to inputs.

This process will require MM passes to compute the Jacobian of gg. Reverse mode provides a matrix-free way of computing transposed Jacobian-vector products. Specifically:

JfTr=[y1x1ymx1y1xnymxn][r1rm] J^{T}_f r = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \dots & \frac{\partial y_m}{\partial x_1}\\ \dots & \dots & \dots \\ \frac{\partial y_1}{\partial x_n} & \dots & \frac{\partial y_m}{\partial x_n}\\ \end{bmatrix} \begin{bmatrix} r_1\\ \dots\\ r_m \end{bmatrix}

can be computed in a single forward + backward pass by initializing yˉ=r\bar{y} = r. Note that when M=1M = 1, that means we can get the gradient xf(x)\nabla_{x} f(x) in a single pass.

The Python package autograd wraps numpy ops to support autodiff. Reverse mode autodiff is also what is going on when you do x.backward() in PyTorch.

Backpropagation

As an example of how NNs use reverse autodiff, consider a one-hidden layer MLP with L2L_2 weight regularization. For simplicity, assume there are no bias terms, so all the trainable parameters are W=(W(1),W(2))W = (W^{(1)}, W^{(2)}) where W(1)Rh×dW^{(1)} \in \mathbb{R}^{h \times d} and W(2)Rq×hW^{(2)} \in \mathbb{R}^{q \times h}.

First, in the forward pass we take an input example xRdx \in \mathbb{R}^d and compute the following:

o=W(2)h=W(2)ϕ(z)=W(2)ϕ(W(1)x) o = W^{(2)}h = W^{(2)}\phi(z) = W^{(2)}\phi(W^{(1)}x)

The objective function for that example is computed as:

J(x,y)=L(o,y)+s(W) J(x, y) = L(o, y) + s(W)

where L(o,y)L(o, y) might be cross-entropy loss and s(W)=λ2(W1F2+W2F2)s(W) = \frac{\lambda}{2}\left(\|W^{1}\|_{F}^2 + \|W^{2}\|_{F}^2\right). The following image from ZLLS2021 shows the DAG associated with this NN:

Second, in the backward pass, we start with the following:

JL=1Js=1 \frac{\partial J}{\partial L} = 1 \quad \quad \frac{\partial J}{\partial s} = 1

Then, we compute the gradient of the JJ with respect to outputs oo:

Jo=prod(JL,Lo)=LoRq \frac{\partial J}{\partial o} = \text{prod}\left(\frac{\partial J}{\partial L}, \frac{\partial L}{\partial o}\right) = \frac{\partial L}{\partial o} \in \mathbb{R}^q

Next, we compute the gradient of the regularization with respect to WW:

sW(1)=λW(1)sW(2)=λW(2) \frac{\partial s}{\partial W^{(1)}} = \lambda W^{(1)} \quad \quad \frac{\partial s}{\partial W^{(2)}} = \lambda W^{(2)}

Then, we can compute the gradient for W(2)W^{(2)}:

JW(2)=prod(Jo,oW(2))+prod(Js,sW(2))=JohT+λW(2)Rq×h \frac{\partial J}{\partial W^{(2)}} = \text{prod}\left(\frac{\partial J}{\partial o}, \frac{\partial o}{\partial W^{(2)}}\right) + \text{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial W^{(2)}}\right) = \frac{\partial J}{\partial o}h^{T} + \lambda W^{(2)} \in \mathbb{R}^{q \times h}

For the gradient of W(1)W^{(1)}, we need to continue:

Jh=prod(Jo,oh)=(W(2))TJo \frac{\partial J}{\partial h} = \text{prod}\left(\frac{\partial J}{\partial o}, \frac{\partial o}{\partial h}\right) = (W^{(2)})^{T}\frac{\partial J}{\partial o}

Jz=prod(Jh,hz)=Jhϕ(z) \frac{\partial J}{\partial z} = \text{prod}\left(\frac{\partial J}{\partial h}, \frac{\partial h}{\partial z}\right) = \frac{\partial J}{\partial h} \circledcirc \phi^{'}(z)

where \circledcirc is the element wise multiplication operator. Finally, we get:

JW(1)=prod(Jz,zW(1))+prod(Js,sW(1))=JzxT+λW(1)Rh×d \frac{\partial J}{\partial W^{(1)}} = \text{prod}\left(\frac{\partial J}{\partial z}, \frac{\partial z}{\partial W^{(1)}}\right) + \text{prod}\left(\frac{\partial J}{\partial s}, \frac{\partial s}{\partial W^{(1)}}\right) = \frac{\partial J}{\partial z}x^{T} + \lambda W^{(1)} \in \mathbb{R}^{h \times d}

The above steps are computationally efficient compared to forward mode autodiff in two ways:

The downside of reverse mode autodiff is memory complexity. It requires storage of all intermediate quantities from the forward pass to compute gradients. Memory requirements scale proportionally to the number of layers and batch size. So, training deep networks with large batches can lead to out of memory errors. That is with backward mode we pay O(V)O(V) is memory cost.

Softmax Operation

For classification, we need to transform real-valued logits oo to a discrete probability distribution y^\hat{y}. The softmax operation does this transformation in a way that makes it differentiable, ensures iy^j=1\sum_{i} \hat{y}_j = 1 and y^j>0\hat{y}_j > 0. Specifically, we have:

y^=softmax(o)wherey^j=exp(oj)kexp(ok) \hat{y} = \text{softmax}(o) \quad \text{where} \quad \hat{y}_j = \frac{\exp(o_j)}{\sum_{k} \exp(o_k)}

Cross-Entropy Loss

Say we have qq classes. Minimizing cross-entropy loss is equivalent to minimizing the negative log likelihood of our data where we assume that P(y(i)x(i))=Multinomial(n=1,p=y^i)P(y^{(i)} \mid x^{(i)}) = \text{Multinomial}(n = 1, p = \hat{y}_i). We have:

logP(YX)=i=1nlogP(y(i)x(i))=i=1nl(y(i),y^(i))=i=1nj=1qyj(i)logy^j(i) -\log P(Y \mid X) = -\sum_{i = 1}^n \log P(y^{(i)} \mid x^{(i)}) = -\sum_{i=1}^n l(y^{(i)}, \hat{y}^{(i)}) = -\sum_{i=1}^n \sum_{j = 1}^q y^{(i)}_j \log \hat{y}^{(i)}_j

Note that logy^j(i)=0\log \hat{y}^{(i)}_j = 0 if y^j(i)=1\hat{y}^{(i)}_j = 1; that is we predict class jj with perfect certainty. Else, logy^k(i)<0\log \hat{y}^{(i)}_k < 0. We have:

l(y,y^)=k=1qyklogexp(oj)kexp(ok)=logk=1qexp(ok)j=1qyjoj \begin{aligned} l(y, \hat{y}) &= -\sum_{k = 1}^q y_k \log \frac{\exp(o_j)}{\sum_{k} \exp(o_k)}\\ &= \log \sum_{k = 1}^q \exp(o_k) - \sum_{j = 1}^q y_j o_j \end{aligned}

The derivative with respect to a logit ojo_j becomes:

loj(y,y^)=exp(oj)k=1qexp(ok)yj=softmax(o)jyj \partial l_{o_j}(y, \hat{y}) = \frac{\exp(o_j)}{\sum_{k = 1}^q \exp(o_k)} - y_j = \text{softmax}(o)_j - y_j

This result is similar to OLS where the gradients of square loss are of the form y^y\hat{y} - y. The gradients of any EFM model are of this form, making them trivial to compute. All of this math holds even when our label yy is not a binary vector but a distribution of the form y=(0.1,0,0.4,0.5)y = (0.1, 0, 0.4, 0.5). In that case, we can think of cross-entropy in information-theoretic terms. We are minimizing:

H(Py,Py^)=H(Py)+DKL(PyPy^) H(P_{y}, P_{\hat{y}}) = H(P_{y}) + D_{\text{KL}}(P_{y} \mid \mid P_{\hat{y}})

with DKL(PyPy^)=0D_{\text{KL}}(P_{y} \mid \mid P_{\hat{y}}) = 0 if Py=Py^P_{y} = P_{\hat{y}} and >0>0 otherwise.

Handling Numerical Issues with Softmax

If some of the logits oko_k are very large, then exp(ok)\exp(o_k) might be larger than the largest number than can be handled by certain data types. A simple solution is to subtract max(ok)\max(o_k) from all the logits. This does not change the softmax outputs:

y^j=exp(ojmax(ok))exp(max(ok))kexp(okmax(ok))exp(max(ok))=exp(ojmax(ok))kexp(okmax(ok)) \begin{aligned} \hat{y}_j &= \frac{\exp(o_j - \max(o_k))\exp(\max(o_k))}{\sum_{k} \exp(o_k - max(o_k))\exp(\max(o_k))}\\ &= \frac{\exp(o_j - \max(o_k))}{\sum_{k} \exp(o_k - max(o_k))} \end{aligned}

Now we face another issue: ojmax(ok)o_j - \max(o_k) might be very negative and exp(ojmax(ok))\exp(o_j - \max(o_k)) might be close to 0. Due to finite precision, these values may be rounded to 0. And then we will have log(yj^)=inf\log(-\hat{y_j}) = -\text{inf}. However, recall that ultimately we plug in the softmax outputs into cross-entropy loss where we take a log. Hence we get:

logy^j=ojlog(kexp(ok)) \log \hat{y}_j = o_j - \log \left(\sum_{k} \exp(o_k) \right)

Hence we can avoid computing exp(ojmax(ok))\exp(o_j - \max(o_k)) by directly computing logy^j\log \hat{y}_j in the cross-entropy loss function. Finally, we can take advantage of the LogSumExp trick:

log(kexp(ok))=max(ok)+log(jexp(ojmax(ok))) \log\left(\sum_{k} \exp(o_k) \right) = \max(o_k) + \log \left(\sum_{j} \exp(o_j - \max(o_k))\right)

These tricks are implemented in torch.nn.CrossEntropyLoss().

Multilayer Perceptrons

A LL-layer MLP is given by:

O=σL1(σ2(σ1(XW(1)+b(1))W(2)+b(2)))WL+bL O = \sigma_{L-1}(\dots \sigma_2(\sigma_1(XW^{(1)} + b^{(1)})W^{(2)} + b^{(2)}) \dots)W^{L} + b^{L}

where σi\sigma_i are activation functions which usually operate element or row wise and W(i)W^{(i)} and b(i)b^{(i)} are the trainable model weights and biases. Note that it is key that σi\sigma_i are non-linear. Otherwise, we are just building a a single-layer linear model with more parameters than necessary.

MLPs are known as universal approximators. Several different statements exist formalizing this intuition. One is that three-layer MLPs can approximate any LL-Lipschitz function f:[0,1]dRf: [0, 1]^d \rightarrow \mathbb{R} with O(d(Lϵ)d)O\left(d\left(\frac{L}{\epsilon}\right)^d\right) neurons and ReLU activations such that L1L_1 error is bounded by ϵ\epsilon:

[0,1]df(x)f^(x)dxϵ \int_{[0, 1]^d} \left\lvert f(x) - \hat{f}(x) \right\rvert dx \leq \epsilon

Finding such a f^\hat{f} in a computationally efficient manner is the hard part. Also, not the exponential dependence on dimension dd here.

Activation Functions

Regularization

Weight Decay

Similar to high-dimensional linear regression, a common strategy to regularize is to add L2L_2 or L1L_1 penalties on the weight matrices W(i)W^{(i)} to force the model to fit smoother functions. Whether the bias terms are penalized varies across applications, but usually it is not penalized in the output layer. For simplicity, assume all the weight matrices are vectorized are represented by parameter ww. The new regularized loss is of the form:

LR(w,b)=λ2wTw+L(w,b) L_{R}(w, b) = \frac{\lambda}{2} w^{T}w + L(w, b)

The GD steps look like:

w(1ηλ)wηwL(w,b) w \leftarrow (1 - \eta\lambda) w - \eta \nabla_{w} L(w, b)

So, the weights are shrunk by a constant factor each step before perfoming the standard GD update. We can also study what happens over the course of the full training procedure. Let w=argminL(w,b)w^{\ast} = \arg\min L(w, b), the unregularized loss. Then, by a second order Taylor approximation around ww^{\ast}:

L^R(w,b)=LR(w,b)+λ(w)T(ww)+12(ww)TH(ww)+λ \hat{L}_{R}(w, b) = L_{R}(w^{\ast}, b) + \lambda (w^{\ast})^{T}(w - w^{\ast}) + \frac{1}{2}(w - w^{\ast})^{T}H(w - w^{\ast}) + \lambda

where HH is the hessian of L(w,b)L(w, b) at ww^{\ast}. Since ww^{\ast} is the minimizer, it is PSD. Note part of the first order term disappears because [wL(w,b)]w=w=0\left[\nabla_{w} L(w, b)\right]_{w = w^{\ast}} = 0. The minimum of L^R(w,b)\hat{L}_{R}(w, b) occurs when wL^R(w,b)=λw+H(ww)=0\nabla_{w} \hat{L}_{R}(w, b) = \lambda w + H(w - w^{\ast}) = 0:

w^=(H+λI)1Hw \hat{w} = (H + \lambda I)^{-1}Hw^{\ast}

Note as λ0\lambda \rightarrow 0, we have w^w\hat{w} \rightarrow w^{\ast}. Else taking the eigendecomposition H=QΔQTH = Q\Delta Q^{T}:

w^=Q(Δ+λI)1ΔQTw(1) \hat{w} = Q(\Delta + \lambda I)^{-1}\Delta Q^{T} w^{\ast} \tag{1}

This is exactly the same analysis as ridge regression. We see that weight decay aligns the components of ww^{\ast} that are aligned with the ii-th eigenvector of HH are scaled by γiγi+λ\frac{\gamma_i}{\gamma_i + \lambda} where γi\gamma_i are the eigenvalues. Directions in the parameter space which do not contribute much to reducing L(w,b)L(w, b) and hence have small eigenvalues are significantly shrunk by L2L_2 weight decay.

We can similarly do L1L_1 regularization. Then, we have:

LR(w,b)=λw1+L(w,b) L_{R}(w, b) = \lambda \|w\|_{1} + L(w, b)

The GD updates look like:

wwλsign(w)ηLw(w,b) w \rightarrow w - \lambda \text{sign}(w) - \eta \nabla L_{w}(w, b)

where sign(w)\text{sign}(w) is applied component-wise. If we assume the Hessian is diagonal:

L^R(w,b)=LR(w,b)+i[12Hi,i(wiwi)2+λwi] \hat{L}_{R}(w, b) = L_{R}(w^{\ast}, b) + \sum_{i} \left[\frac{1}{2}H_{i, i}(w_i - w^{\ast}_i)^2 + \lambda \lvert w_i \rvert \right]

The component-wise solution is given by:

wi^=sign(wi)max{wiαHi,i,0} \hat{w_i} = \text{sign}(w_i^{\ast})\max\left\{\lvert w_i^{\ast} \rvert - \frac{\alpha}{H_{i, i}}, 0\right\}

Again, this is exactly the same as the Lasso analysis. The thresholding operation will encourage shrinkage plus sparsity.

In PyTorch, L2L_2 penalization is automatically supported with by adding weight_decay=lambda as an argument to torch.optim.SGD(). You have to define a custom loss functionf or L1L_1 penalization.

Early Stopping

Early stopping refers to terminating training when you start to see validation error trend upward over some number of past epochs. You return model parameters from the last kink you saw in the risk curve. Such a strategy can be viewed as an efficient algorithm for hyperparameter search. The hyperparameter here is the number of training steps. Unlike grid search, a single training run automatically evaluates the loss for several values of this hyperparameter. Also, it is completely unobtrusive, not changing the loss function and learning dynamics. The extra cost comes from evaluating on the validation set in each step and having to store the optimal model parameters.

Say we restrict the number of optimization steps to τ\tau and assume that the gradient of the loss is bounded by BB. Then, early stopping restricts wow2ητB\|w_{o} - w\|_{2} \leq \eta \tau B where wow_{o} is the initialization. For more intuition, lets analyze a linear model with square loss and optimal ww^{\ast}. We will show that early stopping is equal to L2L_2 regularization. We have:

L(w)=L(w)+12(ww)TH(ww) L(w) = L(w^{\ast}) + \frac{1}{2}(w - w^{\ast})^{T}H(w - w^{\ast})

The GD iterates:

wt=wt1ηH(wt1w)wtw=(IηH)(wt1w) w^t = w^{t - 1} - \eta H(w^{t - 1} - w^{\ast}) \Longrightarrow w^{t} - w^{\ast} = (I - \eta H)(w^{t - 1} - w^{\ast})

Using the eigendecomposition H=QΔQTH = Q \Delta Q^{T}:

QT(wtw)=(IηΔ)QT(wt1w) Q^{T}(w^{t} - w^{\ast}) = (I - \eta \Delta)Q^{T}(w^{t - 1} - w^{\ast})

If η\eta is chosen such that 1ηγi<1\lvert 1 - \eta \gamma_i \rvert < 1 and w0=0w^{0} = 0, then we have:

QTwt=[I(IηΔ)t]QTw(2) Q^{T}w^{t} = \left[I - (I - \eta \Delta)^{t}\right]Q^{T}w^{\ast} \tag{2}

Recall from Equation 11, we have for the L2L_2 regularization:

QTw^=(Δ+λI)1ΔQTwQTw^=[I(Δ+λI)1λ]QTw Q^{T}\hat{w} = (\Delta + \lambda I)^{-1} \Delta Q^{T}w^{\ast} \Longrightarrow Q^{T}\hat{w} = [I - (\Delta + \lambda I)^{-1} \lambda]Q^{T}w

The equality here follows from the Woodbury matrix identity. Compared to Equation 22, they are equivalent when λ\lambda, η\eta and tt are chosen such that:

(IηΔ)t=(Δ+λI)1λ (I - \eta \Delta)^{t} = (\Delta + \lambda I)^{-1} \lambda

If the eigenvalues γi\gamma_i are sufficiently small, one can show that this equivalence corresponds to t=1ηλt = \frac{1}{\eta \lambda}. That is the number of training iterations is inversely proportional to the L2L_2 parameter λ\lambda. In context of early stopping, we see that parameter directions with signifcant curvature learn early compared to directions of less curvature.

Dataset Augmentation

More training samples can improve generalization. If we do not have that, we can create fake data and add it to the training data. In computer vision, add samples with rotations or where a few pixels have been translated can force the NN to be invariant to these transformations. Make sure to not rotate a “9” to a “6” if you are asking your model to classify digits. Adding noise to your inputs can be shown in many cases to be equivalent to adding a weight norm penalty. For a proof, look at B1995.

Adding Noise to Weights

We can also add noise to the parameters of a model. Say for each input sample, we include a pertubation ϵWN(0,ηI)\epsilon_{W} \sim N(0, \eta I) on the network weights. If we are using squared loss, we want to minimize the risk:

Ep(x,y,ϵW)[(y^ϵW(x)y)2]=Ep(x,y,ϵW)[y^ϵW(x)22yy^ϵW(x)+y2] \mathbb{E}_{p(x, y, \epsilon_{W})} \left[(\hat{y}_{\epsilon_W}(x) - y)^2 \right] = \mathbb{E}_{p(x, y, \epsilon_{W})}\left[\hat{y}_{\epsilon_W}(x)^2 - 2y\hat{y}_{\epsilon_W}(x) + y^2\right]

For sufficiently small η\eta, this can be shown to be the same as minimizing the unperturbated risk with a regularization term ηEp(x,y)[Wy^(x)22]\eta \mathbb{E}_{p(x, y)}\left[\|\nabla_{W} \hat{y}(x) \|_{2}^2 \right]. So, the solution is pushed to regions in the parameter space where perturbations in the weights have a small effect on the predictions. We find minima surrounded by flat regions. Note that this only works for non-linear models. In linear regression, the regularization term becomes ηEp(x)[x22]\eta \mathbb{E}_{p(x)}\left[\|x \|_{2}^2 \right], which does not depend on parameters WW.

Dropout

Dropout is a technique developed in SKHSS2014. The idea is to inject noise to the hidden units of a NN. Throughout the training process, during forward propogation, some fraction of neurons in a layer are randomly zeroed out before computing the subsequent layer. The original authors proposed a slightly different interpretation. They argued that overfitting in NNs is characterized by co-adaptation where each layer relies on a specific pattern activations in the previous layer to fit the data well. Dropout forces neurons to learn more robust features since they cannot rely on other units to correct its mistakes.

The image below from ZLLS2021 shows how dropout amounts to training a smaller sub-network of the original NN:

This provides an interpretation of dropout as an approximation to bagging. Specifically, consider the exponential number of sub-networks that can be formed by masking certain neurons in a NN. In dropout, for each training example in a mini batch, a different sub-network is sampled and then we run the forward pass, backprop and parameter update. If μ\mu corresponds to the mask vector, dropout training can be viewed as minimizing Eμ[L(w,b,μ)]\mathbb{E}_{\mu}[L(w, b, \mu)]. We can get an unbiased estimate the gradient of this function by sampling values of μ\mu. Unlike bagging, the base learners share their parameters and are not trained to convergence. Parameter sharing makes it possible to “ensemble” exponentially many sub-networks in a memory-efficient manner. Much like bagging, each sub-network sees a subset of the original training examples sampled with replacement due to minibatches.

The noise added during dropout should be added in an unbiased manner. The expected value of a layer holding all other layers fixed should be equal its value in the absence of noise. Hence an activation function hh is replaced with random variable hh^{'}:

h={0with prob p,h1p with prob 1p h^{'} = \begin{cases} 0 &\quad \text{with prob } p,\\ \frac{h}{1 - p} &\quad \text{ with prob } 1- p \end{cases}

where pp is the dropout probability. Notice that E[h]=h\mathbb{E}[h^{'}] = h. Since dropout is disabled at test time, this scaling ensures the expected total input to a neuron at test time is the same as the expected total input to a neuron at train time.

In practice, implementing dropout during training is as simple as drawing random uniforms and multiplying a hidden layer by mask to convert hhh \rightarrow h^{'}. Usually, the input layer has p0p \approx 0 because we do not want to turn off too many of the input features.

Adding dropout in PyTorch is as easy as using the torch.nn.Dropout class in your NN specification.

Vanishing and Exploding Gradients

Consider a NN with LL layers, input xx and output oo where each layer ll is a transformation flf_{l} parameterized by matrix W(l)W^{(l)}. Let h(0)=xh^{(0)} = x, then we have:

h(l)=fl(h(l1))o=fL(fL1(f1(x))) h^{(l)} = f_{l}(h^{(l-1)}) \quad \quad o = f_{L}(f_{L - 1}( \dots f_1(x)))

We can write the gradient oo with respect to WlW^{l} as follows:

W(l)(o)=h(L1)h(L)h(l)h(l+1)W(l)h(l)=M(L)M(l+1)v(l) \partial_{W^{(l)}}(o) = \partial_{h^{(L-1)}} h^{(L)} \cdots \partial_{h^{(l)}} h^{(l + 1)} \partial_{W^{(l)}} h^{(l)} = M^{(L)} \cdots M^{(l + 1)} v^{(l)}

The gradient is the product of matrices and a vector. Depending on the eigenspectrum of these submatrices the gradient can end up being very small or very large. For simplicity, assume that M=M(L)==M(l+1)M = M^{(L)} = \dots = M^{(l + 1)} and MM has eigendecomposition VΔV1V \Delta V^{-1}. Then, we have:

W(l)(o)=M(L1)v(l)=VΔL1V1v(l) \partial_{W^{(l)}}(o) = M^{(L - 1)}v^{(l)} = V \Delta^{L - 1}V^{-1}v^{(l)}

Hence, for deep NNs, if the eigenvalues are larger than 11 in magnitude, the gradient explodes. If the eigenvalues are smaller than 11 is magnitude, the gradient vanishes. Exploding gradients make learning unstable. Vanishing gradients make it impossible to figure out where to move in parameter space. This problem is pronounced in RNNs where we are using the same weight matrix MM over and over again in many layers, but can also show up elsewhere.

One source of vanishing gradients is the choice of activation function. For example, consider the sigmoid activation. Since σ(x)0\sigma^{'}(x) \approx 0 when x≉0x \not\approx 0 and maxxσ(x)=1/4\max_{x} \sigma^{'}(x) = 1/4, if you have small weights multiplied by these activation gradients, the loss gradient can approach zero at rate exponential in the number of layers during backprop. Early layers train very slowly. ReLU solves this problem.

Exploding gradients are driven by large weights. It is more of a problem with ReLUs compared to squashing functions like the sigmoid and tanh. But for all activations, sufficiently large weights can cause the loss gradient to explode at an exponential rate. As an example, consider the following 3-layer NN where each layer has a single neuron:

y=tanh(w3tanh(w2tanh(w1x))) y = \tanh(w_3 \cdot \tanh(w_2 \cdot \tanh(w_1 \cdot x)))

Then the gradient with respect to the first weight w1w_1 is:

Ldw1=Lyw3w2xsech2(w1x)sech2(w2tanh(w1x))sech2(w3tanh(w2tanh(w1x))) \begin{aligned} \frac{\partial L}{d w_{1}} = \frac{\partial L}{\partial y} \cdot w_3 \cdot w_2 \cdot x \cdot \text{sech}^2(w_1 \cdot x) \cdot \text{sech}^2(w_2 \cdot \tanh(w_1 \cdot x))\cdot \text{sech}^2(w_3 \cdot \tanh(w_2 \cdot \tanh(w_1 \cdot x))) \end{aligned}

Note that sech2(x)1\text{sech}^{2}(x) \leq 1. But for a large number of layers LL the term i=2Lwi\prod_{i = 2}^{L} w_i can blow up when LL is large and wi>1\lvert w_i \rvert > 1. Similar behavior is possible for the sigmoid activation.

Weight Initialization

Recall that weight initialization is not an issue for convex problems such as linear or softmax regression since the appropriate optimization algorithms are guaranteed to reach a global minimum. For non-convex problems, initialization affects the solution to which you converge and the numerical stability of the algorithm.

Breaking Symmetry

Fully connected NNs are symmetric in their parametrization. Imagine a simple MLP with one hidden layer and two units. We can permute the weights in W(1)W^{(1)} and W(2)W^{(2)} and obtain the exact same function. Hence there is permutation symmetry among the hidden units of each layer.

Such symmetry can become a problem during training. If the hidden units of a layer share the same input/output weights and activations, they compute the same values and receive the same gradient during backprop. The GD updates for each unit’s weights will the same. So, the hidden layer would behave as a single unit, wasting the NN’s capacity.

To break symmetry, we need to add specific forms of randomness to the training process. For a single hidden layer MLP, we will see the undesired behavior if we initialize W(1)=cW^{(1)} = c for some constant cc. Randomly initializing the elements in W(1)W^{(1)} to small values from a Gaussian or uniform is the standard strategy. Technically, one can continue to initialize the weights into the output units, W(2)=cW^{(2)} = c. The output units receive different gradients during backprop based on the functional form of the loss. But if you start with W(2)=0W^{(2)} = 0, it will take two iterations to make LW(1)0\frac{\partial L}{\partial W^{(1)}} \neq 0 and finally start updating W(1)W^{(1)}.

Other strategies can work even with constant hidden unit weights. For example, dropout would break the symmetry, but note minibatch SGD is not sufficient. Alternatively, one can set random weights to 0 and random biases are sufficient. But in that case, a LL-layer NN will take LL iterations to get non-zero weights in all layers.

Xavier Initialization

A commonly used initialization is the Xavier initialization from GB2010 which helps mitigate issues with vanishing/exploding gradients. The idea is that with randomly initialized weights, the variance of a unit’s output scales with the number of inputs. Consider an output oio_i from a fully-connected layer without non-linearities. Let there be ninn_{\text{in}} inputs xjx_j and let wijw_{ij} be the associated weights. We have:

oi=j=1ninwijxj o_i = \sum_{j = 1}^{n_{\text{in}}} w_{ij}x_{j}

Assume that wijN(0,σ2)w_{ij} \sim N(0, \sigma^2), E[xj]=0\mathbb{E}[x_{j}] = 0, Var(xj)=γ2\text{Var}(x_j) = \gamma^2 and wijw_{ij} are independent of xjx_j. Then, E[oi]=0\mathbb{E}[o_i] = 0 and Var(oi)=E[oi2]=ninσ2γ2\text{Var}(o_i) = \mathbb{E}[o_i^{2}] = n_{\text{in}}\sigma^{2}\gamma^{2}. One way to keep the variance fixed is to set ninσ2=1n_{\text{in}}\sigma^{2} = 1. Considering backpropogation, one can show that the variance of the gradient blows up unless noutσ2=1n_{\text{out}}\sigma^{2} = 1. Since you cannot simultaneously satisfy the conditions, the Xavier initialization instead takes σ212(nin+nout)=1\sigma^{2} \frac{1}{2}(n_{\text{in}} + n_{\text{out}}) = 1. That is:

σ=2nin+nout \sigma = \sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}

Typically, this variance is used with a Gaussian. Instead, noting that the variance of a Unif(a,a)=a23\text{Unif}(-a, a) = \frac{a^2}{3}, one can sample from:

Unif(6nin+nout,6nin+nout) \text{Unif}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}} \right)

Handling Distribution Shift

Say we train our model on data from distribution pS(x,y)p_{S}(x, y), but our test set is pT(x,y)p_{T}(x, y). Without any assumptions on the relationship between pS(x,y)p_{S}(x, y) and pT(x,y)p_{T}(x, y), we cannot do anything. There are certain restricted assumptions we can make and still make things work:

Say our train distribution is qq and our test distribution is pp. First, consider covariate shift where we have p(yx)=q(yx)p(y \mid x) = q(y \mid x). Hence:

l(f(x),y)p(yx)p(x)dxdy=l(f(x),y)q(yx)q(x)p(x)q(x)dxdy \int \int l(f(x), y) p(y \mid x) p(x) dx dy = \int \int l(f(x), y) q(y \mid x) q(x) \frac{p(x)}{q(x)} dx dy

That is we need to re-weigh samples by propensity scores:

βi=p(xi)q(xi) \beta_i = \frac{p(x_i)}{q(x_i)}

To use this in ERM, we need estimates β^i\hat{\beta}_i. Any approach will require samples from both pp and qq, but only the covariates, not the labels. The simplest approach is to build a classifier which distinguishes samples from the train and test. Let z=1z = 1 mean that a sample comes from pp. Assume for simplicity P(z=1)=p(z=1)=0.5P(z = 1) = p(z = -1) = 0.5. Then, using logistic regression, we have:

βi^=P^(z=1xi)P^(z=1xi)=exp(h(xi)) \hat{\beta_i} = \frac{\hat{\mathbb{P}}(z = 1 \mid x_i)}{\hat{\mathbb{P}}(z = -1 \mid x_i)} = \exp(h(x_i))

The final algorithm is to solve a weighted ERM problem with these estimated importance weights. Usually, we clip the weights at some constant cc to lower variance when the denominator is very small. Theoretically, we need a positivity assumption. That is whenever p(x)>0p(x) > 0, we need q(x)>0q(x) > 0.

For label shift, shift a similar analysis yields that we need to estimate:

βi=p(yi)q(yi) \beta_i = \frac{p(y_i)}{q(y_i)}

Assume we are dealing with a kk-class classification problem. Label shift tends to be easier to handle statistically because if we have a good classifier on the train distribution, then we do not have to deal with potentially high-dimensional covariates xx. What we do is estimate a k×kk \times k confusion matrix CC with predictions on a validation set (constructed via sample splitting). The columns represent the ground truth and rows are predicted categories. We do not have access to test labels, but we do have access to test covariates. So, we can also get the distribution over predicted categories in the test set, p(y^)p(\hat{y}). Then:

Cp(y)=p(y^) Cp(y) = p(\hat{y})

This result follows from the fact that p(y^y)=q(y^y)p(\hat{y} \mid y) = q(\hat{y} \mid y) on the test samples since y^\hat{y} only depends on yy through xx and we make the label shift assumption. Hence we have:

p(y^)=yYp(y^y)p(y)=yYq(y^y)p(y) p(\hat{y}) = \sum_{y \in \mathcal{Y}} p(\hat{y} \mid y) p(y) = \sum_{y \in \mathcal{Y}} q(\hat{y} \mid y) p(y)

Assuming a sufficiently accurate classifier, CC is invertible and we can estimate p^(y)=C1p(y^)\hat{p}(y) = C^{-1}p(\hat{y}). Getting an estimate q^(y)\hat{q}(y) is simple from our labeled train samples. Finally, we plug in β^i\hat{\beta}_i into a weighted ERM.

For concept shift, we usually assume that definitions are changing slowly over time. So, we simply collect more samples over time and continue performing parameter updates to our existing NN.