SHNGS2017 - Implicit Bias on Linearly Separable Datasets

One of the key ideas in overparametrized ML is that there are implicit biases introduced by optimization which encourage algorithms to find global minima which generalize well. That is even though there are many solutions which will yield zero training error, we tend to find “good” solutions in terms of risk.

Summary

SHNGS2017 show for linearly separable datasets, the GD solution to an unregularized logistic regression problem converges in direction to the max-margin SVM solution. This result is fairly unique because unlike least squares problems, when using logistic or cross-entropy loss in underdetermined settings, there is no finite minimizer. Hence their analysis crucially relies on analyzing the direction of the predictor, all that matters in a classification setting.

Key Theorem 1: For any linearly separable dataset, any β\beta-smooth monotone decreasing loss function with an exponential tail (strictly bounded below by zero), any stepsize η<2βσmax(X)2\eta < \frac{2}{\beta \sigma_{\text{max}}(X)^2} and any starting point w(0)w(0), the gradient descent iterates behave as:

w(t)=w^logt+ρ(t) w(t) = \hat{w}\log t + \rho(t)

where w^\hat{w} is the L2L_2 max margin vector:

w^=argminwRdw2s.t.wTxn1 \hat{w} = \arg\min_{w \in \mathbb{R}^d} \|w\|^2 \quad \text{s.t.} \quad w^{T}x_n \geq 1

and the residual grows at most as ρ(t)=O(loglog(t))|\rho(t)| = O(\log \log (t)) and so:

limtw(t)w(t)=w^w^ \lim_{t \rightarrow \infty} \frac{w(t)}{\|w(t)\|} = \frac{\hat{w}}{\|\hat{w}\|}

Further for almost all datasets, the residual ρ(t)\rho(t) is bounded.

Proof Sketch: First, the exponential tail of the loss function is key for the asymptotic convergence to the max margin vector. Assume the loss function: l(u)=eul(u) = e^{-u}. For linearly separable data, we will have n:wTxn\forall n: w^{T}x_n \rightarrow \infty. If wtwt\frac{w_t}{|w_t|} converges to a limit ww_{\infty}, then one can write w(t)=g(t)w+ρ(t)w(t) = g(t) w_{\infty} + \rho(t) such that g(t)g(t) \rightarrow \infty, n:wTxn>0\forall n: w_{\infty}^{T}x_n > 0 and limtρ(t)g(t)=0\lim_{t \rightarrow \infty} \frac{\rho(t)}{g(t)} = 0. The gradient becomes:

L(w)=i=1nexp(w(t)Txn)xn=i=1nexp(g(t)TwTxn)exp(ρ(t)Txn)xn \begin{aligned} -\nabla L(w) &= \sum_{i = 1}^n \exp(-w(t)^{T} x_n) x_n\\ &= \sum_{i = 1}^n \exp\left(-g(t)^{T} w_{\infty}^{T}x_n\right)\exp\left(-\rho(t)^{T}x_n\right)x_n \end{aligned}

As g(t)g(t) \rightarrow \infty, the exp(g(t)TwTxn)\exp\left(-g(t)^{T} w_{\infty}^{T}x_n\right) will decay much quicker for samples with small exponents. The only samples which contribute to the gradient will be the support vectors, i.e. those with the smallest margin, argminnwTxn\arg\min_{n} w_{\infty}^{T}x_n.

Looking at the negative gradient above, we see that it will become a weighted average of the support vectors. Since w|w_{\infty}| \rightarrow \infty, the initial conditions become irrelevant and ww_{\infty} will become dominated by the supports vectors, as will its margin-scaled version, w^=wminnwTxn\hat{w} = \frac{w_{\infty}}{\min_{n} w_{\infty}^{T}x_n}. It follows:

w^=n=1Nαnxnn(αn0 and w^Txn=1)OR(αn=0 and w^Txn>1) \hat{w} = \sum_{n = 1}^N \alpha_n x_n \quad \forall n \quad (\alpha_n \geq 0 \text{ and } \hat{w}^{T}x_n = 1) \quad \text{OR} \quad (\alpha_n = 0 \text{ and } \hat{w}^{T}x_n > 1)

These are exactly the KKT conditions for hard-margin SVM.

Key Theorem 2: For almost every linearly separable dataset, the normalized weight vector converges to the normalized max margin evector in L2L_2 norm:

w(t)w(t)w^w^=O(1logt) \left\| \frac{w(t)}{\|w(t)\|} - \frac{\hat{w}}{\|\hat{w}\|}\right\| = O\left(\frac{1}{\log t}\right)

and in angle:

1w(t)Tw^w(t)w^=O(1log2t) 1 - \frac{w(t)^{T}\hat{w}}{\|w(t)\| \|\hat{w}\|} = O\left(\frac{1}{\log^2 t}\right)

On the other hand the loss decreases as:

L(w(t))=O(1t) L(w(t)) = O\left(\frac{1}{t}\right)

Practical Implications:

Connections to Other Results: AdaBoost can be formulated as a coordinate descent algorithm on the exponential loss of a linear model. With small enough step sizes, AdaBoost does converge precisely to the L1L_1 max-margin solution. For similar loss functions and the regularization path wλ=argminL(w)+λR(w)w_{\lambda} = \arg\min L(w) + \lambda R(|w|) where (R(w)(R(|w|) is the LpL_p norm penalty, one can show that limλ0wλwλ\lim_{\lambda \rightarrow 0} \frac{w_{\lambda}}{|w_{\lambda}|} is proportional to the max LpL_{p} margin solution. These latter results are considering explicit regularization as opposed to implicit regularization induced by optimizaiton.

Extensions: The paper proves similar results for the multi-class setting with cross-entropy loss as well as neural nets where only a single weight layer is optimized and after a sufficient number of iterations the activation units stop switching.

Other Optimization Algorithms: Experimentally, these results continue to hold for SGD and momentum variants of GD. However, adaptive methods such as AdaGrad and ADAM do not converge to the max-margin L2L_2 solution.

A Quick Empirical Check: I ran a quick simulation to check if these results hold up empirically. I generated a linearly separable dataset of N=100N = 100 samples and d=120d = 120 features. Then, I ran unregularized logistic regression keep track of w(t)w(t) along the optimization path. As expected, both the hard SVM solution and final logistic regression solution have zero training error. The first plot here shows the L(w(t))L(w(t)) which decays as O(1t)O\left(\frac{1}{t}\right). The second plot shows w(t)|w(t)| which increases as O(logt)O\left(\log t\right). The last plot shows the margin gap which decays as O(1logt)O\left(\frac{1}{\log t}\right). It is hard to eyeball the difference in 1t\frac{1}{t} and 1logt\frac{1}{\log t}, but I plotted those functions explicitly and they do in fact match these results.