Minimum Norm OLS and GD

Here I prove that running GD to solve OLS with a small initialization converges to the minimum $L_2$ norm solution.

Pseudoinverse

Consider a n×kn \times k matrix XX with rank(X)=r\text{rank}(X) = r. By the SVD, we have:

X=U[D000]VT X = U \begin{bmatrix} D & 0\\ 0 & 0\end{bmatrix} V^{T}

where DD is a r×rr \times r matrix with singular values σ1σ2σr>0\sigma_1 \geq \sigma_2 \geq \dots \sigma_r > 0. XX has an additional minn,kr\min \\{n, k\\} - r singular values, all of which equal zero. The k×nk \times n pseudoinverse follows naturally from the SVD:

X=V[D1000]UT=i=1rviuiTσi X^{\dagger} = V \begin{bmatrix} D^{-1} & 0\\ 0 & 0\end{bmatrix} U^{T} = \sum_{i = 1}^{r} \frac{v_i u_i^{T}}{\sigma_i}

Although the SVD is not unique, XX^{\dagger} is unique. This can be shown by arguing that XX^{\dagger} is the unique solution to the following four equations:

XXX=XXXX=X(XX)T=XX(XX)T=XX \begin{aligned} XX^{\dagger}X &= X\\ X^{\dagger}XX^{\dagger} &= X^{\dagger}\\ (XX^{\dagger})^{T} &= XX^{\dagger}\\ (X^{\dagger}X)^{T} &= X^{\dagger}X \end{aligned}

Theorem:

X={(XTX)1XT when rank(X)=k,XT(XXT)1 when rank(X)=n X^{\dagger} = \begin{cases} (X^{T}X)^{-1}X^{T} \quad \text{ when } \text{rank}(X) = k,\\ X^{T}(XX^{T})^{-1} \quad \text{ when } \text{rank}(X) = n \end{cases}

Proof: When rank(X)=k\text{rank}(X) = k, the SVD and pseudoinverse of XX takes the form:

X=Un×n[Dk×k0(nk)×k]Ik×kTX=Ik×k[Dk×k10k×(nk)]Un×nT \begin{aligned} X &= U_{n \times n} \begin{bmatrix} D_{k \times k}\\ 0_{(n - k) \times k}\end{bmatrix}I_{k \times k}^{T}\\ X^{\dagger} &= I_{k \times k} \begin{bmatrix} D^{-1}_{k \times k} & 0_{k \times (n - k)} \end{bmatrix}U_{n \times n}^{T} \end{aligned}

Then, XTX=D2X^{T}X = D^{2} and (XTX)1XT=I[D10]UT=X(X^{T}X)^{-1}X^{T} = I \begin{bmatrix} D^{-1} & 0 \end{bmatrix} U^{T} = X^{\dagger}. The other case follows similarly.

Psuedoinverse Solution to OLS

Consider the minimum L2L_2 norm least squares solution:

βmin=argmin{β22:β minimizes yXβ22} \beta_{\text{min}} = \arg\min \left\{\|\beta\|_{2}^2 : \beta \text{ minimizes } \|y - X\beta\|_2^2 \right\}

Theorem: Consider β^=Xy\hat{\beta} = X^{\dagger}y. For both consistent and inconsistent Xβ=yX\beta = y, we have β^=βmin\hat{\beta} = \beta_{\text{min}}.

Proof: We start with the consistent case. Suppose Xβ0=yX\beta_0 = y. Then, y=XXXβ0=XXyy = XX^{\dagger}X\beta_0 = XX^{\dagger}y. Thus, XyX^{\dagger}y solves Xβ=yX\beta = y. Next, note a general solution takes the form of Xy+N(X)X^{\dagger}y + N(X). Every solution takes the form of z=Xy+nz = X^{\dagger}y + n where nN(X)n \in N(X). Futher, we have XyR(X)=R(XT)X^{\dagger}y \in R(X^{\dagger}) = R(X^{T}). So, AynA^{\dagger}y \perp n. By the Pythagorean theorem, it follows:

z22=Ay22+n22Ay22 \|z\|_2^2 = \|A^{\dagger}y\|_2^2 + \|n\|_2^2 \geq \|A^{\dagger}y\|_2^2

Equality is only possible if n=0n = 0, so AyA^{\dagger}y is the unique minimum norm solution.

For the inconsistent case, note XTXXy=XT(XX)Ty=(XXX)Ty=XTyX^{T}XX^{\dagger}y = X^{T} (XX^{\dagger})^{T}y = (XX^{\dagger}X)^{T}y = X^{T}y. Thus, XyX^{\dagger}y solves the least squares normal equations XTXβ=XTyX^{T}X\beta = X^{T}y. To prove this the unique minimum norm solution, we use the same argument as the consistent case.

GD Solution to OLS

Theorem: Initialize β0=0\beta^{0} = 0 and consider running GD on the least squares loss yielding:

βt=βt1+ηXT(yXβt1) \beta^{t} = \beta^{t - 1} + \eta X^{T}\left(y - X\beta^{t - 1}\right)

Taking η<1λmax(XTX)\eta < \frac{1}{\lambda_{\text{max}}(X^{T}X)} where λmax(XTX)\lambda_{\text{max}}(X^{T}X) is the max eigenvalue of XTXX^{T}X. Then, limtβtβmin\lim_{t \rightarrow \infty} \beta^{t} \rightarrow \beta_{\text{min}}.

Proof: Note that f(β)=12yXβ22f(\beta) = \frac{1}{2}|y - X\beta|_2^2 is convex. Additionally, f(β)=XT(yXβ)\nabla f(\beta) = -X^{T}(y - X\beta) and 2f(β)=XTX\nabla^2 f(\beta) = X^{T}X. We have:

maxv:v2=1vTXTXvλmax(XTX) \max_{v : \|v\|_{2} = 1} v^{T}X^{T}Xv \leq \lambda_{\text{max}}(X^{T}X)

Hence, taking η<λmax(XTX)\eta < \lambda_{\text{max}}(X^{T}X), the GD Lemma implies that βt\beta^{t} converges to least squares solution β~\tilde{\beta} as tt \rightarrow \infty. Further, note by the gradient update, β~\tilde{\beta} lies in R(XT)R(X^{T}). As shown in the proof above, βmin\beta_{\text{min}} is the unique solution with this property. Thus, we have β~=βmin\tilde{\beta} = \beta_{\text{min}}.

Note: This result only holds for the specific initialization we have specified here. If β0\beta_{0} is a least squares solution such that β0βmin\beta^{0} \neq \beta_{\text{min}}, then there are no GD updates and clearly does not converge to βmin\beta_{\text{min}}. More generally, for any given initialization, GD will converge to the least squares solution with minimum L2L_2 distance from β0\beta^{0}.