Here I prove that running GD to solve OLS with a small initialization converges to the minimum $L_2$ norm solution.
Pseudoinverse
Consider a n×k matrix X with rank(X)=r. By the SVD, we have:
X=U[D000]VT
where D is a r×r matrix with singular values σ1≥σ2≥…σr>0. X has an additional minn,k−r singular values, all of which equal zero. The k×n pseudoinverse follows naturally from the SVD:
X†=V[D−1000]UT=i=1∑rσiviuiT
Although the SVD is not unique, X† is unique. This can be shown by arguing that X† is the unique solution to the following four equations:
XX†XX†XX†(XX†)T(X†X)T=X=X†=XX†=X†X
Theorem:
X†={(XTX)−1XT when rank(X)=k,XT(XXT)−1 when rank(X)=n
Proof:
When rank(X)=k, the SVD and pseudoinverse of X takes the form:
XX†=Un×n[Dk×k0(n−k)×k]Ik×kT=Ik×k[Dk×k−10k×(n−k)]Un×nT
Then, XTX=D2 and (XTX)−1XT=I[D−10]UT=X†. The other case follows similarly.
Psuedoinverse Solution to OLS
Consider the minimum L2 norm least squares solution:
βmin=argmin{∥β∥22:β minimizes ∥y−Xβ∥22}
Theorem: Consider β^=X†y. For both consistent and inconsistent Xβ=y, we have β^=βmin.
Proof: We start with the consistent case. Suppose Xβ0=y. Then, y=XX†Xβ0=XX†y. Thus, X†y solves Xβ=y. Next, note a general solution takes the form of X†y+N(X). Every solution takes the form of z=X†y+n where n∈N(X). Futher, we have X†y∈R(X†)=R(XT). So, A†y⊥n. By the Pythagorean theorem, it follows:
∥z∥22=∥A†y∥22+∥n∥22≥∥A†y∥22
Equality is only possible if n=0, so A†y is the unique minimum norm solution.
For the inconsistent case, note XTXX†y=XT(XX†)Ty=(XX†X)Ty=XTy. Thus, X†y solves the least squares normal equations XTXβ=XTy. To prove this the unique minimum norm solution, we use the same argument as the consistent case.
GD Solution to OLS
Theorem: Initialize β0=0 and consider running GD on the least squares loss yielding:
βt=βt−1+ηXT(y−Xβt−1)
Taking η<λmax(XTX)1 where λmax(XTX) is the max eigenvalue of XTX. Then, limt→∞βt→βmin.
Proof: Note that f(β)=21∣y−Xβ∣22 is convex. Additionally, ∇f(β)=−XT(y−Xβ) and ∇2f(β)=XTX. We have:
v:∥v∥2=1maxvTXTXv≤λmax(XTX)
Hence, taking η<λmax(XTX), the GD Lemma implies that βt converges to least squares solution β~ as t→∞. Further, note by the gradient update, β~ lies in R(XT). As shown in the proof above, βmin is the unique solution with this property. Thus, we have β~=βmin.
Note: This result only holds for the specific initialization we have specified here. If β0 is a least squares solution such that β0=βmin, then there are no GD updates and clearly does not converge to βmin. More generally, for any given initialization, GD will converge to the least squares solution with minimum L2 distance from β0.