Home

Minimum Norm OLS and GD

Published

22 October 2021

Here I prove that running GD to solve OLS with a small initialization converges to the minimum $L_2$ norm solution.

Pseudoinverse

Consider a $n \times k$ matrix $X$ with $\text{rank}(X) = r$ . By the SVD, we have:

$X = U \begin{bmatrix} D & 0\\ 0 & 0\end{bmatrix} V^{T}$

where $D$ is a $r \times r$ matrix with singular values $\sigma_1 \geq \sigma_2 \geq \dots \sigma_r > 0$ . $X$ has an additional $\min \\{n, k\\} - r$ singular values, all of which equal zero. The $k \times n$ pseudoinverse follows naturally from the SVD:

$X^{\dagger} = V \begin{bmatrix} D^{-1} & 0\\ 0 & 0\end{bmatrix} U^{T} = \sum_{i = 1}^{r} \frac{v_i u_i^{T}}{\sigma_i}$

Although the SVD is not unique, $X^{\dagger}$ is unique. This can be shown by arguing that $X^{\dagger}$ is the unique solution to the following four equations:

$\begin{aligned} XX^{\dagger}X &= X\\ X^{\dagger}XX^{\dagger} &= X^{\dagger}\\ (XX^{\dagger})^{T} &= XX^{\dagger}\\ (X^{\dagger}X)^{T} &= X^{\dagger}X \end{aligned}$

Theorem:

$X^{\dagger} = \begin{cases} (X^{T}X)^{-1}X^{T} \quad \text{ when } \text{rank}(X) = k,\\ X^{T}(XX^{T})^{-1} \quad \text{ when } \text{rank}(X) = n \end{cases}$

Proof: When $\text{rank}(X) = k$ , the SVD and pseudoinverse of $X$ takes the form:

$\begin{aligned} X &= U_{n \times n} \begin{bmatrix} D_{k \times k}\\ 0_{(n - k) \times k}\end{bmatrix}I_{k \times k}^{T}\\ X^{\dagger} &= I_{k \times k} \begin{bmatrix} D^{-1}_{k \times k} & 0_{k \times (n - k)} \end{bmatrix}U_{n \times n}^{T} \end{aligned}$

Then, $X^{T}X = D^{2}$ and $(X^{T}X)^{-1}X^{T} = I \begin{bmatrix} D^{-1} & 0 \end{bmatrix} U^{T} = X^{\dagger}$ . The other case follows similarly.

Psuedoinverse Solution to OLS

Consider the minimum $L_2$ norm least squares solution:

$\beta_{\text{min}} = \arg\min \left\{\|\beta\|_{2}^2 : \beta \text{ minimizes } \|y - X\beta\|_2^2 \right\}$

Theorem: Consider $\hat{\beta} = X^{\dagger}y$ . For both consistent and inconsistent $X\beta = y$ , we have $\hat{\beta} = \beta_{\text{min}}$ .

Proof: We start with the consistent case. Suppose $X\beta_0 = y$ . Then, $y = XX^{\dagger}X\beta_0 = XX^{\dagger}y$ . Thus, $X^{\dagger}y$ solves $X\beta = y$ . Next, note a general solution takes the form of $X^{\dagger}y + N(X)$ . Every solution takes the form of $z = X^{\dagger}y + n$ where $n \in N(X)$ . Futher, we have $X^{\dagger}y \in R(X^{\dagger}) = R(X^{T})$ . So, $A^{\dagger}y \perp n$ . By the Pythagorean theorem, it follows:

$\|z\|_2^2 = \|A^{\dagger}y\|_2^2 + \|n\|_2^2 \geq \|A^{\dagger}y\|_2^2$

Equality is only possible if $n = 0$ , so $A^{\dagger}y$ is the unique minimum norm solution.

For the inconsistent case, note $X^{T}XX^{\dagger}y = X^{T} (XX^{\dagger})^{T}y = (XX^{\dagger}X)^{T}y = X^{T}y$ . Thus, $X^{\dagger}y$ solves the least squares normal equations $X^{T}X\beta = X^{T}y$ . To prove this the unique minimum norm solution, we use the same argument as the consistent case.

GD Solution to OLS

Theorem: Initialize $\beta^{0} = 0$ and consider running GD on the least squares loss yielding:

$\beta^{t} = \beta^{t - 1} + \eta X^{T}\left(y - X\beta^{t - 1}\right)$

Taking $\eta < \frac{1}{\lambda_{\text{max}}(X^{T}X)}$ where $\lambda_{\text{max}}(X^{T}X)$ is the max eigenvalue of $X^{T}X$ . Then, $\lim_{t \rightarrow \infty} \beta^{t} \rightarrow \beta_{\text{min}}$ .

Proof: Note that $f(\beta) = \frac{1}{2}|y - X\beta|_2^2$ is convex. Additionally, $\nabla f(\beta) = -X^{T}(y - X\beta)$ and $\nabla^2 f(\beta) = X^{T}X$ . We have:

$\max_{v : \|v\|_{2} = 1} v^{T}X^{T}Xv \leq \lambda_{\text{max}}(X^{T}X)$

Hence, taking $\eta < \lambda_{\text{max}}(X^{T}X)$ , the GD Lemma implies that $\beta^{t}$ converges to least squares solution $\tilde{\beta}$ as $t \rightarrow \infty$ . Further, note by the gradient update, $\tilde{\beta}$ lies in $R(X^{T})$ . As shown in the proof above, $\beta_{\text{min}}$ is the unique solution with this property. Thus, we have $\tilde{\beta} = \beta_{\text{min}}$ .

Note: This result only holds for the specific initialization we have specified here. If $\beta_{0}$ is a least squares solution such that $\beta^{0} \neq \beta_{\text{min}}$ , then there are no GD updates and clearly does not converge to $\beta_{\text{min}}$ . More generally, for any given initialization, GD will converge to the least squares solution with minimum $L_2$ distance from $\beta^{0}$ .