UBC CPSC 406 - Convergence of gradient descent

Convergence of gradient descent

iteration complexity
quadratic models
strong convexity

Smooth functions

A function $f:\mathbb{R}^n\to\mathbb{R}$ is $L$-smooth (ie, $L$-Lipschitz gradient)

\[ \|\nabla f(x) - \nabla f(y)\| \le L\|x-y\| \quad \forall x,y \]

examples

linear: $f(x) = a^Tx$, with $a\in\mathbb{R}^n$, has $L=0$
quadratic: $f(x) = \frac{1}{2}x^TAx + b^Tx + \gamma$, with $A\succeq 0$, has $L=\|A\|_2=\lambda_{\max}(A)$

Second-order characterization

If $f$ is twice continuously differentiable, then $f$ is $L$-smooth if and only if for all $x$

\[ \nabla^2 f(x) \preceq L I \quad\text{ie,}\quad \|\nabla^2 f(x)\|_2 \le L \]

Descent lemma

If $f$ is $L$-smooth, then for all $x,z$

\[ f(z) \le f(x) + \nabla f(x)^T(z-x) + \frac{L}{2}\|z-x\|^2 \]

means that any $L$-smooth function is globally majorized by a quadratic approximation

Projected gradient descent

Projected gradient method for minimizing $L$-smooth $f$ over a convex set $C$ \[ x_{k+1} = \proj_C(x_k - \alpha \nabla f(x_k)) \]

By the descent lemma, for any $\alpha\in(0,\frac1L]$,

\[ \begin{aligned} f(z) &\le f(x) + \nabla f(x)^T(z-x) + \frac{L}{2}\|z-x\|^2 \le f(x) + \nabla f(x)^T(z-x) + \frac{1}{2\alpha}\|z-x\|^2 \end{aligned} \]

(projected) gradient descent step minimizes the quadratic upper bound:

\[ \begin{aligned} \proj_C(x - \alpha \nabla f(x)) &= \argmin_{z\in C} \frac{1}{2\alpha}\|z - (x-\alpha\nabla f(x))\|^2 \\ &= \argmin_{z\in C} \frac{\alpha}2\|\nabla f(x)\|^2 + \nabla f(x)^T(z-x)+\frac{1}{2\alpha}\|z-x\|^2 \\ &= \argmin_{z\in C} f(x) + \nabla f(x)^T(z-x) + \frac{1}{2\alpha}\|z-x\|^2 \end{aligned} \]

Convergence

Let $C=\Rn$ (unconstrained), $f_k:=f(x_k)$, $\nabla f_k:=\nabla f(x_k)$. By descent lemma,

\[ f_{k+1} \le f_k + \nabla f_k^T(x_{k+1}-x_k) + \frac{L}{2}\|x_{k+1}-x_k\|^2 \]

take $x_{k+1} = x_k - \alpha \nabla f_k$, then $x_{k+1}-x_k = -\alpha \nabla f_k$ and

\[ \begin{aligned} f_{k+1} &\le f_k - \alpha \nabla f_k^T\nabla f_k + \frac{L}{2}\|- \alpha \nabla f_k\|^2 \\ &= f_k - \alpha \|\nabla f_k\|^2 + \frac{L\alpha^2}{2}\|\nabla f_k\|^2 \\ &= f_k - \alpha\left(1-\frac{\alpha L}{2}\right) \|\nabla f_k\|^2 \end{aligned} \]

decreasing objective values

\[ f_{k+1} < f_k \quad\text{if}\quad \alpha\in(0,2/L) \quad\text{and}\quad \nabla f_k\ne 0 \]

Nonasymptotic rate

if $\alpha\in(0,2/L]$ then $f_{k+1} \le f_k - \alpha\left(1-\frac{\alpha L}{2}\right) \|\nabla f_k\|^2$
minimize RHS over $\alpha\in(0,2/L]$ gives $\alpha^* = 1/L$ and

\[ f_{k+1} \le f_k - \frac{1}{2L} \|\nabla f_k\|^2 \]

sum over $k=0,1,2,\ldots,T$ gives

\[ \frac1{2L}\alpha\sum_{k=0}^{T}\|\nabla f_k\|^2 \le f(x_0) - f(x_T) \le f(x_0) - f^*, \quad\text{where $f^*$ is min value} \]

bounds min gradient value \[ \min_{k\in\set{0,\ldots,T}}\ \|\nabla f(x_k)\|^2 \le \frac1T\sum_{k=0}^T \|\nabla f(x_k)\|^2 \le \frac{2L(f(x_0) - f^*)}{T} = O(1/T) \]

🎱 Lipschitz constant

That least-squares obective \[ f(x) = \half\|Ax-b\|^2 \] is $L$-smooth. What is $L$?

$\|A\|_2^2$
$\lambda_{\max}(A^TA)$
$\|A^T b\|$
a and c
b and c