Convergence of
gradient descent

CPSC 406 – Computational Optimization

Convergence of gradient descent

  • iteration complexity
  • quadratic models
  • strong convexity

Smooth functions

A function \(f:\mathbb{R}^n\to\mathbb{R}\) is \(L\)-smooth (ie, \(L\)-Lipschitz gradient)

\[ \|\nabla f(x) - \nabla f(y)\| \le L\|x-y\| \quad \forall x,y \]


examples

  • linear: \(f(x) = a^Tx\), with \(a\in\mathbb{R}^n\), has \(L=0\)
  • quadratic: \(f(x) = \frac{1}{2}x^TAx + b^Tx + \gamma\), with \(A\succeq 0\), has \(L=\|A\|_2=\lambda_{\max}(A)\)


Second-order characterization

If \(f\) is twice continuously differentiable, then \(f\) is \(L\)-smooth if and only if for all \(x\)

\[ \nabla^2 f(x) \preceq L I \quad\text{ie,}\quad \|\nabla^2 f(x)\|_2 \le L \]

Descent lemma

If \(f\) is \(L\)-smooth, then for all \(x,z\)

\[ f(z) \le f(x) + \nabla f(x)^T(z-x) + \frac{L}{2}\|z-x\|^2 \]

means that any \(L\)-smooth function is globally majorized by a quadratic approximation

Projected gradient descent

Projected gradient method for minimizing \(L\)-smooth \(f\) over a convex set \(C\) \[ x_{k+1} = \proj_C(x_k - \alpha \nabla f(x_k)) \]

By the descent lemma, for any \(\alpha\in(0,\frac1L]\),

\[ \begin{aligned} f(z) &\le f(x) + \nabla f(x)^T(z-x) + \frac{L}{2}\|z-x\|^2 \le f(x) + \nabla f(x)^T(z-x) + \frac{1}{2\alpha}\|z-x\|^2 \end{aligned} \]

(projected) gradient descent step minimizes the quadratic upper bound:

\[ \begin{aligned} \proj_C(x - \alpha \nabla f(x)) &= \argmin_{z\in C} \frac{1}{2\alpha}\|z - (x-\alpha\nabla f(x))\|^2 \\ &= \argmin_{z\in C} \frac{\alpha}2\|\nabla f(x)\|^2 + \nabla f(x)^T(z-x)+\frac{1}{2\alpha}\|z-x\|^2 \\ &= \argmin_{z\in C} f(x) + \nabla f(x)^T(z-x) + \frac{1}{2\alpha}\|z-x\|^2 \end{aligned} \]

Convergence

  • Let \(C=\Rn\) (unconstrained), \(f_k:=f(x_k)\), \(\nabla f_k:=\nabla f(x_k)\). By descent lemma,

\[ f_{k+1} \le f_k + \nabla f_k^T(x_{k+1}-x_k) + \frac{L}{2}\|x_{k+1}-x_k\|^2 \]

  • take \(x_{k+1} = x_k - \alpha \nabla f_k\), then \(x_{k+1}-x_k = -\alpha \nabla f_k\) and

\[ \begin{aligned} f_{k+1} &\le f_k - \alpha \nabla f_k^T\nabla f_k + \frac{L}{2}\|- \alpha \nabla f_k\|^2 \\ &= f_k - \alpha \|\nabla f_k\|^2 + \frac{L\alpha^2}{2}\|\nabla f_k\|^2 \\ &= f_k - \alpha\left(1-\frac{\alpha L}{2}\right) \|\nabla f_k\|^2 \end{aligned} \]

  • decreasing objective values

\[ f_{k+1} < f_k \quad\text{if}\quad \alpha\in(0,2/L) \quad\text{and}\quad \nabla f_k\ne 0 \]

Nonasymptotic rate

  • if \(\alpha\in(0,2/L]\) then \(f_{k+1} \le f_k - \alpha\left(1-\frac{\alpha L}{2}\right) \|\nabla f_k\|^2\)

  • minimize RHS over \(\alpha\in(0,2/L]\) gives \(\alpha^* = 1/L\) and

\[ f_{k+1} \le f_k - \frac{1}{2L} \|\nabla f_k\|^2 \]

  • sum over \(k=0,1,2,\ldots,T\) gives

\[ \frac1{2L}\alpha\sum_{k=0}^{T}\|\nabla f_k\|^2 \le f(x_0) - f(x_T) \le f(x_0) - f^*, \quad\text{where $f^*$ is min value} \]

  • bounds min gradient value \[ \min_{k\in\set{0,\ldots,T}}\ \|\nabla f(x_k)\|^2 \le \frac1T\sum_{k=0}^T \|\nabla f(x_k)\|^2 \le \frac{2L(f(x_0) - f^*)}{T} = O(1/T) \]

🎱 Lipschitz constant

That least-squares obective \[ f(x) = \half\|Ax-b\|^2 \] is \(L\)-smooth. What is \(L\)?

  1. \(\|A\|_2^2\)
  2. \(\lambda_{\max}(A^TA)\)
  3. \(\|A^T b\|\)
  4. a and c
  5. b and c