Convergence of
gradient descent

CPSC 406 – Computational Optimization

Convergence of gradient descent

  • iteration complexity
  • quadratic models
  • descent lemma
  • smoothness and strong convexity

Example I: Step size selection

Gradient descent for \(f(x) = x^2\) with different step sizes.

Example II: Logistic regression

Gradient descent for logistic regression \(f(θ) = \frac{1}{n} \sum_{i=1}^n \log(1 + \exp(-y_i (x_i^T θ)))\)

Smooth functions

A function \(f:\mathbb{R}^n\to\mathbb{R}\) is \(L\)-smooth (ie, \(L\)-Lipschitz gradient)

\[ \|\nabla f(x) - \nabla f(y)\| \le L\|x-y\| \quad \forall x,y \]


examples

  • linear: \(f(x) = a^Tx\), with \(a\in\mathbb{R}^n\), has \(L=0\)
  • quadratic: \(f(x) = \frac{1}{2}x^TAx + b^Tx + \gamma\), with \(A\succeq 0\), has \(L=\|A\|_2=\lambda_{\max}(A)\)


Second-order characterization

If \(f\) is twice continuously differentiable, then \(f\) is \(L\)-smooth if and only if for all \(x\)

\[ \nabla^2 f(x) \preceq L I \quad\text{ie,}\quad \|\nabla^2 f(x)\|_2 \le L \]

Question

What is the Lipschitz constant \(L\) for the gradient of the function \[ f(x) = \frac{1}{2}\|cAx - b\|^2 \, ? \]

  1. \(L = c\,\|A\|^2\)

  2. \(L = c^2\,\|A\|^2\)

  3. \(L = \|A\|^2\)

  4. \(L = \frac{\|A\|^2}{c}\)

Descent lemma

If \(f\) is \(L\)-smooth, then for all \(x,z\)

\[ f(z) \le f(x) + \nabla f(x)^T(z-x) + \frac{L}{2}\|z-x\|^2 \]

means that any \(L\)-smooth function is globally majorized by a quadratic approximation

Projected gradient descent

  • projected gradient method for minimizing \(L\)-smooth \(f\) over a convex set \(C\) \[ x_{k+1} = \proj_C(x_k - \alpha \nabla f(x_k)) \]

  • by descent lemma, because \(\alpha \le 1/L\)

\[ \begin{aligned} f(z) &\le f(x) + \nabla f(x)^T(z-x) + \frac{L}{2}\|z-x\|^2 \le f(x) + \nabla f(x)^T(z-x) + \frac{1}{2\alpha}\|z-x\|^2 \end{aligned} \]

  • projected gradient descent (with step size \(\alpha\)) step minimizes the quadratic upper bound:

\[ \begin{aligned} \proj_C(x - \alpha \nabla f(x)) &= \argmin_{z\in C} \frac{1}{2\alpha}\|z - (x-\alpha\nabla f(x))\|^2 \\ &= \argmin_{z\in C} \frac{\alpha}2\|\nabla f(x)\|^2 + \nabla f(x)^T(z-x)+\frac{1}{2\alpha}\|z-x\|^2 \\ &= \argmin_{z\in C} f(x) + \nabla f(x)^T(z-x) + \frac{1}{2\alpha}\|z-x\|^2 \end{aligned} \]

Convergence

  • Let \(C=\Rn\) (unconstrained), \(f_k:=f(x_k)\), \(\nabla f_k:=\nabla f(x_k)\). By descent lemma,

\[ f_{k+1} \le f_k + \nabla f_k^T(x_{k+1}-x_k) + \frac{L}{2}\|x_{k+1}-x_k\|^2 \]

  • take \(x_{k+1} = x_k - \alpha \nabla f_k\), then \(x_{k+1}-x_k = -\alpha \nabla f_k\) and

\[ \begin{aligned} f_{k+1} &\le f_k - \alpha \nabla f_k^T\nabla f_k + \frac{L}{2}\|- \alpha \nabla f_k\|^2 \\ &= f_k - \alpha \|\nabla f_k\|^2 + \frac{L\alpha^2}{2}\|\nabla f_k\|^2 \\ &= f_k - \alpha\left(1-\frac{\alpha L}{2}\right) \|\nabla f_k\|^2 \end{aligned} \]

  • decreasing objective values

\[ f_{k+1} < f_k \quad\text{if}\quad \alpha\in(0,2/L) \quad\text{and}\quad \nabla f_k\ne 0 \]

Nonasymptotic rate

  • if \(\alpha\in(0,2/L]\) then \(f_{k+1} \le f_k - \alpha\left(1-\frac{\alpha L}{2}\right) \|\nabla f_k\|^2\)

  • minimize RHS over \(\alpha\in(0,2/L]\) gives \(\alpha^* = 1/L\) and

\[ f_{k+1} \le f_k - \frac{1}{2L} \|\nabla f_k\|^2 \]

  • sum over \(k=0,1,2,\ldots,T\) gives

\[ \frac1{2L}\sum_{k=0}^{T}\|\nabla f_k\|^2 \le f(x_0) - f(x_T) \le f(x_0) - f^*, \quad\text{where $f^*$ is min value} \]

  • bounds min gradient value \[ \min_{k\in\set{0,\ldots,T}}\ \|\nabla f(x_k)\|^2 \le \frac1T\sum_{k=0}^T \|\nabla f(x_k)\|^2 \le \frac{2L(f(x_0) - f^*)}{T} = O(1/T) \]

Question

Convergence rate of gradient descent

Let \(f:\mathbb{R}^n\to\mathbb{R}\) be a convex, \(L\)-smooth function. When applying gradient descent with a constant step size \(\alpha=1/L\), which of the following statements about the convergence is true?

  1. The function values \(f(x_k)\) decrease quadratically with the number of iterations \(k\).
  2. The gradient norms \(\nabla f(x_k)\) converge to zero at a rate \(O(1/k)\).
  3. The method achieves a convergence rate of \(O(e^{-k})\) for the function values.
  4. The sequence \(\{x_k\}\) generated converges to the minimizer in a finite number of steps.

Strong convexity

A function \(f:\mathbb{R}^n\to\mathbb{R}\) is \(\mu\)-strongly convex (with \(\mu>0\)) if for all \(x,y\)

\[ f(z) \ge f(x) + \nabla f(x)^T(z-x) + \frac{\mu}{2}\|z-x\|^2 \]


If \(f\) is twice continuously differentiable, then \(f\) is \(\mu\)-strongly convex if and only if for all \(x\)

\[ d^T\nabla^2 f(x) d \ge \mu\|d\|^2 \quad \forall d\in\mathbb{R}^n \quad\iff\quad \nabla^2 f(x) \succeq I\mu \]

Example (Quadratic functions) For a positive definite matrix \(A\), the function \[ f(x) = \frac{1}{2}x^TAx + b^Tx + \gamma \] is \(\mu\)-strongly convex with \(\mu=\lambda_{\min}(A)\).

Alternative characterization

A function \(f\) is \(\mu\)-strongly convex if and only if for all \(x\)

\[ g(x) = f(x) - \frac{\mu}{2}\|x\|^2 \] is convex.

  • Imples that Tikhonov regularization induces strong convexity

Distance to solution

Lemma 1 (Lipschitz smooth) If \(f\) is \(L\)-smooth, then for all \(x\) and all minimizers \(x^*\) with \(f^*=f(x^*)\), \[ \frac{1}{2L}\|\nabla f(x)\|^2 \le f(x) - f^* \le \frac{L}{2}\|x-x^*\|^2 \]

  • gradient norm does not bound the distance to the solution

Lemma 2 (Strongly convex) If \(f\) is \(\mu\)-strongly convex, then for all \(x\) and all minimizers \(x^*\) with \(f^*=f(x^*)\), \[ \frac{\mu}{2}\|x-x^*\|^2 \le f(x) - f^* \le \frac{1}{2\mu}\|\nabla f(x)\|^2 \]

Smoothness and strong convexity

  • \(L\) smoothness imples

\[ f(y) \le f(x) + \nabla f(x)^T(y-x) + \frac{L}{2}\|y-x\|^2 \]

  • \(\mu\) strong convexity implies

\[ f(y) \ge f(x) + \nabla f(x)^T(y-x) + \frac{\mu}{2}\|y-x\|^2 \]

  • together, for all \(x,y\)

\[ \frac{\mu}{2}\|y-x\|^2 \le f(y) - f(x) - \nabla f(x)^T(y-x) \le \frac{L}{2}\|y-x\|^2 \]

  • implies Hessian eigenvalues bounded above and below:

\[ \mu I \preceq \nabla^2 f(x) \preceq L I \quad \forall x \]

Linear convergence

Linear convergence with strong convexity

  • under \(L\)-smoothness, we deduced the per-iteration decrease \[ f_{k+1} \le f_k - \frac{1}{2L} \|\nabla f_k\|^2 \]

  • under \(\mu\)-strong convexity, \(\tfrac{1}{2\mu}\|\nabla f_k\|^2 \ge f_k - f^*\), hence

\[ f_{k+1} \le f_k - \frac{\mu}{L}(f_k - f^*) \quad\iff\quad f_{k+1} - f^* \le (1-\frac{\mu}{L})(f_k - f^*) \]

  • recursing down from \(k=T, T-1, \ldots, 0\) gives

\[ f_T - f^* \le (1-\frac{\mu}{L})^T(f_0 - f^*) \le \exp\left(-\frac{\mu}{L}T\right)(f_0 - f^*) \]

  • if we require \(f_T - f^* \le \epsilon\), it’s sufficient to run \(T\) iterations such that

\[ T \ge \frac{L}{\mu}\log\left(\frac{f_0 - f^*}{\epsilon}\right) \quad \text{where $\frac{L}{\mu}$ is the condition number} \]