CPSC 406 – Computational Optimization
\[ \def\argmin{\operatorname*{argmin}} \def\Ball{\mathbf{B}} \def\bmat#1{\begin{bmatrix}#1\end{bmatrix}} \def\Diag{\mathbf{Diag}} \def\half{\tfrac12} \def\int{\mathop{\rm int}} \def\ip#1{\langle #1 \rangle} \def\maxim{\mathop{\hbox{\rm maximize}}} \def\maximize#1{\displaystyle\maxim_{#1}} \def\minim{\mathop{\hbox{\rm minimize}}} \def\minimize#1{\displaystyle\minim_{#1}} \def\norm#1{\|#1\|} \def\Null{{\mathbf{null}}} \def\proj{\mathbf{proj}} \def\R{\mathbb R} \def\Re{\mathbb R} \def\Rn{\R^n} \def\rank{\mathbf{rank}} \def\range{{\mathbf{range}}} \def\sign{{\mathbf{sign}}} \def\span{{\mathbf{span}}} \def\st{\hbox{\rm subject to}} \def\T{^\intercal} \def\textt#1{\quad\text{#1}\quad} \def\trace{\mathbf{trace}} \]
Consider the quadratic function with \(H\) symmetric and positive definite \[ f(x) = \frac{1}{2} x\T H x, \qquad H = U\Lambda U\T \]
level sets are ellipsoids:
gradient descent from two starting points:
Let \(x^1, x^2,\ldots\) be the iterates generated by gradient descent with exact linesearch. Then
\[ (x^{k+1} - x^k)^T (x^{k+2}-x^{k+1}) = 0 \]
Proof: exact steplength satisfies
\[ \alpha^k = \argmin_{\alpha>0} \phi(\alpha):=f(x^k + \alpha d^k), \quad d^k = -\nabla f(x^k) \]
\[ 0 = \phi'(\alpha^k) = \frac{d}{d\alpha} f(\underbrace{x^k + \alpha^k d^k}_{=x^{k+1}}) = (d^k)^T \nabla f(x^{k+1}) = - \nabla f(x^k)^T \nabla f(x^{k+1}) \]
\[ \nabla f(x^k)^T \nabla f(x^{k+1}) =0 \quad\Longleftrightarrow\quad (x^{k+1} - x^k)^T (x^{k+2}-x^{k+1}) = 0 \]
The condition number of an \(n\times n\) positive definite matrix \(H\) is \[ \kappa(H) = \frac{\lambda_{\max}(H)}{\lambda_{\min}(H)}\geq 1 \]
\[ \kappa(f) = \kappa(\nabla^2 f(x^*)) \]
\[ \min_x f(x) \qquad f:\Rn\to\R \]
make a linear change of variables: \(x=Sy\) where \(S\) is nonsingular to get rescaled problem \[ \min_y\ g(y):=f(Sy) \]
apply gradient descent to scaled problem \[ y^{k+1} = y^k - \alpha^k \nabla g(y^k) \textt{with} \nabla g(y) = S\T \nabla f(Sy) \]
multiply on left by \(S\) to get \(x\)-update \[ x^{k+1} = S y^{k+1} = S(y^k - \alpha^k \nabla g(y^k)) = x^k - \alpha^k S S^T\nabla f(x^k) \]
scaled gradient method \[ x^{k+1} = x^k + \alpha^k d^k, \qquad d^k = -\underbrace{S S^T}_{\succ0} \nabla f(x^k) \]
If \(\nabla f(x)\neq 0\), the scaled negative gradient \(d=-SS^T\nabla f(x)\) is a descent direction \[ f'(x; d) = d^T \nabla f(x) = -\nabla f(x)^T(SS^T)\nabla f(x) < 0 \] because \(D := SS^T\succ 0\)
Recall: a matrix \(D\) is positive definite if and only if
Observe relationship between optimizing \(f\) and optimizing its scaling \(g\)
\[ \min_y g(y) = f(Sy) \quad\text{with}\quad x \equiv S y \]
condition number of \(\nabla^2 f(x)\) governs convergence of gradient descent
\[ \nabla^2 g(y) = S\T \nabla^2 f(Sy) S \]
\[ f(x) = \half x^T H x + b\T x + \gamma, \quad \nabla^2f(x) = H=U\Lambda U^T\succ 0 \]
\[ \kappa(S^T H S) = \kappa( H^{-1/2} H H^{-1/2}) = \kappa(I) = 1 \]
Close to solution \(x^*\), levels sets of
\(\bullet\ f\) are ellipsoids and \(\kappa(f)>1\)
\(\bullet g\) are circles for ideal \(S\) because \(\kappa(g)\approx 1\)
Make \(S^{(k)}\nabla^2 f(x^{(k)})S^{(k)}\) as well conditioned as possible \[ S^{(k)}(S^{(k)})^T = \begin{cases} (\nabla f(x^{(k)}))^{-1} & \text{Newton ($\kappa = 1$)}\\[1ex] (\nabla f(x^{(k)})+\lambda I)^{-1} & \text{damped Newton} \\[1ex] \Diag\left(\frac{\partial^2 f(x^{(k)})}{\partial x_i^2}\right)^{-1} & \text{diagonal scaling} \end{cases} \]
\[ \min_{x\in \Rn} \quad f(x):=\half\|r(x)\|_2^2, \quad r:\Rn\to\R^m \quad\text{(typically, $m > n$).} \]
\[ r(x) = \begin{bmatrix} r_1(x) \\ r_2(x) \\ \vdots \\ r_m(x) \end{bmatrix}, \quad \nabla f(x) = J(x)^T r(x), \quad J(x) = \begin{bmatrix} \nabla r_1(x)^T \\ \nabla r_2(x)^T \\ \vdots \\ \nabla r_m(x)^T \end{bmatrix} \quad \]
\[ r(x) = Ax-b \]
\[ \min_x \quad \half\sum_{i=1}^m r_i(x), \quad r_i(x) = \|x-b_i\|_2 - d_i \]
\[ \begin{aligned} r(x) = \begin{bmatrix} r_1(x) \\ r_2(x) \\ \vdots \\ r_m(x) \end{bmatrix} &= \begin{bmatrix} r_1(\bar x) + \nabla r_1(\bar x)^T(x-\bar x) \\ r_2(\bar x) + \nabla r_2(\bar x)^T(x-\bar x) \\ \vdots \\ r_m(\bar x) + \nabla r_m(\bar x)^T(x-\bar x) \end{bmatrix} + o(\|x-\bar x\|)\\[10pt] &= J(\bar x)(x-\bar x) + r(\bar x) + o(\|x-\bar x\|)\\[10pt] &= J(\bar x) x - \underbrace{(J(\bar x)\bar x - r(\bar x))}_{=: b(\bar x)} + o(\|x-\bar x\|) \end{aligned} \]
\[ x^{(k+1)} = \argmin_x \ \half\|J(x^k)x - b(x^k)\|_2^2 \textt{or} x^{(k+1)} = J(x^k) \backslash b(x^k) \]
\[ \begin{aligned} x^{(k+1)} &= \argmin_x\ \|J_kx - b_k\|^2 \\ &= (J_k^T J_k)^{-1}J_k\T b_k\\ &= (J_k^T J_k)^{-1}J_k^T (J_k x^k - r_k)\\ &= x^k - (J_k^T J_k)^{-1}J_k^T r_k \end{aligned} \]
\[ x^{k+1} = x^k + d^k, \qquad d^k:=\underbrace{(J_k^T J_k)^{-1}}_{= D_k\succ0}\underbrace{(-J_k^T r_k)}_{=-\nabla f(x^k)} \]
\[ \nabla^2 f(x) = J(x)^T J(x) + \sum_{i=1}^m \nabla^2 r_i(x) \]
\[ \min_x \ f(x) = \half\|r(x)\|_2^2, \quad r:\Rn\to\R^m \]
\[ x^{k+1} = x^k + \alpha^k d^k, \qquad d^k = \argmin_d \ \|J_kd - r_k\|^2 \]