CPSC 406 – Computational Optimization
\[ \def\argmin{\operatorname*{argmin}} \def\Ball{\mathbf{B}} \def\bmat#1{\begin{bmatrix}#1\end{bmatrix}} \def\Diag{\mathbf{Diag}} \def\half{\tfrac12} \def\ip#1{\langle #1 \rangle} \def\maxim{\mathop{\hbox{\rm maximize}}} \def\maximize#1{\displaystyle\maxim_{#1}} \def\minim{\mathop{\hbox{\rm minimize}}} \def\minimize#1{\displaystyle\minim_{#1}} \def\norm#1{\|#1\|} \def\Null{{\mathbf{null}}} \def\proj{\mathbf{proj}} \def\R{\mathbb R} \def\Re{\mathbb R} \def\Rn{\R^n} \def\rank{\mathbf{rank}} \def\range{{\mathbf{range}}} \def\span{{\mathbf{span}}} \def\st{\hbox{\rm subject to}} \def\T{^\intercal} \def\textt#1{\quad\text{#1}\quad} \def\trace{\mathbf{trace}} \]
\[ \min_x\ f(x) \textt{with} f:\Rn\to\R \quad \text{continuously differentiable} \]
\[ f'(x;d) = \lim_{\alpha\to 0^+} \frac{f(x+\alpha d) - f(x)}{\alpha} = \nabla f(x)^T d \]
\[ f'(x;d) < 0 \]
\[ f(x+\alpha d) < f(x) \quad \forall \alpha\in(0,\bar\alpha) \]
Initialize: choose \(x_0\in\Rn\)
For \(k=0,1,2,\ldots\)
Questions:
\[ x^{k+1} = x^{k} + \alpha^{k} d, \qquad d = -\nabla f(x^{k}) \]
\[ f'(x;-\nabla f(x)) = -\nabla f(x)^T \nabla f(x) = -\|\nabla f(x)\|^2 < 0 \]
\[ -\frac{\nabla f(x)}{\|\nabla f(x)\|} = \mathop{\rm argmin}_{\|d\|\le1} f'(x;d) \quad\text{(most negative)} \]
proof proceeds via Cauchy-Schwartz inequality: for any vectors \(w,v\in\Rn\), \[
-\|w\|\cdot\|v\| \le w\T v \le +\|w\|\cdot\|v\|
\]
and upper (or lower) bound achived if and only if \(w\) and \(v\) are parallel
Initialize: choose \(x_0\in\Rn\) and tolerance \(\epsilon>0\)
For \(k=0,1,2,\ldots\)
choose step size \(\alpha^k\) to approximately minimize \[ \phi(\alpha) = f(x^k - \alpha \nabla f(x^k)) \]
update \(x^{k+1} = x^k - \alpha^k \nabla f(x^k)\)
stop if \(\|\nabla f(x^k)\| < \epsilon\)
step size rules typically used in practice
\[\alpha^{k} \in \mathop{\rm argmin}_{\alpha\ge0}\ \phi(\alpha), \qquad \phi(\alpha):=f(x^{k} + \alpha d^{k})\]
\[\alpha^{k} = \bar\alpha>0 \quad \forall k\]
\(f(x)=\half x^2\) with \(x\in\R\). Then
\[ \begin{aligned} x^{k+1} &= x^{(k)} - \bar\alpha \nabla f(x^{(k)})\\ &= x^{(k)} - \bar\alpha x^{(k)}\\ &= (1-\bar\alpha)x^{(k)}\\ &= (1-\bar\alpha)^{k+1}x^{(0)} \end{aligned} \]
if \(\bar\alpha\in(0,2)\) then \(|1-\bar\alpha|<1\) and
\[f(x^{(k)}=\half (1-\bar\alpha)^{2k} (x^{(0)})^2\to0 \textt{as} k\to\infty \]
\[f(x) = \half x\T Hx + b\T x + \gamma, \textt{with} H\succ0\]
reliable constant stepsize \(\bar\alpha\) depends on maximum eigenvalue, and observe \[ d\T H d \le \lambda_{\max}(H) \|d\|^2 \quad \forall d\in\Rn \qquad(1)\]
behaviour of function value along steepest descent direction \(d=-\nabla f(x)\) \[ \begin{aligned} f(x+\alpha d) &= f(x) + \alpha d\T\nabla f(x) + \half \alpha^2 d\T \nabla^2f(x) d & \quad \text{(exact because $f$ quadratic)} \\ &\le f(x) - \alpha\|\nabla f(x)\|^2 + \half \alpha^2 \lambda_{\max}(H)\|d\|^2 &\quad \text{(by (1)}) \\ &= f(x) - \underbrace{(\alpha-\half \alpha^2 \lambda_{\max}(H))}_{(♥)}\|\nabla f(x)\|^2 \end{aligned} \]
if \(♥>0\) then \(f(x+\alpha d)<f(x)\), as required, so choose \[\alpha\in(0,2/\lambda_{\max}(H))\]
for general smooth functions, constant stepsize depends on the Lipschitz constant of the gradient
Definition 1 (L-smooth functions) The function \(f:\Rn\to\R\) is \(L\)-Lipschitz smooth if \[ \|\nabla f(x) - \nabla f(y)\| \le L\|x-y\| \quad \forall x,y\in\Rn \]
\[ f(x) = \half x\T Hx + b\T x + \gamma, \textt{with} H\succ0 \]
If \(f\) is twice continuously differentiable, then \(f\) is \(L\)-Lipschitz smooth if and only if its Hessian is bounded by \(L\), ie, for all \(x\in\Rn\) \[ \nabla^2 f(x) \preceq L I \quad\Longleftrightarrow\quad L I - \nabla^2 f(x) \succeq 0 \] implies that quadratic approximation is a local upper bound
\[ \sigma(a^{\intercal}_i x) \approx b_i, \quad \text{where} \quad \sigma(t) = \frac{1}{1+e^{-t}} \]
\[ \min_x f(x):=-\sum_{i=1}^m b_i\log(\sigma(a_i^\intercal x)) + (1-b_i)\log(1-\sigma(a_i^\intercal x)) \]
\[ \nabla f(x) = A\T r, \quad \nabla^2 f(x) = A\T D A, \quad r = \sigma.(Ax) - b, \quad D = \Diag(r_i(1-r_i))_{i=1}^m \]
\[ u\T \nabla^2 f(x) u = u\T(A\T D A)u \le \frac{1}{4} u\T(A\T A)u \le \frac{1}{4} \lambda_{\max}(A\T A) \]
\[ f(x) = \half x\T Hx + b\T x + \gamma, \textt{with} H\succ0 \]
exact linesearch solves the 1-dimensional optimization problem with \(d\) descent dir: \[ \min_{\alpha\ge0} \ \phi(\alpha) := f(x + \alpha d) \]
exact step computation:
\[ \begin{aligned} \phi(\alpha) &= \half (x + \alpha d)\T H(x + \alpha d) + b\T (x + \alpha d) + \gamma\\ \\ \phi'(\alpha) &= \alpha d\T H d + x\T H d + b\T d = \alpha d\T H d + \nabla f(x)\T d\\ \\ \phi'(\alpha^*) &= 0 \quad \Longleftrightarrow \quad \alpha^* = -\frac{\nabla f(x)\T d}{d\T H d} \end{aligned} \]
pull back along descent direction \(d^{k}\) until sufficient decrease in \(f\)
\(f:\Rn\to\R\) \(L\)-smooth \[ x^{k+1} = x^k - \alpha^k \nabla f(x^k) \]
with
descent (unless \(\nabla f(x^k)=0\)) \[f(x^{k+1}) < f(x^k)\]
convergence \[\|\nabla f(x^k)\| \to 0\]