Second-Order
Optimality

CPSC 406 – Computational Optimization

Hessians and second-order optimality

\[ \def\argmin{\operatorname*{argmin}} \def\Ball{\mathbf{B}} \def\bmat#1{\begin{bmatrix}#1\end{bmatrix}} \def\Diag{\mathbf{Diag}} \def\half{\tfrac12} \def\ip#1{\langle #1 \rangle} \def\maxim{\mathop{\hbox{\rm maximize}}} \def\maximize#1{\displaystyle\maxim_{#1}} \def\minim{\mathop{\hbox{\rm minimize}}} \def\minimize#1{\displaystyle\minim_{#1}} \def\norm#1{\|#1\|} \def\Null{{\mathbf{null}}} \def\proj{\mathbf{proj}} \def\R{\mathbb R} \def\Rn{\R^n} \def\rank{\mathbf{rank}} \def\range{{\mathbf{range}}} \def\span{{\mathbf{span}}} \def\st{\hbox{\rm subject to}} \def\T{^\intercal} \def\textt#1{\quad\text{#1}\quad} \def\trace{\mathbf{trace}} \]

  • sufficient optimality conditions in \(\R\)
  • positive definite matrices
  • Hessians
  • quadratic functions
  • sufficient optimality conditions in \(\R^n\)

Necessary conditions (1-D)

Suppose that \(f:\R\to\R\) is twice continuosly differentiable


necessary optimality conditions

\(x^*\) is a local minimizer only if

  • (first-order) \(f'(x^*)=0\)
  • (second-order) \(f''(x^*)\geq 0\)

sufficient optimality conditions

\(x^*\) is a local minimizer if

  • (first-order) \(f'(x^*)=0\)
  • (second-order) \(f''(x^*)> 0\)

 

 

  • generalize second-order conditions to \(\R^n\)

Example


\[ \min_{x\in\R^2}\, \frac{x_1+x_2}{x_1^2+x_2^2+1} \]

using ForwardDiff
f(x) = (x[1]+x[2])/(x[1]^2+x[2]^2+1)
∇f(x) = ForwardDiff.gradient(f, x)

x =  [1, 1]/sqrt(2);
@show ∇f(+x)
@show ∇f(-x);
∇f(+x) = [1.1102230246251565e-16, 1.1102230246251565e-16]
∇f(-x) = [1.1102230246251565e-16, 1.1102230246251565e-16]
  • Both \(x\) and \(-x\) are stationary. Which is minimial/maximal?

positive
   definite
      matrices

Positive (semi)definite matrices

Let \(H\) by \(n\)-by-\(n\) matrix with \(H=H\T\) (symmetric)

  • \(H\) is positive semidefinite (\(H\succeq0\)) if

\[x\T H x\geq 0 \textt{for all} x\in\R^n\]

  • \(H\) is positive definite (\(H\succ0\)) if

\[x\T H x> 0 \textt{for all} 0\ne x\in\R^n\]

  • \(H\) is negative semidefinite if \(-H\) is positive semidefinite, ie, \(H\preceq0 \Longleftrightarrow -H\succeq0\)
  • \(H\) is negative definite if \(-H\) is positive definite, ie, \(H\prec0 \Longleftrightarrow -H\succ0\)

 

  • \(H\) is indefinite if it is neither positive nor negative semidefinite, ie,

\[ \exists\ x\ne y\in\R^n \textt{such that} x\T H x > 0 \textt{and} y\T H y < 0 \]

Question

The matrix \(H = \begin{bmatrix}\phantom+2 & -1\\ -1 & \phantom-1\end{bmatrix}\) is

  1. positive definite
  2. positive semidefinite
  3. negative definite
  4. negative semidefinite
  5. indefinite

Diagonal matrices

\[ D = \begin{bmatrix}d_1 & 0 & \cdots & 0\\ 0 & d_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & d_n \end{bmatrix} = \Diag(d_1,d_2,\ldots,d_n) \]

  • \(D\succ0 \quad\Longleftrightarrow\quad d_i>0\) for all \(i\)
  • \(D\succeq0 \quad\Longleftrightarrow\quad d_i\geq0\) for all \(i\)

Eigenpairs of symmetric matrices

Let \(H\) be a \(n\)-by-\(n\) matrix. Then \((x,\lambda)\in\Rn\times\R\) is an eigenvector/eigenvalue pair of \(H\) if \[Hx = \lambda x\]

Theorem 1 (Eigenvalues of symmetric matrices) If \(H\) is \(n\)-by-\(n\) and symmetric, then there exists \(n\) orthogonal eigenvectors and all eigenvalues are real.

\[ \left\{ \begin{aligned} Hx_1 &= \lambda_1 x_1\\ Hx_2 &= \lambda_2 x_2\\ &\vdots\\ Hx_n &= \lambda_n x_n \end{aligned} \right\} \textt{or} H X = X\Lambda\ \] where \(X\T = X^{-1}\) (orthogonal) and \(\Lambda\) is a diagonal matrix of eigenvalues: \[ X = \begin{bmatrix}x_1 & x_2 & \cdots & x_n\end{bmatrix} \quad\text{and}\quad \Lambda = \begin{bmatrix}\lambda_1 & 0 & \cdots & 0\\ 0 & \lambda_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \lambda_n \end{bmatrix} \]

Eigenvalues and definiteness

The matrix \(H\) is positive (semi) definite if and only if all of its eigenvalues are (nonnegative) positive.


proof (positive definite)

  • by spectral theorem, \[ X\T H X = \Lambda \textt{where} X\T = X^{-1} \textt{and} \Lambda=\Diag(\lambda_1,\lambda_2,\ldots,\lambda_n)\]

  • for any \(x\in\Rn\) there exists \(y=(y_1,\ldots,y_n)\) such that \(x=Xy\) and \[ x\T Hx = y\T X\T H X y = y\T \Lambda y = \sum_{i=1}^n \lambda_i y_i^2 \]

  • thus, \(x\T Hx>0\) for all \(x\neq 0\) (ie, \(H\) positive definite) if and only if \[ \sum_{i=1}^n \lambda_i y_i^2 > 0 \quad\text{for all}\quad y\neq 0 \quad\Longleftrightarrow\quad \lambda_i > 0 \quad\text{for all}\quad i=1:n \]

Example

using LinearAlgebra 

\[ H = \begin{bmatrix}4 & 1\\ 1& 3\end{bmatrix} \]

H = [4 1; 1 3]
@show eigvals(H);
eigvals(H) = [2.381966011250105, 4.618033988749895]

\[ H = \begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1\\ 1 & 1 & 1/10 \end{bmatrix}\]

H = ones(3,3)
H[3, 3] = 1/10
@show eigvals(H);
eigvals(H) = [-0.6536725037400826, -2.3721342664653315e-17, 2.7536725037400815]

Equivalent conditions

Let \(H\) be a \(n\)-by-\(n\) symmetric matrix.


positive definite equivalences:

  1. all eigenvalues of \(H\) are positive
  2. \(x\T H x > 0\) for all \(0\ne x\in\R^n\)
  3. \(H = R\T R\) for some nonsingular \(n\)-by-\(n\) matrix \(R\)
  4. \(H\) is symmetric and all of its leading principal minors are positive


positive semidefinite equivalences:

  1. all eigenvalues of \(H\) are nonnegative
  2. \(x\T H x \ge 0\) for all \(x\in\R^n\)
  3. \(H = R\T R\) for some \(n\)-by-\(n\) matrix \(R\)
  4. \(H\) is symmetric and all of its principal minors are nonnegative

Hessians

For \(f:\R^n\to\R\) twice continuously differentiable, the Hessian of \(f\) at \(x\in\R^n\) is the \(n\)-by-\(n\) symmetric matrix \[ H(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(x) & \frac{\partial^2 f}{\partial x_1\partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n}(x)\\ \frac{\partial^2 f}{\partial x_2\partial x_1}(x) & \frac{\partial^2 f}{\partial x_2^2}(x) & \cdots & \frac{\partial^2 f}{\partial x_2\partial x_n}(x)\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^2 f}{\partial x_n\partial x_1}(x) & \frac{\partial^2 f}{\partial x_n\partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_n^2}(x) \end{bmatrix} \qquad \frac{\partial^2 f}{\partial x_i\partial x_j} = \frac{\partial^2 f}{\partial x_j\partial x_i} \]

example

\[ f(x) = x_1^2 + 8x_1x_2 - 2x_3^3, \quad \nabla f(x) = \begin{bmatrix}2x_1 + 8x_2\\ 8x_1 - 6x_3^2\\ -6x_3^2\end{bmatrix}, \quad H(x) = \begin{bmatrix}2 & 8 & \phantom-0\\ 8 & 0 & \phantom-0\\ 0 & 0 & -12x_3\end{bmatrix} \]

Quadratic Functions

Quadratic functions

Quadratic functions over \(\Rn\) have the form \[ f(x) = \half x\T H x + b\T x + \gamma \] where \(H\) is symmetric and \(b\in\Rn\) and \(\gamma\in\R\).

  • \(n=1\)

\[ f(x) = \half hx^2 + bx + \gamma, \quad H = [h] \]

  • \(n=2\)

\[ \begin{aligned} f(x) &= \half[\, x_1 \ x_2\, ] \begin{bmatrix}h_{11} & h_{12}\\ h_{21} & h_{22}\end{bmatrix} \begin{bmatrix}x_1\\ x_2\end{bmatrix} + [\,b_1 \ b_2\,] \begin{bmatrix}x_1\\ x_2\end{bmatrix} + \gamma\\ &= \half h_{11}x_1^2 + h_{12}x_1x_2 + \half h_{22}x_2^2 + b_1x_1 + b_2x_2 + \gamma \end{aligned} \]

Quadratic functions and symmetry

\[ f(x) = \half x\T H x + b\T x + \gamma, \quad \nabla f(x) = Hx + b, \quad \nabla^2 f(x) = H \]


We can always assume without loss of generality that \[ H = H\T \quad \text{(symmetric)} \]


Suppose that \(H\ne H\T\): \[ x\T H x = \half x\T H x + \half x\T H\T x = x\T \left[\half\left(H + H\T\right)\right] x \]


Thus we can replace \(H\) with \(\half(H + H\T)\) and not change the function value


Optimality for quadratic functions

\[ \min_{x\in\Rn} f(x) = \half x\T H x + b\T x + \gamma \]

\[ \nabla f(x) = Hx + b, \quad \nabla^2 f(x) = H \]

optimality conditions

  • (necessary) \(x^*\) is optimal only if \(\nabla f(x^*)=Hx^*+b=0\) (stationary)
  • (sufficient) if stationary and \(H\succeq0\), then \(x^*\) is a global minimizer
  • (sufficient) if stationary and \(H\succ0\), then \(x^*\) is the unique global minimizer


proof

for all \(d\ne0\), \[ f(x^*+d) - f(x^*) = d\T\underbrace{\nabla f(x^*)}_{=0} + \half \underbrace{d\T\overbrace{\nabla^2 f(x^*)}^{=H}d}_{\ge0} \begin{cases} \ge0 & \text{if } H\succeq0\\ >0 & \text{if } H\succ0 \end{cases} \]

Nonlinear functions

Directional second derivatives

Given \(f:\Rn\to\R\), recall the directional derivative \[ f'(x;d) = \lim_{\alpha\to 0^+}\frac{f(x+\alpha d)-f(x)}{\alpha} = d\T\nabla f(x) \]

Definition 1 The directional second derivative of \(f\) at \(x\) in the direction \(d\) is \[ f''(x;d) = \lim_{\alpha\to0^+} \frac{f'(x+\alpha d;d) - f'(x;d)}{\alpha} = d\T\nabla^2 f(x)d \]

partial 2nd derivatives are the directional 2nd derivatives along each canonical basis vector \(e_i\): \[ \frac{\partial^2 f}{\partial x_i^2}(x) = f''(x;e_i) \textt{with} e_i(j) = \begin{cases} 1 & \text{if } j=i\\ 0 & \text{if } j\ne i \end{cases} \]

Linear and quadratic approximations

Suppose \(f:\Rn\to\R\) is twice continuously differentiable.

Theorem 2 (Linear approximation) For all \(x\in\Rn\) and \(\epsilon>0\), for each \(y\in\epsilon\Ball(x)\) there exists \(z\in[x,y]\) such that
\[ f(y) = f(x) + \nabla f(x)\T (y-x) + \half (y-x)\T\nabla^2 f(z)(y-x) \]


Theorem 3 (Quadratic approximation) For all \(x\) and \(d\) in \(\Rn\), \[ f(x+d) = f(x) + \nabla f(x)\T d + \half d\T\nabla^2 f(x)d + o(\|d\|^2) \]

Second-order necessary conditions

For \(f:\Rn\to\R\) twice continuously differentiable and \(\bar x\in\Rn\) stationary (ie, \(\nabla f(\bar x)=0\))

  • \(\bar x\) is a local min \(\quad\Longrightarrow\quad\) \(\nabla^2f(\bar x) \succeq0\)

  • \(\bar x\) is a local max \(\quad\Longrightarrow\quad\) \(\nabla^2f(\bar x) \preceq0\)


proof sketch for local min (analogous for local max). If \(\bar x\) is a local min, then for all \(d\ne0\)

\[ \begin{aligned} 0\le f(\bar x+\alpha d) - f(\bar x) &= d\T\underbrace{\nabla f(\bar x)}_{=0} + \half \alpha^2 d\T\nabla^2 f(\bar x)d + \omicron(\alpha^2\|d\|^2) \end{aligned} \]

Divide both sides by \(\alpha^2\) and take the limit as \(\alpha\to0^+\). Because \(\omicron(\alpha^2\|d\|^2)/\alpha^2\to0\), \[ 0\le d\T\nabla^2 f(\bar x)d \]

Because this hold for all \(d\ne0\), \[ \nabla^2 f(\bar x) \succeq0 \]

Sufficient conditions for optimality

For \(f:\Rn\to\R\) twice continuously differentiable and \(\bar x\in\Rn\) stationary, ie, \(\nabla f(\bar x)=0\),

  • \(\nabla f(\xbar)\succ0 \quad\Longrightarrow\quad\) \(\xbar\) is a local min
  • \(\nabla f(\xbar)\prec0 \quad\Longrightarrow\quad\) \(\xbar\) is a local max

proof sketch for local min (analogous for local max). By linear approximation theorem and continuity of \(\nabla^2 f\), for any \(x\) close enough to \(\xbar\) there exists \(z\in[\xbar,x]\) such that

\[ f(x) - f(\bar x) = (x-\xbar)\T\underbrace{\nabla f(\xbar)}_{=0} + \half (x-\bar x)\T\underbrace{\nabla^2 f(z)}_{\succ0}(x-\bar x)>0 \]

Question

\[ f(x) = x_1^2 + 8x_1x_2 - 2x_3^3, \quad \nabla f(x) = \begin{bmatrix}2x_1 + 8x_2\\ 8x_1 \\ -6x_3^2\end{bmatrix}, \quad H(x) = \begin{bmatrix}2 & 8 & \phantom-0\\ 8 & 0 & \phantom-0\\ 0 & 0 & -12x_3\end{bmatrix} \]

The stationary point \(x^* = (0,0,0)\) is a

  1. minimizer
  2. maximizer
  3. saddle point

Example

\[ \min_{x,y}\ f(x,y) = \frac{x+y}{x^2+y^2+1} \]

\[ \nabla f(x,y) = \frac{1}{(x^2+y^2+1)^2} \begin{bmatrix} y^2-2xy-x^2+1\\ x^2-2xy-y^2+1 \end{bmatrix} \]

Stationary points \(\nabla f(x^*,y^*)=0\):

\[ \underbrace{ (x^*_1,y^*_1)=-\frac{1}{\sqrt2}(1,1)}_{\text{minimizer}} \]

\[ \underbrace{(x^*_2,y^*_2)=+\frac{1}{\sqrt2}(1,1)}_{\text{maximizer}} \]

Hessian of \(f\) at these points:

\[ \nabla^2 f(x^*_1,y^*_1) = \frac1{\sqrt{2}}\begin{bmatrix} 1 & 0\\0 & 1 \end{bmatrix}\succ0 \]

\[ \nabla^2 f(x^*)2,y^*_2) = \frac{1}{\sqrt{2}}\begin{bmatrix} -1 & 0\\0& -1 \end{bmatrix}\prec0 \]