Second-Order
Optimality

CPSC 406 – Computational Optimization

Hessians and second-order optimality

\[ \def\argmin{\operatorname*{argmin}} \def\Ball{\mathbf{B}} \def\bmat#1{\begin{bmatrix}#1\end{bmatrix}} \def\Diag{\mathbf{Diag}} \def\half{\tfrac12} \def\ip#1{\langle #1 \rangle} \def\maxim{\mathop{\hbox{\rm maximize}}} \def\maximize#1{\displaystyle\maxim_{#1}} \def\minim{\mathop{\hbox{\rm minimize}}} \def\minimize#1{\displaystyle\minim_{#1}} \def\norm#1{\|#1\|} \def\Null{{\mathbf{null}}} \def\proj{\mathbf{proj}} \def\R{\mathbb R} \def\Rn{\R^n} \def\rank{\mathbf{rank}} \def\range{{\mathbf{range}}} \def\span{{\mathbf{span}}} \def\st{\hbox{\rm subject to}} \def\T{^\intercal} \def\textt#1{\quad\text{#1}\quad} \def\trace{\mathbf{trace}} \]

sufficient optimality conditions in \(\R\)
positive definite matrices
Hessians
quadratic functions
sufficient optimality conditions in \(\R^n\)

Necessary conditions (1-D)

Suppose that \(f:\R\to\R\) is twice continuosly differentiable

necessary optimality conditions

\(x^*\) is a local minimizer only if

(first-order) \(f'(x^*)=0\)
(second-order) \(f''(x^*)\geq 0\)

sufficient optimality conditions

\(x^*\) is a local minimizer if

(first-order) \(f'(x^*)=0\)
(second-order) \(f''(x^*)> 0\)

generalize second-order conditions to \(\R^n\)

Example

\[ \min_{x\in\R^2}\, \frac{x_1+x_2}{x_1^2+x_2^2+1} \]

using ForwardDiff
f(x) = (x[1]+x[2])/(x[1]^2+x[2]^2+1)
∇f(x) = ForwardDiff.gradient(f, x)

x =  [1, 1]/sqrt(2);
@show ∇f(+x)
@show ∇f(-x);

∇f(+x) = [1.1102230246251565e-16, 1.1102230246251565e-16]
∇f(-x) = [1.1102230246251565e-16, 1.1102230246251565e-16]

Both \(x\) and \(-x\) are stationary. Which is minimial/maximal?

positive
definite
matrices

Positive (semi)definite matrices

Let \(H\) by \(n\)-by-\(n\) matrix with \(H=H\T\) (symmetric)

\(H\) is positive semidefinite (\(H\succeq0\)) if

\[x\T H x\geq 0 \textt{for all} x\in\R^n\]

\(H\) is positive definite (\(H\succ0\)) if

\[x\T H x> 0 \textt{for all} 0\ne x\in\R^n\]

\(H\) is negative semidefinite if \(-H\) is positive semidefinite, ie, \(H\preceq0 \Longleftrightarrow -H\succeq0\)
\(H\) is negative definite if \(-H\) is positive definite, ie, \(H\prec0 \Longleftrightarrow -H\succ0\)

\(H\) is indefinite if it is neither positive nor negative semidefinite, ie,

\[ \exists\ x\ne y\in\R^n \textt{such that} x\T H x > 0 \textt{and} y\T H y < 0 \]

Question

The matrix \(H = \begin{bmatrix}\phantom+2 & -1\\ -1 & \phantom-1\end{bmatrix}\) is

positive definite
positive semidefinite
negative definite
negative semidefinite
indefinite

Diagonal matrices

\[ D = \begin{bmatrix}d_1 & 0 & \cdots & 0\\ 0 & d_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & d_n \end{bmatrix} = \Diag(d_1,d_2,\ldots,d_n) \]

\(D\succ0 \quad\Longleftrightarrow\quad d_i>0\) for all \(i\)
\(D\succeq0 \quad\Longleftrightarrow\quad d_i\geq0\) for all \(i\)

Eigenpairs of symmetric matrices

Let \(H\) be a \(n\)-by-\(n\) matrix. Then \((x,\lambda)\in\Rn\times\R\) is an eigenvector/eigenvalue pair of \(H\) if \[Hx = \lambda x\]

Theorem 1 (Eigenvalues of symmetric matrices) If \(H\) is \(n\)-by-\(n\) and symmetric, then there exists \(n\) orthogonal eigenvectors and all eigenvalues are real.

\[ \left\{ \begin{aligned} Hx_1 &= \lambda_1 x_1\\ Hx_2 &= \lambda_2 x_2\\ &\vdots\\ Hx_n &= \lambda_n x_n \end{aligned} \right\} \textt{or} H X = X\Lambda\ \] where \(X\T = X^{-1}\) (orthogonal) and \(\Lambda\) is a diagonal matrix of eigenvalues: \[ X = \begin{bmatrix}x_1 & x_2 & \cdots & x_n\end{bmatrix} \quad\text{and}\quad \Lambda = \begin{bmatrix}\lambda_1 & 0 & \cdots & 0\\ 0 & \lambda_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \lambda_n \end{bmatrix} \]

Eigenvalues and definiteness

The matrix \(H\) is positive (semi) definite if and only if all of its eigenvalues are (nonnegative) positive.

proof (positive definite)

by spectral theorem, \[ X\T H X = \Lambda \textt{where} X\T = X^{-1} \textt{and} \Lambda=\Diag(\lambda_1,\lambda_2,\ldots,\lambda_n)\]
for any \(x\in\Rn\) there exists \(y=(y_1,\ldots,y_n)\) such that \(x=Xy\) and \[ x\T Hx = y\T X\T H X y = y\T \Lambda y = \sum_{i=1}^n \lambda_i y_i^2 \]
thus, \(x\T Hx>0\) for all \(x\neq 0\) (ie, \(H\) positive definite) if and only if \[ \sum_{i=1}^n \lambda_i y_i^2 > 0 \quad\text{for all}\quad y\neq 0 \quad\Longleftrightarrow\quad \lambda_i > 0 \quad\text{for all}\quad i=1:n \]

Example

using LinearAlgebra

\[ H = \begin{bmatrix}4 & 1\\ 1& 3\end{bmatrix} \]

H = [4 1; 1 3]
@show eigvals(H);

eigvals(H) = [2.381966011250105, 4.618033988749895]

\[ H = \begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1\\ 1 & 1 & 1/10 \end{bmatrix}\]

H = ones(3,3)
H[3, 3] = 1/10
@show eigvals(H);

eigvals(H) = [-0.6536725037400826, -2.3721342664653315e-17, 2.7536725037400815]

Equivalent conditions

Let \(H\) be a \(n\)-by-\(n\) symmetric matrix.

positive definite equivalences:

all eigenvalues of \(H\) are positive
\(x\T H x > 0\) for all \(0\ne x\in\R^n\)
\(H = R\T R\) for some nonsingular \(n\)-by-\(n\) matrix \(R\)
\(H\) is symmetric and all of its leading principal minors are positive

positive semidefinite equivalences:

all eigenvalues of \(H\) are nonnegative
\(x\T H x \ge 0\) for all \(x\in\R^n\)
\(H = R\T R\) for some \(n\)-by-\(n\) matrix \(R\)
\(H\) is symmetric and all of its principal minors are nonnegative

Hessians

For \(f:\R^n\to\R\) twice continuously differentiable, the Hessian of \(f\) at \(x\in\R^n\) is the \(n\)-by-\(n\) symmetric matrix \[ H(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2}(x) & \frac{\partial^2 f}{\partial x_1\partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n}(x)\\ \frac{\partial^2 f}{\partial x_2\partial x_1}(x) & \frac{\partial^2 f}{\partial x_2^2}(x) & \cdots & \frac{\partial^2 f}{\partial x_2\partial x_n}(x)\\ \vdots & \vdots & \ddots & \vdots\\ \frac{\partial^2 f}{\partial x_n\partial x_1}(x) & \frac{\partial^2 f}{\partial x_n\partial x_2}(x) & \cdots & \frac{\partial^2 f}{\partial x_n^2}(x) \end{bmatrix} \qquad \frac{\partial^2 f}{\partial x_i\partial x_j} = \frac{\partial^2 f}{\partial x_j\partial x_i} \]

example

\[ f(x) = x_1^2 + 8x_1x_2 - 2x_3^3, \quad \nabla f(x) = \begin{bmatrix}2x_1 + 8x_2\\ 8x_1 - 6x_3^2\\ -6x_3^2\end{bmatrix}, \quad H(x) = \begin{bmatrix}2 & 8 & \phantom-0\\ 8 & 0 & \phantom-0\\ 0 & 0 & -12x_3\end{bmatrix} \]

Quadratic Functions

Quadratic functions

Quadratic functions over \(\Rn\) have the form \[ f(x) = \half x\T H x + b\T x + \gamma \] where \(H\) is symmetric and \(b\in\Rn\) and \(\gamma\in\R\).

\(n=1\)

\[ f(x) = \half hx^2 + bx + \gamma, \quad H = [h] \]

\(n=2\)

\[ \begin{aligned} f(x) &= \half[\, x_1 \ x_2\, ] \begin{bmatrix}h_{11} & h_{12}\\ h_{21} & h_{22}\end{bmatrix} \begin{bmatrix}x_1\\ x_2\end{bmatrix} + [\,b_1 \ b_2\,] \begin{bmatrix}x_1\\ x_2\end{bmatrix} + \gamma\\ &= \half h_{11}x_1^2 + h_{12}x_1x_2 + \half h_{22}x_2^2 + b_1x_1 + b_2x_2 + \gamma \end{aligned} \]

Quadratic functions and symmetry

\[ f(x) = \half x\T H x + b\T x + \gamma, \quad \nabla f(x) = Hx + b, \quad \nabla^2 f(x) = H \]

We can always assume without loss of generality that \[ H = H\T \quad \text{(symmetric)} \]

Suppose that \(H\ne H\T\): \[ x\T H x = \half x\T H x + \half x\T H\T x = x\T \left[\half\left(H + H\T\right)\right] x \]

Thus we can replace \(H\) with \(\half(H + H\T)\) and not change the function value

Optimality for quadratic functions

\[ \min_{x\in\Rn} f(x) = \half x\T H x + b\T x + \gamma \]

\[ \nabla f(x) = Hx + b, \quad \nabla^2 f(x) = H \]

optimality conditions

(necessary) \(x^*\) is optimal only if \(\nabla f(x^*)=Hx^*+b=0\) (stationary)
(sufficient) if stationary and \(H\succeq0\), then \(x^*\) is a global minimizer
(sufficient) if stationary and \(H\succ0\), then \(x^*\) is the unique global minimizer

proof

for all \(d\ne0\), \[ f(x^*+d) - f(x^*) = d\T\underbrace{\nabla f(x^*)}_{=0} + \half \underbrace{d\T\overbrace{\nabla^2 f(x^*)}^{=H}d}_{\ge0} \begin{cases} \ge0 & \text{if } H\succeq0\\ >0 & \text{if } H\succ0 \end{cases} \]

Nonlinear functions

Directional second derivatives

Given \(f:\Rn\to\R\), recall the directional derivative \[ f'(x;d) = \lim_{\alpha\to 0^+}\frac{f(x+\alpha d)-f(x)}{\alpha} = d\T\nabla f(x) \]

Definition 1 The directional second derivative of \(f\) at \(x\) in the direction \(d\) is \[ f''(x;d) = \lim_{\alpha\to0^+} \frac{f'(x+\alpha d;d) - f'(x;d)}{\alpha} = d\T\nabla^2 f(x)d \]

partial 2nd derivatives are the directional 2nd derivatives along each canonical basis vector \(e_i\): \[ \frac{\partial^2 f}{\partial x_i^2}(x) = f''(x;e_i) \textt{with} e_i(j) = \begin{cases} 1 & \text{if } j=i\\ 0 & \text{if } j\ne i \end{cases} \]

Linear and quadratic approximations

Suppose \(f:\Rn\to\R\) is twice continuously differentiable.

Theorem 2 (Linear approximation) For all \(x\in\Rn\) and \(\epsilon>0\), for each \(y\in\epsilon\Ball(x)\) there exists \(z\in[x,y]\) such that
\[ f(y) = f(x) + \nabla f(x)\T (y-x) + \half (y-x)\T\nabla^2 f(z)(y-x) \]

Theorem 3 (Quadratic approximation) For all \(x\) and \(d\) in \(\Rn\), \[ f(x+d) = f(x) + \nabla f(x)\T d + \half d\T\nabla^2 f(x)d + o(\|d\|^2) \]

Second-order necessary conditions

For \(f:\Rn\to\R\) twice continuously differentiable and \(\bar x\in\Rn\) stationary (ie, \(\nabla f(\bar x)=0\))

\(\bar x\) is a local min \(\quad\Longrightarrow\quad\) \(\nabla^2f(\bar x) \succeq0\)
\(\bar x\) is a local max \(\quad\Longrightarrow\quad\) \(\nabla^2f(\bar x) \preceq0\)

proof sketch for local min (analogous for local max). If \(\bar x\) is a local min, then for all \(d\ne0\)

\[ \begin{aligned} 0\le f(\bar x+\alpha d) - f(\bar x) &= d\T\underbrace{\nabla f(\bar x)}_{=0} + \half \alpha^2 d\T\nabla^2 f(\bar x)d + \omicron(\alpha^2\|d\|^2) \end{aligned} \]

Divide both sides by \(\alpha^2\) and take the limit as \(\alpha\to0^+\). Because \(\omicron(\alpha^2\|d\|^2)/\alpha^2\to0\), \[ 0\le d\T\nabla^2 f(\bar x)d \]

Because this hold for all \(d\ne0\), \[ \nabla^2 f(\bar x) \succeq0 \]

Sufficient conditions for optimality

For \(f:\Rn\to\R\) twice continuously differentiable and \(\bar x\in\Rn\) stationary, ie, \(\nabla f(\bar x)=0\),

\(\nabla f(\xbar)\succ0 \quad\Longrightarrow\quad\) \(\xbar\) is a local min
\(\nabla f(\xbar)\prec0 \quad\Longrightarrow\quad\) \(\xbar\) is a local max

proof sketch for local min (analogous for local max). By linear approximation theorem and continuity of \(\nabla^2 f\), for any \(x\) close enough to \(\xbar\) there exists \(z\in[\xbar,x]\) such that

\[ f(x) - f(\bar x) = (x-\xbar)\T\underbrace{\nabla f(\xbar)}_{=0} + \half (x-\bar x)\T\underbrace{\nabla^2 f(z)}_{\succ0}(x-\bar x)>0 \]

Question

\[ f(x) = x_1^2 + 8x_1x_2 - 2x_3^3, \quad \nabla f(x) = \begin{bmatrix}2x_1 + 8x_2\\ 8x_1 \\ -6x_3^2\end{bmatrix}, \quad H(x) = \begin{bmatrix}2 & 8 & \phantom-0\\ 8 & 0 & \phantom-0\\ 0 & 0 & -12x_3\end{bmatrix} \]

The stationary point \(x^* = (0,0,0)\) is a

minimizer
maximizer
saddle point

Example

\[ \min_{x,y}\ f(x,y) = \frac{x+y}{x^2+y^2+1} \]

\[ \nabla f(x,y) = \frac{1}{(x^2+y^2+1)^2} \begin{bmatrix} y^2-2xy-x^2+1\\ x^2-2xy-y^2+1 \end{bmatrix} \]

Stationary points \(\nabla f(x^*,y^*)=0\):

\[ \underbrace{ (x^*_1,y^*_1)=-\frac{1}{\sqrt2}(1,1)}_{\text{minimizer}} \]

\[ \underbrace{(x^*_2,y^*_2)=+\frac{1}{\sqrt2}(1,1)}_{\text{maximizer}} \]

Hessian of \(f\) at these points:

\[ \nabla^2 f(x^*_1,y^*_1) = \frac1{\sqrt{2}}\begin{bmatrix} 1 & 0\\0 & 1 \end{bmatrix}\succ0 \]

\[ \nabla^2 f(x^*)2,y^*_2) = \frac{1}{\sqrt{2}}\begin{bmatrix} -1 & 0\\0& -1 \end{bmatrix}\prec0 \]

Second-Order Optimality

Hessians and second-order optimality

Necessary conditions (1-D)

necessary optimality conditions

sufficient optimality conditions

Example

positive definite matrices

Positive (semi)definite matrices

Question

Diagonal matrices

Eigenpairs of symmetric matrices

Eigenvalues and definiteness

proof (positive definite)

Example

Equivalent conditions

positive definite equivalences:

positive semidefinite equivalences:

Hessians

example

Quadratic Functions

Quadratic functions

Quadratic functions and symmetry

Optimality for quadratic functions

optimality conditions

proof

Nonlinear functions

Directional second derivatives

Linear and quadratic approximations

Second-order necessary conditions

Sufficient conditions for optimality

Question

Example

Second-Order
Optimality

positive
definite
matrices