Gradients

Gradients provide information on a function's sensitivity to perturbations in the input.

Directional derivatives

The behavior of a function $f:\mathbb R^n\to\mathbb R$ along the ray $\{x+αd\mid α\in\mathbb R_+\}$ , where $x$ and $d$ are $n$ -vectors, is given by the univariate function

\phi(\alpha) = f(x+αd).

From standard calculus, the derivative of $\phi$ at the origin, when it exists, is the limit

\phi'(0) = \lim_{α\to0^+}\frac{\phi(α)-\phi(0)}{α}.

We thus arrive at the following definition.

Definition: (directional derivative) The directional derivative of a function

f:\mathbb R^n\to\mathbb R

at a point

x\in\mathbb R^n

, along a direction

d\in\mathbb R^n

, is the limit

f'(x;d) = \lim_{α\to0^+}\frac{f(x+αd)-f(x)}{α}.

It follows immediately from this definition that the partial derivatives of $f$ are simply the directional derivatives of $f$ along the each of the canonical unit direction $e_1,\ldots,e_n$ , i.e.,

\frac{\partial f}{\partial x_i}(x) \equiv f'(x;e_i).

Descent directions

A nonzero vector $d$ is a descent direction of $f$ at $x$ if the directional derivative is negative:

f'(x;d) < 0.

It follows directly from the definition of the directional derivative that $f(x+αd) < f(x)$ for all positive $\alpha$ small enough.

Gradient vector

The gradient of the function $f$ is the collection of all the partial derivatives:

\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(x) \\\vdots \\\frac{\partial f}{\partial x_n}(x)\end{bmatrix}.

The gradient and directional derivative are related via the formula

f'(x;d) = \nabla f(x)^T\! d.

If, for example, the direction $d$ to be the canonical unit direction $e_i$ , then this formula reduces to

f'(x;e_i) = \nabla f(x)^T\! e_i = [\nabla f(x))]_i = \frac{\partial f}{\partial x_i}(x),

which confirms the identity (1).

Linear approximation

The gradient of a continuously differentiable function $f$ (i.e., $f$ is differentiable at all $x$ and $\nabla f$ is continuous) provides a local linear approximation of $f$ in the following sense:

f(x+d) = f(x) + \nabla f(x)^T\! d + \omicron(\| d\|),

The residual $\omicron:\mathbb R_+\to\mathbb R$ of the approximation decays faster than $\| d\|$ , i.e.,

\lim_{α\to0+}\omicron(α)/α=0.

Example

Fortunately, there are good computational tools that automatically produce reliable gradients. Consider the 2-dimensional Rosenbrock function and its gradient:

\begin{aligned} f(x) &= (a - x_1)^2 + b(x_2 - x_1^2)^2 \\ \nabla f(x) &= \begin{bmatrix} -2(a-x_1)-4b(x_2-x_1^2)x_1 \\ 2b(x_2-x_1^2)\end{bmatrix} \end{aligned}

Here is the code for $f$ and its gradient:

a, b = 1, 100
f(x) = (a - x[1])^2 + b*(x[2] - x[1]^2)^2
∇f(x) = [-2(a - x[1]) - 4b*(x[2] - x[1]^2)*x[1] , 2b*(x[2] - x[1]^2) ]

Instead of computing gradients by hand, as we did above, we can use automatic differentiation, such as implemented in the package ForwardDiff, to compute these.

using ForwardDiff
∇fad(x) = ForwardDiff.gradient(f, x)

The function ∇fad returns the value of the gradient at x. Let's compare the hand-computed gradient ∇f against that automatically-computed gradient ∇fad at a random point:

x = rand(2) 
∇f(x) == ∇fad(x)

true

Calculus rules

We derive calculus rules for linear and quadratic functions, which appear often in optimization.

Linear functions

Let $a\in\mathbb R^n$ . The linear function

f(x) = a^T\! x = \sum_i^n a_i x_i

has the gradient $\nabla f(x) = a$ , and so the gradient is constant. Here's a small example:

a = collect(1:5)
ForwardDiff.gradient(x->a'x, rand(5))

Quadratic functions

Let $A\in\mathbb R^{n\times n}$ be a square matrix. Consider the quadratic function

f(x) = \tfrac12 x^T\! A x.

One way to derive the gradient of this function is to write out the quadratic function making explicit all of the coefficients in $A$ . Here's another approach that uses the product rule:

\begin{aligned} ∇f(x) = \tfrac12 ∇(x^T\! a) + \tfrac12 ∇(b^T\! x), \end{aligned}

where $a = Ax$ and $b:=A^T\! x$ are held fixed when applying the gradient. Because each of the functions in the right-hand side of this sum is a linear function, we can apply the calculus rule for linear functions to deduce that

∇f(x) = \tfrac12 Ax + \tfrac12 A^T\! x = \tfrac12 (A+A^T\!)x.

(Recall that $A$ is square.) The matrix $\tfrac12(A+A^T\!)$ is the symmetric part of $A$ . If $A$ is symmetric, i.e., $A = A^T\!$ , then the gradient reduces to

∇f(x) = Ax.

But in optimization, we almost always assume that the matrix that defines the quadratic in (3) is symmetric because always $x^T\! Ax = \tfrac12 x^T\!(A+A^T\!)x$ , and therefore we can instead work with the symmetric part of $A$ .

Example (2-norm). Consider the two functions

f_1(x) = \| x\|_2 \quad\text{and}\quad f_2(x) = \tfrac12\| x\|_2^2.

The function $f_2$ is of the form (3) with $A=I$ , and so $\nabla f_2(x) = x$ . Use the chain rule to obtain the gradient of the $f_1$ :

\nabla f_1(x) = \nabla (x^T\! x)^\tfrac12 = \tfrac12 (x^T\! x)^{-\tfrac12}\nabla (x^T\! x) = \frac{x}{\| x\|_2},

which isn't differentiable at the origin.

Visualizing gradients

Gradients can be understood geometrically in relation to the level-set of the function. The $α$ -level set of a function $f:\mathbb R^n\to\mathbb R$ is the set of points that have an equal or lower value at $x$ :

[f≤α] = \{x\in\mathbb R^n\mid f(x)≤α\}.

Fix any $x$ and consider the level set $[f≤f(x)]$ . For any direction $d$ that's either a descent direction for $f$ or a tangent direction for $[f≤f(x)]$ ,

f'(x;d) = ∇f(x)^T\! d ≤ 0,

which implies that the gradient $\nabla f(x)$ is the outward normal to the level set.

© Michael P. Friedlander | Last modified: January 09, 2024.
Website built with Franklin.jl and the Julia programming language.