UBC CPSC 406 2022-T2

Gradients

Gradients provide information on a function's sensitivity to perturbations in the input.

Directional derivatives

The behavior of a function f:RnRf:\mathbb R^n\to\mathbb R along the ray {x+αdαR+}\{x+αd\mid α\in\mathbb R_+\}, where xx and dd are nn-vectors, is given by the univariate function

ϕ(α)=f(x+αd). \phi(\alpha) = f(x+αd).

From standard calculus, the derivative of ϕ\phi at the origin, when it exists, is the limit

ϕ(0)=limα0+ϕ(α)ϕ(0)α. \phi'(0) = \lim_{α\to0^+}\frac{\phi(α)-\phi(0)}{α}.

We thus arrive at the following definition.

Definition: (directional derivative) The directional derivative of a function f:RnRf:\mathbb R^n\to\mathbb R at a point xRnx\in\mathbb R^n, along a direction dRnd\in\mathbb R^n, is the limit
f(x;d)=limα0+f(x+αd)f(x)α. f'(x;d) = \lim_{α\to0^+}\frac{f(x+αd)-f(x)}{α}.

It follows immediately from this definition that the partial derivatives of ff are simply the directional derivatives of ff along the each of the canonical unit direction e1,,ene_1,\ldots,e_n, i.e.,

fxi(x)f(x;ei). \frac{\partial f}{\partial x_i}(x) \equiv f'(x;e_i).

Descent directions

A nonzero vector dd is a descent direction of ff at xx if the directional derivative is negative:

f(x;d)<0. f'(x;d) < 0.

It follows directly from the definition of the directional derivative that f(x+αd)<f(x)f(x+αd) < f(x) for all positive α\alpha small enough.

Gradient vector

The gradient of the function ff is the collection of all the partial derivatives:

f(x)=[fx1(x)fxn(x)]. \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(x) \\\vdots \\\frac{\partial f}{\partial x_n}(x)\end{bmatrix}.

The gradient and directional derivative are related via the formula

f(x;d)=f(x)T ⁣d. f'(x;d) = \nabla f(x)^T\! d.

If, for example, the direction dd to be the canonical unit direction eie_i, then this formula reduces to

f(x;ei)=f(x)T ⁣ei=[f(x))]i=fxi(x), f'(x;e_i) = \nabla f(x)^T\! e_i = [\nabla f(x))]_i = \frac{\partial f}{\partial x_i}(x),

which confirms the identity (1).

Linear approximation

The gradient of a continuously differentiable function ff (i.e., ff is differentiable at all xx and f\nabla f is continuous) provides a local linear approximation of ff in the following sense:

f(x+d)=f(x)+f(x)T ⁣d+ο(d), f(x+d) = f(x) + \nabla f(x)^T\! d + \omicron(\| d\|),

The residual ο:R+R\omicron:\mathbb R_+\to\mathbb R of the approximation decays faster than d\| d\|, i.e.,

limα0+ο(α)/α=0.\lim_{α\to0+}\omicron(α)/α=0.

Example

Fortunately, there are good computational tools that automatically produce reliable gradients. Consider the 2-dimensional Rosenbrock function and its gradient:

f(x)=(ax1)2+b(x2x12)2f(x)=[2(ax1)4b(x2x12)x12b(x2x12)]\begin{aligned} f(x) &= (a - x_1)^2 + b(x_2 - x_1^2)^2 \\ \nabla f(x) &= \begin{bmatrix} -2(a-x_1)-4b(x_2-x_1^2)x_1 \\ 2b(x_2-x_1^2)\end{bmatrix} \end{aligned}

Here is the code for ff and its gradient:

a, b = 1, 100
f(x) = (a - x[1])^2 + b*(x[2] - x[1]^2)^2
∇f(x) = [-2(a - x[1]) - 4b*(x[2] - x[1]^2)*x[1] , 2b*(x[2] - x[1]^2) ]

Instead of computing gradients by hand, as we did above, we can use automatic differentiation, such as implemented in the package ForwardDiff, to compute these.

using ForwardDiff
∇fad(x) = ForwardDiff.gradient(f, x)

The function ∇fad returns the value of the gradient at x. Let's compare the hand-computed gradient ∇f against that automatically-computed gradient ∇fad at a random point:

x = rand(2) 
∇f(x) == ∇fad(x)
true

Calculus rules

We derive calculus rules for linear and quadratic functions, which appear often in optimization.

Linear functions

Let aRna\in\mathbb R^n. The linear function

f(x)=aT ⁣x=inaixi f(x) = a^T\! x = \sum_i^n a_i x_i

has the gradient f(x)=a\nabla f(x) = a, and so the gradient is constant. Here's a small example:

a = collect(1:5)
ForwardDiff.gradient(x->a'x, rand(5))

Quadratic functions

Let ARn×nA\in\mathbb R^{n\times n} be a square matrix. Consider the quadratic function

f(x)=12xT ⁣Ax. f(x) = \tfrac12 x^T\! A x.

One way to derive the gradient of this function is to write out the quadratic function making explicit all of the coefficients in AA. Here's another approach that uses the product rule:

f(x)=12(xT ⁣a)+12(bT ⁣x),\begin{aligned} ∇f(x) = \tfrac12 ∇(x^T\! a) + \tfrac12 ∇(b^T\! x), \end{aligned}

where a=Axa = Ax and b:=AT ⁣xb:=A^T\! x are held fixed when applying the gradient. Because each of the functions in the right-hand side of this sum is a linear function, we can apply the calculus rule for linear functions to deduce that

f(x)=12Ax+12AT ⁣x=12(A+AT ⁣)x. ∇f(x) = \tfrac12 Ax + \tfrac12 A^T\! x = \tfrac12 (A+A^T\!)x.

(Recall that AA is square.) The matrix 12(A+AT ⁣)\tfrac12(A+A^T\!) is the symmetric part of AA. If AA is symmetric, i.e., A=AT ⁣A = A^T\!, then the gradient reduces to

f(x)=Ax. ∇f(x) = Ax.

But in optimization, we almost always assume that the matrix that defines the quadratic in (3) is symmetric because always xT ⁣Ax=12xT ⁣(A+AT ⁣)xx^T\! Ax = \tfrac12 x^T\!(A+A^T\!)x, and therefore we can instead work with the symmetric part of AA.

Example (2-norm). Consider the two functions

f1(x)=x2andf2(x)=12x22. f_1(x) = \| x\|_2 \quad\text{and}\quad f_2(x) = \tfrac12\| x\|_2^2.

The function f2f_2 is of the form (3) with A=IA=I, and so f2(x)=x\nabla f_2(x) = x. Use the chain rule to obtain the gradient of the f1f_1:

f1(x)=(xT ⁣x)12=12(xT ⁣x)12(xT ⁣x)=xx2, \nabla f_1(x) = \nabla (x^T\! x)^\tfrac12 = \tfrac12 (x^T\! x)^{-\tfrac12}\nabla (x^T\! x) = \frac{x}{\| x\|_2},

which isn't differentiable at the origin.

Visualizing gradients

Gradients can be understood geometrically in relation to the level-set of the function. The αα-level set of a function f:RnRf:\mathbb R^n\to\mathbb R is the set of points that have an equal or lower value at xx:

[fα]={xRnf(x)α}. [f≤α] = \{x\in\mathbb R^n\mid f(x)≤α\}.

Fix any xx and consider the level set [ff(x)][f≤f(x)]. For any direction dd that's either a descent direction for ff or a tangent direction for [ff(x)][f≤f(x)],

f(x;d)=f(x)T ⁣d0, f'(x;d) = ∇f(x)^T\! d ≤ 0,

which implies that the gradient f(x)\nabla f(x) is the outward normal to the level set.