The behavior of a function along the ray , where and are -vectors, is given by the univariate function
From standard calculus, the derivative of at the origin, when it exists, is the limit
We thus arrive at the following definition.
It follows immediately from this definition that the partial derivatives of are simply the directional derivatives of along the each of the canonical unit direction , i.e.,
A nonzero vector is a descent direction of at if the directional derivative is negative:
It follows directly from the definition of the directional derivative that for all positive small enough.
The gradient of the function is the collection of all the partial derivatives:
The gradient and directional derivative are related via the formula
If, for example, the direction to be the canonical unit direction , then this formula reduces to
which confirms the identity (1).
The gradient of a continuously differentiable function (i.e., is differentiable at all and is continuous) provides a local linear approximation of in the following sense:
The residual of the approximation decays faster than , i.e.,
Fortunately, there are good computational tools that automatically produce reliable gradients. Consider the 2-dimensional Rosenbrock function and its gradient:
Here is the code for and its gradient:
a, b = 1, 100
f(x) = (a - x[1])^2 + b*(x[2] - x[1]^2)^2
∇f(x) = [-2(a - x[1]) - 4b*(x[2] - x[1]^2)*x[1] , 2b*(x[2] - x[1]^2) ]
Instead of computing gradients by hand, as we did above, we can use automatic differentiation, such as implemented in the package ForwardDiff, to compute these.
using ForwardDiff
∇fad(x) = ForwardDiff.gradient(f, x)
The function ∇fad
returns the value of the gradient at x
. Let's compare the hand-computed gradient ∇f
against that automatically-computed gradient ∇fad
at a random point:
x = rand(2)
∇f(x) == ∇fad(x)
true
We derive calculus rules for linear and quadratic functions, which appear often in optimization.
Let . The linear function
has the gradient , and so the gradient is constant. Here's a small example:
a = collect(1:5)
ForwardDiff.gradient(x->a'x, rand(5))
Let be a square matrix. Consider the quadratic function
One way to derive the gradient of this function is to write out the quadratic function making explicit all of the coefficients in . Here's another approach that uses the product rule:
where and are held fixed when applying the gradient. Because each of the functions in the right-hand side of this sum is a linear function, we can apply the calculus rule for linear functions to deduce that
(Recall that is square.) The matrix is the symmetric part of . If is symmetric, i.e., , then the gradient reduces to
But in optimization, we almost always assume that the matrix that defines the quadratic in (3) is symmetric because always , and therefore we can instead work with the symmetric part of .
Example (2-norm). Consider the two functions
The function is of the form (3) with , and so . Use the chain rule to obtain the gradient of the :
which isn't differentiable at the origin.
Gradients can be understood geometrically in relation to the level-set of the function. The -level set of a function is the set of points that have an equal or lower value at :
Fix any and consider the level set . For any direction that's either a descent direction for or a tangent direction for ,
which implies that the gradient is the outward normal to the level set.