Gradient descent
\[ \def\argmin{\operatorname*{argmin}} \def\argmax{\operatorname*{argmax}} \def\Ball{\mathbf{B}} \def\bmat#1{\begin{bmatrix}#1\end{bmatrix}} \def\Diag{\mathbf{Diag}} \def\half{\tfrac12} \def\int{\mathop{\rm int}} \def\ip#1{\langle #1 \rangle} \def\maxim{\mathop{\hbox{\rm maximize}}} \def\maximize#1{\displaystyle\maxim_{#1}} \def\minim{\mathop{\hbox{\rm minimize}}} \def\minimize#1{\displaystyle\minim_{#1}} \def\norm#1{\|#1\|} \def\Null{{\mathbf{null}}} \def\proj{\mathbf{proj}} \def\R{\mathbb R} \def\Re{\mathbb R} \def\Rn{\R^n} \def\rank{\mathbf{rank}} \def\range{{\mathbf{range}}} \def\sign{{\mathbf{sign}}} \def\span{{\mathbf{span}}} \def\st{\hbox{\rm subject to}} \def\T{^\intercal} \def\textt#1{\quad\text{#1}\quad} \def\trace{\mathbf{trace}} \def\xbar{\bar x} \]
Slides
Partial demos
Quadratic functions, exact linesearch
Below is a Julia implementation of gradient method with exact linesearch for minimizing quadratic functions of the form \[f(x) = \frac{1}{2}x^T Ax + b^T x.\]
function grad_method_exact_ls(A, b, x0, ϵ=1e-6)
x = copy(x0)
∇f = A*x + b
k = 0
xtrace = x'
while norm(∇f) > ϵ
α = dot(∇f,∇f) / dot(∇f,A*∇f)
x = x - α*∇f
∇f = A*x + b
f = (1/2)x'*A*x + b'*x
@printf "it = %3d | |∇f| = %8.2e | f = %8.2e\n" k norm(∇f) f
k += 1
xtrace = vcat(xtrace,x')
end
return xtrace
end
grad_method_exact_ls (generic function with 2 methods)
We use gradient method with exact linesearch to minimize the\(f(x,y) = x^2+2y^2\) starting at \((x_0,y_0) = (2,1)\). Define the matrix \(A\) and vector \(b\) as \[ A = \begin{bmatrix} 1 & 0 \\0 & 2\end{bmatrix} \quad\text{and}\quad b = \begin{bmatrix} 0 \\0\end{bmatrix}. \]
it = 0 | |∇f| = 9.43e-01 | f = 3.33e-01
it = 1 | |∇f| = 3.14e-01 | f = 3.70e-02
it = 2 | |∇f| = 1.05e-01 | f = 4.12e-03
it = 3 | |∇f| = 3.49e-02 | f = 4.57e-04
it = 4 | |∇f| = 1.16e-02 | f = 5.08e-05
it = 5 | |∇f| = 3.88e-03 | f = 5.65e-06
it = 6 | |∇f| = 1.29e-03 | f = 6.27e-07
it = 7 | |∇f| = 4.31e-04 | f = 6.97e-08
it = 8 | |∇f| = 1.44e-04 | f = 7.74e-09
it = 9 | |∇f| = 4.79e-05 | f = 8.60e-10
it = 10 | |∇f| = 1.60e-05 | f = 9.56e-11
it = 11 | |∇f| = 5.32e-06 | f = 1.06e-11
it = 12 | |∇f| = 1.77e-06 | f = 1.18e-12
it = 13 | |∇f| = 5.91e-07 | f = 1.31e-13
Plot the path of gradient method on the contour plot of \(f(x,y) = x^2+2y^2\).
Condition number of a matrix
The condition number of a \(n\times n\) positive definite matrix \(A\) is defined by \[ \kappa(A) = \frac{\lambda_{\mbox{max}}(A)}{\lambda_{\mbox{min}}(A)}. \] An ill-conditioned matrix have large condition number and the condition number of the Hessian at the solution influences the speed at which the gradient method converges. Generally, if condition number of Hessian is small then gradient method converges quickly and, conversely, if condition number is large then gradient method converges slowly.
Consider the Rosenbrock function \[ f(x_1,x_2) = 100(x_1 - x_2^2)^2 + (1-x_1)^2.' \] We can show that a stationary point of the Rosenbrock function is \((x_1, x_2) = (1,1)\) and the Hessian of \(f\) at \((1,1)\) is \[\begin{bmatrix} 802 & -400\\-400 & 200\end{bmatrix}.\] The Hessian has a large condition number and, as a result, any gradient method will have slow convergence.
function grad_method_backtracking(fObj, gObj, x0; ϵ = 1e-6, μ = 1e-5, maxits = 1000)
x = copy(x0)
f = fObj(x)
∇f = gObj(x)
k = 0
xtrace = x'
while norm(∇f) > ϵ && k < maxits
α = 1.0
while ((f - fObj(x-α*∇f)) < μ*α*dot(∇f,∇f) )
α /=2
end
x = x-α*∇f
f = fObj(x)
∇f = gObj(x)
k += 1
xtrace = vcat(xtrace,x')
end
@printf "it = %3d | |∇f| = %8.2e | f = %8.2e\n" k norm(∇f) f
return x, xtrace
end
grad_method_backtracking (generic function with 1 method)
Apply gradient method with backtracking to Rosenbrock function.
f(x) = 100(x[2]-x[1]^2)^2+(1-x[1])^2
∇f(x) = ForwardDiff.gradient(f,x)
x0 = [2,5]
x, xtrace = grad_method_backtracking(f, ∇f, x0, μ = 1e-4, maxits = 1000);
it = 1000 | |∇f| = 1.56e+00 | f = 1.33e+00
Plot the path of gradient method on the contour plot of Rosenbrock function.
f(x1,x2) = 100(x2-x1^2)^2+(1-x1)^2
x1 = 0:0.05:3
x2 = 3:0.05:6
contour(x1, x2, f, levels = 50)
plot!(xtrace[:,1],xtrace[:,2],marker = 3,legend = false, title = "Path of gradient method")
In the above plot, we see that the gradient method converges slowly. We can increase the convergence rate by transforming the minimization problem so that Hessian is well-behaved at the minimizer.