UBC CPSC 406 2022-T2

Regularized Least Squares

Regularization introduces to an optimization problem prior information or assumptions about properties of an optimal solution. A regularization parameter governs the tradeoff between the fidelity to the unmodified problem and the desired solution properties.

The Pareto frontier

The standard least-squares problem can be interpreted as a method for recovering some underlying object (say, a signal or image) encoded in the vector x0x_0 using noisy measurements of the form

b:=Ax0+w1, b := Ax_0 + w_1,

where ww is random noise and the matrix AA describes the measurement process. The least-squares problem

minxRn Axb2. \min_{x\in\mathbb R^n}\ \|Ax-b\|^2.

seeks a solution xx under the implicit assumption that the noise term w1w_1 is small.

But what if xx also needs to satisfy some other competing objective? Suppose, for example, that we have also available other set of measurements of x0x_0 of the form d=Cx0+w2d = Cx_0+w_2, where w2w_2 is also random noise. The "best" choice for xx is not necessarily (1), and instead we wish to choose xx to balance the two objective values

f1(x)=Axb2andf2(x)=Cxd2. f_1(x) = \|Ax - b\|^2 \quad\text{and}\quad f_2(x) = \|Cx - d\|^2.

Generally, we can make f1(x)f_1(x) or f2(x)f_2(x) small, but not both. The figure below sketches the relationship between the pair {f1(x),f2(x)}\{f_1(x), f_2(x)\} for all values of xx.

The objective pairs on the boundary of two regions is the Pareto frontier. We can compute these Pareto optimal solutions xx by minimizing the weighted sum objective

f1(x)+γf2(x)=Axg2+λCxd2, f_1(x)+\gamma f_2(x) = \|Ax-g\|^2+ λ\|Cx-d\|^2,

where the positive parameter λ\lambda provides the relative weight between the objectives.

For fixed scalars λλ and αα, the set

(λ,α)={ (f1(x),f2(x))f1(x)+λf2(x)=α, xRn } \ell(\lambda,\alpha) = \{\ (f_1(x),f_2(x)) \mid f_1(x) + \lambda f_2(x) = \alpha,\ x \in \mathbb R^n\ \}

forms the graph of a line with slope of λ-\lambda. We may visualize the optimization problem (2) as the problem of finding the smallest value α\alpha such that the line is tangent to the Pareto frontier.

Tikhonov regularization

A particularly common example of regularized least-squares is Tikhonov, which has the form

minx12Axb2+λ12Dx2, \min_{x} \tfrac12\|Ax-b\|^2 + \lambda\tfrac12\|Dx\|^2,

for which the objective can be expressed as

Axb2+λDx2=[AλD]x[b0]2. \|Ax-b\|^2+\lambda\|Dx\|^2 = \bigg\|\begin{bmatrix} A\\\sqrt{\lambda}D\end{bmatrix} x - \begin{bmatrix} b\\ 0\end{bmatrix}\bigg\|^2.

If DD has full column rank, then the stacked matrix

[AλD] \begin{bmatrix} A\\\sqrt{\lambda}D\end{bmatrix}

necessarily also has full rank for any positive λ\lambda, which implies that the regularized problem always has a well-defined unique solution.

Example: Signal denoising

⚠ Note

This example uses the following packages:

using LinearAlgebra
using Plots

Consider a noisy measurement

b:=x+w b := x^\natural + w

of a signal xx^\natural, where the vector ww represents unknown noise. Here's a simple 1-dimensional noisy signal:

n = 300
t = LinRange(0, 4, n)
x = @. sin(t) + t*cos(t)^2
w = 0.1*randn(n)
b = x + w
plot(b, leg=:topleft, label="noisy signal")

The obvious least-squares problem

minx 12xb2 \min_{x}\ \tfrac12\|x-b\|^2

isn't useful because the optimal solution is simply the noisy measurements bb, and so it doesn't yield any new information. But suppose we believe that the signal is "smooth" in the sense that the consecutive elements of

x=(x1,,xi,xi+1,,xn) x=(x_1,\ldots,x_i,x_{i+1},\ldots,x_n)

change relatively little, i.e., the difference xixi+1|x_i-x_{i+1}| is small relative to xx. In this case, we might balance the least-squares fit against the smoothness of the solution by instead solving the regularized least-squares problem

minx 12i=1xb2f1(x)+λ12i=1n1(xixi+1)2f2(x). \min_{x}\ \tfrac12\underbrace{\vphantom{\sum_{i=1}}\|x-b\|^2}_{f_1(x)} + λ\tfrac12 \underbrace{\sum_{i=1}^{n-1}(x_i - x_{i+1})^2}_{f_2(x)}.

The role of the regularizer f2(x)f_2(x) is to promote smooth changes in the elements of xx.

Matrix notation

We can alternatively write the above minimization program in matrix notation. Define the (n1)(n-1)-by-nn finite difference matrix

D=[11+0+0+00+110+0+011+000+11], D = \begin{bmatrix} 1 & -1 & \phantom+0 & \cdots & \phantom+0 & \phantom+0\\ 0 & \phantom+1 & -1 & 0 & \cdots & \phantom+0\\ &\ddots & \ddots & \ddots & & \\ \vdots & & \phantom+0 & 1 & -1 & \phantom+0\\ 0 & \cdots & & 0 & \phantom+1 & -1\end{bmatrix},

which when applied to a vector xx, yields a vector of the differences:

Dx=[x1x2x2x3xn1xn]. Dx = \begin{bmatrix} x_1 - x_2 \\ x_2 - x_3 \\ \vdots \\ x_{n-1} - x_n\end{bmatrix}.

Then we can rephrase the regularization objective as

f2(x)=i=1n1(xixi+1)2=Dx2. f_2(x) = \sum_{i=1}^{n-1}(x_i - x_{i+1})^2 = \|Dx\|^2.

This allows for a reformulation of the weighted leas squares objective into a familiar least squares objective:

xb2+λDx2=[IλD]A^x[b0]b^2. \|x-b\|^2+\lambda\|Dx\|^2 = \bigg\|\underbrace{\begin{bmatrix} I\\\sqrt{\lambda}D\end{bmatrix}}_{\hat{A}}x - \underbrace{\begin{bmatrix} b\\ 0\end{bmatrix}}_{\hat{b}}\bigg\|^2.

So the solution to the weighted least squares minimization program (9) satisfies the normal equation A^T ⁣A^x=A^T ⁣b^\hat{A}^T\!\hat{A}x = \hat{A}^T\!\hat{b}, which simplifies to

(I+λDT ⁣D)x=b. (I + \lambda D^T\! D)x = b.

The following function generates the required finite-difference matrix:

finiteDiff(n) = diagm(0 => ones(n), +1 => -ones(n-1))[1:n-1,:]

Here's an example of a small finite-difference matrix:

finiteDiff(4)
3×4 Matrix{Float64}:
 1.0  -1.0   0.0   0.0
 0.0   1.0  -1.0   0.0
 0.0   0.0   1.0  -1.0

The function below returns a generator, which produces nn logarithmically-spaced numbers in the interval [x1,x2][x_1,x_2]:

LogRange(x1, x2, n) = (10^y for y in range(log10(x1), log10(x2), length=n))

Now solve the regularized least-squares problem for several values of λ[1,104]\lambda\in[1,10^4]:

D = finiteDiff(n)
b̂ = [b; zeros(n-1)]
plot(t, b, w=1, leg =:topleft, label="noisy data")
for λ in LogRange(1e0, 1e4, 3) 
    Â = [ I; √λ*D ]
    xLS = Â \ b̂
    plot!(t, xLS, w=2, label="regularized solution: λ = $(λ)")
end

Note that the expression  = [ I; √λ*D ] contains the UniformScaling object, which represents an identity matrix of any size.