Regularized Least Squares

Regularization introduces to an optimization problem prior information or assumptions about properties of an optimal solution. A regularization parameter governs the tradeoff between the fidelity to the unmodified problem and the desired solution properties.

The Pareto frontier

The standard least-squares problem can be interpreted as a method for recovering some underlying object (say, a signal or image) encoded in the vector $x_0$ using noisy measurements of the form

b := Ax_0 + w_1,

where $w$ is random noise and the matrix $A$ describes the measurement process. The least-squares problem

\min_{x\in\mathbb R^n}\ \|Ax-b\|^2.

seeks a solution $x$ under the implicit assumption that the noise term $w_1$ is small.

But what if $x$ also needs to satisfy some other competing objective? Suppose, for example, that we have also available other set of measurements of $x_0$ of the form $d = Cx_0+w_2$ , where $w_2$ is also random noise. The "best" choice for $x$ is not necessarily (1), and instead we wish to choose $x$ to balance the two objective values

f_1(x) = \|Ax - b\|^2 \quad\text{and}\quad f_2(x) = \|Cx - d\|^2.

Generally, we can make $f_1(x)$ or $f_2(x)$ small, but not both. The figure below sketches the relationship between the pair $\{f_1(x), f_2(x)\}$ for all values of $x$ .

The objective pairs on the boundary of two regions is the Pareto frontier. We can compute these Pareto optimal solutions $x$ by minimizing the weighted sum objective

f_1(x)+\gamma f_2(x) = \|Ax-g\|^2+ λ\|Cx-d\|^2,

where the positive parameter $\lambda$ provides the relative weight between the objectives.

For fixed scalars $λ$ and $α$ , the set

\ell(\lambda,\alpha) = \{\ (f_1(x),f_2(x)) \mid f_1(x) + \lambda f_2(x) = \alpha,\ x \in \mathbb R^n\ \}

forms the graph of a line with slope of $-\lambda$ . We may visualize the optimization problem (2) as the problem of finding the smallest value $\alpha$ such that the line is tangent to the Pareto frontier.

Tikhonov regularization

A particularly common example of regularized least-squares is Tikhonov, which has the form

\min_{x} \tfrac12\|Ax-b\|^2 + \lambda\tfrac12\|Dx\|^2,

for which the objective can be expressed as

\|Ax-b\|^2+\lambda\|Dx\|^2 = \bigg\|\begin{bmatrix} A\\\sqrt{\lambda}D\end{bmatrix} x - \begin{bmatrix} b\\ 0\end{bmatrix}\bigg\|^2.

If $D$ has full column rank, then the stacked matrix

\begin{bmatrix} A\\\sqrt{\lambda}D\end{bmatrix}

necessarily also has full rank for any positive $\lambda$ , which implies that the regularized problem always has a well-defined unique solution.

Example: Signal denoising

⚠ Note

This example uses the following packages:

using LinearAlgebra
using Plots

Consider a noisy measurement

b := x^\natural + w

of a signal $x^\natural$ , where the vector $w$ represents unknown noise. Here's a simple 1-dimensional noisy signal:

n = 300
t = LinRange(0, 4, n)
x = @. sin(t) + t*cos(t)^2
w = 0.1*randn(n)
b = x + w
plot(b, leg=:topleft, label="noisy signal")

The obvious least-squares problem

\min_{x}\ \tfrac12\|x-b\|^2

isn't useful because the optimal solution is simply the noisy measurements $b$ , and so it doesn't yield any new information. But suppose we believe that the signal is "smooth" in the sense that the consecutive elements of

x=(x_1,\ldots,x_i,x_{i+1},\ldots,x_n)

change relatively little, i.e., the difference $|x_i-x_{i+1}|$ is small relative to $x$ . In this case, we might balance the least-squares fit against the smoothness of the solution by instead solving the regularized least-squares problem

\min_{x}\ \tfrac12\underbrace{\vphantom{\sum_{i=1}}\|x-b\|^2}_{f_1(x)} + λ\tfrac12 \underbrace{\sum_{i=1}^{n-1}(x_i - x_{i+1})^2}_{f_2(x)}.

The role of the regularizer $f_2(x)$ is to promote smooth changes in the elements of $x$ .

Matrix notation

We can alternatively write the above minimization program in matrix notation. Define the $(n-1)$ -by- $n$ finite difference matrix

D = \begin{bmatrix} 1 & -1 & \phantom+0 & \cdots & \phantom+0 & \phantom+0\\ 0 & \phantom+1 & -1 & 0 & \cdots & \phantom+0\\ &\ddots & \ddots & \ddots & & \\ \vdots & & \phantom+0 & 1 & -1 & \phantom+0\\ 0 & \cdots & & 0 & \phantom+1 & -1\end{bmatrix},

which when applied to a vector $x$ , yields a vector of the differences:

Dx = \begin{bmatrix} x_1 - x_2 \\ x_2 - x_3 \\ \vdots \\ x_{n-1} - x_n\end{bmatrix}.

Then we can rephrase the regularization objective as

f_2(x) = \sum_{i=1}^{n-1}(x_i - x_{i+1})^2 = \|Dx\|^2.

This allows for a reformulation of the weighted leas squares objective into a familiar least squares objective:

\|x-b\|^2+\lambda\|Dx\|^2 = \bigg\|\underbrace{\begin{bmatrix} I\\\sqrt{\lambda}D\end{bmatrix}}_{\hat{A}}x - \underbrace{\begin{bmatrix} b\\ 0\end{bmatrix}}_{\hat{b}}\bigg\|^2.

So the solution to the weighted least squares minimization program (9) satisfies the normal equation $\hat{A}^T\!\hat{A}x = \hat{A}^T\!\hat{b}$ , which simplifies to

(I + \lambda D^T\! D)x = b.

The following function generates the required finite-difference matrix:

finiteDiff(n) = diagm(0 => ones(n), +1 => -ones(n-1))[1:n-1,:]

Here's an example of a small finite-difference matrix:

finiteDiff(4)

3×4 Matrix{Float64}:
 1.0  -1.0   0.0   0.0
 0.0   1.0  -1.0   0.0
 0.0   0.0   1.0  -1.0

The function below returns a generator, which produces $n$ logarithmically-spaced numbers in the interval $[x_1,x_2]$ :

LogRange(x1, x2, n) = (10^y for y in range(log10(x1), log10(x2), length=n))

Now solve the regularized least-squares problem for several values of $\lambda\in[1,10^4]$ :

D = finiteDiff(n)
b̂ = [b; zeros(n-1)]
plot(t, b, w=1, leg =:topleft, label="noisy data")
for λ in LogRange(1e0, 1e4, 3) 
    Â = [ I; √λ*D ]
    xLS = Â \ b̂
    plot!(t, xLS, w=2, label="regularized solution: λ = $(λ)")
end

Note that the expression Â = [ I; √λ*D ] contains the UniformScaling object, which represents an identity matrix of any size.

© Michael P. Friedlander | Last modified: January 09, 2024.
Website built with Franklin.jl and the Julia programming language.