Least Squares

Linear least-squares, also known as ordinary linear regression, is a basic optimization problem used for data fitting. It’s also a fundamental building block for algorithms meant for more general problems.

Example: Multilinear regression

We begin with a simple example and use linear regression to build a model that relates the fuel efficiency of a car with its weight, engine displacement, and engine horsepower. Load the mtcars dataset provided in the Rdatasets.jl package:

using RDatasets
mtcars = RDatasets.dataset("datasets", "mtcars");

Here are the first five rows of the dataset:

println("Number of rows: ", size(mtcars, 1))
first(mtcars, 5)

Number of rows: 32

5×12 DataFrame

Row	Model	MPG	Cyl	Disp	HP	DRat	WT	QSec	VS	AM	Gear	Carb
	String31	Float64	Int64	Float64	Int64	Float64	Float64	Float64	Int64	Int64	Int64	Int64
1	Mazda RX4	21.0	6	160.0	110	3.9	2.62	16.46	0	1	4	4
2	Mazda RX4 Wag	21.0	6	160.0	110	3.9	2.875	17.02	0	1	4	4
3	Datsun 710	22.8	4	108.0	93	3.85	2.32	18.61	1	1	4	1
4	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
5	Hornet Sportabout	18.7	8	360.0	175	3.15	3.44	17.02	0	0	3	2

The model that we wish to build aims to find coefficients \(x_1,\ldots,x_4\in\R\) such that for each particular car model \(i=1,2,\ldots,m\) (where \(m\) is the number of rows),

\[ \text{MPG}_i \approx x_1 + x_2\cdot\text{Disp}_i + x_3\cdot\text{HP}_i + x_4\cdot\text{WT}_i. \]

Of course, we shouldn’t expect an exact match for all car models \(i\), and instead the coefficients \(x_1,\ldots,x_4\) computed by ordinary linear regression minimizes the sum of least-squares deviations:

\[ \min_{x_1,\ldots,x_4}\ \sum_{i=1}^{m} (x_1 + x_2\cdot\text{Disp}_i + x_3\cdot\text{HP}_i + x_4\cdot\text{WT}_i - \text{MPG}_i)^2 \tag{1}\]

Because this is such an important problem in general, there exists excellent software packages that make formulating and solving these problems easy. Here’s how to do it using the generalized linear models package GLM.jl:

using GLM
ols = lm(@formula(MPG ~ Disp + HP + WT), mtcars)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

MPG ~ 1 + Disp + HP + WT

Coefficients:
───────────────────────────────────────────────────────────────────────────────
                    Coef.  Std. Error      t  Pr(>|t|)   Lower 95%    Upper 95%
───────────────────────────────────────────────────────────────────────────────
(Intercept)  37.1055        2.11082    17.58    <1e-15  32.7817     41.4293
Disp         -0.000937009   0.0103497  -0.09    0.9285  -0.0221375   0.0202635
HP           -0.0311566     0.0114358  -2.72    0.0110  -0.0545817  -0.00773139
WT           -3.80089       1.06619    -3.56    0.0013  -5.98488    -1.6169
───────────────────────────────────────────────────────────────────────────────

The column marked Coef. shows the values for the 4 computed coefficients \(x_1,\ldots,x_4\) (labeled (Intercept), Disp, HP, and WT). The remaining columns show the results for various statistical tests that indicate the reliability of each coefficient It appears that the car’s weight has the biggest impact on fuel economy. This plot shows the projection of the data and the regression line onto the plane spanned by WT and MPG:

using StatsPlots
x₁, x₂, x₃, x₄ = coef(ols)
@df mtcars scatter(:WT, :MPG, xlabel="weight (1000s lbs)", ylabel="MPG", leg=false)
@df mtcars plot!(:WT, x₁ .+ x₄*:WT)

Figure 1: Projection of least-squares fit onto the plane spanned by WT and MPG.

Matrix formulation

The objective in Equation 1 can be written using matrix notation, as follows: define the matrix and vector \[ A = \begin{bmatrix} 1 & \text{WT}_1 & \text{Disp}_1 & \text{HP}_1 \\\vdots & \vdots & \vdots & \vdots \\1 & \text{WT}_m & \text{Disp}_m & \text{HP}_m \end{bmatrix}, \quad b = \begin{bmatrix} \text{MPG}_1 \\ \vdots \\ \text{MPG}_m \end{bmatrix}, \quad x = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix}, \] where \(m=32\). Thus, the least-squares solution \(x^*\) minimizes the 2-norm squared \(\|r\|^2\) of the residual \(r = b - Ax\).

Euclidian norm

The Euclidian norm, or 2-norm, of a vector \(z\in\R^n\) is the function \[ \|z\|_2 = \sqrt{z_1^2+\cdots+z_n^2}. \] This norm has the property that \[ z\T z = \begin{bmatrix} z_1\\\vdots\\ z_n\end{bmatrix}\T \begin{bmatrix} z_1\\\vdots\\ z_n\end{bmatrix} = \sum_{i=1}^n z_i^2 = \|z\|_2^2. \] We often leave out the subscript and simply write \(\|z\|\) for the 2-norm, unless warned otherwise.

General least-squares formulation

The general linear-least squares problem is formulated as \[ \min_{x\in\R^n}\ \tfrac12\|r\|^2, \quad r = b - Ax, \tag{2}\] where \(A\) is an \(m\)-by-\(n\) matrix, \(b\) is an \(m\)-vector, \(x\) is an \(n\)-vector, and \(r\) is an \(m\)-vector of residuals. Typically, there are many more observations than \(m\) than variables \(n\), and so \(m\gg n\). But there are many important application that require least-squares problems where the opposite is true – these problems are considered under determined because there are necessarily infinitely many least-squares solutions \(x^*.\)

The following fundamental result describes several equivalent conditions that any least-squares solution must satisfy.

Theorem 1 (Least-squares optimality) The vector \(x^*\) solves the least-squares problem Equation 2 if and only if it satisfies the following equivalent conditions:

\(A^T r = 0\), where \(r=b-Ax\),
\(A\T A x = A^T b\),
\(y:=Ax\) is the (unique) orthogonal projection of \(b\) onto \(\range(A)\).

Moreover, \(x^*\) is the unique least-squares solution if and only if \(A\) has full rank.

This result relies on a fundamental property that for all matrices \(A\), regardless of their shape, the subspaces \[ \begin{align*} \range(A) &= \{y \mid y = A x \text{ for some } x \} \\\Null(A\T\,) &= \{z \mid A\T z = 0 \} \end{align*} \] are orthogonal complements, i.e., \[ \range(A) \oplus \Null(A\T\,) = \R^m. \]

The equivalence of the three properties is then straightforward, and it’s sufficient to show that the third condition is equivalent to the optimality of the least-squares problem. Orthogonally decompose the residual \(r = b-y\) as \[ b - y = z_n + z_r, \quad z_r\in\range(A), \quad z_n\in\Null(A\T\,). \]

If \(y\) was not the orthogonal projection of \(b\), then there must exist some component of \(b-y\) in the range of \(A\). Thus, \(z_r\ne0\) and \[ \begin{align*} \|r\|^2 &= \|b-y\|^2 \\ &= \|z_n+z_r\|^2 \\ &= \|z_n\|^2+\|z_r\|^2 \\ &> \|z_n\|^2=\|b-(y+z_r)\|^2, \end{align*} \] which means we found a new point \((y+z_r)\in\range(A)\) with a smaller residual, which contradicts the optimality of \(x^*\), and also the uniqueness of the projection \(y\).

Example: Multilinear regression (continued)

For the multilinear regression example, we can obtain the least-squares solution by solving the normal equations as follows:

using LinearAlgebra # gives `norm` function
A = [ones(size(mtcars,1)) mtcars[:,:Disp]  mtcars[:,:HP] mtcars[:,:WT]]
b = mtcars[:,:MPG]
x = A'A \ A'b
@show(norm(x-coef(ols)));

norm(x - coef(ols)) = 7.44856180406681e-14

The coefficients computed using the normal equations are almost the same as those computed using by GLM, which likely solves the normal equations using a different method.

The least-squares solution should also verify the first condition of the above theorem:

r = b - A*x
@show(norm(A'r));

norm(A' * r) = 8.758917631275962e-11

The backslash operator

In this last example, we solved for the least-squares solution by explicitly solving the normal equations \(A\T Ax=A\T b\) via the Julia command x = A'A \ A'b. As we’ll see in a later lecture, this isn’t the best way to solve the least-squares problem. Without getting into the details just yet, it’s enough for now to know that Julia allows for the short-hand x = A \ b: Julia recognizes that this is an overdetermined system, and solves for the least-squares solution using a mathematically equivalent approach. Let’s verify:

x1 = A \ b
@show(norm(x1 - x, Inf));

norm(x1 - x, Inf) = 5.826450433232822e-13

Close enough!

Example: Trendline

This example uses least-squares to illustrate the increasing concentrations of nitrous oxide (N₂O), which is a greenhouse gas. We’ll use a dataset from the Global Monitoring Laboratory:

using Downloads
fname = "mlo_N2O_All.dat"
if !isfile(fname)
    Downloads.download("https://gml.noaa.gov/aftp/data/hats/n2o/insituGCs/CATS/hourly/mlo_N2O_All.dat", fname)
end

This dataset provides hourly measurements of N₂O levels in parts per billion (ppb) from 1999 to 2021 from gas sampled at the Manua Loa Observatory in Hawaii. The following code removes rows with missing concentration levels ppb, and creates a new table with average monthly measurements:

using CSV, DataFrames, DataFramesMeta, Statistics
df = CSV.read(fname, DataFrame, comment="#", normalizenames=true, delim=" ", ignorerepeated=true)
dfg = @chain df begin
    @select(:year = :N2OcatsMLOyr, :month = :N2OcatsMLOmon, :ppb = :N2OcatsMLOm)
    @subset((!isnan).(:ppb))
    @by([:year, :month], :avgppb = mean(:ppb))
end 
describe(dfg, :min, :max, cols=[:year, :avgppb])

2×3 DataFrame

Row	variable	min	max
	Symbol	Real	Real
1	year	1999	2022
2	avgppb	315.06	337.484

A trend line is defined by the univariate function \[ p(t) = \alpha + \beta t, \] where \(\alpha\) is the intercept, \(\beta\) is the slope, and \(t\) is time. Here, we’ll just index the months by the integers \(t=1,2,\ldots,T\). We determine the parameters \(\alpha\) and \(\beta\) by solving the least-squares problem \[ \min_{\alpha,\ \beta}\ \sum_{i=1}^{T}(p(t_i) - \text{ppb}_i)^2, \] which corresponds to the general least-squares problem with \[ A = \begin{bmatrix} 1 & t_1 \\ 1 & t_2 \\ \vdots & \vdots \\ 1 & T \end{bmatrix}, \quad b = \begin{bmatrix} \text{ppb}_1 \\ \text{ppb}_2 \\ \vdots \\ \text{ppb}_T \end{bmatrix}, \quad\text{and}\quad x = \begin{bmatrix} \alpha \\ \beta \end{bmatrix}. \] In Julia,

T = nrow(dfg)
t = collect(1:T)
A = [ones(T)  t]
b = dfg[!, :avgppb];

The least-squares solution is then

α, β = A \ b
@show(α, β);

α = 314.4104422330954
β = 0.07675682672382932

and we can plot the data together with the trend line:

@df dfg scatter(:avgppb)
p(t) = α + β*t
plot!(p, lw=3)
plot!(leg=false, xlab="Month index from 1999 to 2021", ylab="Monthly average N₂O ppb")

Figure 2: Least-squares trend line for N₂O concentrations at Mauna Loa.

But perhaps most interestingly, we can now “detrend” the data to understand how the concentrations vary around the overall 22-year trend:

@df dfg scatter(:avgppb - p.(t))
plot!(leg=false, xlab="Month index from 1999 to 2021", ylab="ppb deviation from mean")

Figure 3: Deviation of N₂O concentrations from the least-squares trend line.