Classification of Handwritten Digits

Authors

Michael P. Friedlander

Yifan Sun

using Images, Plots, Statistics, LinearAlgebra, MAT, Printf

The MNIST dataset (Modified National Institute of Standards and Technology) is a widely used benchmark in machine learning, particularly for handwritten digit recognition. It consists of grayscale images of handwritten digits (0-9), each of size 28 × 28 pixels. In this assignment, we focus on a binary classification problem, distinguishing between the digits “4” and “9”.

The approach we use is not state-of-the-art, but still it will provide you with experience using the tools you are learning to train a standard model. We will use two different classification models: linear regression and logistic regression, and compare their performance. There are more general approaches to this problem, but our emphasis here is on using basic tools of optimization.

1 MNIST dataset

We will work with a preprocessed, abbreviated version of the MNIST dataset, available for download. The dataset includes separate training and testing sets, where each image is represented as a flattened vector of pixel intensities. Along with the images, we have corresponding labels indicating the digit each image represents.

The first step in our analysis is to load the dataset and visualize some sample images before proceeding with data preprocessing and classification model implementation.

Download the dataset, and partition it into training and testing sets.

vars = matread("../assets/mnist.mat") # adjust the file path if necessary
testX = vars["testX"]
testY = dropdims(Int.(vars["testY"]), dims=1)   # labels
trainX = vars["trainX"]
trainY = dropdims(Int.(vars["trainY"]), dims=1) # labels
varinfo(Regex(".*X|.*Y"))

name	size	summary
MSG_BOUNDARY	50 bytes	10-element Vector{UInt8}

Here are a few sample images that we’ll be using to train the classifier.

show_digit(X, i) = reinterpret(N0f8, reshape(X[i, :], 28, 28)') .|> Gray
p = map([1, 4, 6, 2]) do i
        plot(show_digit(trainX, i), title="$(trainY[i])",
             axis=nothing, showaxis=false, grid=false)
    end
plot(p..., layout=(2, 2))

Here are a few random samples of the numbers 4 and 9 in the dataset.

fours = trainX[trainY .== 4, :]
nines = trainX[trainY .== 9, :]
# select `k` random 4s and 9s
k = 6
p = map(1:k) do i
    plot(show_digit(fours, rand(1:size(fours, 1))), title="4",
         axis=nothing, showaxis=false, grid=false)
end
plot(p..., layout=(Int(k/2), Int(k/2)))

1.1 Preprocessing

Before training our classification models, we need to preprocess the data to ensure it is in a suitable format for analysis. Preprocessing is a crucial step in machine learning as it helps improve model performance and generalization. Here we filter the dataset to include only the digits “4” and “9,” relabel the data for binary classification, and apply standard normalization techniques. These steps will help standardize the input data, remove biases, and facilitate efficient learning by the models.

1.1.1 Filter

Filter the train and test datasets to include only the digits 4 and 9, converting the data to double precision. (Double precision is used for simplicity, although single precision could save memory.)

"""
Filter out all digits except 4 and 9.
Return these digit images as floats.
"""
function filter_49(X, Y)
    idx = (Y .== 4) .| (Y .== 9)
    return Float64.(X[idx, :]), Float64.(Y[idx])
end

Atrain, btrain = filter_49(trainX, trainY)
Atest, btest = filter_49(testX, testY);

1.1.2 Relabel

Convert the label vectors in b and btest to a binary format by mapping 4 to +1 and 9 to -1, which ensures a consistent representation for the binary classification task: \[ b_i = \begin{cases} +1 & \text{if sample $i$ is a 4}, \\ -1 & \text{if sample $i$ is a 9}. \end{cases} \]

is49(y) = 1.0 * (y .== 4) - 1.0 * (y .== 9)
btrain = is49.(btrain)
btest = is49.(btest);

1.1.3 Center and normalize

To improve model performance and ensure numerical stability, we apply two common preprocessing steps: centering and normalization. Centering removes bias by subtracting the mean of each feature, while normalization scales the data to unit variance. This helps prevent features with larger magnitudes from dominating the learning process.

It’s important to remove the mean before normalizing the variance to ensure that the scaling reflects meaningful variations in the data rather than differences in absolute values.

function debias_and_normalize!(A, μ, σ)
    A .-= μ
    A ./= max.(σ, 1)
end
μtrain = mean(Atrain, dims=1)
σtrain = std(Atrain, dims=1)
debias_and_normalize!(Atrain, μtrain, σtrain);

2 Linear regression

Linear regression is a simple yet effective approach to classification, often used as a baseline model because it’s easy to implement and interpretable. Although it is primarily designed for regression tasks, it can be adapted for classification by fitting a linear function to approximate the relationship between features and class labels. Here we formulate the problem as a least-squares optimization, solving for the model parameters that minimize the squared error between predicted and actual labels. We then evaluate the performance of this approach and discuss its advantages and limitations for binary classification.

2.1 Train the least-squares model

To train the least-squares model, we solve the optimization problem \[ x_{ls} = \mathop{\rm argmin}_x \|Ax - b\|_2^2, \] where $A$ and $b$ refer specifically to the training data Atrain and btrain. Use Julia’s backslash operator (\) to solve the least-squares problem.

2.2 Construct a classifier

You will now construct a classifier from the least-squares solution $x_{ls}$. The classifier function $C(x, y)$ takes a model vector $x\in\Re^n$ and a feature vector $y\in\Re^n$ (a vectorized image) and predicts a label +1 or -1. That is, the function $C(x, y)$ should return +1 or -1.

For the given data set of feature-label pairs $\{a_i,b_i\}_{i=1}^m\subset\Rn\times\R$, the misclassification error rate is \[ \text{\tt error\_rate} = \frac{1}{m}\sum_{i=1}^m \mathbb{1}\{C(x_{ls}, a_i) \neq b_i\}, \] where $\mathbb{1}\{p\}$ is the indicator function that evaluates to 1 if $p$ is true and 0 otherwise and $(a_i, b_i)$ are the $i$th row of $A$ (the $i$th feature vector) and the $i$th label.

Implement the functions C(x, a) and error_rate(A, b, x).

function C(x, a)
    # fill in
end
function error_rate(A, b, x)
    # fill in
end

2.3 Evaluate the training misclassification rate

Now evaluate the misclassification rate on the training set.

train_error_rate = error_rate(Atrain, btrain, xls)
@printf("The misclassification rate on the training set is %.2f%%", train_error_rate * 100)

2.4 Evaluate the testing misclassification rate

To assess how well our model generalizes to unseen data, we compute the testing error. The training loss and training error provide insight into how well the model fits the observed data, but don’t indicate how the model will perform on new, unseen examples. Evaluating the testing misclassification rate helps measure the model’s ability to generalize.

Before using the test dataset, we must preprocess it in exactly the same way as the training data. This means applying the same mean and standard deviation computed from the training set to ensure consistency.

Why is this important?

3 Logistic regression

Now, rather than fitting data linearly to the +1 and -1 labels, we’ll now use logistic regression to fit of the likelihood of being 0 or 1. In particular, for a feature vector $a\in\Re^n$ and label $b\in\{0,1\}$, we define the probability that the label $b$ is associated with the feature vector $a$ by evaluating the function

\[ \Pr(b\mid a, x) = \begin{cases} \sigma(x'a) & \text{if } b = 1, \\ 1 - \sigma(x'a) & \text{if } b = 0, \end{cases} \]

where $\sigma(z) = 1/(1 + \exp(-z))$ is the sigmoid function.

σ(z) = 1 / (1 + exp(-z))
plot(σ, -5, 5, xlabel="z", ylabel="σ(z)", label=false, size=(200, 200))

The sigmoid function

This function can be interpreted as a “soft switch”: the input $z$ can be viewed as a “confidence” that the label is 1, and the output $\sigma(z)$ is the probability that the label is 1.

3.1 Likelihood and log-likelihood

Now assume that our training data $\{a_1, \ldots, a_m\}$ and labels $\{b_1, \ldots, b_m\}$ are i.i.d. Then the likelihood of oberving this entire training set, given our model $x$, is

\[ \text{Likelihood}(x) = \prod_{i=1}^m \Pr(b_i\mid a_i, x) = \prod_{i=1}^m \sigma(x^T a_i)^{b_i} (1 - \sigma(x^T a_i))^{1 - b_i}. \]

We instead minimize the negative log-likelihood function (i.e., maximize the log-likelihood function):

\[ f(x) = -\log(\text{Likelihood}(x)), \]

which is the sum of the negative log-likelihoods of each individual data point. This is a standard trick in statistics, and is equivalent to maximizing the likelihood itself.

Explain why it’s equivalent to maximize the log-likelihood instead of the likelihood.

Derive the gradient and Hessian of the log-likelihood function $f(x)$.

3.2 Train the logistic regression model

In this exercise you’ll train the logistic regression model using steepest descent.

First, you’ll need to convert the labels to 0 or 1. You can do this by adding 1 to the labels and dividing by 2.

# Convert labels to 0 or 1, overwriting the old labels
btrain01 = (btrain .+ 1) ./ 2
btest01 = (btest .+ 1) ./ 2;

Write the loss function and its gradient.

If you worry that $\log(\sigma(a_i^T x)) = -\infty$ or $\log(1-\sigma(a_i^Tx))=-\infty$ because the sigmoid function evaluates close to 0 or 1, you may replace $\sigma(a_i^Tx)$ with $\max(\min(\sigma(a_i^T x)),\epsilon),\epsilon)$ where $\epsilon$ is some small number.

"""
Loss function. Note that we use `min` and `max` to avoid taking the log of zero or one.
"""
function f(x)
   z = σ.(Atrain*x)
   zadj = min.(max.(z, eps()), 1-eps())
   return sum(-btrain01 .* log.(zadj) - (1.0 .- btrain01) .* log.(1.0 .- zadj))
end
"Loss gradient"
∇f(x) = # ... fill in

3.2.1 Steepest descent with constant stepsize

Train the model using steepest descent with

a constant stepsize of $\bar\alpha = 1/m$
an initial value $x^{(0)} = 0$
for at most 1000 iterations or until the gradient is less than $10^{-6}$

As you train the model, keep track of the training and testing error rates and the loss at each iteration.

Tip

You can use the following struct to keep track of the training and testing error rates and the loss at each iteration.

"Trace for train and test error rates, and loss"
struct ErrorRateTrace
    train::Vector{Float64}
    test::Vector{Float64}
    loss::Vector{Float64}
end
"Trace constructor"
ErrorRateTrace() = ErrorRateTrace([], [], [])
"Update trace with new values"
function update!(t::ErrorRateTrace, x)
    push!(t.train, error_rate(Atrain, btrain, x)),
    push!(t.test, error_rate(Atest, btest, x))
    push!(t.loss, f(x))
end;

p_err = plot(yscale=:log10, xlab="iteration", ylab="error rate", title="Error rate vs. iteration")
plot!(p_err, trace.train, label="train")
plot!(p_err, trace.test, label="test")

p_loss = plot(trace.loss, yscale=:log10, xlabel="iteration", ylabel="loss", title="Loss vs. iteration")
plot(p_err, p_loss, layout=(2, 1))

3.3 Steepest descent with backtracking line search

Repeat the experiments above using steepest descent with backtracking line search.

3.4 Visualize the misfits

Pick three images that fit the least-squares model well and three that do not. Do the same for the logistic model. Display these images and discuss why you think the model may have had trouble with the misfitting images.