Classifying handwritten digits
Now that we know what the fundamental building blocks of convolutional neural networks are we can build our first image classification project. We will classify handwritten digits.
The MNIST Dataset
The MNIST dataset is a widely used benchmark dataset in the field of machine learning and computer vision. It stands for Modified National Institute of Standards and Technology, as it is derived from the larger NIST dataset.
The MNIST dataset consists of a collection of handwritten digits from 0 to 9, represented as grayscale images of size 28x28 pixels. It contains a training set of 60,000 examples and a test set of 10,000 examples.
The goal of using the MNIST dataset is typically to develop and evaluate models that can accurately classify or recognize handwritten digits. It has become a popular dataset for tasks such as image classification, digit recognition, and machine learning algorithm benchmarking.
Due to its simplicity and small size, the MNIST dataset is often used as a starting point for learning and experimenting with various machine learning techniques and models. It provides a relatively simple and well-defined problem for practitioners to work with, allowing them to focus on understanding and implementing different algorithms and approaches.
Loading the dataset using PyTorch
The MNIST dataset is part of the PyTorch library as it is used often for teaching convolutional neural networks. To load the MNIST dataset using PyTorch, you can use the torchvision
package, which provides useful tools and datasets for computer vision tasks. Here's an example of how you can load the MNIST dataset using PyTorch:
In this example, we first define the transformations to be applied to the data using transforms.Compose
. We convert the image to a tensor using transforms.ToTensor()
and then normalize the image by subtracting the mean and dividing by the standard deviation using transforms.Normalize()
.
Next, we use the datasets.MNIST
class to download and load the MNIST dataset. We specify the root directory where the dataset will be stored, whether it's the training set or the test set, the transformations to be applied, and set download=True
to automatically download the dataset if it's not already downloaded.
Finally, we create data loaders using torch.utils.data.DataLoader
to efficiently load the data in batches during training and testing. We specify the dataset, batch size, and whether to shuffle the data.
With these data loaders, you can iterate over the dataset in batches during training or testing. For example, you can use a for
loop to iterate over train_loader
to access the training data in batches:
Normalizing the input tensors
Before we can pass the images to the neural network we need to run some transformations on them. As there are different image formats like JPG which stores colors as interer values between 0 and 255 or PNG which stores colors as floats between 0.0 and 1.0 results would not be reproducable when we just use the rwa image data. This is why we have to standardize the input images. PyTorch allows us to build a transformation pipeline where we can chain several functions that should be applied to the image tensors before they are passed to teh network. Here is how the code looks like:
In this piece of code, we define the data transformations using torchvision.transforms.Compose
. Data transformations are applied to the input data to preprocess and augment it before feeding it into the neural network. Let's break down the transformations used:
ToTensor():
transforms.ToTensor()
converts the input PIL Image or numpy array to a PyTorch tensor.It converts the image data from a range of 0 to 255 (integer) to a range of 0.0 to 1.0 (float).
It also converts the image from the H x W x C format (height x width x channels) to the C x H x W format (channels x height x width).
This transformation is necessary because PyTorch models expect input data in tensor format.
Normalize():
transforms.Normalize((0.5,), (0.5,))
normalizes the input tensor by subtracting a mean and dividing by a standard deviation.In this case, the mean is set to 0.5 and the standard deviation is also set to 0.5.
This transformation helps in standardizing the input data, making it have a zero mean and unit variance.
Normalization can improve the convergence speed and performance of the neural network during training.
Creating a Convolutional Neural Network
Now that we have created our data loaders we can now define the architecture of our convolutional neural network.
In this example, we define a class CNN
that inherits from nn.Module
, which is the base class for all neural network modules in PyTorch. Inside the CNN
class, we define the layers of the CNN model in the __init__
method. We use convolutional layers (nn.Conv2d
), ReLU activation functions (nn.ReLU
), and max pooling layers (nn.MaxPool2d
) to build the network. We also include fully connected (linear) layers (nn.Linear
) for the final classification. The forward
method defines the forward pass of the network, specifying how the input flows through the layers.
After defining the model, we create an instance of it by calling CNN()
and assign it to the model
variable. You can then use this model for training and inference on the MNIST dataset.
A detailed look at the network layers
Let's have a more detailed look at what happens in the different layers. We will use the first block of layers consisting of the conv1
, relu1
and maxpool1
layer.
Convolutional layer (
conv1
):self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)
This layer performs a 2D convolution on the input.
The input to this layer is expected to have 1 channel (since MNIST images are grayscale), and it produces 16 output channels.
The
kernel_size
is set to 3, which means it uses a 3x3 filter/kernel to convolve over the input.The
stride
is set to 1, indicating that the filter moves one pixel at a time.The
padding
is set to 1, which pads the input with zeros on all sides, to ensure that the output has the same spatial dimensions as the input.
ReLU activation (
relu1
):self.relu1 = nn.ReLU()
ReLU (Rectified Linear Unit) is an activation function that introduces non-linearity to the model.
It applies an element-wise activation, setting all negative values to zero and keeping the positive values unchanged.
ReLU is commonly used in neural networks to add non-linearity and increase the model's representational capacity.
Max pooling layer (
maxpool1
):self.maxpool1 = nn.MaxPool2d(kernel_size=2, stride=2)
Max pooling is a downsampling operation that reduces the spatial dimensions of the input.
The
kernel_size
is set to 2, indicating that it uses a 2x2 window to pool the input.The
stride
is also set to 2, meaning that the window moves two pixels at a time.Max pooling takes the maximum value within each window, discarding the rest.
In this case, the max pooling operation reduces the spatial dimensions of the input by a factor of 2, effectively downsampling the feature maps.
To summarize, the first block processes the input through a convolutional layer (conv1
) to extract features, applies a ReLU activation function (relu1
) to introduce non-linearity, and then performs max pooling (maxpool1
) to downsample the feature maps. These operations help in learning and capturing important patterns and features from the input images, while reducing the spatial dimensions to extract relevant information in a more computationally efficient manner.
Training the network
Now that we have defineed our datasets and our model it is now time for training the model. First we give you the code for training the model:
Training the network
Now that we have defined the datasets, its transformations and the network we can now build the complete training loop. This is how the code for training the network looks like:
In this code, after defining the model architecture and creating the data loaders as discussed earlier, we set the number of epochs to train (num_epochs
) and choose the device to run the model on (device
). If a GPU is available, it will be used; otherwise, it will fall back to CPU.
Within the training loop, for each epoch, we iterate over the training dataset using the train_loader
. We move the input images and labels to the chosen device. We perform forward propagation through the model (model(images)
), calculate the loss (criterion(outputs, labels)
), and perform backpropagation and optimization steps. We keep track of the training loss and the number of correct predictions for calculating accuracy.
After each epoch, we switch the model to evaluation mode (model.eval()
) and loop over the validation dataset using the val_loader
. We perform forward propagation, calculate the validation loss, and keep track of the number of correct predictions.
Finally, we calculate the average losses and accuracies for both the training and validation datasets. We print these values after each epoch to monitor the training progress.
Your exercises
Use the MNIST notebook, fill the missing gaps and train the neural network to classify digits.