Classifying handwritten digits

Now that we know what the fundamental building blocks of convolutional neural networks are we can build our first image classification project. We will classify handwritten digits.

The MNIST Dataset

The MNIST dataset is a widely used benchmark dataset in the field of machine learning and computer vision. It stands for Modified National Institute of Standards and Technology, as it is derived from the larger NIST dataset.

The MNIST dataset consists of a collection of handwritten digits from 0 to 9, represented as grayscale images of size 28x28 pixels. It contains a training set of 60,000 examples and a test set of 10,000 examples.

The goal of using the MNIST dataset is typically to develop and evaluate models that can accurately classify or recognize handwritten digits. It has become a popular dataset for tasks such as image classification, digit recognition, and machine learning algorithm benchmarking.

Due to its simplicity and small size, the MNIST dataset is often used as a starting point for learning and experimenting with various machine learning techniques and models. It provides a relatively simple and well-defined problem for practitioners to work with, allowing them to focus on understanding and implementing different algorithms and approaches.

Loading the dataset using PyTorch

The MNIST dataset is part of the PyTorch library as it is used often for teaching convolutional neural networks. To load the MNIST dataset using PyTorch, you can use the torchvision package, which provides useful tools and datasets for computer vision tasks. Here's an example of how you can load the MNIST dataset using PyTorch:

import torch
from torchvision import datasets, transforms

# Define the data transformations
transform = transforms.Compose([
    transforms.ToTensor(),  # Convert image to tensor
    transforms.Normalize((0.5,), (0.5,))  # Normalize the image
])

# Download and load the MNIST training set
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    transform=transform,
    download=True
)

# Download and load the MNIST test set
test_dataset = datasets.MNIST(
    root='./data',
    train=False,
    transform=transform,
    download=True
)

# Create data loaders to efficiently load the data in batches
batch_size = 64
train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True
)

test_loader = torch.utils.data.DataLoader(
    dataset=test_dataset,
    batch_size=batch_size,
    shuffle=False
)

In this example, we first define the transformations to be applied to the data using transforms.Compose. We convert the image to a tensor using transforms.ToTensor() and then normalize the image by subtracting the mean and dividing by the standard deviation using transforms.Normalize().

Next, we use the datasets.MNIST class to download and load the MNIST dataset. We specify the root directory where the dataset will be stored, whether it's the training set or the test set, the transformations to be applied, and set download=True to automatically download the dataset if it's not already downloaded.

Finally, we create data loaders using torch.utils.data.DataLoader to efficiently load the data in batches during training and testing. We specify the dataset, batch size, and whether to shuffle the data.

With these data loaders, you can iterate over the dataset in batches during training or testing. For example, you can use a for loop to iterate over train_loader to access the training data in batches:

for images, labels in train_loader:
    # Your training code here
    pass

Normalizing the input tensors

Before we can pass the images to the neural network we need to run some transformations on them. As there are different image formats like JPG which stores colors as interer values between 0 and 255 or PNG which stores colors as floats between 0.0 and 1.0 results would not be reproducable when we just use the rwa image data. This is why we have to standardize the input images. PyTorch allows us to build a transformation pipeline where we can chain several functions that should be applied to the image tensors before they are passed to teh network. Here is how the code looks like:

# Define the data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

In this piece of code, we define the data transformations using torchvision.transforms.Compose. Data transformations are applied to the input data to preprocess and augment it before feeding it into the neural network. Let's break down the transformations used:

ToTensor():
- transforms.ToTensor() converts the input PIL Image or numpy array to a PyTorch tensor.
- It converts the image data from a range of 0 to 255 (integer) to a range of 0.0 to 1.0 (float).
- It also converts the image from the H x W x C format (height x width x channels) to the C x H x W format (channels x height x width).
- This transformation is necessary because PyTorch models expect input data in tensor format.
Normalize():
- transforms.Normalize((0.5,), (0.5,)) normalizes the input tensor by subtracting a mean and dividing by a standard deviation.
- In this case, the mean is set to 0.5 and the standard deviation is also set to 0.5.
- This transformation helps in standardizing the input data, making it have a zero mean and unit variance.
- Normalization can improve the convergence speed and performance of the neural network during training.

Creating a Convolutional Neural Network

Now that we have created our data loaders we can now define the architecture of our convolutional neural network.

import torch
import torch.nn as nn

# Define the CNN model
class MNISTCNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)
        self.relu1 = nn.ReLU()
        self.maxpool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        self.relu2 = nn.ReLU()
        self.maxpool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        self.fc1 = nn.Linear(7 * 7 * 32, 128)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.maxpool1(x)
        
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.maxpool2(x)
        
        x = x.view(x.size(0), -1)
        
        x = self.fc1(x)
        x = self.relu3(x)
        x = self.fc2(x)
        
        return x

# Create an instance of the CNN model
model = MNISTCNN()

In this example, we define a class CNN that inherits from nn.Module, which is the base class for all neural network modules in PyTorch. Inside the CNN class, we define the layers of the CNN model in the __init__ method. We use convolutional layers (nn.Conv2d), ReLU activation functions (nn.ReLU), and max pooling layers (nn.MaxPool2d) to build the network. We also include fully connected (linear) layers (nn.Linear) for the final classification. The forward method defines the forward pass of the network, specifying how the input flows through the layers.

After defining the model, we create an instance of it by calling CNN() and assign it to the model variable. You can then use this model for training and inference on the MNIST dataset.

A detailed look at the network layers

Let's have a more detailed look at what happens in the different layers. We will use the first block of layers consisting of the conv1, relu1 and maxpool1 layer.

Convolutional layer (conv1):
- self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)
- This layer performs a 2D convolution on the input.
- The input to this layer is expected to have 1 channel (since MNIST images are grayscale), and it produces 16 output channels.
- The kernel_size is set to 3, which means it uses a 3x3 filter/kernel to convolve over the input.
- The stride is set to 1, indicating that the filter moves one pixel at a time.
- The padding is set to 1, which pads the input with zeros on all sides, to ensure that the output has the same spatial dimensions as the input.
ReLU activation (relu1):
- self.relu1 = nn.ReLU()
- ReLU (Rectified Linear Unit) is an activation function that introduces non-linearity to the model.
- It applies an element-wise activation, setting all negative values to zero and keeping the positive values unchanged.
- ReLU is commonly used in neural networks to add non-linearity and increase the model's representational capacity.
Max pooling layer (maxpool1):
- self.maxpool1 = nn.MaxPool2d(kernel_size=2, stride=2)
- Max pooling is a downsampling operation that reduces the spatial dimensions of the input.
- The kernel_size is set to 2, indicating that it uses a 2x2 window to pool the input.
- The stride is also set to 2, meaning that the window moves two pixels at a time.
- Max pooling takes the maximum value within each window, discarding the rest.
- In this case, the max pooling operation reduces the spatial dimensions of the input by a factor of 2, effectively downsampling the feature maps.

To summarize, the first block processes the input through a convolutional layer (conv1) to extract features, applies a ReLU activation function (relu1) to introduce non-linearity, and then performs max pooling (maxpool1) to downsample the feature maps. These operations help in learning and capturing important patterns and features from the input images, while reducing the spatial dimensions to extract relevant information in a more computationally efficient manner.

Training the network

Now that we have defineed our datasets and our model it is now time for training the model. First we give you the code for training the model:

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# Define the CNN model
class MNISTCNN(nn.Module):
    # ... define the model architecture as mentioned before ...

# Set random seed for reproducibility
torch.manual_seed(42)

# Define the data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Download and load the MNIST training set
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    transform=transform,
    download=True
)

# Download and load the MNIST validation set
val_dataset = datasets.MNIST(
    root='./data',
    train=False,
    transform=transform,
    download=True
)

# Create data loaders
batch_size = 32
train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True
)

val_loader = torch.utils.data.DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    shuffle=False
)

# Create an instance of the CNN model
model = MNISTCNN()

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 10
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(num_epochs):
    # Training
    model.train()
    train_loss = 0.0
    train_correct = 0
    
    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.to(device)
        
        optimizer.zero_grad()
        
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item() * images.size(0)
        train_correct += (predicted == labels).sum().item()
    
    train_loss /= len(train_loader.dataset)
    train_accuracy = 100.0 * train_correct / len(train_loader.dataset)
    
    # Validation
    model.eval()
    val_loss = 0.0
    val_correct = 0
    
    with torch.no_grad():
        for images, labels in val_loader:
            images = images.to(device)
            labels = labels.to(device)
            
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            
            loss = criterion(outputs, labels)
            
            val_loss += loss.item() * images.size(0)
            val_correct += (predicted == labels).sum().item()
    
    val_loss /= len(val_loader.dataset)
    val_accuracy = 100.0 * val_correct / len(val_loader.dataset)
    
    # Print loss and accuracy
    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f"Train Loss: {train_loss:.4f} | Train Accuracy: {train_accuracy:.2f}%")
    print(f"Val Loss: {val_loss:.4f} | Val Accuracy: {val_accuracy:.2f}%")
    print("------------------")

Training the network

Now that we have defined the datasets, its transformations and the network we can now build the complete training loop. This is how the code for training the network looks like:

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# Define the CNN model
class MNISTCNN(nn.Module):
    # ... define the model architecture as mentioned before ...

# Set random seed for reproducibility
torch.manual_seed(42)

# Define the data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Download and load the MNIST training set
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    transform=transform,
    download=True
)

# Download and load the MNIST validation set
val_dataset = datasets.MNIST(
    root='./data',
    train=False,
    transform=transform,
    download=True
)

# Create data loaders
batch_size = 32
train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True
)

val_loader = torch.utils.data.DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    shuffle=False
)

# Create an instance of the CNN model
model = CNN()

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 10
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(num_epochs):
    # Training
    model.train()
    train_loss = 0.0
    train_correct = 0
    
    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.to(device)
        
        optimizer.zero_grad()
        
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item() * images.size(0)
        train_correct += (predicted == labels).sum().item()
    
    train_loss /= len(train_loader.dataset)
    train_accuracy = 100.0 * train_correct / len(train_loader.dataset)
    
    # Validation
    model.eval()
    val_loss = 0.0
    val_correct = 0
    
    with torch.no_grad():
        for images, labels in val_loader:
            images = images.to(device)
            labels = labels.to(device)
            
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            
            loss = criterion(outputs, labels)
            
            val_loss += loss.item() * images.size(0)
            val_correct += (predicted == labels).sum().item()
    
    val_loss /= len(val_loader.dataset)
    val_accuracy = 100.0 * val_correct / len(val_loader.dataset)
    
    # Print loss and accuracy
    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f"Train Loss: {train_loss:.4f} | Train Accuracy: {train_accuracy:.2f}%")
    print(f"Val Loss: {val_loss:.4f} | Val Accuracy: {val_accuracy:.2f}%")
    print("------------------")

In this code, after defining the model architecture and creating the data loaders as discussed earlier, we set the number of epochs to train (num_epochs) and choose the device to run the model on (device). If a GPU is available, it will be used; otherwise, it will fall back to CPU.

Within the training loop, for each epoch, we iterate over the training dataset using the train_loader. We move the input images and labels to the chosen device. We perform forward propagation through the model (model(images)), calculate the loss (criterion(outputs, labels)), and perform backpropagation and optimization steps. We keep track of the training loss and the number of correct predictions for calculating accuracy.

After each epoch, we switch the model to evaluation mode (model.eval()) and loop over the validation dataset using the val_loader. We perform forward propagation, calculate the validation loss, and keep track of the number of correct predictions.

Finally, we calculate the average losses and accuracies for both the training and validation datasets. We print these values after each epoch to monitor the training progress.

Your exercises

Use the MNIST notebook, fill the missing gaps and train the neural network to classify digits.
Try to train the model again with a more advanced preprocessing pipeline like the one shown in this code snipped. Compare the results to the simpler pipeline used before. You can find a list of transformations with examples here: https://pytorch.org/vision/stable/auto_examples/transforms/plot_transforms_getting_started.html#sphx-glr-auto-examples-transforms-plot-transforms-getting-started-py

# Define the data transformations
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),  # Randomly flip the image horizontally
    transforms.RandomVerticalFlip(),    # Randomly flip the image vertically
    transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5),  # Randomly invert colors
    transforms.RandomRotation(15),      # Randomly rotate the image up to 15 degrees
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

PreviousConvolutional Neural Networks NextPlayground - Convolutional Neural Networks