# Classifying handwritten digits

## The MNIST Dataset

The MNIST dataset is a widely used benchmark dataset in the field of machine learning and computer vision. It stands for Modified National Institute of Standards and Technology, as it is derived from the larger NIST dataset.

The MNIST dataset consists of a collection of handwritten digits from 0 to 9, represented as grayscale images of size 28x28 pixels. It contains a training set of 60,000 examples and a test set of 10,000 examples.

The goal of using the MNIST dataset is typically to develop and evaluate models that can accurately classify or recognize handwritten digits. It has become a popular dataset for tasks such as image classification, digit recognition, and machine learning algorithm benchmarking.

Due to its simplicity and small size, the MNIST dataset is often used as a starting point for learning and experimenting with various machine learning techniques and models. It provides a relatively simple and well-defined problem for practitioners to work with, allowing them to focus on understanding and implementing different algorithms and approaches.

## Loading the dataset using PyTorch

The MNIST dataset is part of the PyTorch library as it is used often for teaching convolutional neural networks. To load the MNIST dataset using PyTorch, you can use the `torchvision` package, which provides useful tools and datasets for computer vision tasks. Here's an example of how you can load the MNIST dataset using PyTorch:

```python
import torch
from torchvision import datasets, transforms

# Define the data transformations
transform = transforms.Compose([
    transforms.ToTensor(),  # Convert image to tensor
    transforms.Normalize((0.5,), (0.5,))  # Normalize the image
])

# Download and load the MNIST training set
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    transform=transform,
    download=True
)

# Download and load the MNIST test set
test_dataset = datasets.MNIST(
    root='./data',
    train=False,
    transform=transform,
    download=True
)

# Create data loaders to efficiently load the data in batches
batch_size = 64
train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True
)

test_loader = torch.utils.data.DataLoader(
    dataset=test_dataset,
    batch_size=batch_size,
    shuffle=False
)
```

In this example, we first define the transformations to be applied to the data using `transforms.Compose`. We convert the image to a tensor using `transforms.ToTensor()` and then normalize the image by subtracting the mean and dividing by the standard deviation using `transforms.Normalize()`.

Next, we use the `datasets.MNIST` class to download and load the MNIST dataset. We specify the root directory where the dataset will be stored, whether it's the training set or the test set, the transformations to be applied, and set `download=True` to automatically download the dataset if it's not already downloaded.

Finally, we create data loaders using `torch.utils.data.DataLoader` to efficiently load the data in batches during training and testing. We specify the dataset, batch size, and whether to shuffle the data.

With these data loaders, you can iterate over the dataset in batches during training or testing. For example, you can use a `for` loop to iterate over `train_loader` to access the training data in batches:

```python
for images, labels in train_loader:
    # Your training code here
    pass
```

## Normalizing the input tensors

Before we can pass the images to the neural network we need to run some transformations on them. As there are different image formats like JPG which stores colors as interer values between 0 and 255 or PNG which stores colors as floats between 0.0 and 1.0 results would not be reproducable when we just use the rwa image data. This is why we have to standardize the input images. PyTorch allows us to build a transformation pipeline where we can chain several functions that should be applied to the image tensors before they are passed to teh network. Here is how the code looks like:

```python
# Define the data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
```

In this piece of code, we define the data transformations using `torchvision.transforms.Compose`. Data transformations are applied to the input data to preprocess and augment it before feeding it into the neural network. Let's break down the transformations used:

1. **ToTensor():**
   * `transforms.ToTensor()` converts the input PIL Image or numpy array to a PyTorch tensor.
   * It converts the image data from a range of 0 to 255 (integer) to a range of 0.0 to 1.0 (float).
   * It also converts the image from the H x W x C format (height x width x channels) to the C x H x W format (channels x height x width).
   * This transformation is necessary because PyTorch models expect input data in tensor format.
2. **Normalize():**
   * `transforms.Normalize((0.5,), (0.5,))` normalizes the input tensor by subtracting a mean and dividing by a standard deviation.
   * In this case, the mean is set to 0.5 and the standard deviation is also set to 0.5.
   * This transformation helps in standardizing the input data, making it have a zero mean and unit variance.
   * Normalization can improve the convergence speed and performance of the neural network during training.

### Creating a Convolutional Neural Network

Now that we have created our data loaders we can now define the architecture of our convolutional neural network.&#x20;

```python
import torch
import torch.nn as nn

# Define the CNN model
class MNISTCNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)
        self.relu1 = nn.ReLU()
        self.maxpool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        self.relu2 = nn.ReLU()
        self.maxpool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        self.fc1 = nn.Linear(7 * 7 * 32, 128)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.maxpool1(x)
        
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.maxpool2(x)
        
        x = x.view(x.size(0), -1)
        
        x = self.fc1(x)
        x = self.relu3(x)
        x = self.fc2(x)
        
        return x

# Create an instance of the CNN model
model = MNISTCNN()
```

In this example, we define a class `CNN` that inherits from `nn.Module`, which is the base class for all neural network modules in PyTorch. Inside the `CNN` class, we define the layers of the CNN model in the `__init__` method. We use convolutional layers (`nn.Conv2d`), ReLU activation functions (`nn.ReLU`), and max pooling layers (`nn.MaxPool2d`) to build the network. We also include fully connected (linear) layers (`nn.Linear`) for the final classification. The `forward` method defines the forward pass of the network, specifying how the input flows through the layers.

After defining the model, we create an instance of it by calling `CNN()` and assign it to the `model` variable. You can then use this model for training and inference on the MNIST dataset.

## A detailed look at the network layers

Let's have a more detailed look at what happens in the different layers. We will use the first block of layers consisting of the `conv1`, `relu1` and `maxpool1` layer.

1. **Convolutional layer (`conv1`):**
   * `self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)`
   * This layer performs a 2D convolution on the input.
   * The input to this layer is expected to have 1 channel (since MNIST images are grayscale), and it produces 16 output channels.
   * The `kernel_size` is set to 3, which means it uses a 3x3 filter/kernel to convolve over the input.
   * The `stride` is set to 1, indicating that the filter moves one pixel at a time.
   * The `padding` is set to 1, which pads the input with zeros on all sides, to ensure that the output has the same spatial dimensions as the input.
2. **ReLU activation (`relu1`):**
   * `self.relu1 = nn.ReLU()`
   * ReLU (Rectified Linear Unit) is an activation function that introduces non-linearity to the model.
   * It applies an element-wise activation, setting all negative values to zero and keeping the positive values unchanged.
   * ReLU is commonly used in neural networks to add non-linearity and increase the model's representational capacity.
3. **Max pooling layer (`maxpool1`):**
   * `self.maxpool1 = nn.MaxPool2d(kernel_size=2, stride=2)`
   * Max pooling is a downsampling operation that reduces the spatial dimensions of the input.
   * The `kernel_size` is set to 2, indicating that it uses a 2x2 window to pool the input.
   * The `stride` is also set to 2, meaning that the window moves two pixels at a time.
   * Max pooling takes the maximum value within each window, discarding the rest.
   * In this case, the max pooling operation reduces the spatial dimensions of the input by a factor of 2, effectively downsampling the feature maps.

To summarize, the first block processes the input through a convolutional layer (`conv1`) to extract features, applies a ReLU activation function (`relu1`) to introduce non-linearity, and then performs max pooling (`maxpool1`) to downsample the feature maps. These operations help in learning and capturing important patterns and features from the input images, while reducing the spatial dimensions to extract relevant information in a more computationally efficient manner.

Training the network

Now that we have defineed our datasets and our model it is now time for training the model. First we give you the code for training the model:

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# Define the CNN model
class MNISTCNN(nn.Module):
    # ... define the model architecture as mentioned before ...

# Set random seed for reproducibility
torch.manual_seed(42)

# Define the data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Download and load the MNIST training set
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    transform=transform,
    download=True
)

# Download and load the MNIST validation set
val_dataset = datasets.MNIST(
    root='./data',
    train=False,
    transform=transform,
    download=True
)

# Create data loaders
batch_size = 32
train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True
)

val_loader = torch.utils.data.DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    shuffle=False
)

# Create an instance of the CNN model
model = MNISTCNN()

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 10
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(num_epochs):
    # Training
    model.train()
    train_loss = 0.0
    train_correct = 0
    
    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.to(device)
        
        optimizer.zero_grad()
        
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item() * images.size(0)
        train_correct += (predicted == labels).sum().item()
    
    train_loss /= len(train_loader.dataset)
    train_accuracy = 100.0 * train_correct / len(train_loader.dataset)
    
    # Validation
    model.eval()
    val_loss = 0.0
    val_correct = 0
    
    with torch.no_grad():
        for images, labels in val_loader:
            images = images.to(device)
            labels = labels.to(device)
            
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            
            loss = criterion(outputs, labels)
            
            val_loss += loss.item() * images.size(0)
            val_correct += (predicted == labels).sum().item()
    
    val_loss /= len(val_loader.dataset)
    val_accuracy = 100.0 * val_correct / len(val_loader.dataset)
    
    # Print loss and accuracy
    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f"Train Loss: {train_loss:.4f} | Train Accuracy: {train_accuracy:.2f}%")
    print(f"Val Loss: {val_loss:.4f} | Val Accuracy: {val_accuracy:.2f}%")
    print("------------------")
```

Training the network

Now that we have defined the datasets, its transformations and the network we can now build the complete training loop. This is how the code for training the network looks like:

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# Define the CNN model
class MNISTCNN(nn.Module):
    # ... define the model architecture as mentioned before ...

# Set random seed for reproducibility
torch.manual_seed(42)

# Define the data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

# Download and load the MNIST training set
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    transform=transform,
    download=True
)

# Download and load the MNIST validation set
val_dataset = datasets.MNIST(
    root='./data',
    train=False,
    transform=transform,
    download=True
)

# Create data loaders
batch_size = 32
train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True
)

val_loader = torch.utils.data.DataLoader(
    dataset=val_dataset,
    batch_size=batch_size,
    shuffle=False
)

# Create an instance of the CNN model
model = CNN()

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 10
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(num_epochs):
    # Training
    model.train()
    train_loss = 0.0
    train_correct = 0
    
    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.to(device)
        
        optimizer.zero_grad()
        
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item() * images.size(0)
        train_correct += (predicted == labels).sum().item()
    
    train_loss /= len(train_loader.dataset)
    train_accuracy = 100.0 * train_correct / len(train_loader.dataset)
    
    # Validation
    model.eval()
    val_loss = 0.0
    val_correct = 0
    
    with torch.no_grad():
        for images, labels in val_loader:
            images = images.to(device)
            labels = labels.to(device)
            
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            
            loss = criterion(outputs, labels)
            
            val_loss += loss.item() * images.size(0)
            val_correct += (predicted == labels).sum().item()
    
    val_loss /= len(val_loader.dataset)
    val_accuracy = 100.0 * val_correct / len(val_loader.dataset)
    
    # Print loss and accuracy
    print(f"Epoch {epoch+1}/{num_epochs}")
    print(f"Train Loss: {train_loss:.4f} | Train Accuracy: {train_accuracy:.2f}%")
    print(f"Val Loss: {val_loss:.4f} | Val Accuracy: {val_accuracy:.2f}%")
    print("------------------")
```

In this code, after defining the model architecture and creating the data loaders as discussed earlier, we set the number of epochs to train (`num_epochs`) and choose the device to run the model on (`device`). If a GPU is available, it will be used; otherwise, it will fall back to CPU.

Within the training loop, for each epoch, we iterate over the training dataset using the `train_loader`. We move the input images and labels to the chosen device. We perform forward propagation through the model (`model(images)`), calculate the loss (`criterion(outputs, labels)`), and perform backpropagation and optimization steps. We keep track of the training loss and the number of correct predictions for calculating accuracy.

After each epoch, we switch the model to evaluation mode (`model.eval()`) and loop over the validation dataset using the `val_loader`. We perform forward propagation, calculate the validation loss, and keep track of the number of correct predictions.

Finally, we calculate the average losses and accuracies for both the training and validation datasets. We print these values after each epoch to monitor the training progress.

Your exercises

* Use the MNIST notebook, fill the missing gaps and train the neural network to classify digits.
* Try to train the model again with a more advanced preprocessing pipeline like the one shown in this code snipped. Compare the results to the simpler pipeline used before. You can find a list of transformations with examples here: <https://pytorch.org/vision/stable/auto_examples/transforms/plot_transforms_getting_started.html#sphx-glr-auto-examples-transforms-plot-transforms-getting-started-py>

```python
# Define the data transformations
transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),  # Randomly flip the image horizontally
    transforms.RandomVerticalFlip(),    # Randomly flip the image vertically
    transforms.ColorJitter(brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5),  # Randomly invert colors
    transforms.RandomRotation(15),      # Randomly rotate the image up to 15 degrees
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://techacademy.gitbook.io/ai-track-wise24_25/section-4-convolutional-neural-networks/classifying-handwritten-digits.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
