Notebook Explanation
In this section we will go through the code of the final project's notebook
Define basic variables
# Choose the pretrained DistilBERT model checkpoint
ckpt = "distilbert-base-uncased"
# Set the device: use GPU ("cuda") if available, otherwise fallback to CPU
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load the "emotion" dataset from Hugging Face datasets library
dataset = load_dataset("emotion")
# Extract the list of label names from the dataset
labels = dataset["train"].features["label"].names
# Load the tokenizer for the DistilBERT model
tokenizer = AutoTokenizer.from_pretrained(ckpt)
# Load the DistilBERT model (this gives us the pretrained transformer without a classification head)
model = AutoModel.from_pretrained(ckpt)In this code snippet, we are preparing everything we need to train a DistilBERT model for text classification on an emotion dataset. Let’s walk through it step by step.
We start by selecting the pretrained model checkpoint distilbert-base-uncased. This is a compact and efficient version of the original BERT model that ignores the case of letters, meaning that “Apple” and “apple” are treated the same. Using this checkpoint allows us to build on a model that has already been trained on a large corpus of English text and has learned general language patterns.
Next, we determine whether we can use a GPU to accelerate training. The line device = torch.device("cuda") if torch.cuda.is_available() else "cpu" checks whether a CUDA-enabled GPU is available on the system. If it is, the model and data will be moved to the GPU for faster computation. Otherwise, we will fall back to the CPU. This makes our code portable and able to run on different machines without modification.
We then load the dataset using the Hugging Face Datasets library. The load_dataset("emotion") command downloads and prepares a dataset containing text samples labeled with emotions such as joy, anger, sadness, fear, and others. This dataset comes pre-split into training, validation, and test sets. To make sense of the labels in the dataset, we extract the list of label names from the training split with labels = dataset["train"].features["label"].names. This gives us a convenient mapping between the numeric label IDs used internally and the human-readable names of the emotions.
After preparing the dataset, we load a tokenizer for the DistilBERT model. Text data cannot be passed directly to the neural network; it needs to be converted into numerical representations first. The tokenizer handles this by splitting the text into smaller units called tokens and mapping them to numerical IDs from the pretrained vocabulary. Using AutoTokenizer.from_pretrained(ckpt) automatically loads the correct tokenizer for our chosen DistilBERT checkpoint.
Finally, we load the DistilBERT model itself using AutoModel.from_pretrained(ckpt). This gives us the pretrained transformer model, which is capable of producing rich, contextualized representations of text. At this stage, the model does not yet include a classification head, so we will need to add a layer on top of it later to predict which emotion each text sample expresses.
Dataset Tokenization
After loading the dataset and preparing the tokenizer, the next step is to preprocess the text data so it can be fed into the model. Neural networks like DistilBERT cannot work directly with raw text; they need numerical representations of words and sentences. This is where tokenization comes in.
We start by defining a function called tokenize. This function takes a batch of sentences from the dataset and applies the tokenizer to them. Inside the function, we call tokenizer(batch["text"], padding=True, truncation=True). The batch["text"] part extracts the actual text data from the batch. The tokenizer then converts these text samples into token IDs, which are the numerical representations the model can understand. We set padding=True so that all sequences in a batch are padded to the same length, which is necessary for efficient computation. The truncation=True parameter ensures that any text longer than the model’s maximum input length is truncated, preventing errors when working with long sentences.
Once the tokenize function is defined, we apply it to the entire dataset using dataset.map(tokenize, batched=True, batch_size=None). The map method allows us to apply a function to every example in the dataset. Setting batched=True means that the dataset will be processed in batches rather than one sample at a time, which is much faster. The batch_size=None argument lets the library choose an appropriate batch size automatically.
The result of this operation is a new dataset called dataset_encoded. In this dataset, each text sample has been transformed into a sequence of token IDs and is ready to be used for model training. When we display dataset_encoded, we can see the tokenized versions of the text alongside the original data and labels.
This step is crucial because it bridges the gap between raw text and the numerical input format required by the DistilBERT model. Without tokenization, the model would not be able to interpret the dataset.
Extract Hidden States
Now that the dataset has been tokenized, the next step is to pass these inputs through the pretrained DistilBERT model to extract useful features for each sentence. These features, called hidden states, are high-dimensional vector representations that capture the semantic meaning of the text.
We begin by defining a function called extract_hidden_states, which takes the model as an argument and returns another function, _extract_hidden_states, that will process batches of data. This inner function handles the actual extraction of hidden states for each batch.
Inside _extract_hidden_states, the first thing we do is prepare the inputs for the model. Since the model expects its inputs to be on the same device as its weights, we move them to the GPU if it is available, or to the CPU otherwise. The line inputs = {k: v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names} creates a dictionary containing only the input IDs and attention masks (the inputs the model actually needs), and places them on the correct device.
Next, we perform a forward pass through the model to compute its outputs. To avoid tracking gradients and save memory (since we are not training at this stage), we wrap this step in torch.no_grad(). The output of the model includes a tensor called last_hidden_state, which contains the hidden state vectors for each token in the input sentences.
We are interested in a single vector representation for each sentence. For this, we extract the hidden state corresponding to the special [CLS] token, which appears at the beginning of every input sequence in BERT-like models. This token is often used as a summary representation of the entire sentence. The line last_hidden_state[:, 0] selects the hidden state of the [CLS] token for every example in the batch. We then move this data back to the CPU and convert it to NumPy arrays so it can be stored in the dataset.
Before applying this function to the dataset, we call dataset_encoded.set_format("torch", columns=["input_ids", "attention_mask", "label"]). This tells the dataset to return its data in PyTorch tensor format and specifies which columns to include.
Finally, we apply our feature extraction function to the dataset using dataset_encoded.map(extract_hidden_states(model=model), batched=True). The map function processes the dataset in batches, passing each batch through the model to compute and store the hidden state representations. The resulting dataset, which we call dataset_hidden, now contains an additional column called hidden_state. This column holds the vector representation of each sentence and will serve as the input to a classification layer or another machine learning model in the next steps.
Split Dataset into Train, Validation and Test Subsets
After extracting the hidden state vectors for every sentence in the dataset, we are ready to prepare our data for training a classifier. The hidden states we extracted earlier from the DistilBERT model are high-dimensional numerical representations of each sentence. These vectors encode rich information about the meaning and context of the text, and we can use them as features for a simpler machine learning model, such as a logistic regression classifier, to predict the emotion labels.
To do this, we first separate the hidden state vectors and their corresponding labels for the training, validation, and test splits of the dataset. The line X_train = np.array(dataset_hidden["train"]["hidden_state"]) converts the list of hidden state vectors from the training set into a NumPy array, which is a convenient format for working with machine learning models in Python. Similarly, y_train = np.array(dataset_hidden["train"]["label"]) extracts the labels for the training set and stores them as a NumPy array.
We repeat the same process for the validation and test splits. The variables X_valid and y_valid hold the features and labels for the validation set, while X_test and y_test hold those for the test set. These splits will allow us to train the classifier on one portion of the data, tune and evaluate it on another, and finally assess its performance on completely unseen data.
At this point, we have fully prepared our dataset for the classification task. The raw text has been tokenized, passed through the pretrained DistilBERT model to generate meaningful numerical representations, and organized into NumPy arrays that are ready to be fed into a traditional machine learning algorithm. This approach is known as feature extraction, where we leverage the pretrained model as a fixed feature generator and train only the final classifier. It is a powerful and computationally efficient strategy for fine-tuning models on smaller datasets.
Visualize Embeddings
After preparing the dataset and extracting hidden states from the DistilBERT model, we now have high-dimensional feature vectors representing each sentence. While these vectors are very powerful for machine learning models, they are not easy for us humans to interpret directly because they typically live in a space with hundreds of dimensions. To better understand how the model’s features separate different emotions, we can use a dimensionality reduction technique to project these vectors into two dimensions and visualize them.
We start by defining a function called create_2d_embeddings, which takes the feature matrix X and the corresponding labels y as inputs and returns a Pandas DataFrame containing the two-dimensional projections. The first step in this function is to scale the features using MinMaxScaler, which rescales each feature to the range between 0 and 1. Scaling is an important preprocessing step for many dimensionality reduction techniques because it ensures that all features contribute equally to the distance calculations.
Next, we use UMAP (Uniform Manifold Approximation and Projection), a powerful algorithm for reducing high-dimensional data to a lower-dimensional space while preserving the global and local structure of the data as much as possible. We initialize UMAP with two output dimensions (n_components=2) and a cosine distance metric, which works well for feature vectors from models like DistilBERT. After fitting UMAP to the scaled feature matrix with fit(X_scaled), we obtain a set of 2D coordinates for each data point.
These coordinates are stored in a Pandas DataFrame with columns x1 and x2. We also add the original labels to this DataFrame so we can color-code or separate the data points by emotion later on. When we call create_2d_embeddings(X=X_train, y=y_train), it creates a DataFrame df_train_2d that holds the two-dimensional projections of the training set.
Now that we have the 2D embeddings, we want to visualize them to see how well the different emotions are separated in this lower-dimensional space. The plot_2d_embeddings function does exactly that. It creates a grid of subplots—one for each emotion label in the dataset—and uses hexbin plots to display the density of data points in each class. This kind of plot divides the 2D space into hexagonal bins and shades them according to how many data points fall into each bin, giving us a sense of where clusters of points are concentrated.
The function uses different color maps for each label to make the plots visually distinct and sets appropriate titles so we can identify which emotion each subplot corresponds to. Finally, we call plot_2d_embeddings(df_2d=df_train_2d) to display these visualizations.
The resulting plots give us an intuitive sense of how the pretrained model’s hidden states group sentences expressing similar emotions together and how much overlap there is between different emotion categories. This is not only a useful sanity check to see whether the representations are meaningful but also a powerful way to make abstract high-dimensional data more tangible for us.
Training a Dummy Classifier
Before we move on to training a real classifier, it’s helpful to establish a baseline performance. This gives us a reference point to compare against and helps us understand whether our model is truly learning or just guessing. To do this, we use a very simple model called DummyClassifier from scikit-learn.
The DummyClassifier does not attempt to learn from the data at all. Instead, it makes predictions using a fixed, naive strategy. Here we initialize it with the strategy "uniform", which means that the classifier will randomly guess a label for each input, choosing uniformly at random from all possible classes. In other words, it has no knowledge of the data and assigns each emotion an equal chance.
After creating the dummy model, we fit it to our training data with model_uf.fit(X_train, y_train). This step is a formality because the DummyClassifier doesn’t actually learn any patterns during fitting; it just records the set of possible labels so it can randomly sample from them later.
Finally, we evaluate the dummy model’s performance on the training data using model_uf.score(X_train, y_train). This computes the accuracy—the proportion of sentences for which the randomly guessed label happens to match the true label. Since the model is making predictions completely at random, we expect this accuracy to be very low, around the level of random chance. For a dataset with six emotion classes, this would be roughly 1 divided by 6, or about 16–17%.
This baseline is important because it tells us: if our real classifier doesn’t do any better than the DummyClassifier, then it’s not actually learning anything meaningful from the data. On the other hand, if we achieve significantly higher accuracy, we know our model is finding useful patterns in the DistilBERT representations.
Train a Logistic Regression Model
Now that we’ve established two simple baselines with the DummyClassifier, it’s time to train a real classifier on top of the DistilBERT hidden state features. For this, we use a logistic regression model, which is a widely used and effective algorithm for classification tasks.
We begin by creating the model with LogisticRegression(max_iter=3000). Logistic regression works by finding a linear decision boundary in the feature space that best separates the different emotion classes. The max_iter=3000 parameter increases the maximum number of iterations the optimization algorithm is allowed to run, which is important because our feature space is high-dimensional (DistilBERT’s hidden states have hundreds of dimensions) and convergence can take longer than usual.
Next, we train the model on the training data by calling model_lr.fit(X_train, y_train). This step involves the logistic regression algorithm finding the weights for each feature that minimize the classification error on the training set. Because the hidden state vectors produced by DistilBERT already contain rich, meaningful representations of the text, we don’t need a very complex classifier—logistic regression is often enough to achieve strong results in this setup.
Finally, we evaluate the trained model’s performance on the validation data using model_lr.score(X_valid, y_valid). This computes the accuracy, which tells us the proportion of validation samples for which the predicted emotion matches the true label. Comparing this accuracy to the baselines we computed earlier (random guessing and always predicting the most frequent class) gives us a clear sense of whether our model is learning meaningful patterns from the data. If the logistic regression classifier achieves a significantly higher accuracy than both baselines, it’s a good indication that the DistilBERT features are informative and our approach is working.
This final step completes the pipeline: we started with raw text, passed it through a powerful pretrained language model to obtain high-dimensional representations, and trained a simple classifier on top of those representations to predict emotions. This approach is efficient because it leverages the language understanding already captured by DistilBERT without requiring us to fine-tune the entire transformer model.
Plot Confusion Matrix
After training our logistic regression classifier and checking its overall accuracy, it’s useful to take a closer look at how the model performs on each individual class. Accuracy alone doesn’t tell us which emotions the model finds easy to predict and which ones it struggles with. For this purpose, we use a confusion matrix, which provides a detailed breakdown of the model’s predictions compared to the true labels.
We start by defining a function plot_confusion_matrix that takes three arguments: y_preds (the predicted labels), y_true (the true labels), and labels (the list of emotion names). Inside the function, we calculate the confusion matrix using scikit-learn’s confusion_matrix function. The parameter normalize="true" ensures that the matrix is normalized row-wise, meaning each row shows the proportion of samples from a given true class that were predicted as each possible class. This makes it easier to interpret the results, especially when the dataset is imbalanced.
Next, we create a Matplotlib figure and axis and use ConfusionMatrixDisplay to plot the matrix. The cmap="Blues" color map highlights areas where the model performs better (darker blue indicates higher values), and values_format=".2f" ensures the numbers are displayed as percentages with two decimal places. We also add a title to the plot for clarity and display it using plt.show().
To generate the confusion matrix, we first obtain predictions from our trained logistic regression model on the validation data with model_lr.predict(X_valid). These predictions, together with the true labels and the list of emotion names, are passed to the plot_confusion_matrix function.
The resulting plot provides a clear visual summary of the model’s strengths and weaknesses. Each diagonal cell represents the proportion of correctly classified samples for a particular emotion, while the off-diagonal cells show the proportion of misclassifications. For example, if many “fear” sentences are classified as “anger,” we will see a noticeable value in the corresponding cell.
This visualization is extremely helpful for diagnosing the model. It shows whether the classifier struggles with certain emotions, whether some emotions are often confused with others, and whether there are any systematic biases in its predictions. Understanding these patterns can guide us in improving the model further—for instance, by collecting more training data for specific classes or by fine-tuning the DistilBERT model instead of using it purely as a feature extractor.
Train the DistilBERT Transformer Model
So far, we’ve used DistilBERT as a fixed feature extractor, passing sentences through the pretrained model to obtain hidden state vectors and training a logistic regression classifier on top of those representations. While this approach is simple and effective, it doesn’t allow us to adapt the language model itself to our specific dataset. To push performance further, we now move to fine-tuning the entire DistilBERT model on the emotion classification task. This means we will retrain (or partially retrain) the model’s weights so it can learn task-specific features directly from our data.
We begin by defining a helper function compute_metrics that calculates two evaluation metrics: accuracy and the weighted F1 score. This function will be called during training to monitor the model’s performance on the validation set. Inside the function, we first check whether the predictions contain labels, and if they do, we extract the true labels and predicted class IDs (by taking the argmax over the logits output by the model). We then compute the weighted F1 score, which accounts for class imbalance, and the overall accuracy. These two metrics give us a more complete picture of how well the model is performing.
Next, we set some hyperparameters for training. We define the batch size as 16 and calculate the logging_steps parameter, which determines how often training metrics are logged. We also create a directory path where the fine-tuned model will be saved (model_name). The flag train_feature_extractor controls whether the pretrained DistilBERT layers will be updated during training or kept frozen.
We then instantiate the DistilBERT model for sequence classification using AutoModelForSequenceClassification.from_pretrained(ckpt, num_labels=6). Unlike the earlier AutoModel we used as a feature extractor, this version includes a classification head on top of the transformer layers, which directly outputs logits for each of the six emotion classes. We also move the model to the correct device (GPU or CPU).
Depending on the train_feature_extractor flag, we decide whether to fine-tune all of DistilBERT’s parameters or freeze the transformer layers and only train the classification head. If train_feature_extractor is True, all parameters are set to be trainable. Otherwise, we freeze the transformer layers by setting param.requires_grad = False for every parameter in model_finetuned.distilbert.parameters(). Freezing the feature extractor can be useful if we want a faster, lower-resource training run, but fine-tuning the whole model often leads to better performance.
Next, we set up the training configuration with TrainingArguments. Here, we specify where to save the fine-tuned model (output_dir=model_name), how many epochs to train for (num_train_epochs=2), the learning rate (2e-5), batch sizes for training and evaluation, and other options such as weight decay for regularization. We also disable progress bars (disable_tqdm=False) and set push_to_hub=False since we aren’t uploading the model to Hugging Face’s Model Hub.
We then create a Trainer object, which orchestrates the entire training process. The Trainer is initialized with our model, training arguments, metric computation function, training and validation datasets, and tokenizer. This abstraction simplifies fine-tuning by handling data batching, optimization, evaluation, and checkpointing automatically.
Finally, we call trainer.train() to start fine-tuning DistilBERT on our emotion dataset. During training, the model’s parameters are updated to minimize the classification loss, adapting DistilBERT’s language understanding to the specific task of predicting emotions from text. As training progresses, we can monitor accuracy and F1 score on the validation set to ensure the model is improving and to detect any potential overfitting.
By the end of this process, we will have a fine-tuned DistilBERT model that is specialized for emotion classification and ready to be evaluated or deployed for inference. This approach leverages the full power of transfer learning: starting from a general-purpose language model and adapting it to perform a highly specific task with relatively little labeled data.
Plot Confusion Matrix for DistilBERT Text Classification
After fine-tuning our DistilBERT model on the emotion dataset, we want to evaluate its performance on the validation set and visualize how well it distinguishes between different emotion classes.
We begin by using the trained Trainer to make predictions on the validation dataset. The trainer.predict() method takes care of running the model on the entire validation set, batching the data, and collecting the raw outputs. The result is an object called preds_output, which contains three key pieces of information: the raw logits produced by the model for each sample, the true labels, and any additional metadata.
The raw logits are high-dimensional scores for each class, but what we really want are the predicted class IDs. To obtain these, we use np.argmax(preds_output.predictions, axis=1). This selects the index of the highest score along the last dimension of the logits array, which corresponds to the predicted emotion for each sample. We store these predicted labels in the variable y_preds.
With the predictions in hand, we call the plot_confusion_matrix function we defined earlier, passing in the predicted labels, the true labels (y_valid), and the list of emotion names. This creates a normalized confusion matrix, which gives us a clear visual summary of how well the fine-tuned DistilBERT model performs on each class.
In the resulting plot, each diagonal cell shows the proportion of validation samples correctly classified for that emotion, while the off-diagonal cells show how often the model confuses one emotion with another. A perfect model would have all the values concentrated along the diagonal. If we see strong diagonal dominance with little confusion between classes, it indicates that the fine-tuned model has learned to accurately recognize and distinguish between the different emotional states.
This step completes the full workflow: we began with raw text data, fine-tuned a state-of-the-art language model for emotion classification, and now visualized the results to better understand its strengths and weaknesses.
Plot Hidden State of trained DistilBERT model
After fine-tuning DistilBERT on the emotion dataset, we’ve trained the model to adapt its internal representations specifically for emotion classification. This raises an interesting question: how have the sentence embeddings—the hidden states of the model—changed as a result of fine-tuning? To explore this, we use the same approach as earlier and visualize the updated hidden states in two dimensions.
We begin by passing the tokenized dataset through the fine-tuned DistilBERT model to extract the new hidden states. In the line extract_hidden_states(model=model_finetuned.distilbert.to(device)), we use only the base transformer (without the classification head) because we want the embeddings, not the final class predictions. These embeddings are high-dimensional vectors that should now be optimized for distinguishing between different emotions.
Next, we create NumPy arrays X_train_finetuned and y_train_finetuned from the hidden states and labels of the training set. These will serve as inputs for visualization.
To make these high-dimensional embeddings interpretable, we reduce them to two dimensions using the create_2d_embeddings function. This function scales the features and applies UMAP (Uniform Manifold Approximation and Projection), a dimensionality reduction technique that preserves both local and global structure in the data. The resulting DataFrame df_train_2d_finetuned contains the 2D coordinates of each sentence embedding along with its emotion label.
Finally, we call plot_2d_embeddings(df_2d=df_train_2d_finetuned) to generate a grid of hexbin plots. Each subplot corresponds to one of the six emotion classes and shows where the embeddings for sentences in that class are concentrated in the 2D space.
Comparing these plots to those generated before fine-tuning reveals an important insight: after fine-tuning, we often see much clearer separation between the clusters of different emotions. Sentences with the same label tend to group more tightly together, and there is less overlap between different classes. This demonstrates how fine-tuning helps the model reshape its internal representations so that sentences expressing similar emotions are closer together in embedding space, making classification easier for the final layer.
This visualization not only provides a fascinating look inside the model’s “thought process” but also serves as a powerful teaching tool for understanding how transfer learning and fine-tuning work. It shows, in a very visual way, how a general-purpose language model can be specialized for a specific task like emotion detection.
Last updated