Datasets

Every NLP project starts with a dataset. We will show you in this article how you can use the datasets library from the Huggingface ecosystem.

Dataset Loading

Load a Dataset from the Huggingface Hub

from datasets import load_dataset
import matplotlib.pyplot as plt

emotions = load_dataset("emotion")

Output:

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

This code begins by importing the load_dataset function from the Hugging Face datasets library, along with the pyplot module from matplotlib, which is commonly used for data visualization. It then loads a dataset named "emotion" using load_dataset("emotion") and stores the result in a variable called emotions. You can find the emotion dataset on the Huggingface Hub here: https://huggingface.co/datasets/dair-ai/emotionarrow-up-right

The emotion dataset is automatically downloaded and returned as a DatasetDict, which is a dictionary-like object where each key corresponds to a different data split—specifically, "train", "validation", and "test". Each of these splits contains a Dataset object that holds 2 columns: 'text', which contains the input text (typically a short sentence or phrase), and 'label', which is an integer representing the emotion category associated with that text.

The output shows that the training set contains 16,000 examples, while both the validation and test sets contain 2,000 examples each. This structure is typical for datasets used in supervised learning tasks like text classification.

Dataset Inspection

Inspect the train subset

Output:

In this line of code, we access the training split of the emotions dataset by indexing the emotions DatasetDict with the key "train". This returns a Dataset object that contains only the training data. As shown in the output, the training set has 16,000 examples and two features: 'text' and 'label'. Each example in this dataset is a short text paired with a numerical label indicating the corresponding emotion. This subset is what we typically use to train a machine learning model.

Inspect the classes of the dataset

Output:

This line retrieves the metadata associated with the features of the training dataset by accessing the .features attribute. The output is a dictionary that describes the data types and possible values for each column in the dataset.

The 'text' feature is a simple string, meaning each example in this column is a piece of text, such as a sentence or phrase. The 'label' feature is a ClassLabel, which is a special type provided by the datasets library to represent categorical labels. It shows that there are six possible emotion categories: sadness, joy, love, anger, fear, and surprise. Each label in the dataset is stored as an integer internally, but these integers are mapped to the corresponding string names through the ClassLabel object. This makes it easy to work with categorical data while preserving the label meanings.

Get the first ten texts of the dataset

Output:

Here, we retrieve the text content of the first ten examples from the training set. By accessing emotions["train"]["text"], we select the 'text' column, which returns a list of all text entries in the dataset. The slicing operation [:10] then gives us just the first ten items in that list.

The output shows ten short pieces of text, each representing a sentence or phrase that expresses some kind of emotion. These examples are written in informal, conversational language and vary in tone—some express sadness or confusion, while others suggest affection or hopefulness. These texts are the input data that the model will learn to associate with emotion labels during training.

Get the labels of the first ten texts

Output:

This line retrieves the emotion labels corresponding to the first ten examples in the training set. By accessing emotions["train"]["label"], we select the 'label' column, which contains the numerical class IDs for each example. Using the slice [:10], we extract just the first ten labels.

The output is a list of integers: [0, 0, 3, 2, 3, 0, 5, 4, 1, 2]. Each number represents one of the six possible emotion categories defined earlier. For example, 0 corresponds to sadness, 1 to joy, 2 to love, and so on. These numeric values are what the model will predict during classification, but they can easily be converted back to their human-readable names using the ClassLabel mapping from earlier.

Iterating with for-loops over the dataset

Output:

In this code block, we iterate through the first few examples of the training dataset to examine individual entries. Using a for loop with enumerate, we go through each example in emotions["train"] and print both the label and the corresponding text. The enumerate function provides a running index, starting from zero, which we increment by one to display a more natural, human-readable count.

For each item, we print the label—represented as a number—and the associated text. The loop includes a conditional check if i >= 10: break, which stops the iteration after printing the first 10 examples. This is useful for inspecting a small sample of the dataset to better understand what kind of data we are working with.

The output shows how each short text is paired with a numerical label, which, as we saw earlier, can be mapped to emotion names like sadness, joy, anger, and so on. This gives us a concrete view of the data points the model will learn from during training.

Label Conversion

Convert labels from integres to strings

Output:

In this line, we use the int2str method provided by the ClassLabel object for the 'label' feature. This method converts a list of integer labels into their corresponding string names, based on the predefined label mapping.

The input list [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5] consists of repeated integers from 0 to 5, and the output shows the associated emotion names: sadness, joy, love, anger, fear, and surprise. This function is particularly useful when you want to display predictions or dataset entries using readable labels instead of numeric codes, making your results easier to interpret and communicate.

Convert labels from strings to integers

Output:

Here, we do the reverse of what we saw before: instead of converting integers to label names, we convert label names back to their corresponding integers using the str2int method. This method takes a list of emotion strings—such as "surprise", "sadness", and "joy"—and returns the integer IDs that the dataset uses internally to represent them.

The output [5, 0, 4, 1, 3, 2] shows the numeric codes for each emotion in the order they appeared in the input list. This function is helpful when preparing data manually or when converting predicted string labels back into a format suitable for evaluation or further processing.

Last updated