Create Custom Datasets

In the last section we saw how you can load predefined datasets. Let's go the next step and create our own dataset.

We will create a small sentiment analysis dataset in this section. Sentiment analysis is a text classification task where a text gets classified into one of two possible classes: positive or negative.

In this dataset, the label 1 is used to indicate a positive sentiment, while the label 0 stands for a negative sentiment. For example, texts like “An unforgettable journey.” and “The quality is top-notch.” are labeled 1 because they express satisfaction or praise. On the other hand, sentences like “The item broke after one use.” and “I want a refund.” are labeled 0 because they convey frustration or disappointment.

Loading a dataset from a CSV file

First we need to create the files train.csv, valid.csv and test.csv. The format of the file contents should look like this:

text,label
An unforgettable journey.,1
The service exceeded expectations.,1
I am very happy with the outcome.,1
Will definitely use this again.,1
The quality is top-notch.,1
I am so frustrated.,0
The item broke after one use.,0
This is unacceptable.,0
The movie was a disaster.,0
I want a refund.,0

This CSV file contains a small, custom dataset for sentiment analysis. Each row represents a short text along with a corresponding label that indicates the sentiment expressed in the text.

There are two columns in the dataset. The first column, text, contains the input sentence or phrase that we want to analyze. These are short pieces of natural language that express some kind of opinion, emotion, or evaluation. The second column, label, contains a numerical value that represents the sentiment associated with the corresponding text.

This structure is typical for binary classification tasks in sentiment analysis, where the goal is to train a model to automatically determine whether a new piece of text is positive or negative in tone.

Load the CSV files using the datasets library

from datasets import load_dataset

# Load the dataset
data_csv_files = {
    "train": "train.csv",
    "validation": "validation.csv",
    "test": "test.csv"
}
dataset_csv = load_dataset("csv", data_files=data_csv_files)

# Inspect the dataset
dataset_csv

Output:

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 10
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 10
    })
})

In this code, we load a custom sentiment analysis dataset from three CSV files—one for training, one for validation, and one for testing.

We then use the load_dataset function from the datasets library to read in these CSV files. By specifying "csv" as the first argument and passing in our dictionary through the data_files parameter, the function automatically reads the contents of each file and groups them into a DatasetDict with three splits.

Each split contains a Dataset object with two columns: 'text', which holds the actual sentences, and 'label', which contains the sentiment classification (either 0 for negative or 1 for positive). According to the output, the training set has 20 examples, while the validation and test sets each contain 10 examples. This structure mirrors the standard setup for supervised machine learning tasks, where the model is trained on one subset of the data, validated on another during training, and finally evaluated on a separate test set.

Loading a dataset from a JSONL file

Besides CSV there is another popular format for storing datasets: JSONL. JSONL is a text file where each line holds a JSON object. Lines are not separated by commas. Our sentiment analysis dataset would look like this in JSONL:

{"text": "An unforgettable journey.", "label": 1}
{"text": "The service exceeded expectations.", "label": 1}
{"text": "I am very happy with the outcome.", "label": 1}
{"text": "Will definitely use this again.", "label": 1}
{"text": "The quality is top-notch.", "label": 1}
{"text": "I am so frustrated.", "label": 0}
{"text": "The item broke after one use.", "label": 0}
{"text": "This is unacceptable.", "label": 0}
{"text": "The movie was a disaster.", "label": 0}
{"text": "I want a refund.", "label": 0}

In this format, each line in the file is a separate JSON object, representing one data point. This makes it easy to read and write data line by line, which is particularly useful when working with large datasets that shouldn’t be loaded entirely into memory at once.

In the given JSONL file, each line corresponds to a single example for a sentiment analysis task. Each JSON object has two keys: "text" and "label". The "text" field contains a sentence or short phrase expressing an opinion or feeling, and the "label" field indicates the sentiment as an integer—1 for positive sentiment and 0 for negative sentiment.

This format is especially useful for text datasets because it is both human-readable and easy to parse programmatically. It also avoids the need for a surrounding array or comma separators, which makes it well-suited for streaming large datasets or appending new entries without rewriting the entire file.

Loading the JSONL files using the datasets library

data_jsonl_files = {
    "train": "train.jsonl",
    "validation": "validation.jsonl",
    "test": "test.jsonl"
}
dataset_jsonl = load_dataset("json", data_files=data_jsonl_files)

Each line contains two pieces of information: the first is the text itself—a sentence or phrase expressing a sentiment—and the second is the label, represented as either 1 for positive sentiment or 0 for negative sentiment. The tab character serves as the delimiter between the text and its corresponding label.

This format is lightweight and easy to read or write with standard text-processing tools. However, unlike CSV or JSON formats, it does not contain header information, so we need to know in advance what each column represents. When loading such a file into a dataset processing library like Hugging Face datasets, we typically have to specify that the fields are separated by tabs and optionally provide the column names ourselves.

PreviousDatasets NextFinal Project - Image Classification

Last updated 8 months ago

hashtagLoading a dataset from a CSV file

hashtagLoad the CSV files using the datasets library

hashtagLoading a dataset from a JSONL file

hashtagLoading the JSONL files using the datasets library

Loading a dataset from a CSV file

Load the CSV files using the datasets library

Loading a dataset from a JSONL file

Loading the JSONL files using the datasets library