Working with Data Tables

In this section we will have a look how you can load tabular data like CSV files and transform them into a 2D tensor.

To load data from a CSV file and build a 2D tensor in PyTorch, you can use the pandas library to read the CSV file and then convert the resulting DataFrame into a PyTorch tensor. Here's an example:

import pandas as pd
import torch

# Load the CSV file using pandas
data_frame = pd.read_csv('mydata.csv')

# Convert the DataFrame to a PyTorch tensor
tensor_data = torch.tensor(data_frame.values)

# Iterate over the rows in the tensor
for row in tensor_data:
    print(row)

In the above code, we first import the necessary libraries: pandas for loading the CSV file and torch for creating the tensor.

Next, we use pd.read_csv('mydata.csv') to read the CSV file and store the data in a DataFrame called data_frame.

Then, we convert the DataFrame to a PyTorch tensor using torch.tensor(data_frame.values). The data_frame.values attribute returns the underlying NumPy array of the DataFrame, which can be directly converted to a PyTorch tensor using torch.tensor().

Finally, we can iterate over the rows in the tensor using a simple for loop and perform any desired operations on each row.

Note that the resulting tensor will have the same data type as the input data in the CSV file. If you need to specify a specific data type for the tensor, you can use the dtype argument in torch.tensor().

PyTorch Dataset classes

Usually we wrap the code that loads data into DataLoader classes. Here is how they work:

To build a data loader class in PyTorch that loads data from a CSV file (or any other file), you can create a custom dataset class by subclassing torch.utils.data.Dataset and override the __len__ and __getitem__ methods. Here's an example:

import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    """
    This class is responsible for loading data and transforming 
    it to a tensor
    """
    def __init__(self, csv_file):
        """
        This is the construcor method of our dataset.
        Here we load the data and initialize other stuff.
        You can basically write any code here you need for your data.
        """
        self.data_frame = pd.read_csv(csv_file)
        self.tensor_data = torch.tensor(self.data_frame.values)

    def __len__(self):
        """
        This function returns the number of items in the dataset
        """
        return len(self.data_frame)

    def __getitem__(self, index):
        """
        This function returns a single item identified by its index
        """
        row = self.tensor_data[index]
        return row

# Create an instance of your dataset
dataset = MyDataset('mydata.csv')

In this example, we define a custom dataset class MyDataset that extends torch.utils.data.Dataset. In the constructor __init__, we read the CSV file using pd.read_csv(csv_file) and convert the data to a PyTorch tensor.

The __len__ method returns the length of the dataset, which is the number of rows in the CSV file. The __getitem__ method retrieves a single row from the tensor based on the provided index.

We then create an instance of MyDataset by passing the CSV file path 'mydata.csv'. Next, we create a data loader using DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4), specifying the dataset instance, batch size, shuffling, and the number of workers for parallel data loading.

# Create a data loader for your dataset
data_loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=4)

Finally, we can iterate over the data loader in a for loop, and each iteration will provide a batch of rows from the CSV file.

You can customize the dataset class and the data loader parameters according to your specific requirements, such as adding additional transformations, labels, or other data fields.

# Iterate over the data loader
for batch in data_loader:
    print(batch)