Notes on dataloaders

pytorch
machine learning
Author

Temi

Published

September 7, 2023

Note

This post is still under construction; I am adding sutff as I get the time to.

show code
import torch
import numpy as np, os, sys, pandas as pd
import matplotlib.pyplot as plt
import random

print(f'Kernel used is: {os.path.basename(sys.executable.replace("/bin/python",""))}')
Kernel used is: dl-tools

Introduction

When training deep learning models (or any machine learning model for that matter), we try to make the most of available data. One way we do this is to supply a batch of the data to the model at a training iteration. So, if you have 100 observations to train on, you can supply, say, 20 at a time.

In addition, loading 100 observations or more at a time may consume a lot of memory, especially if you have limited resources. Instead, we supply mini-batches of the data enough to be held in memory alongside the model during each training iteration.

pytorch gives us a convenient way to load data in this manner by letting us create our own dataset objects, which are used by pytorch’s dataloader

Class Dataset

First you import the Dataset and Dataloader classes from pytorch

show code
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

I will create a fictitious dataset, observations X and ground truth Y. They will be numpy arrays; this way I can easily manipulate them.

show code
X = np.random.rand(100, 48) 
Y = np.random.choice([0,1], size=100)
X.shape, Y.shape
((100, 48), (100,))

The trick to creating your Dataset object is that when you call the class, or attempt to get an item from the dataset, it should return one training observation. Three methods of that class are needed:

  1. the __init__(): which will initialize your object
  2. the __len__(): to be used by the Dataloader to figure out how many training observations in total you have; will be used for shuffling and indexing necessary observations etc.
  3. the __getitem__(): used to return a single training unit at a time.

Note that in the __getitem__(), I said “training unit”. Ideally, you want to return one training observation at a time but depending on your scenario/problem/data, you may want to return more than one observation at a time. Point is, whatever number of observations you get after calling _getitem__() will be your training unit.

I create a dataset class, MyDataset, that will inherit from pytorch’s Dataset, and return just one observation and ground truth at a time

show code
class MyDataset(Dataset): # will inherit from the Dataset object
    def __init__(self, X, Y):
        self.X = X
        self.Y = Y
    
    def __len__(self): # the dataloader needs to know the number of observations you have
        return self.X.shape[0]

    def __getitem__(self, idx): # this is what returns just one observation or one unit of training
        return(self.X[idx, : ], self.Y[idx]) # essentially, I am just slicing the np array

Now I can use the dataloader object

show code
mydataset = MyDataset(X, Y)
mydataset
<__main__.MyDataset at 0x7fa39bb2da50>

You can confirm that the dataset object works by doing this. I give it an index, 8 and it pulls the observations and ground truth corresponding to that index.

show code
mydataset.__getitem__(8)
(array([1.43263063e-01, 8.19159823e-01, 9.66615361e-01, 3.04914935e-02,
        6.74951172e-01, 2.28270871e-01, 5.07899833e-01, 7.54816753e-01,
        8.50174001e-01, 5.85532967e-01, 2.13319662e-01, 3.74500070e-02,
        8.69480679e-01, 9.91958073e-01, 1.06552389e-01, 6.75307504e-01,
        4.64268091e-01, 3.97405622e-02, 3.63357637e-01, 8.51468424e-01,
        7.07647608e-01, 3.59670787e-04, 3.27379319e-01, 1.23819926e-01,
        6.51143229e-01, 3.65572306e-01, 8.11721461e-01, 8.81402757e-02,
        1.46144989e-01, 7.60215261e-01, 7.05400679e-01, 6.96563049e-01,
        8.31366812e-01, 3.80790558e-01, 9.90544126e-01, 9.84286220e-01,
        3.50894274e-01, 5.80318077e-01, 8.59732277e-01, 7.51747094e-01,
        3.34853644e-02, 1.76530280e-01, 4.94703167e-01, 7.28400713e-01,
        3.35355319e-01, 2.15013442e-01, 5.02317757e-01, 2.88868790e-01]),
 1)

I can also check the number of observations I have

show code
mydataset.__len__()
100

Class Dataloader

All well and good. But I don’t want to give my model one observation at a time. Although people do this, it is too small. Instead, I want to give the model a certain batch at time. Dataloaders help with this. The Dataloader class abstracts over the Dataset class to help us shuffle the data, yield batch_size number of observations at a time, and even make use of parallel backends to yield the data faster, especially when the dataset may be large.

I create a DataLoader object and supply it the argument batch_size. Whenever I ask the object for training examples, it gives me batch_size number of observations at a time. Here I will set batch_size to 20. Remember that I have 100 observations in total. A batch_size of 20 will yield 20 observations at a time without replacement until all 100 are exhausted and this will be 5 different batches.

In addition, I can shuffle the loading of the batches by setting shuffle=True. If you have shuffled your data before now, and intend to keep that, you should probable set this argument to False.

show code
mydataloader = DataLoader(mydataset, batch_size=20, shuffle=True)

Now, for each batch, I will print out the total number of observations and ground truth.

show code
for i, batch in enumerate(mydataloader):
    print(f'batch {i}: number of observations and ground truth are {batch[0].shape[0]} and {batch[1].shape[0]} respectively')
batch 0: number of observations and ground truth are 20 and 20 respectively
batch 1: number of observations and ground truth are 20 and 20 respectively
batch 2: number of observations and ground truth are 20 and 20 respectively
batch 3: number of observations and ground truth are 20 and 20 respectively
batch 4: number of observations and ground truth are 20 and 20 respectively