show code
import torch
import numpy as np, os, sys, pandas as pd
import matplotlib.pyplot as plt
import random
print(f'Kernel used is: {os.path.basename(sys.executable.replace("/bin/python",""))}')
Kernel used is: dl-tools
Temi
September 7, 2023
This post is still under construction; I am adding sutff as I get the time to.
Kernel used is: dl-tools
When training deep learning models (or any machine learning model for that matter), we try to make the most of available data. One way we do this is to supply a batch of the data to the model at a training iteration. So, if you have 100 observations to train on, you can supply, say, 20 at a time.
In addition, loading 100 observations or more at a time may consume a lot of memory, especially if you have limited resources. Instead, we supply mini-batches of the data enough to be held in memory alongside the model during each training iteration.
pytorch
gives us a convenient way to load data in this manner by letting us create our own dataset
objects, which are used by pytorch’s dataloader
First you import the Dataset and Dataloader classes from pytorch
I will create a fictitious dataset, observations X
and ground truth Y
. They will be numpy arrays; this way I can easily manipulate them.
((100, 48), (100,))
The trick to creating your Dataset object is that when you call the class, or attempt to get an item from the dataset, it should return one training observation. Three methods of that class are needed:
__init__()
: which will initialize your object__len__()
: to be used by the Dataloader to figure out how many training observations in total you have; will be used for shuffling and indexing necessary observations etc.__getitem__()
: used to return a single training unit at a time.Note that in the
__getitem__()
, I said “training unit”. Ideally, you want to return one training observation at a time but depending on your scenario/problem/data, you may want to return more than one observation at a time. Point is, whatever number of observations you get after calling_getitem__()
will be your training unit.
I create a dataset class, MyDataset
, that will inherit from pytorch’s Dataset, and return just one observation and ground truth at a time
class MyDataset(Dataset): # will inherit from the Dataset object
def __init__(self, X, Y):
self.X = X
self.Y = Y
def __len__(self): # the dataloader needs to know the number of observations you have
return self.X.shape[0]
def __getitem__(self, idx): # this is what returns just one observation or one unit of training
return(self.X[idx, : ], self.Y[idx]) # essentially, I am just slicing the np array
Now I can use the dataloader object
You can confirm that the dataset object works by doing this. I give it an index, 8
and it pulls the observations and ground truth corresponding to that index.
(array([1.43263063e-01, 8.19159823e-01, 9.66615361e-01, 3.04914935e-02,
6.74951172e-01, 2.28270871e-01, 5.07899833e-01, 7.54816753e-01,
8.50174001e-01, 5.85532967e-01, 2.13319662e-01, 3.74500070e-02,
8.69480679e-01, 9.91958073e-01, 1.06552389e-01, 6.75307504e-01,
4.64268091e-01, 3.97405622e-02, 3.63357637e-01, 8.51468424e-01,
7.07647608e-01, 3.59670787e-04, 3.27379319e-01, 1.23819926e-01,
6.51143229e-01, 3.65572306e-01, 8.11721461e-01, 8.81402757e-02,
1.46144989e-01, 7.60215261e-01, 7.05400679e-01, 6.96563049e-01,
8.31366812e-01, 3.80790558e-01, 9.90544126e-01, 9.84286220e-01,
3.50894274e-01, 5.80318077e-01, 8.59732277e-01, 7.51747094e-01,
3.34853644e-02, 1.76530280e-01, 4.94703167e-01, 7.28400713e-01,
3.35355319e-01, 2.15013442e-01, 5.02317757e-01, 2.88868790e-01]),
1)
I can also check the number of observations I have
All well and good. But I don’t want to give my model one observation at a time. Although people do this, it is too small. Instead, I want to give the model a certain batch at time. Dataloaders
help with this. The Dataloader class abstracts over the Dataset class to help us shuffle the data, yield batch_size
number of observations at a time, and even make use of parallel backends to yield the data faster, especially when the dataset may be large.
I create a DataLoader
object and supply it the argument batch_size
. Whenever I ask the object for training examples, it gives me batch_size
number of observations at a time. Here I will set batch_size
to 20. Remember that I have 100 observations in total. A batch_size
of 20 will yield 20 observations at a time without replacement until all 100 are exhausted and this will be 5 different batches.
In addition, I can shuffle the loading of the batches by setting shuffle=True
. If you have shuffled your data before now, and intend to keep that, you should probable set this argument to False
.
Now, for each batch, I will print out the total number of observations and ground truth.
batch 0: number of observations and ground truth are 20 and 20 respectively
batch 1: number of observations and ground truth are 20 and 20 respectively
batch 2: number of observations and ground truth are 20 and 20 respectively
batch 3: number of observations and ground truth are 20 and 20 respectively
batch 4: number of observations and ground truth are 20 and 20 respectively