Open In App

Loading Data in Pytorch

In this article, we will discuss how to load different kinds of data in PyTorch.

For demonstration purposes, Pytorch comes with 3 divisions of datasets namely torchaudio, torchvision, and torchtext. We can leverage these demo datasets to understand how to load Sound, Image, and text data using Pytorch.



Torchaudio Dataset

Loading demo yes_no audio dataset in torchaudio using Pytorch.

Yes_No dataset is an audio waveform dataset, which has values stored in form of tuples of 3 values namely waveform, sample_rate, labels, where waveform represents the audio signal, sample_rate represents the frequency and label represent whether Yes or No.



To load your custom data:

Syntax: torch.utils.data.DataLoader(data, batch_size, shuffle)

Parameters:

  • data – audio dataset or the path to the audio dataset
  • batch_size – for large dataset, batch_size specifies how much data to load at once
  • shuffle – a bool type. Setting it to True will shuffle the data.




# import the torch and torchaudio dataset packages.
import torch
import torchaudio
 
# access the dataset in torchaudio package using
# datasets followed by dataset name.
# './' makes sure that the dataset is stored
# in a root directory.
# download = True ensures that the
# data gets downloaded
yesno_data = torchaudio.datasets.YESNO('./',
                                       download=True)
 
# loading the first 5 data from yesno_data
for i in range(5):
    waveform, sample_rate, labels = yesno_data[i]
    print("Waveform: {}\nSample rate: {}\nLabels: {}".format(
        waveform, sample_rate, labels))

Output:

Torchvision Dataset

Loading demo ImageNet vision dataset in torchvision using Pytorch. Click here to download the dataset by signing up.




# import the torch and
# torchvision dataset packages.
import torch
import torchvision
 
# access the dataset in torchvision package using
# .datasets followed by dataset name.
imagenet_data = torchvision.datasets.ImageNet('path/to/imagenet_root/')

Code Explanation:

To load your custom image data, use torch.utils.data.DataLoader(data, batch_size, shuffle) as mentioned above.




# import necessary function
# from torchvision package
from torchvision import transforms, datasets
import matplotlib.pyplot as plt
 
# specify the image dataset folder
data_dir = r'path to dataset\train'
 
# perform some transformations like resizing,
# centering and tensorconversion
# using transforms function
transform = transforms.Compose(
    [transforms.Resize(255),
     transforms.CenterCrop(224),
     transforms.ToTensor()])
 
# pass the image data folder and
# transform function to the datasets
# .imagefolder function
dataset = datasets.ImageFolder(data_dir,
                               transform=transform)
 
# now use dataloder function load the
# dataset in the specified transformation.
dataloader = torch.utils.data.DataLoader(dataset,
                                         batch_size=32,
                                         shuffle=True)
 
# iter function iterates through all the
# images and labels and stores in two variables
images, labels = next(iter(dataloader))
 
# print the total no of samples
print('Number of samples: ', len(images))
image = images[2][0# load 3rd sample
 
# visualize the image
plt.imshow(image, cmap='gray')
 
# print the size of image
print("Image Size: ", image.size())
 
# print the label
print(label)

Output:

Image size: torch.Size([224,224])
tensor([0, 0, 0, 1, 1, 1])

Torchtext Dataset

Loading demo IMDB text dataset in torchtext using Pytorch. To load your custom text data we use  torch.utils.data.DataLoader() method.

Syntax: torch.utils.data.DataLoader(‘path to/imdb_data’, batch_size, shuffle=True)

Code Explanation:




# import the torch and torchtext dataset packages.
import torch
import torchtext
 
# access the dataset in torchtext package
# using .datasets followed by dataset name.
text_data = torchtext.datasets.IMDB(split='train')
 
# define a function to tokenize
# the words in the corpus
def tokenize(label, line):
    return line.split()
 
 
# define a empty list to store
# the tokenized words
tokens = []
 
# iterate over the text_data and
# tokenize each line and store
# it in the list tokens
for label, line in text_data:
    tokens += tokenize(label, line)
 
print('The total no. of tokens in imdb dataset is',
      len(tokens))

Output:


Article Tags :