Deep Learning Models -- A collection of various deep learning architectures, models, and tips for TensorFlow and PyTorch in Jupyter Notebooks.
%load_ext watermark
%watermark -a 'Sebastian Raschka' -v -p torch
Sebastian Raschka CPython 3.7.1 IPython 7.2.0 torch 1.0.0
This notebook provides an example for how to load a dataset from an HDF5 file created from a CSV file, using PyTorch's data loading utilities. For a more in-depth discussion, please see the official
An Hierarchical Data Format (HDF) is a convenient way that allows quick access to data instances during minibatch learning if a dataset is too large to fit into memory. The approach outlined in this notebook uses uses the common HDF5 format and should be accessible to any programming language or tool with an HDF5 API.
In this example, we are going to use the Iris dataset for illustrative purposes. Let's pretend it's our large training dataset that doesn't fit into memory.
import pandas as pd
import numpy as np
import h5py
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
In this first step, we are going to process a CSV file (here, Iris) into an HDF5 database:
# suppose this is a large CSV that does not
# fit into memory:
csv_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# Get number of lines in the CSV file if it's on your hard drive:
#num_lines = subprocess.check_output(['wc', '-l', in_csv])
#num_lines = int(nlines.split()[0])
num_lines = 150
num_features = 4
class_dict = {'Iris-setosa': 0,
'Iris-versicolor': 1,
'Iris-virginica': 2}
# use 10,000 or 100,000 or so for large files
chunksize = 10
# this is your HDF5 database:
with h5py.File('iris.h5', 'w') as h5f:
# use num_features-1 if the csv file has a column header
dset1 = h5f.create_dataset('features',
shape=(num_lines, num_features),
compression=None,
dtype='float32')
dset2 = h5f.create_dataset('labels',
shape=(num_lines,),
compression=None,
dtype='int32')
# change range argument from 0 -> 1 if your csv file contains a column header
for i in range(0, num_lines, chunksize):
df = pd.read_csv(csv_path,
header=None, # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i) # skip rows that were already read
df[4] = df[4].map(class_dict)
features = df.values[:, :4]
labels = df.values[:, -1]
# use i-1 and i-1+10 if csv file has a column header
dset1[i:i+10, :] = features
dset2[i:i+10] = labels[0]
After creating the database, let's double-check that everything works correctly:
with h5py.File('iris.h5', 'r') as h5f:
print(h5f['features'].shape)
print(h5f['labels'].shape)
(150, 4) (150,)
with h5py.File('iris.h5', 'r') as h5f:
print('Features of entry no. 99:', h5f['features'][99])
print('Class label of entry no. 99:', h5f['labels'][99])
Features of entry no. 99: [5.7 2.8 4.1 1.3] Class label of entry no. 99: 1
Now, we implement a custom Dataset
for reading the training examples. The __getitem__
method will
index
(more on batching later)Note that we will keep an open connection to the database for efficiency via self.h5f = h5py.File(h5_path, 'r')
-- you may want to close it when you are done (more on this later).
class Hdf5Dataset(Dataset):
"""Custom Dataset for loading entries from HDF5 databases"""
def __init__(self, h5_path, transform=None):
self.h5f = h5py.File(h5_path, 'r')
self.num_entries = self.h5f['labels'].shape[0]
self.transform = transform
def __getitem__(self, index):
features = self.h5f['features'][index]
label = self.h5f['labels'][index]
if self.transform is not None:
features = self.transform(features)
return features, label
def __len__(self):
return self.num_entries
Now that we have created our custom Dataset class, we can initialize a Dataset instance for the training examples using the 'iris.h5' database file. Then, we initialize a DataLoader
that allows us to read from the dataset.
train_dataset = Hdf5Dataset(h5_path='iris.h5',
transform=None)
train_loader = DataLoader(dataset=train_dataset,
batch_size=50,
shuffle=True,
num_workers=4)
That's it! Now we can iterate over an epoch using the train_loader as an iterator and use the features and labels from the training dataset for model training as shown in the next section
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch.manual_seed(0)
num_epochs = 5
for epoch in range(num_epochs):
for batch_idx, (x, y) in enumerate(train_loader):
print('Epoch:', epoch+1, end='')
print(' | Batch index:', batch_idx, end='')
print(' | Batch size:', y.size()[0])
x = x.to(device)
y = y.to(device)
# do model training on x and y here
Epoch: 1 | Batch index: 0 | Batch size: 50 Epoch: 1 | Batch index: 1 | Batch size: 50 Epoch: 1 | Batch index: 2 | Batch size: 50 Epoch: 2 | Batch index: 0 | Batch size: 50 Epoch: 2 | Batch index: 1 | Batch size: 50 Epoch: 2 | Batch index: 2 | Batch size: 50 Epoch: 3 | Batch index: 0 | Batch size: 50 Epoch: 3 | Batch index: 1 | Batch size: 50 Epoch: 3 | Batch index: 2 | Batch size: 50 Epoch: 4 | Batch index: 0 | Batch size: 50 Epoch: 4 | Batch index: 1 | Batch size: 50 Epoch: 4 | Batch index: 2 | Batch size: 50 Epoch: 5 | Batch index: 0 | Batch size: 50 Epoch: 5 | Batch index: 1 | Batch size: 50 Epoch: 5 | Batch index: 2 | Batch size: 50
Remember that we kept an open connection to the HDF5 database in the Hdf5Dataset
(via self.h5f = h5py.File(h5_path, 'r')
). Once we are done, we may want to close this connection:
train_dataset.h5f.close()
%watermark -iv
torch 1.0.0 pandas 0.23.4 numpy 1.15.4 h5py 2.8.0