PyTorch Tutorial by Modar

Basic PyTorch classification tutorial with links and references to useful materials to get started. This tutorial was presented on the 6th of August 2018 as part of the weekly meetings of IVUL-KAUST research group.


PyTorch as GPU accelerated Numpy

PyTorch for Deep Learning

High-Level PyTorch Wrappers

There exists high-level APIs for PyTorch analogous to Keras for Tensorflow such as:

Also, it is not that hard to use tensorboard with PyTorch:

Example: Classification on Kaggle's Dogs vs. Cats

You can use this official ~100 lines example of MNIST classification as a reference.

Import the relevant libraries

In this example, we will only use torch and torchvision and we won't be using any high-level API because we want to demonstrate the power of PyTorch at its core. In fact, we will be building something similar to a high-level API ourselves.

In [1]:
import importlib
import tensorflow as tf  # to visualize training summaries with tensorboard

import torch
from torch import nn
import torch.nn.functional as F
from torchvision import transforms, datasets

# for reproducibility

# a utility function to print the progress of a for-loop
# don't worry about this bit because it is not part of the tutorial
# it is recommended that you install `tqdm` package
def _verbosify(iterable):
    # shows only the iteration number and how many iterations are left
        len_iterable = len(iterable)
    except Exception:
        len_iterable = None
    for i, element in enumerate(iterable, 1):
        if len_iterable is None:
            print('\rIteration #{}'.format(i), end='')
            print('\rIteration #{} out of {} iterations [Done {:.2f}%]'.format(
                i, len_iterable, 100 * i / len_iterable), end='')
        yield element
    print('\r', end='', flush=True)

def verbosify(iterable, **kwargs):
    # try to use tqdm (shows the speed and the remaining time left)
    if importlib.util.find_spec('tqdm') is not None:
        tqdm = importlib.import_module('tqdm').tqdm
        if 'file' not in kwargs:
            kwargs['file'] = importlib.import_module('sys').stdout
        if 'leave' not in kwargs:
            kwargs['leave'] = False
        return tqdm(iterable, **kwargs)
        return iter(_verbosify(iterable))

# try out this example (uncomment to test):
# for i in verbosify(range(10000000)):
#     pass

Prepare the dataset [goal: defining train_loader and valid_loader]

I strongly recommend that you follow this tutorial for a more comprehensive understanding of how to deal with data.

1 - Obtain the dataset

You can find the Dogs vs. Cats dataset here. However, you can use your own dataset but to use ImageFolder you need to make sure that the dataset is in a folder where each subfolder is a class label that contains all the images of that class. There are also utility functions in torchvision that downloads common datasets like MNIST and CIFAR10 under torchvision.datasets (e.g. torchvision.datasets.MNIST).

After you download the Dogs vs. Cats dataset you would get file, unzip it using:

sudo apt-get install unzip
unzip -d all
cd all
cd train

All the images are named as [dog|cat].<index>.jpg but we need to put them in sperate folders {cat, dog} as follows:

mkdir dog cat
mv dog.* dog
mv cat.* cat

2 - Define the transforms

In PyTorch, we call the input to your model data and the output target. When we read images from a folder, they are read as PIL images but in order to feed them as data to our model, they need to be transformed to PyTorch tensors with correct size and normalization. This is why we will create a list of transformation functions for the images, each of which will operate on the output of the previous function while the first function will operate on a single PIL image. You can also create transformation functions for the output labels called target_transform if needed.

In [2]:
train_transform = transforms.Compose([
        transforms.Resize(224),             # takes PIL image as input and outputs PIL image
        transforms.RandomResizedCrop(224),  # takes PIL image as input and outputs PIL image
        transforms.RandomHorizontalFlip(),  # takes PIL image as input and outputs PIL image
        transforms.ToTensor(),              # takes PIL image as input and outputs torch.tensor
        transforms.Normalize(mean=[0.4280, 0.4106, 0.3589],  # takes tensor and outputs tensor
                             std=[0.2737, 0.2631, 0.2601]),  # see next step for mean and std
valid_transform = transforms.Compose([  # for validation we don't randomize or augment
        transforms.Normalize(mean=[0.4280, 0.4106, 0.3589],
                             std=[0.2737, 0.2631, 0.2601]), 

3 - Create the dataset

In [3]:
# Just implement __getitem__ and __len__
class DummyDataset(
    def __init__(self, size=(3, 224, 224), num_samples=1000, num_classes=3):
        self.images = torch.randn(num_samples, *size)
        self.labels = torch.randint(0, num_classes, (num_samples,))
        # this dataset, doesn't need transforms
        # because it is already in the correct size and format

    def __getitem__(self, index):
        return self.images[index, ...], self.labels[index]

    def __len__(self):
        return self.images.size(0)

# Or, we can use `torchvision.datasets.ImageFolder`
dataset = datasets.ImageFolder(root='./all/train/',

# here you should split this dataset into training and validation
def random_split(dataset, split_frac):
    dataset_length = len(dataset)
    train_length = int(dataset_length * (1 - split_frac))
    valid_length = dataset_length - train_length
    train_set, valid_set =, [train_length, valid_length])
    return train_set, valid_set

split_frac = 0.1  # the ratio of images in the validation set
train_set, valid_set = random_split(dataset, split_frac)
In [4]:
# getting the mean and std of the images (assuming that you have enough memory)

# pixels_list = [img.view(3, -1) for img, label in \  # this will take a while
#                datasets.ImageFolder(root='all/train', transform=valid_transform)]  
# pixels =, dim=-1)
# pixels_mean = pixels.mean(dim=-1)
# pixels_std = pixels.std(dim=-1)
# print(pixels_mean)  # Out: tensor([0.4280, 0.4106, 0.3589])
# print(pixels_std)   # Out: tensor([0.2737, 0.2631, 0.2601])

# if you don't have sufficient memory, you can compute mean as a running average
# and std as described here:

4 - Define the loaders

In [5]:
# create the which will do the loading
def data_loader(dataset, batch_size, train, cuda):
                                       num_workers=4 if cuda else 0,
                                       shuffle=not train,

train_loader = data_loader(train_set, 128, train=True, cuda=True)
valid_loader = data_loader(valid_set, 256, train=False, cuda=True)
# Note: look up the rest of the parameters of DataLoader
#       some of the interesting ones are `sampler` and `collate_fn`.

Define the model [goal: defining Net]

Please, refer to the official implementation of AlexNet here for a nicer style of defining an nn.Module that uses nn.Sequential which itself is a subclass of nn.Module. It will introduce you to nn.Sequential and the concept of defining a module using submodules.

In [6]:
class Net(nn.Module):
    def __init__(self):
        # define all the parameters of the model here
        # Note: All the layers and modules have to be direct
        #       attributes of Net to be included in training (e.g. self.conv1).
        #       To add them manually: `self.add_module(name, module)`.
        self.conv1 = nn.Conv2d(3, 32, kernel_size=5)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.conv3 = nn.Conv2d(64, 64, kernel_size=5)
        self.conv4 = nn.Conv2d(64, 20, kernel_size=5)
        self.fc1 = nn.Linear(2000, 1024)  # assumes the input is 224x224
        self.fc2 = nn.Linear(1024, 10)

    def forward(self, x):
        # define the forward pass here
        # Note: PyTorch will complain if you tried to do operations between tensors that don't
        #       have the same dtype and/or device but it will allow scalar tensor operations.
        #       Be wary of operations using scalars because it is a big source of errors.
        #       Native Python scalars are mostly fine but Numpy scalars are problematic:
        #       E.g., `np.array([1.])[0] * torch.tensor(2, device='cuda')` will be in 'cpu'.
        #       Always put the scalars in `torch.tensor` to know when you are mixing stuff.
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = F.relu(F.max_pool2d(self.conv3(x), 2))
        x = F.relu(F.max_pool2d(self.conv4(x), 2))
        x = x.view(x.shape[0], -1)
        x = F.relu(self.fc1(x))
        x = F.dropout(x,
        x = self.fc2(x)
        return x

Implement the training procedure [goal: defining full_epoch()]

We need to implement a function, called full_epoch(), that performs a single complete epoch on a given dataset and returns some computed metrics (e.g., loss and accuracy). It will take as input the model, the data loader, the device to do the operations on and optionally an optimizer. If the optimizer is provided, it will do a training epoch, otherwise it will do a validation epoch. Here, we will also implement a helper function called process(), for modularity purposes only, that processes a single input batch at a time and it will be called by full_epoch() at each iteration. Usually, you would only need to modify process().

In [7]:
def softmax_cross_entropy(output, target):
    # more efficient than `F.cross_entropy(F.softmax(output), target)`
    return F.nll_loss(F.log_softmax(output, dim=1), target)

def accuracy(output, target):
    predictions = output.max(1, keepdim=True)[1]
    return predictions.eq(target.view_as(predictions)).sum()

def process(model, data, target, optimizer=None):
    '''Perform the forward and backward passes on the given data.
        model: An `nn.Module` or a function to process `data`.
        data: The desired input to `model` (e.g., a batch of images).
        target: The desired output of `model` (e.g., ground truth labels).
        optimizer: To perform the backward pass.
        A `dict` of collected metrics.
    # compute the loss
    output = model(data)  # logits
    loss = softmax_cross_entropy(output, target)
    # if training, update the weights
    if optimizer is not None:
        # you need to zero_out the gradients of all the parameters
        # accumlate the gradients with a backward pass
        # update the parameters with the gradients
    # save the metrics
    metrics = {
        'loss': loss.item() * len(data),
        'accuracy': accuracy(output, target).item(),
    return metrics

def full_epoch(model, data_loader, device, optimizer=None):
    '''Perform a single epoch.
        model: An `nn.Module` or a function to process `data`.
        data_loader: A ''.
        device: On which device to perfrom the epoch.
        A `dict` of collected metrics.
    # Change to True and False accordingly
    if optimizer is None:
    total_count = 0
    accumulated_metrics = {}
    for data, target in verbosify(data_loader):
        # process the batch [data (images) and target (labels)]
        metrics = process(model,,, optimizer)
        # accumlate the metrics
        total_count += len(data)
        for metric, value in metrics.items():
            if metric not in accumulated_metrics:
                accumulated_metrics[metric] = 0
            accumulated_metrics[metric] += value
    # compute the averaged metrics
    for metric in accumulated_metrics:
        accumulated_metrics[metric] /= total_count
    return accumulated_metrics

Train a model [goal: defining train() and running it]

But first, we will implement a generic train() function.

In [8]:
def train(model, device, num_epochs, optimizer, train_loader, valid_loader,
          scheduler=None, patience=10, load=None, save=None, log_dir=None, restart=False):
    '''Train a model for a certain number of epochs.
        model: An `nn.Module` or a function to process the batches form the loaders.
        device: In which device to do the training.
        num_epochs: Number of epochs to train.
        optimizer: An `torch.optim.Optimizer` (e.g. SGD).
        train_loader: The `` for the training dataset.
        valid_loader: The `` for the validation dataset.
        scheduler: The learning rate scheduler.
        patience: The number of bad epochs to wait before early termination.
        load: Reinitialize the model and its hyper-parameters from this `*.pt` checkpoint file.
        save: The `*.pt` checkpoint file to save all the parameters of the trained model.
        log_dir: The directory to which we want to save tensorboard summaries.
        restart: Whether to remove `log_dir` and `load` before starting the function.
        The best state of the model during training (at the maximum validation loss).
    # restart if desired by removing old files
    if restart:
        if log_dir is not None and os.path.exists(log_dir):
        if load is not None and os.path.exists(load):

    # try to resume from a checkpoint file if `load` was provided
    if load is not None:
            best_state = torch.load(load)
        except FileNotFoundError:
            msg = 'Couldn\'t find checkpoint file! {} (training with random initialization)'
            load = None

    # otherwise, start from the current initialization
    if load is None:
        best_state = {
            'epoch': -1,
            'model': model.state_dict(),
            'optimizer': optimizer.state_dict(),
            'scheduler': scheduler.state_dict(),
            'loss': float('inf'),
    num_bad_epochs = 0
    for epoch in range(best_state['epoch'] + 1, num_epochs):
        # train and validate
        train_metrics = full_epoch(model, train_loader, device, optimizer)
        valid_metrics = full_epoch(model, valid_loader, device)  # will not do backward pass

        # get the current learing rate
        learning_rate = optimizer.param_groups[0]['lr']
        # Note: an nn.Module can have multiple param_groups
        #       each of which can be assigned a different learning rate
        #       but by default we have a single param_group.

        # reduce the learning rate according to the `scheduler` policy
        if scheduler is not None:

        # print the progress
        print('Epoch #{}: [train: {:.2e} > {:.2f}%][valid: {:.2e} > {:.2f}%] @ {:.2e}'.format(
            epoch, train_metrics['loss'], 100 * train_metrics['accuracy'],
            valid_metrics['loss'], 100 * valid_metrics['accuracy'], learning_rate,

        # save tensorboard summaries
        if log_dir is not None:
            # create the summary writer only the first time
            if not hasattr(log_dir, 'add_summary'):
                log_dir = tf.summary.FileWriter(log_dir)
            summaries = {
                'learning_rate': learning_rate,
            summaries.update({'train/' + name: value for name, value in train_metrics.items()})
            summaries.update({'valid/' + name: value for name, value in valid_metrics.items()})
            values = [tf.Summary.Value(tag=k, simple_value=v) for k, v in summaries.items()]
            log_dir.add_summary(tf.Summary(value=values), epoch)
        # save the model to disk if it has improved
        if best_state['loss'] < valid_metrics['loss']:
            num_bad_epochs += 1
            num_bad_epochs = 0
            best_state = {
                'epoch': epoch,
                'model': model.state_dict(),
                'optimizer': optimizer.state_dict(),
                'scheduler': scheduler.state_dict(),
                'loss': valid_metrics['loss'],
            if save is not None:
      , save)

        # do early stopping
        if num_bad_epochs >= patience:
            print('Validation loss didn\'t improve for {} iterations!'.format(patience))
            print('[Early stopping]')

    # close the summary writer if created
    if log_dir is not None:
        if hasattr(log_dir, 'close'):

    return best_state

Then, we will initialize a model with some hyper-parameters and start the training.

In [9]:
# intialize the model and the hyper-parameters
model = Net()
num_epochs = 10
logs = './all/log/model'
checkpoint = './all/'
device = torch.device('cuda:0')  # e.g., {'cpu', 'cuda:0', 'cuda:1', ...}
optimizer = torch.optim.Adam(model.parameters(),
                             lr=1e-4, betas=(0.9, 0.999), weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, 'min', factor=0.2, patience=3, verbose=True)

# train the model
best_state = train(model, device, num_epochs, optimizer, train_loader, valid_loader,
                   scheduler, patience=10, load=checkpoint, save=checkpoint, log_dir=logs, restart=False)
model.load_state_dict(best_state['model'])  # loads the best model acquired during training
Couldn't find checkpoint file! ./all/ (training without reinitialization)
Epoch #0: [train: 8.01e-01 > 51.02%][valid: 6.90e-01 > 54.20%] @ 1.00e-04
Epoch #1: [train: 6.95e-01 > 54.94%][valid: 6.65e-01 > 61.28%] @ 1.00e-04
Epoch #2: [train: 6.80e-01 > 57.67%][valid: 6.53e-01 > 61.56%] @ 1.00e-04
Epoch #3: [train: 6.76e-01 > 58.54%][valid: 6.58e-01 > 60.24%] @ 1.00e-04
Epoch #4: [train: 6.60e-01 > 60.82%][valid: 6.56e-01 > 60.12%] @ 1.00e-04
Epoch #5: [train: 6.51e-01 > 62.10%][valid: 6.34e-01 > 64.76%] @ 1.00e-04
Epoch #6: [train: 6.43e-01 > 63.32%][valid: 6.21e-01 > 65.52%] @ 1.00e-04
Epoch #7: [train: 6.36e-01 > 64.58%][valid: 6.04e-01 > 68.04%] @ 1.00e-04
Epoch #8: [train: 6.29e-01 > 65.19%][valid: 6.06e-01 > 68.40%] @ 1.00e-04
Epoch #9: [train: 6.25e-01 > 65.35%][valid: 6.03e-01 > 68.16%] @ 1.00e-04

Tensorboard snapshot:

Example conclusion

We implemented a standard deep learning workflow in PyTorch. The steps are (in order):

  • Defined train_loader and valid_loader that loads and transforms our dataset in batches asynchronously
  • Defined a convolutional neural network called Net
  • Implemented full_epoch() that does a single training or validation epoch depending on whether it was fed an optimizer
  • Implemented train(), that uses full_epoch(), with early stopping and nice learning rate scheduling
  • Trained a certain initialization of Net with hand-picked hyper-parameters
  • Usually, you would end this by a testing phase (i.e., test()) but I will leave that to you because now you know how to do it yourself

Final notes and remarks:

  • As an exercise, try to think about how to modify train() to handle the case when we don't have validation data (i.e., valid_loader is None). You will certainly need to do a descent number of changes to incorporate this.
  • If you want to build an API similar to Keras yourself (although you don't need to because such high-level APIs exist already as mentioned at the beginning of this notebook), you will need to add some parameters to train() for callbacks that you call before and after training. Then, put all the early stopping and the learning rate scheduling business outside train() by implementing them through these callbacks
  • Please, don't hesitate to contact me by finding me on GitHub if you have any further inquiries