Transforming our dataset can help us make the most out of a limited dataset. Especially for video data that is often difficult to get in the first place, using transformations can really help reduce the total amount of video footage that is needed.
In this notebook, we show examples of what different transformations look like, to help decide what transformations to use during training.
Import all the functions we need.
import sys
sys.path.append("../../")
import os
import time
import decord
import matplotlib.pyplot as plt
import numpy as np
import warnings
import shutil
from sklearn.metrics import accuracy_score
import torch
import torch.cuda as cuda
import torch.nn as nn
import torchvision
import urllib.request
from utils_cv.action_recognition.dataset import VideoDataset, DEFAULT_MEAN, DEFAULT_STD
from utils_cv.action_recognition.references.functional_video import denormalize
from utils_cv.action_recognition.references.transforms_video import (
CenterCropVideo,
NormalizeVideo,
RandomCropVideo,
RandomHorizontalFlipVideo,
RandomResizedCropVideo,
ResizeVideo,
ToTensorVideo,
)
from utils_cv.common.gpu import system_info
from utils_cv.common.data import data_path
system_info()
warnings.filterwarnings('ignore')
3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0] PyTorch 1.2.0 Torch-vision 0.4.0a0 Available devices: 0: Tesla K80
%load_ext autoreload
%autoreload 2
Set parameters for this notebook:
VIDEO_PATH = os.path.join(str(data_path()), "drinking.mp4")
In this notebook, we'll be showing various transformations, so here's a simple helper function to easily display clips from a video.
def show_clip(clip, size_factor=600):
""" Show frames in a clip """
if isinstance(clip, torch.Tensor):
# Convert [C, T, H, W] tensor to [T, H, W, C] numpy array
clip = np.moveaxis(clip.numpy(), 0, -1)
figsize = np.array([clip[0].shape[1] * len(clip), clip[0].shape[0]]) / size_factor
fig, axs = plt.subplots(1, len(clip), figsize=figsize)
for i, f in enumerate(clip):
axs[i].axis("off")
axs[i].imshow(f)
For this notebook, we'll use a sample video that we've stored in Azure blob to demonstrate the various kinds of transformations. The video is saved to this location:
url = (
"https://cvbp.blob.core.windows.net/public/datasets/action_recognition/drinking.mp4"
)
Download the file from url
and save it locally under VIDEO_PATH:
with urllib.request.urlopen(url) as response, open(VIDEO_PATH, "wb") as out_file:
shutil.copyfileobj(response, out_file)
We'll use decord
to see how many frames the video has.
video_reader = decord.VideoReader(VIDEO_PATH)
video_length = len(video_reader)
print("Video length = {} frames".format(video_length))
Video length = 152 frames
For the rest of the notebook, we'll use the first, middle, and last frames to visualize what the video transformations look like.
clip = [
video_reader[0].asnumpy(),
video_reader[video_length//2].asnumpy(),
video_reader[video_length-1].asnumpy(),
]
Here's what the first, middle and last frame looks like without transformations:
show_clip(clip)
Finally, we'll create a t_clip
tensor that contains each of these three frames so that we can apply our transformations to all of them at once.
# [T, H, W, C] numpy array to [C, T, H, W] tensor
t_clip = ToTensorVideo()(torch.from_numpy(np.array(clip)))
t_clip.shape
torch.Size([3, 3, 1080, 1920])
This section of the notebook shows a variety of different transformations you can apply when training your own model.
show_clip(ResizeVideo(size=800)(t_clip))
show_clip(ResizeVideo(size=800, keep_ratio=False)(t_clip))
show_clip(CenterCropVideo(size=800)(t_clip))
random_crop = RandomCropVideo(size=800)
show_clip(random_crop(t_clip))
show_clip(random_crop(t_clip))
random_resized_crop = RandomResizedCropVideo(size=800)
show_clip(random_resized_crop(t_clip))
show_clip(random_resized_crop(t_clip))
norm_t_clip = NormalizeVideo(mean=DEFAULT_MEAN, std=DEFAULT_STD)(t_clip)
show_clip(norm_t_clip)
WARNING - Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). WARNING - Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers). WARNING - Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
show_clip(denormalize(norm_t_clip, mean=DEFAULT_MEAN, std=DEFAULT_STD))
show_clip(RandomHorizontalFlipVideo(p=.5)(t_clip))