Notebook --- sidebar_label: 'Simple Parallel Workloads' sidebar_position: 2 description: "Parallel Video Resizing via File Sharding" ---

Parallel Video Resizing via File Sharding¶

Many data engineering workloads consist of embarrassingly parallel workloads where you want to run a simple execution on a large number of files. In this notebook, we will use the Sharding functionality in Bacalhau to run a simple video filter on a large number of video files.

Although you would normally you would use your own container and script to make your workloads reproducible, in this example we will use a pre-built container and CLI arguments to allow you to make changes. You can find the container on docker hub.

Prerequistes¶

Make sure you have the latest bacalhau client installed by following the getting started instructions or using the installation command below (which installs Bacalhau local to the notebook).

In [1]:

!command -v bacalhau >/dev/null 2>&1 || (export BACALHAU_INSTALL_DIR=.; curl -sL https://get.bacalhau.org/install.sh | bash)
path=!echo $PATH
%env PATH=./:{path[0]}

Your system is linux_amd64
No BACALHAU detected. Installing fresh BACALHAU CLI...
Getting the latest BACALHAU CLI...
Installing v0.3.15 BACALHAU CLI...
Downloading https://github.com/filecoin-project/bacalhau/releases/download/v0.3.15/bacalhau_v0.3.15_linux_amd64.tar.gz ...
Downloading sig file https://github.com/filecoin-project/bacalhau/releases/download/v0.3.15/bacalhau_v0.3.15_linux_amd64.tar.gz.signature.sha256 ...
Verified OK
Extracting tarball ...
NOT verifying Bin
bacalhau installed into . successfully.
Client Version: v0.3.15
Server Version: v0.3.15
env: PATH=./:/home/gitpod/.pyenv/versions/3.8.13/bin:/home/gitpod/.pyenv/libexec:/home/gitpod/.pyenv/plugins/python-build/bin:/home/gitpod/.pyenv/plugins/pyenv-virtualenv/bin:/home/gitpod/.pyenv/plugins/pyenv-update/bin:/home/gitpod/.pyenv/plugins/pyenv-installer/bin:/home/gitpod/.pyenv/plugins/pyenv-doctor/bin:/home/gitpod/.pyenv/shims:/ide/bin/remote-cli:/home/gitpod/.nix-profile/bin:/home/gitpod/.local/bin:/home/gitpod/.sdkman/candidates/maven/current/bin:/home/gitpod/.sdkman/candidates/java/current/bin:/home/gitpod/.sdkman/candidates/gradle/current/bin:/workspace/.cargo/bin:/home/gitpod/.rvm/gems/ruby-3.1.2/bin:/home/gitpod/.rvm/gems/ruby-3.1.2@global/bin:/home/gitpod/.rvm/rubies/ruby-3.1.2/bin:/home/gitpod/.pyenv/plugins/pyenv-virtualenv/shims:/home/gitpod/.pyenv/shims:/workspace/go/bin:/home/gitpod/.nix-profile/bin:/ide/bin/remote-cli:/home/gitpod/go/bin:/home/gitpod/go-packages/bin:/home/gitpod/.nvm/versions/node/v16.19.0/bin:/home/gitpod/.yarn/bin:/home/gitpod/.pnpm:/home/gitpod/.pyenv/bin:/workspace/.rvm/bin:/home/gitpod/.cargo/bin:/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/sbin/:/home/gitpod/.local/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/gitpod/.nvm/versions/node/v16.19.0/bin:/home/gitpod/.rvm/bin

In [2]:

%%bash
bacalhau version

Client Version: v0.3.15
Server Version: v0.3.15

Submit the workload¶

To submit a workload to Bacalhau you can use the bacalhau docker run command. This allows you to pass input data volume with a -v CID:path argument just like Docker, except the left-hand side of the argument is a content identifier (CID). This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs inside the container.

Bacalhau also mounts a data volume to store output data. By default bacalhau docker run creates an output data volume mounted at /outputs. This is a convenient location to store the results of your job. See below for an example.

And to shard across files in the input directory, we need to pass three (optional) arguments to the command:

sharding-base-path - the path to the directory you want to shard over
sharding-glob-pattern - the pattern to match files in the directory
sharding-batch-size - the number of files to pass into each job

A Simple Video Resize Example¶

In this example, you will create 72px wide video thumbnails for all the videos in the inputs directory. The outputs directory will contain the thumbnails for each video. We will shard by 1 video per job, and use the linuxserver/ffmpeg container to resize the videos.

Note that Bacalhau overwrites the default entrypoint so we must run the full command after the -- argument. In this line you will list all of the mp4 files in the /inputs directory and execute ffmpeg against each instance.

In [3]:

%%bash --out job_id
bacalhau docker run \
  --wait \
  --wait-timeout-secs 100 \
  --id-only \
  --sharding-base-path "/inputs" \
  --sharding-glob-pattern "*.mp4" \
  --sharding-batch-size 1 \
  -v Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j:/inputs \
  linuxserver/ffmpeg -- \
  bash -c 'find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}'

In [4]:

%env JOB_ID={job_id}

env: JOB_ID=c1ebae42-32b7-4d33-9885-704c7e6253b5

Get Results¶

Now let's download and display the result from the results directory. We can use the bacalhau results command to download the results from the output data volume. The --output-dir argument specifies the directory to download the results to.

In [5]:

%%bash
mkdir -p ./results # Temporary directory to store the results
bacalhau get --output-dir ./results ${JOB_ID} # Download the results

Fetching results of job 'c1ebae42-32b7-4d33-9885-704c7e6253b5'...
Results for job 'c1ebae42-32b7-4d33-9885-704c7e6253b5' have been written to...
./results

In [6]:

%%bash
# Copy the files to the local directory, to allow the documentation scripts to copy them to the right place
cp results/combined_results/outputs/* ./ && rm -rf results/combined_results/outputs/*
# Remove any spaces from the filenames
for f in *\ *; do mv "$f" "${f// /_}"; done

In [7]:

import glob
from IPython.display import Video, display
for file in glob.glob('*.mp4'):
    display(Video(filename=file))

<video src={require('./scaled_Bird_flying_over_the_lake.mp4').default} controls > Your browser does not support the video element.