Many data engineering workloads consist of embarrassingly parallel workloads where you want to run a simple execution on a large number of files. In this notebook, we will use the Sharding functionality in Bacalhau to run a simple video filter on a large number of video files.
Although you would normally you would use your own container and script to make your workloads reproducible, in this example we will use a pre-built container and CLI arguments to allow you to make changes. You can find the container on docker hub.
Make sure you have the latest bacalhau
client installed by following the getting started instructions or using the installation command below (which installs Bacalhau local to the notebook).
!command -v bacalhau >/dev/null 2>&1 || (export BACALHAU_INSTALL_DIR=.; curl -sL https://get.bacalhau.org/install.sh | bash)
path=!echo $PATH
%env PATH=./:{path[0]}
Your system is linux_amd64 No BACALHAU detected. Installing fresh BACALHAU CLI... Getting the latest BACALHAU CLI... Installing v0.3.15 BACALHAU CLI... Downloading https://github.com/filecoin-project/bacalhau/releases/download/v0.3.15/bacalhau_v0.3.15_linux_amd64.tar.gz ... Downloading sig file https://github.com/filecoin-project/bacalhau/releases/download/v0.3.15/bacalhau_v0.3.15_linux_amd64.tar.gz.signature.sha256 ... Verified OK Extracting tarball ... NOT verifying Bin bacalhau installed into . successfully. Client Version: v0.3.15 Server Version: v0.3.15 env: PATH=./:/home/gitpod/.pyenv/versions/3.8.13/bin:/home/gitpod/.pyenv/libexec:/home/gitpod/.pyenv/plugins/python-build/bin:/home/gitpod/.pyenv/plugins/pyenv-virtualenv/bin:/home/gitpod/.pyenv/plugins/pyenv-update/bin:/home/gitpod/.pyenv/plugins/pyenv-installer/bin:/home/gitpod/.pyenv/plugins/pyenv-doctor/bin:/home/gitpod/.pyenv/shims:/ide/bin/remote-cli:/home/gitpod/.nix-profile/bin:/home/gitpod/.local/bin:/home/gitpod/.sdkman/candidates/maven/current/bin:/home/gitpod/.sdkman/candidates/java/current/bin:/home/gitpod/.sdkman/candidates/gradle/current/bin:/workspace/.cargo/bin:/home/gitpod/.rvm/gems/ruby-3.1.2/bin:/home/gitpod/.rvm/gems/ruby-3.1.2@global/bin:/home/gitpod/.rvm/rubies/ruby-3.1.2/bin:/home/gitpod/.pyenv/plugins/pyenv-virtualenv/shims:/home/gitpod/.pyenv/shims:/workspace/go/bin:/home/gitpod/.nix-profile/bin:/ide/bin/remote-cli:/home/gitpod/go/bin:/home/gitpod/go-packages/bin:/home/gitpod/.nvm/versions/node/v16.19.0/bin:/home/gitpod/.yarn/bin:/home/gitpod/.pnpm:/home/gitpod/.pyenv/bin:/workspace/.rvm/bin:/home/gitpod/.cargo/bin:/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/sbin/:/home/gitpod/.local/bin:/usr/games:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/gitpod/.nvm/versions/node/v16.19.0/bin:/home/gitpod/.rvm/bin
%%bash
bacalhau version
Client Version: v0.3.15 Server Version: v0.3.15
To submit a workload to Bacalhau you can use the bacalhau docker run
command. This allows you to pass input data volume with a -v CID:path
argument just like Docker, except the left-hand side of the argument is a content identifier (CID). This results in Bacalhau mounting a data volume inside the container. By default, Bacalhau mounts the input volume at the path /inputs
inside the container.
Bacalhau also mounts a data volume to store output data. By default bacalhau docker run
creates an output data volume mounted at /outputs
. This is a convenient location to store the results of your job. See below for an example.
And to shard across files in the input directory, we need to pass three (optional) arguments to the command:
sharding-base-path
- the path to the directory you want to shard oversharding-glob-pattern
- the pattern to match files in the directorysharding-batch-size
- the number of files to pass into each jobIn this example, you will create 72px wide video thumbnails for all the videos in the inputs
directory. The outputs
directory will contain the thumbnails for each video. We will shard by 1 video per job, and use the linuxserver/ffmpeg
container to resize the videos.
Note that Bacalhau overwrites the default entrypoint so we must run the full command after the --
argument. In this line you will list all of the mp4 files in the /inputs
directory and execute ffmpeg
against each instance.
%%bash --out job_id
bacalhau docker run \
--wait \
--wait-timeout-secs 100 \
--id-only \
--sharding-base-path "/inputs" \
--sharding-glob-pattern "*.mp4" \
--sharding-batch-size 1 \
-v Qmd9CBYpdgCLuCKRtKRRggu24H72ZUrGax5A9EYvrbC72j:/inputs \
linuxserver/ffmpeg -- \
bash -c 'find /inputs -iname "*.mp4" -printf "%f\n" | xargs -I{} ffmpeg -y -i /inputs/{} -vf "scale=-1:72,setsar=1:1" /outputs/scaled_{}'
%env JOB_ID={job_id}
env: JOB_ID=c1ebae42-32b7-4d33-9885-704c7e6253b5
Now let's download and display the result from the results directory. We can use the bacalhau results
command to download the results from the output data volume. The --output-dir
argument specifies the directory to download the results to.
%%bash
mkdir -p ./results # Temporary directory to store the results
bacalhau get --output-dir ./results ${JOB_ID} # Download the results
Fetching results of job 'c1ebae42-32b7-4d33-9885-704c7e6253b5'... Results for job 'c1ebae42-32b7-4d33-9885-704c7e6253b5' have been written to... ./results
%%bash
# Copy the files to the local directory, to allow the documentation scripts to copy them to the right place
cp results/combined_results/outputs/* ./ && rm -rf results/combined_results/outputs/*
# Remove any spaces from the filenames
for f in *\ *; do mv "$f" "${f// /_}"; done
import glob
from IPython.display import Video, display
for file in glob.glob('*.mp4'):
display(Video(filename=file))
<video src={require('./scaled_Bird_flying_over_the_lake.mp4').default} controls >
Your browser does not support the video
element.