MapReduce: A Primer with `Hello World!` in bash¶

This tutorial serves as a companion to MapReduce_Primer_HelloWorld.ipynb, with the implementation carried out in the Bash scripting language requiring only a few lines of code.

For this tutorial, we are going to download the core Hadoop distribution and run Hadoop in local standalone mode:

❝ By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. ❞

(see https://hadoop.apache.org/docs/stable/.../Standalone_Operation)

We are going to run a MapReduce job using MapReduce's streaming application. This is not to be confused with real-time streaming:

❝ Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. ❞

MapReduce streaming defaults to using IdentityMapper and IdentityReducer, thus eliminating the need for explicit specification of a mapper or reducer.

Both input and output are standard files since Hadoop's default filesystem is the regular file system, as specified by the fs.defaultFS property in core-default.xml).

In [1]:

%%bash
HADOOP_URL="https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.6.tar.gz"
wget --quiet --no-clobber $HADOOP_URL >/dev/null
[ ! -d $(basename $HADOOP_URL .tar.gz) ] && tar -xzf $(basename $HADOOP_URL)
HADOOP_HOME=$(pwd)'/'$(basename $HADOOP_URL .tar.gz)'/bin'
PATH=$HADOOP_HOME:$PATH
which java >/dev/null|| apt install -y openjdk-19-jre-headless
export JAVA_HOME=$(realpath $(which java) | sed 's/\/bin\/java$//')
echo -e "Hello, World!">hello.txt
output_dir="output"$(date +"%Y%m%dT%H%M")
mapred streaming -input hello.txt -output output_dir >log 2>&1
cat output_dir/part-00000

0	Hello, World!

MapReduce: A Primer with Hello World! in bash¶

MapReduce: A Primer with `Hello World!` in bash¶