Hello World!
in bash¶This tutorial serves as a companion to MapReduce_Primer_HelloWorld.ipynb, with the implementation carried out in the Bash scripting language requiring only a few lines of code.
For this tutorial, we are going to download the core Hadoop distribution and run Hadoop in local standalone mode:
❝ By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. ❞
(see https://hadoop.apache.org/docs/stable/.../Standalone_Operation)
We are going to run a MapReduce job using MapReduce's streaming application. This is not to be confused with real-time streaming:
❝ Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. ❞
MapReduce streaming defaults to using IdentityMapper
and IdentityReducer
, thus eliminating the need for explicit specification of a mapper or reducer.
Both input and output are standard files since Hadoop's default filesystem is the regular file system, as specified by the fs.defaultFS
property in core-default.xml).
%%bash
#set -x
HADOOP_URL="https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz"
wget --quiet --no-clobber $HADOOP_URL >/dev/null
[ ! -d $(basename $HADOOP_URL .tar.gz) ] && tar -xzf $(basename $HADOOP_URL)
HADOOP_HOME=$(pwd)'/'$(basename $HADOOP_URL .tar.gz)'/bin'
PATH=$HADOOP_HOME:$PATH
which java >/dev/null|| apt install -y openjdk-19-jre-headless
export JAVA_HOME=$(realpath $(which java) | sed 's/\/bin\/java$//')
echo -e "Hello, World!">hello.txt
output_dir="output"$(date +"%Y%m%dT%H%M")
sleep 10
mapred streaming -input hello.txt -output $output_dir
ls -lR output*
cat $output_dir/part-00000
output20240324T1947: total 4 -rw-r--r-- 1 root root 16 Mar 24 19:47 part-00000 -rw-r--r-- 1 root root 0 Mar 24 19:47 _SUCCESS output20240324T1948: total 4 -rw-r--r-- 1 root root 16 Mar 24 19:48 part-00000 -rw-r--r-- 1 root root 0 Mar 24 19:48 _SUCCESS 0 Hello, World!
2024-03-24 19:48:27,531 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties 2024-03-24 19:48:27,701 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 2024-03-24 19:48:27,702 INFO impl.MetricsSystemImpl: JobTracker metrics system started 2024-03-24 19:48:27,727 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized! 2024-03-24 19:48:28,055 INFO mapred.FileInputFormat: Total input files to process : 1 2024-03-24 19:48:28,082 INFO mapreduce.JobSubmitter: number of splits:1 2024-03-24 19:48:28,411 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local723241263_0001 2024-03-24 19:48:28,411 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2024-03-24 19:48:28,686 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 2024-03-24 19:48:28,688 INFO mapreduce.Job: Running job: job_local723241263_0001 2024-03-24 19:48:28,697 INFO mapred.LocalJobRunner: OutputCommitter set in config null 2024-03-24 19:48:28,700 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter 2024-03-24 19:48:28,709 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2 2024-03-24 19:48:28,713 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2024-03-24 19:48:28,767 INFO mapred.LocalJobRunner: Waiting for map tasks 2024-03-24 19:48:28,773 INFO mapred.LocalJobRunner: Starting task: attempt_local723241263_0001_m_000000_0 2024-03-24 19:48:28,822 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2 2024-03-24 19:48:28,825 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2024-03-24 19:48:28,855 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 2024-03-24 19:48:28,866 INFO mapred.MapTask: Processing split: file:/content/hello.txt:0+14 2024-03-24 19:48:28,884 INFO mapred.MapTask: numReduceTasks: 1 2024-03-24 19:48:28,969 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 2024-03-24 19:48:28,969 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 2024-03-24 19:48:28,969 INFO mapred.MapTask: soft limit at 83886080 2024-03-24 19:48:28,969 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 2024-03-24 19:48:28,969 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 2024-03-24 19:48:28,976 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 2024-03-24 19:48:28,985 INFO mapred.LocalJobRunner: 2024-03-24 19:48:28,985 INFO mapred.MapTask: Starting flush of map output 2024-03-24 19:48:28,985 INFO mapred.MapTask: Spilling map output 2024-03-24 19:48:28,985 INFO mapred.MapTask: bufstart = 0; bufend = 22; bufvoid = 104857600 2024-03-24 19:48:28,985 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214396(104857584); length = 1/6553600 2024-03-24 19:48:28,993 INFO mapred.MapTask: Finished spill 0 2024-03-24 19:48:29,009 INFO mapred.Task: Task:attempt_local723241263_0001_m_000000_0 is done. And is in the process of committing 2024-03-24 19:48:29,015 INFO mapred.LocalJobRunner: file:/content/hello.txt:0+14 2024-03-24 19:48:29,015 INFO mapred.Task: Task 'attempt_local723241263_0001_m_000000_0' done. 2024-03-24 19:48:29,025 INFO mapred.Task: Final Counters for attempt_local723241263_0001_m_000000_0: Counters: 17 File System Counters FILE: Number of bytes read=141410 FILE: Number of bytes written=776343 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=1 Map output records=1 Map output bytes=22 Map output materialized bytes=30 Input split bytes=75 Combine input records=0 Spilled Records=1 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=0 Total committed heap usage (bytes)=407896064 File Input Format Counters Bytes Read=14 2024-03-24 19:48:29,025 INFO mapred.LocalJobRunner: Finishing task: attempt_local723241263_0001_m_000000_0 2024-03-24 19:48:29,026 INFO mapred.LocalJobRunner: map task executor complete. 2024-03-24 19:48:29,031 INFO mapred.LocalJobRunner: Waiting for reduce tasks 2024-03-24 19:48:29,035 INFO mapred.LocalJobRunner: Starting task: attempt_local723241263_0001_r_000000_0 2024-03-24 19:48:29,046 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2 2024-03-24 19:48:29,047 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2024-03-24 19:48:29,047 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 2024-03-24 19:48:29,054 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@4a68130c 2024-03-24 19:48:29,056 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized! 2024-03-24 19:48:29,079 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=2382574336, maxSingleShuffleLimit=595643584, mergeThreshold=1572499072, ioSortFactor=10, memToMemMergeOutputsThreshold=10 2024-03-24 19:48:29,095 INFO reduce.EventFetcher: attempt_local723241263_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events 2024-03-24 19:48:29,158 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local723241263_0001_m_000000_0 decomp: 26 len: 30 to MEMORY 2024-03-24 19:48:29,162 INFO reduce.InMemoryMapOutput: Read 26 bytes from map-output for attempt_local723241263_0001_m_000000_0 2024-03-24 19:48:29,164 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 26, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->26 2024-03-24 19:48:29,168 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning 2024-03-24 19:48:29,170 INFO mapred.LocalJobRunner: 1 / 1 copied. 2024-03-24 19:48:29,170 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs 2024-03-24 19:48:29,177 INFO mapred.Merger: Merging 1 sorted segments 2024-03-24 19:48:29,178 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes 2024-03-24 19:48:29,179 INFO reduce.MergeManagerImpl: Merged 1 segments, 26 bytes to disk to satisfy reduce memory limit 2024-03-24 19:48:29,180 INFO reduce.MergeManagerImpl: Merging 1 files, 30 bytes from disk 2024-03-24 19:48:29,181 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce 2024-03-24 19:48:29,181 INFO mapred.Merger: Merging 1 sorted segments 2024-03-24 19:48:29,182 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 16 bytes 2024-03-24 19:48:29,183 INFO mapred.LocalJobRunner: 1 / 1 copied. 2024-03-24 19:48:29,193 INFO mapred.Task: Task:attempt_local723241263_0001_r_000000_0 is done. And is in the process of committing 2024-03-24 19:48:29,195 INFO mapred.LocalJobRunner: 1 / 1 copied. 2024-03-24 19:48:29,195 INFO mapred.Task: Task attempt_local723241263_0001_r_000000_0 is allowed to commit now 2024-03-24 19:48:29,197 INFO output.FileOutputCommitter: Saved output of task 'attempt_local723241263_0001_r_000000_0' to file:/content/output20240324T1948 2024-03-24 19:48:29,198 INFO mapred.LocalJobRunner: reduce > reduce 2024-03-24 19:48:29,198 INFO mapred.Task: Task 'attempt_local723241263_0001_r_000000_0' done. 2024-03-24 19:48:29,199 INFO mapred.Task: Final Counters for attempt_local723241263_0001_r_000000_0: Counters: 24 File System Counters FILE: Number of bytes read=141502 FILE: Number of bytes written=776401 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Combine input records=0 Combine output records=0 Reduce input groups=1 Reduce shuffle bytes=30 Reduce input records=1 Reduce output records=1 Spilled Records=1 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 Total committed heap usage (bytes)=407896064 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Output Format Counters Bytes Written=28 2024-03-24 19:48:29,199 INFO mapred.LocalJobRunner: Finishing task: attempt_local723241263_0001_r_000000_0 2024-03-24 19:48:29,200 INFO mapred.LocalJobRunner: reduce task executor complete. 2024-03-24 19:48:29,694 INFO mapreduce.Job: Job job_local723241263_0001 running in uber mode : false 2024-03-24 19:48:29,696 INFO mapreduce.Job: map 100% reduce 100% 2024-03-24 19:48:29,697 INFO mapreduce.Job: Job job_local723241263_0001 completed successfully 2024-03-24 19:48:29,707 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=282912 FILE: Number of bytes written=1552744 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=1 Map output records=1 Map output bytes=22 Map output materialized bytes=30 Input split bytes=75 Combine input records=0 Combine output records=0 Reduce input groups=1 Reduce shuffle bytes=30 Reduce input records=1 Reduce output records=1 Spilled Records=2 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 Total committed heap usage (bytes)=815792128 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=14 File Output Format Counters Bytes Written=28 2024-03-24 19:48:29,707 INFO streaming.StreamJob: Output directory: output20240324T1948