Notebook

Hadoop single-node cluster setup with Python¶

Following Apache's official documentation Hadoop: Setting up a Single Node Cluster we are going to launch a single node Hadoop cluster in pseudo-distributed mode Pseudo-Distributed Operation. As opposed to single-node local (standalone) operation where one single Java Virtual Machine is responsible for all services, in the pseudo-distributed mode each Hadoop daemon runs in a separate Java process.

In this notebook, we are going to rely as much as possible Python commands. This is the companion notebook to Hadoop Setting up a Single Node Cluster, implemented in bash.

Hadoop single-node cluster setup with Python

About this tutorial

Constants

Imports

Setup logging

Set JAVA_HOME

Start sshd server

Install openssh-server

Configure sshd service

Start sshd service

Create private/public key

Test ssh connection

Setup Hadoop

Download core Hadoop

Uncompress archive

Configure Hadoop

Environment variables

Hadoop configuration files

Functions for editing Hadoop configuration files

Edit core-site.xml and hdfs-site.xml

Edit hadoop-env.sh

Hadoop's default values for configuration files

Initialize the namenode

Start HDFS

HDFS processes and ports

Check if HDFS is up and running

lsof for viewing listening ports

Options used in lsof:

Check the NameNode

On the command-line

In the Web UI

Serve the NameNode Web UI

A note on Google Colaboratory advanced outputs

Test HDFS

Run a MapReduce job in local mode

How to find MapReduce examples

Run the pi MapReduce app

Start YARN

Configure yarn-site.xml

Some explanations

Configure mapred-site.xml

Launch YARN

Verify that all YARN services are up and running

Serve the YARN Web UI

Start Job History Server

Start the Job History Server Web UI

Submit the MapReduce pi example to YARN

Check Hadoop logfiles

Stop all services

Stop History Server

Stop YARN

Stop Hadoop and sshd daemon

Restart and play

About this tutorial¶

This tutorial deliberately takes a detailed and sometimes tedious approach in order to offer a comprehensive tutorial experience.

The intention is to help you learn thoroughly and understand the motivations behind each step. By embracing this comprehensive exploration, you can enhance your understanding of the subject matter, even if it may seem a bit pedantic at times, and grasp the significance of every action in the tutorial.

While the notebook is designed to be interactive, it's important to note that several commands may not yield real-time outputs. This delay stems from the nature of the working context, especially when dealing with Big Data. Big Data jobs typically operate in a batch processing mode, where data is processed in large chunks rather than interactively.

It's essential to understand that the nature of Big Data processing, characterized by its batch-oriented approach, inherently involves some latency in generating immediate results during interactive sessions.

Additionally, despite the fact that for this tutorial we are not really dealing with large amounts of data, one shoud keep in mind that a pseudo-distributed Hadoop installation does not provide the most efficient environment since it emulates a cluster on a single virtual machine.

All this is to say that you should be ready to invest some time in this tutorial and arm yourself with patience.

Image conveying a sense of tranquility (photo by Billy Williams on Unsplash).

Constants¶

In [ ]:

# URL for downloading Hadoop
HADOOP_URL = "https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.6.tar.gz"

# logging level (should be one of: DEBUG, INFO, WARNING, ERROR, CRITICAL)
LOGGING_LEVEL = "INFO" #@param ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]

Imports¶

In [ ]:

import sys
import logging
import subprocess
import os

!pip install Crypto
!pip install pycryptodome
import Crypto

!pip install ssh_utilities
from ssh_utilities import Connection

import shutil
from pathlib import Path
import urllib.request
import tarfile
import xml.etree.ElementTree as ET
import xml.dom.minidom
import glob

# true if running on Google Colab
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
 from google.colab import output

Collecting Crypto
  Downloading crypto-1.4.1-py2.py3-none-any.whl (18 kB)
Collecting Naked (from Crypto)
  Downloading Naked-0.1.32-py2.py3-none-any.whl (587 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 587.7/587.7 kB 4.8 MB/s eta 0:00:00
Collecting shellescape (from Crypto)
  Downloading shellescape-3.8.1-py2.py3-none-any.whl (3.1 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from Naked->Crypto) (2.31.0)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from Naked->Crypto) (6.0.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->Naked->Crypto) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->Naked->Crypto) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->Naked->Crypto) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->Naked->Crypto) (2024.2.2)
Installing collected packages: shellescape, Naked, Crypto
Successfully installed Crypto-1.4.1 Naked-0.1.32 shellescape-3.8.1
Collecting pycryptodome
  Downloading pycryptodome-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 9.6 MB/s eta 0:00:00
Installing collected packages: pycryptodome
Successfully installed pycryptodome-3.20.0
Collecting ssh_utilities
  Downloading ssh_utilities-0.15.2-py3-none-any.whl (79 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.7/79.7 kB 2.0 MB/s eta 0:00:00
Collecting paramiko>=2.7.1 (from ssh_utilities)
  Downloading paramiko-3.4.0-py3-none-any.whl (225 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 225.9/225.9 kB 9.7 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions>=3.7.4.2 in /usr/local/lib/python3.10/dist-packages (from ssh_utilities) (4.9.0)
Collecting colorama>=0.4.3 (from ssh_utilities)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Requirement already satisfied: tqdm>=4.47.0 in /usr/local/lib/python3.10/dist-packages (from ssh_utilities) (4.66.2)
Collecting bcrypt>=3.2 (from paramiko>=2.7.1->ssh_utilities)
  Downloading bcrypt-4.1.2-cp39-abi3-manylinux_2_28_x86_64.whl (698 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.9/698.9 kB 29.2 MB/s eta 0:00:00
Requirement already satisfied: cryptography>=3.3 in /usr/local/lib/python3.10/dist-packages (from paramiko>=2.7.1->ssh_utilities) (42.0.2)
Collecting pynacl>=1.5 (from paramiko>=2.7.1->ssh_utilities)
  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 856.7/856.7 kB 60.4 MB/s eta 0:00:00
Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/dist-packages (from cryptography>=3.3->paramiko>=2.7.1->ssh_utilities) (1.16.0)
Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.7.1->ssh_utilities) (2.21)
Installing collected packages: colorama, bcrypt, pynacl, paramiko, ssh_utilities
Successfully installed bcrypt-4.1.2 colorama-0.4.6 paramiko-3.4.0 pynacl-1.5.0 ssh_utilities-0.15.2

Setup logging¶

In [ ]:

for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)

logging_level = getattr(logging, LOGGING_LEVEL.upper(), 10)

logging.basicConfig(level=logging_level, \
                    format='%(asctime)s - %(levelname)s: %(message)s', \
                    datefmt='%d-%b-%y %I:%M:%S %p')

logger = logging.getLogger('my_logger')

Set `JAVA_HOME`¶

On Google Colab Java is available but we accomodate for the general case of an Ubuntu machine where Java (we pick the openjdk-19-jre-headless version) needs to be installed.

In [ ]:

# set variable JAVA_HOME (install Java if necessary)
def is_java_installed():
    os.environ['JAVA_HOME'] = os.path.realpath(shutil.which("java")).split('/bin')[0]
    return os.environ['JAVA_HOME']

def install_java():
    # Uncomment and modify the desired version
    # java_version= 'openjdk-11-jre-headless'
    # java_version= 'default-jre'
    # java_version= 'openjdk-17-jre-headless'
    # java_version= 'openjdk-18-jre-headless'
    java_version= 'openjdk-19-jre-headless'

    print(f"Java not found. Installing {java_version} ... (this might take a while)")
    try:
        cmd = f"apt install -y {java_version}"
        subprocess_output = subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
        stdout_result = subprocess_output.stdout
        # Process the results as needed
        logger.info("Done installing Java {}".format(java_version))
        os.environ['JAVA_HOME'] = os.path.realpath(shutil.which("java")).split('/bin')[0]
        logger.info("JAVA_HOME is {}".format(os.environ['JAVA_HOME']))
    except subprocess.CalledProcessError as e:
        # Handle the error if the command returns a non-zero exit code
        logger.warn("Command failed with return code {}".format(e.returncode))
        logger.warn("stdout: {}".format(e.stdout))

# Install Java if not available
if is_java_installed():
    logger.info("Java is already installed: {}".format(os.environ['JAVA_HOME']))
else:
    logger.info("Installing Java")
    install_java()

19-Feb-24 01:05:23 PM - INFO: Java is already installed: /usr/lib/jvm/java-11-openjdk-amd64

Start `sshd` server¶

The Hadoop cluster needs ssh for communication (even if it runs on a single node). See the official documentation: Setup passphraseless ssh.

We install openssh-server and test if the installation was successful with the shell command

ssh localhost "echo hi!"

as well as using Connection from the Python's ssh_utilities library.

Install `openssh-server`¶

In [ ]:

logger.info("Installing {}".format("openssh-server"))
cmd = ["apt-get", "install", "openssh-server"]
try:
    result = subprocess.check_output(cmd, stderr=subprocess.STDOUT, text=True)
except subprocess.CalledProcessError as e:
    # Access the error message from the output attribute
    error_message = e.output
    print(f"Command failed with error message: {error_message}")

19-Feb-24 01:05:23 PM - INFO: Installing openssh-server

Configure `sshd` service¶

In [ ]:

if IN_COLAB:
  ssh_config_file = "/etc/ssh/ssh_config"
  with open(ssh_config_file, "r+") as f:
    var = 'StrictHostKeyChecking no'
    line_found = any(line.strip().startswith(var) for line in f)
    if not line_found:
      f.seek(0, os.SEEK_END)
      f.write(var +"\n")

Start `sshd` service¶

In [ ]:

logger.info("Starting {}".format("openssh-server"))
cmd = ["/etc/init.d/ssh", "restart"]
try:
    result = subprocess.check_output(cmd, stderr=subprocess.STDOUT, text=True)
except subprocess.CalledProcessError as e:
    # Access the error message from the output attribute
    error_message = e.output
    print(f"Command failed with error message: {error_message}")

19-Feb-24 01:05:35 PM - INFO: Starting openssh-server

Create private/public key¶

The private/public key pair is needed for passwordless authentication.

Note: for some reason Crypto doesn't work without pycryptodome (see ModuleNotFoundError: No module named 'Crypto' Error).

In [ ]:

from Crypto.PublicKey import RSA
key = RSA.generate(2048)
if not os.path.exists(os.path.join(os.path.expanduser('~'), '.ssh')):
  os.makedirs(os.path.join(os.path.expanduser('~'), '.ssh'), mode=700)
#os.chmod(os.path.join(os.path.expanduser('~'),'.ssh'), int('700', base=8))

with open(os.path.join(os.path.expanduser('~'), '.ssh/id_rsa'), 'wb') as f:
  os.chmod(os.path.join(os.path.expanduser('~'),'.ssh/id_rsa'), int('0600', base=8))
  f.write(key.export_key('PEM', passphrase=None))
pubkey = key.publickey()

with open(os.path.join(os.path.expanduser('~'), '.ssh/id_rsa.pub'), 'wb') as f:
    f.write(pubkey.exportKey('OpenSSH'))

In [ ]:

with open(os.path.join(os.path.expanduser('~'), '.ssh/authorized_keys'), 'w') as a:
  with open(os.path.join(os.path.expanduser('~'),'.ssh/id_rsa.pub'), 'r') as f:
    a.write(f.read())
  os.chmod(os.path.join(os.path.expanduser('~'), '.ssh/authorized_keys'), int('600', base=8))

Test `ssh` connection¶

We are going to test if the passphraseless connection is working by running the command "echo hi!" on localhost over ssh. The output should be 'hi!' if the ssh connection is working.

1. Test with a shell command

In [ ]:

!ssh localhost "echo hi!"

Warning: Permanently added 'localhost' (ED25519) to the list of known hosts.
hi!

2. Test with Python ssh_utilities.Connection

In [ ]:

private_key = os.path.join(os.path.expanduser('~'), '.ssh/id_rsa')
conn = Connection.open('root','localhost', private_key)

try:
    command_output = conn.subprocess.run(['echo', 'hi!'], suppress_out=True, quiet=False,
                          capture_output=True, check=True, cwd=Path('/content'))
except Exception as err:
    logger.error(err)
    print("❌ An exception occurred:")
    print(err)
else:
    logger.info("Successfully executed command on localhost over ssh")
    print("✅ Successfully executed command {} on localhost".format('"ssh localhost \'echo hi!\'"'))
    print("Stdout: {}".format(command_output.stdout.decode("utf-8")))

19-Feb-24 01:05:36 PM - INFO: Connection object will not be thread safe

Will login with private RSA key located in /root/.ssh/id_rsa

19-Feb-24 01:05:36 PM - INFO: Will login with private RSA key located in /root/.ssh/id_rsa

Connecting to server: root@localhost

19-Feb-24 01:05:36 PM - INFO: Connecting to server: root@localhost

When running an executale on server always make sure that full path is specified!!!

19-Feb-24 01:05:36 PM - INFO: When running an executale on server always make sure that full path is specified!!!

19-Feb-24 01:05:36 PM - INFO: could not parse key with Ed25519Key
19-Feb-24 01:05:36 PM - INFO: could not parse key with DSSKey
19-Feb-24 01:05:36 PM - INFO: could not parse key with ECDSAKey
19-Feb-24 01:05:36 PM - INFO: trying to authenticate with private-key
19-Feb-24 01:05:36 PM - INFO: Connected (version 2.0, client OpenSSH_8.9p1)
19-Feb-24 01:05:36 PM - INFO: Authentication (publickey) successful!
19-Feb-24 01:05:36 PM - INFO: successfully authenticated with: private-key

Executing command on remote: echo hi!

19-Feb-24 01:05:36 PM - INFO: Successfully executed command on localhost over ssh

✅ Successfully executed command "ssh localhost 'echo hi!'" on localhost
Stdout: hi!

Setup Hadoop¶

Download core Hadoop¶

Download the latest stable version of the core Hadoop distribution from one of the download mirrors locations https://www.apache.org/dyn/closer.cgi/hadoop/common/.

In [ ]:

file_name = os.path.basename(HADOOP_URL)
if os.path.isfile(file_name):
   logger.info("{} already exists, not downloading".format(file_name))
else:
  logger.info("Downloading {}".format(file_name))
  urllib.request.urlretrieve(HADOOP_URL, file_name)

19-Feb-24 01:05:36 PM - INFO: Downloading hadoop-3.3.6.tar.gz

Uncompress archive¶

In [ ]:

dir_name = file_name[:-7]
if os.path.exists(dir_name):
  logger.info("{} is already uncompressed".format(file_name))
else:
  logger.info("Uncompressing {}".format(file_name))
  tar = tarfile.open(file_name)
  tar.extractall()
  tar.close()

19-Feb-24 01:06:22 PM - INFO: Uncompressing hadoop-3.3.6.tar.gz

Configure Hadoop¶

In this section we continue following Apache Hadoop's official documentation on how to set up a single node in pseudo-distributed mode (see Configuration).

Environment variables¶

In [ ]:

os.environ['HADOOP_HOME'] = os.path.join(os.path.join(os.getcwd(), dir_name))
logger.info("HADOOP_HOME is {}".format(os.environ['HADOOP_HOME']))

os.environ['HADOOP_COMMON_HOME'] = os.environ['HADOOP_HOME']
logger.info("HADOOP_COMMON_HOME is {}".format(os.environ['HADOOP_COMMON_HOME']))

os.environ['HADOOP_MAPREDUCE_HOME'] = os.environ['HADOOP_HOME']
logger.info("HADOOP_MAPREDUCE_HOME is {}".format(os.environ['HADOOP_MAPREDUCE_HOME']))

os.environ['HADOOP_YARN_HOME'] = os.environ['HADOOP_HOME']
logger.info("HADOOP_YARN_HOME is {}".format(os.environ['HADOOP_YARN_HOME']))

os.environ['PATH'] = ':'.join([os.path.join(os.environ['HADOOP_HOME'], 'bin'), os.environ['PATH']])
logger.info("PATH is {}".format(os.environ['PATH']))

19-Feb-24 01:06:41 PM - INFO: HADOOP_HOME is /content/hadoop-3.3.6
19-Feb-24 01:06:41 PM - INFO: HADOOP_COMMON_HOME is /content/hadoop-3.3.6
19-Feb-24 01:06:41 PM - INFO: HADOOP_MAPREDUCE_HOME is /content/hadoop-3.3.6
19-Feb-24 01:06:41 PM - INFO: HADOOP_YARN_HOME is /content/hadoop-3.3.6
19-Feb-24 01:06:41 PM - INFO: PATH is /content/hadoop-3.3.6/bin:/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin

Hadoop configuration files¶

Throughout this notebook we're going to edit four important Hadoop configuration files:

core-site.xml containing the site-specific configuration of the Hadoop core
hdfs-site.xml containing the site-specific configuration of Hadoop's HDFS
yarn-site.xml containing the site-specific configuration of Hadoop's YARN
mapred-site.xml containing the site-specific configuration of Hadoop's MapReduce

Additionally, we are going to edit the file hadoop-env.sh, a helper file containing environment variables needed by the shell scripts used to start the cluster.

All these files can be found in etc/hadoop under $HADOOP_HOME (the Hadoop installation folder).

In [ ]:

# site-specific configuration files
core_site_file = os.path.join(os.environ['HADOOP_HOME'],'etc/hadoop/core-site.xml')
hdfs_site_file = os.path.join(os.environ['HADOOP_HOME'],'etc/hadoop/hdfs-site.xml')
yarn_site_file = os.path.join(os.environ['HADOOP_HOME'],'etc/hadoop/yarn-site.xml')
mapred_site_file = os.path.join(os.environ['HADOOP_HOME'],'etc/hadoop/mapred-site.xml')

# Hadoop environment variables
hadoop_env_file = os.path.join(os.environ['HADOOP_HOME'],'etc/hadoop/hadoop-env.sh')

In [ ]:

!ls $HADOOP_HOME/etc/hadoop/*xml $HADOOP_HOME/etc/hadoop/hadoop-env.sh

/content/hadoop-3.3.6/etc/hadoop/capacity-scheduler.xml
/content/hadoop-3.3.6/etc/hadoop/core-site.xml
/content/hadoop-3.3.6/etc/hadoop/hadoop-env.sh
/content/hadoop-3.3.6/etc/hadoop/hadoop-policy.xml
/content/hadoop-3.3.6/etc/hadoop/hdfs-rbf-site.xml
/content/hadoop-3.3.6/etc/hadoop/hdfs-site.xml
/content/hadoop-3.3.6/etc/hadoop/httpfs-site.xml
/content/hadoop-3.3.6/etc/hadoop/kms-acls.xml
/content/hadoop-3.3.6/etc/hadoop/kms-site.xml
/content/hadoop-3.3.6/etc/hadoop/mapred-site.xml
/content/hadoop-3.3.6/etc/hadoop/yarn-site.xml

Functions for editing Hadoop configuration files¶

In order to edit the site-specific configuration files in an orderly fashion, we have created the functions:

clear_conf_file(file) to remove all properties from the XML file file
edit_conf_file(file, propertyname, propertyvalue) to add property name and value in the proper format to the XML file file
print_conf_file(file) to pretty-print the XML file file

In [ ]:

def clear_conf_file(file):
  tree = ET.parse(file)
  root = tree.getroot()
  # remove all properties
  for el in list(root):
     root.remove(el)
  tree.write(file, encoding='utf-8', xml_declaration=True)
  # pretty-print
  dom = xml.dom.minidom.parse(file)

def edit_conf_file(file, propertyname, propertyvalue):
  tree = ET.parse(file)
  root = tree.getroot()
  logger.info("add property {} to {}".format(propertyname, file))
  property = ET.SubElement(root, 'property')
  name = ET.SubElement(property, 'name')
  name.text = propertyname
  value = ET.SubElement(property, 'value')
  value.text = propertyvalue
  tree.write(file, encoding='utf-8', xml_declaration=True)

def print_conf_file(file):
  # pretty-print
  dom = xml.dom.minidom.parse(file)
  print(dom.toprettyxml())

Edit `core-site.xml` and `hdfs-site.xml`¶

Let us configure two properties:

fs.defaultFS (the URI of the default file system) to hdfs://localhost:9000 in core-site.xml
dfs.replication to $1$ (the default is $3$) in hdfs-site.xml

In [ ]:

clear_conf_file(core_site_file)
edit_conf_file(core_site_file, 'fs.defaultFS', 'hdfs://localhost:9000')
print_conf_file(core_site_file)

19-Feb-24 01:06:41 PM - INFO: add property fs.defaultFS to /content/hadoop-3.3.6/etc/hadoop/core-site.xml

<?xml version="1.0" ?>
<configuration>
	

	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://localhost:9000</value>
	</property>
</configuration>

In [ ]:

clear_conf_file(hdfs_site_file)
edit_conf_file(hdfs_site_file, 'dfs.replication', '1')
print_conf_file(hdfs_site_file)

19-Feb-24 01:06:41 PM - INFO: add property dfs.replication to /content/hadoop-3.3.6/etc/hadoop/hdfs-site.xml

<?xml version="1.0" ?>
<configuration>
	


	<property>
		<name>dfs.replication</name>
		<value>1</value>
	</property>
</configuration>

Edit `hadoop-env.sh`¶

In [ ]:

users = ['HDFS_NAMENODE_USER', \
         'HDFS_DATANODE_USER', \
         'HDFS_SECONDARYNAMENODE_USER', \
         'YARN_RESOURCEMANAGER_USER', \
         'YARN_NODEMANAGER_USER', \
         'YARN_PROXYSERVER_USER']

logger.info("Editing {}".format(hadoop_env_file))

with open(hadoop_env_file, "r+") as f:
  for u in users:
    var = 'export ' + u + '=root'
    line_found = any(line.startswith(var) for line in f)
    if not line_found:
      f.seek(0, os.SEEK_END)
      f.write(var +"\n")
  line_found = any(line.startswith('export JAVA_HOME=') for line in f)
  if not line_found:
      f.seek(0, os.SEEK_END)
      f.write('export JAVA_HOME='+os.environ['JAVA_HOME']+"\n")

19-Feb-24 01:06:41 PM - INFO: Editing /content/hadoop-3.3.6/etc/hadoop/hadoop-env.sh

Hadoop's default values for configuration files¶

By providing site-specific configuration files one can override any of the default properties. The default values files corresponfing to the site-specific files are:

core-default.xml
hdfs-default.xml
yarn-default.xml
mapred-default.xml

These are read-only files that contain all default values for Hadoop properties (see cluster setup) and can be viewed at:

In [ ]:

list(glob.iglob(os.environ['HADOOP_HOME']+ '/**/*default.xml', recursive=True))

Out[ ]:

['/content/hadoop-3.3.6/share/doc/hadoop/hadoop-yarn/hadoop-yarn-common/yarn-default.xml',
 '/content/hadoop-3.3.6/share/doc/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml',
 '/content/hadoop-3.3.6/share/doc/hadoop/hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml',
 '/content/hadoop-3.3.6/share/doc/hadoop/hadoop-project-dist/hadoop-common/core-default.xml',
 '/content/hadoop-3.3.6/share/doc/hadoop/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml']

In [ ]:

print('\n'.join([core_site_file, hdfs_site_file, yarn_site_file, mapred_site_file]))

/content/hadoop-3.3.6/etc/hadoop/core-site.xml
/content/hadoop-3.3.6/etc/hadoop/hdfs-site.xml
/content/hadoop-3.3.6/etc/hadoop/yarn-site.xml
/content/hadoop-3.3.6/etc/hadoop/mapred-site.xml

Initialize the namenode¶

In [ ]:

logger.info("Formatting NameNode")
cmd = ["hdfs", "namenode", "-format", "-nonInteractive"]
result = subprocess.call(cmd, stderr=subprocess.STDOUT)

19-Feb-24 01:06:41 PM - INFO: Formatting NameNode

Start HDFS¶

Launch a single-node HDFS cluster with the command start-dfs.sh.

In [ ]:

subprocess_output = subprocess.run([os.path.join(os.environ['HADOOP_HOME'], 'sbin', 'start-dfs.sh')], check=True, stdout=subprocess.PIPE)
print(subprocess_output.stdout.decode())

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [5a42c671b202]
5a42c671b202: Warning: Permanently added '5a42c671b202' (ED25519) to the list of known hosts.

In [ ]:

!/content/hadoop-3.3.6/sbin/stop-dfs.sh
!/content/hadoop-3.3.6/sbin/start-dfs.sh

Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [5a42c671b202]
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [5a42c671b202]

HDFS processes and ports¶

We have started some Java virtual machines corresponding to different Hadoop services. By using lsof to show listening ports you should get something like this:

COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
java    13177 root  341u  IPv4 357089      0t0  TCP *:9870 (LISTEN)
java    13177 root  349u  IPv4 358597      0t0  TCP 127.0.0.1:9000 (LISTEN)
java    13297 root  342u  IPv4 359615      0t0  TCP *:9866 (LISTEN)
java    13297 root  345u  IPv4 358674      0t0  TCP 127.0.0.1:40349 (LISTEN)
java    13297 root  374u  IPv4 358786      0t0  TCP *:9864 (LISTEN)
java    13297 root  375u  IPv4 358790      0t0  TCP *:9867 (LISTEN)
java    13515 root  343u  IPv4 360840      0t0  TCP *:9868 (LISTEN)

In [ ]:

!lsof -n -i -P +c0 -sTCP:LISTEN -ac java

COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
java    2534 root  341u  IPv4  55167      0t0  TCP *:9870 (LISTEN)
java    2534 root  349u  IPv4  55722      0t0  TCP 127.0.0.1:9000 (LISTEN)
java    2645 root  342u  IPv4  54075      0t0  TCP *:9866 (LISTEN)
java    2645 root  345u  IPv4  54085      0t0  TCP 127.0.0.1:40039 (LISTEN)
java    2645 root  374u  IPv4  56384      0t0  TCP *:9864 (LISTEN)
java    2645 root  375u  IPv4  56418      0t0  TCP *:9867 (LISTEN)
java    2860 root  343u  IPv4  57029      0t0  TCP *:9868 (LISTEN)

As you can see, we have multiple services running on different ports all spawned by different Java processes corresponding to:

NameNode
DataNode
Secondary NameNode

port number	description	configured in
9870	dfs.namenode.http-address	hdfs-default.xml
9000	fs.defaultFS	as defined in `core-site.xml`
9864	dfs.datanode.http.address	hdfs-default.xml
9866	dfs.datanode.address	hdfs-default.xml
9867	dfs.datanode.ipc.address	hdfs-default.xml
9868	dfs.namenode.secondary.http-address	hdfs-default.xml

Check if HDFS is up and running¶

Once HDFS is up and running there should be a total of $7$ java listening ports and six of them are precisely ports $9870$, $9000$, $9864$, $9866$, $9876$, $9868$.

Let us check if this is the case introducing a cycle of 20 retries (one per second) to check if all the required ports are available. This check is needed for non-interactive runs of the notebook.

In [ ]:

%%bash
counter=0
until [ $counter -gt 20 ]
do
  ((counter++))
  sleep 1
  # counting lines, exclude header and jupyter port 9000
  if [ $(lsof -n -i -P +c0 -sTCP:LISTEN -ac java| wc -l ) -ge 8 ] && \
  [ $(lsof -n -aiTCP:9870 -aiTCP:9000 -aiTCP:9864 -aiTCP:9866 -aiTCP:9867 -aiTCP:9868 -P +c0 -sTCP:LISTEN -ac java | wc -l) -eq 7 ]
  then
   echo "HDFS is up and running"
   echo "Time to start: $counter secs"
   exit
  fi
done
echo "Some HDFS ports are missing. Wait some more or restart HDFS."

HDFS is up and running
Time to start: 1 secs

`lsof` for viewing listening ports¶

lsof is an exceptionally valuable command. With the option -i it can be used to show processes listening to network connections.

An example: check for java listening to a specific port ($9864$)

In [ ]:

!lsof -iTCP:9864 -n -P +c0 -sTCP:LISTEN -ac java

COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
java    2645 root  374u  IPv4  56384      0t0  TCP *:9864 (LISTEN)

Options used in `lsof`:¶

-i specifies that you want to display only network files, that is open network connections
-n and -P tell lsof to show ports and IP addresses in numeric form
the option +c0 is used to show a longer substring of the name of the UNIX command associated with the process (https://linux.die.net/man/8/lsof)
-sTCP:LISTEN filters for TCP connections in state LISTEN

Check the NameNode¶

On the command-line¶

Run the command hdfs dfsadmin -report.

In [ ]:

subprocess_output = subprocess.run([os.path.join(os.environ['HADOOP_HOME'], 'bin', 'hdfs'), 'dfsadmin', '-report'], check=True, stdout=subprocess.PIPE)
print(subprocess_output.stdout.decode())

Configured Capacity: 115658190848 (107.72 GB)
Present Capacity: 85179183104 (79.33 GB)
DFS Remaining: 85179158528 (79.33 GB)
DFS Used: 24576 (24 KB)
DFS Used%: 0.00%
Replicated Blocks:
	Under replicated blocks: 0
	Blocks with corrupt replicas: 0
	Missing blocks: 0
	Missing blocks (with replication factor 1): 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0
Erasure Coded Block Groups: 
	Low redundancy block groups: 0
	Block groups with corrupt internal blocks: 0
	Missing block groups: 0
	Low redundancy blocks with highest priority to recover: 0
	Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (1):

Name: 127.0.0.1:9866 (localhost)
Hostname: 5a42c671b202
Decommission Status : Normal
Configured Capacity: 115658190848 (107.72 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 30462230528 (28.37 GB)
DFS Remaining: 85179158528 (79.33 GB)
DFS Used%: 0.00%
DFS Remaining%: 73.65%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Mon Feb 19 13:07:50 UTC 2024
Last Block Report: Mon Feb 19 13:07:39 UTC 2024
Num of Blocks: 0

In the Web UI¶

The namenode runs a Web interface displaying information about the current status of the cluster (see HdfsUserGuide.html#Web_Interface). By default, the dfs namenode web ui will listen on port $9870$.

Let us open the Web UI in another tab in the browser with Google Colab's output library (source: https://github.com/googlecolab/colabtools/tree/main/google/colab/output, documentation: Browsing to servers executing on the kernel).

Serve the NameNode Web UI¶

In [ ]:

!wget localhost:9870

--2024-02-19 13:07:52--  http://localhost:9870/
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:9870... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://localhost:9870/index.html [following]
--2024-02-19 13:07:52--  http://localhost:9870/index.html
Reusing existing connection to localhost:9870.
HTTP request sent, awaiting response... 200 OK
Length: 1079 (1.1K) [text/html]
Saving to: ‘index.html’

index.html          100%[===================>]   1.05K  --.-KB/s    in 0s      

2024-02-19 13:07:52 (122 MB/s) - ‘index.html’ saved [1079/1079]

In [ ]:

%%capture namenode_url
if IN_COLAB:
  port = 9870
  output.serve_kernel_port_as_window(port, path='/index.html')

In [ ]:

if IN_COLAB:
  namenode_url.show()

This is what you should see if you open the above link in a newly opened tab or window in your browser

On port $9864$ you can reach the datanode.

In [ ]:

%%capture datanode_url
if IN_COLAB:
  port = 9864
  output.serve_kernel_port_as_window(port, path='/index.html')

In [ ]:

if IN_COLAB:
  datanode_url.show()

A note on Google Colaboratory advanced outputs¶

Throughout this notebook, I use the function output.serve_kernel_port_as_window to serve Web pages from Google Colab.

You can find the documentation for Google Colab's advanced outputs library at: https://colab.research.google.com/notebooks/snippets/advanced_outputs.ipynb#scrollTo=Browsing_to_servers_executing_on_the_kernel.

Before serving a link, I always use wget to see if there are any redirections that need to be included in the path option. For instance:

output.serve_kernel_port_as_window(9870, path='/index.html')

works but

output.serve_kernel_port_as_window(9870)

will not work!

Oddly, the function output.serve_kernel_port_as_window is deprecated in favor of serve_kernel_port_as_iframe, which does not work.

In [ ]:

if IN_COLAB:
  help(output.serve_kernel_port_as_window)

Help on function serve_kernel_port_as_window in module google.colab.output._util:

serve_kernel_port_as_window(port, path='/', anchor_text=None)
    Displays a link in the output to open a browser tab to a port on the kernel.
    
    DEPRECATED; Browser security updates have broken this feature. Use
    `serve_kernel_port_as_iframe` instead. See
    https://developer.chrome.com/en/docs/privacy-sandbox/storage-partitioning/.
    
    
    This allows viewing URLs hosted on the kernel in new browser tabs.
    
    The URL will only be valid for the current user while the notebook is open in
    Colab.
    
    Args:
      port: The kernel port to be exposed to the client.
      path: The path to be navigated to.
      anchor_text: Text content of the anchor link.

Test HDFS¶

Upload the Hadoop distribution .tar.gz file (this is just a "big" file that's already there) to the HDFS filesystem using the command

hdfs dfs -put <filename>

Check DFS usage before:

In [ ]:

!hdfs dfsadmin -report | grep "^DFS Used" | tail -2

DFS Used: 24576 (24 KB)
DFS Used%: 0.00%

In [ ]:

!echo $(basename $HADOOP_URL)
# create home directory on HDFS
!hdfs dfs -mkdir -p /user/root
# upload file to HDFS
!hdfs dfs -put $(basename $HADOOP_URL)
# list HDFS home
!hdfs dfs -ls -h

hadoop-3.3.6.tar.gz
Found 1 items
-rw-r--r--   1 root supergroup    696.3 M 2024-02-19 13:08 hadoop-3.3.6.tar.gz

You should now see that the value for DFS Used has increased from $0.0\%$ to a value $>0.0\%$.

In [ ]:

!hdfs dfsadmin -report | grep "^DFS Used" | tail -2

DFS Used: 735836062 (701.75 MB)
DFS Used%: 0.64%

The same is shown in the Web UI (you might have to refresh the page).


Configured Capacity:	107.72 GB
Configured Remote Capacity:	0 B
DFS Used:	701.76 MB (0.64%)

(see https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#YARN_on_a_Single_Node)

Clean up.

In [ ]:

!hdfs dfs -rm -r $(basename $HADOOP_URL)
!hdfs dfs -ls

Deleted hadoop-3.3.6.tar.gz

Run a MapReduce job in local mode¶

If you run a MapReduce job at this point (without starting YARN), the job will run locally.

We are resetting the YARN and MapReduce configuration files to their initial empty state in case you are not running the notebook for the first time.

In [ ]:

clear_conf_file(yarn_site_file)
clear_conf_file(mapred_site_file)

How to find MapReduce examples¶

The Hadoop distribution comes with several examples (both sources and compiled). Here we are going to show how to find the runnable JAR files.

Search for MapReduce examples with find. The first JAR will do.

In [ ]:

!find $HADOOP_HOME -name "*mapreduce*examples*"

/content/hadoop-3.3.6/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar
/content/hadoop-3.3.6/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.3.6-test-sources.jar
/content/hadoop-3.3.6/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.3.6-sources.jar
/content/hadoop-3.3.6/share/doc/hadoop/hadoop-mapreduce-examples

Call the JAR file to get a list of all available MapReduce examples.

In [ ]:

!hadoop jar $(find $HADOOP_HOME -name "*mapreduce*examples*"|head -1)

An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

Run the pi example without arguments to get a usage message.

In [ ]:

!hadoop jar $(find $HADOOP_HOME -name "*mapreduce*examples*"|head -1) pi

Usage: org.apache.hadoop.examples.QuasiMonteCarlo <nMaps> <nSamples>
Generic options supported are:
-conf <configuration file>        specify an application configuration file
-D <property=value>               define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>  specify a ResourceManager
-files <file1,...>                specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...>               specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...>          specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

Run the `pi` MapReduce app¶

Run the pi example with

<nMaps> $=2$
<nSamples> $= 100$

In [ ]:

!yarn jar $(find $HADOOP_HOME -name "*mapreduce*examples*"|head -1) pi 2 100

Number of Maps  = 2
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Starting Job
2024-02-19 13:08:35,431 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2024-02-19 13:08:35,582 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2024-02-19 13:08:35,582 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2024-02-19 13:08:35,868 INFO input.FileInputFormat: Total input files to process : 2
2024-02-19 13:08:35,881 INFO mapreduce.JobSubmitter: number of splits:2
2024-02-19 13:08:36,394 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1436131745_0001
2024-02-19 13:08:36,394 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-02-19 13:08:36,613 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2024-02-19 13:08:36,614 INFO mapreduce.Job: Running job: job_local1436131745_0001
2024-02-19 13:08:36,622 INFO mapred.LocalJobRunner: OutputCommitter set in config null
2024-02-19 13:08:36,630 INFO output.PathOutputCommitterFactory: No output committer factory defined, defaulting to FileOutputCommitterFactory
2024-02-19 13:08:36,632 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2024-02-19 13:08:36,632 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2024-02-19 13:08:36,640 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2024-02-19 13:08:36,721 INFO mapred.LocalJobRunner: Waiting for map tasks
2024-02-19 13:08:36,723 INFO mapred.LocalJobRunner: Starting task: attempt_local1436131745_0001_m_000000_0
2024-02-19 13:08:36,759 INFO output.PathOutputCommitterFactory: No output committer factory defined, defaulting to FileOutputCommitterFactory
2024-02-19 13:08:36,763 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2024-02-19 13:08:36,763 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2024-02-19 13:08:36,883 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2024-02-19 13:08:36,909 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/QuasiMonteCarlo_1708348112809_1640684131/in/part0:0+118
2024-02-19 13:08:37,026 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2024-02-19 13:08:37,026 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2024-02-19 13:08:37,026 INFO mapred.MapTask: soft limit at 83886080
2024-02-19 13:08:37,026 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2024-02-19 13:08:37,026 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2024-02-19 13:08:37,031 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2024-02-19 13:08:37,179 INFO mapred.LocalJobRunner: 
2024-02-19 13:08:37,181 INFO mapred.MapTask: Starting flush of map output
2024-02-19 13:08:37,181 INFO mapred.MapTask: Spilling map output
2024-02-19 13:08:37,182 INFO mapred.MapTask: bufstart = 0; bufend = 18; bufvoid = 104857600
2024-02-19 13:08:37,182 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214392(104857568); length = 5/6553600
2024-02-19 13:08:37,189 INFO mapred.MapTask: Finished spill 0
2024-02-19 13:08:37,204 INFO mapred.Task: Task:attempt_local1436131745_0001_m_000000_0 is done. And is in the process of committing
2024-02-19 13:08:37,210 INFO mapred.LocalJobRunner: map
2024-02-19 13:08:37,210 INFO mapred.Task: Task 'attempt_local1436131745_0001_m_000000_0' done.
2024-02-19 13:08:37,220 INFO mapred.Task: Final Counters for attempt_local1436131745_0001_m_000000_0: Counters: 23
	File System Counters
		FILE: Number of bytes read=281714
		FILE: Number of bytes written=921239
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=118
		HDFS: Number of bytes written=236
		HDFS: Number of read operations=7
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=4
		HDFS: Number of bytes read erasure-coded=0
	Map-Reduce Framework
		Map input records=1
		Map output records=2
		Map output bytes=18
		Map output materialized bytes=28
		Input split bytes=146
		Combine input records=0
		Spilled Records=2
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=0
		Total committed heap usage (bytes)=357564416
	File Input Format Counters 
		Bytes Read=118
2024-02-19 13:08:37,222 INFO mapred.LocalJobRunner: Finishing task: attempt_local1436131745_0001_m_000000_0
2024-02-19 13:08:37,223 INFO mapred.LocalJobRunner: Starting task: attempt_local1436131745_0001_m_000001_0
2024-02-19 13:08:37,227 INFO output.PathOutputCommitterFactory: No output committer factory defined, defaulting to FileOutputCommitterFactory
2024-02-19 13:08:37,227 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2024-02-19 13:08:37,227 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2024-02-19 13:08:37,228 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2024-02-19 13:08:37,241 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/QuasiMonteCarlo_1708348112809_1640684131/in/part1:0+118
2024-02-19 13:08:37,291 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
2024-02-19 13:08:37,291 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
2024-02-19 13:08:37,291 INFO mapred.MapTask: soft limit at 83886080
2024-02-19 13:08:37,291 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
2024-02-19 13:08:37,291 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
2024-02-19 13:08:37,306 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2024-02-19 13:08:37,324 INFO mapred.LocalJobRunner: 
2024-02-19 13:08:37,325 INFO mapred.MapTask: Starting flush of map output
2024-02-19 13:08:37,325 INFO mapred.MapTask: Spilling map output
2024-02-19 13:08:37,325 INFO mapred.MapTask: bufstart = 0; bufend = 18; bufvoid = 104857600
2024-02-19 13:08:37,325 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214392(104857568); length = 5/6553600
2024-02-19 13:08:37,328 INFO mapred.MapTask: Finished spill 0
2024-02-19 13:08:37,340 INFO mapred.Task: Task:attempt_local1436131745_0001_m_000001_0 is done. And is in the process of committing
2024-02-19 13:08:37,349 INFO mapred.LocalJobRunner: map
2024-02-19 13:08:37,349 INFO mapred.Task: Task 'attempt_local1436131745_0001_m_000001_0' done.
2024-02-19 13:08:37,350 INFO mapred.Task: Final Counters for attempt_local1436131745_0001_m_000001_0: Counters: 23
	File System Counters
		FILE: Number of bytes read=282025
		FILE: Number of bytes written=921299
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=236
		HDFS: Number of bytes written=236
		HDFS: Number of read operations=10
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=4
		HDFS: Number of bytes read erasure-coded=0
	Map-Reduce Framework
		Map input records=1
		Map output records=2
		Map output bytes=18
		Map output materialized bytes=28
		Input split bytes=146
		Combine input records=0
		Spilled Records=2
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=21
		Total committed heap usage (bytes)=452984832
	File Input Format Counters 
		Bytes Read=118
2024-02-19 13:08:37,350 INFO mapred.LocalJobRunner: Finishing task: attempt_local1436131745_0001_m_000001_0
2024-02-19 13:08:37,350 INFO mapred.LocalJobRunner: map task executor complete.
2024-02-19 13:08:37,354 INFO mapred.LocalJobRunner: Waiting for reduce tasks
2024-02-19 13:08:37,354 INFO mapred.LocalJobRunner: Starting task: attempt_local1436131745_0001_r_000000_0
2024-02-19 13:08:37,369 INFO output.PathOutputCommitterFactory: No output committer factory defined, defaulting to FileOutputCommitterFactory
2024-02-19 13:08:37,369 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
2024-02-19 13:08:37,369 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2024-02-19 13:08:37,370 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
2024-02-19 13:08:37,374 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@1395dc25
2024-02-19 13:08:37,376 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2024-02-19 13:08:37,406 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=2382574336, maxSingleShuffleLimit=595643584, mergeThreshold=1572499072, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2024-02-19 13:08:37,408 INFO reduce.EventFetcher: attempt_local1436131745_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2024-02-19 13:08:37,464 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1436131745_0001_m_000001_0 decomp: 24 len: 28 to MEMORY
2024-02-19 13:08:37,468 INFO reduce.InMemoryMapOutput: Read 24 bytes from map-output for attempt_local1436131745_0001_m_000001_0
2024-02-19 13:08:37,468 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 24, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->24
2024-02-19 13:08:37,476 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1436131745_0001_m_000000_0 decomp: 24 len: 28 to MEMORY
2024-02-19 13:08:37,477 INFO reduce.InMemoryMapOutput: Read 24 bytes from map-output for attempt_local1436131745_0001_m_000000_0
2024-02-19 13:08:37,478 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 24, inMemoryMapOutputs.size() -> 2, commitMemory -> 24, usedMemory ->48
2024-02-19 13:08:37,478 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
2024-02-19 13:08:37,479 INFO mapred.LocalJobRunner: 2 / 2 copied.
2024-02-19 13:08:37,479 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs
2024-02-19 13:08:37,488 INFO mapred.Merger: Merging 2 sorted segments
2024-02-19 13:08:37,488 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 42 bytes
2024-02-19 13:08:37,490 INFO reduce.MergeManagerImpl: Merged 2 segments, 48 bytes to disk to satisfy reduce memory limit
2024-02-19 13:08:37,490 INFO reduce.MergeManagerImpl: Merging 1 files, 50 bytes from disk
2024-02-19 13:08:37,491 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2024-02-19 13:08:37,491 INFO mapred.Merger: Merging 1 sorted segments
2024-02-19 13:08:37,492 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 43 bytes
2024-02-19 13:08:37,493 INFO mapred.LocalJobRunner: 2 / 2 copied.
2024-02-19 13:08:37,513 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2024-02-19 13:08:37,609 INFO mapred.Task: Task:attempt_local1436131745_0001_r_000000_0 is done. And is in the process of committing
2024-02-19 13:08:37,612 INFO mapred.LocalJobRunner: 2 / 2 copied.
2024-02-19 13:08:37,612 INFO mapred.Task: Task attempt_local1436131745_0001_r_000000_0 is allowed to commit now
2024-02-19 13:08:37,617 INFO mapreduce.Job: Job job_local1436131745_0001 running in uber mode : false
2024-02-19 13:08:37,618 INFO mapreduce.Job:  map 100% reduce 0%
2024-02-19 13:08:37,633 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1436131745_0001_r_000000_0' to hdfs://localhost:9000/user/root/QuasiMonteCarlo_1708348112809_1640684131/out
2024-02-19 13:08:37,634 INFO mapred.LocalJobRunner: reduce > reduce
2024-02-19 13:08:37,634 INFO mapred.Task: Task 'attempt_local1436131745_0001_r_000000_0' done.
2024-02-19 13:08:37,635 INFO mapred.Task: Final Counters for attempt_local1436131745_0001_r_000000_0: Counters: 30
	File System Counters
		FILE: Number of bytes read=282195
		FILE: Number of bytes written=921349
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=236
		HDFS: Number of bytes written=451
		HDFS: Number of read operations=15
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=7
		HDFS: Number of bytes read erasure-coded=0
	Map-Reduce Framework
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=56
		Reduce input records=4
		Reduce output records=0
		Spilled Records=4
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=0
		Total committed heap usage (bytes)=452984832
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Output Format Counters 
		Bytes Written=97
2024-02-19 13:08:37,635 INFO mapred.LocalJobRunner: Finishing task: attempt_local1436131745_0001_r_000000_0
2024-02-19 13:08:37,635 INFO mapred.LocalJobRunner: reduce task executor complete.
2024-02-19 13:08:38,619 INFO mapreduce.Job:  map 100% reduce 100%
2024-02-19 13:08:38,620 INFO mapreduce.Job: Job job_local1436131745_0001 completed successfully
2024-02-19 13:08:38,633 INFO mapreduce.Job: Counters: 36
	File System Counters
		FILE: Number of bytes read=845934
		FILE: Number of bytes written=2763887
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=590
		HDFS: Number of bytes written=923
		HDFS: Number of read operations=32
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=15
		HDFS: Number of bytes read erasure-coded=0
	Map-Reduce Framework
		Map input records=2
		Map output records=4
		Map output bytes=36
		Map output materialized bytes=56
		Input split bytes=292
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=56
		Reduce input records=4
		Reduce output records=0
		Spilled Records=8
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=21
		Total committed heap usage (bytes)=1263534080
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=236
	File Output Format Counters 
		Bytes Written=97
Job Finished in 3.496 seconds
Estimated value of Pi is 3.12000000000000000000

Start YARN¶

The script that we are going to use to start YARN is

$HADOOP_HOME/sbin/start-yarn.sh

Before launching YARN, ensure proper configuration in both yarn-site.xml and mapred-site.xml. For YARN, it is essential to configure the yarn.resourcemanager.hostname as once this is set, many other services addresses depend on its value ${yarn.resourcemanager.hostname} and do not need to be configured one by one.

In mapred-site.xml we want to configure MapReduce to submitting jobs to the YARN queue by default.

We just configured the minimal set of properties needed to run an example for this tutorial. In general, you might want to customize these XML configuration files to align with your system requirements and specifications.

Some useful links for configuring YARN:

Configure `yarn-site.xml`¶

In [ ]:

clear_conf_file(yarn_site_file)
edit_conf_file(yarn_site_file, 'yarn.resourcemanager.hostname', 'localhost')
# Shuffle service that needs to be set for Map Reduce applications.
edit_conf_file(yarn_site_file, 'yarn.nodemanager.aux-services', 'mapreduce_shuffle')
# Configuration to enable or disable log aggregation (default is false)
edit_conf_file(yarn_site_file, 'yarn.log-aggregation-enable', 'true')
# Environment properties to be inherited by containers from NodeManagers
edit_conf_file(yarn_site_file, 'yarn.nodemanager.env-whitelist', 'JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME')
# Web proxy address
edit_conf_file(yarn_site_file, 'yarn.web-proxy.address', '127.0.0.1:3141')
# NM Webapp address
#edit_conf_file(yarn_site_file, 'yarn.nodemanager.hostname', 'localhost:8042')
# Configurations for History Server
edit_conf_file(yarn_site_file, 'yarn.log-aggregation.retain-seconds', '600')
edit_conf_file(yarn_site_file, 'yarn.log-aggregation.retain-check-interval-seconds', '60')
print_conf_file(yarn_site_file)

19-Feb-24 01:08:39 PM - INFO: add property yarn.resourcemanager.hostname to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml
19-Feb-24 01:08:39 PM - INFO: add property yarn.nodemanager.aux-services to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml
19-Feb-24 01:08:39 PM - INFO: add property yarn.log-aggregation-enable to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml
19-Feb-24 01:08:39 PM - INFO: add property yarn.nodemanager.env-whitelist to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml
19-Feb-24 01:08:39 PM - INFO: add property yarn.web-proxy.address to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml
19-Feb-24 01:08:39 PM - INFO: add property yarn.log-aggregation.retain-seconds to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml
19-Feb-24 01:08:39 PM - INFO: add property yarn.log-aggregation.retain-check-interval-seconds to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml

<?xml version="1.0" ?>
<configuration>
	




	<property>
		<name>yarn.resourcemanager.hostname</name>
		<value>localhost</value>
	</property>
	<property>
		<name>yarn.nodemanager.aux-services</name>
		<value>mapreduce_shuffle</value>
	</property>
	<property>
		<name>yarn.log-aggregation-enable</name>
		<value>true</value>
	</property>
	<property>
		<name>yarn.nodemanager.env-whitelist</name>
		<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
	</property>
	<property>
		<name>yarn.web-proxy.address</name>
		<value>127.0.0.1:3141</value>
	</property>
	<property>
		<name>yarn.log-aggregation.retain-seconds</name>
		<value>600</value>
	</property>
	<property>
		<name>yarn.log-aggregation.retain-check-interval-seconds</name>
		<value>60</value>
	</property>
</configuration>

Some explanations¶

yarn.web-proxy.address is needed so that start-yarn.sh also launches the Web proxy (needed for navigating the different Web services). Here is the relevant code from start-yarn.sh:

# start proxyserver
PROXYSERVER=$("${HADOOP_HDFS_HOME}/bin/hdfs" getconf -confKey  yarn.web-proxy.address 2>&- | cut -f1 -d:)
if [[ -n ${PROXYSERVER} ]]; then
 hadoop_uservar_su yarn proxyserver "${HADOOP_YARN_HOME}/bin/yarn" \
      --config "${HADOOP_CONF_DIR}" \
      --workers \
      --hostnames "${PROXYSERVER}" \
      --daemon start \
      proxyserver
 (( HADOOP_JUMBO_RETCOUNTER=HADOOP_JUMBO_RETCOUNTER + $? ))
fi

We just picked an arbitrary port 3141.

To check if yarn.web-proxy.address use:

hdfs getconf -confKey  yarn.web-proxy.address

In [ ]:

!hdfs getconf -confKey  yarn.web-proxy.address

127.0.0.1:3141

Configure `mapred-site.xml`¶

We want to run MapReduce jobs on YARN by default (property mapreduce.framework.name) and we want to configure the address for the JobHistory daemon.

In [ ]:

clear_conf_file(mapred_site_file)
edit_conf_file(mapred_site_file, 'mapreduce.framework.name', 'yarn')
# Shuffle service that needs to be set for Map Reduce applications.
edit_conf_file(mapred_site_file, 'mapreduce.application.classpath',
               os.environ['HADOOP_HOME'] + '/share/hadoop/mapreduce/*:' +
               os.environ['HADOOP_HOME']+ '/share/hadoop/mapreduce/lib/*')
# this should also work (with the variable ${HADOOP_HOME})
# edit_conf_file(mapred_site_file, 'mapreduce.application.classpath',
#               '${HADOOP_HOME}/share/hadoop/mapreduce/*:' +
#               '${HADOOP_HOME}/share/hadoop/mapreduce/lib/*')
# Configurations for History Server
edit_conf_file(mapred_site_file, 'mapreduce.jobhistory.address', 'localhost:10020')
edit_conf_file(mapred_site_file, 'mapreduce.jobhistory.webapp.address', 'localhost:19888')
print_conf_file(mapred_site_file)

19-Feb-24 01:08:40 PM - INFO: add property mapreduce.framework.name to /content/hadoop-3.3.6/etc/hadoop/mapred-site.xml
19-Feb-24 01:08:40 PM - INFO: add property mapreduce.application.classpath to /content/hadoop-3.3.6/etc/hadoop/mapred-site.xml
19-Feb-24 01:08:40 PM - INFO: add property mapreduce.jobhistory.address to /content/hadoop-3.3.6/etc/hadoop/mapred-site.xml
19-Feb-24 01:08:40 PM - INFO: add property mapreduce.jobhistory.webapp.address to /content/hadoop-3.3.6/etc/hadoop/mapred-site.xml

<?xml version="1.0" ?>
<configuration>
	


	<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>
	<property>
		<name>mapreduce.application.classpath</name>
		<value>/content/hadoop-3.3.6/share/hadoop/mapreduce/*:/content/hadoop-3.3.6/share/hadoop/mapreduce/lib/*</value>
	</property>
	<property>
		<name>mapreduce.jobhistory.address</name>
		<value>localhost:10020</value>
	</property>
	<property>
		<name>mapreduce.jobhistory.webapp.address</name>
		<value>localhost:19888</value>
	</property>
</configuration>

Launch YARN¶

In [ ]:

!$HADOOP_HOME/sbin/stop-yarn.sh
!$HADOOP_HOME/sbin/start-yarn.sh

Stopping nodemanagers
Stopping resourcemanager
Stopping proxy server [127.0.0.1]
127.0.0.1: Warning: Permanently added '127.0.0.1' (ED25519) to the list of known hosts.
Starting resourcemanager
Starting nodemanagers

In [ ]:

!lsof -n -i -P +c0 -sTCP:LISTEN -ac java

COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
java    2534 root  341u  IPv4  55167      0t0  TCP *:9870 (LISTEN)
java    2534 root  349u  IPv4  55722      0t0  TCP 127.0.0.1:9000 (LISTEN)
java    2645 root  342u  IPv4  54075      0t0  TCP *:9866 (LISTEN)
java    2645 root  345u  IPv4  54085      0t0  TCP 127.0.0.1:40039 (LISTEN)
java    2645 root  374u  IPv4  56384      0t0  TCP *:9864 (LISTEN)
java    2645 root  375u  IPv4  56418      0t0  TCP *:9867 (LISTEN)
java    2860 root  343u  IPv4  57029      0t0  TCP *:9868 (LISTEN)
java    4314 root  366u  IPv4  79944      0t0  TCP 127.0.0.1:8088 (LISTEN)
java    4314 root  373u  IPv4  83653      0t0  TCP 127.0.0.1:8033 (LISTEN)
java    4428 root  382u  IPv4  83999      0t0  TCP *:38267 (LISTEN)
java    4428 root  393u  IPv4  83153      0t0  TCP *:8040 (LISTEN)
java    4428 root  403u  IPv4  83159      0t0  TCP *:13562 (LISTEN)

We now have some new java processes that have been added to the list of listening ports (you might need to wait and refresh the previous cell to see them all):

COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
java    13177 root  341u  IPv4 357089      0t0  TCP *:9870 (LISTEN)
java    13177 root  349u  IPv4 358597      0t0  TCP 127.0.0.1:9000 (LISTEN)
java    13297 root  342u  IPv4 359615      0t0  TCP *:9866 (LISTEN)
java    13297 root  345u  IPv4 358674      0t0  TCP 127.0.0.1:40349 (LISTEN)
java    13297 root  374u  IPv4 358786      0t0  TCP *:9864 (LISTEN)
java    13297 root  375u  IPv4 358790      0t0  TCP *:9867 (LISTEN)
java    13515 root  343u  IPv4 360840      0t0  TCP *:9868 (LISTEN)
java    27447 root  366u  IPv4 756489      0t0  TCP 127.0.0.1:8088 (LISTEN)
java    27447 root  373u  IPv4 759769      0t0  TCP 127.0.0.1:8033 (LISTEN)
java    27447 root  383u  IPv4 758559      0t0  TCP 127.0.0.1:8031 (LISTEN)
java    27447 root  393u  IPv4 759823      0t0  TCP 127.0.0.1:8030 (LISTEN)
java    27447 root  403u  IPv4 759857      0t0  TCP 127.0.0.1:8032 (LISTEN)
java    27735 root  340u  IPv4 759788      0t0  TCP 127.0.0.1:3141 (LISTEN)

YARN ports:

port number	description	configured in
8088	yarn.resourcemanager.webapp.address	yarn-default.xml
8030	yarn.resourcemanager.scheduler.address	yarn-default.xml
8031	yarn.resourcemanager.resource-tracker.address	yarn-default.xml
8032	yarn.resourcemanager.address	yarn-default.xml
8033	yarn.resourcemanager.admin.address	yarn-default.xml
8040	yarn.nodemanager.localizer.address	yarn-default.xml
8042	yarn.nodemanager.webapp.address	yarn-default.xml

Web Proxy port:

port number	description	configured in
3141	yarn.web-proxy.address	yarn-site.xml

Verify that all YARN services are up and running¶

In [ ]:

%%bash
counter=0
until [ $counter -gt 20 ]
do
  ((counter++))
  sleep 1
  # counting lines + header
  if [ $(lsof -n -aiTCP:8088 -aiTCP:8030 -aiTCP:8031 -aiTCP:8032 -aiTCP:8033 -aiTCP:8042 -aiTCP:3141 -P +c0 -sTCP:LISTEN -ac java |wc -l ) -eq 8 ]
  then
   echo "YARN is up and running"
   echo "Time to start: $counter secs"
   exit
  fi
done
echo "Some YARN ports are missing. Wait some more or restart YARN."

YARN is up and running
Time to start: 2 secs

If YARN is running correctly, you should see one node listed when calling yarn node -list

In [ ]:

!yarn node -list

2024-02-19 13:09:16,252 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032
Total Nodes:1
         Node-Id	     Node-State	Node-Http-Address	Number-of-Running-Containers
5a42c671b202:38267	        RUNNING	5a42c671b202:8042	                           0

Serve the YARN Web UI¶

In [ ]:

!wget localhost:8088

--2024-02-19 13:09:16--  http://localhost:8088/
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:8088... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://localhost:8088/cluster [following]
--2024-02-19 13:09:17--  http://localhost:8088/cluster
Reusing existing connection to localhost:8088.
HTTP request sent, awaiting response... 200 OK
Length: 13683 (13K) [text/html]
Saving to: ‘index.html.1’

index.html.1        100%[===================>]  13.36K  --.-KB/s    in 0s      

2024-02-19 13:09:17 (230 MB/s) - ‘index.html.1’ saved [13683/13683]

In [ ]:

%%capture yarn_url
if IN_COLAB:
  port = 8088
  output.serve_kernel_port_as_window(port, path='/cluster')

In [ ]:

if IN_COLAB:
  yarn_url.show()

YARN Web UI: empty queue

No application is shown yet in the YARN UI since we have only ran a job locally.

Next, we want to start the History Server in order to be able to view jobs logfile when they're finished.

Start Job History Server¶

As already seen, HDFS daemons are:

NameNode
SecondaryNameNode
DataNode

YARN daemons are:

ResourceManager
NodeManager
WebAppProxy.

If MapReduce is to be used, then the MapReduce Job History Server will also be running. For large installations, these are generally running on separate hosts.

From: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html.

In [ ]:

!$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver
!$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

WARNING: Use of this script to stop the MR JobHistory daemon is deprecated.
WARNING: Attempting to execute replacement "mapred --daemon stop" instead.
WARNING: Use of this script to start the MR JobHistory daemon is deprecated.
WARNING: Attempting to execute replacement "mapred --daemon start" instead.

Start the Job History Server Web UI¶

In [ ]:

!wget localhost:19888

--2024-02-19 13:09:19--  http://localhost:19888/
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:19888... failed: Connection refused.
Connecting to localhost (localhost)|::1|:19888... failed: Cannot assign requested address.
Retrying.

--2024-02-19 13:09:20--  (try: 2)  http://localhost:19888/
Connecting to localhost (localhost)|127.0.0.1|:19888... failed: Connection refused.
Connecting to localhost (localhost)|::1|:19888... failed: Cannot assign requested address.
Retrying.

--2024-02-19 13:09:22--  (try: 3)  http://localhost:19888/
Connecting to localhost (localhost)|127.0.0.1|:19888... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://localhost:19888/jobhistory [following]
--2024-02-19 13:09:24--  http://localhost:19888/jobhistory
Reusing existing connection to localhost:19888.
HTTP request sent, awaiting response... 200 OK
Length: 7791 (7.6K) [text/html]
Saving to: ‘index.html.2’

index.html.2        100%[===================>]   7.61K  --.-KB/s    in 0s      

2024-02-19 13:09:25 (546 MB/s) - ‘index.html.2’ saved [7791/7791]

In [ ]:

if IN_COLAB:
  port = 19888
  output.serve_kernel_port_as_window(port, path='/jobhistory')

Submit the MapReduce `pi` example to YARN¶

After submitting the job, you should now see the pi application listed in the YARN UI. Here is the link in case you closed the window or tab:

In [ ]:

if IN_COLAB:
  port = 8088
  output.serve_kernel_port_as_window(port, path='/cluster')

In [ ]:

!yarn jar $(find $HADOOP_HOME -name "*mapreduce*examples*"|head -1) pi 2 100

Number of Maps  = 2
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Starting Job
2024-02-19 13:09:30,811 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032
2024-02-19 13:09:31,311 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1708348145550_0001
2024-02-19 13:09:31,605 INFO input.FileInputFormat: Total input files to process : 2
2024-02-19 13:09:31,737 INFO mapreduce.JobSubmitter: number of splits:2
2024-02-19 13:09:32,539 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1708348145550_0001
2024-02-19 13:09:32,539 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-02-19 13:09:32,815 INFO conf.Configuration: resource-types.xml not found
2024-02-19 13:09:32,816 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-02-19 13:09:33,368 INFO impl.YarnClientImpl: Submitted application application_1708348145550_0001
2024-02-19 13:09:33,439 INFO mapreduce.Job: The url to track the job: http://127.0.0.1:3141/proxy/application_1708348145550_0001/
2024-02-19 13:09:33,440 INFO mapreduce.Job: Running job: job_1708348145550_0001
2024-02-19 13:09:47,771 INFO mapreduce.Job: Job job_1708348145550_0001 running in uber mode : false
2024-02-19 13:09:47,772 INFO mapreduce.Job:  map 0% reduce 0%
2024-02-19 13:10:00,963 INFO mapreduce.Job:  map 100% reduce 0%
2024-02-19 13:10:09,053 INFO mapreduce.Job:  map 100% reduce 100%
2024-02-19 13:10:10,077 INFO mapreduce.Job: Job job_1708348145550_0001 completed successfully
2024-02-19 13:10:10,573 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=50
		FILE: Number of bytes written=829617
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=526
		HDFS: Number of bytes written=215
		HDFS: Number of read operations=13
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
		HDFS: Number of bytes read erasure-coded=0
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=21855
		Total time spent by all reduces in occupied slots (ms)=4870
		Total time spent by all map tasks (ms)=21855
		Total time spent by all reduce tasks (ms)=4870
		Total vcore-milliseconds taken by all map tasks=21855
		Total vcore-milliseconds taken by all reduce tasks=4870
		Total megabyte-milliseconds taken by all map tasks=22379520
		Total megabyte-milliseconds taken by all reduce tasks=4986880
	Map-Reduce Framework
		Map input records=2
		Map output records=4
		Map output bytes=36
		Map output materialized bytes=56
		Input split bytes=290
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=56
		Reduce input records=4
		Reduce output records=0
		Spilled Records=8
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=327
		CPU time spent (ms)=2850
		Physical memory (bytes) snapshot=865935360
		Virtual memory (bytes) snapshot=8173481984
		Total committed heap usage (bytes)=818937856
		Peak Map Physical memory (bytes)=331313152
		Peak Map Virtual memory (bytes)=2724544512
		Peak Reduce Physical memory (bytes)=218918912
		Peak Reduce Virtual memory (bytes)=2728366080
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=236
	File Output Format Counters 
		Bytes Written=97
Job Finished in 39.964 seconds
Estimated value of Pi is 3.12000000000000000000

By refreshing the Web page, you should see the application change state from "ACCEPTED" to "RUNNING" and, finally, "FINISHED".

YARN Web UI: application running

YARN Web UI: application completed

Check Hadoop logfiles¶

By default, Hadoop writes logs to $HADOOP_HOME/logs.

In [ ]:

!ls -halt $HADOOP_HOME/logs

total 540K
-rw-r--r--  1 root root  65K Feb 19 13:10 hadoop-root-resourcemanager-5a42c671b202.log
-rw-r--r--  1 root root 101K Feb 19 13:10 hadoop-root-namenode-5a42c671b202.log
-rw-r--r--  1 root root 103K Feb 19 13:10 hadoop-root-datanode-5a42c671b202.log
-rw-r--r--  1 root root  69K Feb 19 13:10 hadoop-root-nodemanager-5a42c671b202.log
-rw-r--r--  1 root root  34K Feb 19 13:09 hadoop-root-historyserver-5a42c671b202.log
drwxr-xr-x  3 root root 4.0K Feb 19 13:09 userlogs
-rw-r--r--  1 root root 2.9K Feb 19 13:09 hadoop-root-historyserver-5a42c671b202.out
drwxr-xr-x  3 root root 4.0K Feb 19 13:09 .
-rw-r--r--  1 root root 2.9K Feb 19 13:09 hadoop-root-nodemanager-5a42c671b202.out
-rw-r--r--  1 root root  29K Feb 19 13:09 hadoop-root-proxyserver-5a42c671b202.log
-rw-r--r--  1 root root 3.0K Feb 19 13:09 hadoop-root-resourcemanager-5a42c671b202.out
-rw-r--r--  1 root root  828 Feb 19 13:09 hadoop-root-proxyserver-5a42c671b202.out
-rw-r--r--  1 root root  72K Feb 19 13:07 hadoop-root-secondarynamenode-5a42c671b202.log
-rw-r--r--  1 root root  828 Feb 19 13:07 hadoop-root-secondarynamenode-5a42c671b202.out
-rw-r--r--  1 root root  828 Feb 19 13:07 hadoop-root-datanode-5a42c671b202.out
-rw-r--r--  1 root root  828 Feb 19 13:07 hadoop-root-namenode-5a42c671b202.out
-rw-r--r--  1 root root  828 Feb 19 13:07 hadoop-root-secondarynamenode-5a42c671b202.out.1
-rw-r--r--  1 root root  828 Feb 19 13:06 hadoop-root-datanode-5a42c671b202.out.1
-rw-r--r--  1 root root  828 Feb 19 13:06 hadoop-root-namenode-5a42c671b202.out.1
-rw-r--r--  1 root root    0 Feb 19 13:06 SecurityAuth-root.audit
drwxr-xr-x 11 1000 1000 4.0K Feb 19 13:06 ..

This is for instance the NameNode's log.

In [ ]:

!tail $HADOOP_HOME/logs/hadoop-root-namenode-*.log

2024-02-19 13:10:07,280 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1073741842_1018 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file /tmp/hadoop-yarn/staging/root/.staging/job_1708348145550_0001/job_1708348145550_0001_1.jhist
2024-02-19 13:10:07,685 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/staging/root/.staging/job_1708348145550_0001/job_1708348145550_0001_1.jhist is closed by DFSClient_NONMAPREDUCE_-1816557497_1
2024-02-19 13:10:07,698 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741845_1021, replicas=127.0.0.1:9866 for /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001.summary_tmp
2024-02-19 13:10:07,716 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1073741845_1021 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001.summary_tmp
2024-02-19 13:10:08,119 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001.summary_tmp is closed by DFSClient_NONMAPREDUCE_-1816557497_1
2024-02-19 13:10:08,232 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741846_1022, replicas=127.0.0.1:9866 for /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001-1708348172882-root-QuasiMonteCarlo-1708348207241-2-1-SUCCEEDED-default-1708348185932.jhist_tmp
2024-02-19 13:10:08,292 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1073741846_1022 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001-1708348172882-root-QuasiMonteCarlo-1708348207241-2-1-SUCCEEDED-default-1708348185932.jhist_tmp
2024-02-19 13:10:08,709 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001-1708348172882-root-QuasiMonteCarlo-1708348207241-2-1-SUCCEEDED-default-1708348185932.jhist_tmp is closed by DFSClient_NONMAPREDUCE_-1816557497_1
2024-02-19 13:10:08,810 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741847_1023, replicas=127.0.0.1:9866 for /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001_conf.xml_tmp
2024-02-19 13:10:08,876 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001_conf.xml_tmp is closed by DFSClient_NONMAPREDUCE_-1816557497_1

Show most recent error or warning messages

In [ ]:

!grep "ERROR\|WARN" $HADOOP_HOME/logs/*.log | sort -r -k2 | head

/content/hadoop-3.3.6/logs/hadoop-root-datanode-5a42c671b202.log:2024-02-19 13:10:08,234 WARN org.apache.hadoop.hdfs.DFSUtil: Unexpected value for data transfer bytes=26543 duration=0
/content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:10:07,656 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000004/sysfs]
/content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:10:07,656 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000004/launch_container.sh]
/content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:10:07,656 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000004/container_tokens]
/content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:10:05,055 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000002/sysfs]
/content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:10:05,055 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000002/launch_container.sh]
/content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:10:05,055 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000002/container_tokens]
/content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:09:59,890 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000003/sysfs]
/content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:09:59,890 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000003/launch_container.sh]
/content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:09:59,890 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000003/container_tokens]

Show most recent error messages in Hadoop logfiles.

In [ ]:

!grep "ERROR" $HADOOP_HOME/logs/*.log | sort -r -k2 | head

/content/hadoop-3.3.6/logs/hadoop-root-historyserver-5a42c671b202.log:2024-02-19 13:09:25,174 ERROR org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: Error reading root log dir this deletion attempt is being aborted
/content/hadoop-3.3.6/logs/hadoop-root-secondarynamenode-5a42c671b202.log:2024-02-19 13:07:15,409 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: RECEIVED SIGNAL 15: SIGTERM
/content/hadoop-3.3.6/logs/hadoop-root-datanode-5a42c671b202.log:2024-02-19 13:07:12,489 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
/content/hadoop-3.3.6/logs/hadoop-root-namenode-5a42c671b202.log:2024-02-19 13:07:11,206 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM

Stop all services¶

Stop History Server¶

In [ ]:

!$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver

WARNING: Use of this script to stop the MR JobHistory daemon is deprecated.
WARNING: Attempting to execute replacement "mapred --daemon stop" instead.

Stop YARN¶

In [ ]:

!$HADOOP_HOME/sbin/stop-yarn.sh

Stopping nodemanagers
Stopping resourcemanager
Stopping proxy server [127.0.0.1]

Stop Hadoop and `sshd` daemon¶

Stop Hadoop (with the command: $HADOOP_HOME/sbin/stop-dfs.sh) and the sshd daemon (with /etc/init.d/ssh stop).

In [ ]:

subprocess_output = subprocess.run([os.path.join(os.environ['HADOOP_HOME'], 'sbin', 'stop-dfs.sh')], check=True, stdout=subprocess.PIPE)
print(subprocess_output.stdout.decode())

Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [5a42c671b202]

In [ ]:

subprocess_output = subprocess.run(['/etc/init.d/ssh', 'stop'], check=True, stdout=subprocess.PIPE)
print(subprocess_output.stdout.decode())

 * Stopping OpenBSD Secure Shell server sshd
   ...done.

Restart and play¶

If you've come so far by running all cells in the notebook from the menu (and not interactively) you can re-start Hadoop by issuing the following commands:

In [ ]:

# start the ssh daemon
logger.info("Starting {}".format("openssh-server"))
cmd = ["/etc/init.d/ssh", "restart"]
result = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
!ssh localhost "echo hi!"

# start HDFS
logger.info("Starting HDFS")
subprocess_output = subprocess.run([os.path.join(os.environ['HADOOP_HOME'], 'sbin', 'start-dfs.sh')], check=True, stdout=subprocess.PIPE)
print(subprocess_output.stdout.decode())

19-Feb-24 01:10:33 PM - INFO: Starting openssh-server

hi!

19-Feb-24 01:10:34 PM - INFO: Starting HDFS

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [5a42c671b202]

In [ ]:

%%bash
counter=0
until [ $counter -gt 20 ]
do
  ((counter++))
  sleep 1
  # counting lines, exclude header and jupyter port 9000
  if [ $(lsof -n -i -P +c0 -sTCP:LISTEN -ac java| wc -l ) -ge 8 ] && \
  [ $(lsof -n -aiTCP:9870 -aiTCP:9000 -aiTCP:9864 -aiTCP:9866 -aiTCP:9867 -aiTCP:9868 -P +c0 -sTCP:LISTEN -ac java | wc -l) -eq 7 ]
  then
   echo "HDFS is up and running"
   echo "Time to start: $counter secs"
   exit
  fi
done
echo "Some HDFS ports are missing. Wait some more or restart HDFS."

HDFS is up and running
Time to start: 1 secs

In [ ]:

# start YARN
logger.info("Starting YARN")
!$HADOOP_HOME/sbin/start-yarn.sh

19-Feb-24 01:10:59 PM - INFO: Starting YARN

Starting resourcemanager
Starting nodemanagers

In [ ]:

# wait for YARN services to start listening
%%bash
counter=0
until [ $counter -gt 20 ]
do
  ((counter++))
  sleep 1
  # counting lines + header
  if [ $(lsof -n -aiTCP:8088 -aiTCP:8030 -aiTCP:8031 -aiTCP:8032 -aiTCP:8033 -aiTCP:8042 -aiTCP:3141 -P +c0 -sTCP:LISTEN -ac java |wc -l ) -eq 8 ]
  then
   echo "YARN is up and running"
   echo "Time to start: $counter secs"
   exit
  fi
done
echo "Some YARN ports are missing. Wait some more or restart YARN."

YARN is up and running
Time to start: 3 secs

In [ ]:

# start History Server
!$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

WARNING: Use of this script to start the MR JobHistory daemon is deprecated.
WARNING: Attempting to execute replacement "mapred --daemon start" instead.

In [ ]:

# Web UIs
if IN_COLAB:
  # NameNode
  output.serve_kernel_port_as_window(9870, path='/index.html')
  # YARN
  output.serve_kernel_port_as_window(8088, path='/cluster')
  # Job History Server
  output.serve_kernel_port_as_window(19888, path='/jobhistory')

Run the pi app again.

In [ ]:

!yarn jar $(find $HADOOP_HOME -name "*mapreduce*examples*"|head -1) pi 2 100

Number of Maps  = 2
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Starting Job
2024-02-19 13:11:38,012 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032
2024-02-19 13:11:39,247 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1708348278844_0001
2024-02-19 13:11:39,875 INFO input.FileInputFormat: Total input files to process : 2
2024-02-19 13:11:40,867 INFO mapreduce.JobSubmitter: number of splits:2
2024-02-19 13:11:41,637 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1708348278844_0001
2024-02-19 13:11:41,637 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-02-19 13:11:41,914 INFO conf.Configuration: resource-types.xml not found
2024-02-19 13:11:41,915 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-02-19 13:11:42,367 INFO impl.YarnClientImpl: Submitted application application_1708348278844_0001
2024-02-19 13:11:42,531 INFO mapreduce.Job: The url to track the job: http://127.0.0.1:3141/proxy/application_1708348278844_0001/
2024-02-19 13:11:42,534 INFO mapreduce.Job: Running job: job_1708348278844_0001
2024-02-19 13:11:57,851 INFO mapreduce.Job: Job job_1708348278844_0001 running in uber mode : false
2024-02-19 13:11:57,852 INFO mapreduce.Job:  map 0% reduce 0%
2024-02-19 13:12:11,106 INFO mapreduce.Job:  map 100% reduce 0%
2024-02-19 13:12:18,177 INFO mapreduce.Job:  map 100% reduce 100%
2024-02-19 13:12:20,212 INFO mapreduce.Job: Job job_1708348278844_0001 completed successfully
2024-02-19 13:12:20,361 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=50
		FILE: Number of bytes written=829623
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=528
		HDFS: Number of bytes written=215
		HDFS: Number of read operations=13
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
		HDFS: Number of bytes read erasure-coded=0
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=22188
		Total time spent by all reduces in occupied slots (ms)=4693
		Total time spent by all map tasks (ms)=22188
		Total time spent by all reduce tasks (ms)=4693
		Total vcore-milliseconds taken by all map tasks=22188
		Total vcore-milliseconds taken by all reduce tasks=4693
		Total megabyte-milliseconds taken by all map tasks=22720512
		Total megabyte-milliseconds taken by all reduce tasks=4805632
	Map-Reduce Framework
		Map input records=2
		Map output records=4
		Map output bytes=36
		Map output materialized bytes=56
		Input split bytes=292
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=56
		Reduce input records=4
		Reduce output records=0
		Spilled Records=8
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=248
		CPU time spent (ms)=2910
		Physical memory (bytes) snapshot=849063936
		Virtual memory (bytes) snapshot=8171692032
		Total committed heap usage (bytes)=788529152
		Peak Map Physical memory (bytes)=315535360
		Peak Map Virtual memory (bytes)=2720288768
		Peak Reduce Physical memory (bytes)=230117376
		Peak Reduce Virtual memory (bytes)=2731159552
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=236
	File Output Format Counters 
		Bytes Written=97
Job Finished in 42.765 seconds
Estimated value of Pi is 3.12000000000000000000

Hadoop single-node cluster setup with Python¶

About this tutorial¶

Constants¶

Imports¶

Setup logging¶

Set JAVA_HOME¶

Start sshd server¶

Install openssh-server¶

Configure sshd service¶

Start sshd service¶

Create private/public key¶

Test ssh connection¶

Setup Hadoop¶

Download core Hadoop¶

Uncompress archive¶

Configure Hadoop¶

Environment variables¶

Hadoop configuration files¶

Functions for editing Hadoop configuration files¶

Edit core-site.xml and hdfs-site.xml¶

Edit hadoop-env.sh¶

Hadoop's default values for configuration files¶

Initialize the namenode¶

Start HDFS¶

HDFS processes and ports¶

Check if HDFS is up and running¶

lsof for viewing listening ports¶

Options used in lsof:¶

Check the NameNode¶

On the command-line¶

In the Web UI¶

Serve the NameNode Web UI¶

A note on Google Colaboratory advanced outputs¶

Test HDFS¶

Run a MapReduce job in local mode¶

How to find MapReduce examples¶

Run the pi MapReduce app¶

Start YARN¶

Configure yarn-site.xml¶

Some explanations¶

Configure mapred-site.xml¶

Launch YARN¶

Verify that all YARN services are up and running¶

Serve the YARN Web UI¶

Start Job History Server¶

Start the Job History Server Web UI¶

Submit the MapReduce pi example to YARN¶

Check Hadoop logfiles¶

Stop all services¶

Stop History Server¶

Stop YARN¶

Stop Hadoop and sshd daemon¶

Restart and play¶

Set `JAVA_HOME`¶

Start `sshd` server¶

Install `openssh-server`¶

Configure `sshd` service¶

Start `sshd` service¶

Test `ssh` connection¶

Edit `core-site.xml` and `hdfs-site.xml`¶

Edit `hadoop-env.sh`¶

`lsof` for viewing listening ports¶

Options used in `lsof`:¶

Run the `pi` MapReduce app¶

Configure `yarn-site.xml`¶

Configure `mapred-site.xml`¶

Submit the MapReduce `pi` example to YARN¶

Stop Hadoop and `sshd` daemon¶