Following Apache's official documentation Hadoop: Setting up a Single Node Cluster we are going to launch a single node Hadoop cluster in pseudo-distributed mode Pseudo-Distributed Operation. As opposed to single-node local (standalone) operation where one single Java Virtual Machine is responsible for all services, in the pseudo-distributed mode each Hadoop daemon runs in a separate Java process.
In this notebook, we are going to rely as much as possible Python commands. This is the companion notebook to Hadoop Setting up a Single Node Cluster, implemented in bash.
This tutorial deliberately takes a detailed and sometimes tedious approach in order to offer a comprehensive tutorial experience.
The intention is to help you learn thoroughly and understand the motivations behind each step. By embracing this comprehensive exploration, you can enhance your understanding of the subject matter, even if it may seem a bit pedantic at times, and grasp the significance of every action in the tutorial.
While the notebook is designed to be interactive, it's important to note that several commands may not yield real-time outputs. This delay stems from the nature of the working context, especially when dealing with Big Data. Big Data jobs typically operate in a batch processing mode, where data is processed in large chunks rather than interactively.
It's essential to understand that the nature of Big Data processing, characterized by its batch-oriented approach, inherently involves some latency in generating immediate results during interactive sessions.
Additionally, despite the fact that for this tutorial we are not really dealing with large amounts of data, one shoud keep in mind that a pseudo-distributed Hadoop installation does not provide the most efficient environment since it emulates a cluster on a single virtual machine.
All this is to say that you should be ready to invest some time in this tutorial and arm yourself with patience.
Image conveying a sense of tranquility (photo by Billy Williams on Unsplash).
# URL for downloading Hadoop
HADOOP_URL = "https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.6.tar.gz"
# logging level (should be one of: DEBUG, INFO, WARNING, ERROR, CRITICAL)
LOGGING_LEVEL = "INFO" #@param ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]
import sys
import logging
import subprocess
import os
!pip install Crypto
!pip install pycryptodome
import Crypto
!pip install ssh_utilities
from ssh_utilities import Connection
import shutil
from pathlib import Path
import urllib.request
import tarfile
import xml.etree.ElementTree as ET
import xml.dom.minidom
import glob
# true if running on Google Colab
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
from google.colab import output
Collecting Crypto Downloading crypto-1.4.1-py2.py3-none-any.whl (18 kB) Collecting Naked (from Crypto) Downloading Naked-0.1.32-py2.py3-none-any.whl (587 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 587.7/587.7 kB 4.8 MB/s eta 0:00:00 Collecting shellescape (from Crypto) Downloading shellescape-3.8.1-py2.py3-none-any.whl (3.1 kB) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from Naked->Crypto) (2.31.0) Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from Naked->Crypto) (6.0.1) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->Naked->Crypto) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->Naked->Crypto) (3.6) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->Naked->Crypto) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->Naked->Crypto) (2024.2.2) Installing collected packages: shellescape, Naked, Crypto Successfully installed Crypto-1.4.1 Naked-0.1.32 shellescape-3.8.1 Collecting pycryptodome Downloading pycryptodome-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 9.6 MB/s eta 0:00:00 Installing collected packages: pycryptodome Successfully installed pycryptodome-3.20.0 Collecting ssh_utilities Downloading ssh_utilities-0.15.2-py3-none-any.whl (79 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.7/79.7 kB 2.0 MB/s eta 0:00:00 Collecting paramiko>=2.7.1 (from ssh_utilities) Downloading paramiko-3.4.0-py3-none-any.whl (225 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 225.9/225.9 kB 9.7 MB/s eta 0:00:00 Requirement already satisfied: typing-extensions>=3.7.4.2 in /usr/local/lib/python3.10/dist-packages (from ssh_utilities) (4.9.0) Collecting colorama>=0.4.3 (from ssh_utilities) Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB) Requirement already satisfied: tqdm>=4.47.0 in /usr/local/lib/python3.10/dist-packages (from ssh_utilities) (4.66.2) Collecting bcrypt>=3.2 (from paramiko>=2.7.1->ssh_utilities) Downloading bcrypt-4.1.2-cp39-abi3-manylinux_2_28_x86_64.whl (698 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.9/698.9 kB 29.2 MB/s eta 0:00:00 Requirement already satisfied: cryptography>=3.3 in /usr/local/lib/python3.10/dist-packages (from paramiko>=2.7.1->ssh_utilities) (42.0.2) Collecting pynacl>=1.5 (from paramiko>=2.7.1->ssh_utilities) Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 856.7/856.7 kB 60.4 MB/s eta 0:00:00 Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/dist-packages (from cryptography>=3.3->paramiko>=2.7.1->ssh_utilities) (1.16.0) Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.12->cryptography>=3.3->paramiko>=2.7.1->ssh_utilities) (2.21) Installing collected packages: colorama, bcrypt, pynacl, paramiko, ssh_utilities Successfully installed bcrypt-4.1.2 colorama-0.4.6 paramiko-3.4.0 pynacl-1.5.0 ssh_utilities-0.15.2
for handler in logging.root.handlers[:]:
logging.root.removeHandler(handler)
logging_level = getattr(logging, LOGGING_LEVEL.upper(), 10)
logging.basicConfig(level=logging_level, \
format='%(asctime)s - %(levelname)s: %(message)s', \
datefmt='%d-%b-%y %I:%M:%S %p')
logger = logging.getLogger('my_logger')
JAVA_HOME
¶On Google Colab Java is available but we accomodate for the general case of an Ubuntu machine where Java (we pick the openjdk-19-jre-headless
version) needs to be installed.
# set variable JAVA_HOME (install Java if necessary)
def is_java_installed():
os.environ['JAVA_HOME'] = os.path.realpath(shutil.which("java")).split('/bin')[0]
return os.environ['JAVA_HOME']
def install_java():
# Uncomment and modify the desired version
# java_version= 'openjdk-11-jre-headless'
# java_version= 'default-jre'
# java_version= 'openjdk-17-jre-headless'
# java_version= 'openjdk-18-jre-headless'
java_version= 'openjdk-19-jre-headless'
print(f"Java not found. Installing {java_version} ... (this might take a while)")
try:
cmd = f"apt install -y {java_version}"
subprocess_output = subprocess.run(cmd, shell=True, check=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
stdout_result = subprocess_output.stdout
# Process the results as needed
logger.info("Done installing Java {}".format(java_version))
os.environ['JAVA_HOME'] = os.path.realpath(shutil.which("java")).split('/bin')[0]
logger.info("JAVA_HOME is {}".format(os.environ['JAVA_HOME']))
except subprocess.CalledProcessError as e:
# Handle the error if the command returns a non-zero exit code
logger.warn("Command failed with return code {}".format(e.returncode))
logger.warn("stdout: {}".format(e.stdout))
# Install Java if not available
if is_java_installed():
logger.info("Java is already installed: {}".format(os.environ['JAVA_HOME']))
else:
logger.info("Installing Java")
install_java()
19-Feb-24 01:05:23 PM - INFO: Java is already installed: /usr/lib/jvm/java-11-openjdk-amd64
sshd
server¶The Hadoop cluster needs ssh
for communication (even if it runs on a single node). See the official documentation: Setup passphraseless ssh.
We install openssh-server
and test if the installation was successful with the shell command
ssh localhost "echo hi!"
as well as using Connection
from the Python's ssh_utilities
library.
openssh-server
¶logger.info("Installing {}".format("openssh-server"))
cmd = ["apt-get", "install", "openssh-server"]
try:
result = subprocess.check_output(cmd, stderr=subprocess.STDOUT, text=True)
except subprocess.CalledProcessError as e:
# Access the error message from the output attribute
error_message = e.output
print(f"Command failed with error message: {error_message}")
19-Feb-24 01:05:23 PM - INFO: Installing openssh-server
sshd
service¶if IN_COLAB:
ssh_config_file = "/etc/ssh/ssh_config"
with open(ssh_config_file, "r+") as f:
var = 'StrictHostKeyChecking no'
line_found = any(line.strip().startswith(var) for line in f)
if not line_found:
f.seek(0, os.SEEK_END)
f.write(var +"\n")
sshd
service¶logger.info("Starting {}".format("openssh-server"))
cmd = ["/etc/init.d/ssh", "restart"]
try:
result = subprocess.check_output(cmd, stderr=subprocess.STDOUT, text=True)
except subprocess.CalledProcessError as e:
# Access the error message from the output attribute
error_message = e.output
print(f"Command failed with error message: {error_message}")
19-Feb-24 01:05:35 PM - INFO: Starting openssh-server
The private/public key pair is needed for passwordless authentication.
Note: for some reason Crypto
doesn't work without pycryptodome
(see ModuleNotFoundError: No module named 'Crypto' Error).
from Crypto.PublicKey import RSA
key = RSA.generate(2048)
if not os.path.exists(os.path.join(os.path.expanduser('~'), '.ssh')):
os.makedirs(os.path.join(os.path.expanduser('~'), '.ssh'), mode=700)
#os.chmod(os.path.join(os.path.expanduser('~'),'.ssh'), int('700', base=8))
with open(os.path.join(os.path.expanduser('~'), '.ssh/id_rsa'), 'wb') as f:
os.chmod(os.path.join(os.path.expanduser('~'),'.ssh/id_rsa'), int('0600', base=8))
f.write(key.export_key('PEM', passphrase=None))
pubkey = key.publickey()
with open(os.path.join(os.path.expanduser('~'), '.ssh/id_rsa.pub'), 'wb') as f:
f.write(pubkey.exportKey('OpenSSH'))
with open(os.path.join(os.path.expanduser('~'), '.ssh/authorized_keys'), 'w') as a:
with open(os.path.join(os.path.expanduser('~'),'.ssh/id_rsa.pub'), 'r') as f:
a.write(f.read())
os.chmod(os.path.join(os.path.expanduser('~'), '.ssh/authorized_keys'), int('600', base=8))
ssh
connection¶We are going to test if the passphraseless connection is working by running the command "echo hi!"
on localhost
over ssh
. The output should be 'hi!'
if the ssh
connection is working.
1. Test with a shell command
!ssh localhost "echo hi!"
Warning: Permanently added 'localhost' (ED25519) to the list of known hosts. hi!
2. Test with Python ssh_utilities.Connection
private_key = os.path.join(os.path.expanduser('~'), '.ssh/id_rsa')
conn = Connection.open('root','localhost', private_key)
try:
command_output = conn.subprocess.run(['echo', 'hi!'], suppress_out=True, quiet=False,
capture_output=True, check=True, cwd=Path('/content'))
except Exception as err:
logger.error(err)
print("❌ An exception occurred:")
print(err)
else:
logger.info("Successfully executed command on localhost over ssh")
print("✅ Successfully executed command {} on localhost".format('"ssh localhost \'echo hi!\'"'))
print("Stdout: {}".format(command_output.stdout.decode("utf-8")))
19-Feb-24 01:05:36 PM - INFO: Connection object will not be thread safe
Will login with private RSA key located in /root/.ssh/id_rsa
19-Feb-24 01:05:36 PM - INFO: Will login with private RSA key located in /root/.ssh/id_rsa
Connecting to server: root@localhost
19-Feb-24 01:05:36 PM - INFO: Connecting to server: root@localhost
When running an executale on server always make sure that full path is specified!!!
19-Feb-24 01:05:36 PM - INFO: When running an executale on server always make sure that full path is specified!!! 19-Feb-24 01:05:36 PM - INFO: could not parse key with Ed25519Key 19-Feb-24 01:05:36 PM - INFO: could not parse key with DSSKey 19-Feb-24 01:05:36 PM - INFO: could not parse key with ECDSAKey 19-Feb-24 01:05:36 PM - INFO: trying to authenticate with private-key 19-Feb-24 01:05:36 PM - INFO: Connected (version 2.0, client OpenSSH_8.9p1) 19-Feb-24 01:05:36 PM - INFO: Authentication (publickey) successful! 19-Feb-24 01:05:36 PM - INFO: successfully authenticated with: private-key
Executing command on remote: echo hi!
19-Feb-24 01:05:36 PM - INFO: Successfully executed command on localhost over ssh
✅ Successfully executed command "ssh localhost 'echo hi!'" on localhost Stdout: hi!
Download the latest stable version of the core Hadoop distribution from one of the download mirrors locations https://www.apache.org/dyn/closer.cgi/hadoop/common/.
file_name = os.path.basename(HADOOP_URL)
if os.path.isfile(file_name):
logger.info("{} already exists, not downloading".format(file_name))
else:
logger.info("Downloading {}".format(file_name))
urllib.request.urlretrieve(HADOOP_URL, file_name)
19-Feb-24 01:05:36 PM - INFO: Downloading hadoop-3.3.6.tar.gz
dir_name = file_name[:-7]
if os.path.exists(dir_name):
logger.info("{} is already uncompressed".format(file_name))
else:
logger.info("Uncompressing {}".format(file_name))
tar = tarfile.open(file_name)
tar.extractall()
tar.close()
19-Feb-24 01:06:22 PM - INFO: Uncompressing hadoop-3.3.6.tar.gz
In this section we continue following Apache Hadoop's official documentation on how to set up a single node in pseudo-distributed mode (see Configuration).
os.environ['HADOOP_HOME'] = os.path.join(os.path.join(os.getcwd(), dir_name))
logger.info("HADOOP_HOME is {}".format(os.environ['HADOOP_HOME']))
os.environ['HADOOP_COMMON_HOME'] = os.environ['HADOOP_HOME']
logger.info("HADOOP_COMMON_HOME is {}".format(os.environ['HADOOP_COMMON_HOME']))
os.environ['HADOOP_MAPREDUCE_HOME'] = os.environ['HADOOP_HOME']
logger.info("HADOOP_MAPREDUCE_HOME is {}".format(os.environ['HADOOP_MAPREDUCE_HOME']))
os.environ['HADOOP_YARN_HOME'] = os.environ['HADOOP_HOME']
logger.info("HADOOP_YARN_HOME is {}".format(os.environ['HADOOP_YARN_HOME']))
os.environ['PATH'] = ':'.join([os.path.join(os.environ['HADOOP_HOME'], 'bin'), os.environ['PATH']])
logger.info("PATH is {}".format(os.environ['PATH']))
19-Feb-24 01:06:41 PM - INFO: HADOOP_HOME is /content/hadoop-3.3.6 19-Feb-24 01:06:41 PM - INFO: HADOOP_COMMON_HOME is /content/hadoop-3.3.6 19-Feb-24 01:06:41 PM - INFO: HADOOP_MAPREDUCE_HOME is /content/hadoop-3.3.6 19-Feb-24 01:06:41 PM - INFO: HADOOP_YARN_HOME is /content/hadoop-3.3.6 19-Feb-24 01:06:41 PM - INFO: PATH is /content/hadoop-3.3.6/bin:/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin
Throughout this notebook we're going to edit four important Hadoop configuration files:
core-site.xml
containing the site-specific configuration of the Hadoop corehdfs-site.xml
containing the site-specific configuration of Hadoop's HDFSyarn-site.xml
containing the site-specific configuration of Hadoop's YARNmapred-site.xml
containing the site-specific configuration of Hadoop's MapReduceAdditionally, we are going to edit the file hadoop-env.sh
, a helper file containing environment variables needed by the shell scripts used to start the cluster.
All these files can be found in etc/hadoop
under $HADOOP_HOME
(the Hadoop installation folder).
# site-specific configuration files
core_site_file = os.path.join(os.environ['HADOOP_HOME'],'etc/hadoop/core-site.xml')
hdfs_site_file = os.path.join(os.environ['HADOOP_HOME'],'etc/hadoop/hdfs-site.xml')
yarn_site_file = os.path.join(os.environ['HADOOP_HOME'],'etc/hadoop/yarn-site.xml')
mapred_site_file = os.path.join(os.environ['HADOOP_HOME'],'etc/hadoop/mapred-site.xml')
# Hadoop environment variables
hadoop_env_file = os.path.join(os.environ['HADOOP_HOME'],'etc/hadoop/hadoop-env.sh')
!ls $HADOOP_HOME/etc/hadoop/*xml $HADOOP_HOME/etc/hadoop/hadoop-env.sh
/content/hadoop-3.3.6/etc/hadoop/capacity-scheduler.xml /content/hadoop-3.3.6/etc/hadoop/core-site.xml /content/hadoop-3.3.6/etc/hadoop/hadoop-env.sh /content/hadoop-3.3.6/etc/hadoop/hadoop-policy.xml /content/hadoop-3.3.6/etc/hadoop/hdfs-rbf-site.xml /content/hadoop-3.3.6/etc/hadoop/hdfs-site.xml /content/hadoop-3.3.6/etc/hadoop/httpfs-site.xml /content/hadoop-3.3.6/etc/hadoop/kms-acls.xml /content/hadoop-3.3.6/etc/hadoop/kms-site.xml /content/hadoop-3.3.6/etc/hadoop/mapred-site.xml /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml
In order to edit the site-specific configuration files in an orderly fashion, we have created the functions:
clear_conf_file(file)
to remove all properties from the XML file file
edit_conf_file(file, propertyname, propertyvalue)
to add property name and value in the proper format to the XML file file
print_conf_file(file)
to pretty-print the XML file file
def clear_conf_file(file):
tree = ET.parse(file)
root = tree.getroot()
# remove all properties
for el in list(root):
root.remove(el)
tree.write(file, encoding='utf-8', xml_declaration=True)
# pretty-print
dom = xml.dom.minidom.parse(file)
def edit_conf_file(file, propertyname, propertyvalue):
tree = ET.parse(file)
root = tree.getroot()
logger.info("add property {} to {}".format(propertyname, file))
property = ET.SubElement(root, 'property')
name = ET.SubElement(property, 'name')
name.text = propertyname
value = ET.SubElement(property, 'value')
value.text = propertyvalue
tree.write(file, encoding='utf-8', xml_declaration=True)
def print_conf_file(file):
# pretty-print
dom = xml.dom.minidom.parse(file)
print(dom.toprettyxml())
core-site.xml
and hdfs-site.xml
¶Let us configure two properties:
fs.defaultFS
(the URI of the default file system) to hdfs://localhost:9000
in core-site.xml
dfs.replication
to $1$ (the default is $3$) in hdfs-site.xml
clear_conf_file(core_site_file)
edit_conf_file(core_site_file, 'fs.defaultFS', 'hdfs://localhost:9000')
print_conf_file(core_site_file)
19-Feb-24 01:06:41 PM - INFO: add property fs.defaultFS to /content/hadoop-3.3.6/etc/hadoop/core-site.xml
<?xml version="1.0" ?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
clear_conf_file(hdfs_site_file)
edit_conf_file(hdfs_site_file, 'dfs.replication', '1')
print_conf_file(hdfs_site_file)
19-Feb-24 01:06:41 PM - INFO: add property dfs.replication to /content/hadoop-3.3.6/etc/hadoop/hdfs-site.xml
<?xml version="1.0" ?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
hadoop-env.sh
¶users = ['HDFS_NAMENODE_USER', \
'HDFS_DATANODE_USER', \
'HDFS_SECONDARYNAMENODE_USER', \
'YARN_RESOURCEMANAGER_USER', \
'YARN_NODEMANAGER_USER', \
'YARN_PROXYSERVER_USER']
logger.info("Editing {}".format(hadoop_env_file))
with open(hadoop_env_file, "r+") as f:
for u in users:
var = 'export ' + u + '=root'
line_found = any(line.startswith(var) for line in f)
if not line_found:
f.seek(0, os.SEEK_END)
f.write(var +"\n")
line_found = any(line.startswith('export JAVA_HOME=') for line in f)
if not line_found:
f.seek(0, os.SEEK_END)
f.write('export JAVA_HOME='+os.environ['JAVA_HOME']+"\n")
19-Feb-24 01:06:41 PM - INFO: Editing /content/hadoop-3.3.6/etc/hadoop/hadoop-env.sh
By providing site-specific configuration files one can override any of the default properties. The default values files corresponfing to the site-specific files are:
core-default.xml
hdfs-default.xml
yarn-default.xml
mapred-default.xml
These are read-only files that contain all default values for Hadoop properties (see cluster setup) and can be viewed at:
list(glob.iglob(os.environ['HADOOP_HOME']+ '/**/*default.xml', recursive=True))
['/content/hadoop-3.3.6/share/doc/hadoop/hadoop-yarn/hadoop-yarn-common/yarn-default.xml', '/content/hadoop-3.3.6/share/doc/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml', '/content/hadoop-3.3.6/share/doc/hadoop/hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml', '/content/hadoop-3.3.6/share/doc/hadoop/hadoop-project-dist/hadoop-common/core-default.xml', '/content/hadoop-3.3.6/share/doc/hadoop/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml']
print('\n'.join([core_site_file, hdfs_site_file, yarn_site_file, mapred_site_file]))
/content/hadoop-3.3.6/etc/hadoop/core-site.xml /content/hadoop-3.3.6/etc/hadoop/hdfs-site.xml /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml /content/hadoop-3.3.6/etc/hadoop/mapred-site.xml
logger.info("Formatting NameNode")
cmd = ["hdfs", "namenode", "-format", "-nonInteractive"]
result = subprocess.call(cmd, stderr=subprocess.STDOUT)
19-Feb-24 01:06:41 PM - INFO: Formatting NameNode
Launch a single-node HDFS cluster with the command start-dfs.sh
.
subprocess_output = subprocess.run([os.path.join(os.environ['HADOOP_HOME'], 'sbin', 'start-dfs.sh')], check=True, stdout=subprocess.PIPE)
print(subprocess_output.stdout.decode())
Starting namenodes on [localhost] Starting datanodes Starting secondary namenodes [5a42c671b202] 5a42c671b202: Warning: Permanently added '5a42c671b202' (ED25519) to the list of known hosts.
!/content/hadoop-3.3.6/sbin/stop-dfs.sh
!/content/hadoop-3.3.6/sbin/start-dfs.sh
Stopping namenodes on [localhost] Stopping datanodes Stopping secondary namenodes [5a42c671b202] Starting namenodes on [localhost] Starting datanodes Starting secondary namenodes [5a42c671b202]
We have started some Java virtual machines corresponding to different Hadoop services. By using lsof
to show listening ports you should get something like this:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 13177 root 341u IPv4 357089 0t0 TCP *:9870 (LISTEN)
java 13177 root 349u IPv4 358597 0t0 TCP 127.0.0.1:9000 (LISTEN)
java 13297 root 342u IPv4 359615 0t0 TCP *:9866 (LISTEN)
java 13297 root 345u IPv4 358674 0t0 TCP 127.0.0.1:40349 (LISTEN)
java 13297 root 374u IPv4 358786 0t0 TCP *:9864 (LISTEN)
java 13297 root 375u IPv4 358790 0t0 TCP *:9867 (LISTEN)
java 13515 root 343u IPv4 360840 0t0 TCP *:9868 (LISTEN)
!lsof -n -i -P +c0 -sTCP:LISTEN -ac java
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 2534 root 341u IPv4 55167 0t0 TCP *:9870 (LISTEN) java 2534 root 349u IPv4 55722 0t0 TCP 127.0.0.1:9000 (LISTEN) java 2645 root 342u IPv4 54075 0t0 TCP *:9866 (LISTEN) java 2645 root 345u IPv4 54085 0t0 TCP 127.0.0.1:40039 (LISTEN) java 2645 root 374u IPv4 56384 0t0 TCP *:9864 (LISTEN) java 2645 root 375u IPv4 56418 0t0 TCP *:9867 (LISTEN) java 2860 root 343u IPv4 57029 0t0 TCP *:9868 (LISTEN)
As you can see, we have multiple services running on different ports all spawned by different Java processes corresponding to:
port number | description | configured in | |
---|---|---|---|
9870 | dfs.namenode.http-address | hdfs-default.xml | |
9000 | fs.defaultFS | as defined in core-site.xml |
|
9864 | dfs.datanode.http.address | hdfs-default.xml | |
9866 | dfs.datanode.address | hdfs-default.xml | |
9867 | dfs.datanode.ipc.address | hdfs-default.xml | |
9868 | dfs.namenode.secondary.http-address | hdfs-default.xml |
Once HDFS is up and running there should be a total of $7$ java
listening ports and six of them are precisely ports $9870$, $9000$, $9864$, $9866$, $9876$, $9868$.
Let us check if this is the case introducing a cycle of 20 retries (one per second) to check if all the required ports are available. This check is needed for non-interactive runs of the notebook.
%%bash
counter=0
until [ $counter -gt 20 ]
do
((counter++))
sleep 1
# counting lines, exclude header and jupyter port 9000
if [ $(lsof -n -i -P +c0 -sTCP:LISTEN -ac java| wc -l ) -ge 8 ] && \
[ $(lsof -n -aiTCP:9870 -aiTCP:9000 -aiTCP:9864 -aiTCP:9866 -aiTCP:9867 -aiTCP:9868 -P +c0 -sTCP:LISTEN -ac java | wc -l) -eq 7 ]
then
echo "HDFS is up and running"
echo "Time to start: $counter secs"
exit
fi
done
echo "Some HDFS ports are missing. Wait some more or restart HDFS."
HDFS is up and running Time to start: 1 secs
lsof
for viewing listening ports¶lsof
is an exceptionally valuable command. With the option -i
it can be used to show processes listening to network connections.
An example: check for java
listening to a specific port ($9864$)
!lsof -iTCP:9864 -n -P +c0 -sTCP:LISTEN -ac java
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 2645 root 374u IPv4 56384 0t0 TCP *:9864 (LISTEN)
lsof
:¶-i
specifies that you want to display only network files, that is open network connections-n
and -P
tell lsof
to show ports and IP addresses in numeric form+c0
is used to show a longer substring of the name of the UNIX command associated with the process (https://linux.die.net/man/8/lsof)-sTCP:LISTEN
filters for TCP connections in state LISTEN
Run the command hdfs dfsadmin -report
.
subprocess_output = subprocess.run([os.path.join(os.environ['HADOOP_HOME'], 'bin', 'hdfs'), 'dfsadmin', '-report'], check=True, stdout=subprocess.PIPE)
print(subprocess_output.stdout.decode())
Configured Capacity: 115658190848 (107.72 GB) Present Capacity: 85179183104 (79.33 GB) DFS Remaining: 85179158528 (79.33 GB) DFS Used: 24576 (24 KB) DFS Used%: 0.00% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 ------------------------------------------------- Live datanodes (1): Name: 127.0.0.1:9866 (localhost) Hostname: 5a42c671b202 Decommission Status : Normal Configured Capacity: 115658190848 (107.72 GB) DFS Used: 24576 (24 KB) Non DFS Used: 30462230528 (28.37 GB) DFS Remaining: 85179158528 (79.33 GB) DFS Used%: 0.00% DFS Remaining%: 73.65% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 0 Last contact: Mon Feb 19 13:07:50 UTC 2024 Last Block Report: Mon Feb 19 13:07:39 UTC 2024 Num of Blocks: 0
The namenode runs a Web interface displaying information about the current status of the cluster (see HdfsUserGuide.html#Web_Interface). By default, the dfs namenode web ui will listen on port $9870$.
Let us open the Web UI in another tab in the browser with Google Colab's output
library (source: https://github.com/googlecolab/colabtools/tree/main/google/colab/output, documentation: Browsing to servers executing on the kernel).
!wget localhost:9870
--2024-02-19 13:07:52-- http://localhost:9870/ Resolving localhost (localhost)... 127.0.0.1, ::1 Connecting to localhost (localhost)|127.0.0.1|:9870... connected. HTTP request sent, awaiting response... 302 Found Location: http://localhost:9870/index.html [following] --2024-02-19 13:07:52-- http://localhost:9870/index.html Reusing existing connection to localhost:9870. HTTP request sent, awaiting response... 200 OK Length: 1079 (1.1K) [text/html] Saving to: ‘index.html’ index.html 100%[===================>] 1.05K --.-KB/s in 0s 2024-02-19 13:07:52 (122 MB/s) - ‘index.html’ saved [1079/1079]
%%capture namenode_url
if IN_COLAB:
port = 9870
output.serve_kernel_port_as_window(port, path='/index.html')
if IN_COLAB:
namenode_url.show()
This is what you should see if you open the above link in a newly opened tab or window in your browser
On port $9864$ you can reach the datanode.
%%capture datanode_url
if IN_COLAB:
port = 9864
output.serve_kernel_port_as_window(port, path='/index.html')
if IN_COLAB:
datanode_url.show()
Throughout this notebook, I use the function output.serve_kernel_port_as_window
to serve Web pages from Google Colab.
You can find the documentation for Google Colab's advanced outputs library at: https://colab.research.google.com/notebooks/snippets/advanced_outputs.ipynb#scrollTo=Browsing_to_servers_executing_on_the_kernel.
Before serving a link, I always use wget
to see if there are any redirections that need to be included in the path
option. For instance:
output.serve_kernel_port_as_window(9870, path='/index.html')
works but
output.serve_kernel_port_as_window(9870)
will not work!
Oddly, the function output.serve_kernel_port_as_window
is deprecated in favor of serve_kernel_port_as_iframe
, which does not work.
if IN_COLAB:
help(output.serve_kernel_port_as_window)
Help on function serve_kernel_port_as_window in module google.colab.output._util: serve_kernel_port_as_window(port, path='/', anchor_text=None) Displays a link in the output to open a browser tab to a port on the kernel. DEPRECATED; Browser security updates have broken this feature. Use `serve_kernel_port_as_iframe` instead. See https://developer.chrome.com/en/docs/privacy-sandbox/storage-partitioning/. This allows viewing URLs hosted on the kernel in new browser tabs. The URL will only be valid for the current user while the notebook is open in Colab. Args: port: The kernel port to be exposed to the client. path: The path to be navigated to. anchor_text: Text content of the anchor link.
Upload the Hadoop distribution .tar.gz
file (this is just a "big" file that's already there) to the HDFS filesystem using the command
hdfs dfs -put <filename>
Check DFS usage before:
!hdfs dfsadmin -report | grep "^DFS Used" | tail -2
DFS Used: 24576 (24 KB) DFS Used%: 0.00%
!echo $(basename $HADOOP_URL)
# create home directory on HDFS
!hdfs dfs -mkdir -p /user/root
# upload file to HDFS
!hdfs dfs -put $(basename $HADOOP_URL)
# list HDFS home
!hdfs dfs -ls -h
hadoop-3.3.6.tar.gz Found 1 items -rw-r--r-- 1 root supergroup 696.3 M 2024-02-19 13:08 hadoop-3.3.6.tar.gz
You should now see that the value for DFS Used has increased from $0.0\%$ to a value $>0.0\%$.
!hdfs dfsadmin -report | grep "^DFS Used" | tail -2
DFS Used: 735836062 (701.75 MB) DFS Used%: 0.64%
The same is shown in the Web UI (you might have to refresh the page).
Configured Capacity: | 107.72 GB |
Configured Remote Capacity: | 0 B |
DFS Used: | 701.76 MB (0.64%) |
Clean up.
!hdfs dfs -rm -r $(basename $HADOOP_URL)
!hdfs dfs -ls
Deleted hadoop-3.3.6.tar.gz
If you run a MapReduce job at this point (without starting YARN), the job will run locally.
We are resetting the YARN and MapReduce configuration files to their initial empty state in case you are not running the notebook for the first time.
clear_conf_file(yarn_site_file)
clear_conf_file(mapred_site_file)
The Hadoop distribution comes with several examples (both sources and compiled). Here we are going to show how to find the runnable JAR files.
Search for MapReduce examples with find
. The first JAR will do.
!find $HADOOP_HOME -name "*mapreduce*examples*"
/content/hadoop-3.3.6/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar /content/hadoop-3.3.6/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.3.6-test-sources.jar /content/hadoop-3.3.6/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-3.3.6-sources.jar /content/hadoop-3.3.6/share/doc/hadoop/hadoop-mapreduce-examples
Call the JAR file to get a list of all available MapReduce examples.
!hadoop jar $(find $HADOOP_HOME -name "*mapreduce*examples*"|head -1)
An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that count the pageview counts from a database. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files. wordmean: A map/reduce program that counts the average length of the words in the input files. wordmedian: A map/reduce program that counts the median length of the words in the input files. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
Run the pi
example without arguments to get a usage message.
!hadoop jar $(find $HADOOP_HOME -name "*mapreduce*examples*"|head -1) pi
Usage: org.apache.hadoop.examples.QuasiMonteCarlo <nMaps> <nSamples> Generic options supported are: -conf <configuration file> specify an application configuration file -D <property=value> define a value for a given property -fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations. -jt <local|resourcemanager:port> specify a ResourceManager -files <file1,...> specify a comma-separated list of files to be copied to the map reduce cluster -libjars <jar1,...> specify a comma-separated list of jar files to be included in the classpath -archives <archive1,...> specify a comma-separated list of archives to be unarchived on the compute machines The general command line syntax is: command [genericOptions] [commandOptions]
pi
MapReduce app¶Run the pi
example with
<nMaps>
$=2$<nSamples>
$= 100$!yarn jar $(find $HADOOP_HOME -name "*mapreduce*examples*"|head -1) pi 2 100
Number of Maps = 2 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Starting Job 2024-02-19 13:08:35,431 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties 2024-02-19 13:08:35,582 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 2024-02-19 13:08:35,582 INFO impl.MetricsSystemImpl: JobTracker metrics system started 2024-02-19 13:08:35,868 INFO input.FileInputFormat: Total input files to process : 2 2024-02-19 13:08:35,881 INFO mapreduce.JobSubmitter: number of splits:2 2024-02-19 13:08:36,394 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1436131745_0001 2024-02-19 13:08:36,394 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2024-02-19 13:08:36,613 INFO mapreduce.Job: The url to track the job: http://localhost:8080/ 2024-02-19 13:08:36,614 INFO mapreduce.Job: Running job: job_local1436131745_0001 2024-02-19 13:08:36,622 INFO mapred.LocalJobRunner: OutputCommitter set in config null 2024-02-19 13:08:36,630 INFO output.PathOutputCommitterFactory: No output committer factory defined, defaulting to FileOutputCommitterFactory 2024-02-19 13:08:36,632 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2 2024-02-19 13:08:36,632 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2024-02-19 13:08:36,640 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 2024-02-19 13:08:36,721 INFO mapred.LocalJobRunner: Waiting for map tasks 2024-02-19 13:08:36,723 INFO mapred.LocalJobRunner: Starting task: attempt_local1436131745_0001_m_000000_0 2024-02-19 13:08:36,759 INFO output.PathOutputCommitterFactory: No output committer factory defined, defaulting to FileOutputCommitterFactory 2024-02-19 13:08:36,763 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2 2024-02-19 13:08:36,763 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2024-02-19 13:08:36,883 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 2024-02-19 13:08:36,909 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/QuasiMonteCarlo_1708348112809_1640684131/in/part0:0+118 2024-02-19 13:08:37,026 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 2024-02-19 13:08:37,026 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 2024-02-19 13:08:37,026 INFO mapred.MapTask: soft limit at 83886080 2024-02-19 13:08:37,026 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 2024-02-19 13:08:37,026 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 2024-02-19 13:08:37,031 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 2024-02-19 13:08:37,179 INFO mapred.LocalJobRunner: 2024-02-19 13:08:37,181 INFO mapred.MapTask: Starting flush of map output 2024-02-19 13:08:37,181 INFO mapred.MapTask: Spilling map output 2024-02-19 13:08:37,182 INFO mapred.MapTask: bufstart = 0; bufend = 18; bufvoid = 104857600 2024-02-19 13:08:37,182 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214392(104857568); length = 5/6553600 2024-02-19 13:08:37,189 INFO mapred.MapTask: Finished spill 0 2024-02-19 13:08:37,204 INFO mapred.Task: Task:attempt_local1436131745_0001_m_000000_0 is done. And is in the process of committing 2024-02-19 13:08:37,210 INFO mapred.LocalJobRunner: map 2024-02-19 13:08:37,210 INFO mapred.Task: Task 'attempt_local1436131745_0001_m_000000_0' done. 2024-02-19 13:08:37,220 INFO mapred.Task: Final Counters for attempt_local1436131745_0001_m_000000_0: Counters: 23 File System Counters FILE: Number of bytes read=281714 FILE: Number of bytes written=921239 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=118 HDFS: Number of bytes written=236 HDFS: Number of read operations=7 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 HDFS: Number of bytes read erasure-coded=0 Map-Reduce Framework Map input records=1 Map output records=2 Map output bytes=18 Map output materialized bytes=28 Input split bytes=146 Combine input records=0 Spilled Records=2 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=0 Total committed heap usage (bytes)=357564416 File Input Format Counters Bytes Read=118 2024-02-19 13:08:37,222 INFO mapred.LocalJobRunner: Finishing task: attempt_local1436131745_0001_m_000000_0 2024-02-19 13:08:37,223 INFO mapred.LocalJobRunner: Starting task: attempt_local1436131745_0001_m_000001_0 2024-02-19 13:08:37,227 INFO output.PathOutputCommitterFactory: No output committer factory defined, defaulting to FileOutputCommitterFactory 2024-02-19 13:08:37,227 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2 2024-02-19 13:08:37,227 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2024-02-19 13:08:37,228 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 2024-02-19 13:08:37,241 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/user/root/QuasiMonteCarlo_1708348112809_1640684131/in/part1:0+118 2024-02-19 13:08:37,291 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584) 2024-02-19 13:08:37,291 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100 2024-02-19 13:08:37,291 INFO mapred.MapTask: soft limit at 83886080 2024-02-19 13:08:37,291 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600 2024-02-19 13:08:37,291 INFO mapred.MapTask: kvstart = 26214396; length = 6553600 2024-02-19 13:08:37,306 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 2024-02-19 13:08:37,324 INFO mapred.LocalJobRunner: 2024-02-19 13:08:37,325 INFO mapred.MapTask: Starting flush of map output 2024-02-19 13:08:37,325 INFO mapred.MapTask: Spilling map output 2024-02-19 13:08:37,325 INFO mapred.MapTask: bufstart = 0; bufend = 18; bufvoid = 104857600 2024-02-19 13:08:37,325 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26214392(104857568); length = 5/6553600 2024-02-19 13:08:37,328 INFO mapred.MapTask: Finished spill 0 2024-02-19 13:08:37,340 INFO mapred.Task: Task:attempt_local1436131745_0001_m_000001_0 is done. And is in the process of committing 2024-02-19 13:08:37,349 INFO mapred.LocalJobRunner: map 2024-02-19 13:08:37,349 INFO mapred.Task: Task 'attempt_local1436131745_0001_m_000001_0' done. 2024-02-19 13:08:37,350 INFO mapred.Task: Final Counters for attempt_local1436131745_0001_m_000001_0: Counters: 23 File System Counters FILE: Number of bytes read=282025 FILE: Number of bytes written=921299 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=236 HDFS: Number of bytes written=236 HDFS: Number of read operations=10 HDFS: Number of large read operations=0 HDFS: Number of write operations=4 HDFS: Number of bytes read erasure-coded=0 Map-Reduce Framework Map input records=1 Map output records=2 Map output bytes=18 Map output materialized bytes=28 Input split bytes=146 Combine input records=0 Spilled Records=2 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=21 Total committed heap usage (bytes)=452984832 File Input Format Counters Bytes Read=118 2024-02-19 13:08:37,350 INFO mapred.LocalJobRunner: Finishing task: attempt_local1436131745_0001_m_000001_0 2024-02-19 13:08:37,350 INFO mapred.LocalJobRunner: map task executor complete. 2024-02-19 13:08:37,354 INFO mapred.LocalJobRunner: Waiting for reduce tasks 2024-02-19 13:08:37,354 INFO mapred.LocalJobRunner: Starting task: attempt_local1436131745_0001_r_000000_0 2024-02-19 13:08:37,369 INFO output.PathOutputCommitterFactory: No output committer factory defined, defaulting to FileOutputCommitterFactory 2024-02-19 13:08:37,369 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2 2024-02-19 13:08:37,369 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2024-02-19 13:08:37,370 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ] 2024-02-19 13:08:37,374 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@1395dc25 2024-02-19 13:08:37,376 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized! 2024-02-19 13:08:37,406 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=2382574336, maxSingleShuffleLimit=595643584, mergeThreshold=1572499072, ioSortFactor=10, memToMemMergeOutputsThreshold=10 2024-02-19 13:08:37,408 INFO reduce.EventFetcher: attempt_local1436131745_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events 2024-02-19 13:08:37,464 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1436131745_0001_m_000001_0 decomp: 24 len: 28 to MEMORY 2024-02-19 13:08:37,468 INFO reduce.InMemoryMapOutput: Read 24 bytes from map-output for attempt_local1436131745_0001_m_000001_0 2024-02-19 13:08:37,468 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 24, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->24 2024-02-19 13:08:37,476 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1436131745_0001_m_000000_0 decomp: 24 len: 28 to MEMORY 2024-02-19 13:08:37,477 INFO reduce.InMemoryMapOutput: Read 24 bytes from map-output for attempt_local1436131745_0001_m_000000_0 2024-02-19 13:08:37,478 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 24, inMemoryMapOutputs.size() -> 2, commitMemory -> 24, usedMemory ->48 2024-02-19 13:08:37,478 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning 2024-02-19 13:08:37,479 INFO mapred.LocalJobRunner: 2 / 2 copied. 2024-02-19 13:08:37,479 INFO reduce.MergeManagerImpl: finalMerge called with 2 in-memory map-outputs and 0 on-disk map-outputs 2024-02-19 13:08:37,488 INFO mapred.Merger: Merging 2 sorted segments 2024-02-19 13:08:37,488 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 42 bytes 2024-02-19 13:08:37,490 INFO reduce.MergeManagerImpl: Merged 2 segments, 48 bytes to disk to satisfy reduce memory limit 2024-02-19 13:08:37,490 INFO reduce.MergeManagerImpl: Merging 1 files, 50 bytes from disk 2024-02-19 13:08:37,491 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce 2024-02-19 13:08:37,491 INFO mapred.Merger: Merging 1 sorted segments 2024-02-19 13:08:37,492 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 43 bytes 2024-02-19 13:08:37,493 INFO mapred.LocalJobRunner: 2 / 2 copied. 2024-02-19 13:08:37,513 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords 2024-02-19 13:08:37,609 INFO mapred.Task: Task:attempt_local1436131745_0001_r_000000_0 is done. And is in the process of committing 2024-02-19 13:08:37,612 INFO mapred.LocalJobRunner: 2 / 2 copied. 2024-02-19 13:08:37,612 INFO mapred.Task: Task attempt_local1436131745_0001_r_000000_0 is allowed to commit now 2024-02-19 13:08:37,617 INFO mapreduce.Job: Job job_local1436131745_0001 running in uber mode : false 2024-02-19 13:08:37,618 INFO mapreduce.Job: map 100% reduce 0% 2024-02-19 13:08:37,633 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1436131745_0001_r_000000_0' to hdfs://localhost:9000/user/root/QuasiMonteCarlo_1708348112809_1640684131/out 2024-02-19 13:08:37,634 INFO mapred.LocalJobRunner: reduce > reduce 2024-02-19 13:08:37,634 INFO mapred.Task: Task 'attempt_local1436131745_0001_r_000000_0' done. 2024-02-19 13:08:37,635 INFO mapred.Task: Final Counters for attempt_local1436131745_0001_r_000000_0: Counters: 30 File System Counters FILE: Number of bytes read=282195 FILE: Number of bytes written=921349 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=236 HDFS: Number of bytes written=451 HDFS: Number of read operations=15 HDFS: Number of large read operations=0 HDFS: Number of write operations=7 HDFS: Number of bytes read erasure-coded=0 Map-Reduce Framework Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=56 Reduce input records=4 Reduce output records=0 Spilled Records=4 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=0 Total committed heap usage (bytes)=452984832 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Output Format Counters Bytes Written=97 2024-02-19 13:08:37,635 INFO mapred.LocalJobRunner: Finishing task: attempt_local1436131745_0001_r_000000_0 2024-02-19 13:08:37,635 INFO mapred.LocalJobRunner: reduce task executor complete. 2024-02-19 13:08:38,619 INFO mapreduce.Job: map 100% reduce 100% 2024-02-19 13:08:38,620 INFO mapreduce.Job: Job job_local1436131745_0001 completed successfully 2024-02-19 13:08:38,633 INFO mapreduce.Job: Counters: 36 File System Counters FILE: Number of bytes read=845934 FILE: Number of bytes written=2763887 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=590 HDFS: Number of bytes written=923 HDFS: Number of read operations=32 HDFS: Number of large read operations=0 HDFS: Number of write operations=15 HDFS: Number of bytes read erasure-coded=0 Map-Reduce Framework Map input records=2 Map output records=4 Map output bytes=36 Map output materialized bytes=56 Input split bytes=292 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=56 Reduce input records=4 Reduce output records=0 Spilled Records=8 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=21 Total committed heap usage (bytes)=1263534080 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=236 File Output Format Counters Bytes Written=97 Job Finished in 3.496 seconds Estimated value of Pi is 3.12000000000000000000
The script that we are going to use to start YARN is
$HADOOP_HOME/sbin/start-yarn.sh
Before launching YARN, ensure proper configuration in both yarn-site.xml
and mapred-site.xml
. For YARN, it is essential to configure the yarn.resourcemanager.hostname
as once this is set, many other services addresses depend on its value ${yarn.resourcemanager.hostname}
and do not need to be configured one by one.
In mapred-site.xml
we want to configure MapReduce to submitting jobs to the YARN queue by default.
We just configured the minimal set of properties needed to run an example for this tutorial. In general, you might want to customize these XML configuration files to align with your system requirements and specifications.
Some useful links for configuring YARN:
yarn-site.xml
¶clear_conf_file(yarn_site_file)
edit_conf_file(yarn_site_file, 'yarn.resourcemanager.hostname', 'localhost')
# Shuffle service that needs to be set for Map Reduce applications.
edit_conf_file(yarn_site_file, 'yarn.nodemanager.aux-services', 'mapreduce_shuffle')
# Configuration to enable or disable log aggregation (default is false)
edit_conf_file(yarn_site_file, 'yarn.log-aggregation-enable', 'true')
# Environment properties to be inherited by containers from NodeManagers
edit_conf_file(yarn_site_file, 'yarn.nodemanager.env-whitelist', 'JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME')
# Web proxy address
edit_conf_file(yarn_site_file, 'yarn.web-proxy.address', '127.0.0.1:3141')
# NM Webapp address
#edit_conf_file(yarn_site_file, 'yarn.nodemanager.hostname', 'localhost:8042')
# Configurations for History Server
edit_conf_file(yarn_site_file, 'yarn.log-aggregation.retain-seconds', '600')
edit_conf_file(yarn_site_file, 'yarn.log-aggregation.retain-check-interval-seconds', '60')
print_conf_file(yarn_site_file)
19-Feb-24 01:08:39 PM - INFO: add property yarn.resourcemanager.hostname to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml 19-Feb-24 01:08:39 PM - INFO: add property yarn.nodemanager.aux-services to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml 19-Feb-24 01:08:39 PM - INFO: add property yarn.log-aggregation-enable to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml 19-Feb-24 01:08:39 PM - INFO: add property yarn.nodemanager.env-whitelist to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml 19-Feb-24 01:08:39 PM - INFO: add property yarn.web-proxy.address to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml 19-Feb-24 01:08:39 PM - INFO: add property yarn.log-aggregation.retain-seconds to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml 19-Feb-24 01:08:39 PM - INFO: add property yarn.log-aggregation.retain-check-interval-seconds to /content/hadoop-3.3.6/etc/hadoop/yarn-site.xml
<?xml version="1.0" ?> <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>localhost</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value> </property> <property> <name>yarn.web-proxy.address</name> <value>127.0.0.1:3141</value> </property> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>600</value> </property> <property> <name>yarn.log-aggregation.retain-check-interval-seconds</name> <value>60</value> </property> </configuration>
yarn.web-proxy.address
is needed so that start-yarn.sh
also launches the Web proxy (needed for navigating the different Web services). Here is the relevant code from start-yarn.sh
:
# start proxyserver
PROXYSERVER=$("${HADOOP_HDFS_HOME}/bin/hdfs" getconf -confKey yarn.web-proxy.address 2>&- | cut -f1 -d:)
if [[ -n ${PROXYSERVER} ]]; then
hadoop_uservar_su yarn proxyserver "${HADOOP_YARN_HOME}/bin/yarn" \
--config "${HADOOP_CONF_DIR}" \
--workers \
--hostnames "${PROXYSERVER}" \
--daemon start \
proxyserver
(( HADOOP_JUMBO_RETCOUNTER=HADOOP_JUMBO_RETCOUNTER + $? ))
fi
We just picked an arbitrary port 3141
.
To check if yarn.web-proxy.address
use:
hdfs getconf -confKey yarn.web-proxy.address
!hdfs getconf -confKey yarn.web-proxy.address
127.0.0.1:3141
mapred-site.xml
¶We want to run MapReduce jobs on YARN by default (property mapreduce.framework.name
) and we want to configure the address for the JobHistory daemon.
clear_conf_file(mapred_site_file)
edit_conf_file(mapred_site_file, 'mapreduce.framework.name', 'yarn')
# Shuffle service that needs to be set for Map Reduce applications.
edit_conf_file(mapred_site_file, 'mapreduce.application.classpath',
os.environ['HADOOP_HOME'] + '/share/hadoop/mapreduce/*:' +
os.environ['HADOOP_HOME']+ '/share/hadoop/mapreduce/lib/*')
# this should also work (with the variable ${HADOOP_HOME})
# edit_conf_file(mapred_site_file, 'mapreduce.application.classpath',
# '${HADOOP_HOME}/share/hadoop/mapreduce/*:' +
# '${HADOOP_HOME}/share/hadoop/mapreduce/lib/*')
# Configurations for History Server
edit_conf_file(mapred_site_file, 'mapreduce.jobhistory.address', 'localhost:10020')
edit_conf_file(mapred_site_file, 'mapreduce.jobhistory.webapp.address', 'localhost:19888')
print_conf_file(mapred_site_file)
19-Feb-24 01:08:40 PM - INFO: add property mapreduce.framework.name to /content/hadoop-3.3.6/etc/hadoop/mapred-site.xml 19-Feb-24 01:08:40 PM - INFO: add property mapreduce.application.classpath to /content/hadoop-3.3.6/etc/hadoop/mapred-site.xml 19-Feb-24 01:08:40 PM - INFO: add property mapreduce.jobhistory.address to /content/hadoop-3.3.6/etc/hadoop/mapred-site.xml 19-Feb-24 01:08:40 PM - INFO: add property mapreduce.jobhistory.webapp.address to /content/hadoop-3.3.6/etc/hadoop/mapred-site.xml
<?xml version="1.0" ?> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.application.classpath</name> <value>/content/hadoop-3.3.6/share/hadoop/mapreduce/*:/content/hadoop-3.3.6/share/hadoop/mapreduce/lib/*</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>localhost:10020</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>localhost:19888</value> </property> </configuration>
!$HADOOP_HOME/sbin/stop-yarn.sh
!$HADOOP_HOME/sbin/start-yarn.sh
Stopping nodemanagers Stopping resourcemanager Stopping proxy server [127.0.0.1] 127.0.0.1: Warning: Permanently added '127.0.0.1' (ED25519) to the list of known hosts. Starting resourcemanager Starting nodemanagers
!lsof -n -i -P +c0 -sTCP:LISTEN -ac java
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 2534 root 341u IPv4 55167 0t0 TCP *:9870 (LISTEN) java 2534 root 349u IPv4 55722 0t0 TCP 127.0.0.1:9000 (LISTEN) java 2645 root 342u IPv4 54075 0t0 TCP *:9866 (LISTEN) java 2645 root 345u IPv4 54085 0t0 TCP 127.0.0.1:40039 (LISTEN) java 2645 root 374u IPv4 56384 0t0 TCP *:9864 (LISTEN) java 2645 root 375u IPv4 56418 0t0 TCP *:9867 (LISTEN) java 2860 root 343u IPv4 57029 0t0 TCP *:9868 (LISTEN) java 4314 root 366u IPv4 79944 0t0 TCP 127.0.0.1:8088 (LISTEN) java 4314 root 373u IPv4 83653 0t0 TCP 127.0.0.1:8033 (LISTEN) java 4428 root 382u IPv4 83999 0t0 TCP *:38267 (LISTEN) java 4428 root 393u IPv4 83153 0t0 TCP *:8040 (LISTEN) java 4428 root 403u IPv4 83159 0t0 TCP *:13562 (LISTEN)
We now have some new java
processes that have been added to the list of listening ports (you might need to wait and refresh the previous cell to see them all):
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 13177 root 341u IPv4 357089 0t0 TCP *:9870 (LISTEN)
java 13177 root 349u IPv4 358597 0t0 TCP 127.0.0.1:9000 (LISTEN)
java 13297 root 342u IPv4 359615 0t0 TCP *:9866 (LISTEN)
java 13297 root 345u IPv4 358674 0t0 TCP 127.0.0.1:40349 (LISTEN)
java 13297 root 374u IPv4 358786 0t0 TCP *:9864 (LISTEN)
java 13297 root 375u IPv4 358790 0t0 TCP *:9867 (LISTEN)
java 13515 root 343u IPv4 360840 0t0 TCP *:9868 (LISTEN)
java 27447 root 366u IPv4 756489 0t0 TCP 127.0.0.1:8088 (LISTEN)
java 27447 root 373u IPv4 759769 0t0 TCP 127.0.0.1:8033 (LISTEN)
java 27447 root 383u IPv4 758559 0t0 TCP 127.0.0.1:8031 (LISTEN)
java 27447 root 393u IPv4 759823 0t0 TCP 127.0.0.1:8030 (LISTEN)
java 27447 root 403u IPv4 759857 0t0 TCP 127.0.0.1:8032 (LISTEN)
java 27735 root 340u IPv4 759788 0t0 TCP 127.0.0.1:3141 (LISTEN)
YARN ports:
port number | description | configured in | |
---|---|---|---|
8088 | yarn.resourcemanager.webapp.address | yarn-default.xml | |
8030 | yarn.resourcemanager.scheduler.address | yarn-default.xml | |
8031 | yarn.resourcemanager.resource-tracker.address | yarn-default.xml | |
8032 | yarn.resourcemanager.address | yarn-default.xml | |
8033 | yarn.resourcemanager.admin.address | yarn-default.xml | |
8040 | yarn.nodemanager.localizer.address | yarn-default.xml | |
8042 | yarn.nodemanager.webapp.address | yarn-default.xml |
Web Proxy port:
port number | description | configured in | |
---|---|---|---|
3141 | yarn.web-proxy.address | yarn-site.xml |
%%bash
counter=0
until [ $counter -gt 20 ]
do
((counter++))
sleep 1
# counting lines + header
if [ $(lsof -n -aiTCP:8088 -aiTCP:8030 -aiTCP:8031 -aiTCP:8032 -aiTCP:8033 -aiTCP:8042 -aiTCP:3141 -P +c0 -sTCP:LISTEN -ac java |wc -l ) -eq 8 ]
then
echo "YARN is up and running"
echo "Time to start: $counter secs"
exit
fi
done
echo "Some YARN ports are missing. Wait some more or restart YARN."
YARN is up and running Time to start: 2 secs
If YARN is running correctly, you should see one node listed when calling yarn node -list
!yarn node -list
2024-02-19 13:09:16,252 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032 Total Nodes:1 Node-Id Node-State Node-Http-Address Number-of-Running-Containers 5a42c671b202:38267 RUNNING 5a42c671b202:8042 0
!wget localhost:8088
--2024-02-19 13:09:16-- http://localhost:8088/ Resolving localhost (localhost)... 127.0.0.1, ::1 Connecting to localhost (localhost)|127.0.0.1|:8088... connected. HTTP request sent, awaiting response... 302 Found Location: http://localhost:8088/cluster [following] --2024-02-19 13:09:17-- http://localhost:8088/cluster Reusing existing connection to localhost:8088. HTTP request sent, awaiting response... 200 OK Length: 13683 (13K) [text/html] Saving to: ‘index.html.1’ index.html.1 100%[===================>] 13.36K --.-KB/s in 0s 2024-02-19 13:09:17 (230 MB/s) - ‘index.html.1’ saved [13683/13683]
%%capture yarn_url
if IN_COLAB:
port = 8088
output.serve_kernel_port_as_window(port, path='/cluster')
if IN_COLAB:
yarn_url.show()
YARN Web UI: empty queue
No application is shown yet in the YARN UI since we have only ran a job locally.
Next, we want to start the History Server in order to be able to view jobs logfile when they're finished.
As already seen, HDFS daemons are:
YARN daemons are:
If MapReduce is to be used, then the MapReduce Job History Server will also be running. For large installations, these are generally running on separate hosts.
From: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html.
!$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver
!$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
WARNING: Use of this script to stop the MR JobHistory daemon is deprecated. WARNING: Attempting to execute replacement "mapred --daemon stop" instead. WARNING: Use of this script to start the MR JobHistory daemon is deprecated. WARNING: Attempting to execute replacement "mapred --daemon start" instead.
!wget localhost:19888
--2024-02-19 13:09:19-- http://localhost:19888/ Resolving localhost (localhost)... 127.0.0.1, ::1 Connecting to localhost (localhost)|127.0.0.1|:19888... failed: Connection refused. Connecting to localhost (localhost)|::1|:19888... failed: Cannot assign requested address. Retrying. --2024-02-19 13:09:20-- (try: 2) http://localhost:19888/ Connecting to localhost (localhost)|127.0.0.1|:19888... failed: Connection refused. Connecting to localhost (localhost)|::1|:19888... failed: Cannot assign requested address. Retrying. --2024-02-19 13:09:22-- (try: 3) http://localhost:19888/ Connecting to localhost (localhost)|127.0.0.1|:19888... connected. HTTP request sent, awaiting response... 302 Found Location: http://localhost:19888/jobhistory [following] --2024-02-19 13:09:24-- http://localhost:19888/jobhistory Reusing existing connection to localhost:19888. HTTP request sent, awaiting response... 200 OK Length: 7791 (7.6K) [text/html] Saving to: ‘index.html.2’ index.html.2 100%[===================>] 7.61K --.-KB/s in 0s 2024-02-19 13:09:25 (546 MB/s) - ‘index.html.2’ saved [7791/7791]
if IN_COLAB:
port = 19888
output.serve_kernel_port_as_window(port, path='/jobhistory')
pi
example to YARN¶After submitting the job, you should now see the pi
application listed in the YARN UI. Here is the link in case you closed the window or tab:
if IN_COLAB:
port = 8088
output.serve_kernel_port_as_window(port, path='/cluster')
!yarn jar $(find $HADOOP_HOME -name "*mapreduce*examples*"|head -1) pi 2 100
Number of Maps = 2 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Starting Job 2024-02-19 13:09:30,811 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032 2024-02-19 13:09:31,311 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1708348145550_0001 2024-02-19 13:09:31,605 INFO input.FileInputFormat: Total input files to process : 2 2024-02-19 13:09:31,737 INFO mapreduce.JobSubmitter: number of splits:2 2024-02-19 13:09:32,539 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1708348145550_0001 2024-02-19 13:09:32,539 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2024-02-19 13:09:32,815 INFO conf.Configuration: resource-types.xml not found 2024-02-19 13:09:32,816 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2024-02-19 13:09:33,368 INFO impl.YarnClientImpl: Submitted application application_1708348145550_0001 2024-02-19 13:09:33,439 INFO mapreduce.Job: The url to track the job: http://127.0.0.1:3141/proxy/application_1708348145550_0001/ 2024-02-19 13:09:33,440 INFO mapreduce.Job: Running job: job_1708348145550_0001 2024-02-19 13:09:47,771 INFO mapreduce.Job: Job job_1708348145550_0001 running in uber mode : false 2024-02-19 13:09:47,772 INFO mapreduce.Job: map 0% reduce 0% 2024-02-19 13:10:00,963 INFO mapreduce.Job: map 100% reduce 0% 2024-02-19 13:10:09,053 INFO mapreduce.Job: map 100% reduce 100% 2024-02-19 13:10:10,077 INFO mapreduce.Job: Job job_1708348145550_0001 completed successfully 2024-02-19 13:10:10,573 INFO mapreduce.Job: Counters: 54 File System Counters FILE: Number of bytes read=50 FILE: Number of bytes written=829617 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=526 HDFS: Number of bytes written=215 HDFS: Number of read operations=13 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 HDFS: Number of bytes read erasure-coded=0 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=21855 Total time spent by all reduces in occupied slots (ms)=4870 Total time spent by all map tasks (ms)=21855 Total time spent by all reduce tasks (ms)=4870 Total vcore-milliseconds taken by all map tasks=21855 Total vcore-milliseconds taken by all reduce tasks=4870 Total megabyte-milliseconds taken by all map tasks=22379520 Total megabyte-milliseconds taken by all reduce tasks=4986880 Map-Reduce Framework Map input records=2 Map output records=4 Map output bytes=36 Map output materialized bytes=56 Input split bytes=290 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=56 Reduce input records=4 Reduce output records=0 Spilled Records=8 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=327 CPU time spent (ms)=2850 Physical memory (bytes) snapshot=865935360 Virtual memory (bytes) snapshot=8173481984 Total committed heap usage (bytes)=818937856 Peak Map Physical memory (bytes)=331313152 Peak Map Virtual memory (bytes)=2724544512 Peak Reduce Physical memory (bytes)=218918912 Peak Reduce Virtual memory (bytes)=2728366080 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=236 File Output Format Counters Bytes Written=97 Job Finished in 39.964 seconds Estimated value of Pi is 3.12000000000000000000
By refreshing the Web page, you should see the application change state from "ACCEPTED" to "RUNNING" and, finally, "FINISHED".
YARN Web UI: application running
YARN Web UI: application completed
By default, Hadoop writes logs to $HADOOP_HOME/logs
.
!ls -halt $HADOOP_HOME/logs
total 540K -rw-r--r-- 1 root root 65K Feb 19 13:10 hadoop-root-resourcemanager-5a42c671b202.log -rw-r--r-- 1 root root 101K Feb 19 13:10 hadoop-root-namenode-5a42c671b202.log -rw-r--r-- 1 root root 103K Feb 19 13:10 hadoop-root-datanode-5a42c671b202.log -rw-r--r-- 1 root root 69K Feb 19 13:10 hadoop-root-nodemanager-5a42c671b202.log -rw-r--r-- 1 root root 34K Feb 19 13:09 hadoop-root-historyserver-5a42c671b202.log drwxr-xr-x 3 root root 4.0K Feb 19 13:09 userlogs -rw-r--r-- 1 root root 2.9K Feb 19 13:09 hadoop-root-historyserver-5a42c671b202.out drwxr-xr-x 3 root root 4.0K Feb 19 13:09 . -rw-r--r-- 1 root root 2.9K Feb 19 13:09 hadoop-root-nodemanager-5a42c671b202.out -rw-r--r-- 1 root root 29K Feb 19 13:09 hadoop-root-proxyserver-5a42c671b202.log -rw-r--r-- 1 root root 3.0K Feb 19 13:09 hadoop-root-resourcemanager-5a42c671b202.out -rw-r--r-- 1 root root 828 Feb 19 13:09 hadoop-root-proxyserver-5a42c671b202.out -rw-r--r-- 1 root root 72K Feb 19 13:07 hadoop-root-secondarynamenode-5a42c671b202.log -rw-r--r-- 1 root root 828 Feb 19 13:07 hadoop-root-secondarynamenode-5a42c671b202.out -rw-r--r-- 1 root root 828 Feb 19 13:07 hadoop-root-datanode-5a42c671b202.out -rw-r--r-- 1 root root 828 Feb 19 13:07 hadoop-root-namenode-5a42c671b202.out -rw-r--r-- 1 root root 828 Feb 19 13:07 hadoop-root-secondarynamenode-5a42c671b202.out.1 -rw-r--r-- 1 root root 828 Feb 19 13:06 hadoop-root-datanode-5a42c671b202.out.1 -rw-r--r-- 1 root root 828 Feb 19 13:06 hadoop-root-namenode-5a42c671b202.out.1 -rw-r--r-- 1 root root 0 Feb 19 13:06 SecurityAuth-root.audit drwxr-xr-x 11 1000 1000 4.0K Feb 19 13:06 ..
This is for instance the NameNode's log.
!tail $HADOOP_HOME/logs/hadoop-root-namenode-*.log
2024-02-19 13:10:07,280 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1073741842_1018 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /tmp/hadoop-yarn/staging/root/.staging/job_1708348145550_0001/job_1708348145550_0001_1.jhist 2024-02-19 13:10:07,685 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/staging/root/.staging/job_1708348145550_0001/job_1708348145550_0001_1.jhist is closed by DFSClient_NONMAPREDUCE_-1816557497_1 2024-02-19 13:10:07,698 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741845_1021, replicas=127.0.0.1:9866 for /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001.summary_tmp 2024-02-19 13:10:07,716 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1073741845_1021 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001.summary_tmp 2024-02-19 13:10:08,119 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001.summary_tmp is closed by DFSClient_NONMAPREDUCE_-1816557497_1 2024-02-19 13:10:08,232 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741846_1022, replicas=127.0.0.1:9866 for /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001-1708348172882-root-QuasiMonteCarlo-1708348207241-2-1-SUCCEEDED-default-1708348185932.jhist_tmp 2024-02-19 13:10:08,292 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1073741846_1022 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001-1708348172882-root-QuasiMonteCarlo-1708348207241-2-1-SUCCEEDED-default-1708348185932.jhist_tmp 2024-02-19 13:10:08,709 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001-1708348172882-root-QuasiMonteCarlo-1708348207241-2-1-SUCCEEDED-default-1708348185932.jhist_tmp is closed by DFSClient_NONMAPREDUCE_-1816557497_1 2024-02-19 13:10:08,810 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741847_1023, replicas=127.0.0.1:9866 for /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001_conf.xml_tmp 2024-02-19 13:10:08,876 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/staging/history/done_intermediate/root/job_1708348145550_0001_conf.xml_tmp is closed by DFSClient_NONMAPREDUCE_-1816557497_1
Show most recent error or warning messages
!grep "ERROR\|WARN" $HADOOP_HOME/logs/*.log | sort -r -k2 | head
/content/hadoop-3.3.6/logs/hadoop-root-datanode-5a42c671b202.log:2024-02-19 13:10:08,234 WARN org.apache.hadoop.hdfs.DFSUtil: Unexpected value for data transfer bytes=26543 duration=0 /content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:10:07,656 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000004/sysfs] /content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:10:07,656 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000004/launch_container.sh] /content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:10:07,656 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000004/container_tokens] /content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:10:05,055 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000002/sysfs] /content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:10:05,055 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000002/launch_container.sh] /content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:10:05,055 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000002/container_tokens] /content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:09:59,890 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000003/sysfs] /content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:09:59,890 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000003/launch_container.sh] /content/hadoop-3.3.6/logs/hadoop-root-nodemanager-5a42c671b202.log:2024-02-19 13:09:59,890 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1708348145550_0001/container_1708348145550_0001_01_000003/container_tokens]
Show most recent error messages in Hadoop logfiles.
!grep "ERROR" $HADOOP_HOME/logs/*.log | sort -r -k2 | head
/content/hadoop-3.3.6/logs/hadoop-root-historyserver-5a42c671b202.log:2024-02-19 13:09:25,174 ERROR org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService: Error reading root log dir this deletion attempt is being aborted /content/hadoop-3.3.6/logs/hadoop-root-secondarynamenode-5a42c671b202.log:2024-02-19 13:07:15,409 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: RECEIVED SIGNAL 15: SIGTERM /content/hadoop-3.3.6/logs/hadoop-root-datanode-5a42c671b202.log:2024-02-19 13:07:12,489 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM /content/hadoop-3.3.6/logs/hadoop-root-namenode-5a42c671b202.log:2024-02-19 13:07:11,206 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
!$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver
WARNING: Use of this script to stop the MR JobHistory daemon is deprecated. WARNING: Attempting to execute replacement "mapred --daemon stop" instead.
!$HADOOP_HOME/sbin/stop-yarn.sh
Stopping nodemanagers Stopping resourcemanager Stopping proxy server [127.0.0.1]
sshd
daemon¶Stop Hadoop (with the command: $HADOOP_HOME/sbin/stop-dfs.sh
) and the sshd
daemon (with /etc/init.d/ssh stop
).
subprocess_output = subprocess.run([os.path.join(os.environ['HADOOP_HOME'], 'sbin', 'stop-dfs.sh')], check=True, stdout=subprocess.PIPE)
print(subprocess_output.stdout.decode())
Stopping namenodes on [localhost] Stopping datanodes Stopping secondary namenodes [5a42c671b202]
subprocess_output = subprocess.run(['/etc/init.d/ssh', 'stop'], check=True, stdout=subprocess.PIPE)
print(subprocess_output.stdout.decode())
* Stopping OpenBSD Secure Shell server sshd ...done.
If you've come so far by running all cells in the notebook from the menu (and not interactively) you can re-start Hadoop by issuing the following commands:
# start the ssh daemon
logger.info("Starting {}".format("openssh-server"))
cmd = ["/etc/init.d/ssh", "restart"]
result = subprocess.check_output(cmd, stderr=subprocess.STDOUT)
!ssh localhost "echo hi!"
# start HDFS
logger.info("Starting HDFS")
subprocess_output = subprocess.run([os.path.join(os.environ['HADOOP_HOME'], 'sbin', 'start-dfs.sh')], check=True, stdout=subprocess.PIPE)
print(subprocess_output.stdout.decode())
19-Feb-24 01:10:33 PM - INFO: Starting openssh-server
hi!
19-Feb-24 01:10:34 PM - INFO: Starting HDFS
Starting namenodes on [localhost] Starting datanodes Starting secondary namenodes [5a42c671b202]
%%bash
counter=0
until [ $counter -gt 20 ]
do
((counter++))
sleep 1
# counting lines, exclude header and jupyter port 9000
if [ $(lsof -n -i -P +c0 -sTCP:LISTEN -ac java| wc -l ) -ge 8 ] && \
[ $(lsof -n -aiTCP:9870 -aiTCP:9000 -aiTCP:9864 -aiTCP:9866 -aiTCP:9867 -aiTCP:9868 -P +c0 -sTCP:LISTEN -ac java | wc -l) -eq 7 ]
then
echo "HDFS is up and running"
echo "Time to start: $counter secs"
exit
fi
done
echo "Some HDFS ports are missing. Wait some more or restart HDFS."
HDFS is up and running Time to start: 1 secs
# start YARN
logger.info("Starting YARN")
!$HADOOP_HOME/sbin/start-yarn.sh
19-Feb-24 01:10:59 PM - INFO: Starting YARN
Starting resourcemanager Starting nodemanagers
# wait for YARN services to start listening
%%bash
counter=0
until [ $counter -gt 20 ]
do
((counter++))
sleep 1
# counting lines + header
if [ $(lsof -n -aiTCP:8088 -aiTCP:8030 -aiTCP:8031 -aiTCP:8032 -aiTCP:8033 -aiTCP:8042 -aiTCP:3141 -P +c0 -sTCP:LISTEN -ac java |wc -l ) -eq 8 ]
then
echo "YARN is up and running"
echo "Time to start: $counter secs"
exit
fi
done
echo "Some YARN ports are missing. Wait some more or restart YARN."
YARN is up and running Time to start: 3 secs
# start History Server
!$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
WARNING: Use of this script to start the MR JobHistory daemon is deprecated. WARNING: Attempting to execute replacement "mapred --daemon start" instead.
# Web UIs
if IN_COLAB:
# NameNode
output.serve_kernel_port_as_window(9870, path='/index.html')
# YARN
output.serve_kernel_port_as_window(8088, path='/cluster')
# Job History Server
output.serve_kernel_port_as_window(19888, path='/jobhistory')
Run the pi
app again.
!yarn jar $(find $HADOOP_HOME -name "*mapreduce*examples*"|head -1) pi 2 100
Number of Maps = 2 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Starting Job 2024-02-19 13:11:38,012 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at localhost/127.0.0.1:8032 2024-02-19 13:11:39,247 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1708348278844_0001 2024-02-19 13:11:39,875 INFO input.FileInputFormat: Total input files to process : 2 2024-02-19 13:11:40,867 INFO mapreduce.JobSubmitter: number of splits:2 2024-02-19 13:11:41,637 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1708348278844_0001 2024-02-19 13:11:41,637 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2024-02-19 13:11:41,914 INFO conf.Configuration: resource-types.xml not found 2024-02-19 13:11:41,915 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2024-02-19 13:11:42,367 INFO impl.YarnClientImpl: Submitted application application_1708348278844_0001 2024-02-19 13:11:42,531 INFO mapreduce.Job: The url to track the job: http://127.0.0.1:3141/proxy/application_1708348278844_0001/ 2024-02-19 13:11:42,534 INFO mapreduce.Job: Running job: job_1708348278844_0001 2024-02-19 13:11:57,851 INFO mapreduce.Job: Job job_1708348278844_0001 running in uber mode : false 2024-02-19 13:11:57,852 INFO mapreduce.Job: map 0% reduce 0% 2024-02-19 13:12:11,106 INFO mapreduce.Job: map 100% reduce 0% 2024-02-19 13:12:18,177 INFO mapreduce.Job: map 100% reduce 100% 2024-02-19 13:12:20,212 INFO mapreduce.Job: Job job_1708348278844_0001 completed successfully 2024-02-19 13:12:20,361 INFO mapreduce.Job: Counters: 54 File System Counters FILE: Number of bytes read=50 FILE: Number of bytes written=829623 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=528 HDFS: Number of bytes written=215 HDFS: Number of read operations=13 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 HDFS: Number of bytes read erasure-coded=0 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=22188 Total time spent by all reduces in occupied slots (ms)=4693 Total time spent by all map tasks (ms)=22188 Total time spent by all reduce tasks (ms)=4693 Total vcore-milliseconds taken by all map tasks=22188 Total vcore-milliseconds taken by all reduce tasks=4693 Total megabyte-milliseconds taken by all map tasks=22720512 Total megabyte-milliseconds taken by all reduce tasks=4805632 Map-Reduce Framework Map input records=2 Map output records=4 Map output bytes=36 Map output materialized bytes=56 Input split bytes=292 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=56 Reduce input records=4 Reduce output records=0 Spilled Records=8 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=248 CPU time spent (ms)=2910 Physical memory (bytes) snapshot=849063936 Virtual memory (bytes) snapshot=8171692032 Total committed heap usage (bytes)=788529152 Peak Map Physical memory (bytes)=315535360 Peak Map Virtual memory (bytes)=2720288768 Peak Reduce Physical memory (bytes)=230117376 Peak Reduce Virtual memory (bytes)=2731159552 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=236 File Output Format Counters Bytes Written=97 Job Finished in 42.765 seconds Estimated value of Pi is 3.12000000000000000000