Hadoop is an open-source Apache project that allows creation of parallel processing applications on large data sets, distributed across networked nodes. It’s composed of the Hadoop Distributed File System (HDFS™) that handles scalability and redundancy of data across nodes, and Hadoop YARN: a framework for job scheduling that executes data processing tasks on all nodes.
Hadoop allows for the storage and processing of large data sets across clusters of computers. Hadoop was developed by the Apache Software Foundation and has become a popular tool for big data processing and analysis.
Some of the key use cases of Hadoop are:
Data storage: Hadoop Distributed File System (HDFS) is a highly scalable and fault-tolerant distributed file system that is used to store large amounts of data across multiple nodes in a cluster. Hadoop is often used to store and manage large amounts of unstructured and semi-structured data, such as log files, sensor data, and social media data.
Batch processing: Hadoop provides a powerful framework for batch processing of large data sets. This is typically done using the MapReduce programming model, which allows for the parallel processing of data across multiple nodes in a cluster. Hadoop is often used for tasks such as data cleansing, data transformation, and data aggregation.
Data analysis: Hadoop is often used for data analysis and machine learning tasks. Hadoop provides a number of tools for processing and analyzing large data sets, including Apache Pig and Apache Hive. These tools allow users to query and analyze large data sets using SQL-like commands.
Real-time processing: Hadoop can also be used for real-time processing of data, using tools such as Apache Spark and Apache Flink. These tools allow for the processing of data streams in real-time, enabling applications such as fraud detection, real-time recommendations, and IoT data processing.
Apache Hadoop 3.x Benefits
- Support multiple standby NameNodes.
- Supports multiple NameNodes for multiple namespaces.
- Storage overhead reduced from 200% to 50%.
- Support GPUs.
- Intra-node disk balancing.
- Support for Opportunistic Containers and Distributed Scheduling.
- Support for Microsoft Azure Data Lake and Aliyun Object Storage System file-system
Architecture of Hadoop Cluster:
Apache hadoop has 2 core components .
1) HDFS - its for storage
2) YARN - its for computation
You could see the HDFS and YARN architecture as shown below:
HDFS ARCHITECTURE |
YARN ARCHITECTURE
Before configuring the master and worker nodes, it’s good to understand the different components of a Hadoop cluster. A master node keeps knowledge about the distributed file system, like the inode table on an ext3 filesystem, and schedules resources allocation. node-master will handle this role in this guide, and host two daemons:
• The NameNode: manages the distributed file system and knows where stored data blocks inside the cluster are.
• The ResourceManager: manages the YARN jobs and takes care of scheduling and executing processes on worker nodes.
Worker nodes store the actual data and provide processing power to run the jobs and will host two daemons:
• The DataNode manages the actual data physically stored on the node;
• The NodeManager manages execution of tasks on the node.
Prerequisites for Implementing Hadoop
- Operating system – RHEL 7.6
- Hadoop – You require Hadoop 3.x package
- Passwordless SSH connections between nodes inthe cluster
- Firewall settings on machines in the cluster
- Machine details :
Worker nodes: hadoopNode1 & hadoopNode2 (Power8 server with K80 GPUs running RHEL7)
------------------------------------- ------------------------------------------------------------------
Hadoop Installation Steps :
Step 1: Download the Java 8 Package and Save this file in your home directory.Java is the primary requirement for running Hadoop on any system. They have compiled all the Hadoop jar files using Java 8 run time version. The user now has to install Java 8 to use Hadoop 3.0. And user having JDK 7 has to upgrade it to JDK 8.
If your machine is IBM power architecture (ppc64le) , then you need to get IBM java package from this link copied below:
Download Link : https://developer.ibm.com/javasdk/downloads/sdk8/
Step 2: Extract the Java TarFile.
Step 3: Download the Hadoop 3.x Package.
Download stable version of Hadoop
wget http://apache.spinellicreations.com/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
Step 4: Extract the Hadoop tar File.
Extract the files @ /home/users/sachinpb/sachinPB/
tar xzvf hadoop-3.2.0.tar.gz
At the high level "/home/users/sachinpb/sachinPB/hadoop-3.2.0, you will see the following directories:
├── bin
│ ├── container-executor
│ ├── hadoop
│ ├── hadoop.cmd
│ ├── hdfs
│ ├── hdfs.cmd
│ ├── mapred
│ ├── mapred.cmd
│ ├── oom-listener
│ ├── test-container-executor
│ ├── yarn
│ └── yarn.cmd
├── etc
│ └── hadoop
│ ├── core-site.xml
│ ├── hadoop-env.sh
│ ├── hdfs-site.xml
│ ├── log4j.properties
│ ├── mapred-site.xml
│ ├── workers
│ ├── yarn-env.sh
│ └── yarn-site.xml
├── include
├── lib
│ └── native
│ ├── examples
│ ├── libhadoop.a
│ ├── libhadooppipes.a
│ ├── libhadoop.so -> libhadoop.so.1.0.0
│ ├── libhadoop.so.1.0.0
│ ├── libhadooputils.a
│ ├── libnativetask.a
│ ├── libnativetask.so -> libnativetask.so.1.0.0
│ └── libnativetask.so.1.0.0
├── logs
│
├── sbin
│ ├── hadoop-daemon.sh
│ ├── httpfs.sh
│ ├── mr-jobhistory-daemon.sh
│ ├── refresh-namenodes.sh
│ ├── start-all.sh
│ ├── start-balancer.sh
│ ├── start-dfs.sh
│ ├── start-secure-dns.sh
│ ├── start-yarn.sh
│ ├── stop-all.cmd
│ ├── stop-all.sh
│ ├── stop-balancer.sh
│ ├── stop-dfs.sh
│ ├── stop-secure-dns.sh
│ ├── stop-yarn.sh
│ ├── workers.sh
│ ├── yarn-daemon.sh
│
└── share
├── doc
│ └── hadoop
└── hadoop
├── client
├── common
├── hdfs
├── mapreduce
├── tools
└── yarn
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
update ~/.bashrc
export HADOOP_HOME=$HOME/sachinPB/hadoop-3.2.0
export HADOOP_CONF_DIR=$HOME/sachinPB/hadoop-3.2.0/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/sachinPB/hadoop-3.2.0
export HADOOP_COMMON_HOME=$HOME/sachinPB/hadoop-3.2.0
export HADOOP_HDFS_HOME=$HOME/sachinPB/hadoop-3.2.0
export YARN_HOME=$HOME/sachinPB/hadoop-3.2.0
export PATH=$PATH:$HOME/sachinPB/hadoop-3.2.0/bin
#set Java Home
export JAVA_HOME=/opt/ibm/java-ppc64le-80
export PATH=$PATH:/opt/ibm/java-ppc64le-80/bin
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
source .bashrc
Step 6: Edit the Hadoop Configuration files as per your application requirements.
HOW TO CONFIGURE AND RUN BIG DATA APPLICATIONS ?
Configuration files at : $HOME/sachinPB/hadoop-3.2.0/etc/Hadoop
Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag.
SET NAMENODE LOCATION
core-site.xml
-----------------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoopNode1:9000</value>
</property>
</configuration>
-----------------------------------------------------------------------------
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag.
SET PATH FOR HDFS
hdfs-site.xml
-----------------------------------------------------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:$DATA_DIR/hadoop/hdfs/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:$DATA_DIR/hadoop/hdfs/dn</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
----------------------------------------------------------------------------
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside the configuration tag.
SET YARN AS JOB SCHEDULER
mapred-site.xml
-------------------------------------------------------------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>
-----------------------------------------------------------------------------
Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration tag.
CONFIGURE YARN
yarn-site.xml
----------------------------------------------------------------
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
-----------------------------------------------------------------
Step 11: Edit hadoop-env.sh/yarn-env.sh and add the Java Path as mentioned below.
export JAVA_HOME=$JAVA_PPC64LE_PATH
export HADOOP_HOME=$HOME/sachinPB/hadoop-3.2.0
Step 12 : The file workers is used by startup scripts to start required daemons on all nodes. This a very new change in Hadoop3 ( Please check)
CONFIGURE SLAVES
workers
-------------------------------
hadoopNode1
hadoopNode2
---------------------------------
check Java and hadoop version
[sachinpb@hadoopNode1 hadoop]$ java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
Check Hadoop version:
[sachinpb@hadoopNode1 hadoop]$ hadoop version
Hadoop 3.2.0
This command was run using $HOME/sachinPB/hadoop-3.2.0/share/hadoop/common/hadoop-common-3.2.0.jar
[sachinpb@hadoopNode1 hadoop]$
-----------------------------------------------
Step 13: Next, format the NameNode.
HDFS needs to be formatted like any classical file system. On node-master, run the following command: "hdfs namenode -format"
[sachinpb@hadoopNode1]$ hdfs namenode -format -clusterId CID***-XYZ
2019-05-07 03:44:04,380 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hadoopNode1/$HOST1_IPADDRESS
STARTUP_MSG: args = [-format, -clusterId, CID***-XYZ]
STARTUP_MSG: version = 3.2.0
STARTUP_MSG: classpath = $HOME/sachinpb/sachinPB/hadoop-3.2.0/etc/hadoop:$HOME/sachinpb/sachinPB/hadoop-.
.
.
.
.
[$DATA_DIR/hadoop/hdfs/nn/current/VERSION, $DATA_DIR/hadoop/hdfs/nn/current/seen_txid, $DATA_DIR/hadoop/hdfs/nn/current/fsimage_0000000000000000000.md5, $DATA_DIR/hadoop/hdfs/nn/current/fsimage_0000000000000000000, $DATA_DIR/hadoop/hdfs/nn/current/edits_0000000000000000001-0000000000000000002, $DATA_DIR/hadoop/hdfs/nn/current/edits_0000000000000000003-0000000000000000004, $DATA_DIR/hadoop/hdfs/nn/current/edits_0000000000000000005-0000000000000000006, $DATA_DIR/hadoop/hdfs/nn/current/edits_0000000000000000007-0000000000000000008, $DATA_DIR/hadoop/hdfs/nn/current/edits_inprogress_0000000000000000009]
2019-05-07 03:44:08,926 INFO common.Storage: Storage directory $DATA_DIR/hadoop/hdfs/nn has been successfully formatted.
2019-05-07 03:44:08,937 INFO namenode.FSImageFormatProtobuf: Saving image file $DATA_DIR/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 using no compression
2019-05-07 03:44:09,063 INFO namenode.FSImageFormatProtobuf: Image file $DATA_DIR/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 of size 401 bytes saved in 0 seconds .
2019-05-07 03:44:09,089 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2019-05-07 03:44:09,104 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoopNode1/$HOST1_IPADDRESS
************************************************************/
[sachinpb@hadoopNode1 logs]$
------------------------------------------
Step 14: Once the NameNode is formatted, go to Hadoop-3.0/sbin directory and start all the daemons.
Your Hadoop installation is now configured and ready to run big data application
Step 15: Start hdfs and yarn daemons:
Form directory : $HOME/sachinPB/hadoop-3.2.0/sbin
NOTE: Copy the Hadoop home directory across all the nodes in your cluster (if its not available on shared directory)
[sachinpb@hadoopNode1 sbin]$ ./start-all.sh
WARNING: Attempting to start all Apache Hadoop daemons as sachinpb in 10 seconds..
Starting namenodes on [hadoopNode1]
hadoopNode1: Welcome to hadoopNode1!
hadoopNode1:
hadoopNode1: namenode is running as process 146418.
Starting datanodes
hadoopNode1: Welcome to hadoopNode1!
hadoopNode1:
hadoopNode2: Welcome to hadoopNode2!
hadoopNode2:
hadoopNode1: datanode is running as process 146666.
hadoopNode2: datanode is running as process 112502.
Starting secondary namenodes [hadoopNode1]
hadoopNode1: Welcome to hadoopNode1!
hadoopNode1:
hadoopNode1: secondarynamenode is running as process 147091.
[sachinpb@hadoopNode1 sbin]$
Step 16: All the Hadoop services are up and running. [on other platforms, you could use jps command to see hadoop daemons. IBM JAVA does not provide jps or jstat. So you need to check the Hadoop process by ps command]
[sachinpb@hadoopNode1 sbin]$ ps -ef | grep NameNode
sachinpb 105015 1 0 02:59 ? 00:00:27 $JAVA_PPC64LE_PATH/bin/java -Dproc_namenode -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-namenode-hadoopNode1.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-namenode-hadoopNode1.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.namenode.NameNode
-----
sachinpb 105713 1 0 02:59 ? 00:00:12 $JAVA_PPC64LE_PATH/bin/java -Dproc_secondarynamenode -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-secondarynamenode-hadoopNode1.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-secondarynamenode-hadoopNode1.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
[sachinpb@hadoopNode1 sbin]$ ps -ef | grep DataNode
sachinpb 105268 1 0 02:59 ? 00:00:19 $JAVA_PPC64LE_PATH/bin/java -Dproc_datanode -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dhadoop.security.logger=ERROR,RFAS -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-datanode-hadoopNode1.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-datanode-hadoopNode1.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.datanode.DataNode
------
[sachinpb@hadoopNode1 sbin]$ ps -ef | grep ResourceManager
sachinpb 106257 1 1 02:59 pts/3 00:00:50 $JAVA_PPC64LE_PATH/bin/java -Dproc_resourcemanager -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dservice.libdir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/share/hadoop/yarn,$HOME/sachinpb/sachinPB/hadoop-3.2.0/share/hadoop/yarn/lib,$HOME/sachinpb/sachinPB/hadoop-3.2.0/share/hadoop/hdfs,$HOME/sachinpb/sachinPB/hadoop-3.2.0/share/hadoop/hdfs/lib,$HOME/sachinpb/sachinPB/hadoop-3.2.0/share/hadoop/common,$HOME/sachinpb/sachinPB/hadoop-3.2.0/share/hadoop/common/lib -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-resourcemanager-hadoopNode1.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-resourcemanager-hadoopNode1.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
[sachinpb@hadoopNode1 sbin]$ ps -ef | grep NodeManager
sachinpb 106621 1 1 02:59 ? 00:01:08 $JAVA_PPC64LE_PATH/bin/java -Dproc_nodemanager -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-nodemanager-hadoopNode1.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-nodemanager-hadoopNode1.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.yarn.server.nodemanager.NodeManager
Ckeck similarly the status of hadoop daemons on other worker node [hadoopNode2]:
[sachinpb@hadoopNode2 ~]$ ps -ef | grep hadoop
sachinpb 77718 1 7 21:52 ? 00:00:07 $JAVA_PPC64LE_PATH/bin/java -Dproc_datanode -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dhadoop.security.logger=ERROR,RFAS -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-datanode-hadoopNode2.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-datanode-hadoopNode2.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.datanode.DataNode
sachinpb 78006 1 12 21:52 ? 00:00:11 $JAVA_PPC64LE_PATH/bin/java -Dproc_nodemanager -Djava.library.path=$HOME/sachinpb/sachinPB/hadoop-3.2.0/lib -Dyarn.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dyarn.log.file=hadoop-sachinpb-nodemanager-hadoopNode2.log -Dyarn.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dyarn.root.logger=INFO,console -Dhadoop.log.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0/logs -Dhadoop.log.file=hadoop-sachinpb-nodemanager-hadoopNode2.log -Dhadoop.home.dir=$HOME/sachinpb/sachinPB/hadoop-3.2.0 -Dhadoop.id.str=sachinpb -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.yarn.server.nodemanager.NodeManager
[sachinpb@hadoopNode2 ~]$
Step 17: Now open the Browser and go to localhost:9870/dfshealth.html to check the NameNode interface.
NOTE: In Hadoop 2.x, web UI port is 50070 but in Hadoop3.x, it is moved to 9870. You can access HDFS web UI from localhost:9870
Step 18: run Hadoop application – Example: wordcount mapreduce program
[sachinpb@hadoopNode1 hadoop-3.2.0]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar wordcount /user/sachinpb/helloworld /user/sachinpb/helloworld_out
2019-05-07 04:04:35,044 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2019-05-07 04:04:36,137 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: $MY_DIR/hadoop-yarn/staging/sachinpb/.staging/job_1557225898252_0003
2019-05-07 04:04:36,374 INFO input.FileInputFormat: Total input files to process : 1
2019-05-07 04:04:36,486 INFO mapreduce.JobSubmitter: number of splits:1
2019-05-07 04:04:36,536 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2019-05-07 04:04:36,728 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1557225898252_0003
2019-05-07 04:04:36,729 INFO mapreduce.JobSubmitter: Executing with tokens: []
2019-05-07 04:04:36,939 INFO conf.Configuration: resource-types.xml not found
2019-05-07 04:04:36,939 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2019-05-07 04:04:36,997 INFO impl.YarnClientImpl: Submitted application application_1557225898252_0003
2019-05-07 04:04:37,029 INFO mapreduce.Job: The url to track the job: http://hadoopNode1:8088/proxy/application_1557225898252_0003/
2019-05-07 04:04:37,030 INFO mapreduce.Job: Running job: job_1557225898252_0003
2019-05-07 04:04:45,137 INFO mapreduce.Job: Job job_1557225898252_0003 running in uber mode : false
2019-05-07 04:04:45,138 INFO mapreduce.Job: map 0% reduce 0%
2019-05-07 04:04:51,189 INFO mapreduce.Job: map 100% reduce 0%
2019-05-07 04:04:59,223 INFO mapreduce.Job: map 100% reduce 100%
2019-05-07 04:04:59,232 INFO mapreduce.Job: Job job_1557225898252_0003 completed successfully
2019-05-07 04:04:59,348 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=41
FILE: Number of bytes written=443547
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=126
HDFS: Number of bytes written=23
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3968
Total time spent by all reduces in occupied slots (ms)=4683
Total time spent by all map tasks (ms)=3968
Total time spent by all reduce tasks (ms)=4683
Total vcore-milliseconds taken by all map tasks=3968
Total vcore-milliseconds taken by all reduce tasks=4683
Total megabyte-milliseconds taken by all map tasks=4063232
Total megabyte-milliseconds taken by all reduce tasks=4795392
Map-Reduce Framework
Map input records=1
Map output records=3
Map output bytes=29
Map output materialized bytes=41
Input split bytes=109
Combine input records=3
Combine output records=3
Reduce input groups=3
Reduce shuffle bytes=41
Reduce input records=3
Reduce output records=3
Spilled Records=6
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=288
CPU time spent (ms)=4030
Physical memory (bytes) snapshot=350552064
Virtual memory (bytes) snapshot=3825860608
Total committed heap usage (bytes)=177668096
Peak Map Physical memory (bytes)=226557952
Peak Map Virtual memory (bytes)=1911750656
Peak Reduce Physical memory (bytes)=123994112
Peak Reduce Virtual memory (bytes)=1914109952
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=17
File Output Format Counters
Bytes Written=23
[sachinpb@hadoopNode1 hadoop-3.2.0]$
------------------------
Step 19: Verify the output file in HDFS:
[sachinpb@hadoopNode1 hadoop-3.2.0]$ hdfs dfs -cat /user/sachinpb/helloworld_out/part-r-00000
---------------------
2019 4
hello 6
world 7
---------------------
Step 20: MONITOR YOUR HDFS CLUSTER
[sachinpb@hadoopNode1]$ hdfs dfsadmin -report
Configured Capacity: 1990698467328 (1.81 TB)
Present Capacity: 1794297528320 (1.63 TB)
DFS Remaining: 1794297511936 (1.63 TB)
DFS Used: 16384 (16 KB)
DFS Used%: 0.00%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
-------------------------------------------------
Live datanodes (2):
Name: $HOST1_IPADDRESS:9866 (hadoopNode1)
Hostname: hadoopNode1
Decommission Status : Normal
Configured Capacity: 995349233664 (926.99 GB)
DFS Used: 12288 (12 KB)
Non DFS Used: 118666317824 (110.52 GB)
DFS Remaining: 876682903552 (816.47 GB)
DFS Used%: 0.00%
DFS Remaining%: 88.08%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu May 09 23:23:50 PDT 2019
Last Block Report: Thu May 09 23:11:35 PDT 2019
Num of Blocks: 0
Name: $HOST2_IPADDESS:9866 (hadoopNode2)
Hostname: hadoopNode2
Decommission Status : Normal
Configured Capacity: 995349233664 (926.99 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 77734621184 (72.40 GB)
DFS Remaining: 917614608384 (854.60 GB)
DFS Used%: 0.00%
DFS Remaining%: 92.19%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu May 09 23:23:49 PDT 2019
Last Block Report: Thu May 09 23:20:58 PDT 2019
Num of Blocks: 0
NOTE : you could see two live data nodes (hadoopNode1 & hadoopNode2) in this cluster with all details about allocated HDFS space and block size...etc .This way you can check the health of the above Hadoop cluster. Also, we tested the wordcount application on this cluster as shown above.
Step 21: How to stop the Hadoop daemons in cluster environmnet:
cd to $HOME/sachinpb/sachinPB/hadoop-3.2.0/sbin
[sachinpb@hadoopNode1 sbin]$ ./stop-all.sh
WARNING: Stopping all Apache Hadoop daemons as sachinpb in 10 seconds.
WARNING: Use CTRL-C to abort.
Stopping namenodes on [hadoopNode1]
hadoopNode1: Welcome to hadoopNode1!
hadoopNode1:
Stopping datanodes
hadoopNode1: Welcome to hadoopNode1!
hadoopNode1:
hadoopNode2: Welcome to hadoopNode2!
hadoopNode2:
Stopping secondary namenodes [hadoopNode1]
hadoopNode1: Welcome to hadoopNode1!
hadoopNode1:
Stopping nodemanagers
Stopping resourcemanagers on []
[sachinpb@hadoopNode1 sbin]$
I hope this blog helped in understanding how to install Hadoop 3.x
in a multinode setup i.e cluster and how to perform operation on HDFS files. Overall, Hadoop is a powerful tool for big data processing and analysis, with a wide range of use cases in industries such as finance, healthcare, retail, and telecommunications.
----------------------------------------END-------------------------------------------
Reference:
1) https://hadoop.apache.org/docs/r3.0.3/hadoop-project-dist/hadoop-common/SingleCluster.html
2) https://hadoop.apache.org/docs/r3.0.3/hadoop-project-dist/hadoop-common/ClusterSetup.html
3) https://hadoop.apache.org/docs/r3.0.3/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
4) http://www.sachinpbuzz.com/2014/01/big-data-hadoop-20yarn-multi-node.html
----------------------------------------END-------------------------------------------
Reference:
1) https://hadoop.apache.org/docs/r3.0.3/hadoop-project-dist/hadoop-common/SingleCluster.html
2) https://hadoop.apache.org/docs/r3.0.3/hadoop-project-dist/hadoop-common/ClusterSetup.html
3) https://hadoop.apache.org/docs/r3.0.3/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
4) http://www.sachinpbuzz.com/2014/01/big-data-hadoop-20yarn-multi-node.html
No comments:
Post a Comment