Apache Hadoop 2/ YARN/MR2 Installation for Beginners :
Background:
Big Data spans three dimensions: Volume, Velocity and Variety. (IBM defined 4th dimension or property of Big Data i.e Veracity). Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets (Big Data) across clusters of commodity Machines(Low-cost Servers). It is designed to scale up to thousands of machines, with a high degree of fault tolerance and software has the intelligence to detect & handle the failures at the application layer.
Big Data spans three dimensions: Volume, Velocity and Variety. (IBM defined 4th dimension or property of Big Data i.e Veracity). Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets (Big Data) across clusters of commodity Machines(Low-cost Servers). It is designed to scale up to thousands of machines, with a high degree of fault tolerance and software has the intelligence to detect & handle the failures at the application layer.
NOTE: More details are available@http://hadoop.apache.org/docs/stable/
- The Apache Hadoop component introduced two new terms for Hadoop 1.0 users - MapReduce2 (MR2) and YARN.
- Apache Hadoop YARN is the next-generation Hadoop framework designed to take Hadoop beyond MapReduce for data-processing- resulted in better cluster utilization that permit Hadoop to scale to accommodate more and larger jobs.
- This blog provides information for users to migrate their Apache Hadoop MapReduce applications from Apache Hadoop 1.x to Apache Hadoop 2.x
https://hadoop.apache.org/docs/current2/hadoop-yarn/hadoop-yarn-site/YARN.html
Steps to Install Hadoop2.0 on CentOS/RHEL6 on single node Cluster setup:
Step1: Install Java from link :http://www.oracle.com/technetwork/java/javase/downloads/index.htmlSteps to Install Hadoop2.0 on CentOS/RHEL6 on single node Cluster setup:
Set the environment variable $JAVA_HOME properly
NOTE: Java-1.6.0-openjdk OR other Hadoop Java Versions listed in a below link are more preferable.
http://wiki.apache.org/hadoop/HadoopJavaVersions
Step2: Download Apache Hadoop2.2 to folder $PACKAGE_HOME from link : http://hadoop.apache.org/releases.html#Download
Step 3: Add all hadoop and java environment path variables to .bashrc file.
Example :
Configure $HOME/.bashrc
- HODOOP_HOME
- JAVA_PATH
- PATH
- HADOOP_HDFS_HOME
- HADOOP_YARN_HOME
- HADOOP_MAPRED_HOME
- HADOOP_CONF_DIR
- YARN_CLASS_PATH
------------------------------------------------------------------------------------------
Step 4 : Create a separate Group for Hadoop setup
# groupadd hadoop
Step 5: Add 3 user-accounts in Group "hadoop"
# useradd -g hadoop yarn
# useradd -g hadoop hdfs
# useradd -g hadoop mapred
NOTE: Its good to run daemons with a related accounts
Step 6: Create Data Directories for namenode,datanode and secondary namenode
# mkdir -p $CONFIG/data/hadoop/hdfs/nn
# mkdir -p $CONFIG/data/hadoop/hdfs/dn
# mkdir -p $CONFIG/data/hadoop/hdfs/snn
Step 7: Set permission for "hdfs" account
# chown hdfs:hadoop $CONFIG/data/hadoop/hdfs -R
Step 8: Create Log Directories
# mkdir -p $CONFIG/log/hadoop/yarn
# mkdir logs (at installation directory Example $PACKAGE_HOME/hadoop2.2.0/logs)
Step 9: Set ownership to yarn
# chown yarn:hadoop $CONFIG/log/hadoop/yarn - R
Step 9: Set ownership to yarn
# chown yarn:hadoop $CONFIG/log/hadoop/yarn - R
Go to Hadoop directory "$PACKAGE_HOME/hadoop2.2.0/ "
# chmod g+w logs
# chown yarn:hadoop . -R
Step 10: Configure below listed XML files at $HADOOP_PREFIX/etc/hadoop
------------------------------------------------------------------------------------------------------------------
i) core-site.xml
---------------------------------------------------------------------------------------------------------------------
ii) hadoop-env.sh
[root@spb-master hadoop]# cat hadoop-env.sh
# Copyright 2011 The Apache Software Foundation
export JAVA_HOME=$BIN/java/default
export HADOOP_PREFIX=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_HDFS_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_COMMON_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_MAPRED_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_YARN_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_CONF_DIR=$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
export HADOOP_LOG_DIR=$PACKAGE_HOME/hadoop-2.2.0/logs
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=500
export HADOOP_NAMENODE_INIT_HEAPSIZE="500"
export HADOOP_JOB_HISTORYSERVER_HEAPSIZE="200"
------------------------------------------------------------------------------------------------------------------------ iii) hdfs-site.xml
[root@spb-master hadoop]# cat hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:$DATA_DIR/data/hadoop/hdfs/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:$DATA_DIR/data/hadoop/hdfs/dn</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
[root@spb-master hadoop]#
iv) mapred-site.xml
---------------------------------------------------------------------------------------------------------------------
v) yarn-env.sh
[root@spb-master hadoop]# cat yarn-env.sh
export JAVA_HOME=$BIN/java/default
export HADOOP_PREFIX=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_HDFS_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_COMMON_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_MAPRED_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_YARN_HOME=$PACKAGE_HOME/hadoop-2.2.0
export HADOOP_CONF_DIR=$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx500m
# For setting YARN specific HEAP sizes please use this
# Parameter and set appropriately
YARN_HEAPSIZE=500
vi) yarn-site.xml
---------------------------------------------------------------------------------------------------------------
Step 11: Create a passwordless ssh session for "hdfs" user account :
# su - hdfs
hdfs@localhost$ ssh-keygen -t rsa
hdfs@localhost$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
hdfs@localhost$ chmod 0600 ~/.ssh/authorized_keys
ssh-copy-id -i /home/user1/.ssh/id_rsa.pub hostname1
ssh-copy-id -i /home/user1/.ssh/id_rsa.pub hostname2
ssh-copy-id -i /home/user1/.ssh/id_rsa.pub hostname3
NOTE: It's important to remember that /home/USER must be 700 or 755 –
[root@ibmgpu01 ~]# chmod 755 /pmpi2/smpici
---------------------------------------------------------------------------------Step 12:
Now you are allowed to login without prompting for the password :
[hdfs@localhost]$ ssh localhost
Last login: Sun Dec 29 04:31:44 2013 from localhost
[hdfs@localhost ~]$
---------------------------------------------------------------------------------------------------------------
Step 13: Format Hadoop File system :
Format the NameNode directory as the HDFS superuser ( "hdfs" user account)
#su - hdfs
$ cd $PACKAGE_HOME/hadoop2.2/bin
$./hdfs namenode -format
It should show the message : $CONFIG/data/hadoop/hdfs/nn has been successfully formated as shown below:
[hdfs@localhost bin]$ ./hdfs namenode -format
13/12/29 02:36:52 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost.localdomain/127.0.0.x
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.2.0
STARTUP_MSG: classpath = $PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/common/lib/jetty-6.1.26.jar:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/common/lib/commons-el-1.0.jar:
STARTUP_MSG: java = 1.7.0_45
************************************************************/
13/12/29 02:36:52 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library //hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
13/12/29 02:36:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Formatting using clusterid: CID-d47a364a-edc6-455f-b3c8-4d2ba54458d5
13/12/29 02:36:54 INFO namenode.HostFileManager: read includes:
HostSet(
)
13/12/29 02:36:54 INFO namenode.HostFileManager: read excludes:
HostSet(
)
13/12/29 02:36:54 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
13/12/29 02:36:54 INFO util.GSet: Computing capacity for map BlocksMap
13/12/29 02:36:54 INFO util.GSet: VM type = 64-bit
13/12/29 02:36:54 INFO util.GSet: 2.0% max memory = 96.7 MB
13/12/29 02:36:54 INFO util.GSet: capacity = 2^18 = 262144 entries
13/12/29 02:36:54 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
13/12/29 02:36:54 INFO blockmanagement.BlockManager: defaultReplication = 1
13/12/29 02:36:54 INFO blockmanagement.BlockManager: maxReplication = 512
13/12/29 02:36:54 INFO blockmanagement.BlockManager: minReplication = 1
13/12/29 02:36:54 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
13/12/29 02:36:54 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false
13/12/29 02:36:54 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
13/12/29 02:36:54 INFO blockmanagement.BlockManager: encryptDataTransfer = false
13/12/29 02:36:54 INFO namenode.FSNamesystem: fsOwner = hdfs (auth:SIMPLE)
13/12/29 02:36:54 INFO namenode.FSNamesystem: supergroup = supergroup
13/12/29 02:36:54 INFO namenode.FSNamesystem: isPermissionEnabled = true
13/12/29 02:36:54 INFO namenode.FSNamesystem: HA Enabled: false
13/12/29 02:36:54 INFO namenode.FSNamesystem: Append Enabled: true
13/12/29 02:36:54 INFO util.GSet: Computing capacity for map INodeMap
13/12/29 02:36:54 INFO util.GSet: VM type = 64-bit
13/12/29 02:36:54 INFO util.GSet: 1.0% max memory = 96.7 MB
13/12/29 02:36:54 INFO util.GSet: capacity = 2^17 = 131072 entries
13/12/29 02:36:54 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000
13/12/29 02:36:54 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
13/12/29 02:36:54 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
13/12/29 02:36:54 INFO util.GSet: Computing capacity for map Namenode Retry Cache
13/12/29 02:36:54 INFO util.GSet: VM type = 64-bit
13/12/29 02:36:54 INFO util.GSet: 0.029999999329447746% max memory = 96.7 MB
13/12/29 02:36:54 INFO util.GSet: capacity = 2^12 = 4096 entries
13/12/29 02:36:55 INFO common.Storage: Storage directory $CONFIG/data/hadoop/hdfs/nn has been successfully formatted.
13/12/29 02:36:56 INFO namenode.FSImage: Saving image file $CONFIG/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 using no compression
13/12/29 02:36:56 INFO namenode.FSImage: Image file $CONFIG/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 of size 196 bytes saved in 0 seconds.
13/12/29 02:36:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
13/12/29 02:36:56 INFO util.ExitUtil: Exiting with status 0
13/12/29 02:36:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.x
************************************************************/
[hdfs@localhost bin]$
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Step 24: This command gives you information on hdfs system
[hdfs@localhost bin]$ ./hadoop dfsadmin -report
Configured Capacity: 16665448448 (15.52 GB)
Present Capacity: 12396371968 (11.55 GB)
DFS Remaining: 12396347392 (11.54 GB)
DFS Used: 24576 (24 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)
Live datanodes:
Name: 127.0.0.x:50010 (localhost)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 16665448448 (15.52 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 4269076480 (3.98 GB)
DFS Remaining: 12396347392 (11.54 GB)
DFS Used%: 0.00%
DFS Remaining%: 74.38%
Last contact: Sun Dec 29 03:11:02 PST 2013
[hdfs@localhost bin]$
________________________________________________________
Step25: Stop all the services by running " stop-all.sh "
[hdfs@localhost sbin]$ ./stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
[hdfs@localhost sbin]$
________________________________________________________
Step 26: Start all the services by running "start-all.sh "
Added the YARN architecture block diagram to locate the presence of daemons in different components .
[hdfs@localhost sbin]$ ./start-all.sh
check the status of all services :
[hdfs@localhost sbin]$ jps
6161 NameNode
6260 DataNode
6719 NodeManager
6750 Jps
6355 SecondaryNameNode
6429 ResourceManager
[root@localhost bin]#
Job Definition and control Flow between Hadoop/Yarn components:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31822268
________________________________________________________
Step 27: Run sample application program "pi" from hadoop-mapreduce-examples-2.2.0.jar
First test with hadoop to run existing hadoop program - launch the program, monitor progress, and get/put files on the HDFS. This program calculates the value of " pi " in parallel i.e 2 maps with 10 samples:
$ hadoop jar $BIN/lib/hadoop/hadoop-examples.jar pi 2 10
[hdfs@localhost bin]$ ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 10
Number of Maps = 2
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Starting Job
13/12/29 04:33:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
13/12/29 04:33:13 INFO input.FileInputFormat: Total input paths to process : 2
13/12/29 04:33:13 INFO mapreduce.JobSubmitter: number of splits:2
13/12/29 04:33:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1388320369543_0001
13/12/29 04:33:15 INFO impl.YarnClientImpl: Submitted application application_1388320369543_0001 to ResourceManager at /0.0.0.0:8032
13/12/29 04:33:15 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1388320369543_0001/
13/12/29 04:33:15 INFO mapreduce.Job: Running job: job_1388320369543_0001
13/12/29 04:33:38 INFO mapreduce.Job: Job job_1388320369543_0001 running in uber mode : false
13/12/29 04:33:38 INFO mapreduce.Job: map 0% reduce 0%
13/12/29 04:35:22 INFO mapreduce.Job: map 83% reduce 0%
13/12/29 04:35:23 INFO mapreduce.Job: map 100% reduce 0%
13/12/29 04:36:10 INFO mapreduce.Job: map 100% reduce 100%
13/12/29 04:36:16 INFO mapreduce.Job: Job job_1388320369543_0001 completed successfully
13/12/29 04:36:16 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=50
FILE: Number of bytes written=238681
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=528
HDFS: Number of bytes written=215
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=208977
Total time spent by all reduces in occupied slots (ms)=39840
Map-Reduce Framework
Map input records=2
Map output records=4
Map output bytes=36
Map output materialized bytes=56
Input split bytes=292
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=56
Reduce input records=4
Reduce output records=0
Spilled Records=8
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=1712
CPU time spent (ms)=3320
Physical memory (bytes) snapshot=454049792
Virtual memory (bytes) snapshot=3515953152
Total committed heap usage (bytes)=268247040
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=236
File Output Format Counters
Bytes Written=97
Job Finished in 184.356 seconds
Estimated value of Pi is 3.80000000000000000000
[hdfs@localhost bin]$
________________________________________________________________
Step 28 : Verify the Running Services Using the Web Interface:
Web Interface for the resource Manager can be viewed by
http://localhost:8088
__________________________________________________________________
Step 29 : Create a Directory on HDFS
[hdfs@localhost bin]$ ./hadoop fs -mkdir test1
-------------------------------------------------------------------------
Step 30: Put local file "hellofile" into HDFS (/test1)
[hdfs@localhost bin]$ ./hadoop fs -put hellofile /test1
-------------------------------------------------------------------------
Step 31: Check the input file "hellofile" on HDFS
[hdfs@localhost bin]$ ./hadoop fs -ls /test1
Found 1 items
-rw-r--r-- 1 hdfs supergroup 113 2013-12-29 04:56 /test1/hellofile
[hdfs@localhost bin]$
___________________________________________________________
Step 32: Run application program "WordCount" from hadoop-mapreduce-examples-2.2.0.jar
WordCount Example:
WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.
To run the example, the command syntax is
bin/hadoop jar hadoop-*-examples.jar wordcount <in-dir> <out-dir>
All of the files in the input directory (called in-dir in the command line above) are read and the counts of words in the input are written to the output directory (called out-dir above).It is assumed that both inputs and outputs are stored in HDFS.If your input is not already in HDFS, but is rather in a local file system somewhere, you need to copy the data into HDFS as shown in above steps 29-31.
NOTE: Similarly you could think of processing bigger Data Files ( Weather data , Healthcare data, Machine Log data ...etc).
************************************************************/
13/12/29 02:36:52 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library //hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
13/12/29 02:36:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Formatting using clusterid: CID-d47a364a-edc6-455f-b3c8-4d2ba54458d5
13/12/29 02:36:54 INFO namenode.HostFileManager: read includes:
HostSet(
)
13/12/29 02:36:54 INFO namenode.HostFileManager: read excludes:
HostSet(
)
13/12/29 02:36:54 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
13/12/29 02:36:54 INFO util.GSet: Computing capacity for map BlocksMap
13/12/29 02:36:54 INFO util.GSet: VM type = 64-bit
13/12/29 02:36:54 INFO util.GSet: 2.0% max memory = 96.7 MB
13/12/29 02:36:54 INFO util.GSet: capacity = 2^18 = 262144 entries
13/12/29 02:36:54 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
13/12/29 02:36:54 INFO blockmanagement.BlockManager: defaultReplication = 1
13/12/29 02:36:54 INFO blockmanagement.BlockManager: maxReplication = 512
13/12/29 02:36:54 INFO blockmanagement.BlockManager: minReplication = 1
13/12/29 02:36:54 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
13/12/29 02:36:54 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false
13/12/29 02:36:54 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
13/12/29 02:36:54 INFO blockmanagement.BlockManager: encryptDataTransfer = false
13/12/29 02:36:54 INFO namenode.FSNamesystem: fsOwner = hdfs (auth:SIMPLE)
13/12/29 02:36:54 INFO namenode.FSNamesystem: supergroup = supergroup
13/12/29 02:36:54 INFO namenode.FSNamesystem: isPermissionEnabled = true
13/12/29 02:36:54 INFO namenode.FSNamesystem: HA Enabled: false
13/12/29 02:36:54 INFO namenode.FSNamesystem: Append Enabled: true
13/12/29 02:36:54 INFO util.GSet: Computing capacity for map INodeMap
13/12/29 02:36:54 INFO util.GSet: VM type = 64-bit
13/12/29 02:36:54 INFO util.GSet: 1.0% max memory = 96.7 MB
13/12/29 02:36:54 INFO util.GSet: capacity = 2^17 = 131072 entries
13/12/29 02:36:54 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000
13/12/29 02:36:54 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
13/12/29 02:36:54 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
13/12/29 02:36:54 INFO util.GSet: Computing capacity for map Namenode Retry Cache
13/12/29 02:36:54 INFO util.GSet: VM type = 64-bit
13/12/29 02:36:54 INFO util.GSet: 0.029999999329447746% max memory = 96.7 MB
13/12/29 02:36:54 INFO util.GSet: capacity = 2^12 = 4096 entries
13/12/29 02:36:55 INFO common.Storage: Storage directory $CONFIG/data/hadoop/hdfs/nn has been successfully formatted.
13/12/29 02:36:56 INFO namenode.FSImage: Saving image file $CONFIG/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 using no compression
13/12/29 02:36:56 INFO namenode.FSImage: Image file $CONFIG/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 of size 196 bytes saved in 0 seconds.
13/12/29 02:36:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
13/12/29 02:36:56 INFO util.ExitUtil: Exiting with status 0
13/12/29 02:36:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.x
************************************************************/
[hdfs@localhost bin]$
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Step 14: Start HDFS service - Namenode Daemon process
$cd ../sbin
[hdfs@localhost bin]$ cd ../sbin/
[hdfs@localhost sbin]$ ./hadoop-daemon.sh start namenode
starting namenode, logging to /$PACKAGE_HOME/hadoop-2.2.0/logs/hadoop-hdfs-namenode localhost.localdomain.out
Step 15: Check the status of namenode daemon
[hdfs@localhost ]$ jps
4537 Jps
4300 NameNode =====> started successfully
[hdfs@localhost sbin]$ ps -ef | grep java
hdfs 4300 1 11 02:38 pts/1 00:00:04 $BIN/java/default/bin/java -Dproc_namenode -Xmx100m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hdfs-namenode-localhost.localdomain.log -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.NameNode
_______________________________________________________________________________
Step 16 : Start HDFS service - Secondary Namenode Daemon process
[hdfs@localhost sbin]$ ./hadoop-daemon.sh start secondarynamenode
starting secondarynamenode, logging to $PACKAGE_HOME/hadoop-2.2.0/logs/hadoop-hdfs-secondarynamenode-localhost.localdomain.out
[hdfs@localhost sbin]$
Step 17 : Check the status of Secondarynamenode daemon
[hdfs@localhost bin]$ jps
4300 NameNode
4913 SecondaryNameNode ======> started successfully
[hdfs@localhost sbin]$ ps -ef | grep java | grep 4913
hdfs 4913 1 7 02:46 pts/1 00:00:04 $BIN/java/default/bin/java -Dproc_secondarynamenode -Xmx100m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hdfs-secondarynamenode-localhost.localdomain.log -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
_____________________________________________________________________________________________________________
Step 18: Start HDFS service - DataNode Daemon process
[hdfs@localhost sbin]$ ./hadoop-daemon.sh start datanode
starting datanode, logging to $PACKAGE_HOME/hadoop-2.2.0/logs/hadoop-hdfs-datanode-localhost.localdomain.out
[hdfs@localhost sbin]$
Step 19: Check the status of Datanode daemon
[hdfs@localhost bin]$ jps
4300 NameNode
4913 SecondaryNameNode
4949 Jps
4373 DataNode ======> started successfully
[hdfs@localhost sbin]$ ps -ef | grep java | grep 4373
hdfs 4373 1 34 02:39 pts/1 00:00:06 $BIN/java/default/bin/java -Dproc_datanode -Xmx100m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hdfs-datanode-localhost.localdomain.log -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
___________________________________________________________________
Step 20:Start YARN service - resourcemanager Daemon process
[hdfs@localhost sbin]$ ./yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to $PACKAGE_HOME/hadoop-2.2.0/logs/yarn-hdfs-resourcemanager-localhost.localdomain.out
Step 21 : Check the status of ResourceManager daemon
[hdfs@localhost bin]$ jps
4300 NameNode
4913 SecondaryNameNode
4949 Jps
4373 DataNode
4500 ResourceManager ======> started successfully
[hdfs@localhost sbin]$ ps -ef | grep java | grep 4500
hdfs 4500 1 3 02:41 pts/1 00:00:08 $BIN/java/default/bin/java -Dproc_resourcemanager -Xmx200m -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dyarn.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.home.dir= -Dyarn.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dyarn.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -classpath $PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/common/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/common/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/hdfs:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/hdfs/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/hdfs/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/contrib/capacity-scheduler/*.jar:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/lib/*:$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop//rm-config/log4j.properties org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
_____________________
Step 22:Start YARN service - NodeManager Daemon process
[hdfs@localhost sbin]$ ./yarn-daemon.sh start nodemanager
starting nodemanager, logging to $PACKAGE_HOME/hadoop-2.2.0/logs/yarn-hdfs-nodemanager-localhost.localdomain.out
[hdfs@localhost sbin]$
Step 23 : Check the status of Nodemanager daemon
[hdfs@localhost bin]$ jps
4300 NameNode
4744 NodeManager ======> started successfully
4913 SecondaryNameNode
4949 Jps
4373 DataNode
4500 ResourceManager
[root@localhost bin]#
[hdfs@localhost sbin]$ ps -ef | grep java | grep 4744
hdfs 4744 1 2 02:42 pts/1 00:00:03 $BIN/java/default/bin/java -Dproc_nodemanager -Xmx200m -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dyarn.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.home.dir= -Dyarn.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -Dyarn.policy.file=hadoop-policy.xml -server -Dhadoop.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dyarn.log.dir=$PACKAGE_HOME/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.home.dir=$PACKAGE_HOME/hadoop-2.2.0 -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=$PACKAGE_HOME/hadoop-2.2.0/lib/native -classpath $PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop/:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/common/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/common/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/hdfs:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/hdfs/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/hdfs/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/contrib/capacity-scheduler/*.jar:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/*:$PACKAGE_HOME/hadoop-2.2.0/share/hadoop/yarn/lib/*:$PACKAGE_HOME/hadoop-2.2.0/etc/hadoop//nm-config/log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager ________________________________________________________Step 24: This command gives you information on hdfs system
[hdfs@localhost bin]$ ./hadoop dfsadmin -report
Configured Capacity: 16665448448 (15.52 GB)
Present Capacity: 12396371968 (11.55 GB)
DFS Remaining: 12396347392 (11.54 GB)
DFS Used: 24576 (24 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)
Live datanodes:
Name: 127.0.0.x:50010 (localhost)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 16665448448 (15.52 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 4269076480 (3.98 GB)
DFS Remaining: 12396347392 (11.54 GB)
DFS Used%: 0.00%
DFS Remaining%: 74.38%
Last contact: Sun Dec 29 03:11:02 PST 2013
[hdfs@localhost bin]$
________________________________________________________
Step25: Stop all the services by running " stop-all.sh "
[hdfs@localhost sbin]$ ./stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
[hdfs@localhost sbin]$
________________________________________________________
Step 26: Start all the services by running "start-all.sh "
Added the YARN architecture block diagram to locate the presence of daemons in different components .
[hdfs@localhost sbin]$ ./start-all.sh
check the status of all services :
[hdfs@localhost sbin]$ jps
6161 NameNode
6260 DataNode
6719 NodeManager
6750 Jps
6355 SecondaryNameNode
6429 ResourceManager
[root@localhost bin]#
Job Definition and control Flow between Hadoop/Yarn components:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31822268
________________________________________________________
Step 27: Run sample application program "pi" from hadoop-mapreduce-examples-2.2.0.jar
First test with hadoop to run existing hadoop program - launch the program, monitor progress, and get/put files on the HDFS. This program calculates the value of " pi " in parallel i.e 2 maps with 10 samples:
$ hadoop jar $BIN/lib/hadoop/hadoop-examples.jar pi 2 10
[hdfs@localhost bin]$ ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 10
Number of Maps = 2
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Starting Job
13/12/29 04:33:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
13/12/29 04:33:13 INFO input.FileInputFormat: Total input paths to process : 2
13/12/29 04:33:13 INFO mapreduce.JobSubmitter: number of splits:2
13/12/29 04:33:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1388320369543_0001
13/12/29 04:33:15 INFO impl.YarnClientImpl: Submitted application application_1388320369543_0001 to ResourceManager at /0.0.0.0:8032
13/12/29 04:33:15 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1388320369543_0001/
13/12/29 04:33:15 INFO mapreduce.Job: Running job: job_1388320369543_0001
13/12/29 04:33:38 INFO mapreduce.Job: Job job_1388320369543_0001 running in uber mode : false
13/12/29 04:33:38 INFO mapreduce.Job: map 0% reduce 0%
13/12/29 04:35:22 INFO mapreduce.Job: map 83% reduce 0%
13/12/29 04:35:23 INFO mapreduce.Job: map 100% reduce 0%
13/12/29 04:36:10 INFO mapreduce.Job: map 100% reduce 100%
13/12/29 04:36:16 INFO mapreduce.Job: Job job_1388320369543_0001 completed successfully
13/12/29 04:36:16 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=50
FILE: Number of bytes written=238681
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=528
HDFS: Number of bytes written=215
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=208977
Total time spent by all reduces in occupied slots (ms)=39840
Map-Reduce Framework
Map input records=2
Map output records=4
Map output bytes=36
Map output materialized bytes=56
Input split bytes=292
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=56
Reduce input records=4
Reduce output records=0
Spilled Records=8
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=1712
CPU time spent (ms)=3320
Physical memory (bytes) snapshot=454049792
Virtual memory (bytes) snapshot=3515953152
Total committed heap usage (bytes)=268247040
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=236
File Output Format Counters
Bytes Written=97
Job Finished in 184.356 seconds
Estimated value of Pi is 3.80000000000000000000
[hdfs@localhost bin]$
________________________________________________________________
Step 28 : Verify the Running Services Using the Web Interface:
Web Interface for the resource Manager can be viewed by
http://localhost:8088
Shows the running application on single node cluster |
Application Overview -Final Status( FINISHED) |
Step 29 : Create a Directory on HDFS
[hdfs@localhost bin]$ ./hadoop fs -mkdir test1
-------------------------------------------------------------------------
Step 30: Put local file "hellofile" into HDFS (/test1)
[hdfs@localhost bin]$ ./hadoop fs -put hellofile /test1
-------------------------------------------------------------------------
Step 31: Check the input file "hellofile" on HDFS
[hdfs@localhost bin]$ ./hadoop fs -ls /test1
Found 1 items
-rw-r--r-- 1 hdfs supergroup 113 2013-12-29 04:56 /test1/hellofile
[hdfs@localhost bin]$
___________________________________________________________
Step 32: Run application program "WordCount" from hadoop-mapreduce-examples-2.2.0.jar
WordCount Example:
WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.
To run the example, the command syntax is
bin/hadoop jar hadoop-*-examples.jar wordcount <in-dir> <out-dir>
All of the files in the input directory (called in-dir in the command line above) are read and the counts of words in the input are written to the output directory (called out-dir above).It is assumed that both inputs and outputs are stored in HDFS.If your input is not already in HDFS, but is rather in a local file system somewhere, you need to copy the data into HDFS as shown in above steps 29-31.
NOTE: Similarly you could think of processing bigger Data Files ( Weather data , Healthcare data, Machine Log data ...etc).