Hadoop 0.20.2 bin folder purpose

In my previous post i tried to write overview of folder structure of Hadoop 0.20.2.

In this post the files and their purpose is discussed in bit more detail.

A similar kind of post exists for Hadoop 0.23 version if you are interested to read.

hadoop-config.sh

This is executed with each hadoop command , it does two things mainly. If we specify –conf paramter then it sets the HADOOP_CONF_DIR parameter to the same , secondly if we set parameter like –hosts then it decides to use master or slaves file

hadoop

This is the main hadoop command script file. If we just write on the command prompt just

# hadoop

It shows the list of commands which can use , it shows detailed help. The screen shot below also shows the same.

Now when we type one of the commands from help above it calls some Class file written in the file above.

e,g hadoop namenode -format

if [ "$COMMAND" = "namenode" ] ; then
  CLASS='org.apache.hadoop.hdfs.server.namenode.NameNode'
  HADOOP_OPTS="$HADOOP_OPTS $HADOOP_NAMENODE_OPTS"

So when we call namenode it calls NameNode.java file mentioned above and pass the arguments further. Similarly for other hadoop command

This file also calls

conf/hadoop-env.sh where various hadoop specific environment variables and configurations are set. The details about that can be read in other blog post for conf directory

hadoop-env.sh can be used to pass daemon specific environment variables.

 

hadoop-daemon.sh

Runs a Hadoop command as a daemon

hadoop-daemons.sh

Run a Hadoop command on all slave hosts. It calls slaves.sh and then calls hadoop-daemon.sh to run particular command passed

slaves.sh

Run a shell command on all slave hosts

start-dfs.sh

Start hadoop dfs daemons. Optionally we can also tell to upgrade or rollback dfs state. We run this on master node.

The implementation is

# start dfs daemons
# start namenode after datanodes, to minimize time namenode is up w/o data
# note: datanodes will log connection errors until namenode starts
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode

nameStartOpt and dataStartOpt are decided on parameters we pass

start-mapred.sh

Start hadoop map reduce daemons.  Run this on master node.

# start mapred daemons
# start jobtracker first to minimize connection errors at startup
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker

start-all.sh

Start all hadoop daemons.  Run this on master node.

It calls start-dfs.sh and start-mapred.sh internally

stop-dfs.sh

Stop hadoop DFS daemons.  Run this on master node.

"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop namenode
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR stop datanode
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters stop secondarynamenode

stop-mapred.sh

Stop hadoop map reduce daemons.  Run this on master node.

"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop jobtracker
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR stop tasktracker

stop-all.sh

Stop all hadoop daemons.  Run this on master node.

"$bin"/stop-mapred.sh --config $HADOOP_CONF_DIR
"$bin"/stop-dfs.sh --config $HADOOP_CONF_DIR

start-balancer.sh

Start balancer daemon at particular node for data block balancing

"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start balancer $@

stop-balancer.sh

# Stop balancer daemon.
# Run this on the machine where the balancer is running

"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop balancer

rcc.sh

The Hadoop record compiler

The implementation for this is present in

CLASS='org.apache.hadoop.record.compiler.generated.Rcc'

No comments:

Post a Comment

Please share your views and comments below.

Thank You.