Jugnu Life :-): Hadoop installation tutorial

Purpose of post is to explain how to install hadoop in your computer. This post considers that you have Linux based system available for use. I am doing this on Ubuntu system

If you want to know how to install latest version of Hadoop 2.0 , then see the Hadoop 2.0 Install Tutorial

Before you begin create a separate user named hadoop in the system and do all these operations in that.

This document covers the Steps to
1) Configure SSH
2) Install JDK
3) Install Hadoop

Update your repository
#sudo apt-get update

You can directly copy the commands from there and run in your system

Hadoop requires that various systems present in cluster can talk to each other freely. Hadoop use SSH to prove the identity for connection.

Let's Download and configure SSH

#sudo apt-get install openssh-server openssh-client
#ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
#cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

#sudo chmod go-w $HOME $HOME/.ssh
#sudo chmod 600 $HOME/.ssh/authorized_keys
#sudo chown `whoami` $HOME/.ssh/authorized_keys

Testing your SSH

#ssh localhost
Say yes

It should open connection with SSH
#exit
This will close the SSH

Java 1.6 is mandatory for running hadoop

Lets Download and install JDK

#sudo mkdir /usr/java
#cd /usr/java
#sudo wget http://download.oracle.com/otn-pub/java/jdk/6u31-b04/jdk-6u31-linux-i586.bin

Wait till the jdk download completes
Install java
#sudo chmod o+w jdk-6u31-linux-i586.bin
#sudo chmod +x jdk-6u31-linux-i586.bin
#sudo ./jdk-6u31-linux-i586.bin

Now comes the Hadoop :)

Lets Download and configure Hadoop in Pseudo distributed mode. You can read more about various types of modes on Hadoop website.

Download the latest hadoop version from its website

http://hadoop.apache.org/common/releases.html
Download hadoop 1.0.x tar.gz from hadoop website

Extract it into some folder ( say /home/hadoop/software/20/ )
All softwares have been downloaded at that location

For other modes (Standalone and Fully distributed) please see hadoop documentation

Go to conf directory in hadoop folder and open core-site.xml and add the following property in blank configuration tags

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost</value>
</property>
</configuration>

Similarly do for

conf/hdfs-site.xml:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

conf/mapred-site.xml:

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>

Environment variables

In hadoop_env.sh file , change the JAVA_HOME to location where you installed java
e.g
JAVA_HOME = /usr/java/jdk1.6.0_31

Configure the environment variables for JDK , Hadoop as follows

Go to ~.profile file in the current user home directory
Add the following

You can change the variable paths if you have installed hadoop and java at some other locations

export JAVA_HOME="/usr/java/jdk1.6.0_31"
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_INSTALL="/home/hadoop/software/hadoop-1.0.1"
export PATH=$PATH:$HADOOP_INSTALL/bin

Testing your installation
Format the HDFS
# hadoop namenode -format

hadoop@jj-VirtualBox:~$ start-dfs.sh
starting namenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-namenode-jj-VirtualBox.out
localhost: starting datanode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-datanode-jj-VirtualBox.out
localhost: starting secondarynamenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-secondarynamenode-jj-VirtualBox.out
hadoop@jj-VirtualBox:~$ start-mapred.sh
starting jobtracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-jobtracker-jj-VirtualBox.out
localhost: starting tasktracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-tasktracker-jj-VirtualBox.out

Open the browser and point to page

localhost:50030
localhost:50070

It would open the status page for hadoop

Thats it , this completes the installation of Hadoop , now you are ready to play with it.

9 comments:

GenewitchMarch 29, 2012 at 9:23 AM
This guide made me realize i left out part of a config file, thanks.

Is there a guide to just using hdfs as a distributed file system, like a replacement for NFS, AFS, or gluster?
JugnuMarch 29, 2012 at 9:41 PM
Hello Genewitch

Thank you for your comment.

There is very interesting discussion on HDFS standalone.

Just go through the mailing list

http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201102.mbox/%3CAANLkTi=+Wic=e4uj3vpHrihctr7Uu84uh8YbS1fuXccw@mail.gmail.com%3E
rashmiJuly 19, 2012 at 10:11 AM
Hi,

I have two node cluster, rsi1 and rsi2 are hostnames of both machines.

what should be value of fs.default.name and mapred.job.tracker on both federated namenodes.

want to make both nodes as federated nodes.

Appreciate your reply.

Rashmi
Shankar PandaJanuary 10, 2013 at 9:59 AM
How can we install Hive?Can you please guide.
Shankar PandaJanuary 10, 2013 at 5:48 PM
hive> show tables;
FAILED: Error in metadata: MetaException(message:Got exception: java.net.ConnectException Call to localhost/127.0.0.1:8020 failed on connection exception: java.net.ConnectException: Connection refused)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

why this error is for?Sir can u help me on this.
Shankar PandaJanuary 24, 2013 at 10:16 AM
how to know which is my datanode and namenode?

Please share your views and comments below.

Thank You.