Install RHadoop on Hadoop Cluster

The instructions below can be used to install RHadoop rmr2 , rhdfs packages on Hadoop cluster. I have just single node cluster but it really don't matter. Same instructions apply if you have more machines.

Install R on machine by following the instructions at

http://jugnu-life.blogspot.com.au/2013/02/install-r-on-ubuntu_24.html

Lets start to install RHadoop

RHadoop packages are available at

https://github.com/RevolutionAnalytics

I have cloned the git repo for the packages , this makes easy to do any upgrades. So you have two choices here.

Easy 1 ) Download the tar.gz files for rmr2 , rhdfs

 

Easy 2 :) , Clone the git repo for each of them

Lets go with first one

Download the rmr2 and rhdfs from following locations

https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.5.tar.gz

https://github.com/RevolutionAnalytics/rmr2/blob/master/build/rmr2_2.1.0.tar.gz

https://github.com/RevolutionAnalytics/quickcheck/blob/master/build/quickcheck_1.0.tar.gz

Links might have changed while you are reading this , so pardon me and get latest links

 

I assume you have already installed R on your machine , it needs to be installed on all nodes in your cluster. And these Rhadoop packages also needs to be installed on all of the nodes.

 

Export variables needed

Change the location below depending on your install of Hadoop

 

sudo gedit /etc/environment

Add the following

# Variable added for RHadoop Install
HADOOP_CMD=/home/jj/software/hadoop-1.0.4/bin/hadoop
HADOOP_CONF=/home/jj/software/hadoop-1.0.4/conf
HADOOP_STREAMING=/home/jj/software/hadoop-1.0.4/contrib/streaming/hadoop-streaming-1.0.4.jar

Tell R about Java

R at times is not able to figure out few java settings so lets tell and help R

 

# sudo R CMD javareconf JAVA=/home/jj/software/java/jdk1.6.0_43/bin/java JAVA_HOME=/home/jj/software/java/jdk1.6.0_43 JAVAC=/home/jj/software/java/jdk1.6.0_43/bin/javac JAR=/home/jj/software/java/jdk1.6.0_43/bin/jar JAVAH=/home/jj/software/java/jdk1.6.0_43/bin/javah

 

Please change the following

/home/jj/software/java/jdk1.6.0_43/bin

Depending on your Java location

 

$ sudo R CMD javareconf JAVA=/home/jj/software/java/jdk1.6.0_43/bin/java JAVA_HOME=/home/jj/software/java/jdk1.6.0_43 JAVAC=/home/jj/software/java/jdk1.6.0_43/bin/javac JAR=/home/jj/software/java/jdk1.6.0_43/bin/jar JAVAH=/home/jj/software/java/jdk1.6.0_43/bin/javah

Updating Java configuration in /usr/lib/R
Done.

 

Check cluster is up and happy

hadoop fs –ls /

jj@jj-VirtualBox:~/software/R/RHadoop$ hadoop fs -ls /
Warning: $HADOOP_HOME is deprecated.

Found 3 items
drwxr-xr-x   - jj supergroup          0 2013-03-29 09:59 /hbase
drwxr-xr-x   - jj supergroup          0 2013-03-09 21:57 /home
drwxr-xr-x   - jj supergroup          0 2013-03-09 17:35 /user

 

Install RJava

 

Start R with

#sudo R –save

 

We are just telling R to start with sudo and save settings we do now

 

>install.packages('rJava')

It will ask to choose CRAN server , select something near to you and let the install happen

After its done verify that its there :)

> library()

 

It will show something like

Packages in library ‘/usr/local/lib/R/site-library’:

rJava                   Low-level R to Java interface

Packages in library ‘/usr/lib/R/library’:

Quit R

> q()

All set

 

Install rhdfs now

Go to location where you downloaded tar.gz files and execute following command

 

jj@jj-VirtualBox:~/software/R/RHadoop/rhdfs/build$ ls
rhdfs_1.0.5.tar.gz

 

 

$ sudo export HADOOP_CMD=/home/jj/software/hadoop-1.0.4/bin/hadoop R CMD INSTALL rhdfs_1.0.5.tar.gz

Check

> library('rhdfs')
Loading required package: rJava

HADOOP_CMD=/home/jj/software/hadoop-1.0.4/bin/hadoop

Be sure to run hdfs.init()
> hdfs.init()
> hdfs.ls('/')
  permission owner      group size          modtime   file
1 drwxr-xr-x    jj supergroup    0 2013-03-29 09:59 /hbase
2 drwxr-xr-x    jj supergroup    0 2013-03-09 21:57  /home
3 drwxr-xr-x    jj supergroup    0 2013-03-09 17:35  /user
>

We are able to see HDFS files in R

So all done for rhdfs

 

Install rmr2

 

$ apt-get install -y pdfjam

 

> install.packages(c( 'RJSONIO', 'itertools', 'digest','functional', 'stringr', 'plyr'))

Download package from

http://cran.r-project.org/web/packages/reshape2/index.html

http://cran.r-project.org/web/packages/Rcpp/index.html

 

sudo R CMD INSTALL Rcpp_0.10.3.tar.gz

sudo R CMD INSTALL reshape2_1.2.2.tar.gz

sudo R CMD INSTALL quickcheck_1.0.tar.gz
sudo R CMD INSTALL rmr2_2.0.2.tar.gz

 

Done

More reading

http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/

No comments:

Post a Comment

Please share your views and comments below.

Thank You.