Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Using R and Hadoop Bigdata

Following packages helps working with Bigdata from R.

1. rmr2
2. rhdfs
3. HadoopStreamingR
4. Rhipe
5. h2o
6. SparkR

The links to documentation and tutorials for each of them are below.

All the packages work on the basis of Hadoop Streaming to run the work on cluster instead of single R node. If you are new to Hadoop read the basics of Hadoop Streaming on https://hadoop.apache.org/docs/stable2/hadoop-streaming/HadoopStreaming.html and short tutorial on writing jobs which run using Python. http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Package Name

Useful Tutorials / Readings

rmr2

Web links



https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md



Book

R in Nutshell 2nd edition ( Chapter 26 )

http://shop.oreilly.com/product/0636920022008.do




rhdfs

Wiki

https://github.com/RevolutionAnalytics/RHadoop/wiki/user%3Erhdfs%3EHome


HadoopStreamingR


Cran package documentation

https://cran.r-project.org/web/packages/HadoopStreaming/HadoopStreaming.pdf


Rhipe


Web links

http://tessera.io/docs-RHIPE/#install-and-push




h2o

Documentation using h2o from R

http://h2o-release.s3.amazonaws.com/h2o/rel-slater/1/docs-website/h2o-docs/index.html#%E2%80%A6%20From%20R

R h2o package documentation ( ~140 pages )

http://h2o-release.s3.amazonaws.com/h2o/rel-slater/1/docs-website/h2o-r/h2o_package.pdf


SparkR

Api

https://spark.apache.org/docs/latest/api/R/index.html

Documentation

https://spark.apache.org/docs/latest/api/R/index.html

Install R package behind the proxy


Sys.setenv(http_proxy="http://username:PASSWORD@proxyurl:8080")

install.packages("RPostgreSQL", dependencies=TRUE)

Update R packages

update.packages(ask = FALSE,lib = '/usr/lib64/R/library',repos = "http://local/cran-mirror",dependencies = TRUE)

Install R package only if not installed

options(echo=TRUE) # if you want see commands in output file
args <- commandArgs(trailingOnly = TRUE)
print(args)
# trailingOnly=TRUE means that only your arguments are returned, check:
# print(commandsArgs(trailingOnly=FALSE))

# Install only if not already installed
# http://stackoverflow.com/questions/9341635/check-for-installed-packages-before-running-install-packages
pkgInstall <- function(x)
{
if (!require(x,character.only = TRUE))
{
install.packages(pkgs = x,lib = '/usr/lib64/R/library',repos = "http://local/cran-mirror",dependencies = TRUE,verbose = FALSE,quiet = FALSE)
if(!require(x,character.only = TRUE)) stop("Package not found")
}
}

pkgInstall(args[1])


Save the above script as installpackage.R

Use it as

sudo Rscript installpackage.R pkgname

Using Hive from R

Using Hive from R

We need to install following packages

R CMD INSTALL rJava
R CMD INSTALL RHive

rm(list=ls())
options( java.parameters = "-Xmx8g" )

library(RJDBC)
if (Sys.getenv("JAVA_HOME")!="") Sys.setenv(JAVA_HOME="")
.jinit()
for(l in list.files('/usr/phd/3.0.0.0-249/hadoop',pattern="*.jar",recursive=TRUE)){ .jaddClassPath(paste("/usr/phd/3.0.0.0-249/hadoop/",l,sep=""))}


hivedrv <- JDBC("org.apache.hive.jdbc.HiveDriver", "/usr/phd/3.0.0.0-249/hive/lib/hive-jdbc.jar")
conn <- dbConnect(hivedrv, "jdbc:hive2://hive:10000/default", "myusername")

counter <- dbGetQuery(conn, "select count(*) from default.tablename")

counter

Change the paths for jar above as per your Hadoop distribution

How to create CRAN mirror

We wanted to setup local copy of cran to speed up installs of packages behind the corporate proxy.

Following instructions given on website.

Create simple script

Install Apache http server

sync.sh has contents

rsync -rvCtL --delete --include="*.tar.gz" --include="PACKAGES*" --exclude="*/*" cran.r-project.org::CRAN/src/contrib /var/www/cran

Above script just deletes all the old packges , copies new ones into folder /var/www/cran

You can choose any other location

Rsync used to stop many times due to connection or other issues so i just added it in loop , if exist is not clean run again

call.sh has contents

./sync.sh
while [ $? -ne 0 ]; do
    ./sync.sh
done


Now just schedule the script call.sh via cron to run at regular intervals , the official website say to run every 2 days.

http://cran.r-project.org/mirror-howto.html

You can now use the above location in contrib url for R packages






Error ODBC headers sql.h and sqlext.h not found

Error while installing RODBC package in R

checking for sqlext.h... no
configure: error: "ODBC headers sql.h and sqlext.h not found"
ERROR: configuration failed for package ‘RODBC’
* removing ‘/usr/lib64/R/library/RODBC’


Solution

yum install unixODBC-devel



How to configure HAWQ to talk to RStudio and R

How to configure HAWQ to talk to RStudio and R

HAWQ is installed in server which is used by analyst to run all queries.

RStudio has plugin named RPostgreSQL we can use that to talk to HAWQ

Here is command run sheet to do the same

Load the Plugin
>library(RPostgreSQL)

Declare the driver
>drv <- dbDriver("PostgreSQL")

Create the connection
con <- dbConnect(drv,host="10.1.1.1",port="5432",user="username",password="mypasswrd",dbname="mydatabase")

Run the query
rs <- dbSendQuery(con,"select count(*) from mytable")

Fetch the results
> fetch(rs,n=-1)
    count
1 3713399
> 

RS-PostgreSQL.h:23:26: error: libpq-fe.h: No such file or directory

Error in installing RPostgreSQL

m=ssp-buffer-size=4 -m64 -mtune=generic  -c RS-PQescape.c -o RS-PQescape.o
In file included from RS-PQescape.c:7:
RS-PostgreSQL.h:23:26: error: libpq-fe.h: No such file or directory
RS-PQescape.c: In function ‘RS_PostgreSQL_escape’:
RS-PQescape.c:21: error: ‘PGconn’ undeclared (first use in this function)
RS-PQescape.c:21: error: (Each undeclared identifier is reported only once
RS-PQescape.c:21: error: for each function it appears in.)
RS-PQescape.c:21: error: ‘my_connection’ undeclared (first use in this function)
RS-PQescape.c:28: error: expected expression before ‘)’ token
RS-PQescape.c:32: warning: implicit declaration of function ‘PQescapeStringConn’
make: *** [RS-PQescape.o] Error 1
ERROR: compilation failed for package ‘RPostgreSQL’
* removing ‘/usr/lib64/R/library/RPostgreSQL’

The downloaded source packages are in
        ‘/tmp/RtmpKut82D/downloaded_packages’


Solution

yum install postgresql-devel

ggdal and ggmap R packages install

To install following Ggdal and ggmap R packages we need additional dependencies

I spent lot of time searching. So dumping all here

Versions and binary names below are for Redhat , please see corresponding names If you are on Debian.

geos-devel                   3.3.2-1.el6                    
geos                         3.3.2-1.el6
gdal                        1.7.3-15.el6
gdal-devel           1.7.3-15.el6
proj-devel           4.7.0-1.el6
proj-epsg        4.7.0-1.el6
proj-nad    4.7.0-1.el6
libpng                    2:1.2.49-1.el6_2
libpng-devel 2:1.2.49-1.el6_2


Dumping some of the related errors

Error: proj/epsg not found

Install

proj-epsg                  
proj-nad 

read.c:3:17: error: png.h: No such file or directory
configure: error: proj_api.h not found in standard or given locations.

Install
proj-devel

configure: error: proj_api.h not found in standard or given locations.
ERROR: configuration failed for package ‘rgdal’

configure: error: proj_api.h not found in standard or given locations.


Error: gdal-config not found
gdal-config is in your path. Try typing gdal-config

Install gdal-devel


\

Install RHadoop on Hadoop Cluster

The instructions below can be used to install RHadoop rmr2 , rhdfs packages on Hadoop cluster. I have just single node cluster but it really don't matter. Same instructions apply if you have more machines.

Install R on machine by following the instructions at

http://jugnu-life.blogspot.com.au/2013/02/install-r-on-ubuntu_24.html

Lets start to install RHadoop

RHadoop packages are available at

https://github.com/RevolutionAnalytics

I have cloned the git repo for the packages , this makes easy to do any upgrades. So you have two choices here.

Easy 1 ) Download the tar.gz files for rmr2 , rhdfs

 

Easy 2 :) , Clone the git repo for each of them

Lets go with first one

Download the rmr2 and rhdfs from following locations

https://github.com/RevolutionAnalytics/rhdfs/blob/master/build/rhdfs_1.0.5.tar.gz

https://github.com/RevolutionAnalytics/rmr2/blob/master/build/rmr2_2.1.0.tar.gz

https://github.com/RevolutionAnalytics/quickcheck/blob/master/build/quickcheck_1.0.tar.gz

Links might have changed while you are reading this , so pardon me and get latest links

 

I assume you have already installed R on your machine , it needs to be installed on all nodes in your cluster. And these Rhadoop packages also needs to be installed on all of the nodes.

 

Export variables needed

Change the location below depending on your install of Hadoop

 

sudo gedit /etc/environment

Add the following

# Variable added for RHadoop Install
HADOOP_CMD=/home/jj/software/hadoop-1.0.4/bin/hadoop
HADOOP_CONF=/home/jj/software/hadoop-1.0.4/conf
HADOOP_STREAMING=/home/jj/software/hadoop-1.0.4/contrib/streaming/hadoop-streaming-1.0.4.jar

Tell R about Java

R at times is not able to figure out few java settings so lets tell and help R

 

# sudo R CMD javareconf JAVA=/home/jj/software/java/jdk1.6.0_43/bin/java JAVA_HOME=/home/jj/software/java/jdk1.6.0_43 JAVAC=/home/jj/software/java/jdk1.6.0_43/bin/javac JAR=/home/jj/software/java/jdk1.6.0_43/bin/jar JAVAH=/home/jj/software/java/jdk1.6.0_43/bin/javah

 

Please change the following

/home/jj/software/java/jdk1.6.0_43/bin

Depending on your Java location

 

$ sudo R CMD javareconf JAVA=/home/jj/software/java/jdk1.6.0_43/bin/java JAVA_HOME=/home/jj/software/java/jdk1.6.0_43 JAVAC=/home/jj/software/java/jdk1.6.0_43/bin/javac JAR=/home/jj/software/java/jdk1.6.0_43/bin/jar JAVAH=/home/jj/software/java/jdk1.6.0_43/bin/javah

Updating Java configuration in /usr/lib/R
Done.

 

Check cluster is up and happy

hadoop fs –ls /

jj@jj-VirtualBox:~/software/R/RHadoop$ hadoop fs -ls /
Warning: $HADOOP_HOME is deprecated.

Found 3 items
drwxr-xr-x   - jj supergroup          0 2013-03-29 09:59 /hbase
drwxr-xr-x   - jj supergroup          0 2013-03-09 21:57 /home
drwxr-xr-x   - jj supergroup          0 2013-03-09 17:35 /user

 

Install RJava

 

Start R with

#sudo R –save

 

We are just telling R to start with sudo and save settings we do now

 

>install.packages('rJava')

It will ask to choose CRAN server , select something near to you and let the install happen

After its done verify that its there :)

> library()

 

It will show something like

Packages in library ‘/usr/local/lib/R/site-library’:

rJava                   Low-level R to Java interface

Packages in library ‘/usr/lib/R/library’:

Quit R

> q()

All set

 

Install rhdfs now

Go to location where you downloaded tar.gz files and execute following command

 

jj@jj-VirtualBox:~/software/R/RHadoop/rhdfs/build$ ls
rhdfs_1.0.5.tar.gz

 

 

$ sudo export HADOOP_CMD=/home/jj/software/hadoop-1.0.4/bin/hadoop R CMD INSTALL rhdfs_1.0.5.tar.gz

Check

> library('rhdfs')
Loading required package: rJava

HADOOP_CMD=/home/jj/software/hadoop-1.0.4/bin/hadoop

Be sure to run hdfs.init()
> hdfs.init()
> hdfs.ls('/')
  permission owner      group size          modtime   file
1 drwxr-xr-x    jj supergroup    0 2013-03-29 09:59 /hbase
2 drwxr-xr-x    jj supergroup    0 2013-03-09 21:57  /home
3 drwxr-xr-x    jj supergroup    0 2013-03-09 17:35  /user
>

We are able to see HDFS files in R

So all done for rhdfs

 

Install rmr2

 

$ apt-get install -y pdfjam

 

> install.packages(c( 'RJSONIO', 'itertools', 'digest','functional', 'stringr', 'plyr'))

Download package from

http://cran.r-project.org/web/packages/reshape2/index.html

http://cran.r-project.org/web/packages/Rcpp/index.html

 

sudo R CMD INSTALL Rcpp_0.10.3.tar.gz

sudo R CMD INSTALL reshape2_1.2.2.tar.gz

sudo R CMD INSTALL quickcheck_1.0.tar.gz
sudo R CMD INSTALL rmr2_2.0.2.tar.gz

 

Done

More reading

http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/

Install R on Ubuntu

Install R on Ubuntu Step by step

The supported releases are

Quetzal (12.10), Precise Pangolin (12.04; LTS), Oneiric Ocelot (11.10),  Natty Nawwhal (11.04), Lucid Lynx (10.04; LTS) and Hardy
Heron (8.04; LTS)

Step 1

Add the software source

deb http://<my.favorite.cran.mirror>/bin/linux/ubuntu precise/

The complete list of mirrors are available at http://cran.r-project.org/mirrors.html

in your /etc/apt/sources.list file, replacing
<my.favorite.cran.mirror> by the actual URL of your favorite CRAN
mirror.

Example

deb http://cran.ma.imperial.ac.uk/bin/linux/ubuntu precise/

 

Step 2

Add the key to access the software

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

 

Step 3

Install R

$sudo apt-get update
$sudo apt-get install r-base

 

Done :)

http://cran.r-project.org/bin/linux/ubuntu/README

-----

Alternatives to add key

SECURE APT

The Ubuntu archives on CRAN are signed with the key of "Michael Rutter
<marutter@gmail.com>" with key ID E084DAB9.  To add the key to your
system with one command use (thanks to Brett Presnell for the tip):

   sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

An alternate method can be used by retriving the key with

   gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9

and then feed it to apt-key with

   gpg -a --export E084DAB9 | sudo apt-key add -

-----

rjava jdk not found

Error

Make sure you have Java Development Kit installed and correctly registered in R.
If in doubt, re-run "R CMD javareconf" as root

So means that R is not able to detect Java properly

So lets fix this

Found following pages useful so want to give credits

http://r.789695.n4.nabble.com/rjava-JDK-not-found-td889163.html
http://svn.r-project.org/R/trunk/src/scripts/javareconf

Run command once with sudo

$ sudo R CMD javareconf

Note its output

For me following things were present

jj@jj-VirtualBox:~$ sudo R CMD javareconf
Java interpreter : /usr/bin/java
Java version     : 1.6.0_38
Java home path   : /usr/lib/jvm/jdk1.6.0_38/jre
Java compiler    : /usr/bin/javac
Java headers gen.:
Java archive tool:
Java library path: $(JAVA_HOME)/lib/i386/client:$(JAVA_HOME)/lib/i386:$(JAVA_HOME)/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
JNI linker flags : -L$(JAVA_HOME)/lib/i386/client -L$(JAVA_HOME)/lib/i386 -L$(JAVA_HOME)/../lib/i386 -L/usr/java/packages/lib/i386 -L/lib -L/usr/lib -ljvm
JNI cpp flags    : -I$(JAVA_HOME)/../include -I$(JAVA_HOME)/../include/linux

Updating Java configuration in /etc/R
Done.

Now rerun the same command without sudo

jj@jj-VirtualBox:~$ R CMD javareconf
Java interpreter : /usr/lib/jvm/jdk1.6.0_38/jre/bin/java
Java version     : 1.6.0_38
Java home path   : /usr/lib/jvm/jdk1.6.0_38
Java compiler    : /usr/lib/jvm/jdk1.6.0_38/bin/javac
Java headers gen.: /usr/lib/jvm/jdk1.6.0_38/bin/javah
Java archive tool: /usr/lib/jvm/jdk1.6.0_38/bin/jar
Java library path: /usr/lib/jvm/jdk1.6.0_38/jre/lib/i386/client:/usr/lib/jvm/jdk1.6.0_38/jre/lib/i386:/usr/lib/jvm/jdk1.6.0_38/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
JNI linker flags : -L/usr/lib/jvm/jdk1.6.0_38/jre/lib/i386/client -L/usr/lib/jvm/jdk1.6.0_38/jre/lib/i386 -L/usr/lib/jvm/jdk1.6.0_38/jre/../lib/i386 -L/usr/java/packages/lib/i386 -L/lib -L/usr/lib -ljvm
JNI cpp flags    : -I/usr/lib/jvm/jdk1.6.0_38/include -I/usr/lib/jvm/jdk1.6.0_38/include/linux

Updating Java configuration in /etc/R
/usr/lib/R/bin/javareconf: 370: /usr/lib/R/bin/javareconf: cannot create /etc/R/Makeconf.new: Permission denied
*** cannot create /etc/R/Makeconf.new
*** Please run as root if required.

So we can see that JAVAH and JAR paths are not detected in sudo command execution.

So if we read the R page

http://svn.r-project.org/R/trunk/src/scripts/javareconf

We can re run the command using sudo giving paths required

sudo R CMD javareconf JAVA=/usr/lib/jvm/jdk1.6.0_38/jre/bin/java JAVA_HOME=/usr/lib/jvm/jdk1.6.0_38 JAVAC=/usr/lib/jvm/jdk1.6.0_38/bin/javac JAR=/usr/lib/jvm/jdk1.6.0_38/bin/jar JAVAH=/usr/lib/jvm/jdk1.6.0_38/bin/javah

This should tell where to find what and it should be able to do what it wants