Kiji Maven Setup

To develop applications using Kiji schema and Mapreduce , add following to your maven

If you have custom changes to your maven settings.xml then add following in relevant sections. Otherwise you can download the settings.xml provided by kiji website






Changes to pom.xml


Add the dependency like following



For org.kiji.platforms

Please read below , you need to choose right version depending upon your hadoop cluster

Installing Kiji Schema and Shell

Make sure that your Hadoop and HBase are running up.

If you need help to install Hadoop and HBase please see following posts

Download Kiji Schema and extract it to some location

Set following variables

export KIJI_HOME="/home/jj/software/wibi/kiji/kiji-schema-1.0.0-rc5"
export PATH=$PATH:$KIJI_HOME/bin

Install Kiji system tables

$ kiji install


jj@jj-VirtualBox:~$ kiji install
Warning: $HADOOP_HOME is deprecated.

Creating kiji instance: kiji://localhost:2181/default/
Creating meta tables for kiji instance in hbase...
13/03/30 18:03:37 INFO org.kiji.schema.KijiInstaller: Installing kiji instance 'kiji://localhost:2181/default/'.
13/03/30 18:03:43 INFO org.kiji.schema.KijiInstaller: Installed kiji instance 'kiji://localhost:2181/default/'.
Successfully created kiji instance: kiji://localhost:2181/default/


Installing Kiji Schema Shell

Download from

export KIJI_SHELL_HOME="/home/jj/software/wibi/kiji/kiji-schema-shell-1.0.0-rc5"

Start kiji shell by


jj@jj-VirtualBox:~$ kiji-schema-shell
Warning: $HADOOP_HOME is deprecated.

Kiji schema shell v1.0.0-rc5
Enter 'help' for instructions (without quotes).
Enter 'quit' to quit.
DDL statements must be terminated with a ';'

Congrats , you have installed Kiji Schema successfully. Lets play :)

Handle schema changes evolution in Hadoop

In Hadoop if you use Hive if you try to have different schemas for different partition , you cannot have field inserted in middle.

If the fields are added in end you can use Hive natively.

However things would break if field is inserted in middle.

There are few ways to handle schema evolution and changes in hadoop

Use Avro

For flat schema of a database tables ( or files ) ,  generate avro schema. This Avro schema can be used anywhere in programming or mapping it with hive using AvroSerde

I am exploring various JSON apis which can be used and also exploring various methods i can do.

Nokia has released code to generate Avro schemas from XMLs

Okay my problem statement and solution are simple.

The ideas in my mind are

  1. Store Schema details of Table in some database
  2. Read the database field details and generate Avro schema
  3. Store it to some location in Hadoop  /schema/tableschema
  4. Map the Hive to use this avro schema location in HDFS
  5. If some change comes in schema update the database and the system would again generate new avro schema
  6. Push the new schema to HDFS
  7. Hive would use new schema without breaking old data should be able to support schema changes and evolution for data in Hadoop

Most NoSQL databases have similar approach , check Oracle link below

Oracle NoSQL solution manages the schema information and changes in KeyStore



Hortonworks guys are working on new file format which have similar feature of storing schema within data like Avro

The "versioned metadata" means that the ORC file's metadata is stored in ProtoBufs so that we can add (or remove) fields to the metadata. That means that for some changes to ORC file format we can provide both forward and backward compatibility.

ORC files like Avro files are self-describing. They include the type structure of the records in the metadata of the file. It will take more integration work with hive to make the schemas very flexible with ORC.

Jackson Tutorial

Jackson is an API to play with JSON using java.

Some useful links

GSON tutorials

Few links of GSON Tutorials i found

Optional About JSON first link

Install RHadoop on Hadoop Cluster

The instructions below can be used to install RHadoop rmr2 , rhdfs packages on Hadoop cluster. I have just single node cluster but it really don't matter. Same instructions apply if you have more machines.

Install R on machine by following the instructions at

Lets start to install RHadoop

RHadoop packages are available at

I have cloned the git repo for the packages , this makes easy to do any upgrades. So you have two choices here.

Easy 1 ) Download the tar.gz files for rmr2 , rhdfs


Easy 2 :) , Clone the git repo for each of them

Lets go with first one

Download the rmr2 and rhdfs from following locations

Links might have changed while you are reading this , so pardon me and get latest links


I assume you have already installed R on your machine , it needs to be installed on all nodes in your cluster. And these Rhadoop packages also needs to be installed on all of the nodes.


Export variables needed

Change the location below depending on your install of Hadoop


sudo gedit /etc/environment

Add the following

# Variable added for RHadoop Install

Tell R about Java

R at times is not able to figure out few java settings so lets tell and help R


# sudo R CMD javareconf JAVA=/home/jj/software/java/jdk1.6.0_43/bin/java JAVA_HOME=/home/jj/software/java/jdk1.6.0_43 JAVAC=/home/jj/software/java/jdk1.6.0_43/bin/javac JAR=/home/jj/software/java/jdk1.6.0_43/bin/jar JAVAH=/home/jj/software/java/jdk1.6.0_43/bin/javah


Please change the following


Depending on your Java location


$ sudo R CMD javareconf JAVA=/home/jj/software/java/jdk1.6.0_43/bin/java JAVA_HOME=/home/jj/software/java/jdk1.6.0_43 JAVAC=/home/jj/software/java/jdk1.6.0_43/bin/javac JAR=/home/jj/software/java/jdk1.6.0_43/bin/jar JAVAH=/home/jj/software/java/jdk1.6.0_43/bin/javah

Updating Java configuration in /usr/lib/R


Check cluster is up and happy

hadoop fs –ls /

jj@jj-VirtualBox:~/software/R/RHadoop$ hadoop fs -ls /
Warning: $HADOOP_HOME is deprecated.

Found 3 items
drwxr-xr-x   - jj supergroup          0 2013-03-29 09:59 /hbase
drwxr-xr-x   - jj supergroup          0 2013-03-09 21:57 /home
drwxr-xr-x   - jj supergroup          0 2013-03-09 17:35 /user


Install RJava


Start R with

#sudo R –save


We are just telling R to start with sudo and save settings we do now



It will ask to choose CRAN server , select something near to you and let the install happen

After its done verify that its there :)

> library()


It will show something like

Packages in library ‘/usr/local/lib/R/site-library’:

rJava                   Low-level R to Java interface

Packages in library ‘/usr/lib/R/library’:

Quit R

> q()

All set


Install rhdfs now

Go to location where you downloaded tar.gz files and execute following command


jj@jj-VirtualBox:~/software/R/RHadoop/rhdfs/build$ ls



$ sudo export HADOOP_CMD=/home/jj/software/hadoop-1.0.4/bin/hadoop R CMD INSTALL rhdfs_1.0.5.tar.gz


> library('rhdfs')
Loading required package: rJava


Be sure to run hdfs.init()
> hdfs.init()
  permission owner      group size          modtime   file
1 drwxr-xr-x    jj supergroup    0 2013-03-29 09:59 /hbase
2 drwxr-xr-x    jj supergroup    0 2013-03-09 21:57  /home
3 drwxr-xr-x    jj supergroup    0 2013-03-09 17:35  /user

We are able to see HDFS files in R

So all done for rhdfs


Install rmr2


$ apt-get install -y pdfjam


> install.packages(c( 'RJSONIO', 'itertools', 'digest','functional', 'stringr', 'plyr'))

Download package from


sudo R CMD INSTALL Rcpp_0.10.3.tar.gz

sudo R CMD INSTALL reshape2_1.2.2.tar.gz

sudo R CMD INSTALL quickcheck_1.0.tar.gz
sudo R CMD INSTALL rmr2_2.0.2.tar.gz



More reading

SSH Putty tools and tips

Some useful SSH tools and tips

Use mputty for tabbed putty ssh sessions , lots of features

Make sequence files on disk

Create SequenceFiles from files on your local filesystem

Extract the contents of a SequenceFile back to the filesystem

Convert popular archive formats — tar (including tar.bz2 and tar.gz) and zip — to and from SequenceFile format.

Process XML data in Hadoop

To read XML files

Mahout has XML input format , see the blog post below to read more

Pig has XMLLoader

Import export of Data to HDFS

Various tools and methods to import data to HDFS. Depending on what type of data and where it is located you can use following tools

Import from Database

This tool can import data from various databases , custom connectos are also available for fast processing of import export of data.

Import of file based loads


Collect to one place and then push to HDFS

Has verious sources and sink classes which can be used to push files to HDFS

HDFS File Slurper
A basic tool to do import export

Regular tools
Use some automation tool like cron , autosys to push files to HDFS at some location
# hadoop fs -copyFromLocal src dest
# hadoop fs -copyToLocal src dest

Use oozie
Use oozie ssh action to login to machine and then execute the above two copy commands

Import export from HBase to HDFS

Use HBase export utility class

$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> 
Import to HBase
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
The HBase export utility writes data in sequence file format. So you need to do its conversion

You can use Mapreduce to read from HBase and write to HDFS in plain text or any other format

HBase pseudo mode install

If you have done already , then install Hadoop by following

Lets get HBase working

Download HBase and Zookeeper Tar ball from Apache website

Extract to some place and set environment variables ( in say .profile of your home )

export HBASE_HOME="/home/jj/software/hbase-0.94.5"

export ZOOKEEPER_HOME="/home/jj/software/zookeeper-3.4.5"

HBase settings

Check DNS settings

jj@jj-VirtualBox:~$ cat /etc/hosts    localhost    jj-VirtualBox

Check both IP should be same , by default in Ubuntu its not.
HBase expects the loopback IP address to be Ubuntu and some other distributions, for example, will default to and this will cause problems for you.


export JAVA_HOME="/home/jj/software/java/jdk1.6.0_43"

Changes in
hbase-site.xml properties

Add the following



Create hbase directory in HDFS

$ hadoop fs -mkdir /hbase

Zookeeper settings

In conf directory of Zookeeper

Rename zoo_sample.cfg
to zoo.cfg

Change the path of


We are ready to test

Start Hadoop


Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /home/jj/software/hadoop-1.0.4/libexec/../logs/hadoop-jj-namenode-jj-VirtualBox.out
localhost: starting datanode, logging to /home/jj/software/hadoop-1.0.4/libexec/../logs/hadoop-jj-datanode-jj-VirtualBox.out
localhost: starting secondarynamenode, logging to /home/jj/software/hadoop-1.0.4/libexec/../logs/hadoop-jj-secondarynamenode-jj-VirtualBox.out
starting jobtracker, logging to /home/jj/software/hadoop-1.0.4/libexec/../logs/hadoop-jj-jobtracker-jj-VirtualBox.out
localhost: starting tasktracker, logging to /home/jj/software/hadoop-1.0.4/libexec/../logs/hadoop-jj-tasktracker-jj-VirtualBox.out

Check Hadoop pages


All fine ?

Let start HBase

HBase automatically starts zookeeper also so no need to start by own


localhost: starting zookeeper, logging to /home/jj/software/hbase-0.94.5/bin/../logs/hbase-jj-zookeeper-jj-VirtualBox.out
starting master, logging to /home/jj/software/hbase-0.94.5/logs/hbase-jj-master-jj-VirtualBox.out
localhost: starting regionserver, logging to /home/jj/software/hbase-0.94.5/bin/../logs/hbase-jj-regionserver-jj-VirtualBox.out

Check HBase pages

Region Server


All done :)

Struck somewhere ?

Post below

Which process is using port

$ netstat -lnp | grep portNo

Note the process id and process name

To see full path of process use the command , replace the processID below

ls -l /proc/processID/exe


ls -l /proc/1222/exe

Configure hiveserver2

Please configure Hive with MySQL first before starting hiveserver2

Follow this post

You can use some other database like Oracle also.

Add the following settings in hive-site.xml

  <description>Enable Hive's Table Lock Manager Service</description>
  <description>Zookeeper quorum used by Hive's Table Lock Manager</description>


From command prompt

Start hiveserver2


The hiveserver2 binary is present in bin folder of hive directory , so you can go to that folder and run it in case you installed hive via tar ball

How to check

Now just start beeline and see if things are working fine

The userame password are not requied if you haven’t configured any LDAP settings for hiveserver

$ /usr/lib/hive/bin/beeline
beeline> !connect jdbc:hive2://localhost:10000 username password org.apache.hive.jdbc.HiveDriver
0: jdbc:hive2://localhost:10000> SHOW TABLES;
show tables;
| tab_name  |
No rows selected (0.238 seconds)
0: jdbc:hive2://localhost:10000>

The statergy behind Google reader phase out

If i see Google statergy to bring people who are using Google reader to G+ , then it makes sense for them. For example we all spend lots of time with facebook , if facebook have inbuit rss reader like Google reader people will inturn spend more time inside fb and also share the posts they read with friends network. Although there are apps which do this , but many dont explore fb apps.

Okay coming back to Google decision on Reader phasing out. If whole statergy is to make G+ as default place for people to work on in Google ecosystem then soon reader should be embedded here. So that i can share with my circles what i am reading and Google+ also becomes happy with increasing traffic.  The major issue with G+ is lack of colors :) which Google did not understand till now. People dont like dull whites with lots of hidden blurred buttons to make them think where to click :) , Are you doing any usability study for G+ ? World outside Google is not Geek they want colorful stuff.  #googlereader

Google are you listening ? :)

Find current shell in linux

~$ echo $SHELL


Above shows its using bash

Decision Tree

Stuff useful for learning Decision Trees.

Intro from Wikipedia


Decision Trees

Chapter 3


Machine Learning by Tom Mitchell

Online lecture at (Highly recommended)


Decision Tree Applet

This applet explains in a great way how the selection policy of root node effects the decision tree. You can play with each of those policies and how the outcome varies.

Maven Plugins Where to define and Configure

There are the build and the reporting plugins:

  • Build plugins will be executed during the build and then, they should be configured in the <build/> element.
  • Reporting plugins will be executed during the site generation and they should be configured in the <reporting/> element.


Specify them in the <build><pluginManagement/></build> elements for each build plugins ( generally in a parent POM).

For reporting plugins, specify each version in the <reporting><plugins/></reporting> elements (and also in the <build><pluginManagement/></build> )

The configuration of plugin behaviour can be done using

<configuration>Configurations inside the <executions> tag differ from those that are outside <executions> in that they cannot be used from a direct command line invocation. Instead they are only applied when the lifecycle phase they are bound to are invoked. Alternatively, if you move a configuration section outside of the executions section, it will apply globally to all invocations of the plugin.


Read this and complete the understanding

GUI Graphical Interface for HBase region servers

Hannibal is a tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.

This helps us to answer following questions

  1.     How well are regions balanced over the cluster?
  2.     How well are the regions split for each table?
  3.     How do regions evolve over time

To install Hannilal follow the following steps , it wont take long.


Download the latest version

$ git clone

$ cd hannibal


Edit your .profile or /etc/environment your choice

To include the following property

Change the version depending on your HBase version , options are 0.94 , 0.90 , 0.94



Copy the hbase-site.xml from your HBase conf to Hanniball conf directory

4) Build the project

It will take sometime as it will download dependencies from internet

$ ./build

After it show message as Success

5) Start the server

$ ./start

It will take sometime for server to start and then you can monitor it at


You can configure the port incase you already have something running there.

Please note that history data about regions is only collected while the application is running, it will need to run for
some time until the region detail graphs fill up. 

Happy Hadooping :)