Jugnu Life :-): February 2013

Install R on Ubuntu

Install R on Ubuntu Step by step

The supported releases are

Quetzal (12.10), Precise Pangolin (12.04; LTS), Oneiric Ocelot (11.10), Natty Nawwhal (11.04), Lucid Lynx (10.04; LTS) and Hardy
Heron (8.04; LTS)

Step 1

Add the software source

deb http://<my.favorite.cran.mirror>/bin/linux/ubuntu precise/

The complete list of mirrors are available at http://cran.r-project.org/mirrors.html

in your /etc/apt/sources.list file, replacing
<my.favorite.cran.mirror> by the actual URL of your favorite CRAN
mirror.

Example

deb http://cran.ma.imperial.ac.uk/bin/linux/ubuntu precise/

Step 2

Add the key to access the software

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

Step 3

Install R

$sudo apt-get update
$sudo apt-get install r-base

Done :)

http://cran.r-project.org/bin/linux/ubuntu/README

-----

Alternatives to add key

SECURE APT

The Ubuntu archives on CRAN are signed with the key of "Michael Rutter
<marutter@gmail.com>" with key ID E084DAB9. To add the key to your
system with one command use (thanks to Brett Presnell for the tip):

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9

An alternate method can be used by retriving the key with

gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9

and then feed it to apt-key with

gpg -a --export E084DAB9 | sudo apt-key add -

-----

rjava jdk not found

Error

Make sure you have Java Development Kit installed and correctly registered in R.
If in doubt, re-run "R CMD javareconf" as root

So means that R is not able to detect Java properly

So lets fix this

Found following pages useful so want to give credits

http://r.789695.n4.nabble.com/rjava-JDK-not-found-td889163.html
http://svn.r-project.org/R/trunk/src/scripts/javareconf

Run command once with sudo

$ sudo R CMD javareconf

Note its output

For me following things were present

jj@jj-VirtualBox:~$ sudo R CMD javareconf
Java interpreter : /usr/bin/java
Java version     : 1.6.0_38
Java home path   : /usr/lib/jvm/jdk1.6.0_38/jre
Java compiler    : /usr/bin/javac
Java headers gen.:
Java archive tool:
Java library path: $(JAVA_HOME)/lib/i386/client:$(JAVA_HOME)/lib/i386:$(JAVA_HOME)/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
JNI linker flags : -L$(JAVA_HOME)/lib/i386/client -L$(JAVA_HOME)/lib/i386 -L$(JAVA_HOME)/../lib/i386 -L/usr/java/packages/lib/i386 -L/lib -L/usr/lib -ljvm
JNI cpp flags    : -I$(JAVA_HOME)/../include -I$(JAVA_HOME)/../include/linux

Updating Java configuration in /etc/R
Done.

Now rerun the same command without sudo

jj@jj-VirtualBox:~$ R CMD javareconf
Java interpreter : /usr/lib/jvm/jdk1.6.0_38/jre/bin/java
Java version     : 1.6.0_38
Java home path   : /usr/lib/jvm/jdk1.6.0_38
Java compiler    : /usr/lib/jvm/jdk1.6.0_38/bin/javac
Java headers gen.: /usr/lib/jvm/jdk1.6.0_38/bin/javah
Java archive tool: /usr/lib/jvm/jdk1.6.0_38/bin/jar
Java library path: /usr/lib/jvm/jdk1.6.0_38/jre/lib/i386/client:/usr/lib/jvm/jdk1.6.0_38/jre/lib/i386:/usr/lib/jvm/jdk1.6.0_38/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
JNI linker flags : -L/usr/lib/jvm/jdk1.6.0_38/jre/lib/i386/client -L/usr/lib/jvm/jdk1.6.0_38/jre/lib/i386 -L/usr/lib/jvm/jdk1.6.0_38/jre/../lib/i386 -L/usr/java/packages/lib/i386 -L/lib -L/usr/lib -ljvm
JNI cpp flags    : -I/usr/lib/jvm/jdk1.6.0_38/include -I/usr/lib/jvm/jdk1.6.0_38/include/linux

Updating Java configuration in /etc/R
/usr/lib/R/bin/javareconf: 370: /usr/lib/R/bin/javareconf: cannot create /etc/R/Makeconf.new: Permission denied
*** cannot create /etc/R/Makeconf.new
*** Please run as root if required.

So we can see that JAVAH and JAR paths are not detected in sudo command execution.

So if we read the R page

http://svn.r-project.org/R/trunk/src/scripts/javareconf

We can re run the command using sudo giving paths required

sudo R CMD javareconf JAVA=/usr/lib/jvm/jdk1.6.0_38/jre/bin/java JAVA_HOME=/usr/lib/jvm/jdk1.6.0_38 JAVAC=/usr/lib/jvm/jdk1.6.0_38/bin/javac JAR=/usr/lib/jvm/jdk1.6.0_38/bin/jar JAVAH=/usr/lib/jvm/jdk1.6.0_38/bin/javah

This should tell where to find what and it should be able to do what it wants

Chain Mapper Example

Giving example of how to use Chain map class in hadoop to call code in sequence.
I am writing comments in between to explain

public class ChainDriver {
                public static Logger log = Logger.getLogger(ChainDriver.class);
                /**
                * @param args
                * @throws IOException
                * @throws ClassNotFoundException
                * @throws InterruptedException
                */
                public static void main(String[] args) throws IOException,
                                                InterruptedException, ClassNotFoundException {
                                // Start main Chain Job and declare its conf and job
                                Configuration chainConf = new Configuration();
                                Job chainJob = Job.getInstance(chainConf);
                                // Variable names kept like conf1 etc to make code less cluttered
                                // Start Mapper for MyMapperA
                                Configuration conf1 = new Configuration(false);
                                // Example for Passing arguments to the mappers
                                conf1.set("myParameter", args[2]);
                                ChainMapper.addMapper(chainJob, MyMapperA.class,
                                                                LongWritable.class, Text.class, Text.class, Text.class, conf1);
                                // Start Mapper for Second replacement
                                Configuration conf2 = new Configuration(false);
                                // Dynamically take the class name from argument to make more Dynamic chain :)
                                // (MapperC OR MapperD)
                                ChainMapper.addMapper(chainJob,
                                                                (Class<? extends Mapper>) Class.forName(args[2]), Text.class,
                                                                Text.class, NullWritable.class, Text.class, conf2);
                                // Set the parameters for main Chain Job
                                chainJob.setJarByClass(ChainDriver.class);
                                FileInputFormat.addInputPath(chainJob, new Path(args[0]));
                                FileOutputFormat.setOutputPath(chainJob, new Path(args[1]));
                                System.exit(chainJob.waitForCompletion(true) ? 0 : 1);
                }
}

Now in details few important points
1)
Configuration conf1 = new Configuration(false);
The Sub mappers configuration objects are initiated with boolean false
http://hadoop.apache.org/docs/r2.0.3-alpha/api/src-html/org/apache/hadoop/conf/Configuration.html#line.518
Using constructor

public Configuration(boolean loadDefaults)

loadDefaults - specifies whether to load from the default files

2)

Passing arguments
conf1.set("myParameter", args[2]);

You can use same code as we use in any Driver class

3)

ChainMapper.addMapper(chainJob, MyMapperA.class,
LongWritable.class, Text.class, Text.class, Text.class, conf1);

The method signature is like this

public static void addMapper(Job job,
                             Class<? extends Mapper> klass,
                             Class<?> inputKeyClass,
                             Class<?> inputValueClass,
                             Class<?> outputKeyClass,
                             Class<?> outputValueClass,
                             Configuration mapperConf)

The Job argument here is Job object of main Driver , chainJob
Then we tell which mapper to start and key value pairs as used by Mapper

Last argument is of Conf of Mapper being called

4)

You can call as many mappers , Reducers in chain but one thing to be kept in mind is that output of previous mapper ( or reducer) must be consumable directly by next in chain.
For example

Map 1
Map 2
are called in chain

If map 1 emits ,
Text as Key
Long as Value

Then

Map 2 should have

Text as Key Input
and Long and Value Input

The framework will not do any conversions for you.

5)
http://hadoop.apache.org/docs/r2.0.3-alpha/api/org/apache/hadoop/mapreduce/lib/chain/ChainMapper.html

The ChainMapper class allows to use multiple Mapper classes within a single Map task.

Quoting from Javadocs

The key functionality of this feature is that the Mappers in the chain do not need to be aware that they are executed in a chain. This enables having reusable specialized Mappers that can be combined to perform composite operations within a single task.
Special care has to be taken when creating chains that the key/values output by a Mapper are valid for the following Mapper in the chain. It is assumed all Mappers and the Reduce in the chain use matching output and input key and value classes as no conversion is done by the chaining code.
Using the ChainMapper and the ChainReducer classes is possible to compose Map/Reduce jobs that look like [MAP+ / REDUCE MAP*]. And immediate benefit of this pattern is a dramatic reduction in disk IO.

I found it pretty good tool while developing multiple processing pipelines. I just develop re usable classes of various tasks and call them in chain.

Update April 6

I would say based on experience using till now , chain mapper makes processing slow. So use it if and only if unless you really cannot avoid this.

Do you have some tips to improve performance of Chain mapper ? Please share below.

Happy Chaining :)

Installing Python from source on Linux Ubuntu

Download the latest tar from internet

http://www.python.org/download/

Extract it to some location

Using tar xfz command

go to location where you extracted

/home/jj/software/programming/Python-3.3.2

Configure the build

jj@jj-VirtualBox:~/software/programming/Python-3.3.2$ ./configure

Make the build

jj@jj-VirtualBox:~/software/programming/Python-3.3.2$ make

Install Python

jj@jj-VirtualBox:~/software/programming/Python-3.3.2$ sudo make install

jj@jj-VirtualBox:~/software/programming/Python-3.3.2$ exit

Test install

jj@jj-VirtualBox:~$ python3

Python 3.3.2 (default, Jun 6 2013, 13:25:22)

[GCC 4.6.3] on linux

Type "help", "copyright", "credits" or "license" for more information.

>>>

Done :)

Tell Maven to use specific plugin version

If you want to use specific version of maven plugin you can say in build

In the version tag tell maven to use latest version ( or required ) while doing processing

<build>
    ...
    <pluginManagement>
      <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-jar-plugin</artifactId>
          <version>2.2</version>
          <executions>
            <execution>
              <id>pre-process-classes</id>
              <phase>compile</phase>
              <goals>
                <goal>jar</goal>
              </goals>
              <configuration>
                <classifier>pre-process</classifier>
              </configuration>
            </execution>
          </executions>
        </plugin>
      </plugins>
    </pluginManagement>
    ...
  </build>

Data representation in Mahout

Mahout uses concept of data model to represent the data on which we are doing analysis.

Datamodel Implementations represent a repository of information about users and their associated Preferences for items.

The class diagram below tries to explain more

Refreshable Interface : Implementations of this interface have state that can be periodically refreshed. For example, an implementation instance might contain some pre-computed information that should be periodically refreshed. The refresh(Collection) method triggers such a refresh.

AbstractDataModel : Contains some features common to all implementations.

FileDataModel : A DataModel backed by a delimited file. This class expects a file where each line contains a user ID, followed by item ID, followed by optional preference value, followed by optional timestamp. Commas or tabs delimit fields:

e.g Code

DataModel model = new FileDataModel(new File(
"src/main/java/com/jugnu.mahout.LearnDatamodel/intro.csv"));

We declare a new datamodel of file type and give the path of file which has data to be analyzed

Plugin execution not covered by lifecycle configuration

Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin:2.15.2:testCompile (execution: default, phase: test-compile)

If your pom.xml has been automatically generated then chances are there that pluginManagement tag is missing.

<build>

<pluginManagement>

....

</plugins>

</pluginManagement>

</build>

Just add those closing and starting tags and this should work.

Ubuntu proxy switcher

A addon to quickly switch Ubuntu proxy

http://erasmusjam.wordpress.com/2011/07/08/a-proxy-switcher-application-indicator-for-ubuntu/

Git Basics

Some of the basic stuff required to use Git

Install

sudo apt-get install git

Clone

Checkout in SVN world is known as clone in git , so i would try to give some counterparts here with svn. Although they work differently in background. Its fairly okay to start off and learn like this

git clone URL directory

URL is repository URL
directory is place in local system where you want to clone

e.g

git clone https://github.com/jagatsingh/mapreduce-text-processing.git MapReduceLibrary

Add

Now when you create any file or modify any file we need to do add operation telling git then i want this file to go in next commit

Syntax

git add filename
or
git add directoryname

e.g

git add readme

In SVN add is not required if the system already has this file and you modify it. But in git we need to tell it to add the file in next commit

Commit

git commit -m "Commit message"

After we are done with addition we can commit the set of files to system.

e.g

git commit -m "Committing the readme for the code"

Looking for more detailed tutorial , Read the link below

http://git-scm.com/book/en/Git-Basics-Recording-Changes-to-the-Repository

Export Hive query result to file

$hive -e 'select * from myTable' > MyResultsFile.txt

Validate oozie workflow and coordinator

Many times if we submit wrong oozie workflow we find it difficult to find what’s going on.

We can validate our oozie workflow coordinator or bundle xml by following command

oozie validate workflow.xml

It will show that its valid oozie xml or not

Just change the path in bold for xml file

Why my Oozie job is waiting

Often i wonder why my oozie job is waiting and not running?

If you are also here to find reason for the same the following command would be handy

oozie job -oozie http://myoozieserver:11000/oozie -log replaceooziejobid

Replace the Oozie server details

Replace the oozie job id for which you want to check waiting reason.

The system would show details about the job and why its waiting :)

Api diagrams from source

ApiViz ia another magical tool which can be helpful in learning about the code

Lets see how it works

Download GraphViz

http://www.graphviz.org/Download..php

Set the environment variable

GRAPHVIZ_HOME=C:\Program Files\Graphviz 2.28\bin

ApiViz

http://code.google.com/p/apiviz/#Prerequisites

We can use maven to use it quickly

Edit maven pom.xml

<build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-javadoc-plugin</artifactId>
                <version>2.9</version>

<configuration>
         <doclet>org.jboss.apiviz.APIviz</doclet>
         <docletArtifact>
           <groupId>org.jboss.apiviz</groupId>
           <artifactId>apiviz</artifactId>
           <version>1.3.2.GA</version>
         </docletArtifact>
         <useStandardDocletOptions>true</useStandardDocletOptions>
         <charset>UTF-8</charset>
         <encoding>UTF-8</encoding>
         <docencoding>UTF-8</docencoding>
         <breakiterator>true</breakiterator>
         <version>true</version>
         <author>true</author>
         <keywords>true</keywords>
         <additionalparam>
           -sourceclasspath ${project.build.outputDirectory}
         </additionalparam>
       </configuration>

</plugin>
</plugins>
</build>

Lets start magic

mvn compile javadoc:javadoc

How to learn about internal details of Open source projects

Open source is good place to learn , but many times we see that documentation is poor.

Mailing lists are great place to learn always but source code is one thing which speaks for itself.

I try to

Draw UML digrams

Draw API diagrams

Some of the free tools which i use i wanted to share

Want to read how to use yworks?

See it here

http://jugnu-life.blogspot.com.au/2013/02/generate-uml-diagrams-for-java-from.html

I found building javadocs for project using yworks uml plugin makes it easy to understand how things are flowing.

Read about using Apiviz here

http://jugnu-life.blogspot.com.au/2013/02/api-diagrams-from-source.html

So after downloading source this is the first thing i do to make javadocs again to digg into the code

Want to have more fun ?

Read here

http://code.google.com/p/apiviz/#Sample

http://www.umlgraph.org/download.html

What are your ideas to learn in open source ?

Generate UML diagrams for Java from source

ywroks has very good docklet for generating automated UML diagrams from java source code

Here are steps to use it.

I am using maven based system , although you can use ant also. Read the documentation on yworks website. On linux based system you need x server running for uml generation.

Step 1

Download docklet jars

http://www.yworks.com/en/downloads.html#yDoc

Extract it to some location say

D:\Development\yworks-uml-doclet-3.0_01-jdk1.5

Step 2

Tell maven you want to use this docklet

Add following to your pom.xml

Add property for path

Add build details for javadoc

<project>…..

<properties>
<yworks.uml.path>D:\Development\yworks-uml-doclet-3.0_01-jdk1.5</yworks.uml.path>
</properties>

<build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-javadoc-plugin</artifactId>
                <version>2.9</version>
                <configuration>
                    
                    <doclet>ydoc.doclets.YStandard</doclet>
                    <docletPath>${yworks.uml.path}/lib/ydoc.jar:${yworks.uml.path}/lib/ydoc.jar:${yworks.uml.path}/resources:${yworks.uml.path}/lib</docletPath>
                    <additionalparam>-umlautogen</additionalparam>
                    
                    <bootclasspath>${sun.boot.class.path}</bootclasspath>
                    
                    <doctitle>${project.name} (${project.version})</doctitle>
                    <show>private</show>

</project>

Step 3

Let the magic began

mvn:package javadoc:aggregate

or just

mvn javadoc:javadoc

Open the javadocs at

/target/site/apidocs/index.html

You can see magical pictures there :)

Credit : http://blog.keyboardplaying.org/2012/05/29/javadoc-uml-diagrams-maven/

Tip 1: If you want to generate uml for only specific packages use following additional configuration above

<sourcepath>${basedir}/src/main/java/com/MypakagePath</sourcepath>

Want to have more fun ?

Read here

http://code.google.com/p/apiviz/#Sample

http://www.umlgraph.org/download.html

Code Auto complete for Maven Eclipse projects

If you want source code and javadocs also downloaded for projects while you are working.

Then choose

I have Eclipse M2E plugin

Windows > Preferences > Maven

Choose

Download Artifact Javadocs

Download Artifacts sources

Related discussions

http://stackoverflow.com/questions/310720/get-source-jar-files-attached-to-eclipse-for-maven-managed-dependencies

http://stackoverflow.com/questions/2059431/get-source-jars-from-maven-repository

Add external repository in Project

I wanted to work against Apache snapshots instead of default repo

So here is what i did

In project pom.xml add the following settings

<project>….

<repositories>
        <repository>
            <id>apache.snapshot</id>
            <url>https://repository.apache.org/content/repositories/snapshots//</url>
        </repository>
    </repositories>

</project>

Now you can use this inside your project

Something like

<dependency>
            <groupId>org.apache.mahout</groupId>
            <artifactId>mahout-core</artifactId>
            <version>0.8-SNAPSHOT </version>
        </dependency>

Change Maven local repository path

Open Maven

conf/settings.xml

The path is specified in

Default path is : ~/.m2/repository

You can change it something like below in Windows based systems

<localRepository>K:/MavenRepository/repository</localRepository>

http://www.mkyong.com/maven/where-is-maven-local-repository/

Zookeeper Tutorials

Basics

http://anismiles.wordpress.com/2010/06/08/zookeeper-primer/

Leader Selection Implementation

http://zookeeper.apache.org/doc/r3.2.2/recipes.html#sc_leaderElection

http://cyberroadie.wordpress.com/2011/11/24/implementing-leader-election-with-zookeeper/

Sprint and Servlet based leader selection

https://github.com/erezmazor/projectx/tree/master/org.projectx.zookeeper

Concurrent queues

http://blog.cloudera.com/blog/2009/05/building-a-distributed-concurrent-queue-with-apache-zookeeper/

Distributed coordination

http://www.igvita.com/2010/04/30/distributed-coordination-with-zookeeper/

http://highscalability.com/blog/2008/7/15/zookeeper-a-reliable-scalable-distributed-coordination-syste.html

http://zookeeper-tutorial.blogspot.com.au/

Bindings in various languages

https://cwiki.apache.org/ZOOKEEPER/zkclientbindings.html

Curator is the Java Client library for Zookeeper

https://github.com/Netflix/curator

https://github.com/Netflix/curator/wiki

http://blog.palominolabs.com/2012/08/14/using-netflix-curator-for-service-discovery/

Presentations

https://cwiki.apache.org/ZOOKEEPER/eurosystutorial.data/part-1.pdf

https://cwiki.apache.org/ZOOKEEPER/eurosystutorial.data/part-2.pdf

https://cwiki.apache.org/ZOOKEEPER/eurosystutorial.data/part-3.pdf

https://cwiki.apache.org/ZOOKEEPER/eurosystutorial.data/part-4.pdf

Oozie Install in Bigtop

I suppose that you have followed the standard steps mentioned in other websites for oozie install.

Few things which take time i am documenting.

If we are configuring oozie to use MySQL and webconsole then in

/usr/lib/oozie/libext

Copy MySQL JDBC jar

Copy ext js zip (unzip form) to above directory

In bigtop based exvironment we have to unzip the ext 2.2 zip in

/var/lib/oozie

Instead of just dropping the zip file as suggested in other pages

Also change the owner group information of above two as oozie:oozie

Text Editor for Ubuntu

If you are looking for something like Notepad++ in Ubuntu , then i found the default gedit editor can be used in many cases. However the amount of functionality available in npp is not there in gedit.

But i love gedit also.

To enable many powerful functions of gedit

Go to

Edit > Preferences

There you see a set of magical things which we can do

View > Display line numbers while you are working

Colors > You can choose your theme

Plugins > This is area where you can choose set of plugins

I have chosen all and you can start using them also.

One of the good thing is syntax completion

lets see we are writing some shell script

Create one shell script.sh

when we want to write some standard code like for loop

Just write for and press TAB

for<Tab>

Editor would automatically complete the syntax.

You can see set of snippets available for various languages

The post below also share more details about power of Gedit

http://grigio.org/pimp_my_gedit_was_textmate_linux

Besides Gedit there are other powerful editors like gvim , kate , emacs , Play with them also :)