Jugnu Life :-): May 2012

Hadoop 0.23 Maven dependencies

Might not be useful , but keeping them here for fast reference

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>0.23.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>0.23.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
    <version>0.23.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-annotations</artifactId>
    <version>0.23.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client-core</artifactId>
    <version>0.23.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client-app</artifactId>
    <version>0.23.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client-hs</artifactId>
    <version>0.23.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn-server-tests</artifactId>
    <version>0.23.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn-common</artifactId>
    <version>0.23.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn-server-common</artifactId>
    <version>0.23.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn-server-resourcemanager</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client-shuffle</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn-server-nodemanager</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client-common</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn-api</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn-server-web-proxy</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-auth</artifactId>
    <version>0.23.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-rumen</artifactId>
    <version>0.23.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-archives</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn-site</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-extras</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn-applications-distributedshell</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-distcp</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn-server</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-assemblies</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs-httpfs</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-main</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-project-dist</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-yarn-applications</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-examples</artifactId>
    <version>0.23.1</version>
</dependency>
            <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-project</artifactId>
    <version>0.23.1</version>
</dependency>

Hadoop 0.20.2 bin folder purpose

In my previous post i tried to write overview of folder structure of Hadoop 0.20.2.

In this post the files and their purpose is discussed in bit more detail.

A similar kind of post exists for Hadoop 0.23 version if you are interested to read.

hadoop-config.sh

This is executed with each hadoop command , it does two things mainly. If we specify –conf paramter then it sets the HADOOP_CONF_DIR parameter to the same , secondly if we set parameter like –hosts then it decides to use master or slaves file

hadoop

This is the main hadoop command script file. If we just write on the command prompt just

# hadoop

It shows the list of commands which can use , it shows detailed help. The screen shot below also shows the same.

Now when we type one of the commands from help above it calls some Class file written in the file above.

e,g hadoop namenode -format

if [ "$COMMAND" = "namenode" ] ; then
CLASS='org.apache.hadoop.hdfs.server.namenode.NameNode'
HADOOP_OPTS="$HADOOP_OPTS $HADOOP_NAMENODE_OPTS"

So when we call namenode it calls NameNode.java file mentioned above and pass the arguments further. Similarly for other hadoop command

This file also calls

conf/hadoop-env.sh where various hadoop specific environment variables and configurations are set. The details about that can be read in other blog post for conf directory

hadoop-env.sh can be used to pass daemon specific environment variables.

hadoop-daemon.sh

Runs a Hadoop command as a daemon

hadoop-daemons.sh

Run a Hadoop command on all slave hosts. It calls slaves.sh and then calls hadoop-daemon.sh to run particular command passed

slaves.sh

Run a shell command on all slave hosts

start-dfs.sh

Start hadoop dfs daemons. Optionally we can also tell to upgrade or rollback dfs state. We run this on master node.
The implementation is
# start dfs daemons
# start namenode after datanodes, to minimize time namenode is up w/o data
# note: datanodes will log connection errors until namenode starts
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start namenode $nameStartOpt
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters start secondarynamenode
nameStartOpt and dataStartOpt are decided on parameters we pass

start-mapred.sh

Start hadoop map reduce daemons. Run this on master node.

# start mapred daemons
# start jobtracker first to minimize connection errors at startup
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start tasktracker

start-all.sh

Start all hadoop daemons. Run this on master node.
It calls start-dfs.sh and start-mapred.sh internally

stop-dfs.sh

Stop hadoop DFS daemons. Run this on master node.
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop namenode
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR stop datanode
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR --hosts masters stop secondarynamenode

stop-mapred.sh

Stop hadoop map reduce daemons. Run this on master node.

"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop jobtracker
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR stop tasktracker

stop-all.sh

Stop all hadoop daemons. Run this on master node.
"$bin"/stop-mapred.sh --config $HADOOP_CONF_DIR
"$bin"/stop-dfs.sh --config $HADOOP_CONF_DIR

start-balancer.sh

Start balancer daemon at particular node for data block balancing

"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR start balancer $@

stop-balancer.sh

# Stop balancer daemon.
# Run this on the machine where the balancer is running
"$bin"/hadoop-daemon.sh --config $HADOOP_CONF_DIR stop balancer

rcc.sh

The Hadoop record compiler
The implementation for this is present in
CLASS='org.apache.hadoop.record.compiler.generated.Rcc'

Hadoop 0.20.2 Directory Structure

This post explains the directory structure of the Hadoop 0.20.2 version. Along with directory structure attempt has been made to explain the usage of files present in each of those directories.

A separate post for each of the directories explains the detailed purpose of usage of files present in them.

A similar kind of post is present for 0.23 version of Hadoop

bin

This is the main folder where all the shell script files like start-all.sh , stop-all.sh etc are present. The main hadoop binary is also present in this folder.

If you want to read in more detail , you can read this post on hadoop 0.20.2 bin folder files use and purpose.

c++

Contains the C++ libs for x86 and 64 bit architecture of Pipes etc. Hadoop Pipes allows C++ code to use the Hadoop

conf

This is the directory where all configuration files are stored. You specify here the set of slaves present in your cluster , kind of cluster configuration e.g standalone , pseudo mode , fully distributed in this folder

contrib

Set of community developed useful stuff which can be used. e.g thriftfs , hdfsproxy etc

docs

Documentation and API docs for Hadoop distribution

ivy

ivy , pom definition of hadoop code which is used while building it from the source.

lib

Set of third party libraries , jars used by Hadoop

librecordio

Some lib :)

logs

Hadoop daemon log files , it stores the logs for various daemons running in hadoop. This is the place to see when you have some error in hadoop

src

Source code for Hadoop distribution , there is one build.xml file also you can build the full hadoop using that

webapps

webapps contains the files which are used to show to web frontend of the hadoop namenode at localhost:500070 and 500030 etc ports which gives us complete picture about the hadoop daemons what is happening at this time , which job is under process. The servlets are defined in web.xml for each of them and they run on the top of jetty server

Hadoop NoHostToRoute

If you get NoHostToRoute error message then chances are that reason could be one of the those which are mentioned here.

http://wiki.apache.org/hadoop/NoRouteToHost

Hadoop works on IP4 networks so if you have IP6 network then you have to disable it by following command

sudo sed -i 's/net.ipv6.bindv6only\ =\ 1/net.ipv6.bindv6only\ =\ 0/' \
/etc/sysctl.d/bindv6only.conf && sudo invoke-rc.d procps restart

Inside Hadoop 0.23 IP6 check is made inside , libexec/hadoop-config.sh file by following code.

# check if net.ipv6.bindv6only is set to 1
bindv6only=$(/sbin/sysctl -n net.ipv6.bindv6only 2> /dev/null)
if [ -n "$bindv6only" ] && [ "$bindv6only" -eq "1" ] && [ "$HADOOP_ALLOW_IPV6" != "yes" ]
then
echo "Error: \"net.ipv6.bindv6only\" is set to 1 - Java networking could be broken"
echo "For more info: http://wiki.apache.org/hadoop/HadoopIPv6"
exit 1
fi

The following article in Hadoop wiki explains the Hadoop and IP6

http://wiki.apache.org/hadoop/HadoopIPv6

Pentaho Reporting BI Server

After you are done with creation of report using report tool you can publish report to BI server

Go to

biserver-ce/pentaho-solutions/system/publisher_config.xml

Change the publisher password to something

Use the same as publisher password while using report creator tool

How to run simple Hadoop programs

How to run simple Hadoop programs

Let us write simple hadoop program and try to run it in Hadoop

Copy the following code in your eclipse java project in some class file

Configure the eclipse path to remove any errors. If you need help in setting up eclipse for hadoop then please see other post.

package org.jagat.hdfs;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class HDFSCopyAll {

   public static void main(String[] args) throws IOException {

        Configuration conf = new Configuration();

        FileSystem hdfs = FileSystem.get(conf);
        FileSystem local = FileSystem.getLocal(conf);

        FileStatus[] localinput = local.listStatus(new Path(
                "/home/hadoop/software/20/pig-0.10.0/docs/api"));

        for(int i=0;i<localinput.length;i++){
            System.out.println(   localinput[i].getLen());

        }

   }

}

It is just using Hadoop API to get the length of files present in api directory.

The intention of this post is not to teach anything about hadoop api or about mapreduce programs , but just how to run a hadoop program you write.

Change the path above (/home/hadoop/software/20/pig-0.10.0/docs/api) to some real path present in your computer.

Now its time to package this as a Jar

Go to File > Export

Eclipse will show us menu.

Choose Main class as the above class name HDFSCopyAll and create a jar in your computer

Now its time to run it.

Open terminal and go to place where you made the jar

and invoke the jar as follows

$hadoop jar learnHadoop.jar

This will run and show you the output.

If you are stuck somewhere just post message in comments below.

Thanks for reading

hadoop ls: Cannot access .: No such file or directory.

Hadoop by default access /user/YOURLOGINNAME

when we try to ls a directory

So just do the following

hadoop fs -mkdir /user
hadoop fs -mkdir /user/YOURLOGINNAME

Then do hadoop fs -ls

It would work

Thanks for reading :)

Hive MySql setup configuration

Hive use Derby database by default for storing its data.

But it has limitation that only one user can access it and the data cannot be shared among multiple machines.

So we can use MySQL Database to store the metadata in hive.

Go to hive-site.xml

and configure the following properties

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hostname:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>

Create one use hadoop in MySQL and give its password chosen above with grant all privileges

CREATE USER 'hadoop'@'hostname' IDENTIFIED BY 'hadoop';
GRANT ALL PRIVILEGES ON *.* TO 'hadoop' WITH GRANT OPTION;

Thats it :)

Start using Hive

Hadoop Study Group

Cloudera Hadoop certification now available worldwide 1 May 2012

At last its 1 May 2012

Cloudera has opened certifications through vue to worldwide people

Details are as follows from

Developer Exam

Exam Name: Cloudera Certified Developer for Apache Hadoop
Current Version: (CCD-333)
Certification Requirement: Required for Cloudera Certified Developer for Apache Hadoop (CCDH)
Number of Questions: 60
Time Limit: 90 minutes
Passing Score : 67%
Languages: English (Japanese forthcoming)

Adminstrator exam

Exam Name: Cloudera Certified Administrator for Apache Hadoop
Current Version: (CCA-332)
Certification Requirement: Required for Cloudera Certified Administrator for Apache Hadoop (CCAH)
Number of Questions: 30
Time Limit: 60 minutes
Passing Score: 67%
Languages: English (Japanese forthcoming)

You can register for it at http://www.pearsonvue.com/cloudera.

Questions can be single choice and multiple choice correct answer types

Syllabus guidelines for Developer exam

Core Hadoop Concepts
Recognize and identify Apache Hadoop daemons and how they function both in data storage and processing. Understand how Apache Hadoop exploits data locality. Given a big data scenario, determine the challenges to large-scale computational models and how distributed systems attempt to overcome various challenges posed by the scenario.

Storing Files in Hadoop
Analyze the benefits and challenges of the HDFS architecture, including how HDFS implements file sizes, block sizes, and block abstraction. Understand default replication values and storage requirements for replication. Determine how HDFS stores, reads, and writes files. Given a sample architecture, determine how HDFS handles hardware failure.

Job Configuration and Submission
Construct proper job configuration parameters, including using JobConf and appropriate properties. Identify the correct procedures for MapReduce job submission. How to use various commands in job submission (“hadoop jar” etc.)

Job Execution Environment
Given a MapReduce job, determine the lifecycle of a Mapper and the lifecycle of a Reducer. Understand the key fault tolerance principles at work in a MapReduce job. Identify the role of Apache Hadoop Classes, Interfaces, and Methods. Understand how speculative execution exploits differences in machine configurations and capabilities in a parallel environment and how and when it runs.

Input and Output
Given a sample job, analyze and determine the correct InputFormat and OutputFormat to select based on job requirements. Understand the role of the RecordReader, and of sequence files and compression.

Job Lifecycle
Analyze the order of operations in a MapReduce job, how data moves from place to place, how partitioners and combiners function, and the sort and shuffle process.

Data processing
Analyze and determine the relationship of input keys to output keys in terms of both type and number, the sorting of keys, and the sorting of values. Given sample input data, identify the number, type, and value of emitted keys and values from the Mappers as well as the emitted data from each Reducer and the number and contents of the output file(s).

Key and Value Types
Given a scenario, analyze and determine which of Hadoop’s data types for keys and values are appropriate for the job. Understand common key and value types in the MapReduce framework and the interfaces they implement.

Common Algorithms and Design Patterns
Evaluate whether an algorithm is well-suited for expression in MapReduce. Understand implementation and limitations and strategies for joining datasets in MapReduce. Analyze the role of DistributedCache and Counters.

The Hadoop Ecosystem
Analyze a workflow scenario and determine how and when to leverage ecosystems projects, including Apache Hive, Apache Pig, Sqoop and Oozie. Understand how Hadoop Streaming might apply to a job workflow.

Syllabus guidelines for Admin exam

Apache Hadoop Cluster Core Technologies
Daemons and normal operation of an Apache Hadoop cluster, both in data storage and in data processing. The current features of computing systems that motivate a system like Apache Hadoop.

Apache Hadoop Cluster Planning
Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster.

Apache Hadoop Cluster Management
Cluster handling of disk and machine failures. Regular tools for monitoring and managing the Apache Hadoop file system

Job Scheduling
How the default FIFO scheduler and the FairScheduler handle the tasks in a mix of jobs running on a cluster.

Monitoring and Logging
Functions and features of Apache Hadoop’s logging and monitoring systems.

For more details please see

http://university.cloudera.com/certification.html

If you want to form study group with me for preparation then please message me at jagatsingh [at] gmail [dot] com

See more details at Hadoop Study Group

Hadoop Tutorial Series

Found very interesting Hadoop tutorial series.

http://www.philippeadjiman.com/blog/2009/12/07/hadoop-tutorial-part-1-setting-up-your-mapreduce-learning-playground/

I am also planning to write one similar series of posts :) based on latest Hadoop version.

Error retrieving next row Hive Pentaho

Error retrieving next row

Pentaho

Download the latest version of

hive-jdbc-0.7.0-pentaho-SNAPSHOT.jar.

Hive Could not load shims in class null , Pentaho

Error connecting to database [test] : org.pentaho.di.core.exception.KettleDatabaseException:
Error occured while trying to connect to the database

Error connecting to database: (using class org.apache.hadoop.hive.jdbc.HiveDriver)
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: java.lang.RuntimeException: Could not load shims in class null

Solution:

This error is due to incompatibility between version of hadoop and hive supported

Drop the supported Hive , Hadoop version jar in pentaho/lib/jdbc directory

Custom Partitioner in Hadoop

I found few very very good links for writing own custom partitioner class for hadoop , Just wanted to share with you all.

A very good course being taught for Cloud computing and Map Reduce

http://www.cs.bgu.ac.il/~dsp112/The_Map-Reduce_Pattern

A mailing list discussion for writing Custom partitioner with Job context use by Configurable interface

http://lucene.472066.n3.nabble.com/Custom-partitioner-for-hadoop-td1335146.html

A very good blog post in how to use and write Partitioner in new Hadoop API

http://cornercases.wordpress.com/2011/05/06/an-example-configurable-partitioner/

Besides this if you want to see implementations , there are by default following present in hadoop

All are under Package org.apache.hadoop.mapreduce.lib.partition

BinaryPartitioner

Partition keys using a configurable part of the bytes array

HashPartitioner

Partition keys by their Object.hashCode().

KeyFieldBasedPartitioner

Defines a way to partition keys based on certain key fields

TotalOrderPartitioner

Partitioner effecting a total order by reading split points from an externally generated source.

If you know some good link , please do share here in comments.

Thanks for reading.