Jugnu Life :-): March 2012

Sample data for practice with Hadoop

Go to http://www.infochimps.com/datasets

Filter by Free data sets available

Filter by Downloadable data only

Choose the data type which is of interest to you.

Happy Hadooping :)

Pig Editor for eclipse

I downloaded PigEditor from http://romainr.github.com/PigEditor/ and installed it in Eclipse

I was not able to configure PigPen which is being told as better feature rich editor then this.

Would be trying to get hold of PigPen also.

Incase you want to use PigEditor here is what you have to do

In Eclipse Update site use the URL

http://romainr.github.com/PigEditor/updates/

Let it install

Restart Eclipse

Create a New General project

It would ask you to allow XUnit use with project , say yes and you are done.

Pig Return codes

Value        Meaning            Comment
0        Success
1        Failure         Would retry again
2        Failure
3        Partial Failure        Used with multiquery
4        Illegal argument
5        IOException throws    UDF raised exception
6        PigException        Python UDF raised exception
7        ParseException         Can happen after variable parsing if variable substitution is being done
8        Throwable        Unexcepted exception

Apache Pig Introduction Tutorial

Apache Pig is a platform to analyze large data sets.

In simple terms you have lots and lots of data on which you need to do some processing or analysis , one way is to write Map Reduce code and then run that processing on data.

Other way is to write Pig scripts which would inturn be converted to Map Reduce code and would process your data.

Pig consists of two parts

Pig latin language
Pig engine

Pig latin is a scripting language which allows you to describe how data flow from one or more inputs should be read , how it should be processed and then where it should be stored.

The flows can be simple or complex where some processing is applied in between. Data can be picked from multiple inputs.
We can say Pig Latin describes a directed acyclic graphs where edges are data flows and the nodes are operators that process the data

The job of engine is to exectute the data flow written in Pig latin in parallel on hadoop infrastructure.

Why Pig is required when we can code all in MR

Pig provides all standard data processing operations like sort , group , join , filter , order by , union right inside pig latin
In MR we have to lots of manual coding.

Pig does optimization of Pig latin scripts while creating them into MR jobs.
It creates optimized version of Map reduce to run on hadoop

It takes very less time to write Pig latin script then to write corresponding MR code

Where Pig is useful

Transactional ETL Data pipelines ( Mostly used)
Research on raw data
Iterative processing

You can read next about how to install Pig

Oracle Date mapped to TimeStamp while importing with Sqoop

Oracle Date mapped to TimeStamp while importing with Sqoop

The current version of Sqoop 1.4.1 maps the Oracle Date to Timestamp since Oracle drives does this. Read the discussion below.

http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-faq-090281.html#08_01

How to solve this

While you are importing with sqoop pass on driver specific arguments as example below

$ sqoop import -D mapDateToTimestamp=false --connect jdbc:oracle:thin:@//db.example.com/foo --table bar

The above property mapDateToTimestamp to false would make the driver will revert to the default 9i-10g behavior and map DATE to Date.

Installing Pig ( Apache Hadoop Pig)

Apache Pig can be downloaded from http://pig.apache.org

Download the latest release from its website

Unzip the downloaded tar file

Set the environment variables in your system as

export PIG_HOME="/home/hadoop/software/pig-0.9.2"
export PATH=$PATH:$PIG_HOME/bin

Set the place where you downloaded pig and also set its path

If you plan to run Pig on hadoop cluster then one additional variable needs to be set

export PIG_CLASSPATH="/home/hadoop/software/hadoop-1.0.1/conf"

It tells about the place where to look for hdfs-site.xml and other configuration files for hadoop

Restart your computer

Thats it , now lets test the installation

On the command prompt

type

# pig -h

it shoud show the help related to Pig , and its various commands.

Done :) ,

Next you can read about

How to run your first Pig script in local mode

Or about various Pig running modes

Hadoop Pig Local mode Tutorial

The below example is explaining how to start programming in Pig.

I followed the book , Programming Pig.

This post assumes that you have already installed PIG in your computer. If you need help you can read the turorial to install pig.

So lets get start to write out first Pig program , using the same code example given in book chapter 2

Download the code examples from github website ( link below)

https://github.com/alanfgates/programmingpig

Pig can run in local mode and mapreduce mode.

When we say local mode it means that source data would be picked from the directory which is local in your computer. So to run some program you would go to the directory where data is and then run pig script to analyze the data.

I downloaded the code examples from above link

Now i go to the data directory where all data is present.

# cd /home/hadoop/Downloads/PigBook/data

Change the path depending upon where you copied the code in your computer

Now lets start pig in local mode

# pig -x local

-x local tells that Dear Pig , lets start working locally in this computer.

The output is similar to below

2012-03-11 11:44:13,346 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/Downloads/PigBook/data/pig_1331446453340.log
2012-03-11 11:44:13,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop filesystem at: file:///

It would enter grunt> shell , grunt is the shell to write pig scripts.

Lets try to see all the files which are present in data directory

grunt> ls

Output is shown below

file:/home/hadoop/Downloads/PigBook/data/webcrawl<r 1>    255068
file:/home/hadoop/Downloads/PigBook/data/baseball<r 1>    233509
file:/home/hadoop/Downloads/PigBook/data/NYSE_dividends<r 1>    17027
file:/home/hadoop/Downloads/PigBook/data/NYSE_daily<r 1>    3194099
file:/home/hadoop/Downloads/PigBook/data/README<r 1>    980
file:/home/hadoop/Downloads/PigBook/data/pig_1331445409976.log<r 1>    823

It is showing the list of files which are present in that folder (data)

Lets run on program , In chapter 2 there is one pig script.

Go to PigBook/examples/chap2 folder and there is one script named average_dividend.pig

The code of script is as follows

dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
grouped = group dividends by symbol;
avg = foreach grouped generate group, AVG(dividends.dividend);
store avg into 'average_dividend';

In plain english the above code is saying following

Load the NYSE_dividends file in contains fields as exchange, symbol, date, dividend
Group the records in that file by symbol

calculate average for dividend and

store the average results in average_divident folder

Result

After lots of processing the output would look like

Input(s):
Successfully read records from: "file:///home/hadoop/Downloads/PigBook/data/NYSE_dividends"

Output(s):
Successfully stored records in: "file:///home/hadoop/Downloads/PigBook/data/average_dividend"

Job DAG:
job_local_0001

2012-03-11 11:47:10,994 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

To check the output go to average_dividend directory which is created within data directory ( remember we started pig in this directory)

There is one MR part file part-r-00000 that has the final results

Thats it , PIG latin has done all the magic behind the scene.

Coming next running Pig latin in mapreduce mode

org.apache.hbase#hbase;0.92.0-SNAPSHOT: not found

If you are using sqoop 1.4.1 and you try to build it you can get error as

org.apache.hbase#hbase;0.92.0-SNAPSHOT: not found

This is due to reason that HBase 0.92.0 has been released

Just make the following changes in build.xml and run the build again

https://reviews.apache.org/r/4169/diff/

Sqoop free form query example

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --target-dir importOnlyEmpName -e 'Select Name from Employee_Table where $CONDITIONS' --m 1

free form query is presented after -e or -query

We can write our query in single quotes or double quotes. Just read the notes below from official documentation.

sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --target-dir importOnlyEmpName -e "Select Name from Employee_Table where (employee_Name="David" OR Salary>'2000') AND \$CONDITIONS" --m 1

Example of Sqoop free form query with where clause

The above query is selecting just name from the table Employee_Table which has other columns also besides name.
Importance of $CONDITIONS in free form query
Its worth nothing the importance of $CONDITIONS in free form query ( this thread explains well , getting info from there)
If you run a parallel import, the map tasks will execute your query with different values substituted in for $CONDITIONS. e.g., one mapper may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and the next mapper may execute "select bla from foo WHERE (id >= 10000 AND id < 20000)" and so on.
Sqoop does not parse your SQL statement into an abstract syntax tree which would allow it to modify your query without textual hints. You are free to add further constraints like you suggested in your initial example (read the thread), but the literal string "$CONDITIONS" does need to appear in the WHERE clause of your query so that Sqoop can textually replace it with its own refined constraints.
Setting -m 1 is the only way to force a non-parallel import. You still need $CONDITIONS in there because it queries the database
about column type information, etc in the client before executing the import job, but does not want actual rows returned to the client. So
it will execute your query with $CONDITIONS set to '1 = 0' to ensure that it receives type information, but not records.
Notes from Sqoop documentation
If you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"
The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.

Sqoop --target-dir example

Example for Import using Sqoop in target directory in the HDFS

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --m 1

The above command will import the data present in Employee_Table in sqoop database to the directory named employeeImportAll directory

After import is done we can check if data is present

Just see the output for each of the 3 commands one by one.

hadoop@jj-VirtualBox:~$ hadoop fs -ls
hadoop@jj-VirtualBox:~$ hadoop fs -ls /user/hadoop/employeeImportAll
hadoop@jj-VirtualBox:~$ hadoop fs -cat /user/hadoop/employeeImportAll/part-m-00000

All the results are present as comma separated file

ERROR tool.ImportTool: Error during import: No primary key could be found for table

12/03/05 23:44:31 ERROR tool.ImportTool: Error during import: No primary key could be found for table Employee_Table. Please specify one with --split-by or perform a sequential import with '-m 1'.

Sample queryon which i got this error

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll

Explanation

While performing the parallel imports Sqoop needs a criterion by which it can split the workload.Sqoop uses the splitting column to split the workload. By default Sqoop will identify the primary key column (if present) in a table to use as the splitting column.

The low and high values of splitting column are retrieved from databases and the map tasks operate on evenly sized components of total range.

For example , if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

Solution

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --m 1

Just add the --m 1 , it tells to use sequential import with 1 mapper

Or another solution can be by telling to sqoop to use particulay column as split column.

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --split-by columnName

Spring and Hadoop Integration

Spring integration with hadoop

Spring Hadoop provides support for writing Apache Hadoop applications that benefit from the features of Spring, Spring Batch and Spring Integration.

Features

Extension to Spring Batch to support creating an end-to-end data pipeline solution
Simplified reading and writing to HDFS using Spring's resource abstraction
Spring Batch Tasklets for Map-Reduce an Streaming Jobs
Integration with Cascading, HBase, Hive and Pig

http://www.springsource.org/spring-data/hadoop

Cloudera Certified Administrator for Apache Hadoop (CCAH) exam topics and syllabus

Update : 6 April 2013

Cloudera Certified Administrator for Apache Hadoop (CCAH)

To earn a CCAH certification, candidates must pass an exam designed to test a candidate’s fluency with the concepts and skills required in the following areas:

If you are interested in Developer exam then you should read other post

http://jugnu-life.blogspot.in/2012/03/cloudera-certified-developer-for-apache.html

Details for Admin exam are here along with from where to prepare

Test Name: Cloudera Certified Administrator for Apache Hadoop CDH4 (CCA-410)
Number of Questions: 60
Time Limit: 90 minutes
Passing Score: 70%
Languages: English, Japanese
English Release Date: November 1, 2012
Japanese Release Date: December 1, 2012
Price: USD $295, AUD285, EUR225, GBP185, JPY25,500

1. HDFS (38%)

Objectives

Describe the function of all Hadoop Daemons
Describe the normal operation of an Apache Hadoop cluster, both in data storage and in data processing.
Identify current features of computing systems that motivate a system like Apache Hadoop.
Classify major goals of HDFS Design
Given a scenario, identify appropriate use case for HDFS Federation
Identify components and daemon of an HDFS HA-Quorum cluster
Analyze the role of HDFS security (Kerberos)
Describe file read and write paths

Section Study Resources

Hadoop: The Definitive Guide, 3rd edition: Chapter 3
Hadoop Operations: Chapter 2
Hadoop in Practice: Appendix C: HDFS Dissected
CDH4 High Availability Guide
CDH4 HA with Quorum-based storage docs
Apache HDFS High Availability Using the Quorum Journal Manager docs

2. MapReduce (10%)

Objectives

Understand how to deploy MapReduce MapReduce v1 (MRv1)
Understand how to deploy MapReduce v2 (MRv2 / YARN)
Understand basic design strategy for MapReduce v2 (MRv2)

Section Study Resources

Apache YARN docs (note: we don't control apache.org links and as of 11 February 2013, they have been experiencing downtime. You may get a 404 error.)

CDH4 YARN deployment docs

3. Hadoop Cluster Planning (12%)

Objectives

Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster.
Analyze the choices in selecting an OS
Understand kernel tuning and disk swapping
Given a scenario and workload pattern, identify a hardware configuration appropriate to the scenario
Cluster sizing: given a scenario and frequency of execution, identify the specifics for the workload, including CPU, memory, storage, disk I/O
Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster
Network Topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario

Section Study Resources

Hadoop Operations: Chapter 4

4. Hadoop Cluster Installation and Administration (17%)

Objectives

Given a scenario, identify how the cluster will handle disk and machine failures.
Analyze a logging configuration and logging configuration file format.
Understand the basics of Hadoop metrics and cluster health monitoring.
Identify the function and purpose of available tools for cluster monitoring.
Identify the function and purpose of available tools for managing the Apache Hadoop file system.

Section Study Resources

Hadoop Operations, Chapter 5

5. Resource Management (6%)

Objectives

Understand the overall design goals of each of Hadoop schedulers.
Understand the role of HDFS quotas.
Given a scenario, determine how the FIFO Scheduler allocates cluster resources.
Given a scenario, determine how the Fair Scheduler allocates cluster resources.
Given a scenario, determine how the Capacity Scheduler allocates cluster resources.

Section Study Resources

A slide deck from Matei Zaharia, developer of the Fair Scheduler
Hadoop Operations, Chapter 7 Capacity Scheduler Apache docs (note: we don't control apache.org links and as of 11 February 2013, they have been experiencing downtime. You may get a 404 error.)

6. Monitoring and Logging (12%)

Objectives

Understand the functions and features of Hadoop’s metric collection abilities
Analyze the NameNode and JobTracker Web UIs
Interpret a log4j configuration
Understand how to monitor the Hadoop Daemons
Identify and monitor CPU usage on master nodes
Describe how to monitor swap and memory allocation on all nodes
Identify how to view and manage Hadoop’s log files
Interpret a log file

Section Study Resources

7. The Hadoop Ecosystem (5%)

Objectives

Understand Ecosystem projects and what you need to do to deploy them on a cluster.

Section Study Resources

Hadoop: The Definitive Guide, 3rd Edition: Chapters 11, 12, 14, 15
Hadoop in Practice: Chapters 10, 11
Hadoop in Action: Chapters 10, 11
Apache Hive docs
Apache Pig docs
Introduction to Pig Video
Apache Sqoop docs site
Aaron Kimball on Sqoop at Hadoop World 2012
Cloudera Manager Online Training Video Series
Each project in the Hadoop ecosystem has at least one book devoted to it. The exam scope does not require deep knowledge of programming in Hive, Pig, Sqoop, Cloudera Manager, Flume, etc. rather how those projects contribute to an overall big data ecosystem.

Cloudera Certified Developer for Apache Hadoop Syllabus exam topics and contents (CCDH)

Cloudera Certified Developer for Apache Hadoop (CCDH)

Update : 6 April 2013

Cloudera has added exam learning resources on the website , please read this link for latest.

http://university.cloudera.com/certification/prep/ccdh.html

http://jugnu-life.blogspot.in/2012/05/cloudera-hadoop-certification-now.html

Syllabus , exam contents

http://university.cloudera.com/certification.html

To earn a CCDH certification, candidates must pass an exam designed to test a candidate’s fluency with the concepts and skills required in the following areas:

If you are interested in Administrator exam then you should read other post

http://jugnu-life.blogspot.in/2012/03/cloudera-certified-administrator-for.html

Exam syllabus for Developer and Study sources are mentioned below.

1. Core Hadoop Concepts (CCD-410:25% | CCD-470: 33%)

Objectives

Recognize and identify Apache Hadoop daemons and how they function both in data storage and processing under both CDH3 and CDH4.
Understand how Apache Hadoop exploits data locality, including rack placement policy.
Given a big data scenario, determine the challenges to large-scale computational models and how distributed systems attempt to overcome various challenges posed by the scenario.
Identify the role and use of both MapReduce v1 (MRv1) and MapReduce v2 (MRv2 / YARN) daemons.

Section Study Resources

Hadoop File System Shell Guide
Apache YARN docs
CDH4 YARN deployment docs
CDH4 update including MapReduce v2 (MRv2)
We offer a great section on YARN in the following video:
What’s New in CDH4? A Guide for Previous Attendees of Cloudera Administrator Training for Apache Hadoop

2. Storing Files in Hadoop (7%)

Objectives

Analyze the benefits and challenges of the HDFS architecture
Analyze how HDFS implements file sizes, block sizes, and block abstraction.
Understand default replication values and storage requirements for replication.
Determine how HDFS stores, reads, and writes files.
Given a sample architecture, determine how HDFS handles hardware failure.

Section Study Resources

Hadoop: The Definitive Guide, 3rd edition: Chapter 3
Hadoop Operations: Chapter 2
Hadoop in Practice: Appendix C: HDFS Dissected

3. Job Configuration and Submission (7%)

Objectives

Construct proper job configuration parameters
Identify the correct procedures for MapReduce job submission.
How to use various commands in job submission

Section Study Resources

Hadoop: The Definitive Guide, 3rd Edition: Chapter 5

4. Job Execution Environment (10%)

Objectives

Given a MapReduce job, determine the lifecycle of a Mapper and the lifecycle of a Reducer.
Understand the key fault tolerance principles at work in a MapReduce job.
Identify the role of Apache Hadoop Classes, Interfaces, and Methods.
Understand how speculative execution exploits differences in machine configurations and capabilities in a parallel environment and how and when it runs.

Section Study Resources

Hadoop in Action: Chapter 3
Hadoop: The Definitive Guide, 3rd Edition: Chapter 6

5. Input and Output (6%)

Objectives

Given a sample job, analyze and determine the correct InputFormat and OutputFormat to select based on job requirements.
Understand the role of the RecordReader, and of sequence files and compression.

Section Study Resources

Hadoop: The Definitive Guide, 3rd Edition: Chapter 7
Hadoop in Action: Chapter 3
Hadoop in Practice: Chapter 3

6. Job Lifecycle (18%)

Objectives

Analyze the order of operations in a MapReduce job.
Analyze how data moves through a job.
Understand how partitioners and combiners function, and recognize appropriate use cases for each.
Recognize the processes and role of the the sort and shuffle process.

Section Study Resources

Hadoop: The Definitive Guide, 3rd Edition: Chapter 6
Hadoop in Practice: Techniques in section 6.4

Two blog posts from Philippe Adjiman’s Hadoop Tutorial Series

7. Data processing (6%)

Objectives

Analyze and determine the relationship of input keys to output keys in terms of both type and number, the sorting of keys, and the sorting of values.
Given sample input data, identify the number, type, and value of emitted keys and values from the Mappers as well as the emitted data from each Reducer and the number and contents of the output file(s).

Section Study Resources

Hadoop: The Definitive Guide, 3rd Edition: Chapter 7 on Input Formats and Output Formats
Hadoop in Practice: Chapter 3

8. Key and Value Types (6%)

Objectives

Given a scenario, analyze and determine which of Hadoop’s data types for keys and values are appropriate for the job.
Understand common key and value types in the MapReduce framework and the interfaces they implement.

Section Study Resources

Hadoop: The Definitive Guide, 3rd Edition: Chapter 4
Hadoop in Practice: Chapter 3

9. Common Algorithms and Design Patterns (7%)

Objectives

Evaluate whether an algorithm is well-suited for expression in MapReduce.
Understand implementation and limitations and strategies for joining datasets in MapReduce.
Analyze the role of DistributedCache and Counters.

Section Study Resources

Hadoop: The Definitive Guide, 3rd Edition: Chapter 8
Hadoop in Practice: Chapter 4, 5, 7
MapReduce Algorithms tutorial video. Note: uses the old API.
Hadoop in Action: Chapter 5.2

10. The Hadoop Ecosystem (8%)

Objectives

Analyze a workflow scenario and determine how and when to leverage ecosystems projects, including Apache Hive, Apache Pig, Sqoop and Oozie.
Understand how Hadoop Streaming might apply to a job workflow.

Section Study Resources

Hadoop: The Definitive Guide, 3rd Edition: Chapters 11, 12, 14, 15
Hadoop in Practice: Chapters 10, 11
Hadoop in Action: Chapters 10, 11
Introduction to Apache Pig Video Tutorial Introduction to Apache Hive video tutorial
Apache Hive docs
Apache Pig docs
Introduction to Pig Video
Apache Sqoop docs
Aaron Kimball on Sqoop at Hadoop World 2012
Cloudera Manager Online Training Video Series
Each project in the Hadoop ecosystem has at least one book devoted to it. The exam scope does not require deep knowledge of programming in Hive, Pig, Sqoop, Cloudera Manager, Flume, etc. rather how those projects contribute to an overall big data ecosystem.

Hadoop Certification in India or Outside USA

If you waiting to write Cloudera exam for Hadoop certification , then there is good news for you.

Cloudera is going to organize exams through Pearson VUE center starting 1 May 2012 throughout the world.

Exams :
Cloudera Certified Administrator for Apache Hadoop (CCAH) and Cloudera Certified Developer for Apache Hadoop (CCDH)

Start date : 1 May 2012
Testing center : Pearson VUE
Exam fees : $295 US

Another good news is that , its no more necessary to attend training prior to writing exam in Hadoop world. So that huge training fees can be avoided if we study on our own. (At least I cannot afford to pay (1600 USD)that training cost its huge , Cloudera people are you listening ? USD 1600 is huge 1 USD = 50 INR)

This is great news for many people in India or around the world outside USA who wanted to write certification exam.

More details at official press release below.

http://www.cloudera.com/company/press-center/releases/cloudera-university-takes-industry-leading-certification-program-for-apache-hadoop-worldwide/

If you are also planning to write exam like me , lets plan and study together.

How are you working out for them ?

In which technologies you are working these days , i am working for MR , HIVE , Sqoop these days and following Tom White book on hadoop

Do you have idea about contents of exam and syllabus?

I have written about the contents of the exam here in two blog posts

http://jugnu-life.blogspot.in/2012/03/cloudera-certified-developer-for-apache.html
http://jugnu-life.blogspot.in/2012/03/cloudera-certified-administrator-for.html

Setting up development environment for Sqoop

The post at offical Wiki of Sqoop explains well the process to setup development environment for Sqoop.

I am having following

Ubuntu system 11.10
Downloaded Eclipse
Ant 1.8 in my system
Sbsclipse ( svn plugin for eclipse)
Make is already present in Ubuntu
Asciidoc i downloaded from Software repository in Ubuntu , easy part :)
Java 1.6 is already there in my system

All set :)

Snappy compressions library

Snappy is a compression / decompression library build using C++

The main advantage of Snappy is the high speed in compressing or decompressing the data.

http://code.google.com/p/snappy/

Integrating Pig and Accumulo

Accumulo

Accumulo is a distributed key/value store that provides expressive, cell-level access labels.

Accumulo is a sorted, distributed key/value store based on Google's BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift.

http://www.covert.io/post/18605091231/accumulo-and-pig

The above post explains use of Pig and Accumulo together.

Sqoop import with where clause

If you are following from previous sqoop import tutorial http://jugnu-life.blogspot.in/2012/03/sqoop-import-tutorial.html then , lets try to do conditional import from RDBMS in sqoop

$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret -m 1

The sqoop command above would import all the rows present in the table Customer.

Let say that customer table is something like this

CustomerName	DateOfJoining
Adam	2012-12-12
John	2002-1-3
Emma	2011-1-3
Tina	2009-3-8

Now lets say we want to import only those customers which are joining after 2005-1-1

We can modify the sqoop import as

$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret --where "DateOfJoining > '2005-1-1' "

This would import only 3 records from above table.

Happy sqooping :)

Sqoop installation tutorial

Sqoop is a tool which is used to import / export data from RDBMS to HDFS

It can be downloaded from the apache website. As of writing this post the Sqoop is in incubation project with apache , but it would come as full project in the near future.

Sqoop is a client tool , you are not required to install it to all nodes of Cluster. The best practice is to just install it on client ( or edge node of the cluster) . The data transfer is direct between Cluster and Database , incase you are worried for traffic between machine where you install Sqoop and Database.

Installation steps

You can download the latest version of sqoop from apache website
http://sqoop.apache.org/

The installation is fairly simple to start off for development purpose with Sqoop

Download the latest sqoop binary file

Extract it in some folder

Specify the SQOOP_HOME and add Sqoop path variable so that we can directly run the sqoop commands

For example i downloaded sqoop in following directory and my environment variables look like this
export SQOOP_HOME="/home/hadoop/software/sqoop-1.4.3"

export PATH=$PATH:$SQOOP_HOME/bin

Sqoop can be connected to various types of databases .

For example it can talk to mysql , Oracle , Postgress databases. It uses JDBC to connect to them. JDBC driver for each of databases is needed by sqoop to connect to them.

JDBC driver jar for each of the database can be downloaded from net. For example mysql jar is present at link below

http://dev.mysql.com/downloads/connector/j/

Download the mysql j connector jar and store in lib directory present in sqoop home folder.

Thats it.

Just test your installation by typing

$ sqoop help

You should see the list of commands with there use in sqoop

Happy sqooping :)

Sqoop import tutorial

This tutorial explains how to use sqoop to import the data from RDBMS to HDFS. Tutorial is divided into multiple posts to cover various functionalities offered by sqoop import

The general syntax for import is

$ sqoop-import (generic-args) (import-args)

Argument	Description
`--connect <jdbc-uri>`	Specify JDBC connect string
`--connection-manager <class-name>`	Specify connection manager class to use
`--driver <class-name>`	Manually specify JDBC driver class to use
`--hadoop-home <dir>`	Override $HADOOP_HOME
`--help`	Print usage instructions
`-P`	Read password from console
`--password <password>`	Set authentication password
`--username <username>`	Set authentication username
`--verbose`	Print more information while working
`--connection-param-file <filename>`	Optional properties file that provides connection parameters

Example run

$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret -m 1

When we run this sqoop command it would try to connect to mysql database named CompanyDatabase with username root , password mysecret and with one map task.

Generally its not recommended to give password in command , instead its advisable to use -P parameter which tells to ask for password in console.

One more thing which we should notice is the use of localhost as database address , if you are running your hadoop cluster in distributed mode than you should give full hostname and IP of the database.

Hadoop installation tutorial

Purpose of post is to explain how to install hadoop in your computer. This post considers that you have Linux based system available for use. I am doing this on Ubuntu system

If you want to know how to install latest version of Hadoop 2.0 , then see the Hadoop 2.0 Install Tutorial

Before you begin create a separate user named hadoop in the system and do all these operations in that.

This document covers the Steps to
1) Configure SSH
2) Install JDK
3) Install Hadoop

Update your repository
#sudo apt-get update

You can directly copy the commands from there and run in your system

Hadoop requires that various systems present in cluster can talk to each other freely. Hadoop use SSH to prove the identity for connection.

Let's Download and configure SSH

#sudo apt-get install openssh-server openssh-client
#ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
#cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

#sudo chmod go-w $HOME $HOME/.ssh
#sudo chmod 600 $HOME/.ssh/authorized_keys
#sudo chown `whoami` $HOME/.ssh/authorized_keys

Testing your SSH

#ssh localhost
Say yes

It should open connection with SSH
#exit
This will close the SSH

Java 1.6 is mandatory for running hadoop

Lets Download and install JDK

#sudo mkdir /usr/java
#cd /usr/java
#sudo wget http://download.oracle.com/otn-pub/java/jdk/6u31-b04/jdk-6u31-linux-i586.bin

Wait till the jdk download completes
Install java
#sudo chmod o+w jdk-6u31-linux-i586.bin
#sudo chmod +x jdk-6u31-linux-i586.bin
#sudo ./jdk-6u31-linux-i586.bin

Now comes the Hadoop :)

Lets Download and configure Hadoop in Pseudo distributed mode. You can read more about various types of modes on Hadoop website.

Download the latest hadoop version from its website

http://hadoop.apache.org/common/releases.html
Download hadoop 1.0.x tar.gz from hadoop website

Extract it into some folder ( say /home/hadoop/software/20/ )
All softwares have been downloaded at that location

For other modes (Standalone and Fully distributed) please see hadoop documentation

Go to conf directory in hadoop folder and open core-site.xml and add the following property in blank configuration tags

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost</value>
</property>
</configuration>

Similarly do for

conf/hdfs-site.xml:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

conf/mapred-site.xml:

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>

Environment variables

In hadoop_env.sh file , change the JAVA_HOME to location where you installed java
e.g
JAVA_HOME = /usr/java/jdk1.6.0_31

Configure the environment variables for JDK , Hadoop as follows

Go to ~.profile file in the current user home directory
Add the following

You can change the variable paths if you have installed hadoop and java at some other locations

export JAVA_HOME="/usr/java/jdk1.6.0_31"
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_INSTALL="/home/hadoop/software/hadoop-1.0.1"
export PATH=$PATH:$HADOOP_INSTALL/bin

Testing your installation
Format the HDFS
# hadoop namenode -format

hadoop@jj-VirtualBox:~$ start-dfs.sh
starting namenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-namenode-jj-VirtualBox.out
localhost: starting datanode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-datanode-jj-VirtualBox.out
localhost: starting secondarynamenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-secondarynamenode-jj-VirtualBox.out
hadoop@jj-VirtualBox:~$ start-mapred.sh
starting jobtracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-jobtracker-jj-VirtualBox.out
localhost: starting tasktracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-tasktracker-jj-VirtualBox.out

Open the browser and point to page

localhost:50030
localhost:50070

It would open the status page for hadoop

Thats it , this completes the installation of Hadoop , now you are ready to play with it.

http://localhost:50070/dfshealth.jsp crash

I get often this problem that http://localhost:50070/dfshealth.jsp crashes and it doesn't show up anything.

I am running pseudo mode configuration.

One of the temporary solution which i found online was to format dfs again but this is very frustrating.

Also in

http://localhost:50030/jobtracker.jsp

Jobtracker history i get the following message

HTTP ERROR 500

Problem accessing /jobhistoryhome.jsp. Reason:

INTERNAL_SERVER_ERROR

http://localhost:50030/jobhistoryhome.jsp

I see similar problem was also observed here

http://grokbase.com/p/hadoop/common-user/10383vj1gn/namenode-problem

Solution

If you see carefully the log of namenode

We have error as

org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-hadoop/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.

This says that following variables are not properly set

Normally this is due to the machine having been rebooted and /tmp being cleared out. You do not want to leave the Hadoop name node or data node storage in /tmp for this reason. Make sure you properly configure dfs.name.dir and dfs.data.dir to point to directories
outside of /tmp and other directories that may be cleared on boot.

The quick setup guide is really just to help you start experimenting with Hadoop. For setting up a cluster for any real use, you'll want to
follow the next guide - Cluster Setup -
http://hadoop.apache.org/common/docs/current/cluster_setup.html

So here is what i did in hadoop-site.xml added the following two properties and now its working fine

<property>
        <name>dfs.name.dir</name>
        <value>/home/hadoop/workspace/hadoop_space/data_dir</value>
</property>

<property>
        <name>dfs.data.dir</name>
        <value>/home/hadoop/workspace/hadoop_space/name_dir</value>
</property>

Source : http://lucene.472066.n3.nabble.com/Directory-tmp-hadoop-root-dfs-name-is-in-an-inconsistent-state-storage-directory-DOES-NOT-exist-or-ie-td812243.html

Found some other solution for this problem ? Please share below , thanks.