Sample data for practice with Hadoop

Go to http://www.infochimps.com/datasets

Filter by Free data sets available

Filter by Downloadable data only

Choose the data type which is of interest to you.

Happy Hadooping :)

 

Pig Editor for eclipse

I downloaded PigEditor from http://romainr.github.com/PigEditor/ and installed it in Eclipse

I was not able to configure PigPen which is being told as better feature rich editor then this.

Would be trying to get hold of PigPen also.

Incase you want to use PigEditor here is what you have to do

In Eclipse Update site use the URL

http://romainr.github.com/PigEditor/updates/

Let it install

Restart Eclipse

Create a New General project

It would ask you to allow XUnit use with project , say yes and you are done.

 

 

 

Pig Return codes

Pig Return codes

Value        Meaning            Comment
0        Success
1        Failure         Would retry again
2        Failure   
3        Partial Failure        Used with multiquery
4        Illegal argument        
5        IOException throws    UDF raised exception
6        PigException        Python UDF raised exception   
7        ParseException         Can happen after variable parsing if variable substitution is being done
8        Throwable        Unexcepted exception

Apache Pig Introduction Tutorial

Apache Pig is a platform to analyze large data sets.

In simple terms you have lots and lots of data on which you need to do some processing or analysis , one way is to write Map Reduce code and then run that processing on data.

Other way is to write Pig scripts which would inturn be converted to Map Reduce code and would process your data.

Pig consists of two parts

  • Pig latin language
  • Pig engine


Pig latin is a scripting language which allows you to describe how data flow from one or more inputs should be read , how it should be processed and then where it should be stored.

The flows can be simple or complex where some processing is applied in between. Data can be picked from multiple inputs.
We can say Pig Latin describes a directed acyclic graphs where edges are data flows and the nodes are operators that process the data

The job of engine is to exectute the data flow written in Pig latin in parallel on hadoop infrastructure.

Why Pig is required when we can code all in MR

Pig provides all standard data processing operations like sort , group , join , filter , order by , union right inside pig latin
In MR we have to lots of manual coding.

Pig does optimization of Pig latin scripts while creating them into MR jobs.
It creates optimized version of Map reduce to run on hadoop

It takes very less time to write Pig latin script then to write corresponding MR code

Where Pig is useful

Transactional ETL Data pipelines ( Mostly used)
Research on raw data
Iterative processing

You can read next about how to install Pig



 

Oracle Date mapped to TimeStamp while importing with Sqoop

Oracle Date mapped to TimeStamp while importing with Sqoop

The current version of Sqoop 1.4.1 maps the Oracle Date to Timestamp since Oracle drives does this. Read the discussion below.

http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-faq-090281.html#08_01

How to solve this

While you are importing with sqoop pass on driver specific arguments as example below

$ sqoop import -D mapDateToTimestamp=false --connect jdbc:oracle:thin:@//db.example.com/foo --table bar

The above property mapDateToTimestamp to false would make the driver will revert to the default 9i-10g behavior and map DATE to Date.

Installing Pig ( Apache Hadoop Pig)

Apache Pig can be downloaded from http://pig.apache.org

Download the latest release from its website

Unzip the downloaded tar file

Set the environment variables in your system as

export PIG_HOME="/home/hadoop/software/pig-0.9.2"
export PATH=$PATH:$PIG_HOME/bin

Set the place where you downloaded pig and also set its path

If you plan to run Pig on hadoop cluster then one additional variable needs to be set

export PIG_CLASSPATH="/home/hadoop/software/hadoop-1.0.1/conf"

It tells about the place where to look for hdfs-site.xml and other configuration files for hadoop

Restart your computer

Thats it , now lets test the installation

On the command prompt

type

# pig -h

it shoud show the help related to Pig , and its various commands.

Done :) ,

Next you can read about

How to run your first Pig script in local mode

Or about various Pig running modes

Hadoop Pig Local mode Tutorial

The below example is explaining how to start programming in Pig.

I followed the book , Programming Pig.

This post assumes that you have already installed PIG in your computer. If you need help you can read the turorial to install pig.

So lets get start to write out first Pig program , using the same code example given in book chapter 2

Download the code examples from github website ( link below)

https://github.com/alanfgates/programmingpig

Pig can run in local mode and mapreduce mode.

When we say local mode it means that source data would be picked from the directory which is local in your computer. So to run some program you would go to the directory where data is and then run pig script to analyze the data.

I downloaded the code examples from above link

Now i go to the data directory where all data is present.

# cd /home/hadoop/Downloads/PigBook/data

Change the path depending upon where you copied the code in your computer

Now lets start pig in local mode

# pig -x local

-x local tells that Dear Pig , lets start working locally in this computer.

The output is similar to below

2012-03-11 11:44:13,346 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop/Downloads/PigBook/data/pig_1331446453340.log
2012-03-11 11:44:13,720 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop filesystem at: file:///

It would enter grunt> shell , grunt is the shell to write pig scripts.

Lets try to see all the files which are present in data directory

grunt> ls

Output is shown below


file:/home/hadoop/Downloads/PigBook/data/webcrawl<r 1>    255068
file:/home/hadoop/Downloads/PigBook/data/baseball<r 1>    233509
file:/home/hadoop/Downloads/PigBook/data/NYSE_dividends<r 1>    17027
file:/home/hadoop/Downloads/PigBook/data/NYSE_daily<r 1>    3194099
file:/home/hadoop/Downloads/PigBook/data/README<r 1>    980
file:/home/hadoop/Downloads/PigBook/data/pig_1331445409976.log<r 1>    823

It is showing the list of files which are present in that folder (data)

Lets run on program , In chapter 2 there is one pig script.

Go to PigBook/examples/chap2 folder and there is one script named average_dividend.pig

The code of script is as follows

dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
grouped   = group dividends by symbol;
avg       = foreach grouped generate group, AVG(dividends.dividend);
store avg into 'average_dividend';

In plain english the above code is saying following

Load the NYSE_dividends file in contains fields as exchange, symbol, date, dividend
Group the records in that file by symbol

calculate average for dividend and

store the average results in average_divident folder

Result

After lots of processing the output would look like

 


Input(s):
Successfully read records from: "file:///home/hadoop/Downloads/PigBook/data/NYSE_dividends"

Output(s):
Successfully stored records in: "file:///home/hadoop/Downloads/PigBook/data/average_dividend"

Job DAG:
job_local_0001


2012-03-11 11:47:10,994 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

 

To check the output go to average_dividend directory which is created within data directory ( remember we started pig in this directory)

There is one MR part file part-r-00000 that has the final results

Thats it , PIG latin has done all the magic behind the scene.

Coming next running Pig latin in mapreduce mode

 

 

org.apache.hbase#hbase;0.92.0-SNAPSHOT: not found

If you are using sqoop 1.4.1 and you try to build it you can get error as

org.apache.hbase#hbase;0.92.0-SNAPSHOT: not found

This is due to reason that HBase 0.92.0 has been released

Just make the following changes in build.xml and run the build again

https://reviews.apache.org/r/4169/diff/

Sqoop free form query example

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --target-dir importOnlyEmpName -e 'Select Name from Employee_Table where $CONDITIONS' --m 1

free form query is presented after -e or -query

We can write our query in single quotes or double quotes. Just read the notes below from official documentation.

sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --target-dir importOnlyEmpName -e "Select Name from Employee_Table where (employee_Name="David" OR Salary>'2000') AND \$CONDITIONS" --m 1

Example of Sqoop free form query with where clause

The above query is selecting just name from the table Employee_Table which has other columns also besides name.
Importance of $CONDITIONS in free form query
Its worth nothing the importance of $CONDITIONS in free form query ( this thread explains well , getting info from there)
If you run a parallel import, the map tasks will execute your query with different values substituted in for $CONDITIONS. e.g., one mapper may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and the next mapper may execute "select bla from foo WHERE (id >= 10000 AND id < 20000)" and so on.
Sqoop does not parse your SQL statement into an abstract syntax tree which would allow it to modify your query without textual hints. You are free to add further constraints like you suggested in your initial example (read the thread), but the literal string "$CONDITIONS" does need to appear in the WHERE clause of your query so that Sqoop can textually replace it with its own refined constraints.
Setting -m 1 is the only way to force a non-parallel import. You still need $CONDITIONS in there because it queries the database
about column type information, etc in the client before executing the import job, but does not want actual rows returned to the client. So
it will execute your query with $CONDITIONS set to '1 = 0' to ensure that it receives type information, but not records.
Notes from Sqoop documentation
If you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"
The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.

Sqoop --target-dir example

Example for Import using Sqoop in target directory in the HDFS

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --m 1

The above command will import the data present in Employee_Table in sqoop database to the directory named employeeImportAll directory


After import is done we can check if data is present

Just see the output for each of the 3 commands one by one.

hadoop@jj-VirtualBox:~$ hadoop fs -ls
hadoop@jj-VirtualBox:~$ hadoop fs -ls /user/hadoop/employeeImportAll
hadoop@jj-VirtualBox:~$ hadoop fs -cat /user/hadoop/employeeImportAll/part-m-00000


All the results are present as comma separated file

ERROR tool.ImportTool: Error during import: No primary key could be found for table

12/03/05 23:44:31 ERROR tool.ImportTool: Error during import: No primary key could be found for table Employee_Table. Please specify one with --split-by or perform a sequential import with '-m 1'.

Sample queryon which i got this error

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll

Explanation

While performing the parallel imports Sqoop needs a criterion by which it can split the workload.Sqoop uses the splitting column to split the workload. By default Sqoop will identify the primary key column (if present) in a table to use as the splitting column.

The low and high values of splitting column are retrieved from databases and the map tasks operate on evenly sized components of total range.

For example , if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

Solution

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --m 1

Just add the --m 1 , it tells to use sequential import with 1 mapper

Or another solution can be by telling to sqoop to use particulay column as split column.

$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll  --split-by columnName

 

Spring and Hadoop Integration

Spring integration with hadoop

Spring Hadoop provides support for writing Apache Hadoop applications that benefit from the features of Spring, Spring Batch and Spring Integration.

Features

  • Extension to Spring Batch to support creating an end-to-end data pipeline solution
  • Simplified reading and writing to HDFS using Spring's resource abstraction
  • Spring Batch Tasklets for Map-Reduce an Streaming Jobs
  • Integration with Cascading, HBase, Hive and Pig

http://www.springsource.org/spring-data/hadoop

Cloudera Certified Administrator for Apache Hadoop (CCAH) exam topics and syllabus

Update : 6 April 2013

Cloudera Certified Administrator for Apache Hadoop (CCAH)

To earn a CCAH certification, candidates must pass an exam designed to test a candidate’s fluency with the concepts and skills required in the following areas:

If you are interested in Developer exam then you should read other post

http://jugnu-life.blogspot.in/2012/03/cloudera-certified-developer-for-apache.html


Details for Admin exam are here along with from where to prepare

 

Test Name: Cloudera Certified Administrator for Apache Hadoop CDH4 (CCA-410)
Number of Questions: 60
Time Limit: 90 minutes
Passing Score: 70%
Languages: English, Japanese
English Release Date: November 1, 2012
Japanese Release Date: December 1, 2012
Price: USD $295, AUD285, EUR225, GBP185, JPY25,500

 

 

1. HDFS (38%)

Objectives
  • Describe the function of all Hadoop Daemons
  • Describe the normal operation of an Apache Hadoop cluster, both in data storage and in data processing.
  • Identify current features of computing systems that motivate a system like Apache Hadoop.
  • Classify major goals of HDFS Design
  • Given a scenario, identify appropriate use case for HDFS Federation
  • Identify components and daemon of an HDFS HA-Quorum cluster
  • Analyze the role of HDFS security (Kerberos)
  • Describe file read and write paths
Section Study Resources

2. MapReduce (10%)

Objectives
  • Understand how to deploy MapReduce MapReduce v1 (MRv1)
  • Understand how to deploy MapReduce v2 (MRv2 / YARN)
  • Understand basic design strategy for MapReduce v2 (MRv2)
Section Study Resources
  • Apache YARN docs (note: we don't control apache.org links and as of 11 February 2013, they have been experiencing downtime. You may get a 404 error.)
  • CDH4 YARN deployment docs

    3. Hadoop Cluster Planning (12%)

    Objectives
    • Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster.
    • Analyze the choices in selecting an OS
    • Understand kernel tuning and disk swapping
    • Given a scenario and workload pattern, identify a hardware configuration appropriate to the scenario
    • Cluster sizing: given a scenario and frequency of execution, identify the specifics for the workload, including CPU, memory, storage, disk I/O
    • Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster
    • Network Topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario
    Section Study Resources
    • Hadoop Operations: Chapter 4

    4. Hadoop Cluster Installation and Administration (17%)

    Objectives
    • Given a scenario, identify how the cluster will handle disk and machine failures.
    • Analyze a logging configuration and logging configuration file format.
    • Understand the basics of Hadoop metrics and cluster health monitoring.
    • Identify the function and purpose of available tools for cluster monitoring.
    • Identify the function and purpose of available tools for managing the Apache Hadoop file system.
    Section Study Resources
    • Hadoop Operations, Chapter 5

    5. Resource Management (6%)

    Objectives
    • Understand the overall design goals of each of Hadoop schedulers.
    • Understand the role of HDFS quotas.
    • Given a scenario, determine how the FIFO Scheduler allocates cluster resources.
    • Given a scenario, determine how the Fair Scheduler allocates cluster resources.
    • Given a scenario, determine how the Capacity Scheduler allocates cluster resources.
    Section Study Resources

    6. Monitoring and Logging (12%)

    Objectives
    • Understand the functions and features of Hadoop’s metric collection abilities
    • Analyze the NameNode and JobTracker Web UIs
    • Interpret a log4j configuration
    • Understand how to monitor the Hadoop Daemons
    • Identify and monitor CPU usage on master nodes
    • Describe how to monitor swap and memory allocation on all nodes
    • Identify how to view and manage Hadoop’s log files
    • Interpret a log file
    Section Study Resources

      7. The Hadoop Ecosystem (5%)

      Objectives
      • Understand Ecosystem projects and what you need to do to deploy them on a cluster.
      Section Study Resources
    • Cloudera Certified Developer for Apache Hadoop Syllabus exam topics and contents (CCDH)

      Cloudera Certified Developer for Apache Hadoop (CCDH)

      Update : 6 April 2013

      Cloudera has added exam learning resources on the website , please read this link for latest.

      http://university.cloudera.com/certification/prep/ccdh.html

      http://jugnu-life.blogspot.in/2012/05/cloudera-hadoop-certification-now.html

      Syllabus , exam contents

      http://university.cloudera.com/certification.html

      To earn a CCDH certification, candidates must pass an exam designed to test a candidate’s fluency with the concepts and skills required in the following areas:

      If you are interested in Administrator exam then you should read other post

      http://jugnu-life.blogspot.in/2012/03/cloudera-certified-administrator-for.html

       

      Exam syllabus for Developer and Study sources are mentioned below.

      1. Core Hadoop Concepts (CCD-410:25% | CCD-470: 33%)

      Objectives
      • Recognize and identify Apache Hadoop daemons and how they function both in data storage and processing under both CDH3 and CDH4.
      • Understand how Apache Hadoop exploits data locality, including rack placement policy.
      • Given a big data scenario, determine the challenges to large-scale computational models and how distributed systems attempt to overcome various challenges posed by the scenario.
      • Identify the role and use of both MapReduce v1 (MRv1) and MapReduce v2 (MRv2 / YARN) daemons.
      Section Study Resources

       

      2. Storing Files in Hadoop (7%)

      Objectives
      • Analyze the benefits and challenges of the HDFS architecture
      • Analyze how HDFS implements file sizes, block sizes, and block abstraction.
      • Understand default replication values and storage requirements for replication.
      • Determine how HDFS stores, reads, and writes files.
      • Given a sample architecture, determine how HDFS handles hardware failure.
      Section Study Resources
      • Hadoop: The Definitive Guide, 3rd edition: Chapter 3
      • Hadoop Operations: Chapter 2
      • Hadoop in Practice: Appendix C: HDFS Dissected

      3. Job Configuration and Submission (7%)

      Objectives
      • Construct proper job configuration parameters
      • Identify the correct procedures for MapReduce job submission.
      • How to use various commands in job submission
      Section Study Resources
      • Hadoop: The Definitive Guide, 3rd Edition: Chapter 5

      4. Job Execution Environment (10%)

      Objectives
      • Given a MapReduce job, determine the lifecycle of a Mapper and the lifecycle of a Reducer.
      • Understand the key fault tolerance principles at work in a MapReduce job.
      • Identify the role of Apache Hadoop Classes, Interfaces, and Methods.
      • Understand how speculative execution exploits differences in machine configurations and capabilities in a parallel environment and how and when it runs.
      Section Study Resources
      • Hadoop in Action: Chapter 3
      • Hadoop: The Definitive Guide, 3rd Edition: Chapter 6

      5. Input and Output (6%)

      Objectives
      • Given a sample job, analyze and determine the correct InputFormat and OutputFormat to select based on job requirements.
      • Understand the role of the RecordReader, and of sequence files and compression.
      Section Study Resources
      • Hadoop: The Definitive Guide, 3rd Edition: Chapter 7
      • Hadoop in Action: Chapter 3
      • Hadoop in Practice: Chapter 3

      6. Job Lifecycle (18%)

      Objectives
      • Analyze the order of operations in a MapReduce job.
      • Analyze how data moves through a job.
      • Understand how partitioners and combiners function, and recognize appropriate use cases for each.
      • Recognize the processes and role of the the sort and shuffle process.
      Section Study Resources
      • Hadoop: The Definitive Guide, 3rd Edition: Chapter 6
      • Hadoop in Practice: Techniques in section 6.4
      Two blog posts from Philippe Adjiman’s Hadoop Tutorial Series

      7. Data processing (6%)

      Objectives
      • Analyze and determine the relationship of input keys to output keys in terms of both type and number, the sorting of keys, and the sorting of values.
      • Given sample input data, identify the number, type, and value of emitted keys and values from the Mappers as well as the emitted data from each Reducer and the number and contents of the output file(s).
      Section Study Resources
      • Hadoop: The Definitive Guide, 3rd Edition: Chapter 7 on Input Formats and Output Formats
      • Hadoop in Practice: Chapter 3

      8. Key and Value Types (6%)

      Objectives
      • Given a scenario, analyze and determine which of Hadoop’s data types for keys and values are appropriate for the job.
      • Understand common key and value types in the MapReduce framework and the interfaces they implement.
      Section Study Resources
      • Hadoop: The Definitive Guide, 3rd Edition: Chapter 4
      • Hadoop in Practice: Chapter 3

      9. Common Algorithms and Design Patterns (7%)

      Objectives
      • Evaluate whether an algorithm is well-suited for expression in MapReduce.
      • Understand implementation and limitations and strategies for joining datasets in MapReduce.
      • Analyze the role of DistributedCache and Counters.
      Section Study Resources
      • Hadoop: The Definitive Guide, 3rd Edition: Chapter 8
      • Hadoop in Practice: Chapter 4, 5, 7
      • MapReduce Algorithms tutorial video. Note: uses the old API.
      • Hadoop in Action: Chapter 5.2

      10. The Hadoop Ecosystem (8%)

      Objectives
      • Analyze a workflow scenario and determine how and when to leverage ecosystems projects, including Apache Hive, Apache Pig, Sqoop and Oozie.
      • Understand how Hadoop Streaming might apply to a job workflow.
      Section Study Resources

      Hadoop Certification in India or Outside USA

      If you waiting to write Cloudera exam for Hadoop certification , then there is good news for you.

      Cloudera is going to organize exams through Pearson VUE center starting 1 May 2012 throughout the world.

      Exams :
      Cloudera Certified Administrator for Apache Hadoop (CCAH) and Cloudera Certified Developer for Apache Hadoop (CCDH)

      Start date : 1 May 2012
      Testing center : Pearson VUE
      Exam fees : $295 US

      Another good news is that , its no more necessary to attend training prior to writing exam in Hadoop world. So that huge training fees can be avoided if we study on our own. (At least I cannot afford to pay (1600 USD)that training cost its huge , Cloudera people are you listening ? USD 1600 is huge 1 USD = 50 INR)

      This is great news for many people in India or around the world outside USA who wanted to write certification exam.

      More details at official press release below.

      http://www.cloudera.com/company/press-center/releases/cloudera-university-takes-industry-leading-certification-program-for-apache-hadoop-worldwide/

      If you are also planning to write exam like me , lets plan and study together.

      How are you working out for them ?

      In which technologies you are working these days , i am working for MR , HIVE , Sqoop these days and following Tom White book on hadoop

      Do you have idea about contents of exam and syllabus?

      I have written about the contents of the exam here in two blog posts

      http://jugnu-life.blogspot.in/2012/03/cloudera-certified-developer-for-apache.html
      http://jugnu-life.blogspot.in/2012/03/cloudera-certified-administrator-for.html

      Setting up development environment for Sqoop

      The post at offical Wiki of Sqoop explains well the process to setup development environment for Sqoop.

      I am having following

      Ubuntu system 11.10
      Downloaded Eclipse
      Ant 1.8 in my system
      Sbsclipse ( svn plugin for eclipse)
      Make is already present in Ubuntu
      Asciidoc i downloaded from Software repository in Ubuntu , easy part :)
      Java 1.6 is already there in my system

      All set :)

      Snappy compressions library

      Snappy is a compression / decompression library build using C++

      The main advantage of Snappy is the high speed in compressing or decompressing the data.

      http://code.google.com/p/snappy/

      Integrating Pig and Accumulo

      Accumulo

      Accumulo is a distributed key/value store that provides expressive, cell-level access labels.

      Accumulo is a sorted, distributed key/value store based on Google's BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift.

      http://www.covert.io/post/18605091231/accumulo-and-pig

      The above post explains use of Pig and Accumulo together.

      Sqoop import with where clause

      If you are following from previous sqoop import tutorial http://jugnu-life.blogspot.in/2012/03/sqoop-import-tutorial.html then , lets try to do conditional import from RDBMS in sqoop

      $ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret -m 1

      The sqoop command above would import all the rows present in the table Customer.

      Let say that customer table is something like this

      CustomerName
      DateOfJoining
      Adam
      2012-12-12
      John
      2002-1-3
      Emma
      2011-1-3
      Tina
      2009-3-8

      Now lets say we want to import only those customers which are joining after 2005-1-1

      We can modify the sqoop import as

      $ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret --where "DateOfJoining > '2005-1-1' "

      This would import only 3 records from above table.

      Happy sqooping :)

      Sqoop installation tutorial

      Sqoop is a tool which is used to import / export data from RDBMS to HDFS

      It can be downloaded from the apache website. As of writing this post the Sqoop is in incubation project with apache , but it would come as full project in the near future.
       
      Sqoop is a client tool , you are not required to install it to all nodes of Cluster. The best practice is to just install it on client ( or edge node of the cluster) . The data transfer is direct between Cluster and Database , incase you are worried for traffic between machine where you install Sqoop and Database.
       
      Installation steps

      You can download the latest version of sqoop from apache website
      http://sqoop.apache.org/

      The installation is fairly simple to start off for development purpose with Sqoop

      Download the latest sqoop binary file

      Extract it in some folder

      Specify the SQOOP_HOME and add Sqoop path variable so that we can directly run the sqoop commands

      For example i downloaded sqoop in following directory and my environment variables look like this
      export SQOOP_HOME="/home/hadoop/software/sqoop-1.4.3"
       
       

      export PATH=$PATH:$SQOOP_HOME/bin

      Sqoop can be connected to various types of databases .
       
      For example it can talk to mysql , Oracle , Postgress databases. It uses JDBC to connect to them. JDBC driver for each of databases is needed by sqoop to connect to them.

      JDBC driver jar for each of the database can be downloaded from net. For example mysql jar is present at link below

      Download the mysql j connector jar and store in lib directory present in sqoop home folder.

      Thats it.

      Just test your installation by typing

      $ sqoop help

      You should see the list of commands with there use in sqoop

      Happy sqooping :)

      Sqoop import tutorial

      This tutorial explains how to use sqoop to import the data from RDBMS to HDFS. Tutorial is divided into multiple posts to cover various functionalities offered by sqoop import

      The general syntax for import is

      $ sqoop-import (generic-args) (import-args)

       




































      Argument Description
      --connect <jdbc-uri>Specify JDBC connect string
      --connection-manager <class-name>Specify connection manager class to use
      --driver <class-name>Manually specify JDBC driver class to use
      --hadoop-home <dir>Override $HADOOP_HOME
      --helpPrint usage instructions
      -PRead password from console
      --password <password>Set authentication password
      --username <username>Set authentication username
      --verbosePrint more information while working
      --connection-param-file <filename>Optional properties file that provides connection parameters

       


      Example run


       


      $ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret -m 1


      When we run this sqoop command it would try to connect to mysql database named CompanyDatabase with username root , password mysecret and with one map task.


      Generally its not recommended to give password in command , instead its advisable to use -P parameter which tells to ask for password in console.


      One more thing which we should notice is the use of localhost as database address , if you are running your hadoop cluster in distributed mode than you should give full hostname and IP of the database.

      Hadoop installation tutorial

      Purpose of post is to explain how to install hadoop in your computer. This post considers that you have Linux based system available for use. I am doing this on Ubuntu system

      If you want to know how to install latest version of Hadoop 2.0 , then see the Hadoop 2.0 Install Tutorial

      Before you begin create a separate user named hadoop in the system and do all these operations in that.

      This document covers the Steps to
      1) Configure SSH
      2) Install JDK
      3) Install Hadoop

      Update your repository
      #sudo apt-get update

      You can directly copy the commands from there and run in your system

      Hadoop requires that various systems present in cluster can talk to each other freely. Hadoop use SSH to prove the identity for connection.

      Let's Download and configure SSH

      #sudo apt-get install openssh-server openssh-client
      #ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
      #cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

      #sudo chmod go-w $HOME $HOME/.ssh
      #sudo chmod 600 $HOME/.ssh/authorized_keys
      #sudo chown `whoami` $HOME/.ssh/authorized_keys

      Testing your SSH

      #ssh localhost
      Say yes

      It should open connection with SSH
      #exit
      This will close the SSH

      Java 1.6 is mandatory for running hadoop

      Lets Download and install JDK

      #sudo mkdir /usr/java
      #cd /usr/java
      #sudo wget http://download.oracle.com/otn-pub/java/jdk/6u31-b04/jdk-6u31-linux-i586.bin

      Wait till the jdk download completes
      Install java
      #sudo chmod o+w jdk-6u31-linux-i586.bin
      #sudo chmod +x jdk-6u31-linux-i586.bin
      #sudo ./jdk-6u31-linux-i586.bin

      Now comes the Hadoop :)

      Lets Download and configure Hadoop in Pseudo distributed mode. You can read more about various types of modes on Hadoop website.

      Download the latest hadoop version from its website

      http://hadoop.apache.org/common/releases.html
      Download hadoop 1.0.x tar.gz from hadoop website

      Extract it into some folder ( say /home/hadoop/software/20/ )
      All softwares have been downloaded at that location

      For other modes (Standalone and Fully distributed) please see hadoop documentation

      Go to conf directory in hadoop folder and open core-site.xml and add the following property in blank configuration tags

       

      <configuration>
      <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost</value>
      </property>
      </configuration>

      Similarly do for

      conf/hdfs-site.xml:

      <configuration>
      <property>
      <name>dfs.replication</name>
      <value>1</value>
      </property>
      </configuration>


      conf/mapred-site.xml:

      <configuration>
      <property>
      <name>mapred.job.tracker</name>
      <value>localhost:8021</value>
      </property>
      </configuration>

      Environment variables

      In hadoop_env.sh file , change the JAVA_HOME to location where you installed java
      e.g
      JAVA_HOME = /usr/java/jdk1.6.0_31

      Configure the environment variables for JDK , Hadoop as follows

      Go to ~.profile file in the current user home directory
      Add the following

      You can change the variable paths if you have installed hadoop and java at some other locations

      export JAVA_HOME="/usr/java/jdk1.6.0_31"
      export PATH=$PATH:$JAVA_HOME/bin
      export HADOOP_INSTALL="/home/hadoop/software/hadoop-1.0.1"
      export PATH=$PATH:$HADOOP_INSTALL/bin

      Testing your installation
      Format the HDFS
      # hadoop namenode -format

      hadoop@jj-VirtualBox:~$ start-dfs.sh
      starting namenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-namenode-jj-VirtualBox.out
      localhost: starting datanode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-datanode-jj-VirtualBox.out
      localhost: starting secondarynamenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-secondarynamenode-jj-VirtualBox.out
      hadoop@jj-VirtualBox:~$ start-mapred.sh
      starting jobtracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-jobtracker-jj-VirtualBox.out
      localhost: starting tasktracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-tasktracker-jj-VirtualBox.out

      Open the browser and point to page

      localhost:50030
      localhost:50070

      It would open the status page for hadoop

      Thats it , this completes the installation of Hadoop , now you are ready to play with it.

      http://localhost:50070/dfshealth.jsp crash

      I get often this problem that http://localhost:50070/dfshealth.jsp crashes and it doesn't show up anything.

      I am running pseudo mode configuration.

      One of the temporary solution which i found online was to format dfs again but this is very frustrating.

      Also in

      http://localhost:50030/jobtracker.jsp

      Jobtracker history i get the following message

      HTTP ERROR 500

      Problem accessing /jobhistoryhome.jsp. Reason:

      INTERNAL_SERVER_ERROR

      http://localhost:50030/jobhistoryhome.jsp

      I see similar problem was also observed here

      http://grokbase.com/p/hadoop/common-user/10383vj1gn/namenode-problem

      Solution

      If you see carefully the log of namenode

      We have error as

      org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-hadoop/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.

      This says that following variables are not properly set

      Normally this is due to the machine having been rebooted and /tmp being cleared out. You do not want to leave the Hadoop name node or data node storage in /tmp for this reason. Make sure you properly configure dfs.name.dir and dfs.data.dir to point to directories
      outside of /tmp and other directories that may be cleared on boot.

      The quick setup guide is really just to help you start experimenting with Hadoop. For setting up a cluster for any real use, you'll want to
      follow the next guide - Cluster Setup -
      http://hadoop.apache.org/common/docs/current/cluster_setup.html

      So here is what i did in hadoop-site.xml added the following two properties and now its working fine

       

        <property>
              <name>dfs.name.dir</name>
              <value>/home/hadoop/workspace/hadoop_space/data_dir</value>
        </property>

        <property>
              <name>dfs.data.dir</name>
              <value>/home/hadoop/workspace/hadoop_space/name_dir</value>
        </property>

       

      Source : http://lucene.472066.n3.nabble.com/Directory-tmp-hadoop-root-dfs-name-is-in-an-inconsistent-state-storage-directory-DOES-NOT-exist-or-ie-td812243.html

      Found some other solution for this problem ? Please share below , thanks.