Go to http://www.infochimps.com/datasets
Filter by Free data sets available
Filter by Downloadable data only
Choose the data type which is of interest to you.
Happy Hadooping :)
Go to http://www.infochimps.com/datasets
Filter by Free data sets available
Filter by Downloadable data only
Choose the data type which is of interest to you.
Happy Hadooping :)
I downloaded PigEditor from http://romainr.github.com/PigEditor/ and installed it in Eclipse
I was not able to configure PigPen which is being told as better feature rich editor then this.
Would be trying to get hold of PigPen also.
Incase you want to use PigEditor here is what you have to do
In Eclipse Update site use the URL
http://romainr.github.com/PigEditor/updates/
Let it install
Restart Eclipse
Create a New General project
It would ask you to allow XUnit use with project , say yes and you are done.
Pig Return codes
Value Meaning Comment
0 Success
1 Failure Would retry again
2 Failure
3 Partial Failure Used with multiquery
4 Illegal argument
5 IOException throws UDF raised exception
6 PigException Python UDF raised exception
7 ParseException Can happen after variable parsing if variable substitution is being done
8 Throwable Unexcepted exception
Apache Pig is a platform to analyze large data sets.
In simple terms you have lots and lots of data on which you need to do some processing or analysis , one way is to write Map Reduce code and then run that processing on data.
Other way is to write Pig scripts which would inturn be converted to Map Reduce code and would process your data.
Pig consists of two parts
Pig latin is a scripting language which allows you to describe how data flow from one or more inputs should be read , how it should be processed and then where it should be stored.
The flows can be simple or complex where some processing is applied in between. Data can be picked from multiple inputs.
We can say Pig Latin describes a directed acyclic graphs where edges are data flows and the nodes are operators that process the data
The job of engine is to exectute the data flow written in Pig latin in parallel on hadoop infrastructure.
Why Pig is required when we can code all in MR
Pig provides all standard data processing operations like sort , group , join , filter , order by , union right inside pig latin
In MR we have to lots of manual coding.
Pig does optimization of Pig latin scripts while creating them into MR jobs.
It creates optimized version of Map reduce to run on hadoop
It takes very less time to write Pig latin script then to write corresponding MR code
Where Pig is useful
Transactional ETL Data pipelines ( Mostly used)
Research on raw data
Iterative processing
You can read next about how to install Pig
Apache Pig can be downloaded from http://pig.apache.org
Download the latest release from its website
Unzip the downloaded tar file
Set the environment variables in your system as
export PIG_HOME="/home/hadoop/software/pig-0.9.2"
export PATH=$PATH:$PIG_HOME/bin
Set the place where you downloaded pig and also set its path
If you plan to run Pig on hadoop cluster then one additional variable needs to be set
export PIG_CLASSPATH="/home/hadoop/software/hadoop-1.0.1/conf"
It tells about the place where to look for hdfs-site.xml and other configuration files for hadoop
Restart your computer
Thats it , now lets test the installation
On the command prompt
type
# pig -h
it shoud show the help related to Pig , and its various commands.
Done :) ,
Next you can read about
How to run your first Pig script in local mode
Or about various Pig running modes
The below example is explaining how to start programming in Pig.
I followed the book , Programming Pig.
This post assumes that you have already installed PIG in your computer. If you need help you can read the turorial to install pig.
So lets get start to write out first Pig program , using the same code example given in book chapter 2
Download the code examples from github website ( link below)
https://github.com/alanfgates/programmingpig
Pig can run in local mode and mapreduce mode.
When we say local mode it means that source data would be picked from the directory which is local in your computer. So to run some program you would go to the directory where data is and then run pig script to analyze the data.
I downloaded the code examples from above link
Now i go to the data directory where all data is present.
# cd /home/hadoop/Downloads/PigBook/data
Change the path depending upon where you copied the code in your computer
Now lets start pig in local mode
# pig -x local
-x local tells that Dear Pig , lets start working locally in this computer.
The output is similar to below
2012-03-11 11:44:13,346 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/Downloads/PigBook/data/pig_1331446453340.log
2012-03-11 11:44:13,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop filesystem at: file:///
It would enter grunt> shell , grunt is the shell to write pig scripts.
Lets try to see all the files which are present in data directory
grunt> ls
Output is shown below
file:/home/hadoop/Downloads/PigBook/data/webcrawl<r 1> 255068
file:/home/hadoop/Downloads/PigBook/data/baseball<r 1> 233509
file:/home/hadoop/Downloads/PigBook/data/NYSE_dividends<r 1> 17027
file:/home/hadoop/Downloads/PigBook/data/NYSE_daily<r 1> 3194099
file:/home/hadoop/Downloads/PigBook/data/README<r 1> 980
file:/home/hadoop/Downloads/PigBook/data/pig_1331445409976.log<r 1> 823
It is showing the list of files which are present in that folder (data)
Lets run on program , In chapter 2 there is one pig script.
Go to PigBook/examples/chap2 folder and there is one script named average_dividend.pig
The code of script is as follows
dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
grouped = group dividends by symbol;
avg = foreach grouped generate group, AVG(dividends.dividend);
store avg into 'average_dividend';
In plain english the above code is saying following
Load the NYSE_dividends file in contains fields as exchange, symbol, date, dividend
Group the records in that file by symbol
calculate average for dividend and
store the average results in average_divident folder
Result
After lots of processing the output would look like
Input(s):
Successfully read records from: "file:///home/hadoop/Downloads/PigBook/data/NYSE_dividends"
Output(s):
Successfully stored records in: "file:///home/hadoop/Downloads/PigBook/data/average_dividend"
Job DAG:
job_local_0001
2012-03-11 11:47:10,994 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
To check the output go to average_dividend directory which is created within data directory ( remember we started pig in this directory)
There is one MR part file part-r-00000 that has the final results
Thats it , PIG latin has done all the magic behind the scene.
Coming next running Pig latin in mapreduce mode
If you are using sqoop 1.4.1 and you try to build it you can get error as
org.apache.hbase#hbase;0.92.0-SNAPSHOT: not found
This is due to reason that HBase 0.92.0 has been released
Just make the following changes in build.xml and run the build again
https://reviews.apache.org/r/4169/diff/
\$CONDITIONS
instead of just $CONDITIONS
to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"
OR
conditions in the WHERE
clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.Example for Import using Sqoop in target directory in the HDFS
$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --m 1
The above command will import the data present in Employee_Table in sqoop database to the directory named employeeImportAll directory
After import is done we can check if data is present
Just see the output for each of the 3 commands one by one.
hadoop@jj-VirtualBox:~$ hadoop fs -ls
hadoop@jj-VirtualBox:~$ hadoop fs -ls /user/hadoop/employeeImportAll
hadoop@jj-VirtualBox:~$ hadoop fs -cat /user/hadoop/employeeImportAll/part-m-00000
All the results are present as comma separated file
12/03/05 23:44:31 ERROR tool.ImportTool: Error during import: No primary key could be found for table Employee_Table. Please specify one with --split-by or perform a sequential import with '-m 1'.
Sample queryon which i got this error
$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll
Explanation
While performing the parallel imports Sqoop needs a criterion by which it can split the workload.Sqoop uses the splitting column to split the workload. By default Sqoop will identify the primary key column (if present) in a table to use as the splitting column.
The low and high values of splitting column are retrieved from databases and the map tasks operate on evenly sized components of total range.
For example , if you had a table with a primary key column of id
whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi
, with (lo, hi)
set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.
Solution
$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --m 1
Just add the --m 1 , it tells to use sequential import with 1 mapper
Or another solution can be by telling to sqoop to use particulay column as split column.
$ sqoop-import --connect jdbc:mysql://localhost:3306/sqoop --username root --password root --table Employee_Table --target-dir employeeImportAll --split-by columnName
Spring integration with hadoop
Spring Hadoop provides support for writing Apache Hadoop applications that benefit from the features of Spring, Spring Batch and Spring Integration.
http://www.springsource.org/spring-data/hadoop
Update : 6 April 2013
Cloudera Certified Administrator for Apache Hadoop (CCAH)
To earn a CCAH certification, candidates must pass an exam designed to test a candidate’s fluency with the concepts and skills required in the following areas:
If you are interested in Developer exam then you should read other post
http://jugnu-life.blogspot.in/2012/03/cloudera-certified-developer-for-apache.html
Details for Admin exam are here along with from where to prepare
Test Name: Cloudera Certified Administrator for Apache Hadoop CDH4 (CCA-410)
Number of Questions: 60
Time Limit: 90 minutes
Passing Score: 70%
Languages: English, Japanese
English Release Date: November 1, 2012
Japanese Release Date: December 1, 2012
Price: USD $295, AUD285, EUR225, GBP185, JPY25,500
Cloudera Certified Developer for Apache Hadoop (CCDH)
Update : 6 April 2013
Cloudera has added exam learning resources on the website , please read this link for latest.
http://university.cloudera.com/certification/prep/ccdh.html
http://jugnu-life.blogspot.in/2012/05/cloudera-hadoop-certification-now.html
Syllabus , exam contents
http://university.cloudera.com/certification.html
To earn a CCDH certification, candidates must pass an exam designed to test a candidate’s fluency with the concepts and skills required in the following areas:
If you are interested in Administrator exam then you should read other post
http://jugnu-life.blogspot.in/2012/03/cloudera-certified-administrator-for.html
Exam syllabus for Developer and Study sources are mentioned below.
Snappy is a compression / decompression library build using C++
The main advantage of Snappy is the high speed in compressing or decompressing the data.
http://code.google.com/p/snappy/
Accumulo
Accumulo is a distributed key/value store that provides expressive, cell-level access labels.
Accumulo is a sorted, distributed key/value store based on Google's BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift.
http://www.covert.io/post/18605091231/accumulo-and-pig
The above post explains use of Pig and Accumulo together.
If you are following from previous sqoop import tutorial http://jugnu-life.blogspot.in/2012/03/sqoop-import-tutorial.html then , lets try to do conditional import from RDBMS in sqoop
$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret -m 1
The sqoop command above would import all the rows present in the table Customer.
Let say that customer table is something like this
CustomerName | DateOfJoining |
Adam | 2012-12-12 |
John | 2002-1-3 |
Emma | 2011-1-3 |
Tina | 2009-3-8 |
Now lets say we want to import only those customers which are joining after 2005-1-1
We can modify the sqoop import as
$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret --where "DateOfJoining > '2005-1-1' "
This would import only 3 records from above table.
Happy sqooping :)
This tutorial explains how to use sqoop to import the data from RDBMS to HDFS. Tutorial is divided into multiple posts to cover various functionalities offered by sqoop import
The general syntax for import is
$ sqoop-import (generic-args) (import-args)
Argument | Description |
---|---|
--connect <jdbc-uri> | Specify JDBC connect string |
--connection-manager <class-name> | Specify connection manager class to use |
--driver <class-name> | Manually specify JDBC driver class to use |
--hadoop-home <dir> | Override $HADOOP_HOME |
--help | Print usage instructions |
-P | Read password from console |
--password <password> | Set authentication password |
--username <username> | Set authentication username |
--verbose | Print more information while working |
--connection-param-file <filename> | Optional properties file that provides connection parameters |
Example run
$ sqoop import --connect jdbc:mysql://localhost/CompanyDatabase --table Customer --username root --password mysecret -m 1
When we run this sqoop command it would try to connect to mysql database named CompanyDatabase with username root , password mysecret and with one map task.
Generally its not recommended to give password in command , instead its advisable to use -P parameter which tells to ask for password in console.
One more thing which we should notice is the use of localhost as database address , if you are running your hadoop cluster in distributed mode than you should give full hostname and IP of the database.
Purpose of post is to explain how to install hadoop in your computer. This post considers that you have Linux based system available for use. I am doing this on Ubuntu system
If you want to know how to install latest version of Hadoop 2.0 , then see the Hadoop 2.0 Install Tutorial
Before you begin create a separate user named hadoop in the system and do all these operations in that.
This document covers the Steps to
1) Configure SSH
2) Install JDK
3) Install Hadoop
Update your repository
#sudo apt-get update
You can directly copy the commands from there and run in your system
Hadoop requires that various systems present in cluster can talk to each other freely. Hadoop use SSH to prove the identity for connection.
Let's Download and configure SSH
#sudo apt-get install openssh-server openssh-client
#ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
#cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
#sudo chmod go-w $HOME $HOME/.ssh
#sudo chmod 600 $HOME/.ssh/authorized_keys
#sudo chown `whoami` $HOME/.ssh/authorized_keys
Testing your SSH
#ssh localhost
Say yes
It should open connection with SSH
#exit
This will close the SSH
Java 1.6 is mandatory for running hadoop
Lets Download and install JDK
#sudo mkdir /usr/java
#cd /usr/java
#sudo wget http://download.oracle.com/otn-pub/java/jdk/6u31-b04/jdk-6u31-linux-i586.bin
Wait till the jdk download completes
Install java
#sudo chmod o+w jdk-6u31-linux-i586.bin
#sudo chmod +x jdk-6u31-linux-i586.bin
#sudo ./jdk-6u31-linux-i586.bin
Now comes the Hadoop :)
Lets Download and configure Hadoop in Pseudo distributed mode. You can read more about various types of modes on Hadoop website.
Download the latest hadoop version from its website
http://hadoop.apache.org/common/releases.html
Download hadoop 1.0.x tar.gz from hadoop website
Extract it into some folder ( say /home/hadoop/software/20/ )
All softwares have been downloaded at that location
For other modes (Standalone and Fully distributed) please see hadoop documentation
Go to conf directory in hadoop folder and open core-site.xml and add the following property in blank configuration tags
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost</value>
</property>
</configuration>
Similarly do for
conf/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
conf/mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
Environment variables
In hadoop_env.sh file , change the JAVA_HOME to location where you installed java
e.g
JAVA_HOME = /usr/java/jdk1.6.0_31
Configure the environment variables for JDK , Hadoop as follows
Go to ~.profile file in the current user home directory
Add the following
You can change the variable paths if you have installed hadoop and java at some other locations
export JAVA_HOME="/usr/java/jdk1.6.0_31"
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_INSTALL="/home/hadoop/software/hadoop-1.0.1"
export PATH=$PATH:$HADOOP_INSTALL/bin
Testing your installation
Format the HDFS
# hadoop namenode -format
hadoop@jj-VirtualBox:~$ start-dfs.sh
starting namenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-namenode-jj-VirtualBox.out
localhost: starting datanode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-datanode-jj-VirtualBox.out
localhost: starting secondarynamenode, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-secondarynamenode-jj-VirtualBox.out
hadoop@jj-VirtualBox:~$ start-mapred.sh
starting jobtracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-jobtracker-jj-VirtualBox.out
localhost: starting tasktracker, logging to /home/hadoop/software/hadoop-1.0.1/libexec/../logs/hadoop-hadoop-tasktracker-jj-VirtualBox.out
Open the browser and point to page
localhost:50030
localhost:50070
It would open the status page for hadoop
Thats it , this completes the installation of Hadoop , now you are ready to play with it.
I get often this problem that http://localhost:50070/dfshealth.jsp crashes and it doesn't show up anything.
I am running pseudo mode configuration.
One of the temporary solution which i found online was to format dfs again but this is very frustrating.
Also in
http://localhost:50030/jobtracker.jsp
Jobtracker history i get the following message
HTTP ERROR 500
Problem accessing /jobhistoryhome.jsp. Reason:
INTERNAL_SERVER_ERROR
http://localhost:50030/jobhistoryhome.jsp
I see similar problem was also observed here
http://grokbase.com/p/hadoop/common-user/10383vj1gn/namenode-problem
Solution
If you see carefully the log of namenode
We have error as
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-hadoop/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.
This says that following variables are not properly set
Normally this is due to the machine having been rebooted and /tmp being cleared out. You do not want to leave the Hadoop name node or data node storage in /tmp for this reason. Make sure you properly configure dfs.name.dir and dfs.data.dir to point to directories
outside of /tmp and other directories that may be cleared on boot.
The quick setup guide is really just to help you start experimenting with Hadoop. For setting up a cluster for any real use, you'll want to
follow the next guide - Cluster Setup -
http://hadoop.apache.org/common/docs/current/cluster_setup.html
So here is what i did in hadoop-site.xml added the following two properties and now its working fine
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/workspace/hadoop_space/data_dir</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/workspace/hadoop_space/name_dir</value>
</property>
Source : http://lucene.472066.n3.nabble.com/Directory-tmp-hadoop-root-dfs-name-is-in-an-inconsistent-state-storage-directory-DOES-NOT-exist-or-ie-td812243.html
Found some other solution for this problem ? Please share below , thanks.