The below example is explaining how to start programming in Pig.
I followed the book , Programming Pig.
This post assumes that you have already installed PIG in your computer. If you need help you can read the turorial to install pig.
So lets get start to write out first Pig program , using the same code example given in book chapter 2
Download the code examples from github website ( link below)
https://github.com/alanfgates/programmingpig
Pig can run in local mode and mapreduce mode.
When we say local mode it means that source data would be picked from the directory which is local in your computer. So to run some program you would go to the directory where data is and then run pig script to analyze the data.
I downloaded the code examples from above link
Now i go to the data directory where all data is present.
# cd /home/hadoop/Downloads/PigBook/data
Change the path depending upon where you copied the code in your computer
Now lets start pig in local mode
# pig -x local
-x local tells that Dear Pig , lets start working locally in this computer.
The output is similar to below
2012-03-11 11:44:13,346 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/Downloads/PigBook/data/pig_1331446453340.log
2012-03-11 11:44:13,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop filesystem at: file:///
It would enter grunt> shell , grunt is the shell to write pig scripts.
Lets try to see all the files which are present in data directory
grunt> ls
Output is shown below
file:/home/hadoop/Downloads/PigBook/data/webcrawl<r 1> 255068
file:/home/hadoop/Downloads/PigBook/data/baseball<r 1> 233509
file:/home/hadoop/Downloads/PigBook/data/NYSE_dividends<r 1> 17027
file:/home/hadoop/Downloads/PigBook/data/NYSE_daily<r 1> 3194099
file:/home/hadoop/Downloads/PigBook/data/README<r 1> 980
file:/home/hadoop/Downloads/PigBook/data/pig_1331445409976.log<r 1> 823
It is showing the list of files which are present in that folder (data)
Lets run on program , In chapter 2 there is one pig script.
Go to PigBook/examples/chap2 folder and there is one script named average_dividend.pig
The code of script is as follows
dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);
grouped = group dividends by symbol;
avg = foreach grouped generate group, AVG(dividends.dividend);
store avg into 'average_dividend';
In plain english the above code is saying following
Load the NYSE_dividends file in contains fields as exchange, symbol, date, dividend
Group the records in that file by symbol
calculate average for dividend and
store the average results in average_divident folder
Result
After lots of processing the output would look like
Input(s):
Successfully read records from: "file:///home/hadoop/Downloads/PigBook/data/NYSE_dividends"
Output(s):
Successfully stored records in: "file:///home/hadoop/Downloads/PigBook/data/average_dividend"
Job DAG:
job_local_0001
2012-03-11 11:47:10,994 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
To check the output go to average_dividend directory which is created within data directory ( remember we started pig in this directory)
There is one MR part file part-r-00000 that has the final results
Thats it , PIG latin has done all the magic behind the scene.
Coming next running Pig latin in mapreduce mode
No comments:
Post a Comment
Please share your views and comments below.
Thank You.