Apache Pig is a platform to analyze large data sets.
In simple terms you have lots and lots of data on which you need to do some processing or analysis , one way is to write Map Reduce code and then run that processing on data.
Other way is to write Pig scripts which would inturn be converted to Map Reduce code and would process your data.
Pig consists of two parts
- Pig latin language
- Pig engine
Pig latin is a scripting language which allows you to describe how data flow from one or more inputs should be read , how it should be processed and then where it should be stored.
The flows can be simple or complex where some processing is applied in between. Data can be picked from multiple inputs.
We can say Pig Latin describes a directed acyclic graphs where edges are data flows and the nodes are operators that process the data
The job of engine is to exectute the data flow written in Pig latin in parallel on hadoop infrastructure.
Why Pig is required when we can code all in MR
Pig provides all standard data processing operations like sort , group , join , filter , order by , union right inside pig latin
In MR we have to lots of manual coding.
Pig does optimization of Pig latin scripts while creating them into MR jobs.
It creates optimized version of Map reduce to run on hadoop
It takes very less time to write Pig latin script then to write corresponding MR code
Where Pig is useful
Transactional ETL Data pipelines ( Mostly used)
Research on raw data
Iterative processing
You can read next about how to install Pig
No comments:
Post a Comment
Please share your views and comments below.
Thank You.