Apache Pig Introduction Tutorial

Apache Pig is a platform to analyze large data sets.

In simple terms you have lots and lots of data on which you need to do some processing or analysis , one way is to write Map Reduce code and then run that processing on data.

Other way is to write Pig scripts which would inturn be converted to Map Reduce code and would process your data.

Pig consists of two parts

  • Pig latin language
  • Pig engine


Pig latin is a scripting language which allows you to describe how data flow from one or more inputs should be read , how it should be processed and then where it should be stored.

The flows can be simple or complex where some processing is applied in between. Data can be picked from multiple inputs.
We can say Pig Latin describes a directed acyclic graphs where edges are data flows and the nodes are operators that process the data

The job of engine is to exectute the data flow written in Pig latin in parallel on hadoop infrastructure.

Why Pig is required when we can code all in MR

Pig provides all standard data processing operations like sort , group , join , filter , order by , union right inside pig latin
In MR we have to lots of manual coding.

Pig does optimization of Pig latin scripts while creating them into MR jobs.
It creates optimized version of Map reduce to run on hadoop

It takes very less time to write Pig latin script then to write corresponding MR code

Where Pig is useful

Transactional ETL Data pipelines ( Mostly used)
Research on raw data
Iterative processing

You can read next about how to install Pig



 

No comments:

Post a Comment

Please share your views and comments below.

Thank You.