Open In App

Introduction to Apache Pig

Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to process the large datasets. It provides a high-level of abstraction for processing over the MapReduce. It provides a high-level scripting language, known as Pig Latin which is used to develop the data analysis codes. First, to process the data which is stored in the HDFS, the programmers will write the scripts using the Pig Latin Language. Internally Pig Engine(a component of Apache Pig) converted all these scripts into a specific map and reduce task. But these are not visible to the programmers in order to provide a high-level of abstraction. Pig Latin and Pig Engine are the two main components of the Apache Pig tool. The result of Pig always stored in the HDFS. 

Note: Pig Engine has two type of execution environment i.e. a local execution environment in a single JVM (used when dataset is small in size)and distributed execution environment in a Hadoop Cluster. 



Need of Pig: One limitation of MapReduce is that the development cycle is very long. Writing the reducer and mapper, compiling packaging the code, submitting the job and retrieving the output is a time-consuming task. Apache Pig reduces the time of development using the multi-query approach. Also, Pig is beneficial for programmers who are not from Java background. 200 lines of Java code can be written in only 10 lines using the Pig Latin language. Programmers who have SQL knowledge needed less effort to learn Pig Latin. 

Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At that time, the main idea to develop Pig was to execute the MapReduce jobs on extremely large datasets. In the year 2007, it moved to Apache Software Foundation(ASF) which makes it an open source project. The first version(0.1) of Pig came in the year 2008. The latest version of Apache Pig is 0.18 which came in the year 2017. 



Features of Apache Pig: 

Difference between Pig and MapReduce
Apache Pig MapReduce
It is a scripting language. It is a compiled programming language.
Abstraction is at higher level. Abstraction is at lower level.
It have less line of code as compared to MapReduce. Lines of code is more.
Less effort is needed for Apache Pig. More development efforts are required for MapReduce.
Code efficiency is less as compared to MapReduce. As compared to Pig efficiency of code is higher.
Pig provides built in functions for ordering, sorting and union. Hard to perform data operations. 
It allows nested data types like map, tuple and bag It does not allow nested data types

Applications of Apache Pig:  

Types of Data Models in Apache Pig: It consist of the 4 types of data models as follows:  

 

Article Tags :