All CategoriesBig Data Hadoop & Spark

Beginner’s Guide for Pig

Pig was initially developed by Yahoo! to allow individuals using Apache Hadoop to focus more on analyzing large datasets and spend less time in writing mapper and reducer programs.Pig enables users to write complex data analysis code without prior knowledge in Java. Pig’s simple SQL-like scripting language is called Pig Latin and has its own Pig runtime environment where PigLatin programs are executed.

Pig translates the Pig Latin script into MapReduce so that it can be executed within YARN(Yet Another Resource Negotiator, a cluster management technology) to access a single dataset stored in the Hadoop Distributed File System (HDFS).

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

Pig has two execution modes. They are:

  1. MapReduce Mode
  2. Local Mode

MapReduce Mode

When Pig runs in MapReduce mode, it deals with files stored in Hadoop cluster and HDFS. The MapReduce mode is the default mode here.

The command for running Pig in MapReduce mode is ‘pig’.

This enables the user to code on grunt shell.

Note:- all Hadoop daemons should be running before starting pig in MR mode

Local Mode

When Pig runs in local mode, it needs access to a single machine, where all the files are installed and run using local host and local file system.

The command for running Pig in local mode is as follows:

pig -x local

 

Before going further to the examples, let’s look at the steps which summarizes Pig execution steps.

  1. The first step in a Pig program is to Load the data you want to manipulate, from HDFS.
  2. The data is then run through a set of transformations (which are translated into a set of mapper and reducer tasks).
  3. Finally, the data is dumped on to the screen or the results are stored in a file somewhere.

Hadoop

Load

The dataset on which the Pig will operate will contain rows and columns by default and is separated by a tab (‘ \t ’).But in general, not every dataset is in this format. So, this format has to be specified when loading the data.

The command for loading data is as follows:

A = LOAD ‘student’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float);

This command loads and stores the data as structured text files in Relation Alias named ‘A’.We can also have comma(‘ , ’) separated and semicolon(‘ ; ’) separated datasets.

Pig Commands – Example

Let’s look at some Pig commands and see how it works in MapReduce through an example dataset.

Suppose we have this dataset with some rows and columns whose Data Dictionary and data are defined as follows:

Name

Department

Salary

mohan

IT

55000

raju

MEC

40200

manju

ECE

65400

kiran

CS

45600

prateek

EEE

52700

sanju

IT

57300

sam

CVL

92400

ashish

CS

35900

prateek

CS

56700

manju

IT

64300

kiran

MEC

41020

Let’s see the outcome of some of the Pig commands operation on this dataset.

Load

Load

Describe

Group

 

Average

 

Store

Cat

There are several commands and functions to operate within Pig. The more use cases you practice, more will be the degree of perfection in performing data analysis.

Keep visiting our website Acadgild for more updates on Big Data and other technologies. Click here to learn Big Data Hadoop Development.

Hadoop

Tags

prateek

An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close