Big Data Hadoop & Spark

PIG Use Case: Pokemon Data Analysis

In this blog for Apache Pig beginners, we will use the simple Pig built-in functions to form two lists of 5 randomly selected Pokémons with some parameters set by the management. So Pig queries are listed below for the perusal of a data analyst.
The Pokémon Fight League (PFL) management for the 2017 match has first of all decided a minimum criterion for the entry selection process that filters through the defense power for any Pokémon, which should ideally be greater than 55.
Hence, the eligible list will be randomly formed after filtering out the Pokémons with a defenseless than 55.
Furthermore,Our job is to give 2 list of names of those Pokémons who will be eligible for taking part in PFL this year from the list of all the participating 800 Pokémons.
Use the link to download dataset for all the Pokémons seems like taking part in PFL 2017.
Especially relevant in addition readers may go through the link to practice the HIVE Use Case basics commands on the Pokémon dataset.
So First of all, we will load the dataset inside PIG. We can either use the local mode or the MR mode. Here consequently, we will be using the local mode.
Command
Load_Data = LOAD ‘/home/prateek/Documents/PIG/Pokémon.csv’ USING PigStorage(‘,’) AS(Sno:int,Name:chararray,Type1:chararray,Type2:chararray,Total:int,HP:int,Attack:int,Defense:int,SpAtk:int,SpDef:int,Speed:int);

Check while loading the data, it is done correctly by using the dump; command.

Ques 1: Find the list of players that have been selected in the qualifying round (DEFENCE>55).

Explanation:

Command
selected_list = FILTER Load_Data BY Defense>55;

The dataset is filtered, and hence out of all the 800 Pokémons, only 544 are eligible to take part in the tournament. In order to get the count, refer the next problem statement.
Hadoop

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

Ques 2: State the number of players taking part in the competition after getting selected in the qualifying round.

Explanation:
Command
gourp_selcted_list = Group selected_list All;
count_selcted_list = foreach gourp_selcted_list GENERATE COUNT(selected_list);


So, All the 544 players taking part will be alphabetically arranged and two teams of 5 Pokémons need to be extracted out randomly from the earlier list.
Seems like,this way we will have 2 lists containing 5 Pokémon each so to fight each other.

Ques 3: Using random() generate random numbers for each Pokémon on the selected list.

Explanation:
Command
random_include1 = foreach selected_list GENERATE RANDOM(),Name,Type1,Type2,Total,HP,Attack,Defense,SpAtk,SpDef,Speed;

Hence sample for the list after adding random numbers:

 

Ques 4: Arrange the new list in a descending order according to a column randomly.

Explanation: This will give us consequently a layer arranged to pick the random list which 1st player will choose.
Command
random1_desending = ORDER random_include1 BY $0 DESC;

Hence the sample for the list after the query.

Yet we want 1 more list with random arrangements of Pokémons which will be therefore chosen by the 2nd player later on.

Ques 5: Now on a new relation again associate random numbers for each Pokémon and arrange in descending order according to column random.

Explanation: We will be repeating above two steps again to form the 2nd list.
Command
random_include2 = foreach selected_list GENERATE RANDOM(),Name,Type1,Type2,Total,HP,Attack,Defense,SpAtk,SpDef,Speed;
random2_desending = ORDER random_include2 BY $0 DESC;

Hence sample for the list.


Hence sample for the list.

 
Now, especially relevant selecting the top 5.

Ques: From the two different descending lists of random Pokémons, select the top 5 Pokémons for 2 different players.

Explanation:
Commands
limit_data_random1_desending = LIMIT random1_desending 5 ;
limit_data_random2_desending = LIMIT random2_desending 5 ;

Hence sample for the list.


Hence sample for the list.

 

Ques: Store the data on a local drive to announce for the final match. By the name player1 and player2 (only show the NAME and HP).

Explanation:
Commands
filter_only_name1 = foreach limit_data_random1_desending Generate ($1,HP);

filter_only_name2 = foreach limit_data_random2_desending Generate ($1,HP);

Since for Player1 we have:

Since for Player2 we have:

Therefore querying is over using some simple pre-defined functions to get 2 sets of 5 Pokémons, which get select randomly.
In conclusion, let’s store this result in our local system  so we can use it as input to our next blog. Especially relevant where we will see UDF using PIG and calculations will be done through user-defined formulas.
STORE limit_data_random1_desending INTO ‘/home/acadgild/Documents/prateek/PIG/player1.txt’;

As a Result:

STORE limit_data_random2_desending INTO ‘/home/acadgild/Documents/prateek/PIG/player2.txt’;

As a Result:

As a result, the Pokémons for both players got selected . This player will be fighting consequently in the Finals with their respective Pokémons assigned.
Finally,subscribe to visit our next blog furthermore to see what happens when players clash in the battlefield to win the PFL with data analytics.
Keep visiting our site www.acadgild.com for more updates on Big Data Ecosystem and other technologies.
Hadoop

prateek

An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close