Big Data Hadoop & Spark

Hive Use Case – Pokemon Data Analysis

In this post, we will be performing certain Hive queries to perform data analysis on Pokémon Go characters.

So, what is Pokémon Go?

Pokémon Go is a free-to-play, location-based augmented reality game developed by Niantic for iOS and Android devices. It was released only in July 2016 and only in selected countries. You can download Pokémon for free of cost and start playing. You can also use PokéCoins to purchase Pokéballs, the in-game item you need to be able to catch Pokémon. 

Performing Pokémon Data Analysis

You can download the dataset necessary for this data analysis from here.

The dataset consists of 11 columns and their respective description is as follows:

Pokemonid_Number: This column represents id of each Pokémon.

Name: This column represents the name of the Pokémon.

Type 1: This column represents the property of a Pokémon.

Type 2: This column represents the extended property of the same Pokémon.

A Pokémon may be one or both the types. For instance, Charmander is a Fire type, while Bulbasaur is both a Grass type as well as a Poison type. With the current 18-type system, there are 324 possible ways to assign these types to Pokémon, along with 171 unique combinations. As of Generation VI, 133 different type combinations have been used.

Total: This column represents the sum of all character points of a Pokémon (HP, Attack, Defense, Sp. Atk, Sp. Def, and Speed).

HP (Hit Points): This column represents Pokémon Hit Points, which is a value that determines how much damage a Pokémon can receive. When a Pokémon’s HP is down to ‘0’, the Pokémon will faint. HP is the most frequently affected stat of them all, as a depleting HP is a key factor in winning a battle.

Attack: This column represents the Attack stat.

Defense: This column represents the Defense stat.

Sp. Atk: This column represents a Pokémon’s Special Attack stat.

Sp. Def: This column represents a Pokémon’s Special Defense stat.

Speed: This column represents the speed stat of a Pokémon.

Let’s begin by creating a table to hold the dataset, as shown below.

CREATE TABLE pokemon (Number Int,Name String,Type1 String,Type2 String,Total Int,HP Int,Attack Int,Defense Int,Sp_Atk Int,Sp_Def Int,Speed Int) row format delimited fields terminated BY ',' lines terminated BY '\n' tblproperties("skip.header.line.count"="1");

The above command will create a table called ‘pokemon’, with the fields as shown in the dataset description. We have given the parameter to skip the header line, so while loading the dataset, this ‘pokemon’ table will ignore the header line.

Next, let’s load the dataset into the table as shown below.

load data local inpath '/home/acadgild/Desktop/Pokemon.csv ' INTO table pokemon;

Problem Statement 1:

Find out the average HP (Hit points) of all the Pokémon, using the below query.

Select avg(HP) from pokemon;

In the above screenshot, you can see that the average Hit point of the Pokémon is 69.25875.

Problem Statement 2:

Create and insert values of existing table ‘pokemon’ into a new table ‘pokemon1’, with an additional column ‘power_rate’ to find the count of ‘powerful’ and ‘moderate’ from the table ‘pokemon1’.

Now, based on the average hit points, we will create another column called ‘power_rate’.

In order to segregate the Pokémon, we will use if condition inside the select statement, which will create one more column in our dataset. The if condition should be used in the following manner inside a Hive query.

if(expr1,expr2,expr3)

Now, we will create a table based on the condition that if the HP is greater than the average HP, then it is powerful, and if the HP is less than the average, then it is Moderate and a neutral condition is considered as powerless. The same is given as a Hive query below.

create table pokemon1 as select *, IF(HP>69.25875, 'powerful', IF(HP<69.25875, 'Moderate','powerless')) AS power_rate from pokemon;

With the above query, a new column power_rate has been created.

Now, we will find out the number of powerful and moderate HP Pokémons present, using the below query.

select COUNT(name),power_rate from pokemon1 group by power_rate;

You can see that in the above screenshot, we have the result as 422 Pokémons with moderate power and 378 Pokémons with high power.

Problem Statement 3:

Find out the top 10 Pokémons according to their HP’s using the below query.

select name,hp from pokemon1 order by hp desc limit 10;

You can see the list of top 10 Pokémons according to their HP’s.

Problem Statement 4:

Find out the top 10 Pokémons based on their Attack stat, using the below query.

select name,attack from pokemon1 order by attack desc limit 10;

You can see the list of top 10 Pokémons according to their attack.
Hadoop

Problem statement 5:

Find out the top 10 Pokémons based on their Defense stat, using the below query.

Now, we will see the list of top 10 Pokémons according to their defense, using the below query.

select name,defense from pokemon1 order by defense desc limit 10;

In the above screenshot, you can see the list of top 10 Pokémons according to their defense.

Problem statement 6:

Find out the top 10 Pokémons based on their total power.

select name,total from pokemon1 order by total desc limit 10;

In the above screenshot, you can see the list of top 10 Pokémons according to their all round powers.

Problem statement 7:

Find out the top 10 Pokémons having a drastic change in their attack and sp.attack, using the below query.

select name,(attack-sp_atk) as atk_diff from pokemon1 order by atk_diff limit 10;

In the above screenshot, you can see the list of top 10 Pokémons having a drastic change in their attack and sp.attack.

Problem statement 8:

Find out the top 10 Pokémons having a drastic change in their defense and sp.defense, using the below query.

select name,(defense-sp_defense) as def_diff from pokemon1 order by def_diff limit 10;


In the above screenshot, you can see the list of top 10 Pokémons having a drastic change in their defense and sp.defense.

Problem statement 9:

Find out the top 10 fastest Pokémons, using the below query.

Select name, speed from pokemon order by speed desc limit 10;

We hope this post has been helpful in understanding how to perform data analysis using Hive. In the case of any queries, feel free to comment below and we will get back to you at the earliest.

Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

Hadoop

2 Comments

  1. Hi there, have you guys truncated the data from the csv file because there are 800 records and the average hitpoints is different from yours

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close