All CategoriesBig Data Hadoop & Spark - Advanced

Census Data Analysis Using Apache Hive Zeppelin

What is Census data Analysis?

Census is an official count or survey, especially of a population. In India, the census data has been surveyed 15 times as of 2011. Census is conducted every 10 years. It has been conducted by the Registrar General and Census Commissioner of India, under the Ministry of Home Affairs, Government of India.

In this blog, we will perform analysis on the census data collected in 2001 using Hive and we will visualize the data through Apache Zeppelin and derive key points from this analysis.

We recommend our users to go through our previous blog on Apache Zeppelin to know how to install Zeppelin and how to integrate it with Hive.

You can download the dataset from the below link

https://drive.google.com/open?id=0ByJLBTmJojjzWllGZFJFaXFVbU0

Data set description of this data is as follows:

State,District,Persons,Males,Females,Growth_1991_2001_,Rural,Urban,Scheduled_Caste_population,Percentage_SC_to_total,Number_of_households,Household_size_per_household_,Sex_ratio_females_per_1000_males_,Sex_ratio__0_6_years_,Scheduled_Tribe_population,Percentage_to_total_population__ST_,Persons_literate,Males_Literate,Females_Literate,Persons_literacy_rate,Males_Literatacy_Rate,Females_Literacy_Rate,Total_Educated,Data_without_level,Below_Primary,Primary,Middle,Matric_Higher_Secondary_Diploma,Graduate_and_Above,X0_4_years,X5_14_years,X15_59_years,X60_years_and_above__Incl_A_N_S_,Total_workers,Main_workers,Marginal_workers,Non_workers,SC_1_Name,SC_1_Population,SC_2_Name,SC_2_Population,SC_3_Name,SC_3_Population,Religeon_1_Name,Religeon_1_Population,Religeon_2_Name,Religeon_2_Population,Religeon_3_Name,Religeon_3_Population,ST_1_Name,ST_1_Population,ST_2_Name,ST_2_Population,ST_3_Name,ST_3_Population,Imp_Town_1_Name,Imp_Town_1_Population,Imp_Town_2_Name,Imp_Town_2_Population,Imp_Town_3_Name,Imp_Town_3_Population,Total_Inhabited_Villages,Drinking_water_facilities,Safe_Drinking_water,Electricity_Power_Supply_,Electricity_domestic_,Electricity_Agriculture_,Primary_school,Middle_schools,Secondary_Sr_Secondary_schools,College,Medical_facility,Primary_Health_Centre,Primary_Health_Sub_Centre,Post__telegraph_and_telephone_facility,Bus_services,Paved_approach_road,Mud_approach_road,Permanent_House,Semi_permanent_House,Temporary_House

Let’s quickly create a hive table with this data description, and is as follows:

%hive
Create table census(State String,District String,Persons String,Males int,Females int,Growth_1991_2001 int,Rural int,Urban int,Scheduled_Caste_population int,Percentage_SC_to_total int,Number_of_households int,Household_size_per_household int,Sex_ratio_females_per_1000_males int ,Sex_ratio_0_6_years int,Scheduled_Tribe_population int,Percentage_to_total_population_ST int,Persons_literate int,Males_Literate int,Females_Literate int,Persons_literacy_rate int,Males_Literatacy_Rate int,Females_Literacy_Rate int,Total_Educated int,Data_without_level int,Below_Primary int,Primary int,Middle int,Matric_Higher_Secondary_Diploma int,Graduate_and_Above int,X0_4_years int,X5_14_years int,X15_59_years int,X60_years_and_above_Incl_ANS int,Total_workers int,Main_workers int,Marginal_workers int,Non_workers int,SC_1_Name String,SC_1_Population int,SC_2_Name String,SC_2_Population int,SC_3_Name String,SC_3_Population int,Religeon_1_Name String,Religeon_1_Population int,Religeon_2_Name String,Religeon_2_Population int,Religeon_3_Name String,Religeon_3_Population int,ST_1_Name String,ST_1_Population int,ST_2_Name String,ST_2_Population int,ST_3_Name String,ST_3_Population int,Imp_Town_1_Name String,Imp_Town_1_Population int,Imp_Town_2_Name String,Imp_Town_2_Population int,Imp_Town_3_Name String,Imp_Town_3_Population int,Total_Inhabited_Villages int,Drinking_water_facilities int,Safe_Drinking_water int,Electricity_Power_Supply int,Electricity_domestic int,Electricity_Agriculture int,Primary_school int,Middle_schools int,Secondary_Sr_Secondary_schools int,College int,Medical_facility int,Primary_Health_Centre int,Primary_Health_Sub_Centre int,Post_telegraph_and_telephone_facility int,Bus_services int,Paved_approach_road int,Mud_approach_road int,Permanent_House int,Semi_permanent_House int,Temporary_House int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE

We have created a Hive table with the name census through Zeppelin and the table description can be seen in the screen shot shown below.


Now we will load this data into the created table using the load command as shown below.

load data local inpath '/home/kiran/Documents/datasets/census.csv' into table census


We have successfully loaded the data into the census table. This data can be seen using the selected command as show below.

select * from census

  1. State-Wise Population

Here we will find the population of each state and we have arranged them in a descending order. The state of Uttar Pradesh (UP) stays in the first place and the same can be seen in the following bar and pie charts:

%hive
select state,sum(persons) as total_population from census group by state order by total_population desc



2. Growth Rate of Each State Between 1991-2001
Census are calculated for every 10 years. In between, 1991-2001, we can see the growth rate of each state in the following bar and pie charts. Nagaland stands at the top with the highest growth rate.

%hive
select state,avg(Growth_1991_2001) as total_growth from census group by state



3. Literacy Rate of Each State
Hadoop
Literacy is an important aspect of state’s all round growth. Literacy rate of each state can be calculated using the following query. Kerala shows the highest literacy rate with over 90% of people are educated.

%hive
select state,avg(Persons_literacy_rate) from census group by state



4. States with More Female Population
In every state, women are given some special privileges. Here, from the following query, you can find states which have a greater female population.

%hive
select state, sum(Males)-sum(Females) from census group by state

Kerala has more female population and the same can be visualized using the Bar and Area chart graphs.


5. Percentage of Population in Every State
Here, we can find the percentage of a population each state has using the following query. UP itself has 16% of India’s total population.

%hive
select state, (sum(persons) * 100.0) / SUM(sum(persons)) over() as percent_pop_by_state
FROM census group by state



6.Percentage of People Working in Each State
Here, we will find the percentage of people working in each state and visualize the same using bar and pie charts. Percentage of working people can be calculated using the following query.

%hive
select state,sum(Total_workers)*100/sum(persons) from census group by state

Mizoram stays on the top with over 52% its population that works.


Key points drawn from the analysis:

  • Uttar Pradeshhas the highest population that constitutes 16% of India’s total population
  • Top 5 states by population have over 50% of India’s total population
  • Lakshadweep and Andaman Nicobar Islands are the smallest ones
  • Kerala and Pondicherry have more female population than male
  • Nagaland has the highest growth rate of population
  • Mizoram has the highest number of people working
  • Kerala has the highest literacy rate

We hope this blog helped you in understanding how to perform analysis and visualize your data using zeppelin. Keep visiting our site www.acadgild.com for more updates on Big data and other technologies.
Hadoop

Tags

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close