All CategoriesBig Data Hadoop & Spark - Advanced

H1B Visa Applicants Data Analysis

The H1B is an employment-based, non-immigrant visa category for temporary foreign workers in the United States. For a foreign national to apply for H1B visa, an US employer must offer a job and petition for H1B visa with the US immigration department. This is the most common visa status applied for and held by international students once they complete college/ higher education (Masters, Ph.D.) and work in a full-time position.

In this blog, we will be performing analysis on the H1B visa applicants between the years 2011-2015. After analyzing the data, we can derive the following facts.

  • Is the number of petitions with Data Engineer job title increasing over time?

  • Which part of the US has the most Data Engineer jobs?

  • Which industry has the most number of Data Scientist positions?

  • Which employers file the most petitions each year?

The dataset has nearly 3 million records. You can download the dataset from here.

The dataset description is as follows:

The columns in the dataset include:

  • CASE_STATUS: Status associated with the last significant event or decision. Valid values include “Certified,” “Certified-Withdrawn,” Denied,” and “Withdrawn”.
  • EMPLOYER_NAME: Name of employer submitting labour condition application.
  • SOC_NAME: the Occupational name associated with the SOC_CODE. SOC_CODE is the occupational code associated with the job being requested for temporary labour condition, as classified by the Standard Occupational Classification (SOC) System.
  • JOB_TITLE: Title of the job
  • FULL_TIME_POSITION: Y = Full Time Position; N = Part Time Position
  • PREVAILING_WAGE: Prevailing Wage for the job being requested for temporary labour condition. The wage is listed at annual scale in USD. The prevailing wage for a job position is defined as the average wage paid to similarly employed workers in the requested occupation in the area of intended employment. The prevailing wage is based on the employer’s minimum requirements for the position.
  • YEAR: Year in which the H1B visa petition was filed
  • WORKSITE: City and State information of the foreign worker’s intended area of employment
  • lon: longitude of the Worksite
  • lat: latitude of the Worksite

In the data, few columns are enclosed by double quotes and also we have comma’s in a single column and the column is enclosed by double quotes. So we have used hive csv serve to load the data. In the quoteChar, we have given “(double quote). So this will take the column value in between the double quotes.

Let’s create a table to load the h1b applicant’s data as shown below.

CREATE TABLE h1b_applications(s_no int,case_status string, employer_name string, soc_name string, job_title string, full_time_position string,prevailing_wage int,year string, worksite string, longitute double, latitute double )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\""
) STORED AS TEXTFILE;

We have created a table with name h1b_applications. Let’s load the data into the table

load data local inpath '/home/kiran/Documents/datasets/h1b.csv' into table h1b_applications;

We have successfully loaded the data into the table. You can see the same in the below screenshot.

Let’s fire a query on the table to check whether the data is loaded successfully or not.

select * from h1b_applications limit 10;

In the below screenshot we can see that data has been loaded successfully.

Let’s get started with the problem statements which we have prepared for the analysis

Is the number of petitions with Data Engineer job title increasing over time?

Here we will find the number of applicants applied for the various kind of data engineer positions

The below query will evaluate the count of different kinds of data engineer positions that are present.

select year,count(year) from h1b_applications where job_title LIKE '%DATA ENGINEER%' group by year;

Here is the result of this query

2011 60
2012 81
2013 151
2014 249
2015 394
2016 786

In the below screenshot you can see the visualization graph.

With this graph, we can definitely say that number of petitions for data engineer positions are definitely increasing over time.

If you want to get the detailed result of the number of positions for each kind of data engineer positions, you can run the below result.

select year,job_title,count(year) from h1b_applications where job_title LIKE '%DATA ENGINEER%' group by year,job_title;

In the below screenshot, you can see the detailed number of petitions for all kinds of data engineer positions.

If you want to collect the list of all the data engineer petitions for every year, you can collect the list of all unique job titles using the collect_set function in hive as shown below.

select a.year,count(a.year),collect_set(a.job_title) from h1b_applications a where a.job_title LIKE '%DATA ENGINEER%' group by a.year;

In the below screenshot, you can see the result for the same. Year, the total number of petitions and the collection of all unique job_titles.

Let’s now evaluate the second problem statement

Which part of the US has the most Data Engineer jobs?

Hadoop

select split(worksite,'[,]')[1] as state, count(split(worksite,'[,]')[1]) as job_cnt from h1b_applications where job_title LIKE '%DATA ENGINEER%' group by split(worksite,'[,]')[1] order by job_cnt desc

With the above query, we will get the count of the number of data engineer petitions as per the state. In the worksite column, we have a combination of city and state separated by a comma so we are splitting the column by comma and selecting the 2nd element in the array.

In the below screenshot, you can see the state and the number of data engineer applicants applied.

You can see the pie chart for the same result.

Let’s now check the number of petitions every year by state for data engineer positions

select split(worksite,'[,]')[1] as state,year,count(split(worksite,'[,]')[1]) as job_cnt from h1b_applications where job_title LIKE '%DATA ENGINEER%' group by year,split(worksite,'[,]')[1] order by year ;

In the below screenshot, you can see the number of petitions in each state every year for different data engineer job roles.

Now we will see the number different kinds of data engineer position that are present in each state. The below query will fetch you the desired results.

select split(worksite,'[,]')[1] as state, job_title,count(split(worksite,'[,]')[1]) as job_cnt from h1b_applications where job_title LIKE '%DATA ENGINEER%' group by job_title,split(worksite,'[,]')[1] order by job_cnt desc;

In the below screenshot, you can see the number of petitions for different kinds of data engineer positions in each state.

Let’s move on to our 3rd problem statement.

Which industry has the most number of Data Scientist positions?

The below query will find the number of petitions each industry received for data scientist position ordered by the count in descending order.

select soc_name,count(soc_name) as cnt from h1b_applications where job_title LIKE '%DATA SCIENTIST%' group by soc_name order by cnt desc;

In the below screenshot you can see the number of data scientist positions present for all the industries. Statisticians industry need more data scientists.

Let’s move on to our next problem statement

Which employers file the most petitions each year?

The below query will select the employer_name and the number of petitions it has received each year.

select employer_name, year,count(year) as cnt from h1b_applications group by year,employer_name order by cnt;

In the below screenshot you can see the result that Infosys ltd has received more number of petitions for 2013,2014,2015,2016 years.

We hope this blog helped you in understanding and analyzing the job trends for foreigners in US. Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

 

Hadoop

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles

Close