Big Data Hadoop & Spark

MapReduce Use Case – Uber Data Analysis

In this post, we will be performing analysis on the Uber dataset in Hadoop using MapReduce in Java.

The Uber dataset consists of four columns; they are dispatching_base_number, date, active_vehicles and trips. You can download the dataset from here.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

Problem Statement 1:

In this problem statement, we will find the days on which each basement has more trips.

Source Code

Mapper Class:

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
java.text.SimpleDateFormat format = new java.text.SimpleDateFormat("MM/dd/yyyy");
String[] days ={"Sun","Mon","Tue","Wed","Thu","Fri","Sat"};
private Text basement = new Text();
Date date = null;
private int trips;
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String line = value.toString();
String[] splits = line.split(",");
basement.set(splits[0]);
try {
date = format.parse(splits[1]);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
trips = new Integer(splits[3]);
String keys = basement.toString()+ " "+days[date.getDay()];
context.write(new Text(keys), new IntWritable(trips));
}
}

From the Mapper, we will take the combination of the basement and the day of the week as key and the number of trips as value.

First, we will parse the date, which is in string format into date format using SimpleDateFormat class in Java. Now, to take out the day of the date, we will use the getDay() which will return an integer value with the day of the week’s number. So, we have created an array which consists of all the days from Sunday to Monday and have passed the value returned by getDay() into the array in order to get the day of the week.

Now, after this operation, we have returned the combination of Basement_number+Day of the week as key and the number of trips as value.

Reducer Class:

In the reducer, we will calculate the sum of trips for each basement and for each particular day, by using the below lines of code.

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

Whole Source Code:

import java.io.IOException;
import java.text.ParseException;
import java.util.Date;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Uber1 {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
java.text.SimpleDateFormat format = new java.text.SimpleDateFormat("MM/dd/yyyy");
String[] days ={"Sun","Mon","Tue","Wed","Thu","Fri","Sat"};
private Text basement = new Text();
Date date = null;
private int trips;
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String line = value.toString();
String[] splits = line.split(",");
basement.set(splits[0]);
try {
date = format.parse(splits[1]);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
trips = new Integer(splits[3]);
String keys = basement.toString()+ " "+days[date.getDay()];
context.write(new Text(keys), new IntWritable(trips));
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Uber1");
job.setJarByClass(Uber1.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Running the Program:

First, we need to build a jar file for the above program and we need to run it as a normal Hadoop program by passing the input dataset and the output file path as shown below.

hadoop jar uber1.jar /uber /user/output1

In the output file directory, a part of the file is created and contains the below output:

B02512 Sat 15026
B02512 Sun 10487
B02512 Thu 15809
B02512 Tue 12041
B02512 Wed 12691
B02598 Fri 93126
B02598 Mon 60882
B02598 Sat 94588
B02598 Sun 66477
B02598 Thu 90333
B02598 Tue 63429
B02598 Wed 71956
B02617 Fri 125067
B02617 Mon 80591
B02617 Sat 127902
B02617 Sun 91722
B02617 Thu 118254
B02617 Tue 86602
B02617 Wed 94887
B02682 Fri 114662
B02682 Mon 74939
B02682 Sat 120283
B02682 Sun 82825
B02682 Thu 106643
B02682 Tue 76905
B02682 Wed 86252
B02764 Fri 326968
B02764 Mon 214116
B02764 Sat 356789
B02764 Sun 249896
B02764 Thu 304200
B02764 Tue 221343
B02764 Wed 241137
B02765 Fri 34934
B02765 Mon 21974
B02765 Sat 36737
B02765 Sun 22536
B02765 Thu 30408
B02765 Tue 22741
B02765 Wed 24340

Hadoop
Problem Statement 2:

In this problem statement, we will find the days on which each basement has more number of active vehicles.

Source Code

Mapper Class:

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
java.text.SimpleDateFormat format = new java.text.SimpleDateFormat("MM/dd/yyyy");
String[] days ={"Sun","Mon","Tue","Wed","Thu","Fri","Sat"};
private Text basement = new Text();
Date date = null;
private int active_vehicles;
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String line = value.toString();
String[] splits = line.split(",");
basement.set(splits[0]);
try {
date = format.parse(splits[1]);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
active_vehicles = new Integer(splits[3]);
String keys = basement.toString()+ " "+days[date.getDay()];
context.write(new Text(keys), new IntWritable(active_vehicles));
}
}

From the Mapper, we will take the combination of the basement and the day of the week as key and the number of active vehicles as value.

First, we will parse the date which is in string format to date format using SimpleDateFormat class in Java. Now, to take out the day of the date, we will use the getDay(), which will return an integer value with the day of the week’s number. So, we have created an array which consists of all the days from Sunday to Monday and have passed the value returned by getDay(), into the array in order to get the day of the week.

Now, after this operation, we have returned the combination of Basement_number+Day of the week as key and the number of active vehicles as value.

Reducer Class:

Now, in the reducer, we will calculate the sum of active vehicles for each basement and for each particular day, using the below lines of code.

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

Whole Source Code:

import java.io.IOException;
import java.text.ParseException;
import java.util.Date;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Uber2 {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
java.text.SimpleDateFormat format = new java.text.SimpleDateFormat("MM/dd/yyyy");
String[] days ={"Sun","Mon","Tue","Wed","Thu","Fri","Sat"};
private Text basement = new Text();
Date date = null;
private int active_vehicles;
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String line = value.toString();
String[] splits = line.split(",");
basement.set(splits[0]);
try {
date = format.parse(splits[1]);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
active_vehicles = new Integer(splits[2]);
String keys = basement.toString()+ " "+days[date.getDay()];
context.write(new Text(keys), new IntWritable(active_vehicles));
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Uber1");
job.setJarByClass(Uber2.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Running the Program:

First, we need to build a jar file for the above program and run it as a normal Hadoop program by passing the input dataset and the output file path as shown below.

hadoop jar uber2.jar /uber /user/output2

In the output file directory, a part file is created and contains the below output:

B02512 Fri 2221
B02512 Mon 1676
B02512 Sat 1931
B02512 Sun 1447
B02512 Thu 2187
B02512 Tue 1763
B02512 Wed 1900
B02598 Fri 9830
B02598 Mon 7252
B02598 Sat 8893
B02598 Sun 7072
B02598 Thu 9782
B02598 Tue 7433
B02598 Wed 8391
B02617 Fri 13261
B02617 Mon 9758
B02617 Sat 12270
B02617 Sun 9894
B02617 Thu 13140
B02617 Tue 10172
B02617 Wed 11263
B02682 Fri 11938
B02682 Mon 8819
B02682 Sat 11141
B02682 Sun 8688
B02682 Thu 11678
B02682 Tue 9075
B02682 Wed 10092
B02764 Fri 36308
B02764 Mon 26927
B02764 Sat 33940
B02764 Sun 26929
B02764 Thu 35387
B02764 Tue 27569
B02764 Wed 30230
B02765 Fri 3893
B02765 Mon 2810
B02765 Sat 3612
B02765 Sun 2566
B02765 Thu 3646
B02765 Tue 2896
B02765 Wed 3152

We hope this post has been helpful in understanding the Uber Data Analysis use case using MapReduce. In the case of any queries, feel free to comment below and we will get back to you at the earliest. Keep visiting our site www.acadgild.com for more updates on BigData and other technologies.

Hadoop

18 Comments

  1. Hi
    I am getting error when i run above code as error
    String keys = basement.toString()+ ” “+days[date.getDay()];
    Unparseable date: “date”
    @SuppressWarnings(“deprecation”)
    Pl support . correct syntax

    1. According to the error message there are two possible scenarios for this one might be the date format which you are using is not matching with the date format in the dataset. If the date format is according to the date in the dataset then some other column is getting parsed by the date. Please check and revert.

  2. I think we are not finding correct solution, Problem stated is ”
    In this problem statement, we will find the days on which each basement has more trips.” whereas solution provided is just giving sum of each day rather finding ON WHICH DAY maximum trip/active vehichle .
    So it justs like another word count example.
    Pls correct me

  3. 2016-08-29 01:26:23,391 WARN util.NativeCodeLoader (NativeCodeLoader.java:(62)) – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: 0
    at Uber2.main(Uber2.java:119)

    1. Hi Akansha,
      We are providing the input while deploying the jar file using the hadoop jar command. Please have a look at this command hadoop jar uber1.jar /uber /user /output1 . In this command, the parameters which you need to pass are:: the path to jar file, input dataset path, output file path. Here the input dataset path is given as /uber. Because we have our input dataset in the root directory of HDFS.

  4. anyone solve any of the above error which are 1. date().. i really need this source code.please give me any kind of solution

    1. Hi krisha,
      Please remove the data description from the dataset i.e., the 1st line in the dataset. Because we have defined our date format should be as ‘MM/dd/yyyy’ using simpleDateFormat and the value which is getting parsed i.e., value ‘date’ is not according to the date format which we have specified so that is why you are getting that error. If you remove the 1st line of the dataset, then all the remaining values in that column will be of date format which we have specified.

      1. thank you sir for give me answer sorry to say i phase again error during run time of program that error is class class not found exception

  5. thanks for the code.
    however i am having a problem while running the jar file. It throws me an error saying, ‘exception in thread “main” java.lang.error’.
    what is it that i’m doing wrong?

  6. The problem statement 1 says “we will find the days on which each basement has more trips”. But the solution that you have given calculates the total number of trips for a given basement+day. Could you please provide the correct solution.

  7. I am a brginner of Java and Hadoop
    why you included
    Date date=null;
    if we declare date reference to a null we will get NullPointerException

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close