Big Data Hadoop & Spark

Implementing Custom Input Format in Spark

In this post, we will be discussing how to implement custom input format in Spark. In Spark, we will implement the custom input format by using Hadoop custom input format. You can refer to our previous post to get an idea of how custom input format has been implemented on the Titanic Dataset.

Problem Statement:

Find the number of people who died and the number of people who survived, along with their genders.

The same custom input classes of Hadoop can be implemented in Spark as well.

In the Hadoop custom input format post, we have aggregated two columns and made as a key.

To implement the same in Spark shell, You need to build a jar file of the source code of the custom input format of Hadoop.
You can download the jar file from the following link:

https://drive.google.com/open?id=0ByJLBTmJojjzMEJCM29GRGRDbzA

You can see the classes inside the jar file, in the below screenshot.

Here, Titanic_input.class is the Input format class and Key_value.class is the custom key class.

You can register the jar file in the Scala shell using the below command:

:cp name of the jar file.jar

Note: Your jar file should be in the class path (Here our class path is $Home directory, so we have copied the jar file in the home directory).

Now, let’s write our custom input format in Spark using these custom classes.

Spark provides support for both the old as well as the new APIs of Hadoop.

Old APIs (Which supports mapred libraries of Hadoop)

New APIs (Which supports mapreduce libraries of Hadoop)

  • newAPIHadoopRDD

  • newAPIHadoopFile

We can implement the APIs using Spark context.

Old APIs

SparkContext.hadoopRDD(conf, inputFormatClass, keyClass, valueClass, minPartitions)

Hadoop

  • Conf – Here, conf to be passed is org.apache.hadoop.mapred.JobConf. In this specific format, we need to pass the input file from the configuration.

  • InputFormatClass – Here, you need to pass the Input format class of Hadoop.

  • KeyClass – Here, you need to pass the input Key class.

  • ValueClass – Here, you need to pass the input Value class.

SparkContext.hadoopFile(path, inputFormatClass, keyClass, valueClass, minPartitions)

  • Path – Here, Input file is passed as the arguments itself in path.

  • InputFormatClass – Here, you need to pass the Input format class of Hadoop.

  • KeyClass – Here, you need to pass the input Key class.

  • ValueClass – Here you need to pass the input Value class.

  • minPartitions – This specifies the minimum number of partitions.

New APIs

sc.newAPIHadoopRDD(conf, fClass, kClass, vClass)

  • Conf – Here conf to be passed is org.apache.hadoop.conf.configuration.

  • fClass – Input format class.

  • kClass – Input key class.

  • vClass – Input value class.

sc.newAPIHadoopFile(input, fClass, kClass, vClass)

  • Input – Here, input path is passed as string as the first argument.

  • fClass – Input format class.

  • kClass – Input key class.

  • vClass – Input value class.

val input = sc.newAPIHadoopFile("/home/kiran/TitanicData.txt", classOf[Titanic_input], classOf[Key_value],classOf[IntWritable])

Now, we will take out the Key and Value and create a pair.

val pairs = input.map(x => ((x._1.toString(),x._2.toString().toInt)))

We have converted our custom input Key to String and the value IntWritable to Int.

Now, let’s perform the reduceByKey operation to count the number of people who died and the number of people survived, along with their genders.

Let’s use the reduceByKey() method to count the values for each unique key.

val reduce = pairs.reduceByKey(_+_)

The collect method is used to check the output.

reduce.collect

res29: Array[(String, Int)] = Array((1,male,109), (1,female,233), (0,female,81), (0,male,468))

You can see the whole trace in the below screenshot.

This is how we can implement the custom input format in Spark shell.

Hope this post has been helpful in understanding the steps to implement custom input format in Spark. In case of any queries, feel free to comment below and we will get back to you at the earliest.

Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

Spark

Tags

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close