All CategoriesBig Data Hadoop & Spark

Writing a Custom UDF in Spark

In this blog, we will try to understand what UDF is and how to write a UDF in Spark.

A User defined function(UDF) is a function provided by the user at times where built-in functions are not capable of doing the required work.

Apache Spark is a general processing engine on the top of Hadoop eco-system. Spark’s rich resources have almost all the components of Hadoop. For example, we can perform batch processing in Spark and real-time data processing, without using any additional tools like Kafka/Flume of Hadoop. It has its own streaming engine called spark streaming.

We can perform various functions with Spark:

  • SQL operations: It has its own SQL engine called Spark SQL. It covers the features of both SQL and Hive.

  • Machine Learning: It has Machine Learning Library, MLib. It can perform Machine Learning without the help of MAHOUT.

  • Graph processing: It performs Graph processing by using GraphX component.


All the above features are inbuilt in Spark.

It can be run on different types of cluster managers such as Hadoop, YARN framework and Apache Mesos framework. It has its own standalone scheduler to get started if other frameworks are not available. Spark provides the access and ease of storing the data, it can be run on many file systems. For example, HDFS, Hbase, MongoDB, Cassandra and can store the data in its local files system.

In this post, we will see how to write UDF functions in spark and how to use them in spark SQL.

For this, we will take Uber data analysis use case.

The Uber dataset consists of 4 columns. And the dataset description is as follows





You can download the dataset from the below link:

Problem Statement:

Find the days on which each basement has more trips.

In this problem statement, we will be finding the number of trips each basement has along with the days and we will arrange them in descending order.

We will use Spark SQL to achieve this and to parse the date column, we will write a custom UDF in Scala.

val dataset = sc.textFile("/home/kiran/Documents/datasets/uber")
val header = dataset.first()
val data = dataset.filter(line => line != header)
case class uber(dispatching_base_number:String ,date:String,active_vehicles:Int,trips:Int)
val mapping=>x.split(",")).map(x => uber(x(0).toString,x(1),x(2).toInt,x(3).toInt)).toDF


In line 1, we are loading the dataset in our local system, using the textFile method.

In line 2, we are creating a variable header, which holds the first line of the dataset (In this dataset the first line is header line).

In line 3, we are filtering the header line from the dataset using the filter RDD.

In line 4, we are declaring a case class holding the dataset description of the Uber data.

In line 5, we are preparing a structure for the data and mapping it to the case class structure and finally converting it to a data frame.

In line 6, we are registering the data in a table called Uber.

Now we need to write a SQL query on the table to find out the number of trips each basement made based on the days.

To parse the date column, we are writing a custom UDF in Scala and is as follows:

parse = (s: String) => {
val format = new java.text.SimpleDateFormat("MM/dd/yyyy")
var days =Array("Sun","Mon","Tue","Wed","Thu","Fri","Sat")
val split = days(format.parse(s).getDay).toString

UDF function will be same as a Scala function, the function parameters will decide the number of columns that UDF will act on. Here we are working on only one column so we have passed only one parameter.

The UDF will take the date which is in string format and parsed using the java’s SimpleDataFormat class and the getDay function will return the day number of the week. So we have created an array storing the names of the weeks. Depending on the return value of getDay function, the corresponding index value of the array is returned. So finally we will get the day of the week.

To register the UDF into spark SQL, we need to use:


Below is the SQL query which returns the number of trips made on each based on the days. To parse the date, we have used the UDF.

val test = sqlContext.sql("select dispatching_base_number as dis, parsed(date) as dt ,sum(trips) as cnt from uber group by dispatching_base_number,parsed(date) order by cnt desc")

You can see the whole stack in the below screen shot.

Now we can use the collect()  to print the results.

Array[org.apache.spark.sql.Row] = Array([B02764,Sat,356789], [B02764,Fri,326968], [B02764,Thu,304200], [B02764,Sun,249896], [B02764,Wed,241137], [B02764,Tue,221343], [B02764,Mon,214116], [B02617,Sat,127902], [B02617,Fri,125067], [B02682,Sat,120283], [B02617,Thu,118254], [B02682,Fri,114662], [B02682,Thu,106643], [B02617,Wed,94887], [B02598,Sat,94588], [B02598,Fri,93126], [B02617,Sun,91722], [B02598,Thu,90333], [B02617,Tue,86602], [B02682,Wed,86252], [B02682,Sun,82825], [B02617,Mon,80591], [B02682,Tue,76905], [B02682,Mon,74939], [B02598,Wed,71956], [B02598,Sun,66477], [B02598,Tue,63429], [B02598,Mon,60882], [B02765,Sat,36737], [B02765,Fri,34934], [B02765,Thu,30408], [B02765,Wed,24340], [B02765,Tue,22741], [B02765,Sun,22536], [B02765,Mon,21974], [B02512,Fri,16435], [B02512,Thu,15809]

We hope this blog helped you in understanding how to write custom UDF in spark. Keep visiting our site for more updates on Big Data and other technologies.

Related Popular Courses:








  1. Hi Kiran,
    Nice written article , A User defined function(UDF) is a function provided by the user at times where built-in functions are not capable of doing the required work.(captured from above article).Pls let me know , Will I able to write same code like you in the article if I have no prior programming background .

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles