Big Data Hadoop & Spark

How to Write a Custom UDF for Hive in Python

In this instructional post, we will see how to write a custom UDF for Hive in Python. Before we begin, let us understand what is UDF.

Hive is a data warehouse system built on top of Hadoop to perform ad-hoc queries and is used to get processed data from large datasets. Hive provides an SQL like querying interface over traditional MapReduce, called HQL to perform data analysis.

An example of a simple Hive query:

SELECT col1, COUNT(col2) FROM mytable GROUP BY col1;

When a user fires this query in the background, then:

  • The query is converted into simple MapReduce job.

  • The job is run on Hadoop platform.

  • The result is displayed on the console.

Hive has a rich set of functions that can be used to perform the analysis. But, sometimes there may come scenarios where our requirements cannot be met by simply using the built-in functions. In such a scenario, users need to implement custom User Defined Functions (UDF) and feed it into the Hive query.

What are UDFs?

User Defined Functions(UDFs) provides us a way to:

  • Extend the functionality of Hive by writing functions that can be evaluated in Hive QL.

  • Custom serializers and/or deserializer (“serdes”), which provide a way of either deserializing a custom file format stored on HDFS to a POJO (plain old Java object), or serializing a POJO to a custom file format (or both).

  • Custom mappers/reducers, which allow you to add a custom map or reduce steps into your Hive query. These map/reduce steps can be written in any programming language, and not just in Java.

Since the Hadoop framework is written in Java, naturally most of the Hadoop developers prefer Java to write the UDFs. However, Apache has also made it easy for non-Java developers to be able to work on Hadoop; this is done using the Hadoop Streaming Interface!

Hive UDFs written in Python should be accessed using the Hadoop Streaming concept, which allows for any executable to be used as either the mapper or reducer in a MapReduce transformation. It is easy to create a script in Python and feed it into a Hive query using the function “TRANSFORM…AS.

Let us try to understand the steps using a simple example.

Step 1 – Understanding the Data Set

We have taken a simple data set that has 2 columns, i.e.; FirstName and LastName

They are both separated by tab.

Below given is the screenshot of same.

Step 2 – Create a Table in Hive and Load the Data

Start the Hive shell and run the following commands:

CREATE table mytable(

fname STRING,

lname STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH ‘/path_to_data’ OVERWRITE INTO TABLE mytable;

In the following screenshot you can find the same steps:

You can cross check the data loaded into hive table by using the SELECT * FROM mytable; command.

Step 3 – Problem Statement

We need to convert the ‘lname’ column into lowercase. For this, we will write the Python script and feed it to the Hive query.

Expected o/p:

RAVI kumar

Anish kumar

Rakesh jha

Vishal kumar

Ananya ghosh

Step 4 – Writing the Custom UDF in Python

import sys
import datetime
for line in sys.stdin:
line = line.strip()
fname , lname = line.split('\t')
l_name = lname.lower()
print '\t'.join([fname, str(l_name)])

Explanation of Code

Line 5: Read data from STDIN. This is accomplished using sys.stdin

Line 7: Strip the data into individual lines.

Line 9: Split the line into words based on tab delimiter and assign it to

variables.

Line 11: Converting “lname” into lowercase using .lower() function of python.

Line 13: Printing the new row by replacing “lname” with “l_name.

Step 5 – Add the Python File into Hive

By using the add FILE command, we can add the Python script into Hive’s classpath. Once this is done, we need not alias the script as it can be directly referred to by its name.

Syntax:

add FILE /path_to_python.py;

Example:

add FILE /home/acadgild/Desktop/my.py;

Step 6 – Use the TRANSFORM…AS Construct in the Hive Shell to Run the Python UDF.

Syntax:

SELECT TRANSFORM (col_name) //Line 1

USING ‘python python_script’ //Line 2

AS (col_names) //Line 3

FROM table_name; //Line 4

Example:

SELECT TRANSFORM(fname, lname) USING ‘python my.py’ AS (fname, l_name) FROM mytable;

  • Line 1 allows the user to select a subset of relevant columns that should be passed to the script as the input parameters.

  • Line 2 specifies which executable should be used to handle the actual transformation. For our purposes, the script is written in Python and has already been placed on the Hive classpath through the add FILE command.

  • Line 3 allows the user to specify the format of the output returned by the script.

  • Line 4 should be treated as a normal FROM clause in a query.

 

We can see that we have received the expected output.

Hope this post helped you in writing your first custom UDF in Python. For any queries feel free to comment below.

Keep visiting www.acadgild.com for more updates on the courses.

Related Popular Courses:

WHAT IS BIG DATA AND HADOOP

ANDROID CERTIFICATION BY GOOGLE

KAFKA CONNECT

DATA SCIENTIST ONLINE COURSE

ANALYTICS COURSES

Tags

7 Comments

  1. Thanks for posting this!
    How about Python scripts where some non-default libraries are imported?
    Not all Hadoop nodes will have the same requirements, and some libraries will fail to be imported.
    Using simple Python script is quite trivial, but most of the time in Python we import some external packages. Do you know how to handle that in submitting Hive queries?

  2. wonderful explanation and it is clear and concise, I like the way your short and clear approach to the concept. It did not take more than 10 minutes to understand the writing UDF and using it in Hive.
    Thanks a lot for the great article.

  3. Thanks for the article. I have a small query- how to add the python file permanently to Hive?
    So each time when starting Hive client, we do not have to do ADD FILE each time before executing the Hive query from CLI.

  4. I want to use a UDF that will return a value not just print on console.
    I want to use the returned value as a parameter to some where condition in a query.
    eg: select * from table where col=’select transform(some col) using ‘udf’ as;
    Can u pls help me

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles

Close