In this instructional post, we will see how to write a custom UDF for Hive in Python. Before we begin, let us understand what is UDF.
Hive is a data warehouse system built on top of Hadoop to perform ad-hoc queries and is used to get processed data from large datasets. Hive provides an SQL like querying interface over traditional MapReduce, called HQL to perform data analysis.
An example of a simple Hive query:
SELECT col1, COUNT(col2) FROM mytable GROUP BY col1;
When a user fires this query in the background, then:
The query is converted into simple MapReduce job.
The job is run on Hadoop platform.
The result is displayed on the console.
Hive has a rich set of functions that can be used to perform the analysis. But, sometimes there may come scenarios where our requirements cannot be met by simply using the built-in functions. In such a scenario, users need to implement custom User Defined Functions (UDF) and feed it into the Hive query.
What are UDFs?
User Defined Functions(UDFs) provides us a way to:
Extend the functionality of Hive by writing functions that can be evaluated in Hive QL.
Custom serializers and/or deserializer (“serdes”), which provide a way of either deserializing a custom file format stored on HDFS to a POJO (plain old Java object), or serializing a POJO to a custom file format (or both).
Custom mappers/reducers, which allow you to add a custom map or reduce steps into your Hive query. These map/reduce steps can be written in any programming language, and not just in Java.
Since the Hadoop framework is written in Java, naturally most of the Hadoop developers prefer Java to write the UDFs. However, Apache has also made it easy for non-Java developers to be able to work on Hadoop; this is done using the Hadoop Streaming Interface!
Hive UDFs written in Python should be accessed using the Hadoop Streaming concept, which allows for any executable to be used as either the mapper or reducer in a MapReduce transformation. It is easy to create a script in Python and feed it into a Hive query using the function “TRANSFORM…AS.”
Let us try to understand the steps using a simple example.
Step 1 – Understanding the Data Set
We have taken a simple data set that has 2 columns, i.e.; FirstName and LastName
They are both separated by tab.
Below given is the screenshot of same.
Step 2 – Create a Table in Hive and Load the Data
Start the Hive shell and run the following commands:
CREATE table mytable(
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH ‘/path_to_data’ OVERWRITE INTO TABLE mytable;
In the following screenshot you can find the same steps:
You can cross check the data loaded into hive table by using the SELECT * FROM mytable; command.
Step 3 – Problem Statement
We need to convert the ‘lname’ column into lowercase. For this, we will write the Python script and feed it to the Hive query.
Step 4 – Writing the Custom UDF in Python
for line in sys.stdin:
line = line.strip()
fname , lname = line.split('\t')
l_name = lname.lower()
print '\t'.join([fname, str(l_name)])
Explanation of Code
Line 5: Read data from STDIN. This is accomplished using sys.stdin
Line 7: Strip the data into individual lines.
Line 9: Split the line into words based on tab delimiter and assign it to
Line 11: Converting “lname” into lowercase using .lower() function of python.
Line 13: Printing the new row by replacing “lname” with “l_name.”
Step 5 – Add the Python File into Hive
By using the add FILE command, we can add the Python script into Hive’s classpath. Once this is done, we need not alias the script as it can be directly referred to by its name.
add FILE /path_to_python.py;
add FILE /home/acadgild/Desktop/my.py;
Step 6 – Use the TRANSFORM…AS Construct in the Hive Shell to Run the Python UDF.
SELECT TRANSFORM (col_name) //Line 1
USING ‘python python_script’ //Line 2
AS (col_names) //Line 3
FROM table_name; //Line 4
SELECT TRANSFORM(fname, lname) USING ‘python my.py’ AS (fname, l_name) FROM mytable;
Line 1 allows the user to select a subset of relevant columns that should be passed to the script as the input parameters.
Line 2 specifies which executable should be used to handle the actual transformation. For our purposes, the script is written in Python and has already been placed on the Hive classpath through the add FILE command.
Line 3 allows the user to specify the format of the output returned by the script.
Line 4 should be treated as a normal FROM clause in a query.
We can see that we have received the expected output.
Hope this post helped you in writing your first custom UDF in Python. For any queries feel free to comment below.
Keep visiting www.acadgild.com for more updates on the courses.