Big Data Hadoop & Spark

Execution of Hive UDAF

In this blog we will be discussing to implementing Hive UDAF to find the largest Integer from the input file.
We expect the readers to have basic knowledge on Hive so, refer the below links to get the basics of Hive operations.
Hive Beginners Guide
File Formats In Apache
Indexing in Hive
HivePartitioning In Hive
Let’s start our discussion with understanding of UDAF.  
User-Defined Aggregation Functions (UDAFs) are an exceptional way to integrate advanced data-processing into Hive. Aggregate functions perform a calculation on a set of values and return a single value.
An aggregate function is more difficult to write than a regular UDF. Values are aggregated in chunks (potentially across many tasks), so the implementation has to be capable of combining partial aggregations into a final result.
We will start our discussion with the given source code which has been used to find the largest Integer from the input file.
The code to achieve this is explained in the below example, we need to make a jar file of the below source code and then use that jar file while executing hive scripts shown in the upcoming section.
UDAF to find the largest Integer in the table.

package com.hive.udaf;
import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
public class Max extends UDAF
{
public static class MaxIntUDAFEvaluator implements UDAFEvaluator
{
private IntWritable output;
public void init()
{
output=null;
}
public boolean iterate(IntWritable maxvalue) // Process input table
{
if(maxvalue==null)
{
return true;
}
if(output == null)
{
output = new IntWritable(maxvalue.get());
}
else
{
output.set(Math.max(output.get(), maxvalue.get()));
}
return true;
}
public IntWritable terminatePartial()
{
return output;
}
public boolean merge(IntWritable other)
{
return iterate(other);
}
public IntWritable terminate() //final result
{
return output;
}
}
}

Let’s see now the steps for UDAF Execution.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

  1. Creating a new Input Dataset

We need an input dataset to execute the above example. The Dataset that will be used for demonstration is Numbers_List. It has one column, which contains List of Integer values.

 

  1. Create a new table  and load the input dataset

In the below screenshot we have a created a new table Num_list with only one field(column) Num.
Next, we have loaded the input dataset Numbers_List contents into the table Num_List.


 

  1. Display the contents of table Num_list to ensure whether the input file have been loaded successfully or not.

By using select statement command we can see if the contents of the dataset Numbers_List have been loaded to the table Num_list or not.

 

  1. Add the Jar file in hive with complete path (Jar file made from source code need to be added)


We can see in the above screenshot we have added h-udaf.jar in hive.
 

  1. Create temporary function as shown below

The need to create function is, calling function is very easily inside hive than using jar multiple times during analysis.
Let us create a temporary function max for newly created UDAF.

 

  1. Use the select statement to find the largest number from the table Num_List

After, successfully following the above steps we can see use the Select statement command to find the largest number in the table.


Thus, from the above screenshot we can see the largest number in the table Num_list is 99.
We hope this blog helped you in understanding the Hive UDAF and its execution.
Keep visiting our website Acadgild for more updates on Big Data and other technologies. Click here to learn Big Data Hadoop Development.
Hadoop
 

Manjunath

is working with AcadGild as Big Data Engineer and is a Big Data enthusiast with 2+ years of experience in Hadoop Development. He is passionate about coding in Hive, Spark, Scala. Feel free to contact him at [email protected] for any further queries.

2 Comments

  1. Hi Team,
    I want to build spark application using SBT and want to launch them on cluster,
    Please help us in this.
    The blogs are awesome,really helpful

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close