Free Shipping

Secure Payment

easy returns

24/7 support

  • Home
  • Blog
  • Join Optimization in Apache Hive

Join Optimization in Apache Hive

 July 14  | 0 Comments

Hive, like other SQL databases, allows users to join various tables. However, Joins can be computationally expensive, especially on big tables. Hive on top of Hadoop makes data processing so straightforward and scalable that we can easily forget to optimize our Hive queries. In this post, we will see some of the best practices for Join optimization on Hive and speed up the query execution time.

For join optimization in Hive, we can use repartition joins, replication joins and semi joins. In this post, we will look into this for join optimization in Hive.

Before we proceed, let us create two Hive tables and feed the data for proper understanding.

Table creation:

 

Table creation:

 

Join table ordering (Largest table last).

As with any type of tuning, it is important to understand the internal working of a system. When Hive executes a join, it needs to select which table is streamed and which table is cached. Hive takes the last table in the JOIN statement for streaming, so we need to ensure that this streaming table is largest among the two.

Let’s look at the example of our two tables. The ‘emp’ table consists of department id, employee name, and employee salary. For any organization, this list can keep growing over the time. But, the ‘dept’ table will be static for most of the time.  Hence, when these two tables are joined it is important that the larger table comes last in the query. Let us see the optimized Hive query:

 

Or, you can also explicitly tell Hive which table it should stream.

 

Map Side Join:

Also known as replicated join, a map-side join is a special type of join where a smaller table is loaded in memory and join is performed in map phase of MapReduce job. Since there is no reducer involved in the map-side join, it is much faster when compared to regular join.

An important point to note is, one table must be small enough to fit into memory. It is recommended to have a proper configuration so that Hive automatically attempt to convert Joins into the map-side join. Below is a Hive join operation which is not a map-side join.

In the image above, note the highlighted part. You can see that “number of reducer” is 1 which slows down the join operation.

Now, to perform map-side join, set few configurations either into hive-site.xml OR directly from Hive shell. Below are the configurations which I have set from Hive shell.

hive> set hive.auto.convert.join=true;

hive> set hive.auto.convert.join.noconditionaltask=true;

Once you are done with the configuration, execute the same join operation as we performed above.

Note the highlighted part again! You will find that there is no reducer phase performed in this join operation. Hence, the map-side join is faster than regular join operation.

Sort-Merge-Bucket (SMB) Map Join:

It is another Hive join optimization technique where all the tables need to be bucketed and sorted. In this case joins are very efficient because they require a simple merge of the presorted tables.

Let us create bucketed tables from our existing tables i.e.; emp and dept. Before creating bucketed table, you need to set below properties.

 

Now we need to create bucketed table as shown below:

 

Similarly, create another bucketed table from ‘dept’ table and inserting into it.

 

Now the stage is set to perform SMB Map Join to optimize Hive joining. Again, make some changes in properties to perform SMB Map join.

 

Check the highlighted part. You will find that 4 mapper tasks are running (as we had 4 buckets). This helps in performing faster join operation when compared to regular

joins.

Hope this post helped you in join optimization on Apache Hive. For more updates, keep visiting www.acadgild.com

>