Big Data Hadoop & Spark

Hadoop Tutorial : Joins in Hive

In this hadoop tutorial post, we will understand various types of JOIN operation that Hive supports. Hive, like alternative SQL database, permits users to join various tables. However, hive doesn’t support JOIN clauses that don’t seem to have the Equality Conditions because it is extremely troublesome to express such conditions as a MapReduce job.
Syntax for Disallowed Joins (Not allowed in hive)
SELECT a.* FROM a JOIN b ON (a.id <> b.id)
Now, let us see different types of JOIN operations supported by Hive.

  • JOIN
  • OUTER JOIN
    • LEFT OUTER JOIN
    • RIGHT OUTER JOIN
    • FULL OUTER JOIN
  • LEFT SEMI JOIN
  • CROSS JOIN

Download sample dataset to practice: Employee.txt  Expense.txt
JOIN
Similar to OUTER JOIN in SQL, join in hive is used to combine records based on common fields from multiple tables. Let us understand the concept by creating two tables and Joining them. Creating a table ’emp’ and loading the data into it as shown below:

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

CREATE TABLE emp(ID int,NAME varchar(10),AGE int,ADDRESS varchar(10),SALARY int)row format delimited FIELDS terminated BY ',' stored AS textfile;
Load data local inpath '/home/cloudera/Downloads/employee.txt' into table emp;
select * from emp;

 

 
Creating a table named ‘expense’ and loading the data into it is shown below:

CREATE TABLE expense( OID int,DATE varchar(20),emp_ID int,AMOUNT int)row format delimited FIELDS terminated BY ',' stored AS textfile;
Load data local inpath '/home/cloudera/Downloads/expense.txt' into table expense;
select * from expense;

 

Performing the JOIN operation.
Command:
SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM emp c JOIN expense o ON (c.ID = o.emp_ID);

The above JOIN is also known as INNER JOIN as the value which is available is both the tables gets joined. Missing values in either table gets discarded. From the figure above we can understand that the values which are common in both the tables get joined and hence the name Inner Join. Check below output screenshot and you will find that few id’s from tables did not appear after performing join operation.

OUTER JOINS

Outer Joins can be further sub-categorized into:

  • Left Outer Join
  • Right Outer Join
  • Full Outer Join

Let us see each of them closely.
LEFT OUTER JOIN
If you have a requirement where in you want to keep all the rows of the table a (present in left side) + the common rows of the table a & b, you can use left outer join.
The syntax “FROM a LEFT OUTER join b” should be written in one line so as to grasp how it works: a is to the LEFT of b in this query, then all rows from a are kept.
The below image in this hadoop tutorial depicts the same.

In the above picture, if you consider ‘O’ as a left-hand side table and ‘E’ as a right-hand side table, you will get an idea that all the rows in the left table are present and the common rows of O+E are present (Shown by the yellow portion in this hadoop tutorial).
Syntax:
SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key);

RIGHT OUTER JOIN: 
This join will return all the rows from right hand side table along with the common rows present in both left and right table. In case there is no match, join operation will still return the row but with NULL values.
Below image in this hadoop tutorial shows the right outer join.
Image missing.
 
Command:

RIGHT OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM emp c RIGHT OUTER JOIN expense o ON (c.ID = o.emp_ID);


FULL OUTER JOIN:
Full outer join combines all the records present in both left as well as right table based on the condition.In brief, you will not lose the data after performing the join operation.
Command

FULL OUTER JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM emp c FULL OUTER JOIN expense o ON (c.ID = o.emp_ID);

Below is the picture that depicts the same.

Consider ‘O’ as a left side table and ‘E’ as right side table. You can see that all the data is available (shown by yellow areas) after join operation.
 

 

LEFT SEMI JOIN

LEFT SEMI JOIN implements the related IN/EXISTS subquery semantics in an efficient manner. Since Hive presently doesn’t support IN/EXISTS subqueries, you’ll be able to rewrite your queries using the LEFT SEMI join. The restrictions of using the LEFT SEMI join is that the right-hand-side table ought to solely be referenced within the JOIN condition (the ON-clause), however not in the WHERE- or SELECT-clauses, etc.

This type of syntax query:
SELECT a.key, a.value
FROM a
WHERE a.key in
(SELECT b.key
FROM B);
Can be written as Syntax:
SELECT a.key, a.val
FROM a LEFT SEMI JOIN b on (a.key = b.key)
Example:
LEFT SEMI JOIN
SELECT c.ID, c.NAME FROM emp c LEFT SEMI JOIN expense e ON (c.ID = e.emp_ID);


 

CROSS JOIN

The SQL CROSS JOIN produces a result set that is the number of rows within the 1st table multiplied by the number of rows within the second table if no WHEREVER clause is used alongside the CROSS join. This kind of a result’s called as a Cartesian product.
If where clause is used with CROSS join, it functions like an INNER join.
An alternative manner for achieving the same results to use column names separated by commas after selecting and mentioning the table names involved, this comes after a FROM clause.

Example:

 

CROSS JOIN
SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM emp c CROSS JOIN expense o ON (c.ID = o.emp_ID);

 

Hope this In this hadoop tutorial helped you in understanding the different types of JOINS available in Hive using Map-Reduce. Feel free to comment below for your queries. And follow AcadGild for more trending blogs on Big Data.

prateek

An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

One Comment

  1. from where can I get a sample project or case study to work on Data.
    I learn the Basic and now need to perform the Learning .
    Thanks in Advanced

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close