In this hadoop tutorial post, we will understand various types of JOIN operation that Hive supports. Hive, like alternative SQL database, permits users to join various tables. However, hive doesn’t support JOIN clauses that don’t seem to have the Equality Conditions because it is extremely troublesome to express such conditions as a MapReduce job.
Syntax for Disallowed Joins (Not allowed in hive)
SELECT a.* FROM a JOIN b ON (a.id <> b.id)
Now, let us see different types of JOIN operations supported by Hive.
- OUTER JOIN
- LEFT OUTER JOIN
- RIGHT OUTER JOIN
- FULL OUTER JOIN
- LEFT SEMI JOIN
- CROSS JOIN
Download sample dataset to practice: Employee.txt Expense.txt
Similar to OUTER JOIN in SQL, join in hive is used to combine records based on common fields from multiple tables. Let us understand the concept by creating two tables and Joining them. Creating a table ’emp’ and loading the data into it as shown below:
CREATE TABLE emp(ID int,NAME varchar(10),AGE int,ADDRESS varchar(10),SALARY int)row format delimited FIELDS terminated BY ',' stored AS textfile; Load data local inpath '/home/cloudera/Downloads/employee.txt' into table emp; select * from emp;
Creating a table named ‘expense’ and loading the data into it is shown below:
CREATE TABLE expense( OID int,DATE varchar(20),emp_ID int,AMOUNT int)row format delimited FIELDS terminated BY ',' stored AS textfile; Load data local inpath '/home/cloudera/Downloads/expense.txt' into table expense; select * from expense;
Performing the JOIN operation.
SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM emp c JOIN expense o ON (c.ID = o.emp_ID);
The above JOIN is also known as INNER JOIN as the value which is available is both the tables gets joined. Missing values in either table gets discarded. From the figure above we can understand that the values which are common in both the tables get joined and hence the name Inner Join. Check below output screenshot and you will find that few id’s from tables did not appear after performing join operation.
Outer Joins can be further sub-categorized into:
- Left Outer Join
- Right Outer Join
- Full Outer Join
Let us see each of them closely.
LEFT OUTER JOIN
If you have a requirement where in you want to keep all the rows of the table a (present in left side) + the common rows of the table a & b, you can use left outer join.
The syntax “FROM a LEFT OUTER join b” should be written in one line so as to grasp how it works: a is to the LEFT of b in this query, then all rows from a are kept.
The below image in this hadoop tutorial depicts the same.
In the above picture, if you consider ‘O’ as a left-hand side table and ‘E’ as a right-hand side table, you will get an idea that all the rows in the left table are present and the common rows of O+E are present (Shown by the yellow portion in this hadoop tutorial).
SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key);
RIGHT OUTER JOIN:
This join will return all the rows from right hand side table along with the common rows present in both left and right table. In case there is no match, join operation will still return the row but with NULL values.
Below image in this hadoop tutorial shows the right outer join.
RIGHT OUTER JOIN SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM emp c RIGHT OUTER JOIN expense o ON (c.ID = o.emp_ID);
FULL OUTER JOIN:
Full outer join combines all the records present in both left as well as right table based on the condition.In brief, you will not lose the data after performing the join operation.
FULL OUTER JOIN SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM emp c FULL OUTER JOIN expense o ON (c.ID = o.emp_ID);
Below is the picture that depicts the same.
Consider ‘O’ as a left side table and ‘E’ as right side table. You can see that all the data is available (shown by yellow areas) after join operation.
LEFT SEMI JOIN
LEFT SEMI JOIN implements the related IN/EXISTS subquery semantics in an efficient manner. Since Hive presently doesn’t support IN/EXISTS subqueries, you’ll be able to rewrite your queries using the LEFT SEMI join. The restrictions of using the LEFT SEMI join is that the right-hand-side table ought to solely be referenced within the JOIN condition (the ON-clause), however not in the WHERE- or SELECT-clauses, etc.
This type of syntax query:
SELECT a.key, a.value
WHERE a.key in
Can be written as Syntax:
SELECT a.key, a.val
FROM a LEFT SEMI JOIN b on (a.key = b.key)
LEFT SEMI JOIN SELECT c.ID, c.NAME FROM emp c LEFT SEMI JOIN expense e ON (c.ID = e.emp_ID);
The SQL CROSS JOIN produces a result set that is the number of rows within the 1st table multiplied by the number of rows within the second table if no WHEREVER clause is used alongside the CROSS join. This kind of a result’s called as a Cartesian product.
If where clause is used with CROSS join, it functions like an INNER join.
An alternative manner for achieving the same results to use column names separated by commas after selecting and mentioning the table names involved, this comes after a FROM clause.
CROSS JOIN SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM emp c CROSS JOIN expense o ON (c.ID = o.emp_ID);
Hope this In this hadoop tutorial helped you in understanding the different types of JOINS available in Hive using Map-Reduce. Feel free to comment below for your queries. And follow AcadGild for more trending blogs on Big Data.