In this video tutorial, we will be discussing about the concepts of Partitioning in Hive.
Hive has been one of the preferred tool for performing queries on large datasets, especially when full table scan is done on the datasets.
In the case of tables which are not partitioned, all the files in a table’s data directory is read and then filters are applied on it as a subsequent phase. This becomes a slow and expensive affair especially in cases of large tables.
Without partitioning Hive reads all the data in the directory and applies the query filters on it. This is slow and expensive since all data has to be read.
Very often users need to filter the data on specific column values. To apply the partitioning in hive , users need to understand the domain of the data on which they are doing analysis.
With this knowledge, identification of the frequently queried or accessesd columns becomes easy and then partitioning feature of Hive can be applied on the selected columns.
Owing the fact that Partitions are horizontal slices of data,larger sets of data can be separated into more manageable chunks.
In this blog we are going to discuss about partitioning in hive and ways to use it. The problem with hive is that when we apply where clause then even a simple query in Hive reads the entire dataset.
This decreases the efficiency and becomes a bottleneck when we are required to run the queries on large tables.
This issue can be overcome by implementing partitions in hive.
When to use Hive Partitioning:
When any user wants data contained within a table to be split across multiple sections in hive table, use of partition is suggested.
The entries for the various columns of dataset are segregated and stored in their respective partition. When we write the query to fetch the values from table , only the required partitions of the table are queried, which reduces the time taken by query to yield the result.
A scenario for partitioning:
Let’s take a scenario:
- Data is present in hdfs coming in from various ecommerce companies.
- We need to run the HiveQl queries on user buying pattern.
- We need to analyse data coming in from last 10 days.
In the above scenario instead of running the queries which involves scanning of entire table, an approach should be followed where query runs on only last 10 days of data.
If any user wants to do some analysis on purchasing patterns of customer in some specific date range and if his query has to run on entire dataset then efficiency will be low in this case.
But running the hive queries on data with in specified date range will increase the efficiency when compared to running the data on entire dataset as the query will run on certain chunk of dataset, so time taken to retrieve the result will be less.
Now we will be demonstrating the above scenario using a sample data:
First Name, Last Name, Roll number are the three columns in the sample data.
We will be creating partitions based on country and state column.
For this we will first create a database and a table:
Note: Columns which are a part of partitioner columns are not included in table definitions.
Using a database:
- Static partitioning
- Dynamic partitioning.
When to use static partitioning:
Static partitioning needs to be applied when we know data(supposed to be inserted) belongs to which partition.
Refer the below section to get a complete picture of static partitioning:
Steps for static partitioning:
1.Creating input files for partitioning:
Let’s take two input file:
2.Copying the input files:
The above two input files need to be copied into table further subdivided into partitions divided on the basis of country and state.
user_info data needs to be copied into the section of table having partitions as country = us and state = fl.
When to use dynamic partitioning:
In static partitioning every partitioning needs to be backed with individual hive statement which is not feasible for large number of partitions as it will require writing of lot of hive statements.
In that scenario dynamic partitioning is suggested as we can create as many number of partitions with single hive statement.
This approach is explained in the below section:
Steps for Dynamic partitioning:
We need to create the partitioned table par_user as shown below.
This table will be populated with the contents from table user1 and partitions will be created based on country and state.
Citing the all above explanation we come to the conclusion that data management becomes very efficient with the usage of partitioning in hive.
Keep visiting our site www.acadgild.com for more updates on Bigdata and other technologies. Click here to learn Bigdata Hadoop Development