In our previous post, we have discussed on the concept of Partitioning in Hive. In this post, we will be discussing the concept of Bucketing in Hive, which gives a fine structure to Hive tables while performing queries on large datasets.
As we all know, Partition helps in increasing the efficiency when performing a query on a table. Instead of scanning the whole table, it will only scan for the partitioned set and does not scan or operate on the unpartitioned sets, which helps us to provide results in lesser time and the details will be displayed very quickly because of Hive Partition.
100% Free Course On Big Data Essentials
Subscribe to our blog and get access to this course ABSOLUTELY FREE.
Now, let’s assume a condition that there is a huge dataset. At times, even after partitioning on a particular field or fields, the partitioned file size doesn’t match with the actual expectation and remains huge and we want to manage the partition results into different parts. To overcome this problem of partitioning, Hive provides Bucketing concept, which allows user to divide table data sets into more manageable parts.
Thus, Bucketing helps user to maintain parts that are more manageable and user can set the size of the manageable parts or Buckets too.
Bucketing Features in Hive
Hive partition divides table into number of partitions and these partitions can be further subdivided into more manageable parts known as Buckets or Clusters. The Bucketing concept is based on Hash function, which depends on the type of the bucketing column. Records which are bucketed by the same column will always be saved in the same bucket.
Here, CLUSTERED BY clause is used to divide the table into buckets.
In Hive Partition, each partition will be created as directory. But in Hive Buckets, each bucket will be created as file.
Bucketing can also be done even without partitioning on Hive tables.
Bucketed tables allows much more efficient sampling than the non-bucketed tables. With sampling, we can try out queries on a section of data for testing and debugging purpose when the original data sets are very huge. Here, the user can fix the size of buckets according to the need.
Bucketing concept also provides the flexibility to keep the records in each bucket to be sorted by one or more columns. Since the data files are equal sized parts, map-side joins will be faster on the bucketed tables.
Let us see how to create and populate bucketed table in the following example, where we try to subdivide a partition table into multiple manageable parts based on a field by using bucketing technique.
Input Dataset to Perform Bucketing Operation:
To perform the bucketing operation on a dataset, we require an input dataset. You can refer to the below link for our input dataset real_state.
Column 1 : Street
Column 2 : City
Column 3 : Zip
Column 4 : State
Column 5 : Beds
Column 6 : Baths
Column 7 : Sq_feet
Column 8 : flat_type
Column 9 : Price
Selecting the Database:
Let us begin with the coding part by selecting a database and performing queries on the bucketed table.
Here, we have used a database named ‘bucketing’ which already exists in the Hive metastore database and we will create new tables in the ‘bucketing’ database, to perform Bucketing operation on the newly created tables.
Creating a New Input Table:
Next, we have created a new input table with name ‘input_table’, where we will save the contents of the input dataset.
Load the Input Dataset:
Now, in the below image, we can see that we have loaded the contents of real_state into input_table by using the load command.
Display the Contents of Table input_table to Ensure Whether the Input File has been Loaded Successfully or Not:
By using ‘select’ statement, we can check the contents of the input_table, whether we have successfully loaded the contents or not.
We can see from the above image that we have successfully loaded the contents of the input dataset real_state into input_table.
Set the Below Properties in Hive Command Line Before Proceeding Further for Bucketing Scripts:
To populate the bucketed table, we have to set hive.enforce.bucketing property to ‘true’, so that the Hive knows to create the number of buckets declared in the table definition.
The property hive.enforce.bucketing = true is similar to hive.exec.dynamic.partition = true, in Hive partitioning. By setting this property, we will enable dynamic bucketing while loading data into the Hive table.
The above hive.enforce.bucketing = true property sets the number of reduce tasks to be equal to the number of buckets mentioned in the table definition (Which is ‘4’ in our case) and automatically selects the clustered by column from table definition.
Creating Bucket Table:
From the above image, we can see that we have created a new bucket table with name ‘bucket_table’, which is partitioned by ‘city’ and clustered by ‘street’ field with the bucket size of ‘4’.
Here, we have decomposed Hive Buckets into ‘4’ parts.
Query to Retrieve Data from Bucketed Table:
We can observe from the above image that we are selecting the columns from input_table and inserting it into our bucketed table ‘bucket_table’, which is partitioned by city.
After successfully inserting the contents of input_table to bucket_table, we can see the bucketed output result in the browser and can also download the required city partitioned bucketed files in our local file system.
The below screenshot shows the path of the bucket_table with city as ‘ANTELOPE’.
We can observe from the above image, four Buckets have been created in the above Hive warehouse path for every city name.
Now, let’s download the output part file result by selecting the download option.
We can observe from the above image, the part file is successfully downloaded in the local file system downloads directory.
It is clear that Bucketing is most suitable for sampling purpose and adds some optimization to the query performance.
We can also execute the above output in command line by copy pasting the same or different part file path, according to the user’s requirement.
Hence, from the above execution steps, we can observe that we have created a structure to decompose the data into more manageable parts or equal parts using Bucketing technique in Hive.
We hope this post has been helpful in understanding the concept of Bucketing in Hive.