Big Data Hadoop & Spark

Indexing in Hive

This blog focuses of the concepts involved in indexing in Hive. This post includes the following topics:

  • When to use indexing.
  • How indexing is helpful.
  • How to create indexes for your tables.
  • Perform some operations regarding the indexing in Hive.

What is an Index?

An Index acts as a reference to the records. Instead of searching all the records, we can refer to the index to search for a particular record. Indexes maintain the reference of the records. So that it is easy to search for a record with minimum overhead. Indexes also speed up the searching of data.

100% Free Course On Big Data Essentials

Subscribe to our blog and get access to this course ABSOLUTELY FREE.

Why to use indexing in Hive?

Hive is a data warehousing tool present on the top of Hadoop, which provides the SQL kind of interface to perform queries on large data sets. Since Hive deals with Big Data, the size of files is naturally large and can span up to Terabytes and Petabytes. Now if we want to perform any operation or a query on this huge amount of data it will take large amount of time.

In a Hive table, there are many numbers of rows and columns. If we want to perform queries only on some columns without indexing, it will take large amount of time because queries will be executed on all the columns present in the table.

The major advantage of using indexing is; whenever we perform a query on a table that has an index, there is no need for the query to scan all the rows in the table. Further, it checks the index first and then goes to the particular column and performs the operation.

So if we maintain indexes, it will be easier for Hive query to look into the indexes first and then perform the needed operations within less amount of time.

Eventually, time is the only factor that everyone focuses on.

When to use Indexing?

Indexing can be use under the following circumstances:

  • If the dataset is very large.
  • If the query execution is more amount of time than you expected.
  • If a speedy query execution is required.
  • When building a data model.

Indexes are maintained in a separate table in Hive so that it won’t affect the data inside the table, which contains the data. Another major advantage for indexing in Hive is that indexes can also be partitioned depending on the size of the data we have.

Types of Indexes in Hive

  • Compact Indexing
  • Bitmap Indexing

Bit map indexing was introduced in Hive  0.8 and is commonly used for columns with distinct values.

Differences between Compact and Bitmap Indexing

The main difference is the storing of the mapped values of the rows in the different blocks. When the data inside a Hive table is stored by default in the HDFS, they are distributed across the nodes in a cluster. There needs to be a proper identification of the data, like the data in block indexing. This data will be able to identity which row is present in which block, so that when a query is triggered it can go directly into that block. So, while performing a query, it will first check the index and then go directly into that block.

Compact indexing stores the pair of indexed column’s value and its blockid.

Bitmap indexing stores the combination of indexed column value and list of rows as a bitmap.

Let’s now understand what is bitmap?

A bitmap is is a type of memory organization or image file format used to store digital images so with this meaning of bitmap, we can redefine bitmap indexing as given below.

“Bitmap index stores the combination of value and list of rows as a digital image.”

The following are the different operations that can be performed on Hive indexes:

  • Creating index
  • Showing index
  • Alter index
  • Dropping index

Creating Index in Hive

Syntax for creating a compact index in Hive is as follows:

CREATE INDEX index_name
ON TABLE table_name (columns,....)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;
  • Here, in the place of index_name we can give any name of our choice, which will be the table’s INDEX NAME.
  • In the ON TABLE line, we can give the table_name for which we are creating the index and the names of the columns in brackets for which the indexes are to be created. We should specify the columns which are available only in the table.
  • The org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ line specifies that a built in CompactIndexHandler will act on the created index, which means we are creating a compact index for the table.
  • The WITH DEFERRED REBUILD statement should be present in the created index because we need to alter the index in later stages using this statement.

This syntax will create an index for our table, but to complete the creation, we need to complete the REBUILD statement. For this to happen, we need to add one more alter statement. A MapReduce job will be launched and the index creation is now completed.

Hadoop

ALTER INDEX index_nam on table_name REBUILD;

This ALTER statement will complete our REBUILDED index creation for the table.

Examples – Creating Index

In this section we will first execute the hive query on non-indexed table and will note down the time taken by query to fetch the result.

In the second part, we will be performing the same query on indexed table and then will compare the time taken by query to fetch the result with the earlier case.

We will be demonstrating this difference of time with practical examples.

In first scenario we are performing operations on non-indexed table.

Let’s create a normal managed table to contain the olympic dataset first.

Table Creation

create table olympic(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as textfile;

Here we are creating a table with name ‘olympic’. The schema of the table is as specified and the data inside the input file is delimited by tab space.

At the end of the line, we have specified ‘stored as textfile’, which means we are using a TEXTFILE format.

You can check the schema of your created table using the command ‘describe olympic;’

We can load data into the created table as follows:

load data local inpath ‘path of your file‘into table olympic;

We have successfully loaded the input file data into the table which is in the TEXTFILE format.

Let’s perform an Average operation on this ‘olympic’ data. Let’s calculate the average age of the athletes using the following command:

SELECT AVG(age) from olympic;

Here we can see the average age of the athletes to be 26.405433646812956 and the time for performing this operation is 21.08 seconds.

Now, let’s create the index for this table:

CREATE INDEX olympic_index
ON TABLE olympic (age)
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;

ALTER INDEX olympic_index on olympic REBUILD;

Here we have created an index for the ‘olympic’ table on the age column. We can view the indexes created for the table by using the below command:

show formatted index on olympic;

We can see the indexes available for the ‘olympic’ table in the above image.

Now, let’s perform the same Average operation on the same table.

We have now got the average age as 26.405433646812956, which is same as the above, but now the time taken for performing this operation is 17.26 seconds, which is less than the above case.

Now we know that by using indexes we can reduce the time of performing the queries.

Can we have different indexes for the same table?

 Yes! We can have any number of indexes for a particular table and any type of indexes as well.

Let’s create a Bitmap index for the same table:

CREATE INDEX olympic_index_bitmap
ON TABLE olympic (age)
AS 'BITMAP'
WITH DEFERRED REBUILD;
ALTER INDEX olympic_index_bitmap on olympic REBUILD;

Here, As ‘BITMAP’ defines the type of index as BITMAP.

We have successfully created the Bitmap index for the table.

We can check the available indexes for the table using the below command:

show formatted index on olympic;

We can now see that we have two indexes available for our table.

Hadoop

Average Operation with Two Indexes

Now, let’s perform the same Average operation having the two indexes.

This time, we have got the same result in 17.614 seconds which is same as in the case of compact index.

Note: With different types (compact,bitmap) of indexes on the same columns, for the same table, the index which is created first is taken as the index for that table on the specified columns.

Now let’s delete one index using the following command:

DROP INDEX IF EXISTS olympic_index ON olympic;

We can check the available indexes on the table to verify whether the index is deleted or not.

We have successfully deleted one index i.e., olympic_index ,which is a compact index.

We now have only one index available for our table, which is a bitmap index.

Average Operation with Bitmap Index

Let’s perform the same Average age operation on the same table with bitmap index.

 We have got the average age as 26.105433646812956, which is same as the above cases but the operation was done in just 16.47 seconds, which is less than the above two cases.

Through the above examples, we have proved the following:

  • Indexes decrease the time for executing the query.
  • We can have any number of indexes on the same table.
  • We can use the type of index depending on the data we have.
  • In some cases, Bitmap indexes work faster than the Compact indexes and vice versa.

When not to use indexing?

It is essential to know when and where indexing shouldn’t be used. They should not be used in the following scenarios:

  • Indexes are advised to build on the columns on which you frequently perform operations.
  • Building more number of indexes also degrade the performance of your query.
  • Type of index to be created should be identified prior to its creation (if your data requires bitmap you should not create compact).This leads to increase in time for executing your query.

We hope this blog helped you in understanding Indexing in Hive. Keep visiting our site for more updates on BigData and other technologies.

Suggested Reading

Hive Bucketing

Parquet File Format

Related Popular Courses:

HADOOP ONLINE TRAINING

MOBILE APP DEVELOPMENT TRAINING AND CERTIFICATION

APACHE KAFKA

BEST DATA SCIENCE PROGRAMS ONLINE

CERTIFICATION FOR DATA ANALYST

Hadoop

10 Comments

  1. If we are using columnar formats like RC, ORC or formats with built in compression like Avro, Parquet will creating index be helpful?

    1. No. The reason for this is ORC. ORC has build in Indexes which allow the format to skip blocks of data during read, they also support Bloom filters. Together this pretty much replicates what Hive Indexes did and they do it automatically in the data format without the need to manage an external table ( which is essentially what happens in indexes. ). I would rather spend my time to properly setup the ORC tables. https://community.hortonworks.com/questions/18093/creating-indexes-in-hive.html

  2. what is the difference between Partitioning and indexing. In partition we are also divide larger data sets into smaller one and it comes query effectie if you select particular rows from table.

    1. Hi Sachin,
      Partitioning divides the larger dataset into smaller ones so that efficiency in processing the query increases. But if you do indexing it won’t divide the dataset into smaller ones, rather it would create another table containing all the details of the table which you are indexed. So when you try to execute any query on an indexed table it will first query on the index_table based on the data in the index it will directly query on the original table. It is just like the index of any text book.

  3. Hi,
    I tested indexes on hive. I couldn’t see any performance improvements actually I saw performance degradation. I really wonder, which version of hive did you use it during your tests ?

  4. Hi ,
    Thanks for the explanation. Can you please elaborate on how compact and bitmap indexing differs?

    1. When there is a column which has few distinct values like gender then it is suited for bitmap index
      where as if we have a column with more distinct value then we should go for compact indexing
      ex: employeeID

  5. Hi,

    is it possible to add a bloom filter to another column after the table was created and data was already inserted ?

    I have a table with a bloom filter created on 1 column and what to add a filter for another column.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close