In this current world, huge data is being generated all around us each and every minute. Until recently, we have been using Hadoop and related technologies to handle these data and to analyze it. But lately, we are in dire urgency to provide faster results in lesser time. So, Cloudera decided to develop a query engine which can run on a massively parallel processing cluster where each system has its own CPU, memory and disk. Here, through the database software and high speed interconnects, the system functions as a whole and can scale as new nodes can be added to the cluster which provides fast and interactive query response to the queries submitted. So Cloudera introduced Cloudera Impala to produce faster results in lesser time.
What is Cloudera Impala?
Cloudera impala is a massively parallel processing (MPP) SQL-like query engine that allows users to execute low latency SQL Queries for the data stored in HDFS and HBase, without any data transformation or movement.
The main goal of Impala is to make SQL on Hadoop operations, fast and efficient to appeal to new categories of users and open up Hadoop to new types of use cases. Impala makes SQL queries simple enough to be accessible to analysts who are familiar with SQL and to those using business intelligence tools that runs on Hadoop.
How Impala is different from Other Hadoop components?
Impala gets the same benefits of Hadoop as used by other components like Java MapReduce, Pig, Hive and HBase. But Impala integrates with Hive metastore database to share databases and tables between both the components. The integration between Impala and Hive gives exceptional advantages to the users to use either Impala or Hive to create tables, load data, issue queries, and so on.
How Impala compared faster than Hive?
Impala is quite different from Hive and executes SQL queries natively without translating them into the Hadoop MapReduce jobs. Thus taking less time to execute the submitted queries
Uses of Impala
Impala can be used when there is a need for results in less time.
Impala is very useful when partial data needs to be analyzed, since it uses parquet as default file format, which stores the data in vertical manner. While performing queries it reads only the column split part files rather than reading the entire data set as compared to Hive.
Impala can be used when the same kind of queries needs to be processed several times. In real-time, we may face different kind of situations where some queries needs to be processed repeatedly/ So Impala will be helpful in these situations. For example, to know the number of visits for a website or the number of view for a particular channel, on a daily or hourly basis.
The Impala shell is similar to a Hive shell where users can create databases and tables, insert data, and issue queries.
By using impala-shell command, users can enter in to the impala shell to create databases and tables and can perform SQL query operations.
We can use the show databases command to view the databases present in the Hive metastore database.
Since Impala is integrated with Hive, we can create databases and tables and issue queries both in Hive as well as impala without any issues to other components.
We can use the use database_name; command to use a particular database which is available in the Hive metastore database to create tables and to perform operations on that table, according to the requirement.
We can use the create table table-name(); command to create a new table in Impala. By default, Impala saves the data set table in parquet type.
Hive Meta-store Integration with Impala
We have been discussing that Hive and Impala shares the same metastore database; in the below section, we will be creating a new table in the Hive metastore database where we can also see the same table created in impala too.
Let us use one of the databases available in the Hive metastore database and create a new table in it.
We can enter the show databases; command in Hive to see the databases which are available in the Hive metastore database.
Now, let’s use the same database acadgild_emp1 which was used in Impala and create a new table acadgild_emp2_details.
Now, by using the show tables command we can see that both acadgild_emp1_details table and the acadgild_emp2_details table are present in the acadgild_emp1 database which can be used in Impala as well as Hive.
Testing Hive and Impala’s Queries Execution Speed
Now, let’s take a look at how fast Impala is compared to Hive, while executing queries. Let’s use the table acadgild_emp2_details and process some queries on it using Hive as well as Impala, to know the time taken to execute the queries.
We can see from the above table acadgild_emp2_details, we have only 5 rows and 3 columns of data. Now let’s run a query to count the number of rows present in acadgild_emp2_details table in both Hive and Impala to determine the time constraint.
From the above image, we can see that Hive took 52.34 seconds to count only 5 rows. Now, let’s run the same command in Impala to know the time frame.
From the above image, we can see that Impala took only 0.8 seconds to process the above query, whereas Hive took 52.34 seconds for the same query.
Now, let’ try again and run a different query in both Hive and Impala to find out the employees who are earning more than 60000₹ in the table acadgild_emp2_details.
From the above image, we can see that Hive took 29.579 seconds to run the above query. Now, let’s run the same command in Impala to know the time frame.
On running the above query, Impala took only 0.95 seconds. Impala took less than a second to select 2 rows whereas; Hive took 29.57 seconds to fetch 2 records.
By executing these queries, we can see massive time difference between Hive and Impala when executing low latency queries.
Thus, Impala can be used when there is no need of executing MapReduce jobs and when there is need for faster results in lesser time.