Big Data Hadoop & Spark

Most Popular Data Formats in Impala

In this instructional post, we will see how to work with two most important data formats in Impala i.e. text and parquet. Along with this, we will also see how to query the Hive table in Impala.
Before we proceed, let us see few key points about Impala:
• It is a low latency, massively parallel query engine.
• Developed by Cloudera.
• Allows query data present in HDFS, Hbase with SQL syntax.
• It does not translate SQL syntax into another processing framework.
• It is not fault-tolerant.
• Does not support complex types such as maps, arrays, and structs.
• Uses Hive meta store, so it can read Hive tables and perform DDL queries.

Data formats in Impala – text file

Let us discuss the first important data format in Impala, i.e. Text file format.
The text file is one of the commonly used file formats that you will see while exploring new tools. We will see it in detail.
NOTE: I have already moved the sample file to HDFS. If your file is present in the local filesystem, copy it to HDFS.

Sample data consists of 3 columns i.e.; id, name, location
Ex: 1, Edvista,Bangalore
Fields are separated by ‘,’
That’s pretty much for the data part, let’s start the Impala shell and fire some SQL queries.
Command: impala-shell

Once the shell starts, you can either use default database or create your own.
After this, you need to create an external table and load the data which is there in HDFS.
Command:
CREATE EXTERNAL TABLE my_table (
Id INT,
Name STRING,
Location STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LOCATION ‘hdfs://quickstart.cloudera:8020/sqoopout’;

You can also perform simple aggregation and check the result.
Command:
SELECT `loc`, COUNT(`id`) FROM my_table GROUP BY `loc`;

The SQL statement is not converted into another processing framework and hence when you check the query execution time, it is 0.96s.
To compare it with Hive, I have created a table in Hive and executed the same statement on the same sample data. And guess what!?
Query execution time was way more than that of Impala because Hive launches MR job in the background.

Working with Hive tables in Impala

Want to query Hive tables using Impala? Well, that’s pretty easy. All you need to do is, refresh Impala’s metadata and then use Hive table in Impala. To do so, use the command below in Impala shell.
Command: INVALIDATE METADATA acad;
Where, acad à is the table in Hive.
After this, you can run a query against this table in Impala.

Data formats in Impala – parquet file

Parquet is another important data formats in Impala which is preferred when you want compressed file for space and time efficiencies. Using parquet format, we can speed up our queries.
We will create a new parquet table from an existing table in Impala. The existing table need not be a parquet table. Once the table is created, you can enter the data as well.
Command:
CREATE TABLE my_table_parquet LIKE my_table AS PARQUET;
INSERT OVERWRITE TABLE my_table_parquet SELECT * FROM my_table;
SHOW TABLE STATS my_table_parquet;

Data formats in Impala – SequenceFile

If you do not have an existing file to use, begin by making one within the appropriate format.
To create a SequenceFile table:
In the impala-shell interpreter, issue a command similar to:
create table sequencefile_table (column_specs) stored as sequencefile;
Because impala will query some kinds of tables that it cannot presently write to, after making tables of certain file formats, you may use the Hive shell to load the information. See however impala Works with Hadoop File Formats for details. once loading information into a table through Hive or different mechanism outside of impala, issue a REFRESH table_name statement the next time you connect with the impala node, before querying the table, to create impala acknowledge the new information.
For example, here is how you may produce some SequenceFile tables in impala, load information through Hive, and query them through Impala:



 

Avro

The first step is to convert our data into JSON with the help of a Pig script. Open up the Pig Editor and run:

REGISTER piggybank.jar
 
data = load '/user/hive/warehouse/review/yelp_academic_dataset_review_clean.json'
    AS (funny:INT, useful:INT, cool:INT, user_id:CHARARRAY, review_id:CHARARRAY, text:CHARARRAY, business_id:CHARARRAY, stars:INT, date:CHARARRAY, type:CHARARRAY);
 
data_clean = FILTER data BY business_id IS NOT NULL AND text IS NOT NULL;   
 
STORE data_clean INTO 'impala/reviews_avro'
USING org.apache.pig.piggybank.storage.avro.AvroStorage(
'{
"schema": {
  "name": "review",
  "type": "record",
  "fields": [
     {"name":"funny", "type":"int"},
     {"name":"useful", "type":"int"},
     {"name":"cool", "type":"int"},
     {"name":"user_id", "type":"string"}
     {"name":"review_id", "type":"string"},
     {"name":"text", "type":"string"},
     {"name":"business_id", "type":"string"},
     {"name":"stars", "type":"int"},
     {"name":"date", "type":"string"},     
     {"name":"type", "type":"string"},
  ]}
}');

 

Then, in the Hive Editor create the table with:

CREATE TABLE review_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/romain/impala/reviews_avro'
tblproperties ('avro.schema.literal'='{
  "name": "review",
  "type": "record",
  "fields": [
     {"name":"business_id", "type":"string"},
     {"name":"cool", "type":"int"},
     {"name":"date", "type":"string"},
     {"name":"funny", "type":"int"},
     {"name":"review_id", "type":"string"},
     {"name":"stars", "type":"int"},
     {"name":"text", "type":"string"},
     {"name":"type", "type":"string"},
     {"name":"useful", "type":"int"},
     {"name":"user_id", "type":"string"}]}'
);

You can now go back to Impala, and use the table after having refreshed the metadata with:

1
REFRESH avro_table

 Complete Glance at all File Format Support in Impala.

credits
Hope this post helped you in understanding data formats in Impala. For further updates, keep visiting www.acadgild.com.
 
Hadoop

Tags

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close