Big Data Hadoop & Spark

Parquet File Format Hadoop

In this blog, we will discuss most widely used file format in Hadoop Parquet ecosystem
Parquet, an open source file format for Hadoop. Parquet stores nested data structures in a flat columnar format .Compared to a traditional approach where data is stored in row-oriented approach, parquet is more efficient in terms of storage and performance.
 
Parquet can be used in any  Hadoop ecosystem like Hive, Impala , Pig, and Spark.
Parquet stores binary data in a column-oriented way, where the values of each column are organized so that they are all adjacent, enabling better compression. It is especially good for queries which read particular columns from a “wide” (with many columns) table since only needed columns are read and IO is minimized. Read this for more details on Parquet.
When we are processing Big data, cost required to store such data is more (Hadoop stores data redundantly I.e 3 copies of each file to achieve fault tolerance) along with the storage cost processing the data comes with CPU,Network IO, etc costs. As the data increases cost for processing and storage increases. Parquet is the choice of Big data  as it serves both needs, efficient and performance in both storage and processing.
We will see how to use parquet with hive to achieve better compression and performance
For demonstration, we will use historical stock data of s&p500  index.
To use Parquet with Hive 0.10 – 0.12  you must download the Parquet Hive package from the Parquet project. You want the parquet-hive-bundle jar in Maven Central.
From Hive 0.13 Native Parquet support was added.
Creating table in hive to store parquet format:

We cannot load  text file directly into parquet table , we should first create an alternate table to store the text file and use insert overwrite command to write the data in parquet format.
We will create table to store text data

Load the data into the table

 
Check the data

Use insert  command to load the data into parquet table.

 

In order to test performance, we should run the queries in Multi-node cluster, where jobs are parallelized and run simultaneously

Advantages of using Parquet

There are several advantages to columnar formats.

  • Organizing by column allows for better compression, as data is more homogeneous. The space savings are very noticeable at the scale of a Hadoop cluster.
  • I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. Better compression also reduces the bandwidth required to read the input.
  • As we store data of the same type in each column, we can use encoding better suited to the modern processors’ pipeline by making instruction branching more predictable.

We hope this blog helped you implementing parquet file format using Hive. For more free resources and blogs on Big Data and other technologies visit here.

Related Popular Courses:

HADOOP CERTIFICATION

MOBILE APP DEVELOPMENT CERTIFICATE ONLINE

KAFKA CONSUMER JAVA EXAMPLE

DATA SCIENCE EDUCATION

COURSES ON BIG DATA ANALYTICS

Hadoop

Tags

One Comment

  1. Great tutorial.
    Just one correction.. Please use the correct table name in following sections:-
    Load the data into the table

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close
Close