In this blog, we will discuss most widely used file format in Hadoop ecosystem Parquet
Parquet, an open source file format for Hadoop. Parquet stores nested data structures in a flat columnar format .Compared to a traditional approach where data is stored in row-oriented approach, parquet is more efficient in terms of storage and performance.
Parquet stores binary data in a column-oriented way, where the values of each column are organized so that they are all adjacent, enabling better compression. It is especially good for queries which read particular columns from a “wide” (with many columns) table since only needed columns are read and IO is minimized. Read this for more details on Parquet.
When we are processing Big data, cost required to store such data is more (Hadoop stores data redundantly I.e 3 copies of each file to achieve fault tolerance) along with the storage cost processing the data comes with CPU,Network IO, etc costs. As the data increases cost for processing and storage increases. Parquet is the choice of Big data as it serves both needs, efficient and performance in both storage and processing.
We will see how to use parquet with hive to achieve better compression and performance
For demonstration, we will use historical stock data of s&p500 index.
To use Parquet with Hive 0.10 – 0.12 you must download the Parquet Hive package from the Parquet project. You want the parquet-hive-bundle jar in Maven Central.
From Hive 0.13 Native Parquet support was added.
Creating table in hive to store parquet format:
We cannot load text file directly into parquet table , we should first create an alternate table to store the text file and use insert overwrite command to write the data in parquet format.
We will create table to store text data
Load the data into the table
Check the data
Use insert command to load the data into parquet table.
In order to test performance, we should run the queries in Multi-node cluster, where jobs are parallelized and run simultaneously
Advantages of using Parquet
There are several advantages to columnar formats.
- Organizing by column allows for better compression, as data is more homogeneous. The space savings are very noticeable at the scale of a Hadoop cluster.
- I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. Better compression also reduces the bandwidth required to read the input.
- As we store data of the same type in each column, we can use encoding better suited to the modern processors’ pipeline by making instruction branching more predictable.
We hope this blog helped you implementing parquet file format using Hive. For more free resources and blogs on Big Data and other technologies visit here.