Hive Compression Codecs
Compression is implemented in Hadoop as Hive, MapReduce, or any other processing component that results in several Network bandwidths between the nodes for I/O and for storage (not to mention the redundant storage to help fault tolerance). In this blog, we will go through compression in Hive. To reduce the amount of disk space that the Hive queries use, you should enable the Hive compression codecs.
Also, there are many completely different compression codecs that we are able to use with Hive. Names as 4mc, snappy, lzo, lz4, bzip2, and gzip. Each one has their own drawbacks and benefits. Following are the codecs:
- 4mc com.hadoop.compression.fourmc.FourMcCodec
- gzip org.apache.hadoop.io.compress.GzipCodec
- lzo com.hadoop.compression.lzo.LzopCodec
- Snappy org.apache.hadoop.io.compress.SnappyCodec
- bzip2 org.apache.hadoop.io.compress.BZip2Codec
- lz4 org.apache.hadoop.io.compress.Lz4Codec
Here we are going to choose 4mc compression codecs with an example. Below are some additional details about listing files before and after enabling the 4mc compression codecs.
Let us perceive how to activate the compression codecs in a Hive system. There are 2 places where you can modify compression codecs in Hive, one is through the intermediate process, and an alternative is while writing the output of a Hive query to the HDFS location.
Users can forever enable or disable this within the Hive session for every query. These properties are set within the hive.site.xml or within the Hive session via the Hive command line interface.
hive>set hive.exec.compress.output = true;
hive>set mapred.output.compression.codec= com.hadoop.compression.fourmc.FourMCHighCodec;
In this blog, we have used the above properties to compress a pseudo file that is shown later with an example.
Users can also set the following properties in hive-site.xml and map-site.xml to get permanent effects.
hive.exec.compress.intermediate is set to false, which implies the files created in intermediate map steps aren’t compressed. Set this to true so that the I/O and Network take up less bandwidth.
hive.exec.compress.output is false, this parameter is observed if the ultimate output to HDFS is compressed.
The parameters mapred.map.output.compression.codec and mapred.output.compression.codec within the config file (/etc/hive/conf/mapred-site.xml).
MapReduce is one sort of application which will run on a Hadoop platform, thus, all the apps using the MapReduce framework goes in here.
Let’s look at an example:
Create a table by the following command and loading data into the same table:
create table Prateek_Emp(line String) row format delimited fields terminated by ‘\n’;
load data local inpath ‘/home/acadgild/Downloads/employee.txt’ into table Prateek_Emp;
Setting properties in HIVE for compression codecs:
Write the data from Hive to HDFS after the compression codecs is set:
INSERT OVERWRITE DIRECTORY ‘/user/hive/’ SELECT * FROM Prateek_Emp;
Check if a file is present in the output dir:
Hadoop dfs -ls /user/hive/
Comparing the size of the original file and compressed the file.
ls -ltr employee.txt
hadoop dfs -ls /user/hive
We can see the above screenshot with the highlighted part, the memory taken by the file in local is over the compressed file in HDFS.
Let’s see how we can perform a query on the compressed data inside Hive.
First, we need to create another table to load the compressed data.
CREATE TABLE IF NOT EXISTS Prateek_Emp_bzip(name string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\n’ STORED AS TEXTFILE LOCATION ‘/user/hive’;
Loading data from a compressed file and verifying it.
LOAD DATA INPATH ‘/user/hive/*.bz2’ INTO TABLE Prateek_Emp_bzip;
select * from Prateek_Emp_bzip;
The conclusion is that, if you enable forums compression codecs with the Hive, you’ll scale back the overall time interval of your query together with less disk consumption.