Big Data Hadoop & Spark

Understanding HBase Compactions

What is HBase?

HBase is the database for Hadoop ecosystem, where distributed file system is used at the bottom layer, where the data is actually stored in physical form. Within HBase, Cache and RAM are used as a storage area, which gives speed to the ecosystem.

We recommend you to go through this post, understanding of Hfile , first to understand the below concepts better. Also we will discuss how to tune compactions and take most of the control manually.

What are the types of Compactions?

The servers are active all day long and while acting on Big Data, HBase hardly gets to write data. Therefore, it breaks the writing process into two parts: Minor Compaction and Major Compaction.
When the storage area of HBase is all most filled with data, it starts creating compressed files, which occupies less memory.

Here are the various processes involved in Minor Compaction:

  • Bigger Hfile are created by combining smaller Hfiles.
  • Hfile keeps the deleted file with them.
  • Increases space in memory, useful to store more data.
  • Merge sorting is used in process.

The other way to go around is major compaction

  • Data present per column family in one region is accumulated to 1 Hfile.
  • During this process, all deleted files or expired cells are deleted permanently
  • Increase read performance of newly created Hfile.
  • Accepts lots of I/O.
  • Possibilities for traffic congestion.
  • The Major compaction process is also known as Write Amplification Process.
  • This process must be scheduled at a minimum bandwidth of network I/O.

HBase compaction tuning tips

Short Description:
How to use some hidden HBase compaction configuration choices to enhance performance and stability of HBase cluster. Below

Disabling automatic major compactions

Usually, HBase users want to possess a full management of a major compaction events and the solely way to do that is to disable periodic automatic major compactions by setting hbase.hregion.majorcompaction to 0.
But, sadly, this doesn’t provide you with 100% management of major compactions, because, sometimes, minor compactions can be promoted to major ones by HBase automatically, but, luckily, we’ve got another configuration choice, which will help during this case (below).

Maximum compaction selection size

We have another config option which can control compaction process:
hbase.hstore.compaction.max.size (by default value is set to LONG.MAX_VALUE)
In HBase 1.2+ we have as well:
These choices control maximum size (in bytes) of compaction selection allowed. If you need to delay large compactions (major ones) until off-peak hours, you’ll set, for example:
hbase.hstore.compaction.max.size=500000000 (500MB)
hbase.hstore.compaction.max.size.offpeak= 500000000000 (500GB)
The idea is to not enable minor compaction promotions to major ones throughout peak hours. Compactions can still happen during peak hours, however, they’ll be restricted to 500MB in size (or to whatever you set). Certainly, if your region size < 500MB some major compactions can still happen. we care about not the majority or minority of compaction here but about compaction size.
Note: when you run manual major compaction requests those settings are ignored

Off peak compactions

If your deployment has off peak hours you can use off-peak configuration settings.
To enable off peak compaction following config options must be set :
hbase.offpeak.start.hour= 0..23
hbase.offpeak.end.hour= 0..23
Compaction file ratio for peak hours is 1.2, for off peak 5.0 (by default).
Both can be changed:
Heigh the file ratio value – the more aggressive (frequent) compaction is going to be. Default values are fine for the majority of deployments.

Hope this post has been helpful in understanding about Compactions in HBase and how to take control in our hand. In the case of any queries, feel free to comment below and we will get back to you at the earliest.
Stay tuned to our blog for more posts on Big Data and other technologies.



An alumnus of the NIE-Institute Of Technology, Mysore, Prateek is an ardent Data Science enthusiast. He has been working at Acadgild as a Data Engineer for the past 3 years. He is a Subject-matter expert in the field of Big Data, Hadoop ecosystem, and Spark.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles