Big Data Hadoop & Spark

Processing Small Files in Hadoop

Hadoop is a tool designed for larger files. But how do you handle small files? This blog gives you a brief overview on solving this problem.
There are two primary reasons why small files are problematic in Hadoop:

  • NameNode memory management and
  • MapReduce performance.

NameNode Memory Management
A common approach to solve memory problem involves Hadoop Archive (HAR) Files and Federated NameNodes.
Hadoop Archives or HAR is an archiving facility that packs files into HDFS blocks efficiently and hence HAR can be used to tackle the small files problem in Hadoop. HAR is created from a collection of files and the archiving tool (a simple command) will run a MapReduce job to process the input files in parallel and create an archive file.
Federation uses multiple independent Namenodes/namespaces. The Namenodes are federated and independent and do not require coordination with each other. The Datanodes are used as common storage for blocks by all the Namenodes. Each Datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports. They also handle commands from the Namenodes.

MapReduce Performance
Common solutions include:

  •        Change the ingestion process/interval
  •        Batch file consolidation
  •        Sequence files
  •        HBase
  •        S3DistCp (If using Amazon EMR)
  •        Using a CombineFileInputFormat
  •        Hive configuration settings
  •        Using Hadoop’s append capabilities

Step by Step practical implementation using HAR command
An additional Hadoop step will help start your jobflow which aggregates the small files.
Let us take these two small input file as a sample.

Add these files to HDFS.

 
Hadoop
Syatax:
archive -archiveName NAME -p <parent path> [-r <replication factor>]<src>* <dest>

 
Once a .har file is created, you can do a listing on the .har file and you will see it is made up of index files and part files. Part files are nothing but the original files concatenated together into a big file. Index files are look up files which are used to look up the individual small files inside the big part files
hadoop fs -ls /output/location/myhar.har
/output/location/myhar.har/_index
/output/location/myhar.har/_masterindex
/output/location/myhar.har/part-0

 
We can see the result stored in part file located in the har file by cat command.

 
This way we can join thousands of small files to make a single large file. This will make processing for Hadoop easier.
You can also join files inside HDFS by get merge command. For more information on this, you can refer to our blog, Merging files in HDFS.
Sample comparison in execution time:

(img source:snoplowanalytics):
We hope this blog was useful. If you have any questions, feel free to contact us at [email protected]
Visit AcadGild for latest trending blogs
Hadoop

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles

Close