All CategoriesBig Data Hadoop & Spark - Advanced

Top 10 Differences Between Hadoop 2.x and 3.x

Every firm, big or small is understanding the benefit of implementing Hadoop and extracting maximum benefit from data. Hadoop was launched for the first time in public in the year 2011 and since then it underwent major changes in 3 different versions.
In this blog, we will see 10 major differences Apache Hadoop has implemented in version 3.x to make it better.
Interested readers are requested to follow the link to read differences in hadoop1.x and hadoop 2.x.
Although the 3.x is still in testing phase and has passed alpha1 release to alpha2 release. The main release is yet planned to come in mid of this year.
Few common points which might prove to be helpful in new arrival are YARN (Yet Another Resource Negotiator).
It comes with an introduction to containers in Hadoop 2.x. As this uses containers to execute programs and improves quality like scalability, high availability, multi-tenancy. Also, Tools like Hive, Pig, Sqoop, and another ecosystem will be present. DataNodes resources were not dedicated to MapReduce and could be used for different application as well.
Now since we already have these many good features in Hadoop 2.x which continues to Hadoop 3.x, Let’s have a look at the following table describing the major differences

10 Major Differences Between Hadoop 2.x and 3.x

Hadoop 2.x

Hadoop 3.x

1. Java version 6 was the minimum requirement. Java version 8 is the minimum requirement.As most of the dependency library file used is from java8.
2. HDFS supports replication for fault tolerance.

HDFS support for erasure encoding. (Erasure coding is a technique for durably storing information with significant space savings compared to replication)

3. YARN timeline service Introduced YARN timeline service v.2(improved scalability and reliability)
4. Limited Shell scripts with Bugs. Many new Unix shell API, along with old Bug Fixed.
5. Map reduce became fast due to YARN. Map reduce became faster, particularly at map output collector and shuffle jobs by 30%.
6. Secondary namenode was introduced as standby. Supports more than 2 namenode
7. Default ports were Conflicting in Linux port range. Which leads to failure in port reservation. Port range has been optimized.
8. Hadoop did not support Microsoft filesystem. Hadoop now supports integration with Microsoft Azure Data Lake as an alternative to Hadoop-compatible filesystem
9. A single DataNode manages multiple disks. Disks inside can lead to significant skew within a DataNode. New functionality intra-DataNode balancing is added, which is invoked via the hdfs disk balancer CLI.
10. The host needs to set the Heap Size for JAVA and Hadoop task. new methods for configuring daemon heap sizes. Notably, auto-tuning is now possible based on the memory size of the host, and the HADOOP_HEAPSIZE variable has been deprecated

You can refer to this blog for installing Hadoop3.X single node cluster in your machine.
Keep visiting for more updates on big data certification courses.

One Comment

  1. Hi Prateek,
    I want to explore Hadoop, but in Windows 10. Is this possible with optimum results? Which Hadoop and Hive version should I use?
    Also I installed Haoop v2.9.0, but after configuration I find that http://localhost:50070 is not working. The DataNode folder is empty. Can you please suggest what could be the reason?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles