Hadoop 2.0

Sarvar
8 min readOct 22, 2022

Hey,

It’s Sarvar Nadaf again, a senior developer at Luxoft. I worked on several technologies like Cloud Ops (Azure and AWS), Data Ops, Serverless Analytics, and Dev Ops for various clients across the globe.

Let’s see Hadoop 2.0

Previously, I wrote an article to completely describe Hadoop 1.0; if you haven’t read it, I’ll include a link right below. Now that we’re looking more closely into the Hadoop 2.0 framework, we’ll see exactly what advantages it has over Hadoop 1.0.

Link — (Hadoop 1.0)

Why Hadoop 2.0?

Only the map-reduce framework is supported by the Hadoop 1.0 architecture; alternative models like spark, hive, Cassandra, HBase, impala, etc. are not supported. Hadoop 2.0 enables a large range of tools for all types of data processing, including batch processing, data warehousing, data streaming, and more. Hadoop 1.0 only has one name node, making it the single point of failure in the framework. As a result, the active and passive name nodes introduced in Hadoop 2.0 to overcome the drawback of the Hadoop 1.0 framework. To perform multiple tasks like resource management, job scheduling, job monitoring, rescheduling jobs, etc. Hadoop 1.0 only has two components, namely Job Tracker and task tracker. Job Tracker is the single point of failure. However, the YARN framework for Hadoop 2.0 offers the greatest resource management currently available. It includes a resource manager with active stand by high availability and other services like node manager, scheduler, and task manager that manage independent tasks.

Hadoop 2.0

Now, we’ll take a closer look at the Hadoop 2.0 service.

HDFS:

Hadoop’s distributed file system is called HDFS. in which the architecture is master slave. Data node is a slave service, while Name node is the master service.

Name Node:

Name Node includes specific information about data, such as where data is actually stored on which node with replicas details, name node is the brain of the Hadoop 2.0 eco system. The two most crucial files on a name node are the Edit logs and the FS Image. They both function as true Hadoop file systems. Many people mistakenly believe that Hadoop’s file system is HDFS, but the truth is that Hadoop’s file system is actually FS Image and Edit Logs. Let’s explore why FS images and edit logs are the most crucial files.

  1. FS Image : On the Linux server where name node is configured, the FS Image is kept in the directory /dfs/nn/. FS Image provides information on the location of the data on the Data Blocks and which blocks are kept on which node as well as the full directory structure (in other word we called namespace) of the HDFS. Because edit logs contain all of the most recent operations that were performed on a Hadoop cluster, FS Image closely collaborates with them. When a new edit log gets available, the FS image will be merged with that latest edit log.
  2. Edit Logs: As the name implies, Edit Logs does the same thing: it keeps track of changes made to the Hadoop cluster. Every three seconds, data nodes provide block reports containing information on all data operations performed on each block, including newly created blocks, deleted blocks, updated blocks, newly created Replications, etc. Once the edit log has been updated, it keeps track of any operations carried out on the cluster. edit log update, which we referred to as edit log margin in FS Image, is the most recent update to the name node with a time stamp.

Data Data Node:

In Hadoop 2.0 HDFS, the data node is a slave node. This was the actual storage layer where data chunks were kept. Hadoop 2.0’s default block size is 128 MB per block. but as per enterprise suggest we will keep block size 256 MB per block for better performance and high throughput.

Since the data node has all the data, it automatically sends block reports to the name node after an hour and after each restart. With the aid of a sync mechanism, data nodes send heartbeat signals to name nodes every 3 seconds to keep them in sync and keep them as alive node.

Name Node High Availability:

Name Node High Availability
  1. Fencing — The journal nodes perform this fencing by allowing only one name node to be the writer at a time. Remaining two journal node read from the leader journal node.
  2. Epoch Number — To avoid brain split situations between the name nodes. To prevent both name node in active state after fail over epoch numbers assist us to maintain one name node active at a time. After a failover, failed name node mark has stand-by with the help of epoch numbers.
  3. QJM (Quorum Journal Manager) — It allows sharing these edit logs between the active name node and standby name node. QJM followed by restricted majority. (Journal Node majority wins)

Zookeeper –

  1. Failure detection- With NameNode, Zookeeper keeps a session going. This session expires in the event of failure, and the zookeeper notifies the other Name Nodes to begin the failover procedure.
  2. Active Name Node election — A method for choosing a node to be an active node is offered by Zookeeper. As a result, anytime his active Name Node fails, another Name Node assumes exclusive lock on the Zookeeper and declares that it wishes to replace it as the active Name Node.

ZKFC handles –

  1. Health Monitoring — ZKFC periodically pings the active NameNode with Health check command and if the NameNode doesn’t respond it in time it will mark it as unhealthy. This may happen because the NameNode might be crashed or frozen.
  2. Zookeeper Session Management — If the local NameNode is healthy it keeps a session open in the Zookeeper. If this local NameNode is active, it holds a special lock znode. If the session expires then this lock will delete automatically.
  3. Zookeeper-based Election — If there is a situation where local NameNode is healthy and ZKFC gets to know that none of the other nodes currently holds the znode lock, the ZKFC itself will try to acquire that lock. If it succeeds in this task then it has won the election and becomes responsible for running a failover. The failover is similar to manual failover; first, the previously active node is fenced if required to do so and then the local node becomes the active node.

YARN

All the resource allocation and management is done by YARN. Yarn have many component like scheduler, node manager, application manager, application master etc.

YARN Architecture

Resource Manager — It is the master daemon of YARN and is responsible for resource assignment and management among all the applications. Whenever it receives a processing request, it forwards it to the corresponding node manager and allocates resources for the completion of the request accordingly. It has two major components:

  1. Scheduler — It performs scheduling based on the allocated application and available resources. It is a pure scheduler, means it does not perform other tasks such as monitoring or tracking and does not guarantee a restart if a task fails. The YARN scheduler supports plug-in such as Capacity Scheduler and Fair Scheduler to partition the cluster resources.
  2. Application manager — It is responsible for accepting the application and negotiating the first container from the resource manager. It also restarts the Application Manager container if a task fails.

Node Manager — It takes care of individual node on Hadoop cluster and manages application and workflow and that particular node. Its primary job is to keep-up with the Node Manager. It monitors resource usage, performs log management and also kills a container based on directions from the resource manager. It is also responsible for creating the container process and starts it on the request of Application master.

Application Master — An application is a single job submitted to a framework. The application manager is responsible for negotiating resources with the resource manager, tracking the status and monitoring progress of a single application. Once the application is started, it sends the health report to the resource manager from time-to-time.

Container — It is a collection of physical resources such as RAM, CPU cores and disk on a single node. Actual Hadoop jobs are run into the containers.

Resource Manager High Available –

The image below, as you can see, refers to YARN High availability. Therefore, YARN High availability is essentially built on Zookeeper. Its active standby resource manager helps in overcoming YARN failure. Zookeeper will record every aspect of the active resource manager’s status. Active standby electors will choose the active resource manager if a failover occurs. The three core services that make up yarn high availability are explained in more detail below.

Source: Apache YARN

ActiveStandbyElector — Instead of a separate ZKFC daemon, Active Stand by Elector, which is integrated into Resource Manager, serves as a failure detector and a leader elector. To choose which Resource Manager should be the Active, the RMs can integrate a Zookeeper-based Active Stand by Elector. Another Resource Manager is automatically chosen to be the Active and takes control when the Active fails or stops functioning.

Round Robin — Round robin attempts are made by Application Masters (AMs) and Node Managers (NMs) to connect to the Resource Manager until they reach the Active Resource Manager. They continue round-robin polling until they reach the “new” active if the active goes down. This standard retry mechanism is used.

Resource Manager Restart — The Resource Manager that is being promoted to an active state loads the Resource Manager internal state and, with the help of the RM restart function, carries on as much as possible from where the previous active left off. For each managed application that was previously sent to the Resource Manager, a new attempt is launched. Applications can frequently checkpoint to prevent losing any work.

Conclusion: We have seen that Hadoop 1.0 has a number of advantages over Hadoop 2.0. Compared to Hadoop 1.0, Hadoop 2.0 is more secure and highly available. Multiple types of data can be processed using a single Hadoop 2.0 framework. In the world of big data, Hadoop is extremely important.

— — — — — — — —

Here is the End!

I hope you like my article. I’m going to share my knowledge with you in order to make it easier for you to grasp Apache Hadoop. I’ll be publishing more articles like this soon. Please leave a comment below with any questions or requests for clarification on any of the topics covered in this article, and I will do my best to respond.

happy studying!

--

--