Hey,
It’s Sarvar Nadaf again, a senior developer at Luxoft. I worked on several technologies like Cloud Ops (Azure and AWS), Data Ops, Serverless Analytics, and Dev Ops for various clients across the globe.
Let’s see Hadoop
What is Hadoop?
Doug Cutting and Mike Cafarella, the two founders and brains behind this revolutionary effort, introduced the Apache Foundations open source project Hadoop in 2006. Hadoop was originally developed for the Nutch search engine project. Hadoop is now an open source software from Apache that is used to process and store large amounts of data efficiently.
Hadoop uses the map-reduce framework to process massive amounts of data on commodity hardware. Hadoop is well known for its HDFS storage (Hadoop Distributed File System). which actually stores data in an distributed manner and highly available. Hadoop is written in Java programming language.
Why Hadoop:
Prior to now, managing and processing large amounts of data presented significant challenges. In the past, data was growing more and more every day, and handling such a large amount of data was a difficult. Traditional ETL processing was more common; after data was processed on an ETL system, it was loaded onto databases. But it is impossible to process a large amount of data on traditioned systems and even it was not possible to store huge amount of data directly into the databases because it was time consuming and require more expensive hardware.
Hadoop was invented to address this issue. It can process enormous amounts of data on commodity hardware and store the results in a distributed storage system called HDFS. Hadoop resolves the big problem of processing the large amount of data on a Hadoop cluster and storing the processed data on HDFS.
The Best Aspect of Hadoop:
Utilizing all of the storage and processing power (CPU & Memory) of cluster servers and running distributed processes on large volumes of data are made simpler by Hadoop.
Hadoop Core Components:
Now are looking at Hadoops core component one by one.
HDFS (Hadoop Distributed File System)
Hadoop’s distributed file system, or HDFS, store data into blocks. Compared to standard file systems, HDFS offers higher data processing speed, superior fault tolerance, and native support for massive data. HDFS is made to be installed on commodity hardware and its highly fault-tolerant.
Master Slave architecture is used by HDFS. Data Nodes are considered as slave nodes, whereas Name Node is considered the master. We have several data nodes but only one name node on the Hadoop framework. Let’s explore the responsibilities performed by the master and slave nodes in more depth.
Name Node:
Name Node includes specific information about data, such as where data is actually stored on which node with replicas details, name node is the brain of the Hadoop eco system. The two most crucial files on a name node are the Edit logs and the FS Image. They both function as true Hadoop file systems. Many people mistakenly believe that Hadoop’s file system is HDFS, but the truth is that Hadoop’s file system is actually FS Image and Edit Logs. Let’s explore why FS images and edit logs are the most crucial, if not the most critical, files.
FS Image :
On the Linux server where name node is configured, the FS Image is kept in the directory /dfs/nn/. FS Image provides information on the location of the data on the Data Blocks and which blocks are kept on which node as well as the full directory structure (in other word we called namespace) of the HDFS.
Because edit logs contain all of the most recent operations that were performed on a Hadoop cluster, FS Image closely collaborates with them. When a new edit log gets available, the FS image will be merged with that latest edit log.
Edit Logs:
As the name implies, Edit Logs does the same thing: it keeps track of changes made to the Hadoop cluster. Every three seconds, data nodes provide block reports containing information on all data operations performed on each block, including newly created blocks, deleted blocks, updated blocks, newly created Replications, etc.
Once the edit log has been updated, it keeps track of any operations carried out on the cluster. edit log update, which we referred to as edit log margin in FS Image, is the most recent update to the name node with a time stamp.
Data Data Node:
In Hadoop HDFS, the data node is a slave node. This was the actual storage layer where data chunks were kept. Hadoop’s default block size is 64 MB per block. What happens if we receive data that is more than 64MB? Hadoop will then store 64MB of that data on one block and the remaining data on additional blocks. If we receive data that is less than 64 MB, the system will store that data in a block and use the remaining space by upcoming data.
Data nodes store data and perform numerous operations on the HDFS storage layer. To maintain all operations track data node gathers data reports and sends a report to the master node every three seconds. Every data sends heart beat to the master node in sync with the cluster to declare the data node alive. Because data is replicated to other nodes, it is highly available on the data node. On HDFS, the default replication factor is 3. When data is received from the source, three replication factories will store that data.
MapReduce:
MapReduce framework can process Big Data in parallel on different nodes. By splitting petabytes of data into smaller chunks and processing them in parallel on Hadoop commodity servers, MapReduce makes concurrent processing easier. In the end, it collects all the information from several servers and gives the application a consolidated output. Large amounts of complex data can be analyzed using analytical tools like MapReduce.
MapReduce Framework use below mentioned different type of Stage to process big data on Hadoop cluster .
- Split
- Map
- Shuffle
- Sort
- Reduce
YARN (Yet Another Resource Negotiator)
Yarn is key feature of Hadoop framework. As the name implies, yarn is a framework used to manage the resources of Hadoop clusters. On the Hadoop cluster, yarn is in responsible of resource allocation and job scheduling.
Those are the two primary components of Yarn:
- Job Tracker : When it comes to the Hadoop cluster’s ability to execute MapReduce job, the Job Tracker process is necessary. Usually, the Job Tracker process does not run on a Data Node but rather on a different node. Job tracker is a crucial Daemon for the MapReduce MapReduce execution MapReduce execution requests are sent to Job Tracker by the client. To locate the data, Job Tracker communicates with the Name Node. Individual Task Trackers are monitored by Job Tracker, which then updates the client on the job’s overall status.
- Task Tracker : Data Node is used to execute Task Tracker. Almost on every Data Node on task tracker is present. The execution of the job task will be tracked by Task Tracker and reported in real time to Job Tracker. Task Trackers managed Data Nodes serve as the execution platform for Mapper and Reducer jobs. Mapper and Reducer tasks will be delegated to Task Trackers by Job Tracker for execution. Failure of Task Tracker is not regarded as fatal. The task carried out by a Task Tracker will be delegated to another node by Job Tracker if the Task Tracker stops responding.
Zookeeper :
A Hadoop cluster’s operational functions are provided by Zookeeper. For distributed systems, Zookeeper manage a naming registry, a synchronization service, high availability, fail over control, and a distributed configuration service. Zookeeper is used by distributed systems to store and handle updates to crucial configuration data.
Here is the End!
I Hope you like my article. I’m going to share my knowledge with you in order to make it easier for you to grasp Apache Hadoop. I’ll be publishing more articles like this soon. I hope you enjoy it.
happy studying!