Hadoop Ecosystem Overview: A Comprehensive Guide

Introduction to Hadoop

Apache Hadoop is an open-source framework designed to store and process massive amounts of data in a distributed and scalable manner. It allows organizations to efficiently store, manage, and analyze big data across clusters of computers.

Why Hadoop?

Scalability – Can handle petabytes of data across multiple nodes.
Fault Tolerance – Automatically replicates data to prevent data loss.
Cost-Effective – Uses commodity hardware, reducing infrastructure costs.
Flexibility – Supports structured, semi-structured, and unstructured data.
High Throughput – Uses parallel processing to speed up computations.

Core Components of Hadoop

The Hadoop ecosystem consists of multiple tools and frameworks that extend its functionality. The four major components of Hadoop are:

HDFS (Hadoop Distributed File System)
YARN (Yet Another Resource Negotiator)
MapReduce (Processing Framework)
Common (Hadoop Libraries & Utilities)

We will explore each component and its supporting tools in detail.

1. Hadoop Distributed File System (HDFS)

HDFS is the storage layer of Hadoop that stores data across multiple nodes in a fault-tolerant and distributed manner.

Key Features of HDFS

Distributed Storage – Stores large files across multiple machines.
Replication – Default replication factor is 3 (data stored in 3 different nodes).
High Fault Tolerance – If a node fails, data can be retrieved from replicated copies.
Write Once, Read Many – Optimized for high-speed data reads.

HDFS Architecture

Component	Description
NameNode	Master node that stores metadata and manages file system namespace.
DataNode	Stores actual data blocks and sends health reports to the NameNode.
Secondary NameNode	Creates periodic snapshots of the NameNode to prevent metadata loss.

How HDFS Works?

Data is split into blocks (default: 128MB or 256MB).
Blocks are stored across multiple DataNodes.
Replication ensures fault tolerance.
Clients interact with the NameNode to access files.

Basic HDFS Commands

# List files in HDFS
hdfs dfs -ls /

# Upload a file to HDFS
hdfs dfs -put localfile.txt /hdfs_path/

# Download a file from HDFS
hdfs dfs -get /hdfs_path/file.txt localfile.txt

# Remove a file from HDFS
hdfs dfs -rm /hdfs_path/file.txt

2. YARN (Yet Another Resource Negotiator)

YARN is Hadoop’s resource management layer that allows multiple applications to run simultaneously on a Hadoop cluster.

Key Features of YARN

Resource Allocation – Manages CPU and memory for different tasks.
Multi-Tenancy – Supports multiple applications like Spark, Hive, and MapReduce.
Scalability – Efficiently scales to handle thousands of nodes.

YARN Architecture

Component	Description
ResourceManager (RM)	Allocates cluster resources to applications.
NodeManager (NM)	Monitors resource usage on each node.
ApplicationMaster (AM)	Manages the lifecycle of an application.

How YARN Works?

A job is submitted to the ResourceManager.
ResourceManager assigns resources to the ApplicationMaster.
ApplicationMaster coordinates execution across NodeManagers.
NodeManagers execute tasks and report progress.

3. MapReduce (Processing Framework)

MapReduce is Hadoop’s data processing engine, used for batch processing of large datasets.

Key Features of MapReduce

Parallel Processing – Splits jobs into smaller tasks that run simultaneously.
Fault Tolerance – Retries failed tasks automatically.
Scalability – Can run on clusters with thousands of machines.

MapReduce Workflow

Map Phase: Processes input data and generates key-value pairs.
Shuffle & Sort: Groups similar keys together.
Reduce Phase: Aggregates values and produces the final output.

Example: Word Count Program in Hadoop MapReduce

import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import java.io.IOException;
import java.util.StringTokenizer;

public class WordCount {
    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }
}

4. Hadoop Ecosystem Tools

Apart from the core components, Hadoop includes several tools for data ingestion, processing, querying, and machine learning.

4.1 Data Ingestion Tools

Tool	Description
Sqoop	Transfers data between Hadoop and relational databases (MySQL, PostgreSQL, etc.).
Flume	Collects and ingests log data from multiple sources.
Kafka	Real-time data streaming for big data processing.

4.2 Data Processing & Querying

Tool	Description
Hive	SQL-based data warehousing tool.
Pig	High-level scripting language for processing big data.
HBase	NoSQL database for real-time data access.
Spark	Fast, in-memory data processing framework.

4.3 Machine Learning & AI

Tool	Description
Mahout	Machine learning library for clustering and classification.
MLlib	Spark-based machine learning library.

4.4 Workflow & Job Scheduling

Tool	Description
Oozie	Workflow scheduler for managing Hadoop jobs.
Zookeeper	Coordination service for distributed applications.

5. Hadoop Ecosystem Use Cases

Data Warehousing – Companies use Hive for big data querying.
Log Analysis – Flume + Spark for analyzing real-time log data.
Fraud Detection – Machine learning with Hadoop for credit card fraud.
Healthcare – Processing genomic data for medical research.
Recommendation Systems – Amazon, Netflix use Hadoop for personalized recommendations.

6. Setting Up a Hadoop Cluster

Single-Node Hadoop Installation

# Install Java
sudo apt update && sudo apt install openjdk-8-jdk

# Download Hadoop
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
tar -xvzf hadoop-3.3.1.tar.gz
export HADOOP_HOME=~/hadoop-3.3.1
export PATH=$HADOOP_HOME/bin:$PATH

# Start HDFS
hdfs namenode -format
start-dfs.sh
start-yarn.sh