Hadoop Ecosystem Overview: A Comprehensive Guide
Introduction to Hadoop
Apache Hadoop is an open-source framework designed to store and process massive amounts of data in a distributed and scalable manner. It allows organizations to efficiently store, manage, and analyze big data across clusters of computers.
Why Hadoop?
- Scalability – Can handle petabytes of data across multiple nodes.
- Fault Tolerance – Automatically replicates data to prevent data loss.
- Cost-Effective – Uses commodity hardware, reducing infrastructure costs.
- Flexibility – Supports structured, semi-structured, and unstructured data.
- High Throughput – Uses parallel processing to speed up computations.
Core Components of Hadoop
The Hadoop ecosystem consists of multiple tools and frameworks that extend its functionality. The four major components of Hadoop are:
- HDFS (Hadoop Distributed File System)
- YARN (Yet Another Resource Negotiator)
- MapReduce (Processing Framework)
- Common (Hadoop Libraries & Utilities)
We will explore each component and its supporting tools in detail.
1. Hadoop Distributed File System (HDFS)
HDFS is the storage layer of Hadoop that stores data across multiple nodes in a fault-tolerant and distributed manner.
Key Features of HDFS
- Distributed Storage – Stores large files across multiple machines.
- Replication – Default replication factor is 3 (data stored in 3 different nodes).
- High Fault Tolerance – If a node fails, data can be retrieved from replicated copies.
- Write Once, Read Many – Optimized for high-speed data reads.
HDFS Architecture
Component | Description |
---|---|
NameNode | Master node that stores metadata and manages file system namespace. |
DataNode | Stores actual data blocks and sends health reports to the NameNode. |
Secondary NameNode | Creates periodic snapshots of the NameNode to prevent metadata loss. |
How HDFS Works?
- Data is split into blocks (default: 128MB or 256MB).
- Blocks are stored across multiple DataNodes.
- Replication ensures fault tolerance.
- Clients interact with the NameNode to access files.
Basic HDFS Commands
# List files in HDFS
hdfs dfs -ls /
# Upload a file to HDFS
hdfs dfs -put localfile.txt /hdfs_path/
# Download a file from HDFS
hdfs dfs -get /hdfs_path/file.txt localfile.txt
# Remove a file from HDFS
hdfs dfs -rm /hdfs_path/file.txt
2. YARN (Yet Another Resource Negotiator)
YARN is Hadoop’s resource management layer that allows multiple applications to run simultaneously on a Hadoop cluster.
Key Features of YARN
- Resource Allocation – Manages CPU and memory for different tasks.
- Multi-Tenancy – Supports multiple applications like Spark, Hive, and MapReduce.
- Scalability – Efficiently scales to handle thousands of nodes.
YARN Architecture
Component | Description |
---|---|
ResourceManager (RM) | Allocates cluster resources to applications. |
NodeManager (NM) | Monitors resource usage on each node. |
ApplicationMaster (AM) | Manages the lifecycle of an application. |
How YARN Works?
- A job is submitted to the ResourceManager.
- ResourceManager assigns resources to the ApplicationMaster.
- ApplicationMaster coordinates execution across NodeManagers.
- NodeManagers execute tasks and report progress.
3. MapReduce (Processing Framework)
MapReduce is Hadoop’s data processing engine, used for batch processing of large datasets.
Key Features of MapReduce
- Parallel Processing – Splits jobs into smaller tasks that run simultaneously.
- Fault Tolerance – Retries failed tasks automatically.
- Scalability – Can run on clusters with thousands of machines.
MapReduce Workflow
- Map Phase: Processes input data and generates key-value pairs.
- Shuffle & Sort: Groups similar keys together.
- Reduce Phase: Aggregates values and produces the final output.
Example: Word Count Program in Hadoop MapReduce
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import java.io.IOException;
import java.util.StringTokenizer;
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
}
4. Hadoop Ecosystem Tools
Apart from the core components, Hadoop includes several tools for data ingestion, processing, querying, and machine learning.
4.1 Data Ingestion Tools
Tool | Description |
---|---|
Sqoop | Transfers data between Hadoop and relational databases (MySQL, PostgreSQL, etc.). |
Flume | Collects and ingests log data from multiple sources. |
Kafka | Real-time data streaming for big data processing. |
4.2 Data Processing & Querying
Tool | Description |
---|---|
Hive | SQL-based data warehousing tool. |
Pig | High-level scripting language for processing big data. |
HBase | NoSQL database for real-time data access. |
Spark | Fast, in-memory data processing framework. |
4.3 Machine Learning & AI
Tool | Description |
---|---|
Mahout | Machine learning library for clustering and classification. |
MLlib | Spark-based machine learning library. |
4.4 Workflow & Job Scheduling
Tool | Description |
---|---|
Oozie | Workflow scheduler for managing Hadoop jobs. |
Zookeeper | Coordination service for distributed applications. |
5. Hadoop Ecosystem Use Cases
- Data Warehousing – Companies use Hive for big data querying.
- Log Analysis – Flume + Spark for analyzing real-time log data.
- Fraud Detection – Machine learning with Hadoop for credit card fraud.
- Healthcare – Processing genomic data for medical research.
- Recommendation Systems – Amazon, Netflix use Hadoop for personalized recommendations.
6. Setting Up a Hadoop Cluster
Single-Node Hadoop Installation
# Install Java
sudo apt update && sudo apt install openjdk-8-jdk
# Download Hadoop
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
tar -xvzf hadoop-3.3.1.tar.gz
export HADOOP_HOME=~/hadoop-3.3.1
export PATH=$HADOOP_HOME/bin:$PATH
# Start HDFS
hdfs namenode -format
start-dfs.sh
start-yarn.sh