Distributed Systems & Hadoop
Lecture Notes
## Introduction to Distributed Systems A distributed system is a computing environment in which various components are spread across multiple computers (or nodes) on a network. These devices split up the work, coordinating their efforts to complete the job more efficiently than if a single device had been responsible for the task. For how we orchestrate these jobs, see [Data Pipelines with Airflow](/courses/data-engineering-mastery/1). ## The Hadoop Ecosystem Apache Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. ### HDFS (Hadoop Distributed File System) HDFS is designed to store very large files across machines in a large cluster. It stores data reliably even in the presence of hardware failure. - **NameNode**: The master node that manages the file system namespace and regulates access to files by clients. - **DataNode**: The worker nodes that store and retrieve blocks when they are told to (by clients or the NameNode). ### MapReduce Algorithm MapReduce is a processing technique and a program model for distributed computing. 1. **Map Phase**: Takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). 2. **Reduce Phase**: Takes the output from a map as an input and combines those data tuples into a smaller set of tuples. ```java // A simple Word Count Mapper in Java public class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } ``` ## Fault Tolerance The defining characteristic of Hadoop is its fault tolerance. By default, HDFS replicates every block of data 3 times across different DataNodes (and ideally, different server racks). If a node goes down, the NameNode simply redirects client read requests to a replica node, ensuring zero downtime.