Random Notes
  • Introduction
  • Reading list
  • Theory
    • Index
      • Impossibility of Distributed Consensus with One Faulty Process
      • Time, Clocks, and the Ordering of Events in a Distributed System
      • Using Reasoning About Knowledge to analyze Distributed Systems
      • CAP Twelve Years Later: How the “Rules” Have Changed
      • A Note on Distributed Computing
  • Operating System
    • Index
  • Storage
    • Index
      • Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
      • Exploiting Commutativity For Practical Fast Replication
      • Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
      • Building Consistent Transactions with Inconsistent Replication
      • Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
      • Spanner: Google's Globally-Distributed Database
      • Bigtable: A Distributed Storage System for Structured Data
      • The Google File System
      • Dynamo: Amazon’s Highly Available Key-value Store
      • Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
      • Replicated Data Consistency Explained Through Baseball
      • Session Guarantees for Weakly Consistent Replicated Data
      • Flat Datacenter Storage
      • Small Cache, Big Effect: Provable Load Balancing forRandomly Partitioned Cluster Services
      • DistCache: provable load balancing for large-scale storage systems with distributed caching
      • Short Summaries
  • Coordination
    • Index
      • Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
      • Paxos made simple
      • ZooKeeper: Wait-free coordination for Internet-scale systems
      • Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering
      • Keeping CALM: When Distributed Consistency is Easy
      • In Search of an Understandable Consensus Algorithm
      • A comprehensive study of Convergent and Commutative Replicated Data Types
  • Fault Tolerance
    • Index
      • The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
      • Gray Failure: The Achilles’ Heel of Cloud-Scale Systems
      • Capturing and Enhancing In Situ System Observability for Failure Detection
      • Check before You Change: Preventing Correlated Failures in Service Updates
      • Efficient Scalable Thread-Safety-Violation Detection
      • REPT: Reverse Debugging of Failures in Deployed Software
      • Redundancy Does Not Imply Fault Tolerance
      • Fixed It For You:Protocol Repair Using Lineage Graphs
      • The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance
      • Lineage-driven Fault Injection
      • Short Summaries
  • Cloud Computing
    • Index
      • Improving MapReduce Performance in Heterogeneous Environments
      • CLARINET: WAN-Aware Optimization for Analytics Queries
      • MapReduce: Simplified Data Processing on Large Clusters
      • Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
      • Resource Management
      • Apache Hadoop YARN: Yet Another Resource Negotiator
      • Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
      • Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
      • Large-scale cluster management at Google with Borg
      • MapReduce Online
      • Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling
      • Reining in the Outliers in Map-Reduce Clusters using Mantri
      • Effective Straggler Mitigation: Attack of the Clones
      • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
      • Discretized Streams: Fault-Tolerant Streaming Computation at Scale
      • Sparrow: Distributed, Low Latency Scheduling
      • Making Sense of Performance in Data Analytics Framework
      • Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
      • Drizzle: Fast and Adaptable Stream Processing at Scale
      • Naiad: A Timely Dataflow System
      • The Dataflow Model:A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale
      • Interruptible Tasks:Treating Memory Pressure AsInterrupts for Highly Scalable Data-Parallel Program
      • PACMan: Coordinated Memory Caching for Parallel Jobs
      • Multi-Resource Packing for Cluster Schedulers
      • Other interesting papers
  • Systems for ML
    • Index
      • A Berkeley View of Systems Challenges for AI
      • Tiresias: A GPU Cluster Managerfor Distributed Deep Learning
      • Gandiva: Introspective Cluster Scheduling for Deep Learning
      • Workshop papers
      • Hidden Technical Debt in Machine Learning Systems
      • Inference Systems
      • Parameter Servers and AllReduce
      • Federated Learning at Scale - Part I
      • Federated Learning at Scale - Part II
      • Learning From Non-IID data
      • Ray: A Distributed Framework for Emerging AI Applications
      • PipeDream: Generalized Pipeline Parallelism for DNN Training
      • DeepXplore: Automated Whitebox Testingof Deep Learning Systems
      • Distributed Machine Learning Misc.
  • ML for Systems
    • Index
      • Short Summaries
  • Machine Learning
    • Index
      • Deep Learning with Differential Privacy
      • Accelerating Deep Learning via Importance Sampling
      • A Few Useful Things to Know About Machine Learning
  • Video Analytics
    • Index
      • Scaling Video Analytics on Constrained Edge Nodes
      • Focus: Querying Large Video Datasets with Low Latency and Low Cost
      • NoScope: Optimizing Neural Network Queriesover Video at Scale
      • Live Video Analytics at Scale with Approximation and Delay-Tolerance
      • Chameleon: Scalable Adaptation of Video Analytics
      • End-to-end Learning of Action Detection from Frame Glimpses in Videos
      • Short Summaries
  • Networking
    • Index
      • Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport
      • Learning in situ: a randomized experiment in video streaming
      • Short Summaries
  • Serverless
    • Index
      • Serverless Computing: One Step Forward, Two Steps Back
      • Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads
      • SAND: Towards High-Performance Serverless Computing
      • Pocket: Elastic Ephemeral Storage for Serverless Analytics
      • Fault-tolerant and Transactional Stateful Serverless Workflows
  • Resource Disaggregation
    • Index
  • Edge Computing
    • Index
  • Security/Privacy
    • Index
      • Differential Privacy
      • Honeycrisp: Large-Scale Differentially Private Aggregation Without a Trusted Core
      • Short Summaries
  • Misc.
    • Index
      • Rate Limiting
      • Load Balancing
      • Consistency Models in Distributed System
      • Managing Complexity
      • System Design
      • Deep Dive into the Spark Scheduler
      • The Actor Model
      • Python Global Interpreter Lock
      • About Research and PhD
Powered by GitBook
On this page
  • Motivation:
  • Overview:
  • Programming Model:
  • Execution:
  • Fault tolerance:
  • Note:
  • Challenges:
  • Example1 - Distributed Grep:
  • Example2 - Inverted index:
  • Example3 - Terasort:
  • Example4 - Two sum:
  • Example5 - Group Anagrams:

Was this helpful?

  1. Cloud Computing
  2. Index

MapReduce: Simplified Data Processing on Large Clusters

http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

Motivation:

The input data of computations within Google is usually very large, forcing the programmer to distribute the tasks across lots of machines. The problem is: how to build a distributed computation framework that can process large data sets in parallel and handle fault tolerance?

Overview:

MapReduce is a framework that hides the details of parallelizing your workflow, fault-tolerance, distributing data to workers, and load balancing behind the abstractions map and reduce. The user of MapReduce is responsible for writing these map and reduce functions, while the MapReduce library is responsible for executing that program in a distributed environment.

Programming Model:

The computation to be performed is expressed through two functions: map and reduce. The map function takes an input key, value pair and outputs an intermediate key, value pair, while the reduce function accepts the intermediate key and a set of values for that key. The reduce function merges together these values to form a possibly smaller set of values.

Execution:

Image there is a client invokes MapReduce and supplies the map and reduce function. The client will connect to one of the machine in the cluster and that machine is going to be the special machine, the so-called master process.

Any machine can play any role and the master is going to make a decision to split up the input of the job. It will make the assumption that the input docs is already out there on the machines(In Google, there is a system called Google File system, that spreads document randomly around machines, makes sure there is at least three copies of each doc which are far away from each other to protect against power failures..etc) The master is first going to make an attempt that for each doc, it's going to schedule a worker machine that process the doc, and it will schedule that work on machine that the doc is already located. (Because moving computation is cheaper than moving data). This may not always be possible, because the docs may be placed in a skewed fashion. Imagine a pessimistic scenario where all our document are on machine A, it doesn't make sense to respect locality, it make sense to move docs to other places so we get more parallelism. In general, the master schedules the tasks as close as it can, while also maximum the parallel resources. It, in some sense, is a multi-level optimization problem the master has to deal with.[1]

The map tasks have the nice property that they are embarrassingly parallel and the intermediate result are write out to mappers' local disks. The master is responsible for keeping track the process of these map tasks, by asking them to periodically report their progress. After the map phase is finished, the master is going to talk to the reducers and assign reduce tasks and reducer is going to ask the data from the mappers and process them.

Fault tolerance:

What could go wrong is the problem we should ask when we program distributed systems, because anything could go wrong, will.

1.Master failure: The google authors say it totally probably won't happen. (so, it's just pretend for a minute, that the master doesn't fail)

2.Mapper failure: Because the master asks the worker to send heartbeat messages, if a worker hasn't been phone home for a sufficient long amount of time, the master could decide that worker is never coming back. Thus, the master can reschedule the task in other workers. But of course, what if the process that we decided was never going to come back indeed does come back, is that a problem? Because of the way the master structure the map functions, because it's pure functional. We could also just ignore the problem, because the two copy of the tasks are going to compute the same result, so we can just forget about the fact we ask two workers to do it and whoever finish the last, they are overwriting the same thing the other guy writing.

3.Reducer failure: same thing. they can just go back and read the code data on disk of mappers. (The mappers, once their result are written to disk, it's sort of checkpoint barrier here.)

4. Stragglers(e.g. bad disk, bad nic card): Stragglers are machines that takes an unusually long time to complete. When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks. The task is marked as completed whenever either the primary or the backup execution completes

[1] Please check delay scheduling paper for more details

[2] The decision is going to be made via a hash function mod k about which keys go to which reducer

Note:

1. For the users, they only need to write a Map function and a Reduce function. They don't need to understand parallelism and distributed programming.

2. No reduce starts before all maps are finished because if there exists at least one map task running, it's possible that it will generate some key/value pairs, but the reduce job for that key has already been processed.

3. Each map/reduce task is independent of each other(No communication required between map task or reduce task)

4. The input data of the map function comes from the distributed file system(e.g., GFS or HDFS), and the output of the map function is written into the local disk of the node. The reducer will read the data from the mappers' disk and write the output back to the distributed file system.

Challenges:

There are six challenges when designing large-scale systems like Mapreduce. This paper proposes some "naive" solutions to them and most of the papers in this section propose more advanced solutions. The challenges are:

  • Parallelism

  • Network

  • Stragglers

  • Scheduling

  • Fault-tolerance

  • Locality

Example1 - Distributed Grep:

How to quickly grep for some term in documents which spread across machines?

The term will be passed in map function, where the whole map function will be hard-coded for the term. So, the map function will just say, for every Y in the document, emit(print) the line, if it has the word in it, otherwise, don't. The reduce function can be ignored.

Example2 - Inverted index:

How to produce a produce a report that, instead of finding the word and how many times it occurred, find the documents a word occurred in, which is a list of url,? (Later we can to use this to build my search engine.)

To build a search engine, we could create a term vector, for example: Apple occurred in url[2, 3, 4] Banana occurred in url[4, 9, 10]. If someone wants to search for the phrase the Apple Banana, we can look up all the terms and take the intersection of their matching document. In the earlier days of internet, this is exactly how it works. So, this will be a way of saying that we have these document spread everywhere, but they are not searchable, we have to send queries to all the servers. To make them searchable, we can transform them into so-called inverted index and the inverted index, which we can computed using MapReduce, can be use to index back to the docs.

Example3 - Terasort:

How to quickly sort 1TB data?(You can view the data as an array of strings)

Naive Solution: In the map phase, every mapper gets some subsets of data and sort them. Then, in the reduce phase, a single reducer gets all data and performs "merge-sort".

However, the reduce phase will bottleneck as a single reducer gets all the data.

In the reduce phase, the ithi^{th}ith reduce task is responsible for sorting all RiR_iRi​ s, which means the output of reduce i are all less than the output of reduce i + 1.

Terasort consist of three steps, sampling, tagging(map) and sorting(reduce). At a very high level, sampling is performed in JobClient. We sort a subset of input data(e.g. 100,000 keys) and divide them into R partitions. Then, we find the upper bound and the lower bound of every partition(called the reduce boundaries) and store them. For example, if the sampling data is [b, abc, abd, bcd, abcd, efg, hii, afd, rrr, mnk]. Then, after you sort it, the resulting data will be [abc, abcd, abd, afd, b, bcd, efg, hii, mnk, rrr]If you have 4 reducers, the reduce boundaries will be abd, bcd ,and mnk

In map phase, the inputs of the map tasks are data and the reduce boundaries. The mapper will first use the reduce boundaries to build a two level trie(where the leave nodes is the reduce number) that quickly indexes into the list of sample keys based on the first two bytes of the key. Then, for every key, the mapper will find the corresponding partition and tag it with the partition number. For example, abg will be tagged with 2 and mnz will be tagged with 4.

In the reduce phase, reducer i will fetch all partition i's from all mapper's local disk and sort them. The final sorted result can be printed by combining all reduce outputs in order.

We can map all pairs which sum to the target value into the same reducer and every reducer can perform local two sum on its own. To map the pairs into the same reducer, for every integer i, if i<target/2i < target / 2 i<target/2, then h(i)=ih(i) = ih(i)=i, otherwise h(i)=target−ih(i) = target - ih(i)=target−i. Then, if a+b=targeta + b = targeta+b=target and a<target/2a < target / 2 a<target/2, h(a)=a=target−b=h(b)h(a) = a = target - b = h(b) h(a)=a=target−b=h(b), which means a and b will go the same reducer.

PreviousCLARINET: WAN-Aware Optimization for Analytics QueriesNextDryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Last updated 5 years ago

Was this helpful?

: In the map phase, the input of every task are divided into R partitions, where R is the number of reducer. In particular, every key in RiR_iRi​ is smaller than every key in Ri+1R_{i+1}Ri+1​ for all i>0i > 0i>0 . For example, you can use the first character in the string.

Example4 - :

Example5 - :

The mapper will take a list a words and for every string in the list, output(emit) {sorted string, string}and the reducer will take {sorted string, a list of anagram words [A1, A2 ... ]}and output {[A1, A2, A3]}().

TeraSort
Two sum
Group Anagrams
code
http://kubicode.me/2015/06/27/Hadoop/TeraSort-in-Hadoop/
http://kubicode.me/2015/06/27/Hadoop/TeraSort-in-Hadoop/
http://dongxicheng.org/mapreduce/hadoop-terasort-analyse/