Random Notes
  • Introduction
  • Reading list
  • Theory
    • Index
      • Impossibility of Distributed Consensus with One Faulty Process
      • Time, Clocks, and the Ordering of Events in a Distributed System
      • Using Reasoning About Knowledge to analyze Distributed Systems
      • CAP Twelve Years Later: How the “Rules” Have Changed
      • A Note on Distributed Computing
  • Operating System
    • Index
  • Storage
    • Index
      • Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
      • Exploiting Commutativity For Practical Fast Replication
      • Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
      • Building Consistent Transactions with Inconsistent Replication
      • Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
      • Spanner: Google's Globally-Distributed Database
      • Bigtable: A Distributed Storage System for Structured Data
      • The Google File System
      • Dynamo: Amazon’s Highly Available Key-value Store
      • Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
      • Replicated Data Consistency Explained Through Baseball
      • Session Guarantees for Weakly Consistent Replicated Data
      • Flat Datacenter Storage
      • Small Cache, Big Effect: Provable Load Balancing forRandomly Partitioned Cluster Services
      • DistCache: provable load balancing for large-scale storage systems with distributed caching
      • Short Summaries
  • Coordination
    • Index
      • Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
      • Paxos made simple
      • ZooKeeper: Wait-free coordination for Internet-scale systems
      • Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering
      • Keeping CALM: When Distributed Consistency is Easy
      • In Search of an Understandable Consensus Algorithm
      • A comprehensive study of Convergent and Commutative Replicated Data Types
  • Fault Tolerance
    • Index
      • The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
      • Gray Failure: The Achilles’ Heel of Cloud-Scale Systems
      • Capturing and Enhancing In Situ System Observability for Failure Detection
      • Check before You Change: Preventing Correlated Failures in Service Updates
      • Efficient Scalable Thread-Safety-Violation Detection
      • REPT: Reverse Debugging of Failures in Deployed Software
      • Redundancy Does Not Imply Fault Tolerance
      • Fixed It For You:Protocol Repair Using Lineage Graphs
      • The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance
      • Lineage-driven Fault Injection
      • Short Summaries
  • Cloud Computing
    • Index
      • Improving MapReduce Performance in Heterogeneous Environments
      • CLARINET: WAN-Aware Optimization for Analytics Queries
      • MapReduce: Simplified Data Processing on Large Clusters
      • Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
      • Resource Management
      • Apache Hadoop YARN: Yet Another Resource Negotiator
      • Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
      • Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
      • Large-scale cluster management at Google with Borg
      • MapReduce Online
      • Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling
      • Reining in the Outliers in Map-Reduce Clusters using Mantri
      • Effective Straggler Mitigation: Attack of the Clones
      • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
      • Discretized Streams: Fault-Tolerant Streaming Computation at Scale
      • Sparrow: Distributed, Low Latency Scheduling
      • Making Sense of Performance in Data Analytics Framework
      • Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
      • Drizzle: Fast and Adaptable Stream Processing at Scale
      • Naiad: A Timely Dataflow System
      • The Dataflow Model:A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale
      • Interruptible Tasks:Treating Memory Pressure AsInterrupts for Highly Scalable Data-Parallel Program
      • PACMan: Coordinated Memory Caching for Parallel Jobs
      • Multi-Resource Packing for Cluster Schedulers
      • Other interesting papers
  • Systems for ML
    • Index
      • A Berkeley View of Systems Challenges for AI
      • Tiresias: A GPU Cluster Managerfor Distributed Deep Learning
      • Gandiva: Introspective Cluster Scheduling for Deep Learning
      • Workshop papers
      • Hidden Technical Debt in Machine Learning Systems
      • Inference Systems
      • Parameter Servers and AllReduce
      • Federated Learning at Scale - Part I
      • Federated Learning at Scale - Part II
      • Learning From Non-IID data
      • Ray: A Distributed Framework for Emerging AI Applications
      • PipeDream: Generalized Pipeline Parallelism for DNN Training
      • DeepXplore: Automated Whitebox Testingof Deep Learning Systems
      • Distributed Machine Learning Misc.
  • ML for Systems
    • Index
      • Short Summaries
  • Machine Learning
    • Index
      • Deep Learning with Differential Privacy
      • Accelerating Deep Learning via Importance Sampling
      • A Few Useful Things to Know About Machine Learning
  • Video Analytics
    • Index
      • Scaling Video Analytics on Constrained Edge Nodes
      • Focus: Querying Large Video Datasets with Low Latency and Low Cost
      • NoScope: Optimizing Neural Network Queriesover Video at Scale
      • Live Video Analytics at Scale with Approximation and Delay-Tolerance
      • Chameleon: Scalable Adaptation of Video Analytics
      • End-to-end Learning of Action Detection from Frame Glimpses in Videos
      • Short Summaries
  • Networking
    • Index
      • Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport
      • Learning in situ: a randomized experiment in video streaming
      • Short Summaries
  • Serverless
    • Index
      • Serverless Computing: One Step Forward, Two Steps Back
      • Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads
      • SAND: Towards High-Performance Serverless Computing
      • Pocket: Elastic Ephemeral Storage for Serverless Analytics
      • Fault-tolerant and Transactional Stateful Serverless Workflows
  • Resource Disaggregation
    • Index
  • Edge Computing
    • Index
  • Security/Privacy
    • Index
      • Differential Privacy
      • Honeycrisp: Large-Scale Differentially Private Aggregation Without a Trusted Core
      • Short Summaries
  • Misc.
    • Index
      • Rate Limiting
      • Load Balancing
      • Consistency Models in Distributed System
      • Managing Complexity
      • System Design
      • Deep Dive into the Spark Scheduler
      • The Actor Model
      • Python Global Interpreter Lock
      • About Research and PhD
Powered by GitBook
On this page
  • TL;DR:
  • Summary:
  • Comment:
  • Related Links:

Was this helpful?

  1. Cloud Computing
  2. Index

Sparrow: Distributed, Low Latency Scheduling

https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf

TL;DR:

Sparrow is a new scheduler that uses a decentralized, random sampling approach to provide dramatically higher throughput than the current scheduler, while also providing scheduling delays of less than 10ms and fast scheduler failover. Sparrow does not require any communication between schedulers.

Summary:

The motivation behind this paper is the trend that modern data analytic systems are running ever shorter and higher-fanout jobs[1]. As a result, scheduling decisions must be made at very high throughput.[2] Thus, the existing centralized scheduler cannot meet our requirements.

First, we need to understand what are the design goals. To support a workload composed of sub-second[3] tasks, a scheduler must:

1.Low latency: Provide millisecond-scale scheduling delay

2.High Throughput: Support millions of task scheduling decisions

3.Fault tolerant: It needs to recover from failures quickly

4.Ensure Scheduling quality: For example we might wish to distribute the tasks uniformly

The existing centralized scheduler can provide quality placement since they maintain a complete view of which tasks are running on which worker machines, but they do not meet other requirements. Sparrow adopts the decentralized approach, in which many schedulers run in parallel, and they are stateless.[4][7]. Sparrow models each worker as a queue, the worker will run some number of tasks at a time, and if it is assign more tasks, it will add the remaining tasks to the queue

Sparrow uses a technique called batch sampling, instead of naive per-task sampling[9]. It can improve the per-task sampling because it allows all probes for a particular job to share information. To schedule using batch sampling, a scheduler randomly selects dm worker machines. (where d>=1 and m is the number of tasks within a job), and schedule the m tasks to the m least loaded workers.

We might argue that queue length is a poor indicator of the wait time. Consider a scenario in which node A has one 500ms task, but node b has five 10ms tasks. In addition, sampling suffers from race conditions since we have many schedulers. They might concurrently place tasks on the lightly loaded worker. To address these problems, Sparrow uses a technique called late binding. Instead of replying to the probes immediately after receiving it, the worker will treat the probes as tasks and place them into the queue. Thus, the previous method becomes "the scheduler assigns the job's tasks to the first m workers to reply."

Comment:

This paper is extremely well written as they explain their approaches very clear. As shown in[10], centralized scheduler will become the performance bottleneck and its scheduling quality degraded/scheduling overhead increased as the number of machines in the cluster and concurrent jobs continued to grow.

I really like the late bind idea as it solves two problems at once. However, the authors make some strong assumptions. They assume the latency is zero, which is right within a data center, but the latency across data centers may be O(100ms), making Sparrow not a good fit for geo-distributed data analytics. They also assume jobs are single wave, but in real-world jobs might run as multiple waves of tasks.[8]

Lastly, Sparrow's sampling mechanism schedule tasks on machines with the shortest queue, but it does not consider for other factors which may affect the completion time, such as data locality.

[1] Higher degree of parallelism

[2] For example, in the paper, the authors claim that a cluster containing ten thousand 16-core machines and running 100ms tasks may require 1 million scheduling decisions per second.

[3] The authors predict that the tasks are going to be shorter and shorter.

[4] Some terminology: A cluster composed of worker machines, each has a fixed number of slots, and schedules. A job consists of many tasks. If the worker is currently fully utilized, it will queue new tasks until resources become available again.

[5] A naive approach is to do per-task sampling by using the power of two choices. The scheduler places each task on the least loaded of two randomly selected worker machines. Note that the scheduler must first send a probe to the two randomly selected workers.

[7] They might share the information across tasks when possible, but Sparrow does not require any communication between schedulers.

[8] Multi-wave means the number of tasks in a job or stage is greater than the number of available slots.

[9] The power of two choices technique proposes a simple improvement over purely random assignment of tasks to worker machines: place each task on the least loaded of two randomly selected worker machines.

[10] Apollo: scalable and coordinated scheduling for cloud-scale computing

Related Links:

PreviousDiscretized Streams: Fault-Tolerant Streaming Computation at ScaleNextMaking Sense of Performance in Data Analytics Framework

Last updated 5 years ago

Was this helpful?