Random Notes
  • Introduction
  • Reading list
  • Theory
    • Index
      • Impossibility of Distributed Consensus with One Faulty Process
      • Time, Clocks, and the Ordering of Events in a Distributed System
      • Using Reasoning About Knowledge to analyze Distributed Systems
      • CAP Twelve Years Later: How the “Rules” Have Changed
      • A Note on Distributed Computing
  • Operating System
    • Index
  • Storage
    • Index
      • Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
      • Exploiting Commutativity For Practical Fast Replication
      • Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
      • Building Consistent Transactions with Inconsistent Replication
      • Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
      • Spanner: Google's Globally-Distributed Database
      • Bigtable: A Distributed Storage System for Structured Data
      • The Google File System
      • Dynamo: Amazon’s Highly Available Key-value Store
      • Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
      • Replicated Data Consistency Explained Through Baseball
      • Session Guarantees for Weakly Consistent Replicated Data
      • Flat Datacenter Storage
      • Small Cache, Big Effect: Provable Load Balancing forRandomly Partitioned Cluster Services
      • DistCache: provable load balancing for large-scale storage systems with distributed caching
      • Short Summaries
  • Coordination
    • Index
      • Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
      • Paxos made simple
      • ZooKeeper: Wait-free coordination for Internet-scale systems
      • Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering
      • Keeping CALM: When Distributed Consistency is Easy
      • In Search of an Understandable Consensus Algorithm
      • A comprehensive study of Convergent and Commutative Replicated Data Types
  • Fault Tolerance
    • Index
      • The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
      • Gray Failure: The Achilles’ Heel of Cloud-Scale Systems
      • Capturing and Enhancing In Situ System Observability for Failure Detection
      • Check before You Change: Preventing Correlated Failures in Service Updates
      • Efficient Scalable Thread-Safety-Violation Detection
      • REPT: Reverse Debugging of Failures in Deployed Software
      • Redundancy Does Not Imply Fault Tolerance
      • Fixed It For You:Protocol Repair Using Lineage Graphs
      • The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance
      • Lineage-driven Fault Injection
      • Short Summaries
  • Cloud Computing
    • Index
      • Improving MapReduce Performance in Heterogeneous Environments
      • CLARINET: WAN-Aware Optimization for Analytics Queries
      • MapReduce: Simplified Data Processing on Large Clusters
      • Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
      • Resource Management
      • Apache Hadoop YARN: Yet Another Resource Negotiator
      • Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
      • Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
      • Large-scale cluster management at Google with Borg
      • MapReduce Online
      • Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling
      • Reining in the Outliers in Map-Reduce Clusters using Mantri
      • Effective Straggler Mitigation: Attack of the Clones
      • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
      • Discretized Streams: Fault-Tolerant Streaming Computation at Scale
      • Sparrow: Distributed, Low Latency Scheduling
      • Making Sense of Performance in Data Analytics Framework
      • Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
      • Drizzle: Fast and Adaptable Stream Processing at Scale
      • Naiad: A Timely Dataflow System
      • The Dataflow Model:A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale
      • Interruptible Tasks:Treating Memory Pressure AsInterrupts for Highly Scalable Data-Parallel Program
      • PACMan: Coordinated Memory Caching for Parallel Jobs
      • Multi-Resource Packing for Cluster Schedulers
      • Other interesting papers
  • Systems for ML
    • Index
      • A Berkeley View of Systems Challenges for AI
      • Tiresias: A GPU Cluster Managerfor Distributed Deep Learning
      • Gandiva: Introspective Cluster Scheduling for Deep Learning
      • Workshop papers
      • Hidden Technical Debt in Machine Learning Systems
      • Inference Systems
      • Parameter Servers and AllReduce
      • Federated Learning at Scale - Part I
      • Federated Learning at Scale - Part II
      • Learning From Non-IID data
      • Ray: A Distributed Framework for Emerging AI Applications
      • PipeDream: Generalized Pipeline Parallelism for DNN Training
      • DeepXplore: Automated Whitebox Testingof Deep Learning Systems
      • Distributed Machine Learning Misc.
  • ML for Systems
    • Index
      • Short Summaries
  • Machine Learning
    • Index
      • Deep Learning with Differential Privacy
      • Accelerating Deep Learning via Importance Sampling
      • A Few Useful Things to Know About Machine Learning
  • Video Analytics
    • Index
      • Scaling Video Analytics on Constrained Edge Nodes
      • Focus: Querying Large Video Datasets with Low Latency and Low Cost
      • NoScope: Optimizing Neural Network Queriesover Video at Scale
      • Live Video Analytics at Scale with Approximation and Delay-Tolerance
      • Chameleon: Scalable Adaptation of Video Analytics
      • End-to-end Learning of Action Detection from Frame Glimpses in Videos
      • Short Summaries
  • Networking
    • Index
      • Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport
      • Learning in situ: a randomized experiment in video streaming
      • Short Summaries
  • Serverless
    • Index
      • Serverless Computing: One Step Forward, Two Steps Back
      • Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads
      • SAND: Towards High-Performance Serverless Computing
      • Pocket: Elastic Ephemeral Storage for Serverless Analytics
      • Fault-tolerant and Transactional Stateful Serverless Workflows
  • Resource Disaggregation
    • Index
  • Edge Computing
    • Index
  • Security/Privacy
    • Index
      • Differential Privacy
      • Honeycrisp: Large-Scale Differentially Private Aggregation Without a Trusted Core
      • Short Summaries
  • Misc.
    • Index
      • Rate Limiting
      • Load Balancing
      • Consistency Models in Distributed System
      • Managing Complexity
      • System Design
      • Deep Dive into the Spark Scheduler
      • The Actor Model
      • Python Global Interpreter Lock
      • About Research and PhD
Powered by GitBook
On this page
  • Data Validation
  • Federated/Decentralized Learning
  • Distributed Machine Learning
  • Deep Learning Scheduler
  • Inference
  • Machine Learning Systems in industry
  • Misc.

Was this helpful?

  1. Systems for ML

Index

  • A Berkeley View of Systems Challenges for AI - Stoica et al., 2017

Data Validation

  • Automating Large-Scale Data Quality Verification - Schelter et al., VLDB '18

    • Presented a system for automating the verification of data quality at scale

  • Data Validation for Machine Learning - Breck et al., SysML '19

  • Designed a system to monitor the quality of data fed into machine learning algorithm in Google

Federated/Decentralized Learning

  • Towards Federated Learning at Scale: System Design - Bonawitz et al., SysML '19 [Summary]

    • Discussed the general architecture and protocol of federated learning in Google

  • Communication-Efficient Learning of Deep Networks from Decentralized Data - McMahan et al.,arXiv '17 [Summary]

    • Described the Federated averaging algorithm

  • The Non-IID Data Quagmire of Decentralized Machine Learning - Hsieh et al., arXiv '19 [Summary]

    • Studied the problem of non-IID data partition, and designed a system-level approach that adapts the communication frequency to reflect the skewness in the data.

  • Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data - Yang et al., WWW' 21

    • Studied the impact of system heterogeneity on existing federated learning algorithms and proposed several nice observations on potential impact factors

Distributed Machine Learning

  • Large Scale Distributed Deep Networks - Dean et al., NIPS '12

  • Introduced the idea of Model Parallelism and Data Parallelism, as well as Google's first-generation Deep Network platform, DistBelief. The ideas are "old", but it is a must-read if you are interested in distributed Deep Network platforms.

  • More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server - Ho et al., NIPS '13 [Summary]

    • Discussed the Stale Synchronous Parallel(SSP) and implementation of SSP parameter server

  • Scaling Distributed Machine Learning with the Parameter Server - Li et al., OSDI '14 [Summary]

    • Described the architecture and protocol of parameter server

  • Project Adam: Building an Efficient and Scalable Deep Learning Training System - Chilimbi et al., OSDI '14

    • Described the design and implementation of a distributed system called Adam comprised of commodity server machines to train deep neural networks.

  • TensorFlow: A System for Large-Scale Machine Learning - Abadi et al., OSDI '16

    • The core idea behind TensorFlow is the dataflow(with mutable state) representation of the Deep Networks, which they claim to subsume existing work on parameter servers, and offers a uniform programming model that allows users to harness large-scale heterogeneous systems, both for production tasks and for experimenting with new approaches.

  • Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds - Hsieh et al., NSDI '17 [Summary]

    • Designed a geo-distributed ML system which differentiates communication within data center and across data centers and presented the Approximate Synchronous Parallel model(similar to SSP).

  • Gradient Coding: Avoiding Stragglers in Distributed Learning - Tandon et al., ICML '17 [Summary]

    • Coded Computation

  • Revisiting Distributed Synchronous SGD - Chen et al., arXiv '17

    • Proposed a solution to mitigate stragglers: adding b extra workers, but as soon as the parameter servers receive gradients from any N workers, they stop waiting and update their parameters using the N gradients.

  • Adaptive Communication Strategies in Local-Update SGD - Wang et al., SysML '18 [Summary]

    • Proposed an adaptive algorithm for choosing τ\tauτ in asynchronous training. (Change τ\tau τ as algorithm converges)

  • Ray: A Distributed Framework for Emerging AI Applications - Moritz et al., OSDI '18 [Summary]

  • Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads - Jeon et al., ATC '19 [Summary]

  • PipeDream: Generalized Pipeline Parallelism for DNN Training - Narayanan et al., SOSP '19 [Summary]

    • Proposed Pipeline-parallel training that combines data and model parallelism with pipelining.

  • A Generic Communication Scheduler for Distributed DNN Training Acceleration - Peng et al., SOSP '19 [Summary]

    • Key insight: Communication of former layers of a neural network has higher priority and can preempt communication of latter layers.

  • HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism - Park et al., ATC '20

  • A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters - Jiang et al., OSDI '20

    • Presented a concise overview of the theoretical performance of PS and All-reduce architecture

    • Proposed an architecture to accelerate distributed DNN training by 1) leveraging spare CPU and network resources, 2) optimizing both inter-machine and intra-machine communication, and 3) move parameter updates to GPUs

Deep Learning Scheduler

  • Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters - Peng et al., EuroSys '18

  • Gandiva: Introspective Cluster Scheduling for Deep Learning - Xiao et al., OSDI '18 [Summary]

  • Tiresias: A GPU Cluster Manager for Distributed Deep Learning - Gu et al., NSDI '19 [Summary]

  • Themis: Fair and Efficient GPU Cluster Scheduling - Mahajan et al., NSDI '20

  • AntMan: Dynamic Scaling on GPU Clusters for Deep Learning - Xiao et al., OSDI '20

Inference

  • Clipper: A Low-Latency Online Prediction Serving System - Crankshaw et al., NSDI '17 [Summary]

    • Discussed the challenge of prediction serving systems and presented their general-purpose low-latency prediction serving system.

  • Pretzel: Opening the Black Box of Machine Learning Prediction Serving Systems - Lee et al., OSDI '18

  • InferLine: ML Prediction Pipeline Provisioning and Management for Tight Latency Objectives - Crankshaw et al., arXiv '18

  • DeepCPU: Serving RNN-based Deep Learning Models 10x Faster - Zhang et al., ATC '18

  • GRNN: Low-Latency and Scalable RNN Inference on GPUs - Holmes et al., EuroSys '19

  • MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving - Zhang et al., ATC '19

    • Proposes a SLO-aware model scheduling and scaling by selecting between AWS EC2 and AWS lambda to absorb load bursts.

  • Optimizing CNN Model Inference on CPUs - Liu et al., ATC '19

  • Parity Models: Erasure-Coded Resilience for Prediction Serving Systems - Kosaian et al, SOSP '19 [Summary]

    • A Learning-based approach to achieve erasure-coded resilience for Neural Networks.

  • Nexus: a GPU cluster engine for accelerating DNN-based video analysis - Shen et al., SOSP '19 [Summary]

  • Serving DNNs like Clockwork: Performance Predictability from the Bottom Up - Gujarati et al., OSDI '20

Machine Learning Systems in industry

  • Uber's Machine Learning Platform - Michelangelo: [Blog and Video]

  • Horovod: fast and easy distributed deep learning in TensorFlow - Sergeev et al., 2018 [Github][Summary]

    • Horovod is Uber's distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed Deep Learning fast and easy to use.

  • Machine Learning at Facebook: Understanding Inference at the Edge - Wu et al., 2018 [Summary]

    • Facebook's work on bringing machine learning inference to the edge.

Misc.

  • Cartel: A System for Collaborative Transfer Learning at the Edge - Daga et al., SoCC '19 [Summary]

    • Proposed a framework for Collaborative Learning

  • Collaborative Learning between Cloud and End Devices: An Empirical Study on Location Prediction - Lu et al., SEC '19 [Summary]

  • DeepXplore: Automated Whitebox Testing of Deep Learning Systems - Pei et al., SOSP '17 [Summary]

  • A unifying view on dataset shift in classification - Moreno-Torres et al., 2010

    • Explained various types of data shift(i.e. covariate shift, prior probability shift, and Concept Shift)

  • A First Look at Deep Learning Apps on Smartphones - Xu et al., WWW '19

PreviousOther interesting papersNextA Berkeley View of Systems Challenges for AI

Last updated 4 years ago

Was this helpful?