Index
A Berkeley View of Systems Challenges for AI - Stoica et al., 2017
Data Validation
Automating Large-Scale Data Quality Verification - Schelter et al., VLDB '18
Presented a system for automating the verification of data quality at scale
Data Validation for Machine Learning - Breck et al., SysML '19
Designed a system to monitor the quality of data fed into machine learning algorithm in Google
Federated/Decentralized Learning
Towards Federated Learning at Scale: System Design - Bonawitz et al., SysML '19 [Summary]
Discussed the general architecture and protocol of federated learning in Google
Communication-Efficient Learning of Deep Networks from Decentralized Data - McMahan et al.,arXiv '17 [Summary]
Described the Federated averaging algorithm
The Non-IID Data Quagmire of Decentralized Machine Learning - Hsieh et al., arXiv '19 [Summary]
Studied the problem of non-IID data partition, and designed a system-level approach that adapts the communication frequency to reflect the skewness in the data.
Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data - Yang et al., WWW' 21
Studied the impact of system heterogeneity on existing federated learning algorithms and proposed several nice observations on potential impact factors
Distributed Machine Learning
Large Scale Distributed Deep Networks - Dean et al., NIPS '12
Introduced the idea of Model Parallelism and Data Parallelism, as well as Google's first-generation Deep Network platform, DistBelief. The ideas are "old", but it is a must-read if you are interested in distributed Deep Network platforms.
More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server - Ho et al., NIPS '13 [Summary]
Discussed the Stale Synchronous Parallel(SSP) and implementation of SSP parameter server
Scaling Distributed Machine Learning with the Parameter Server - Li et al., OSDI '14 [Summary]
Described the architecture and protocol of parameter server
Project Adam: Building an Efficient and Scalable Deep Learning Training System - Chilimbi et al., OSDI '14
Described the design and implementation of a distributed system called Adam comprised of commodity server machines to train deep neural networks.
TensorFlow: A System for Large-Scale Machine Learning - Abadi et al., OSDI '16
The core idea behind TensorFlow is the dataflow(with mutable state) representation of the Deep Networks, which they claim to subsume existing work on parameter servers, and offers a uniform programming model that allows users to harness large-scale heterogeneous systems, both for production tasks and for experimenting with new approaches.
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds - Hsieh et al., NSDI '17 [Summary]
Designed a geo-distributed ML system which differentiates communication within data center and across data centers and presented the Approximate Synchronous Parallel model(similar to SSP).
Gradient Coding: Avoiding Stragglers in Distributed Learning - Tandon et al., ICML '17 [Summary]
Coded Computation
Revisiting Distributed Synchronous SGD - Chen et al., arXiv '17
Proposed a solution to mitigate stragglers: adding b extra workers, but as soon as the parameter servers receive gradients from any N workers, they stop waiting and update their parameters using the N gradients.
Adaptive Communication Strategies in Local-Update SGD - Wang et al., SysML '18 [Summary]
Proposed an adaptive algorithm for choosing in asynchronous training. (Change as algorithm converges)
Ray: A Distributed Framework for Emerging AI Applications - Moritz et al., OSDI '18 [Summary]
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads - Jeon et al., ATC '19 [Summary]
PipeDream: Generalized Pipeline Parallelism for DNN Training - Narayanan et al., SOSP '19 [Summary]
Proposed Pipeline-parallel training that combines data and model parallelism with pipelining.
A Generic Communication Scheduler for Distributed DNN Training Acceleration - Peng et al., SOSP '19 [Summary]
Key insight: Communication of former layers of a neural network has higher priority and can preempt communication of latter layers.
A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters - Jiang et al., OSDI '20
Presented a concise overview of the theoretical performance of PS and All-reduce architecture
Proposed an architecture to accelerate distributed DNN training by 1) leveraging spare CPU and network resources, 2) optimizing both inter-machine and intra-machine communication, and 3) move parameter updates to GPUs
Deep Learning Scheduler
Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters - Peng et al., EuroSys '18
Gandiva: Introspective Cluster Scheduling for Deep Learning - Xiao et al., OSDI '18 [Summary]
Tiresias: A GPU Cluster Manager for Distributed Deep Learning - Gu et al., NSDI '19 [Summary]
Themis: Fair and Efficient GPU Cluster Scheduling - Mahajan et al., NSDI '20
AntMan: Dynamic Scaling on GPU Clusters for Deep Learning - Xiao et al., OSDI '20
Inference
Clipper: A Low-Latency Online Prediction Serving System - Crankshaw et al., NSDI '17 [Summary]
Discussed the challenge of prediction serving systems and presented their general-purpose low-latency prediction serving system.
Pretzel: Opening the Black Box of Machine Learning Prediction Serving Systems - Lee et al., OSDI '18
InferLine: ML Prediction Pipeline Provisioning and Management for Tight Latency Objectives - Crankshaw et al., arXiv '18
DeepCPU: Serving RNN-based Deep Learning Models 10x Faster - Zhang et al., ATC '18
GRNN: Low-Latency and Scalable RNN Inference on GPUs - Holmes et al., EuroSys '19
MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving - Zhang et al., ATC '19
Proposes a SLO-aware model scheduling and scaling by selecting between AWS EC2 and AWS lambda to absorb load bursts.
Optimizing CNN Model Inference on CPUs - Liu et al., ATC '19
Parity Models: Erasure-Coded Resilience for Prediction Serving Systems - Kosaian et al, SOSP '19 [Summary]
A Learning-based approach to achieve erasure-coded resilience for Neural Networks.
Nexus: a GPU cluster engine for accelerating DNN-based video analysis - Shen et al., SOSP '19 [Summary]
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up - Gujarati et al., OSDI '20
Machine Learning Systems in industry
Horovod: fast and easy distributed deep learning in TensorFlow - Sergeev et al., 2018 [Github][Summary]
Horovod is Uber's distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed Deep Learning fast and easy to use.
Machine Learning at Facebook: Understanding Inference at the Edge - Wu et al., 2018 [Summary]
Facebook's work on bringing machine learning inference to the edge.
Misc.
Cartel: A System for Collaborative Transfer Learning at the Edge - Daga et al., SoCC '19 [Summary]
Proposed a framework for Collaborative Learning
DeepXplore: Automated Whitebox Testing of Deep Learning Systems - Pei et al., SOSP '17 [Summary]
A unifying view on dataset shift in classification - Moreno-Torres et al., 2010
Explained various types of data shift(i.e. covariate shift, prior probability shift, and Concept Shift)
A First Look at Deep Learning Apps on Smartphones - Xu et al., WWW '19
Last updated
Was this helpful?