Index
Last updated
Was this helpful?
Last updated
Was this helpful?
- Stoica et al., 2017
- Schelter et al., VLDB '18
Presented a system for automating the verification of data quality at scale
- Breck et al., SysML '19
Designed a system to monitor the quality of data fed into machine learning algorithm in Google
- Bonawitz et al., SysML '19 []
Discussed the general architecture and protocol of federated learning in Google
- McMahan et al.,arXiv '17 []
Described the Federated averaging algorithm
- Hsieh et al., arXiv '19 []
Studied the problem of non-IID data partition, and designed a system-level approach that adapts the communication frequency to reflect the skewness in the data.
- Yang et al., WWW' 21
Studied the impact of system heterogeneity on existing federated learning algorithms and proposed several nice observations on potential impact factors
Introduced the idea of Model Parallelism and Data Parallelism, as well as Google's first-generation Deep Network platform, DistBelief. The ideas are "old", but it is a must-read if you are interested in distributed Deep Network platforms.
Discussed the Stale Synchronous Parallel(SSP) and implementation of SSP parameter server
Described the architecture and protocol of parameter server
Described the design and implementation of a distributed system called Adam comprised of commodity server machines to train deep neural networks.
The core idea behind TensorFlow is the dataflow(with mutable state) representation of the Deep Networks, which they claim to subsume existing work on parameter servers, and offers a uniform programming model that allows users to harness large-scale heterogeneous systems, both for production tasks and for experimenting with new approaches.
Designed a geo-distributed ML system which differentiates communication within data center and across data centers and presented the Approximate Synchronous Parallel model(similar to SSP).
Coded Computation
Proposed a solution to mitigate stragglers: adding b extra workers, but as soon as the parameter servers receive gradients from any N workers, they stop waiting and update their parameters using the N gradients.
Proposed an adaptive algorithm for choosing in asynchronous training. (Change as algorithm converges)
Proposed Pipeline-parallel training that combines data and model parallelism with pipelining.
Key insight: Communication of former layers of a neural network has higher priority and can preempt communication of latter layers.
Presented a concise overview of the theoretical performance of PS and All-reduce architecture
Proposed an architecture to accelerate distributed DNN training by 1) leveraging spare CPU and network resources, 2) optimizing both inter-machine and intra-machine communication, and 3) move parameter updates to GPUs
Discussed the challenge of prediction serving systems and presented their general-purpose low-latency prediction serving system.
Proposes a SLO-aware model scheduling and scaling by selecting between AWS EC2 and AWS lambda to absorb load bursts.
A Learning-based approach to achieve erasure-coded resilience for Neural Networks.
Horovod is Uber's distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed Deep Learning fast and easy to use.
Facebook's work on bringing machine learning inference to the edge.
Proposed a framework for Collaborative Learning
Explained various types of data shift(i.e. covariate shift, prior probability shift, and Concept Shift)
- Dean et al., NIPS '12
- Ho et al., NIPS '13 []
- Li et al., OSDI '14 []
- Chilimbi et al., OSDI '14
- Abadi et al., OSDI '16
- Hsieh et al., NSDI '17 []
- Tandon et al., ICML '17 []
- Chen et al., arXiv '17
- Wang et al., SysML '18 []
- Moritz et al., OSDI '18 []
- Jeon et al., ATC '19 []
- Narayanan et al., SOSP '19 []
- Peng et al., SOSP '19 []
- Park et al., ATC '20
- Jiang et al., OSDI '20
- Peng et al., EuroSys '18
- Xiao et al., OSDI '18 []
- Gu et al., NSDI '19 []
- Mahajan et al., NSDI '20
- Xiao et al., OSDI '20
- Crankshaw et al., NSDI '17 []
- Lee et al., OSDI '18
- Crankshaw et al., arXiv '18
- Zhang et al., ATC '18
- Holmes et al., EuroSys '19
- Zhang et al., ATC '19
- Liu et al., ATC '19
- Kosaian et al, SOSP '19 []
- Shen et al., SOSP '19 []
- Gujarati et al., OSDI '20
Uber's Machine Learning Platform - Michelangelo: [ and ]
- Sergeev et al., 2018 [][]
- Wu et al., 2018 []
- Daga et al., SoCC '19 []
- Lu et al., SEC '19 []
- Pei et al., SOSP '17 []
- Moreno-Torres et al., 2010
- Xu et al., WWW '19