Index
- Presented a system for automating the verification of data quality at scale
- Designed a system to monitor the quality of data fed into machine learning algorithm in Google
- Discussed the general architecture and protocol of federated learning in Google
- Communication-Efficient Learning of Deep Networks from Decentralized Data - McMahan et al.,arXiv '17 [Summary]
- Described the Federated averaging algorithm
- Studied the problem of non-IID data partition, and designed a system-level approach that adapts the communication frequency to reflect the skewness in the data.
- Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data - Yang et al., WWW' 21
- Studied the impact of system heterogeneity on existing federated learning algorithms and proposed several nice observations on potential impact factors
- Introduced the idea of Model Parallelism and Data Parallelism, as well as Google's first-generation Deep Network platform, DistBelief. The ideas are "old", but it is a must-read if you are interested in distributed Deep Network platforms.
- More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server - Ho et al., NIPS '13 [Summary]
- Discussed the Stale Synchronous Parallel(SSP) and implementation of SSP parameter server
- Described the architecture and protocol of parameter server
- Project Adam: Building an Efficient and Scalable Deep Learning Training System - Chilimbi et al., OSDI '14
- Described the design and implementation of a distributed system called Adam comprised of commodity server machines to train deep neural networks.
- The core idea behind TensorFlow is the dataflow(with mutable state) representation of the Deep Networks, which they claim to subsume existing work on parameter servers, and offers a uniform programming model that allows users to harness large-scale heterogeneous systems, both for production tasks and for experimenting with new approaches.
- Designed a geo-distributed ML system which differentiates communication within data center and across data centers and presented the Approximate Synchronous Parallel model(similar to SSP).
- Coded Computation
- Proposed a solution to mitigate stragglers: adding b extra workers, but as soon as the parameter servers receive gradients from any N workers, they stop waiting and update their parameters using the N gradients.
- Proposed an adaptive algorithm for choosingin asynchronous training. (Changeas algorithm converges)
- Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads - Jeon et al., ATC '19 [Summary]
- PipeDream: Generalized Pipeline Parallelism for DNN Training - Narayanan et al., SOSP '19 [Summary]
- Proposed Pipeline-parallel training that combines data and model parallelism with pipelining.
- A Generic Communication Scheduler for Distributed DNN Training Acceleration - Peng et al., SOSP '19 [Summary]
- Key insight: Communication of former layers of a neural network has higher priority and can preempt communication of latter layers.
- A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters - Jiang et al., OSDI '20
- Presented a concise overview of the theoretical performance of PS and All-reduce architecture
- Proposed an architecture to accelerate distributed DNN training by 1) leveraging spare CPU and network resources, 2) optimizing both inter-machine and intra-machine communication, and 3) move parameter updates to GPUs
- Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters - Peng et al., EuroSys '18
- Discussed the challenge of prediction serving systems and presented their general-purpose low-latency prediction serving system.
- Pretzel: Opening the Black Box of Machine Learning Prediction Serving Systems - Lee et al., OSDI '18
- InferLine: ML Prediction Pipeline Provisioning and Management for Tight Latency Objectives - Crankshaw et al., arXiv '18
- MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving - Zhang et al., ATC '19
- Proposes a SLO-aware model scheduling and scaling by selecting between AWS EC2 and AWS lambda to absorb load bursts.
- Parity Models: Erasure-Coded Resilience for Prediction Serving Systems - Kosaian et al, SOSP '19 [Summary]
- A Learning-based approach to achieve erasure-coded resilience for Neural Networks.
- Nexus: a GPU cluster engine for accelerating DNN-based video analysis - Shen et al., SOSP '19 [Summary]
- Serving DNNs like Clockwork: Performance Predictability from the Bottom Up - Gujarati et al., OSDI '20
- Horovod: fast and easy distributed deep learning in TensorFlow - Sergeev et al., 2018 [Github][Summary]
- Horovod is Uber's distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed Deep Learning fast and easy to use.
- Facebook's work on bringing machine learning inference to the edge.
- Proposed a framework for Collaborative Learning
- Explained various types of data shift(i.e. covariate shift, prior probability shift, and Concept Shift)
Last modified 1yr ago