Random Notes
  • Introduction
  • Reading list
  • Theory
    • Index
      • Impossibility of Distributed Consensus with One Faulty Process
      • Time, Clocks, and the Ordering of Events in a Distributed System
      • Using Reasoning About Knowledge to analyze Distributed Systems
      • CAP Twelve Years Later: How the “Rules” Have Changed
      • A Note on Distributed Computing
  • Operating System
    • Index
  • Storage
    • Index
      • Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
      • Exploiting Commutativity For Practical Fast Replication
      • Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
      • Building Consistent Transactions with Inconsistent Replication
      • Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
      • Spanner: Google's Globally-Distributed Database
      • Bigtable: A Distributed Storage System for Structured Data
      • The Google File System
      • Dynamo: Amazon’s Highly Available Key-value Store
      • Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
      • Replicated Data Consistency Explained Through Baseball
      • Session Guarantees for Weakly Consistent Replicated Data
      • Flat Datacenter Storage
      • Small Cache, Big Effect: Provable Load Balancing forRandomly Partitioned Cluster Services
      • DistCache: provable load balancing for large-scale storage systems with distributed caching
      • Short Summaries
  • Coordination
    • Index
      • Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
      • Paxos made simple
      • ZooKeeper: Wait-free coordination for Internet-scale systems
      • Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering
      • Keeping CALM: When Distributed Consistency is Easy
      • In Search of an Understandable Consensus Algorithm
      • A comprehensive study of Convergent and Commutative Replicated Data Types
  • Fault Tolerance
    • Index
      • The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
      • Gray Failure: The Achilles’ Heel of Cloud-Scale Systems
      • Capturing and Enhancing In Situ System Observability for Failure Detection
      • Check before You Change: Preventing Correlated Failures in Service Updates
      • Efficient Scalable Thread-Safety-Violation Detection
      • REPT: Reverse Debugging of Failures in Deployed Software
      • Redundancy Does Not Imply Fault Tolerance
      • Fixed It For You:Protocol Repair Using Lineage Graphs
      • The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance
      • Lineage-driven Fault Injection
      • Short Summaries
  • Cloud Computing
    • Index
      • Improving MapReduce Performance in Heterogeneous Environments
      • CLARINET: WAN-Aware Optimization for Analytics Queries
      • MapReduce: Simplified Data Processing on Large Clusters
      • Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
      • Resource Management
      • Apache Hadoop YARN: Yet Another Resource Negotiator
      • Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
      • Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
      • Large-scale cluster management at Google with Borg
      • MapReduce Online
      • Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling
      • Reining in the Outliers in Map-Reduce Clusters using Mantri
      • Effective Straggler Mitigation: Attack of the Clones
      • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
      • Discretized Streams: Fault-Tolerant Streaming Computation at Scale
      • Sparrow: Distributed, Low Latency Scheduling
      • Making Sense of Performance in Data Analytics Framework
      • Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
      • Drizzle: Fast and Adaptable Stream Processing at Scale
      • Naiad: A Timely Dataflow System
      • The Dataflow Model:A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale
      • Interruptible Tasks:Treating Memory Pressure AsInterrupts for Highly Scalable Data-Parallel Program
      • PACMan: Coordinated Memory Caching for Parallel Jobs
      • Multi-Resource Packing for Cluster Schedulers
      • Other interesting papers
  • Systems for ML
    • Index
      • A Berkeley View of Systems Challenges for AI
      • Tiresias: A GPU Cluster Managerfor Distributed Deep Learning
      • Gandiva: Introspective Cluster Scheduling for Deep Learning
      • Workshop papers
      • Hidden Technical Debt in Machine Learning Systems
      • Inference Systems
      • Parameter Servers and AllReduce
      • Federated Learning at Scale - Part I
      • Federated Learning at Scale - Part II
      • Learning From Non-IID data
      • Ray: A Distributed Framework for Emerging AI Applications
      • PipeDream: Generalized Pipeline Parallelism for DNN Training
      • DeepXplore: Automated Whitebox Testingof Deep Learning Systems
      • Distributed Machine Learning Misc.
  • ML for Systems
    • Index
      • Short Summaries
  • Machine Learning
    • Index
      • Deep Learning with Differential Privacy
      • Accelerating Deep Learning via Importance Sampling
      • A Few Useful Things to Know About Machine Learning
  • Video Analytics
    • Index
      • Scaling Video Analytics on Constrained Edge Nodes
      • Focus: Querying Large Video Datasets with Low Latency and Low Cost
      • NoScope: Optimizing Neural Network Queriesover Video at Scale
      • Live Video Analytics at Scale with Approximation and Delay-Tolerance
      • Chameleon: Scalable Adaptation of Video Analytics
      • End-to-end Learning of Action Detection from Frame Glimpses in Videos
      • Short Summaries
  • Networking
    • Index
      • Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport
      • Learning in situ: a randomized experiment in video streaming
      • Short Summaries
  • Serverless
    • Index
      • Serverless Computing: One Step Forward, Two Steps Back
      • Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads
      • SAND: Towards High-Performance Serverless Computing
      • Pocket: Elastic Ephemeral Storage for Serverless Analytics
      • Fault-tolerant and Transactional Stateful Serverless Workflows
  • Resource Disaggregation
    • Index
  • Edge Computing
    • Index
  • Security/Privacy
    • Index
      • Differential Privacy
      • Honeycrisp: Large-Scale Differentially Private Aggregation Without a Trusted Core
      • Short Summaries
  • Misc.
    • Index
      • Rate Limiting
      • Load Balancing
      • Consistency Models in Distributed System
      • Managing Complexity
      • System Design
      • Deep Dive into the Spark Scheduler
      • The Actor Model
      • Python Global Interpreter Lock
      • About Research and PhD
Powered by GitBook
On this page
  • Introduction
  • Challenges
  • How to provide prediction?
  • Clipper: A Low-Latency Online Prediction Serving System
  • Parity Models: Erasure-Coded Resilience for Prediction Serving Systems
  • Nexus: a GPU cluster engine for accelerating DNN-based video analysis
  • References:

Was this helpful?

  1. Systems for ML
  2. Index

Inference Systems

PreviousHidden Technical Debt in Machine Learning SystemsNextParameter Servers and AllReduce

Last updated 5 years ago

Was this helpful?

Introduction

Machine learning life-cycle can be divided into three distinct phases: Model Deployment, Training ,and Inference. The final phase of rendering predictions is often referred to as prediction serving, model scoring, or inference. It is the process of using the model to make a prediction given input(e.g. predict a user's rating for a movie.) Prediction serving requires integrating machine-learning software with other systems including user-facing application code, live databases, and high-volume data streams. As such, it comes with its own set of challenges and tradeoffs and is the domain of the emerging class of prediction-serving systems. I will try to summarize the challenges, traditional approaches and two related papers published recently in top conferences.

Challenges

  • Complexity of Deploying Machine Learning

There is a large and growing number of machine learning frameworks. Each framework has strengths and weaknesses and many are optimized for specific models or application domains (e.g., computer vision). Thus, there is no dominant framework and often multiple frameworks may be used for a single application (e.g., speech recognition and computer vision in automatic captioning). As a consequence of these design decisions, application developers are forced to accept reduced accuracy by forgoing the use of a model well-suited to the task or to incur the substantially increased complexity of integrating and supporting multiple machine learning frameworks.

  • Prediction Latency and Throughput

How to provide prediction?

  • Offline prediction. This is when your ML model is used in a batch scoring job for a large number of data points, where predictions are not required in real-time serving. In offline recommendations, for example, you only use historical information about customer-item interactions to make the prediction, without any need for online information. Offline recommendations are usually performed in retention campaigns for (inactive) customers with high propensity to churn, in promotion campaigns, and so on.

  • Online prediction. This is when your ML system is used to serve real-time predictions, based on online requests from the operational systems and apps. In contrast to offline prediction, in online recommendations you need the current context of the customer who's using your application, along with historical information, to make the prediction. This context includes information such as date time, page views, funnels, items viewed, items in the basket, items removed from the basket, customer device, and device location.

Clipper starts with the design goal of easily serving any trained model at interactive latencies. From this starting point, the paper explores techniques for optimizing both inference performance and accuracy while encapsulating the models in a uniform, black-box prediction interface.

Clipper adopts a modular, layered architecture, running each model in a separate Docker container and interposing an intermediate layer between the models and the querying applications. More Specifically, Clipper is divided into model selection and model abstraction layers. The model abstraction layer is responsible for providing a common prediction interface, ensuring resource isolation, and optimizing the query workload for batch-oriented machine learning frameworks. The model selection layer is responsible for dispatching queries to one or more models and combining their predictions based on feedback to improve accuracy, estimate uncertainty, and provide robust predictions.

Model Containers

Each model is managed in a separate Docker container to provide process isolation. Clipper supports replicating model containers, both locally and across a cluster, to improve prediction throughput and leverage additional hardware accelerators.

Model Abstraction Layer

The prediction caches provide a partial pre-materialization mechanism for frequent queries and the Clipper batching component transforms the concurrent stream of prediction queries received by Clipper into batches.

Batching helps amortize the cost of system overheads (e.g., remote procedure call and feature method invocation) and improves throughput by enabling models to leverage internal parallelism. For example, many machine-learning frameworks are optimized for batch-oriented model training and therefore capable of using SIMD (single instruction, multiple data) instructions and GPU accelerators to improve computation on large input batches. However, batching may increase the latency of predictions since it requires all queries in the batch to complete before returning a single prediction. Clipper solves this problem by employing an AIMD scheme to find the optimal batch size that maximizes throughput subject to constraint that the batch evaluation latency is under the target SLO.

Model Selection Layer

Motivation

Prediction serving systems must deliver predictions with low latency(~10ms) and adhere to strict SLOs. Queries that are not completed by their SLO are often useless to applications. However, because prediction serving systems are typically run in large scale, multi-tenant clusters(e.g. public clouds). There are numerous causes of inflated tail latencies in these settings, such as multi-tenancy and resource contention, hardware unreliability and failures. This paper focuses on minimizing the tail latency of prediction serving systems.

Common techniques are 1) Proactively issue redundant requests to multiple servers and wait only for the first replica to respond. 2) Reactively issue redundant requests only if a certain amount of time has elapsed without receiving the result from the server. While a proactive approach operates with low-latency, it suffers from high resource overhead. Whereas the reactive approach operates with low resource-overhead, but higher latency.

Erasure Codes

To use erasure codes for alleviating the effect of slowdowns and failures that occur in distributed computation, data units are encoded into parity units, and the deployed computation is performed over all data and parity units in parallel. A decoder then uses the outputs from the fast k of these computations to reconstruct the outputs corresponding to the original data units. For a prediction serving system, employing coded-computation would involve encoding queries such that a decoder can recover slow or failed predictions. Coded-computations are very different from the traditional use of erasure code in storage because we want to recover the results of computation over data units rather than the data units themselves.

Parity Model

Since erasure codes for coded computation can be learned, it is tempting to use machine learning(e.g. neural networks) models for encoders and decoders. However, neural network encoders and decoders add significant latency to reconstruction. Instead, the author proposes to use simple, fast encoders(e.g. summation/multiplication) and decoders and instead design a new computation over parities, called "parity model". Parity models are neural network models that transform parities into a form that enables decoding.

Evaluation Result

The Parity Model significantly reducing the tail latency: in the presence of load imbalance, Parity Model reduces 99.9th percentile latency by up to 48%, bringing tail latency up to 3.5x closer to median latency, while maintaining the same median. (Note that it does not improve tail latency)

However, because we used a machine learning model, the reconstruction of unavailable outputs will be approximations of the function outputs that would be returned if they were not slow or failed. The evaluation shows that the Parity Model's degraded mode accuracy is no more than 6.5% lower than that when predictions from the deployed model are available.

DNN Serving is similar to traditional distributed serving, but it imposes some additional constraints 1) Uses accelerators(e.g., GPUs and TPUs) 2) Pre-load models: loading a DNN model into memory can cost hundreds of ms to seconds and 3) Batching, which allows kernels to avoid stalling on memory access by operating on each loaded input many more times than without batching.

Existing DNN serving systems(e.g., Clipper and TF serving) are single application solutions. They do not coordinate resource allocations across DNN applications and rely on external schedulers that cannot perform cross-app optimizations. Moreover, they are not optimized for complex DNN pipelines and limits the granularity of batched execution to whole models.

The authors identified three optimization opportunities.

  • Cluster Level: GPU sharing has to account for SLO and “squishy” load demands across models. (Squishy means the input to the packing algorithm varies with the size of the batch.) In other words, we can change the batch size of the models to increase throughput or meet the SLO requirements.

  • Model Level: Another important observation is that transfer learning adapts a model from one dataset to another or from one task to another by re-training only the last few layers. DNN frameworks assume that if models differ in any layer, they cannot be executed in a batched fashion at all.

Nexus

To tackle the above challenges, the authors propose Nexus, a GPU cluster for DNN executions that address these problems to attain high execution efficiency on GPU clusters while serving video analysis requests within a specific latency SLO.

When a model is uploaded to Nexus, a profiler measures the execution latency and memory use for different batch sizes when the models are uploaded to Nexus. The global scheduler uses load statistics from the runtime and invokes the epoch scheduler to decide which model to execute and at what batch size, and which backend to place the models so as to balance the load and maximize utilization. Allocation, scheduling, and routing updates happen at the granularity of an epoch, typically 30-60s, but a new epoch can also be triggered by large changes in workload.

Profiling-based batch-aware resource allocator

The scheduling problem has the structure of bin-packing, but we need to address the "squishiness" of tasks and the need to meet latency SLOs. The problem can be formulated as follows:

Nexus uses a squishy bin-packing algorithm which runs in two phases. First, it allocates one GPU for each model session, and choose the largest batch size which meets the SLO. Then, it merges these nodes into fewer nodes. To merge two nodes, the algorithm uses the minimum duty cycle of two nodes as the new duty cycle and adjust the batch size. Such a merge is valid if the occupancy of the merged node is no more than 1. To pick which node to merge, it sorts all nodes by its occupancy in decreasing order and, for each node, find a merging that yields the highest occupancy.

Scheduling Complex Queries

The objective is to find the best latency SLO split for each model in the pipeline to minimize the total number of GPUs that are required for the pipeline. Nexus us dynamic programming to solve this optimization problem and more details are in the paper.

Batch common prefix across models

Nexus computes the hash of sub-tree and detect common subtrees. It loads common prefix once and executes common prefix in a batch of mixed requests while executes different suffix sequentially.

Evaluation

Comments:

I liked the observations of Nexus but I was less fond with the writing. It took me some time to figure out the motivation and the corresponding solutions.

I have few comments/concerns about this Nexus. 1. The assumption of Nexus is that the models are small so that multiple models can fit in one GPU. However, as the models are getting larger, this assumption may not hold anymore. 2. the common prefix batching technique is not applicable if the models are completely different. Thus, how to apply batching to inherently different models is an interesting question.

References:

Ideally, the prediction serving system should render predictions in O(10ms)O(10ms)O(10ms) even under bursty workloads. (For example, for model, users typically expect a keyboard response within 20 ms of an input event) Because prediction serving is often on the critical path, predictions must both be fast and have bounded tail latencies to meet service level objectives.

In the remaining post, we will look at three related papers: 1) and 2) . 3)

The Model Selection Layer uses feedback to dynamically select one or more of the deployed models and combine their outputs to provide more accurate and robust predictions. The selection policy uses reward feedback to choose between and even combine multiple candidate models for a given prediction request. By selecting the optimal model or set of models to use on a per-query basis, Clipper makes machine-learning applications more robust to dynamic environments and allows applications to react in realtime to degrading or failing models. The selection policy interface is designed to support and explore/exploit techniques that can express a wide range of such methods, including multiarmed bandit techniques and the Thompson sampling algorithm used by .

Erasure codes are popular tools in storage systems for imparting resilience to data unavailability while remaining agnostic to the cause of unavailability and using less resources than replication-based approaches.( is a great introduction to Erasure codes). An erasure code encodes k data units to produce r redundant “parity” units in such a way that any KKK of the total (K+r)(K + r)(K+r) data and parity units are sufficient for a decoder to recover the original KKK data units. The overhead incurred by an erasure code is k+rk\frac{k+r}{k}kk+r​ , which is typically much less than that of replication (by setting r<Kr < K r<K )

The paper illustrates the power of parity models by using the simple addition/subtraction erasure code. Under this setting, the encoder produces a parity as the summation of queries in a coding group. (e.g. pixel-wise summation for images.). The decoder will subtract K−1K -1 K−1 available predictions from the output of the parity model to reconstruct an unavailable prediction.

Application Level: If a certain latency SLO is allocated to the query as a whole, the system needs to partition the latency across different components in the DNN pipeline. Similar to the observation in the paper, the best allocation scheme changes as the workload changes.(Figure 5 in the paper)

- Crankshaw et al., 2017

- Kosaian et al., 2019

- Crankshaw et al., 2017

An ML model can provide predictions in two ways:
Clipper: A Low-Latency Online Prediction Serving System
Parity Models: Erasure-Coded Resilience for Prediction Serving Systems
Nexus: a GPU cluster engine for accelerating DNN-based video analysis
Clipper: A Low-Latency Online Prediction Serving System
ensemble methods
LASER
Parity Models: Erasure-Coded Resilience for Prediction Serving Systems
Nexus: a GPU cluster engine for accelerating DNN-based video analysis
Chameleon
Clipper: A Low-Latency Online Prediction Serving System
Parity Models: Erasure-Coded Resilience for Prediction Serving Systems
Prediction-Serving Systems
Tensorflow Serving
keyboard prediction
Here
Credit: Joseph Gonzalez
Fast encoder/decoder with parity model
The goal is to minimize the total number of GPUs
Game Analysis
Traffic Monitoring