Random Notes
  • Introduction
  • Reading list
  • Theory
    • Index
      • Impossibility of Distributed Consensus with One Faulty Process
      • Time, Clocks, and the Ordering of Events in a Distributed System
      • Using Reasoning About Knowledge to analyze Distributed Systems
      • CAP Twelve Years Later: How the “Rules” Have Changed
      • A Note on Distributed Computing
  • Operating System
    • Index
  • Storage
    • Index
      • Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
      • Exploiting Commutativity For Practical Fast Replication
      • Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
      • Building Consistent Transactions with Inconsistent Replication
      • Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
      • Spanner: Google's Globally-Distributed Database
      • Bigtable: A Distributed Storage System for Structured Data
      • The Google File System
      • Dynamo: Amazon’s Highly Available Key-value Store
      • Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
      • Replicated Data Consistency Explained Through Baseball
      • Session Guarantees for Weakly Consistent Replicated Data
      • Flat Datacenter Storage
      • Small Cache, Big Effect: Provable Load Balancing forRandomly Partitioned Cluster Services
      • DistCache: provable load balancing for large-scale storage systems with distributed caching
      • Short Summaries
  • Coordination
    • Index
      • Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
      • Paxos made simple
      • ZooKeeper: Wait-free coordination for Internet-scale systems
      • Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering
      • Keeping CALM: When Distributed Consistency is Easy
      • In Search of an Understandable Consensus Algorithm
      • A comprehensive study of Convergent and Commutative Replicated Data Types
  • Fault Tolerance
    • Index
      • The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
      • Gray Failure: The Achilles’ Heel of Cloud-Scale Systems
      • Capturing and Enhancing In Situ System Observability for Failure Detection
      • Check before You Change: Preventing Correlated Failures in Service Updates
      • Efficient Scalable Thread-Safety-Violation Detection
      • REPT: Reverse Debugging of Failures in Deployed Software
      • Redundancy Does Not Imply Fault Tolerance
      • Fixed It For You:Protocol Repair Using Lineage Graphs
      • The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance
      • Lineage-driven Fault Injection
      • Short Summaries
  • Cloud Computing
    • Index
      • Improving MapReduce Performance in Heterogeneous Environments
      • CLARINET: WAN-Aware Optimization for Analytics Queries
      • MapReduce: Simplified Data Processing on Large Clusters
      • Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
      • Resource Management
      • Apache Hadoop YARN: Yet Another Resource Negotiator
      • Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
      • Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
      • Large-scale cluster management at Google with Borg
      • MapReduce Online
      • Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling
      • Reining in the Outliers in Map-Reduce Clusters using Mantri
      • Effective Straggler Mitigation: Attack of the Clones
      • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
      • Discretized Streams: Fault-Tolerant Streaming Computation at Scale
      • Sparrow: Distributed, Low Latency Scheduling
      • Making Sense of Performance in Data Analytics Framework
      • Monotasks: Architecting for Performance Clarity in Data Analytics Frameworks
      • Drizzle: Fast and Adaptable Stream Processing at Scale
      • Naiad: A Timely Dataflow System
      • The Dataflow Model:A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale
      • Interruptible Tasks:Treating Memory Pressure AsInterrupts for Highly Scalable Data-Parallel Program
      • PACMan: Coordinated Memory Caching for Parallel Jobs
      • Multi-Resource Packing for Cluster Schedulers
      • Other interesting papers
  • Systems for ML
    • Index
      • A Berkeley View of Systems Challenges for AI
      • Tiresias: A GPU Cluster Managerfor Distributed Deep Learning
      • Gandiva: Introspective Cluster Scheduling for Deep Learning
      • Workshop papers
      • Hidden Technical Debt in Machine Learning Systems
      • Inference Systems
      • Parameter Servers and AllReduce
      • Federated Learning at Scale - Part I
      • Federated Learning at Scale - Part II
      • Learning From Non-IID data
      • Ray: A Distributed Framework for Emerging AI Applications
      • PipeDream: Generalized Pipeline Parallelism for DNN Training
      • DeepXplore: Automated Whitebox Testingof Deep Learning Systems
      • Distributed Machine Learning Misc.
  • ML for Systems
    • Index
      • Short Summaries
  • Machine Learning
    • Index
      • Deep Learning with Differential Privacy
      • Accelerating Deep Learning via Importance Sampling
      • A Few Useful Things to Know About Machine Learning
  • Video Analytics
    • Index
      • Scaling Video Analytics on Constrained Edge Nodes
      • Focus: Querying Large Video Datasets with Low Latency and Low Cost
      • NoScope: Optimizing Neural Network Queriesover Video at Scale
      • Live Video Analytics at Scale with Approximation and Delay-Tolerance
      • Chameleon: Scalable Adaptation of Video Analytics
      • End-to-end Learning of Action Detection from Frame Glimpses in Videos
      • Short Summaries
  • Networking
    • Index
      • Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport
      • Learning in situ: a randomized experiment in video streaming
      • Short Summaries
  • Serverless
    • Index
      • Serverless Computing: One Step Forward, Two Steps Back
      • Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads
      • SAND: Towards High-Performance Serverless Computing
      • Pocket: Elastic Ephemeral Storage for Serverless Analytics
      • Fault-tolerant and Transactional Stateful Serverless Workflows
  • Resource Disaggregation
    • Index
  • Edge Computing
    • Index
  • Security/Privacy
    • Index
      • Differential Privacy
      • Honeycrisp: Large-Scale Differentially Private Aggregation Without a Trusted Core
      • Short Summaries
  • Misc.
    • Index
      • Rate Limiting
      • Load Balancing
      • Consistency Models in Distributed System
      • Managing Complexity
      • System Design
      • Deep Dive into the Spark Scheduler
      • The Actor Model
      • Python Global Interpreter Lock
      • About Research and PhD
Powered by GitBook
On this page
  • TL;DR:
  • Summary:
  • One last thing: Why not just use vector clocks?
  • Strength:
  • Weakness:
  • Related links:

Was this helpful?

  1. Storage
  2. Index

Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

https://www.cs.cmu.edu/~dga/papers/cops-sosp2011.pdf

TL;DR:

This paper introduces a new consistency model, causal+, that extends the causal consistency model and lies between sequential and causal consistency models. The authors claim that causal+ is the strongest consistency model achievable for ALPS systems, but they do not prove why something stronger cannot be achieved.

Summary:

This paper is motivated by the problem of how to build ALPS systems which provide the strongest while achievable consistency(Under CAP)[1]. In other words, the authors are trying to develop a system that achieves both ALPS and causal consistency with convergent conflict handling.[2].

There are lots of existing systems, but none of them are satisfying. Amazon's Dynamo[3] only provides eventual consistency, which is too weak for the programmer to build on to provide other services. The inconsistent views of the updates can escape and poison those other services. On the other hand, strong-consistency (linearizability) is impossible due to CAP. Bayou is probably the closest work, but it does not scale well because they achieved causal+ via log-based replay. Log-exchange-based serialization inhibits replica scalability, as it relies on a single serialization point in each replica to establish ordering.

First, we need to understand what is causal consistency with convergent conflict handling(see section 3 of the paper). Causal consistency means the value returned from the get operation must be consistent with the order defined by causality(i.e. "it must appear the operation that writes a value occurs after all operations that causally precede it). However, causal consistency does not order concurrent operations and concurrent operations on the same key result in a conflict. [4] Thus, we want convergent conflict handling: all conflicting operations must be handled in the same manner at all replicas.[5].

A get operation for a record is local at the closest datacenter and is non-blocking. Since all data is replicated at each datacenter, we can have local get.

A put operation for a record is 1) translated to put-after-dependencies based on the dependencies seen in this site 2) queued for asynchronous replication to other sites/replicas 3) returns done to the client at this point (early reply) 4) asynchronous replication to other sites/data centers occur.

Each operation maintains dependencies for operations(See figure 6). Replication dependencies are checked at each datacenter, and when they are satisfied the value is updated there.

Finally, the authors present an extended version of COPS, which also supports get transactions. [6]

One last thing: Why not just use vector clocks?

Answer from the authors: The problem with vector clocks and scalability has always been that the size of vector clocks in O(N), where N is the number of nodes. So if we want to scale to a datacenter with 10K nodes, each piece of metadata must have size O(10K). And in fact, vector clocks alone only allow you to learn Lamport's happens-before relationship -- they don't actually govern how you would enforce causal consistency in the system. You'd still need to employ either a serialization point or explicit dependency checking, as we do, in order to provide the desired consistency across the datacenter.

Distributed systems protocols (and DB systems) typically think about sites that include all the data they wish to establish causal ordering over. In COPS, a datacenter node only has a shard of the total data set, so you are fundamentally going to need to do some consistency enforcement or verification protocol.

In short: Vector clocks give you potential ordering between two observed operations. We use explicit dependency metadata to tell a server what other operation it depends on, because that operation likely resides only on other servers in the cluster!

[1]. ALPS means 1)Availability: All operations issued to the data store are completed successfully. 2) Low latency: the operations complete "quickly"(nice to have an average performance of a few milliseconds and worst-case performance(i.e., 99.9th) of 10s or 100s of milliseconds) 3) partition tolerance: the data store continues to operate under network partitions. (in the paper, the authors argue that network partition only occurs across datacenters 4) High scalability: The data store scales out linearly. We know that under CAP theorem, we cannot achieve both ALPS and strong consistency(i.e., linearizability).

[2] causal consistency is the strongest achievable consistency model under CAP theorem.

[3]such systems also include Linkedin's Voldemort and Facebook's Cassandra.

[4] concurrent operations mean that for operations a and b, a does not happen before b , and b does not happen before a.) If they operate on the same key, replicas may diverge forever.

[5]Common ways to handle conflicts are 1) last-writer-win 2) application specific means.

[6]Recall that transaction is a group of operations that should occur together or not at all. Get transaction gives clients a consistent view of multiple keys.

Strength:

I think this is a very important paper as it presents a causally-correct system which provides a much stronger guarantee than eventual consistent systems. It also formally defines the crucial properties of distributed data stores(ALPS) as well as causal consistency.

Weakness:

The dependencies can become very big. Base on my understanding, the dependencies are stored by the client library based on the client's history. If the client submits many operations on many different keys, the dependency list can grow very long. I think one way to mitigate this, as we do in Dropbox, is to use Hybrid Logical Clocks(HLCs). As Murat Demirbas points out, The loosely synchronized real-time component of HLC would help in truncating the dependency list. The logical clock component of HLC would help in maintaining this dependency list precisely even in the uncertainty region of loosely synchronized clocks.

Related links:

On Dynamo vs. COPS vs. PNUTS

Review by Murat Demirbas:

Youtube Talk:

High Scalability:

PreviousExploiting Commutativity For Practical Fast ReplicationNextBuilding Consistent Transactions with Inconsistent Replication

Last updated 5 years ago

Was this helpful?

A brief response to "COPS and PNUTS"
Logo
Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
Paper: Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS - High Scalability -