The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services - Chow et al., OSDI '14
Lineage-driven Fault Injection - Alvaro et al., SIGMOD '15 [Summary]
Early Detection of Configuration Errors to Reduce Failure Damage - Xu et al., OSDI '16
Gray Failure: The Achilles’ Heel of Cloud-Scale Systems - Huang et al., HotNets '16
DQBarge: Improving Data-Quality Tradeoffs in Large-Scale Internet Services - Chow et al., OSDI '16
The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance - Chen et al., SIGCOMM '17 [Summary]
Redundancy Does Not Imply Fault Tolerance - Ganesan et al., FAST '17 [Summary]
REPT: Reverse Debugging of Failures in Deployed Software - Cui et al., OSDI '18 [Summary]
Capturing and Enhancing In Situ System Observability for Failure Detection - Huang et al., OSDI '18
Efficient Scalable Thread-Safety-Violation Detection - Li et al., SOSP '19 [Summary]
Check before You Change: Preventing Correlated Failures in Service Updates - Zhai et al., NSDI '20 [Summary]
Understanding, Detecting and Localizing Partial Failures in Large System Software - Lou et al., NSDI '20
AGAMOTTO: How Persistent is your Persistent Memory Application? - Neal et al., OSDI '20
Last updated