> For the complete documentation index, see [llms.txt](https://xzhu0027.gitbook.io/blog/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://xzhu0027.gitbook.io/blog/fault-tolerance/index.md).

# Index

* [**The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services**](https://www.usenix.org/conference/osdi14/technical-sessions/presentation/chow) - Chow et al., OSDI '14
* [**Lineage-driven Fault Injection**](https://people.ucsc.edu/~palvaro/molly.pdf) - Alvaro et al., SIGMOD '15 \[[Summary](https://xzhu0027.gitbook.io/blog/fault-tolerance/index/lineage-driven-fault-injection)]
* [**Early Detection of Configuration Errors to Reduce Failure Damage** ](https://www.usenix.org/system/files/conference/osdi16/osdi16-xu.pdf)- Xu et al., OSDI '16
* [**Gray Failure: The Achilles’ Heel of Cloud-Scale Systems**](https://www.cs.jhu.edu/~huang/paper/grayfailure-hotos17.pdf) - Huang et al., HotNets '16
* [**DQBarge: Improving Data-Quality Tradeoffs in Large-Scale Internet Services**](https://www.usenix.org/conference/osdi16/technical-sessions/presentation/chow) - Chow et al., OSDI '16
* [**The Good, the Bad, and the Differences: Better Network Diagnostics with Differential Provenance**](https://www.cs.rice.edu/~angchen/papers/sigcomm-2016.pdf) - Chen et al., SIGCOMM '17 \[[Summary](https://xzhu0027.gitbook.io/blog/fault-tolerance/index/the-good-the-bad-and-the-differences-better-network-diagnostics-with-differential-provenance)]&#x20;
* [**Redundancy Does Not Imply Fault Tolerance**](https://research.cs.wisc.edu/wind/Publications/fast17-ganesan.pdf) - Ganesan et al., FAST '17 \[[Summary](https://xzhu0027.gitbook.io/blog/fault-tolerance/index/redundancy-does-not-imply-fault-tolerance)]
* [**REPT: Reverse Debugging of Failures in Deployed Software**](https://www.usenix.org/conference/osdi18/presentation/weidong) - Cui et al., OSDI '18 \[[Summary](https://xzhu0027.gitbook.io/blog/fault-tolerance/index/rept-reverse-debugging-of-failures-in-deployed-software)]
* [**Capturing and Enhancing In Situ System Observability for Failure Detection**](https://www.usenix.org/system/files/osdi18-huang.pdf) - Huang et al., OSDI '18
* [**Efficient Scalable Thread-Safety-Violation Detection**](https://www.microsoft.com/en-us/research/uploads/prod/2019/09/sosp19-final193.pdf) - Li et al., SOSP '19 \[[Summary](https://xzhu0027.gitbook.io/blog/fault-tolerance/index/efficient-scalable-thread-safety-violation-detection)]
* [**Check before You Change: Preventing Correlated Failures in Service Updates** ](https://ennanzhai.github.io/pub/nsdi20-cloudcanary.pdf)- Zhai et al., NSDI '20 \[[Summary](https://xzhu0027.gitbook.io/blog/fault-tolerance/index/check-before-you-change-preventing-correlated-failures-in-service-updates)]
* [**Understanding, Detecting and Localizing Partial Failures in Large System Software**](https://www.usenix.org/system/files/nsdi20-paper-lou.pdf) - Lou et al., NSDI '20
* [**Automated Reasoning and Detection of Specious Configuration in Large Systems with Symbolic Execution**](https://www.usenix.org/conference/osdi20/presentation/hu) - Hu et al., OSDI '20
* [**AGAMOTTO: How Persistent is your Persistent Memory Application?**](https://www.usenix.org/conference/osdi20/presentation/neal) - Neal et al., OSDI '20