Random Notes

Lineage-driven Fault Injection



“Failure is always an option” and finding bugs in distributed systems is always a hard thing to do. Previous works include verification, which is doing an exhaustive test on all inputs(e.g., model checking and theorem proving and testing, which does random fault injection(e.g., Chaos monkey). However, all existing techniques are unsatisfactory. Verification approach works well on simple protocols(i.e., It is complete: if there exist a violation, this approach will find it), but it suffers from “state explosion problem” as the protocols become more complicated. Random fault injection is good at finding shallow bugs, which are caused by few faults, but it can’t cover all the possible failure scenarios.
This paper introduces a novel approach for discovering bugs in fault-tolerant data management system: Lineage-driven fault injection and a system called Molly, an implementation of LDFI. Unlike random fault injection approach, Molly doesn’t blindly inject faults. Instead, Molly uses provenance to reason backward from successful runs.
Molly works in synchronous, non-Byzantine model. It will record the system runs, especially the success runs. Unlike other approaches which try to find bugs directly, Molly will try to find why it succeeded, based on its lineage. It will construct “lineage graphs” for successful results and an “SAT problem whose variables encode crash failures and message omissions. The off-the-shelf SAT solver will either provide solutions(which are the interesting faults that we want to inject) or no solutions, indicating the program is correctly implemented. For each fault we injected, we will either find a counterexample or discover more redundancy in the program(e.g., Message retry). Since the state space is finite, if the program is truly fault-tolerant, the SAT problem generated by Molly will eventually become unsatisfiable, proving that the program is correct.

Related post: