Distributed Systems
Fault tolerance isn't a feature. It's a series of decisions you made before the failure.
On the difference between systems that tolerate failures because they were designed to, and systems that survive by coincidence. The gap between these is not usually visible until the failure actually happens — which is precisely the problem.