As Martin Kleppman outlines in his book of “Designing Data Intensive Applications” he highlights clearly the difference between software faults and failures, and it’s importance of why we need to understand this. Here are my notes from his insights:
⚠️ Software Fault
A component of a system deviating from it’s spec.
- It is impossible to reduce faults to zero
- Some faults can not be resolved, which means they may need to be avoided instead
- Faults are most unlikely caused by hardware components breaking at the same time (although not impossible)
🔥 Software Failure
A system as a whole fails to deliver a service.
- Systematic errors across a system cause the most software failures
- Cascading failures are often dormant for long periods of time, this may only materialise when a particular unknown edge case occurs
- An example of an infamous software failure: the Linux leap second issue, this has lead to many Linux application failures









Leave a comment