Software Faults and Failures ⚠️🔥

As Martin Kleppman outlines in his book of “Designing Data Intensive Applications” he highlights clearly the difference between software faults and failures, and it’s importance of why we need to understand this. Here are my notes from his insights:

⚠️ Software Fault

A component of a system deviating from it’s spec.

  • It is impossible to reduce faults to zero
  • Some faults can not be resolved, which means they may need to be avoided instead
  • Faults are most unlikely caused by hardware components breaking at the same time (although not impossible)

🔥 Software Failure

A system as a whole fails to deliver a service.

  • Systematic errors across a system cause the most software failures
  • Cascading failures are often dormant for long periods of time, this may only materialise when a particular unknown edge case occurs
  • An example of an infamous software failure: the Linux leap second issue, this has lead to many Linux application failures

Leave a comment

I’m Sean

Welcome to the Scalable Human blog. Just a software engineer writing about algo trading, AI, and books. I learn in public, use AI tools extensively, and share what works. Educational purposes only – not financial advice.

Let’s connect