How to Monitor Staleness

What is Staleness?

Describing staleness in an operational and monitoring perspective, this means understanding whether your databases are returning up to date results.

The how to Manage Staleness?

Even if your application can tolerate stale reads, you need to:

  • Monitor the health of your replication (if it falls behind significantly)
  • Provide alerting so that you can investigate the cause
    • Which could be…
      • Problem in the network
      • Overloaded nodes
      • etc

Leader based replication

The database typically exposes metrics for the replication lag. Which means you can feed this into a monitoring system.

How is this possible?

This is because writes are applied to the leader and the followers in the same order…

  • Each node has a position in the replication log, and the number of writes it has applied locally
  • By subtracting a follower current position and the leaders current position, you can calculate the lag
    • {follower position} – {leader position} = {current lag}

Leaderless Replication

There is no fixed order in which writes are processed!

  • This makes monitoring difficult for sure
  • The database only uses read repair and no anti entropy
  • There is no limit to how old the value might be!
  • The value is only infrequently read
    • The value returned by a stale replica maybe ancient!

There has been some research measuring replica staleness in databases with leaderless replication, and predicting the expected percentage of stale reads depending on the parameters N (nodes in quorum group), W (write nodes), and R (read nodes).

  • Unfortunately this is not yet common practice…
  • But it would be good to include some staleness measurements on the parameters for databases
  • Eventual consistency is a deliberate vague guarantee
  • But for operability it is important to quantify eventually

Leave a comment