How to Monitor Staleness

What is Staleness?

Describing staleness in an operational and monitoring perspective, this means understanding whether your databases are returning up to date results.

The how to Manage Staleness?

Even if your application can tolerate stale reads, you need to:

  • Monitor the health of your replication (if it falls behind significantly)
  • Provide alerting so that you can investigate the cause
    • Which could be…
      • Problem in the network
      • Overloaded nodes
      • etc

Leader based replication

The database typically exposes metrics for the replication lag. Which means you can feed this into a monitoring system.

How is this possible?

This is because writes are applied to the leader and the followers in the same order…

  • Each node has a position in the replication log, and the number of writes it has applied locally
  • By subtracting a follower current position and the leaders current position, you can calculate the lag
    • {follower position} – {leader position} = {current lag}

Leaderless Replication

There is no fixed order in which writes are processed!

  • This makes monitoring difficult for sure
  • The database only uses read repair and no anti entropy
  • There is no limit to how old the value might be!
  • The value is only infrequently read
    • The value returned by a stale replica maybe ancient!

There has been some research measuring replica staleness in databases with leaderless replication, and predicting the expected percentage of stale reads depending on the parameters N (nodes in quorum group), W (write nodes), and R (read nodes).

  • Unfortunately this is not yet common practice…
  • But it would be good to include some staleness measurements on the parameters for databases
  • Eventual consistency is a deliberate vague guarantee
  • But for operability it is important to quantify eventually

How to Resolve Conflicts with Custom Conflict Resolution?

In some scenarios, the most suitable way to resolve a conflict resolution may be dependent on the type of application being built.

👉 Most multi-leader replication tools allow you write custom conflict resolution logic using application code.

  • This could be on executed on a write or a read ✍️ 📖
    • On write ✍️
      • As soon as the database detects a conflict in the log of replicated changes it will call a conflict handler
      • For example, Bucardo allows you to write snippets of Perl for this purpose
        • This handler typically cannot prompt a user
        • It runs as a background process and it must execute quickly
    • On read 📖
      • When a conflict is detected all the conflicting writes are stored
      • The next time data is read, these multiple versions of the data are returned to the application
      • The application may prompt the user or automatically resolve the conflict and write the result back to the database
      • For example: CouchDB works this way!

✋ To note, that conflict resolution is only applied on the individual row or document not for an entire transaction…

🤔 Meaning, if you have a transaction that makes several different writes, each write will need to be handled separately for the purpose of conflict resolution.

To be continued

The final post will be covering the powerful capability of automatic conflict resolution and reasons why you may consider this approach, as writing custom conflict handlers can be error prone!

For more reading on conflict resolution:

What is Replication?

Conceptually, replication facilitates a distributed architecture. This is achieved by using two fundamental types of nodes, that handle application(s) reads and writes to the database(s). These are:

  • Leader node – writable and readable
  • Follower node – read only

The leader has the current data and the followers will update accordingly.

Asynchronous vs Synchronous Replication

Synchronous Replication
  • Have one follower synchronous and all the other followers are asynchronous
  • If a follower falls behind and is synchronous it will switch to asynchronous and another follower will become synchronous
  • This guarantees you have one up to date copy on a follower and a leader
  • This configuration is called semi-synchronous
Asynchronous Replication
  • If the leader fails, and any writes are committed to the associated followers this will fail
  • The guarantee to write is not durable
  • Benefit is that the leader is still able to process a write even if followers have fallen behind
  • Weakened durability trade off can come at the benefit of distributed geographical applications
  • Can be serious problem if nodes fail asynchronously for some systems
  • An example of a highly configurable asynchronous replication system from Azure storage aka ‘Object Replication For Block Blobs’ – https://docs.microsoft.com/en-us/azure/storage/blobs/object-replication-overview

Setting up Failed Followers

This can be either to add followers or replace failed followers.

How are you sure a follower has up to date data?

You could lock the database on writes, until all the followers are up to date?

But this is no longer a high availability system…

Fortunately, setting up a follower can be achieved without down time! Conceptually, the process goes like this:

  1. Consistently take snapshots of a leader node
  2. If possible, this should be done without taking a lock on an entire database
  3. Most databases have snapshot features
    • If this is not the case, you can use third party tools like:
  4. This way you can copy the snapshot to a new follower node
  5. Follower requests the snapshot and once this is copied it will then be flagged as ‘caught up’
    • A follower request to the leader replication log can be:
      • Postgres calls this: Log sequence number
      • MySQL calls this: Bin log co-ordinates
  6. Now the follower can continue to process data changes from the leader
    • This type of work can be done automated way or by a database admin

Node Outages (Followers and Leaders)

Possible reasons for follower outages:

  • This could occur with faults
  • Rebooting nodes for security patch updates
How to keep a system up and running during these follower/leader outages, whilst maintaining high availability?

Follower Failure

This can be handled with catch up recovery:

  • On disk the follower keeps a record/log of the data changes intercepted from the leader
  • In a scenario where the follower was to crash or the network was to be interrupted:
    • The follower can get back up with ease, where it can find the point in time when the failure last occurred
    • Therefore, the follower can request all the data between the time it was disconnected from leader
    • Once the request is completed it will then begin to retrieve the data changes as it did before

Leader Failure

To highlight, leader failover can be challenging. This requires a follower to be promoted to a leader. Subsequently, clients will need to be configured to set there writes to the new leader, and then other followers will need to start to process data again.

This process is called leader failover:

  • Manual Failover
    1. Notification that a leader has failed to a Admin
    2. Admin undertakes the necessary steps to bring up the new leader
  • Automatic Failover
    1. The system determines that the leader has failed
      • Can be due to:
        • Crashes
        • Power outages
        • Network issues
        • Etc…
      • Unfortunately, there is no bomb proof way of detecting what can cause this
      • This is why timeouts are used (to note, typically timeouts are disabled for planned maintenance)
    2. Choosing a New Leader
      • Election process
        • Usually the node with the most up to date data changes from the old leader wins the election
          • This is to minimise any data loss
        • Getting all the nodes to agree on a new leader is a consensus problem
    3. Reconfiguring the system to use the new leader
      • Clients now need to send there write requests to the new leader
      • If the old leader returns it may think its still the leader, not realising other replicas have forced it to step down
      • The system will need to address the old leader to step down and become a follower to the new leader
Failover is fraught with things that can go wrong…
  • If asynchronous replication is used
    • The new leader may of not revived all the new writes from the old leader before it failed
    • The new leader may have conflicting writes
    • Writes may need to be discarded, this maybe unacceptable for durability in some systems
  • Another issue – two nodes thinking there the leader!
    • This issue is called Split Brain 🧠
    • If both leaders accept writes, data can be lost or corrupted
      • As a solution:
        • Some systems have a safety catch mechanism to shutdown one node if two leader are detected..
        • This is known as fencing, or ‘shoot the other node in the head’ – although if not executed carefully, you can shoot both of them!
  • Too long a timeout before a leader is declared dead
    • Can take a longer time to recovery if the leader fails
  • Too short a timeout
    • Can create unnecessary fail overs

There are no easy solutions to these issues of what fail-overs can bring, this is why some operations teams choose to do this manually, even if the software supports automatic failover.

Conclusion

Replication is an important process to understand, as many database technologies utilise this to achieve system scalability, which can consist of different geographical location in some cases. Although as highlighted, there are fundamental problems to distributed systems, that this approach is not immune from.

This must be handled accordingly, initially requiring an understanding of what types of issues there are, whether these are node failures, unreliable networks, etc. Trade offs such as network consistency, durability, availability and latency must be made depending on the customer requirements.

As a footnote, replication is something that many database systems include in there features and functionality set, so the reality is most developers will be faced with choosing the appropriate database technologies as an off the shelf solution, rather than building these complex components from scratch.

A recommend starting point on deciding on a replication solution toolset:

https://www.trustradius.com/data-replication