What is Replication Lag?

Replication lag.. why bother with replication?

Reasons why we use this:

Fault tolerance, from node failures
Scalability of nodes based on requests
Latency, placing nodes geographically closer to users

How can replication lag occur with a read scaling architecture?

Let walk through a common replication pattern with leader based replication:

Requirements
- Writes go through a single node 🔵
- Read only queries go to any replica 🟢 🟢 🟢

When does this work?

In the example above, it would require a higher perentage of reads and a lesser amount of writes for this pattern to be attractive, as changing replica conditions due to writes are minimal, meaning delivery of read transactions are more consistent. ✅

Asynchronous vs Synchronous replication:

Asynchronous replication only really works when wanting to add more follower nodes ✅
Synchronous replication with a node outage can cause an entire system to lag or create down time ❌

Followers can fall behind although temporary state as they will catch up eventually… This is called eventual consistency.

Eventual consistency

Eventually is deliberately vague as there is no limit to how far a node can fall behind:

Maybe a fraction of second (unnoticeable) 🤷‍♂️
If there is lag in the entire system this can easily become several seconds to several minutes 🕙
When lags are so large, it is not just theoretically an issue, this can cause real problems for applications ❌

Problem of replication lag?

Reading your own writes…

Many applications let you submit data and let other users view it
- This maybe a record in a customer database, comment in a forum etc
Asynchronous replication will mean some nodes maybe not up to date 🤔
- So if a user is submitting changes to a leader
  - They may not see this on the follower they viewing from… this can cause distress to a user ❌
  - 👉 Especially if it involves depositing your own money to another local account and not seeing the changes immediately being transferred 💸

How to handle replication lag?

When working with an eventually consistent system, it is worth considering the application behaviour if there are replication lags of several minutes or hours:

If the answer is no problem… then great!
But if the result is a poor experience for users 👎
- It is important to provide a stronger guarantee 💪
  - Like read after write (in the next blog post I will talk about this deeper)

❌ Pretending 🎭 an application is synchronous, when it is asynchronous is a concoction of problems later down the line. 🪲🦟

Potential solution

There are ways to provide a stronger guarantee that the underlying database…

By performing certain types of reads on a leader 🤔
- However this is complex to do do on the application layer 👎

It would be better for some developer not needing to worry about replication issues, instead they can just just the database is doing the right thing. 🤷‍♂️

This is why transactions exist
- They are a way for a database to provide strong guarantees so that the application can be simpler ✅
Single node transactions have existed for a long time
- However, the move to distributed, replicated and partitioned databases many system have abandoned them 🤷‍♂️
  - Claiming that transactions are too expensive ❌
  - 👉 And asserting eventual consistency is inevitable in a scalable system
    - There are some truths on that statement, but this is overly simplistic, there are many more nuances out there

📚 Further Reading & Related Topics

If you’re exploring replication lag in distributed systems, these related articles will provide deeper insights:

• Distributed Data-Intensive Systems: Logical Log Replication – Learn how logical logs help manage replication consistency and reduce lag in distributed environments.

• Distributed Data-Intensive Systems: Reading and Writing Quorums – Understand how quorum-based approaches affect replication performance and data consistency.

4 responses to “What is Replication Lag?”

Distributed Data Intensive Systems: Reading Your Own Writes – Scalable Human

1st March 2022

[…] can also monitor the Replication Lag on the followers and prevent queries on any follower that is any more than one minute behind the […]

LikeLike

Eventual Consistency vs Strong Consistency – Scalable Human

9th March 2022

[…] Data would required locking during the period of updates or replication process (replication lag) […]

LikeLike

What is Replication? – Scalable Human

4th April 2022

[…] Replication Lag […]

LikeLike

What is Replication and Why is it Important? – Scalable Human

4th April 2022

[…] Replication Lag […]

LikeLike

Scalable Human Blog