Lets firstly breakdown the goals of replication
- High availability
- Maintaining that a system is running at an agreed level of operation.
- Even when one machine or several machines or an entire data centre goes down!
- Disconnected Operations – Clients with offline operations
- Enabling the application to continue working when there is network latency
- Latency
- Placing data geographically close to users so that the users can interact with it faster
- Scalability
- The ability to handle a higher volume of reads, that a single machine can handle by performing these reads on replicas
The humbling reality…🙏
Despite being a simplistic objective to keep copies on several machines, replication is remarkably a tricky problem! It requires careful attention to:
- 📝Concurrency 👈
- 📝Things that can go wrong 👈
- 📝Dealing with the consequences of the faults 👈
At a minimum, we will generally require dealing with the following:
- 📝Unavailable nodes 🔴❌
- 📝Network interruptions 📶 ❌
And that is not even considering the more insidious faults 🧟♂️, such as:
- 📝Silent data corruption due to software bugs 🪲
What approaches we can take with replication?
- Single leader replication
- Clients send all writes to a single node (leader)
- Streams of data change events are sent between followers
- Reads can be performed by any replica
- But followers may return stale reads
- Clients send all writes to a single node (leader)
- Multi-leader replication
- Where clients send each write to one of several leader nodes
- Any of which can accept writes
- Streams of data change events are sent between leaders and to any follower nodes
- Related, on choosing the best multi-leader topology: The Multi-Leader Replication Topologies
- Where clients send each write to one of several leader nodes
- Leaderless replication
- Clients send each write to several nodes
- There is the ability to read from several nodes in parallel ⬇️
- In order to correct and detect nodes with stale data
Advantages and disadvantage of replication
Single leader replication is the most popular because:
- Easier to understand ✅ ☺️
- No conflict resolution to worry about ✅
Multi-leader replication and Leaderless replication can be more robust in handling:
- Faulty nodes ✅
- Network interruptions ✅
- Latency spikes ✅
At the cost of being:
- Harder to reason about ❌
- Providing only very weak consistency guarantees to end users ❌
Asynchronous and synchronous replicaton
This can have a profound affect on the system behaviour when there is a fault.
Asynchronous replication:
- Can be faster when the system is running as expected ✅
- It is important to figure out what happen when replication lag increases or servers failures 👈📝
- If a leader fails and you promote an asynchronously updated follower to be the new leader 🤔
- There is a risk… ⚠️
- the recently committed data maybe lost ❌
- There is a risk… ⚠️
Consistency models can be utilised to combat replication lag by giving the replicas a set of instructions on how to behave when this occurs.
The consistency models:
- Read after write consistency
- Users should always see data that they have submitted themselves
- Monotonic reads
- After the users have seen data at one point in time
- They should not see data from an earlier point in time
- After the users have seen data at one point in time
- Consistent prefix reads
- Users should see the data in a state that makes casual sense
- For example, seeing a question and it’s reply in the correct order
- Users should see the data in a state that makes casual sense
Concurrency issues:
These are inherent in multi leader and leaderless replication… 🤷♂️
- Because they allow multiple writes to happen concurrently therefore conflicts may occur ❌
- 🧐 There are numerous algorithms that allow databases determine whether:
- One operation happened before another ✅
- Whether they happened concurrently ✅
- There are also algorithms to resolving conflicts ➡️💥⬅️
- This can be done by merging together concurrent updates or subtle techniques ✅
Final author recommendations
This concludes the end of the blog series on replication in data intensive systems. To note, most of this information can be found in “Designing Data Intensive Applications” by Martin Kleppmann, which is in my opinion a highly recommend and extensive book on this subject area.
[…] the previous blogs I have been discussing replication, where we have multiple copies of the same data on different nodes… Although, for very large […]
LikeLike
[…] In this blog post I will summarise transactions and the different vectors we should consider, I plan to make some further posts on the different areas on this, similar to my series on partitioning and replication. […]
LikeLike
[…] database status quo. This was achieved by offering new choices of data models, and by including replication and partitioning by […]
LikeLike