Conceptually, replication facilitates a distributed architecture. This is achieved by using two fundamental types of nodes, that handle application(s) reads and writes to the database(s). These are:
- Leader node – writable and readable
- Follower node – read only
The leader has the current data and the followers will update accordingly.
Asynchronous vs Synchronous Replication
Synchronous Replication
- Have one follower synchronous and all the other followers are asynchronous
- If a follower falls behind and is synchronous it will switch to asynchronous and another follower will become synchronous
- This guarantees you have one up to date copy on a follower and a leader
- This configuration is called semi-synchronous
Asynchronous Replication
- If the leader fails, and any writes are committed to the associated followers this will fail
- The guarantee to write is not durable
- Benefit is that the leader is still able to process a write even if followers have fallen behind
- Weakened durability trade off can come at the benefit of distributed geographical applications
- Can be serious problem if nodes fail asynchronously for some systems
- An example of a highly configurable asynchronous replication system from Azure storage aka ‘Object Replication For Block Blobs’ – https://docs.microsoft.com/en-us/azure/storage/blobs/object-replication-overview
Setting up Failed Followers
This can be either to add followers or replace failed followers.
How are you sure a follower has up to date data?
You could lock the database on writes, until all the followers are up to date?
But this is no longer a high availability system…
Fortunately, setting up a follower can be achieved without down time! Conceptually, the process goes like this:
- Consistently take snapshots of a leader node
- If possible, this should be done without taking a lock on an entire database
- Most databases have snapshot features
- If this is not the case, you can use third party tools like:
- This way you can copy the snapshot to a new follower node
- Follower requests the snapshot and once this is copied it will then be flagged as ‘caught up’
- A follower request to the leader replication log can be:
- Postgres calls this: Log sequence number
- MySQL calls this: Bin log co-ordinates
- Now the follower can continue to process data changes from the leader
- This type of work can be done automated way or by a database admin
Node Outages (Followers and Leaders)
Possible reasons for follower outages:
- This could occur with faults
- Rebooting nodes for security patch updates
How to keep a system up and running during these follower/leader outages, whilst maintaining high availability?
Follower Failure
This can be handled with catch up recovery:
- On disk the follower keeps a record/log of the data changes intercepted from the leader
- In a scenario where the follower was to crash or the network was to be interrupted:
- The follower can get back up with ease, where it can find the point in time when the failure last occurred
- Therefore, the follower can request all the data between the time it was disconnected from leader
- Once the request is completed it will then begin to retrieve the data changes as it did before
Leader Failure
To highlight, leader failover can be challenging. This requires a follower to be promoted to a leader. Subsequently, clients will need to be configured to set there writes to the new leader, and then other followers will need to start to process data again.
This process is called leader failover:
- Manual Failover
- Notification that a leader has failed to a Admin
- Admin undertakes the necessary steps to bring up the new leader
- Automatic Failover
- The system determines that the leader has failed
- Can be due to:
- Crashes
- Power outages
- Network issues
- Etc…
- Unfortunately, there is no bomb proof way of detecting what can cause this
- This is why timeouts are used (to note, typically timeouts are disabled for planned maintenance)
- Choosing a New Leader
- Election process
- Usually the node with the most up to date data changes from the old leader wins the election
- This is to minimise any data loss
- Getting all the nodes to agree on a new leader is a consensus problem
- Reconfiguring the system to use the new leader
- Clients now need to send there write requests to the new leader
- If the old leader returns it may think its still the leader, not realising other replicas have forced it to step down
- The system will need to address the old leader to step down and become a follower to the new leader
Failover is fraught with things that can go wrong…
- If asynchronous replication is used
- The new leader may of not revived all the new writes from the old leader before it failed
- The new leader may have conflicting writes
- Writes may need to be discarded, this maybe unacceptable for durability in some systems
- Another issue – two nodes thinking there the leader!
- This issue is called Split Brain 🧠
- If both leaders accept writes, data can be lost or corrupted
- As a solution:
- Some systems have a safety catch mechanism to shutdown one node if two leader are detected..
- This is known as fencing, or ‘shoot the other node in the head’ – although if not executed carefully, you can shoot both of them!
- Too long a timeout before a leader is declared dead
- Can take a longer time to recovery if the leader fails
- Too short a timeout
- Can create unnecessary fail overs
There are no easy solutions to these issues of what fail-overs can bring, this is why some operations teams choose to do this manually, even if the software supports automatic failover.
Conclusion
Replication is an important process to understand, as many database technologies utilise this to achieve system scalability, which can consist of different geographical location in some cases. Although as highlighted, there are fundamental problems to distributed systems, that this approach is not immune from.
This must be handled accordingly, initially requiring an understanding of what types of issues there are, whether these are node failures, unreliable networks, etc. Trade offs such as network consistency, durability, availability and latency must be made depending on the customer requirements.
As a footnote, replication is something that many database systems include in there features and functionality set, so the reality is most developers will be faced with choosing the appropriate database technologies as an off the shelf solution, rather than building these complex components from scratch.
A recommend starting point on deciding on a replication solution toolset:
https://www.trustradius.com/data-replication