What to consider with Replication and Multi-Data Centre Operations?

Leaderless replication is suitable for multi-data centre operations, due its ability to tolerate conflicts, such as:

  • Concurrent writes
  • Network interruptions
  • Latency spikes

For instance both Cassandra and Voldemort implement there multi data centre support within the normal leaderless model.

  • The number of replicas N includes nodes in all data centres
    • And with these you can specify the configuration for each of these
  • Each write from a client is sent to all replicas (regardless of data centres)
    • But usually the client only waits for acknowledgment from a quorum of nodes within its local data centre
      • Therefore, this is unaffected by delays and interruptions on the cross data centre link
  • Now configuration…
    • Asynchronous configuration is often used for for higher latency writes to data centres (although there is some flexibility in configuration)
  • For example, Riak keeps all communication between clients and database nodes local to one data centre.
    • So for this, N describes the number of replicas within one data centre
    • Cross data replication between database clusters happens asynchronously in the background with a style that is similar to multi leader replication

Detecting concurrent writes

Dynamo styled databases allow for several clients to concurrently write to the same key.

  • Which means conflicts can occur even if strict quorums are used
  • The situation is similar to multi-leader replication
    • Although with Dynamo styled databases conflicts can additionally occur during read repair or hinted hand off
  • The problem is event can arrive in a different order at different nodes, this could be due to:
    • Network delays
    • Partial failures
    • For example:
      • We have two clients A and B
        • Simultaneously writing to a key X
        • In a three node datastore
        • Node 1 receives a write from A
          • But never receives from B due to a transient outage 😦
        • Node 2 first receives the write from A then the write from B
        • Node 3 first received the write from B and the write from A
      • If each node simply overwrote the value for a key for whenever it receives a write request from a client…
        • The nodes will become permanently inconsistent
      • Node 2 thinks the final value of x is B 🤷‍♂️
      • Where the other nodes think that the value is A
  • In order to become eventually consistent the replicas should converge towards the same value
    • How do they do that?
      • One might hope that replicated databases handle this automatically… (unfortunately most implementation are quite poor)
      • If you want to avoid data loss
        • You the application developer need to know a lot of internals of your database conflict handling

In previous blog posts I have mentioned some techniques on conflict resolution.

Last write wins, discarding concurrent writes

One approach to achieving eventual convergence, is to declare that each replica needs only store the most recent value, and allow older values to be overwritten and discarded.

  • Then as long as we have some way of unambiguously determining which write is more recent and every write is eventually copied to every replica
    • The replicas will eventually converge to the same value
    • ⚠️ This idea is quite misleading
  • In the earlier example with the two clients simultaneously write to the 3 node data store
    • Neither client knew about the other one when it sent its writes
      • So it is not clear which one happened first
      • In fact it really does not make sense to say either happened “first”
      • We say the writes are concurrent so the order is undefined
      • Even though the writes do not have a natural ordering
        • We can force a arbitrary order on them
          • For example we can attach a timestamp to each write
          • Pick the biggest timestamp as the most recent and discard any writes with the earlier timestamps
          • This conflict resolution algorithm is called “last write wins” LWW
          • This is the only conflict resolution handler supported in Cassandra
          • And an optional feature Riak

Last Write Wins (LWW)

LWW achieves the goal of eventual convergence but at the cost of durability 👎

  • If there are several concurrent writes to the same key
  • Even if they were all reported successfully for the client
    • Because they were written to the write replicas
      • Only one of the writes will survive
      • The other will be silently discarded
      • Moreover, LWW may even drop writes that are not concurrent
  • There are some situations such as caching where lost writes can be acceptable
  • If losing data is not acceptable LWW is a poor choice for conflict resolution
    • The only safe way to use a database with LWW
    • Is to ensure a key is only written once and thereafter treated as immutable
      • Thus, avoiding any concurrent updates to the same key
      • For example a recommended way of using Cassandra is to use a UUID as the key
        • Thus giving each write operation a unique key

What are Sloppy Quorums and Hinted Handoffs?

Why optimised quorums useful?

Databases with appropriately configured quorums can tolerate the failure of individual nodes, this can allow systems to:

  • Not rely on a need for a failover
  • Tolerate individual slow nodes, because requests do not have to wait for all N (nodes in quorum group) to respond
  • They can return when W (write) or R (read) nodes had responded

These characteristics make leaderless replication appealing for use cases:

  • Require high availability
  • Low latency
  • Tolerate occasional stale reads

“Achilles heel” of Quorums!

Quorums can provide some reliable outcomes for end users, they are not fault tolerantNetwork interruptions can easily cut off a client from a large number of database nodes!

  • Although those nodes are alive and other clients maybe able to connect to them,
  • To a client that is cut off from a database nodes they may as well be dead!
    • In this situation it is likely that fewer than write or read reachable nodes remain, so the client can no longer reach a quorum

Large Clusters?

It is likely that the client can connect to some database nodes during a network interruption when there are many nodes, just not the quorums that are needed for assembly for a particular value.

Database Designer Trade offs?

👉 Is it better to return errors to all requests from which we cannot reach a Quorum W and R nodes?

Or… (Sloppy Quorum)

👉 We accept writes anyway. Write them to the same nodes, that are reachable but are not among the N nodes where the value usually lives?

Sloppy Quorums

  • Writes and reads still require W and R successful responses
    • But those may include nodes that are among designated N home nodes for a value
    • Analogy example:
      1. Locking yourself out of your house…
      2. You knock on the neighbours door and ask if you can stay around there’s for a bit
        • Once the network interruption is fixed 👨‍🔧
        • Any writes that node temporarily accepted on behalf of another node are sent to the appropriate home nodes
        • This is called Hinted Handoff
      3. Continuing the analogy….
      4. Once you find your keys to your house, you then proceed to go home

When to use Quorums?

  • Sloppy quorums are particularly useful for increasing write availability
    • As long as any W nodes are available the database can accept writes 🤝
  • However this means even if W + R > N, you cannot be sure to read the latest value for a key
    • Because the latest value could of been temporarily written to some nodes outside of N

Final note

As discovered a sloppy quorum is not really a quorum at all, when we compare this to the traditional approach. It is only an assurance of durability, as the data is stored on W nodes. There is no guarantee that R node will see it… 👎 Only until the hinted handoff is completed.

Sloppy Quorums are optional in Dynamo styled databases (noSQL), in Riak (noSQL) they are enabled by default as from my current research. Whilst Cassandra (noSQL) and Voldemort (noSQL) are disabled by default.

What is Leaderless Replication?

Quickie on multi leader and single leader replication

Typically a single or multi leader replication approach is adopted, this is based on the concept:

  • Client sends write requests to one node the leader
  • The database system takes care of copying data it writes to other replicas
  • A leader determines the order the write should be processed
  • The followers apply the writes in the same order

Although.. some data storage systems can take another approach…

The leaderless replication set up

As the name entails this means abandoning the concept of a leader. So this means:

  • Allowing any replica to accept writes from clients
  • Some of the earliest replicated systems were leaderless
  • The idea was forgotten during the domination of RDBMS…

Don’t lose hope, it has once again became an attractive architecture for databases, for instance:

  • After Amazon used it for its in house Dynamo system adoption has grown
  • Riak, Voldemort, Cassandra are open source data stores with leaderless replication models, inspired by Dynamo
  • This kind of database is called Dynamo style

In some leaderless replication styles the client sends it writes several replicas:

  • While with others, the co-ordinator node does this on behalf of the client…
  • However unlike a leader database that co-ordinator does not enforce a particular ordering of writes

As we shall learn in the next blog post this difference in design has profound consequences for the way the database is used.