Sharding: What it is and why many database architectures rely on it

In this post I will breakdown how sharding works with it advantages and trade offs. In addition, I also aim to assist you with why you may what to consider this powerful database partitioning technique.

As of this writing, sharding has been notably broadcasted to the mainstream media, that’s if you have been following the wild cryptocurrency news, especially with some scalability challenges they face! As the blockchain ETH and many others are pursing this technique in its anticipated release of ETH 2.0. The reason to mention this, is that this technique is being chosen to resolve a huge scalability problem for the latest upcoming blockchain technologies. This optimisation technique requires taking seriously, as it has been knocking around for a while, and many databases that seek to achieve scalability have chosen to utilise this technique with successful results.

To note, we are specifically talking about horizontal sharding, not vertical sharding! This is because this is much more popular and more applicable for large scale problems.

What is (horizontal) Sharding?

  • Optimisation technique
  • Instead of adding resources to a database, you are partitioning this up into smaller databases
  • Enables a system to horizontally scale as much as it requires

Alternative options to Sharding?

  • Vertical scaling
    • 🤔 Scaling up your hardware
    • 🤔 Adding resources to your instance
    • ❌ Can seem easier, although can become more expensive
    • ❌ Diminishing returns on performance, as it does not scale linearly
    • 🌐 For more info on this: Vertical scaling vs Horizontal scaling
  • Node Replication (asynchronous)
    • 🤔 Can consist of making copy’s of your database
    • 🤔 Utilises leader nodes (writable and readable) and follower nodes (readable)
    • ❌ Delayed propagation, nodes eventually update (eventual consistency)
    • ❌ Can result in stale data, adding more replicas can add to problem
  • And then there is Sharding…
    • We separate the databases out with different data
    • Different database instances
    • Multiple leader (master) nodes
    • Horizontal partitioning

How does Sharding work?

Example A
  • Example A – The partitioning key needs to be predictable (customer id incremental)
Example B
  • Example B – How do you know what shard your customer data is on? (Shard 1 or Shard 2)
    • You will need to introduce some complexity using an algorithmic approach, which can encompasses hashing or some type of predictable calculation to find the partition keys
    • What happens if a shard goes down?
      • Your system will fail to make queries on the shard affected
      • The live shards will still be able to process writes and reads
    • How do we determine which shard to query?
      • We require an intermediary layer
        • Responsible for looking up (algorithmic approach) which shard each customer is on and then creating the request for that (on the routing layer i.e. RESTful endpoints) within it to find the shard
        • Issue with this is that we will need to create a new layer of complexity that can also handle node failures, single points of failure etc…
        • This can also be expensive to maintain / build a system like this

When is it worth doing Sharding?

Pros

  • 🚀 Scalability
    • you can always truncate your data horizontally to sale to the problem (infinite basis)
  • 🚀 High Availability
    • If one database goes down, only a small subset of your customers are affected… but at least you can still stream traffic for your other customers
  • 🚀 Robust Fault Tolerance
    • Very unlikely that your entire system will fail at once, unless there is a consistent error across the application to cause this

Cons

  • 👎 Added Complexity:
    • Partition mapping
      • There are two approaches to retrieve mapped partitions
        1. Algorithmic approach – calculating the shard’s partition ID
        2. Centralised database – that keeps track of all the partition IDs that keep an index of where the customers are located on each shard
    • Creating routing layer
      • you may have to build this layer as there are not many solutions that provide this
    • None uniformity of data
      • Having to split your data evenly. This can be achieved with hashing functions:
        • For example, if you have 100 rows or 100 “pieces” of data you want to split this evenly 50/50. Depending on your application, there maybe a customer that has much high data storage demands…
        • What can eventually happen… is that a shard could disproportionately grow larger that the other shards, this means the process of reshuffling or resharding will have to take place to resolve this
  • 👎 Resharding
    • Not an easy task to fulfil as the data will need to be separated out elegantly the across a system
  • 👎 Analytical type queries
    • This can be an expensive operation across shard network as this requires scatter and gather meaning that is may have to query across different shards

Conclusion

To conclude, from this initial analysis of sharding, it can be said that this is an important partitioning technique to consider and understand, when designing your database/application architecture, especially if the latest upcoming blockchain technologies are implementing this. Although, it does come with some initial issues to overcome, there are intuitive solutions out there that have managed these areas, depending on there unique data requirements.

Thanks for reading. Hope this was quick and useful read. Please leave a comment if you have questions on this or any additional inputs.

📚 Further Reading & Related Topics

If you’re exploring sharding in database architectures, these related articles will provide deeper insights:

• Understanding Partitioning: Proportional to Nodes – Learn how partitioning works as a complementary strategy to sharding, optimizing data distribution across nodes in distributed systems to ensure performance and scalability.

• Replication vs. Sharding: Key Differences in Distributed Databases – Dive deeper into how sharding differs from other data distribution techniques like replication and clustering, and how these strategies improve scalability and availability in large systems.

One response to “Sharding: What it is and why many database architectures rely on it”

  1. What is Partitioning? – Scalable Human Avatar

    […] require something that can divide the data up into partitions… aka sharding (previously highlighted in previous blog […]

    Like

Leave a reply to What is Partitioning? – Scalable Human Cancel reply

I’m Sean

Welcome to the Scalable Human blog. Just a software engineer writing about algo trading, AI, and books. I learn in public, use AI tools extensively, and share what works. Educational purposes only – not financial advice.

Let’s connect