Say you have a large amount of data, and you want to partition it…
How do you decide which records to store on which nodes? 🤔
- Our goals with partitioning is to spread data and query load evenly across nodes
- If every node takes an equal burden…
- In theory 10 nodes should be able to 10x as much data the load (read write throughput) is currently handling on one node (ignoring replication for now) ✅
- If every node takes an equal burden…
Unfair partitioning (Skew) 👎
If the partitioning is unfair… So that more partitions have more data than others..
- We call it skew
- The presence of skew makes partitioning much less effective 👎
- In an extreme case all the load could end up on 1 partition! ❌
- So 9 out of 10 nodes are idle
- Single busy node is your bottleneck
- Partition with disproportionately high load is called a hotspot
Avoiding hot spots
The simplest approach to avoiding hotspots would be to:
- Assign records to nodes randomly that distributes the data fairly evenly across nodes ✅
Big downside:
- When you are trying to read a particular item
- You have no way of knowing which particular node it is on
- So you have to query all the nodes in parallel!! 🤯
We can do better….
Let’s assume we have a key value pair data model, in which you always access a record via primary key.
In the next blog post I will be covering partitioning by key range where you can incorporate this.
📚 Further Reading & Related Topics
If you’re exploring partitioning strategies for key-value stores in distributed systems, these related articles will provide deeper insights:
• How Does Partitioning Work When Requests Are Being Routed? – Learn how partitioning affects request distribution and system scalability in distributed environments.
• Distributed Data-Intensive Systems: Logical Log Replication – Understand how replication strategies ensure data consistency across partitions in key-value storage architectures.









Leave a comment