What is Partitioning of Key-Value Data?

Say you have a large amount of data, and you want to partition it…

How do you decide which records to store on which nodes? 🤔

  • Our goals with partitioning is to spread data and query load evenly across nodes
    • If every node takes an equal burden…
      • In theory 10 nodes should be able to 10x as much data the load (read write throughput) is currently handling on one node (ignoring replication for now) ✅

Unfair partitioning (Skew) 👎

If the partitioning is unfair… So that more partitions have more data than others..

  • We call it skew
    • The presence of skew makes partitioning much less effective 👎
    • In an extreme case all the load could end up on 1 partition! ❌
      • So 9 out of 10 nodes are idle
      • Single busy node is your bottleneck
      • Partition with disproportionately high load is called a hotspot

Avoiding hot spots

The simplest approach to avoiding hotspots would be to:

  • Assign records to nodes randomly that distributes the data fairly evenly across nodes ✅

Big downside:

  • When you are trying to read a particular item
    • You have no way of knowing which particular node it is on
    • So you have to query all the nodes in parallel!! 🤯

We can do better….

Let’s assume we have a key value pair data model, in which you always access a record via primary key.

In the next blog post I will be covering partitioning by key range where you can incorporate this.

📚 Further Reading & Related Topics

If you’re exploring partitioning strategies for key-value stores in distributed systems, these related articles will provide deeper insights:

• How Does Partitioning Work When Requests Are Being Routed? – Learn how partitioning affects request distribution and system scalability in distributed environments.

• Distributed Data-Intensive Systems: Logical Log Replication – Understand how replication strategies ensure data consistency across partitions in key-value storage architectures.

2 responses to “What is Partitioning of Key-Value Data?”

  1. What is Partitioning by Hash of Key? – Scalable Human Avatar

    […] What is Partitioning of Key-Value Data? […]

    Like

  2. Parallel Query Execution – What is it? (in 1 minute) – Scalable Human Avatar

    […] Martin Kleppman from Designing Data Intensive Applications outlines, there are simple queries that read and write a single key. Plus there are scatter gather queries, in the case of document partition secondary […]

    Like

Leave a comment

I’m Sean

Welcome to the Scalable Human blog. Just a software engineer writing about algo trading, AI, and books. I learn in public, use AI tools extensively, and share what works. Educational purposes only – not financial advice.

Let’s connect