Understanding Partitioning Proportional to Nodes

With dynamic partitioning the number of partitions is proportional to the size of the dataset.

Since splitting and merging processes, this keeps the size of each partition between some fixed number of partitions

  • The size of each partition is proportional to the size of the dataset ✅
  • For both of these cases the number of partitions is independent to the number of nodes.

But there is another option

A third option is used by Cassandra

  • Is to make the number of partitions proportional to the number of nodes
  • In this case the size of each partition grows proportionally to the dataset size, whilst the number of nodes remain the unchanged

What happens when you increase the number of nodes?

  • 👉 When you increase the number of nodes, the partitions become smaller again!
    • Since a larger data volume generally requires a larger number of nodes to store it on ✅
    • This approach also keeps the size fairly stable ✅

What happens when a new node joins the cluster?

  • When a new node joins the cluster…
    • It randomly chooses a fixed number of existing partitions to split
    • And then it takes ownership of one half of the split partitions
    • While leaving the other half of each partition in place

Randomly chooses?!

The randomisation can produce an unfair split 👎

But when this is averaged over larger number of partitions (in Cassandra 256 partitions per nodes by default), the new nodes end up taking a fair share of the load of the existing nodes ✅

Cassandra 3.0 used an alternative rebalancing algorithm that avoided the unfair splits…

Picking the partition boundaries randomly

  • Requires the use of hash based partitioning
    • So the boundaries can be picked by the range of number produced by the hash function

Indeed this approach corresponds most closely with the original definition of consistent hashing (mentioned earlier in the blog). New hash functions can achieve a similar affect with lower metadata overhead.

📚 Further Reading & Related Topics

If you’re exploring partitioning strategies and scalability in distributed systems, these related articles will provide deeper insights:

• What is Dynamic Partitioning? – Learn how dynamic partitioning adapts to changing workloads and improves performance across distributed nodes.

• How Does Partitioning Work When Requests Are Being Routed? – Explore how requests are distributed across partitions and the role of node-aware partitioning in efficient system scaling.

4 responses to “Understanding Partitioning Proportional to Nodes”

  1. What is Partitioning? – Scalable Human Blog Avatar

    […] • Understanding Partitioning Proportional to Nodes – Explore how partitioning scales dynamically based on the number of nodes in a distributed system. […]

    Like

  2. How to Partition with Secondary Indexes by Document – Scalable Human Blog Avatar

    […] • Understanding Partitioning Proportional to Nodes – Explore how partitioning strategies scale dynamically based on system architecture and workload distribution. […]

    Like

  3. What to Consider with Secondary Indexes and Partitioning – Scalable Human Blog Avatar

    […] • Understanding Partitioning Proportional to Nodes – Learn how partitioning strategies work in distributed systems and how they impact data access when using secondary indexes. […]

    Like

  4. What is Partitioning by Hash of Key? – Scalable Human Blog Avatar

    […] • Understanding Partitioning Proportional to Nodes – Learn how partitioning strategies based on hash keys differ from proportional partitioning and their impact on data distribution and retrieval. […]

    Like

Leave a reply to What to Consider with Secondary Indexes and Partitioning – Scalable Human Blog Cancel reply

I’m Sean

Welcome to the Scalable Human blog. Just a software engineer writing about algo trading, AI, and books. I learn in public, use AI tools extensively, and share what works. Educational purposes only – not financial advice.

Let’s connect