What is Partitioning and Why is it Important?

In my previous blogs published, I have covered the different ways of which how partitioning is managed with large datasets and thereby reduced to smaller subsets. All this content was derived from the author Martin Kleppman in “Designing Data Intensive Applications“. In this post I will summarise all the topics covered, with their respective links for further detail if required.

Why do we have Partitioning?

Partitioning is necessary when you have so much data that storing and processing on a single machine is no longer feasible!

The goal of partitioning…
- To spread the data and query the load evenly across multiple machines.
- Avoiding hotspots – (nodes with disproportionately high load)

How to choose the appropriate Partitioning scheme for you!

This requires choosing a partitioning scheme that is appropriate to your data, and rebalancing the partitions when nodes are added to, or removed from the cluster.

Two main approached to partitioning:

Key range partitioning?

Keys are sorted and a partition owns all the keys from some minimum and up to some maximum.

✅ Sorting has the advantage that efficient range queries are possible
❌ But there is a risk of hotspots
- If the application often accesses keys that are close together in a sorted order
🛠 In this approach partitions are typically rebalanced dynamically by splitting the range in to two subranges when the partitions gets too big.

Hash partitioning

The hash is applied to each key, and the partition owns the range of hashes.

❌ This method destroys the ordering of keys
- Making range queries inefficient…
✅ But may distribute load evenly

When partitioning by hash, it is common to make a fixed number of partitions in advance.

⚙️ To assign several partitions to each node
⚙️ To move entire partitions to each node to another when nodes are added or removed

Dynamic partitioning can be used, and hybrid approaches are also possible!

✅ For example a compound key
- Using one part of the key to identify the partition and the other part for the sort order!

What about secondary indexes, what happens with them?

The secondary index is just another means to accessing the record(s) you want without using a primary key. Secondary index also needs to be partitioned and there are two methods:

Document partition indexes or local indexes
- Where the secondary indexes are stored in the same partition as a primary key and value
  - ✅ This means only a single partition requires updating on write (faster write)
  - ❌ But a read of a secondary index requires a scatter gather across all partitions (slower read)
Term partition indexes or global indexes
- Where the secondary indexes are partitioned separately using the indexed values
- Any entry in the secondary index may include records from all partitions of the primary key
  - ❌ When a document is written, several partitions of the secondary index need to be updated! (slower write)
  - ✅ However, a read can be served from a single partition (faster read)

Summary

It was also discussed in previous blog post on routing queries to the appropriate partition, which ranges from simple partition are load balancing to sophisticated parallel query execution engines!

By design, every partition operates mostly independently….

👉 That is what allows a partitioned database to scale!
- However, operations that need to write to several partition can be difficult to reason about, for example:
  - What happens if a write to one partition succeeds?
  - But another fails?

These questions will be addressed the the upcoming blog posts. Thanks for reading and ignoring my typos 🙂

📚 Further Reading & Related Topics

If you’re exploring partitioning, request routing, and distributed systems, these related articles will provide deeper insights:

• Load Balancing Algorithms Every Developer Should Know – Learn how different load balancing strategies complement partitioning techniques to optimize request distribution.

• Distributed Data-Intensive Systems: Logical Log Replication – Understand how replication mechanisms ensure data consistency across partitions, reinforcing the principles behind efficient request routing.

2 responses to “What is Partitioning and Why is it Important?”

Do we take Transactions for granted? – Scalable Human Blog

6th November 2022

[…] consider, I plan to make some further posts on the different areas on this, similar to my series on partitioning and […]

LikeLike

Distributed databases – Is Performance, Scalability and Transactional Guarantees Achievable? – Scalable Human Blog

20th November 2022

[…] quo. This was achieved by offering new choices of data models, and by including replication and partitioning by […]

LikeLike

Scalable Human Blog