How to Partition with Secondary Indexes by Document

Conceptualise a situation where you are operating a website for used cars.

🚗 🏎 🚕 🚓 🚘 🚖 🚔 🚙

Each listing has a unique ID (call it the document ID)
You partition the database by document ID
For example:
- IDs 0-499 = partition 0
- IDs 500-999 = partition 1
- And so on…

The need for the secondary index…

You want to let users search for cars, allowing them to filter by colour and by make.

So you require a secondary index on colour and make
- In a document database they would be fields
- In a relational database they would be columns

If you have declared the index, the database can perform the indexing automatically
- For example, whenever a red car is added to the database
  - The database automatically adds it to the lost of document ids for the index entry where the colour equals red

Secondary index relationships with partitions

In the indexing approach, each partition is completely separate.

👉 Each partition maintains its own secondary indexes covering only the documents in that partition
👉 It does not care what data is stored in the other partitions
👉 Whenever you need to write to the database to add remove or update a document you only need to deal with the partition that contains the document ID that you are writing
- For that reason, a document partition index is also known as a local index as opposed to a global index

Challenge of reading a document partition by index

It requires care reading from a document partition index. Unless you have done something special with the document IDs…

❌ There is no reason why all the cars with a particular colour or make would be in the same partition.

Red cars could appear in both partition 0 and partition 1 🤔
You need to send the query to all partitions and combine all the results you get back
- This approach to querying a partitioned database is sometimes known as “scatter gather“

“Scatter Gather”

This combining of results from what you have queried from each partition, makes read queries quite expensive!

Even if you query the partitions in parallel…

“Scatter gather” is prone to tail latency amplification
- In which the chances of getting a slow call increases if an end user request requires multiple backend calls!

Nevertheless… it is widely used by MongoDB, Riak, Cassandra, Elasticsearch and VoltDB as they use document partition secondary indexes.

Most database vendors recommend you structure your partitioning scheme so that secondary index queries can be served by a single partition
But that is not always possible especially when you are using multiple secondary indexes in a single query!
Such as filtering cars by colour and by make at the same time

📚 Further Reading & Related Topics

If you’re exploring partitioning with secondary indexes in document databases, these related articles will provide deeper insights:

• What is Dynamic Partitioning? – Learn how dynamic partitioning optimizes data distribution and indexing in document-based databases.

• Understanding Partitioning Proportional to Nodes – Explore how partitioning strategies scale dynamically based on system architecture and workload distribution.

Scalable Human Blog