Part 2: Top 4 Considerations Designing Data-Intensive Applications

Continuing on this series from ‘Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems’ by Martin Kleppman, I will be continuing this journey with 4 more considerations, these will be focusing on database technique optimisation with indexes, asynchronous replica applications, horizontal vs vertical scaling and finally batch processing.

Please note, for each of these topics, I will be for the most part skimming the surfaces of these areas, my aim is to keep these areas short and sweet for now.

1. Database indexes

Firstly, there is a multitude of ways to increase the speed of data reads and writes from a database. In a short summary I will be covering database indexes and its implementation can be understood.

  1. The attractiveness of indexes comes from the increased speed of the data being queried from database tables.
  2. To elaborate, indexes can locate data quickly, due to avoiding the burden of having to search through all the rows of the database, in the event of data being accessed.
  3. This is similar to reading the back of a book with the glossary to help guide the reader to the page they are searching for.
  4. The way this works is that it uses the indexes to serve as lookup tables to efficiently store data for faster retrievals.
  5. Additionally, the method that this is implemented is that it maintains a data structure with a index or even multiple indexes. Subsequently, due to this, there are added writes and storage demands to uphold this data structure.

2. Asynchronous and Synchronous Node Replication

Node Replication

As Martin Kleppman highlights, there is a common technique in maintaining high availability between distributed nodes, this is using the follower and leader node technique.

Follower and Leader Nodes (Virtual Machines)

As mentioned, there is a particular method of using leader application that is writable and readable, whilst the follower distributed nodes are readable only.

What this does, is that it enables a distributed application to have multiple nodes, therefore allowing horizontal scaling as nodes/VMs can be added to scale. whilst the other followers synchronise with the leader node.

Asynchronous vs Synchronous Replication
Synchronous Replication

A common issue, with using synchronous replication is that it can mean that all the application nodes will have to stop and wait until all the follower nodes are synced, in most cases this can be quick and un-noticeable. However, the Achilles heel can occur when any of the nodes become slow for whatever reason (i.e. bandwidth issues, lack of threads, under resourced nodes, etc) this can cascade the entire node cluster to lag, until all the nodes are in sync again, that lag can range from seconds, minutes, hours, etc.

Asynchronous Replication

This is a particular reason why asynchronous nodes are preferable in this context as they will not wait for nodes to catch up, this will complete the job based on the ability to utilise available threads. The complete asynchronous approach trade off is that it weakens durability, but the benefit is that the leader is still able to process a write even if a follower falls behind, this approach is quite popular due to its ability to perform well over distributed geographical areas. Although, if the system does not need to be synchronised a 100% of the time this is the a viable option.

Semi-Synchronous Replication (hybrid)

Alternatively, a common method is using semi-synchronous approach, where all the followers are asynchronous whilst one is left synchronous. This approach requires swapping the synchronous flag to other followers if they fall behind. In addition, this guarantees you have one up to date copy on a follower and a leader.

3. Batch processing

Batch processing is the ability to run repetitive high volume data jobs. This allows for data to be processed when computing resources are available generally in a automated manner.

The batch method, involves a period of data processing, this is called the “batch window”. The attraction for batch processing is that it can be done at an efficient time for an application. For instance, there your application may be primarily used between working hours while at night the application can run it batch processing schedule.

Batch processing becomes increasingly more important as data collection increases, this allows for the application to keep on top of the data jobs in an efficient manner. However to finally note there are particular parameters to consider when building a batch processing system, these can include:

  • Who is submitting the job
  • What program is running
  • Where the program is running
  • When the job should be executed

4. Horizontal Scaling vs Vertical Scaling

Vertical Scaling

  • Also known as “scaling up”
  • The ability to add resources to a machine
  • Utilises shared memory architecture
  • Can replace disks, CPUs without shutting down a machine
  • Stuck in one location geographically
  • Twice the size in resources does not equal handling twice the load
  • Cost significantly more than horizontal scaling

Horizontal Scaling

  • Virtual Machines are called nodes
  • This uses disk, memory and CPUs independently
  • The co-ordination of the nodes are done on a software level
  • Can use whatever machine that have the best performance to price ratio
  • Can protect against losing an entire data centre
  • Supports multi region disturbed architecture

📚 Further Reading & Related Topics

If you’re exploring key considerations for designing data-intensive applications, these related articles will provide deeper insights:

• Distributed Data-Intensive Systems: Replication vs. Partitioning vs. Clustering vs. Sharding – Dive into the foundational concepts of distributed systems design, focusing on how replication, partitioning, clustering, and sharding influence the design and scalability of data-intensive applications.

• What Is Partitioning and Why Does It Matter? – Learn how partitioning works in distributed systems, and how it can impact performance, scalability, and fault tolerance when designing data-intensive applications.

Leave a comment

I’m Sean

Welcome to the Scalable Human blog. Just a software engineer writing about algo trading, AI, and books. I learn in public, use AI tools extensively, and share what works. Educational purposes only – not financial advice.

Let’s connect