What is Replication and Why is it Important?

Lets firstly breakdown the goals of replication

  • High availability
    • Maintaining that a system is running at an agreed level of operation.
    • Even when one machine or several machines or an entire data centre goes down!
  • Disconnected OperationsClients with offline operations
    • Enabling the application to continue working when there is network latency
  • Latency
    • Placing data geographically close to users so that the users can interact with it faster
  • Scalability
    • The ability to handle a higher volume of reads, that a single machine can handle by performing these reads on replicas

The humbling reality…🙏

Despite being a simplistic objective to keep copies on several machines, replication is remarkably a tricky problem! It requires careful attention to:

  • 📝Concurrency 👈
  • 📝Things that can go wrong 👈
  • 📝Dealing with the consequences of the faults 👈

At a minimum, we will generally require dealing with the following:

  • 📝Unavailable nodes 🔴❌
  • 📝Network interruptions 📶 ❌

And that is not even considering the more insidious faults 🧟‍♂️, such as:

  • 📝Silent data corruption due to software bugs 🪲

What approaches we can take with replication?

  • Single leader replication
    • Clients send all writes to a single node (leader)
      • Streams of data change events are sent between followers
    • Reads can be performed by any replica
      • But followers may return stale reads
  • Multi-leader replication
    • Where clients send each write to one of several leader nodes
      • Any of which can accept writes
    • Streams of data change events are sent between leaders and to any follower nodes
    • Related, on choosing the best multi-leader topology: The Multi-Leader Replication Topologies
  • Leaderless replication
    • Clients send each write to several nodes
    • There is the ability to read from several nodes in parallel ⬇️
      • In order to correct and detect nodes with stale data

Advantages and disadvantage of replication

Single leader replication is the most popular because:

  • Easier to understand ✅ ☺️
  • No conflict resolution to worry about ✅

Multi-leader replication and Leaderless replication can be more robust in handling:

  • Faulty nodes ✅
  • Network interruptions ✅
  • Latency spikes ✅

At the cost of being:

  • Harder to reason about ❌
  • Providing only very weak consistency guarantees to end users ❌

Asynchronous and synchronous replicaton

This can have a profound affect on the system behaviour when there is a fault.

Asynchronous replication:

  • Can be faster when the system is running as expected ✅
  • It is important to figure out what happen when replication lag increases or servers failures 👈📝
  • If a leader fails and you promote an asynchronously updated follower to be the new leader 🤔
    • There is a risk… ⚠️
      • the recently committed data maybe lost ❌

Replication Lag

Consistency models can be utilised to combat replication lag by giving the replicas a set of instructions on how to behave when this occurs.

The consistency models:

  • Read after write consistency
    • Users should always see data that they have submitted themselves
  • Monotonic reads
    • After the users have seen data at one point in time
      • They should not see data from an earlier point in time
  • Consistent prefix reads
    • Users should see the data in a state that makes casual sense
      • For example, seeing a question and it’s reply in the correct order

Concurrency issues:

These are inherent in multi leader and leaderless replication… 🤷‍♂️

Final author recommendations

This concludes the end of the blog series on replication in data intensive systems. To note, most of this information can be found in “Designing Data Intensive Applications” by Martin Kleppmann, which is in my opinion a highly recommend and extensive book on this subject area.

What is Multi-Leader Replication?

In replication there are specific scenarios where replication patterns are more suitable than others, depending on an applications use case. Subsequently, multi-leader replication is an alternative from single leader replication, as previously mentioned in my initial post on what replication is.

Why Multi-Leader Replication Over Single Leader Replication?

Firstly, I will outline the major floors with single leader replication, so that we can understand why you may consider this alternative approach.

Disadvantage of Single Leader Replication

🤔 All writes must go through it.

  • For instance, if the database is partitioned, each partition has one leader…
    • Although, different partitions may have there leaders on different nodes…
    • So for any reason you cannot connect to the leader…
      • ❌ For example, if there is a network disruption between you and the leader, you cannot write to the database

The Solution Multi-Leader Replication

This is considered a natural extension to the leader-based replication model

  • Aka… Multi leader replication, master master or active active replication.
    • In this setup each leader simultaneously acts as a follower to the other leaders
  • ✅ It allows more than one node to accept writes
  • ✅ Replication still occurs in the same way
  • ⚙️ Each node that processes a write, must forward that data change to all the other nodes

Use Cases for Multi-Leader Replication

⚠️ It rarely makes sense to use a multi-leader set up within a single data centre. Because usually, the benefits outweigh the added complexity,

🤔 However, there are some situations when this configuration is reasonable…

  • Multi-Data-Centre Operation
    • Consider a scenario where you have a database with replicas in several different data centres
      • Perhaps you can tolerate a failure of an entire data centre
      • Or perhaps you want to be closer to your users

Normal Follower and Leader Replication

🟡 The leader has to be in one of the data centres and all writes must go through that data centre.

Multi Leader Configuration

🟡 You can have a leader in each of the data centres, while you may have a normal follower and leader replication in each data-centre.

🟡 The leaders of these data centre can replicate its changes to leaders in other data centres.

Comparison: Multi-Data Centre Deployments… Single Leader vs Multi-leader Configuration

📈 Performance?

  • Single leader configuration for multi data centres?
    • Every write must go over the internet to the data centre with the leader
      • This can add significant latency to writes
      • May even contravene the purpose of having multiple data-centres in the first place!
  • Multi-leader configuration for multi data centres?
    • Every write can be processed in the local data centre and this is replicated asynchronously to other data centres.
      • Thus, the inter-data centre network delay are hidden, which means, perceived performance is improved visibly for end users.

🛡 Tolerance of Data Centre Outages?

  • Single leader configuration tolerance?
    • If a data centre with the leader fails… failover can be used to promote a follower to a leader in another data centre.
    • More reading behind node failure election process found here: Replication – what is it?
  • Multi-leader configuration tolerance?
    • Each data centre can operate independently between each other, replication catches up when a failed data centre comes back online.

🕸 Tolerance of network problems?

  • Single leader configuration network?
    • ❌ Traffic between data centres usually goes over the the public network
      • Which maybe less reliable in a local network data centre
    • ❌ Single leader configuration is very sensitive to problems in this inter data centre link set up
      • As write are made synchronously over this link
  • Multi-leader configuration with asynchronous network?
    • ✅ Can usually tolerate network problems better
    • ✅ A temporary network interruption does not prevent writes from being processed

How do I get my hands on multi leader configurations?

Some databases support multi leader replication by default.

⚠️ Before jumping in there is a disclaimer…

Although multi-leader replication has advantages it does have disadvantages…

  • 🤔 Issues to consider
    • The same data maybe concurrently modified in two different data centres
      • Thus those write conflicts must be resolved
    • As multi-leader replication is typically a retro fitted feature in many databases…
      • ❌ There are drawbacks in other database features!
        • These features can somewhat be problematic
          • Auto incrementing keys
          • Triggers
          • Integrity constraints

Footnote

For the reasons highlighted in the section above, multi-leader replication is actually considered dangerous, and should be avoided if at all possible. Although, I am not completely disregarding this, the multi-leader configuration set up is an important concept to understand as this is reused in other areas of an application where clients want offline operations… Stay tuned, as this will be covered in my next post.

The Five Design Principles: SOLID

To start, in object-oriented programming, there is a helpful acronym used within software development named SOLID. Whereby, this was created to define five simple design principles. The goals of the principles, are to achieve more understandable, maintainable and extendable software where designs are understandable.

In addition, to give credit the origin of this acronym was from the famous software engineer and instructor Robert C. Martin, whom has written multitude of best-selling software design and methodology books within the field of software engineering.

What does SOLID stand for?

  • Single Responsibility Principle
    • Class should only have one reason to change
  • Open-closed Principle
    • Open for extension, closed for modification
  • Liskov Substitution Principle
    • Design by contract
  • Interface Segregation Principle
    • Many client specifics interfaces are better that one general purpose interface
  • Dependency Inversion Principle
    • Depend upon abstraction, not concretions

[S] Single Responsibility Principle

“A class should have only one reason to change”

Martin Fowler
❌ Consider a Rectangle Class…

There are problems with this design…

  • Rectangle class is immobile
  • You are forced to use its additional dependencies when you may just want one thing!

🤔 Consider this design

Now to question…

  • Do these methods make sense together?

✅ Solution

Two Interfaces…

⚠️ Do we need to do this all the time?

  • Will parts of your class change when others do not?
    • ✅ If yes – This type of solution is recommend!
    • ❌ If no – This solution may not be required
      • Beware of adding unwanted complexity

[O] Open-closed Principle

Closed for modification and open to extension.

❌ Bad example

We keep editing the method for new changes.

✅ Good Example

🟡 Bonus principle: Illustration above shows the inheritance principle of favouring composition over inheritance.

To note, with this design it should be quite easy to add different TV providers.

[L] Liskov Substitution Principle

You should always be able to substitute subtypes for their base classes.

This can be seen as ensuring a class is behaving as it is intended.

How not to violate this principle?

Design by contract

This technique ensures that the classes design is not breached by determining these key checks:

  • Pre-conditions
  • Post-conditions
  • Variants

Why is this useful?

Testing

These particular contractual conditions will help develop code that is easier to test, therefore creating more robust solutions.

[I] Interface Segregation Principle

Cohesion

How strong are the relationships between an interface’s methods?

❌ Example of low cohesion

Methods are not that similar and serve different purposes.

✅ Example of high cohesion

Methods are related there respective classes.

Maybe we need a machine that requires everything!

[D] Dependency Inversion Principle

High-level modules should not depend on low-level modules.

❌ Bad Example

Low level components are tied to high level components:

✅ Good Example

Both high level and low level should depend on abstraction:

What is Replication?

Conceptually, replication facilitates a distributed architecture. This is achieved by using two fundamental types of nodes, that handle application(s) reads and writes to the database(s). These are:

  • Leader node – writable and readable
  • Follower node – read only

The leader has the current data and the followers will update accordingly.

Asynchronous vs Synchronous Replication

Synchronous Replication
  • Have one follower synchronous and all the other followers are asynchronous
  • If a follower falls behind and is synchronous it will switch to asynchronous and another follower will become synchronous
  • This guarantees you have one up to date copy on a follower and a leader
  • This configuration is called semi-synchronous
Asynchronous Replication
  • If the leader fails, and any writes are committed to the associated followers this will fail
  • The guarantee to write is not durable
  • Benefit is that the leader is still able to process a write even if followers have fallen behind
  • Weakened durability trade off can come at the benefit of distributed geographical applications
  • Can be serious problem if nodes fail asynchronously for some systems
  • An example of a highly configurable asynchronous replication system from Azure storage aka ‘Object Replication For Block Blobs’ – https://docs.microsoft.com/en-us/azure/storage/blobs/object-replication-overview

Setting up Failed Followers

This can be either to add followers or replace failed followers.

How are you sure a follower has up to date data?

You could lock the database on writes, until all the followers are up to date?

But this is no longer a high availability system…

Fortunately, setting up a follower can be achieved without down time! Conceptually, the process goes like this:

  1. Consistently take snapshots of a leader node
  2. If possible, this should be done without taking a lock on an entire database
  3. Most databases have snapshot features
    • If this is not the case, you can use third party tools like:
  4. This way you can copy the snapshot to a new follower node
  5. Follower requests the snapshot and once this is copied it will then be flagged as ‘caught up’
    • A follower request to the leader replication log can be:
      • Postgres calls this: Log sequence number
      • MySQL calls this: Bin log co-ordinates
  6. Now the follower can continue to process data changes from the leader
    • This type of work can be done automated way or by a database admin

Node Outages (Followers and Leaders)

Possible reasons for follower outages:

  • This could occur with faults
  • Rebooting nodes for security patch updates
How to keep a system up and running during these follower/leader outages, whilst maintaining high availability?

Follower Failure

This can be handled with catch up recovery:

  • On disk the follower keeps a record/log of the data changes intercepted from the leader
  • In a scenario where the follower was to crash or the network was to be interrupted:
    • The follower can get back up with ease, where it can find the point in time when the failure last occurred
    • Therefore, the follower can request all the data between the time it was disconnected from leader
    • Once the request is completed it will then begin to retrieve the data changes as it did before

Leader Failure

To highlight, leader failover can be challenging. This requires a follower to be promoted to a leader. Subsequently, clients will need to be configured to set there writes to the new leader, and then other followers will need to start to process data again.

This process is called leader failover:

  • Manual Failover
    1. Notification that a leader has failed to a Admin
    2. Admin undertakes the necessary steps to bring up the new leader
  • Automatic Failover
    1. The system determines that the leader has failed
      • Can be due to:
        • Crashes
        • Power outages
        • Network issues
        • Etc…
      • Unfortunately, there is no bomb proof way of detecting what can cause this
      • This is why timeouts are used (to note, typically timeouts are disabled for planned maintenance)
    2. Choosing a New Leader
      • Election process
        • Usually the node with the most up to date data changes from the old leader wins the election
          • This is to minimise any data loss
        • Getting all the nodes to agree on a new leader is a consensus problem
    3. Reconfiguring the system to use the new leader
      • Clients now need to send there write requests to the new leader
      • If the old leader returns it may think its still the leader, not realising other replicas have forced it to step down
      • The system will need to address the old leader to step down and become a follower to the new leader
Failover is fraught with things that can go wrong…
  • If asynchronous replication is used
    • The new leader may of not revived all the new writes from the old leader before it failed
    • The new leader may have conflicting writes
    • Writes may need to be discarded, this maybe unacceptable for durability in some systems
  • Another issue – two nodes thinking there the leader!
    • This issue is called Split Brain 🧠
    • If both leaders accept writes, data can be lost or corrupted
      • As a solution:
        • Some systems have a safety catch mechanism to shutdown one node if two leader are detected..
        • This is known as fencing, or ‘shoot the other node in the head’ – although if not executed carefully, you can shoot both of them!
  • Too long a timeout before a leader is declared dead
    • Can take a longer time to recovery if the leader fails
  • Too short a timeout
    • Can create unnecessary fail overs

There are no easy solutions to these issues of what fail-overs can bring, this is why some operations teams choose to do this manually, even if the software supports automatic failover.

Conclusion

Replication is an important process to understand, as many database technologies utilise this to achieve system scalability, which can consist of different geographical location in some cases. Although as highlighted, there are fundamental problems to distributed systems, that this approach is not immune from.

This must be handled accordingly, initially requiring an understanding of what types of issues there are, whether these are node failures, unreliable networks, etc. Trade offs such as network consistency, durability, availability and latency must be made depending on the customer requirements.

As a footnote, replication is something that many database systems include in there features and functionality set, so the reality is most developers will be faced with choosing the appropriate database technologies as an off the shelf solution, rather than building these complex components from scratch.

A recommend starting point on deciding on a replication solution toolset:

https://www.trustradius.com/data-replication

Sharding: What it is and why many database architectures rely on it

In this post I will breakdown how sharding works with it advantages and trade offs. In addition, I also aim to assist you with why you may what to consider this powerful database partitioning technique.

As of this writing, sharding has been notably broadcasted to the mainstream media, that’s if you have been following the wild cryptocurrency news, especially with some scalability challenges they face! As the blockchain ETH and many others are pursing this technique in its anticipated release of ETH 2.0. The reason to mention this, is that this technique is being chosen to resolve a huge scalability problem for the latest upcoming blockchain technologies. This optimisation technique requires taking seriously, as it has been knocking around for a while, and many databases that seek to achieve scalability have chosen to utilise this technique with successful results.

To note, we are specifically talking about horizontal sharding, not vertical sharding! This is because this is much more popular and more applicable for large scale problems.

What is (horizontal) Sharding?

  • Optimisation technique
  • Instead of adding resources to a database, you are partitioning this up into smaller databases
  • Enables a system to horizontally scale as much as it requires

Alternative options to Sharding?

  • Vertical scaling
    • 🤔 Scaling up your hardware
    • 🤔 Adding resources to your instance
    • ❌ Can seem easier, although can become more expensive
    • ❌ Diminishing returns on performance, as it does not scale linearly
    • 🌐 For more info on this: Vertical scaling vs Horizontal scaling
  • Node Replication (asynchronous)
    • 🤔 Can consist of making copy’s of your database
    • 🤔 Utilises leader nodes (writable and readable) and follower nodes (readable)
    • ❌ Delayed propagation, nodes eventually update (eventual consistency)
    • ❌ Can result in stale data, adding more replicas can add to problem
  • And then there is Sharding…
    • We separate the databases out with different data
    • Different database instances
    • Multiple leader (master) nodes
    • Horizontal partitioning

How does Sharding work?

Example A
  • Example A – The partitioning key needs to be predictable (customer id incremental)
Example B
  • Example B – How do you know what shard your customer data is on? (Shard 1 or Shard 2)
    • You will need to introduce some complexity using an algorithmic approach, which can encompasses hashing or some type of predictable calculation to find the partition keys
    • What happens if a shard goes down?
      • Your system will fail to make queries on the shard affected
      • The live shards will still be able to process writes and reads
    • How do we determine which shard to query?
      • We require an intermediary layer
        • Responsible for looking up (algorithmic approach) which shard each customer is on and then creating the request for that (on the routing layer i.e. RESTful endpoints) within it to find the shard
        • Issue with this is that we will need to create a new layer of complexity that can also handle node failures, single points of failure etc…
        • This can also be expensive to maintain / build a system like this

When is it worth doing Sharding?

Pros

  • 🚀 Scalability
    • you can always truncate your data horizontally to sale to the problem (infinite basis)
  • 🚀 High Availability
    • If one database goes down, only a small subset of your customers are affected… but at least you can still stream traffic for your other customers
  • 🚀 Robust Fault Tolerance
    • Very unlikely that your entire system will fail at once, unless there is a consistent error across the application to cause this

Cons

  • 👎 Added Complexity:
    • Partition mapping
      • There are two approaches to retrieve mapped partitions
        1. Algorithmic approach – calculating the shard’s partition ID
        2. Centralised database – that keeps track of all the partition IDs that keep an index of where the customers are located on each shard
    • Creating routing layer
      • you may have to build this layer as there are not many solutions that provide this
    • None uniformity of data
      • Having to split your data evenly. This can be achieved with hashing functions:
        • For example, if you have 100 rows or 100 “pieces” of data you want to split this evenly 50/50. Depending on your application, there maybe a customer that has much high data storage demands…
        • What can eventually happen… is that a shard could disproportionately grow larger that the other shards, this means the process of reshuffling or resharding will have to take place to resolve this
  • 👎 Resharding
    • Not an easy task to fulfil as the data will need to be separated out elegantly the across a system
  • 👎 Analytical type queries
    • This can be an expensive operation across shard network as this requires scatter and gather meaning that is may have to query across different shards

Conclusion

To conclude, from this initial analysis of sharding, it can be said that this is an important partitioning technique to consider and understand, when designing your database/application architecture, especially if the latest upcoming blockchain technologies are implementing this. Although, it does come with some initial issues to overcome, there are intuitive solutions out there that have managed these areas, depending on there unique data requirements.

Thanks for reading. Hope this was quick and useful read. Please leave a comment if you have questions on this or any additional inputs.

Horizontal Scaling vs Vertical Scaling?

Vertical or horizontal? That is the question… this post will outline the differences between these two types of scalability strategies, when considering the architecture of your application.

These nuggets of information were extracted from Martin Kleppmann book on “Designing Data Intensive Applications”, this is a highly recommend read for those whom want to learn more about building enterprise systems, that is primarily handling significant quantities of data at scale.

Since reading this, I have extracted Martin Kleppann’s thoughts/findings into notes, that I have found useful when considering how to scale an application.

Scaling up (vertical scaling)

Adding resource to a machine to scale…

  • Can cost significantly more to scale
  • Utilise shared memory architecture
  • Disks can be replaced without shutting down a machine and even CPUs
  • Stuck to one location
  • Twice the size does not mean it can handle twice the load

Shared nothing architecture (horizontal scaling)

Adding more machines to scale…

  • Can use disk, memory, CPUs independently
  • The co-ordination of nodes (virtual machines) is done on a software level on a conventional network
  • Flexibility to choose what machine has the best price to performance ratio
  • Ability to protect against losing an entire data centre
  • Multi-region distributed architecture is achievable with this

Conclusion

So from what has been highlighted, between these two types of scaling strategies, the most popular solution today is horizontal scaling, whilst vertical scaling is considered a later strategy, but still widely adopted today due to it simplicity, and not all systems require to be multi regional scalable or have a future to support ongoing growth. Although, if the system is set for a future to grow gradually or exponentially you may find horizontal scaling at present to be the most durable architecture.

For more of a in depth reading on this, I recommend these links:

Mission – cloud services

IBM – post

Part 2: Top 4 Considerations Designing Data-Intensive Applications

Continuing on this series from ‘Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems’ by Martin Kleppman, I will be continuing this journey with 4 more considerations, these will be focusing on database technique optimisation with indexes, asynchronous replica applications, horizontal vs vertical scaling and finally batch processing.

Please note, for each of these topics, I will be for the most part skimming the surfaces of these areas, my aim is to keep these areas short and sweet for now.

1. Database indexes

Firstly, there is a multitude of ways to increase the speed of data reads and writes from a database. In a short summary I will be covering database indexes and its implementation can be understood.

  1. The attractiveness of indexes comes from the increased speed of the data being queried from database tables.
  2. To elaborate, indexes can locate data quickly, due to avoiding the burden of having to search through all the rows of the database, in the event of data being accessed.
  3. This is similar to reading the back of a book with the glossary to help guide the reader to the page they are searching for.
  4. The way this works is that it uses the indexes to serve as lookup tables to efficiently store data for faster retrievals.
  5. Additionally, the method that this is implemented is that it maintains a data structure with a index or even multiple indexes. Subsequently, due to this, there are added writes and storage demands to uphold this data structure.

2. Asynchronous and Synchronous Node Replication

Node Replication

As Martin Kleppman highlights, there is a common technique in maintaining high availability between distributed nodes, this is using the follower and leader node technique.

Follower and Leader Nodes (Virtual Machines)

As mentioned, there is a particular method of using leader application that is writable and readable, whilst the follower distributed nodes are readable only.

What this does, is that it enables a distributed application to have multiple nodes, therefore allowing horizontal scaling as nodes/VMs can be added to scale. whilst the other followers synchronise with the leader node.

Asynchronous vs Synchronous Replication
Synchronous Replication

A common issue, with using synchronous replication is that it can mean that all the application nodes will have to stop and wait until all the follower nodes are synced, in most cases this can be quick and un-noticeable. However, the Achilles heel can occur when any of the nodes become slow for whatever reason (i.e. bandwidth issues, lack of threads, under resourced nodes, etc) this can cascade the entire node cluster to lag, until all the nodes are in sync again, that lag can range from seconds, minutes, hours, etc.

Asynchronous Replication

This is a particular reason why asynchronous nodes are preferable in this context as they will not wait for nodes to catch up, this will complete the job based on the ability to utilise available threads. The complete asynchronous approach trade off is that it weakens durability, but the benefit is that the leader is still able to process a write even if a follower falls behind, this approach is quite popular due to its ability to perform well over distributed geographical areas. Although, if the system does not need to be synchronised a 100% of the time this is the a viable option.

Semi-Synchronous Replication (hybrid)

Alternatively, a common method is using semi-synchronous approach, where all the followers are asynchronous whilst one is left synchronous. This approach requires swapping the synchronous flag to other followers if they fall behind. In addition, this guarantees you have one up to date copy on a follower and a leader.

3. Batch processing

Batch processing is the ability to run repetitive high volume data jobs. This allows for data to be processed when computing resources are available generally in a automated manner.

The batch method, involves a period of data processing, this is called the “batch window”. The attraction for batch processing is that it can be done at an efficient time for an application. For instance, there your application may be primarily used between working hours while at night the application can run it batch processing schedule.

Batch processing becomes increasingly more important as data collection increases, this allows for the application to keep on top of the data jobs in an efficient manner. However to finally note there are particular parameters to consider when building a batch processing system, these can include:

  • Who is submitting the job
  • What program is running
  • Where the program is running
  • When the job should be executed

4. Horizontal Scaling vs Vertical Scaling

Vertical Scaling

  • Also known as “scaling up”
  • The ability to add resources to a machine
  • Utilises shared memory architecture
  • Can replace disks, CPUs without shutting down a machine
  • Stuck in one location geographically
  • Twice the size in resources does not equal handling twice the load
  • Cost significantly more than horizontal scaling

Horizontal Scaling

  • Virtual Machines are called nodes
  • This uses disk, memory and CPUs independently
  • The co-ordination of the nodes are done on a software level
  • Can use whatever machine that have the best performance to price ratio
  • Can protect against losing an entire data centre
  • Supports multi region disturbed architecture

Part 1: Top 5 Considerations Designing Data-Intensive Applications

‘Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems’ – Martin Kleppman, details some of the key points initially into what considerations that can be taken when designing data intensive applications. This post will be in multiple parts on some of the top considerations to make as of this writing.

1. “Big Data” not a useful terminology

As from Martin Kleppman analysis on todays buzz word “Big Data” he puts his point forward about it not being useful within the conversation of data world as it does have a single meaning and has multiple definitions that has accumulated over a period of time. The terminology is simply too abstract for such an extensive area. Therefore, when addressing this topic it should be referenced as different components. I.e. capturing data, data analysis, data storage, visualisation etc.

2. Premature system optimisation

Ensuring your not over engineering a solution, this can mean creating more work than required. As “Martin Kleppman” puts it “your not Google argument”. As Google may well be using the latest and most scalable data systems, this does not mean we need to optimise or architect our software systems to the same level as Google. However, it is noted that we should be mindful of allowing extensibility of our system to allow for future optimisations for a growing system.

3. Storage Engines

There are multiple storage engines to choose from with both benefits and drawbacks. The differences can include faster reads or writes to the database, scalability, high availability, etc. Consequently, this emphasises the importance of researching the best storage engine for the needs of your application. For reference here is a list of some of the storage engines:

  • Ordered/unordered flat files
  • Hash tables
  • B Trees
  • ISAM
  • Heaps

For further reading on this, I highly recommend this blog post specifically on this area:

View at Medium.com

4. Reliable, scalable and maintainable applications

For the longevity of a system, reliability, scalability and maintainability need to be factored in. This consideration encompasses all of these points in direct or indirect ways. However, as Software Engineer these are the core principles that require being maintained when designing and delivering high performance application. In addition, measuring these area are important as William Thomson, Lord Kelvin puts it…

“If you can’t measure it, you cannot improve it”

Here is are some measurements in these areas:

  • Reliability
    • Software Failure rate – system as a whole is ability to deliver a service
    • Software Fault rate – a component of a system deviating from is specification
    • System availability – Availability = Uptime ÷ (Uptime + downtime)
  • Scalability
    • Throughput – the transaction rate processed by the system
    • Resource usage – the utilisation levels resources such as CPU, memory, disk, bandwidth
    • Cost – price per transaction
  • Maintenance
    • Corrective maintenance – cost correcting faults that does not satisfy functionally requirements
    • Adaptive maintenance – cost of change to satisfy functional requirements
    • Perfective maintenance – cost of improving performance or design of an application from periodic reviews

5. Storing results of expensive operations

Also known as memoisation, an optimisation technique in computing, which is the method of storing results of expensive function calls and returning the cached results, when the same inputs occurs more than once. Additionally, this optimisation is useful to understand as this can alleviate the database read demands and therefore increase application performance.