System Design Availability

Posted in System Design on April 16, 2023 by Admin ‐ 6 min read

System Design Availability Ad Sponsored

ⓘ Sponsored by

What is Availability? & Why is it important in System Design?

Availability refers to the ability of a system to remain operational and accessible to users, even in the face of various failures and disruptions. Specifically, availability is a measure of the percentage of time that a system is up and running and able to perform its intended functions.

During system design interviews, you should always consider availability as one of the main pillars, if your system goes down by a certain duration of time it would be a very bad experience for end users, and the product will lose its viability and will cost the business a lot of money.

Let me give you a real-life example if Google (Youtube) goes down for even 1 minute, it would be a huge loss for the company. Around December 2020 there was an outage in Youtube service, and Google lost 1.7 Million Dollars (As per reports) the outage lasted for 37 minutes, which should give you a perspective on the availability of a system.

To achieve high availability, you must consider a variety of factors, such as

  1. Redundancy - involves having multiple instances of critical components or systems, so that if one fails, others can take over
  2. Fault tolerance - involves designing systems that can continue to function even if individual components fail
  3. Disaster recovery - involves having backup systems and processes in place to quickly recover from major outages or disruptions
  4. Load balancing - involves distributing workloads across multiple systems, so that no single system becomes overwhelmed and unavailable

The concept of Nine’s in Availability

Availability is typically measured in a few key ways:

  1. Uptime percentage - This is the most common metric. It’s calculated as the total time a system or service was available over a given period, divided by the total possible available time in that period. For example, a system that is up and running for 8760 hours in a year has an uptime percentage of 100%.

  2. Mean time between failures (MTBF) - This measures the average length of time between unplanned failures of a system or component. A high MTBF means fewer failures and higher availability.

  3. Mean time to recover (MTTR) - This is the average time it takes to detect a failure and restore the system or service to an operational state. A low MTTR means faster recovery and higher availability.

  4. Recovery time objectives (RTO) and recovery point objectives (RPO) - These set targets for how quickly a system must be restored after a failure (RTO) and how little data can be lost (RPO). Meeting these targets helps ensure high availability.

  5. Redundancy - The use of redundant components, systems, networks, data centers, etc. provides the backup capacity to prevent outages and ensures continuous availability.

  6. Automation - The use of automated processes for monitoring, failure detection, backup, restore, disaster recovery and other key availability workflows helps minimize downtime.

  7. Geographic redundancy - Placing critical systems and infrastructure in geographically separate locations protects against regional outages due to natural disasters or other localized events.

The generalized formula for availability is given by:

$$ \begin{equation*} Availability = \left(\frac{UpTime}{UpTime + DownTime}\right) \end{equation*} $$

Availability is often expressed as the number of nines after the decimal, rather than a percentage. For high-demand systems and services, measuring availability by nines provides a more granular and stringent target.

99% available = 2 nines of availability. This is a good target for most basic systems and services. It means the service is unavailable for about 7 hours per year.

99.9% available = 3 nines of availability. This is suitable for mission-critical systems and core business services. Downtime is limited to about 8 hours per year.

99.99% available = 4 nines of availability. Only about 5 minutes of downtime per year. This is appropriate for critical systems and services.

99.999% available = 5 nines of availability. Just 1 minute of downtime per year. This is considered the “gold standard” of availability and is targeted for the most mission-critical applications and infrastructure.

99.9999% available = 6 nines. Less than 30 seconds of downtime per year. Only the most strategic and high-profile systems aim for this level of availability.


ⓘ Availability vs DownTime

Achieving a high number of nines requires advanced design, robust architecture, automated monitoring and management, geographic redundancy, disaster recovery planning, and stringent service level agreements (SLAs) to minimize the risk of outages. The higher the nines, the more expensive and complex the solutions become.

Most organizations determine the right balance of cost, complexity and constraint based on the criticality of their systems and services. For many businesses, 3 to 4 nines represent an achievable yet ambitious target for their core infrastructure and applications. 5 nines may be suitable for their most critical functions. The nines approach provides a useful framework for positioning and improving availability across an enterprise’s technology landscape.

Amazon S3 service is one of the best for object-based storage, providing 4 nines of availability.

S3 Availability

ⓘ S3 4 nines of availability

Fault Tolerant vs Availability

High availability and fault tolerance are related but different concepts:

  1. High availability is a broad term that refers to any solution that reduces downtime and provides continuous access to systems and services. This includes redundancy, backups, disaster recovery, and failovers. The main goal is minimizing disruption to users and ensuring key business functions remain available.

  1. Fault tolerance is a specific approach to high availability that focuses on designing components and systems that continue operating despite the failure of part of the system. Fault-tolerant systems contain redundant components that take over transparently when a fault is detected, allowing the system as a whole to continue operating without interruption.

Some key differences between high availability and fault tolerance:

SCOPEHigh availability is an umbrella term for any solution that improves uptimeFault tolerance is one specific approach to building highly available systems with redundant components.
REDUNDANCYHigh availability can use fault tolerance techniques but also includes other solutions like failover clustering, backups, disaster recovery, etc.Rely on redundant hardware, networks, data storage, etc.
TRANSPARENCYAims to minimize disruptions but some outages may still be perceptible to end users.Designed to mask hardware and software faults from the outside world
MEAN TIME REPAIRNot all high availability solutions achieve this level of resilience.Fault tolerant systems can continue operating even when components are being repaired or replaced
COSTHigh availability practices span a range of cost-effective to premium solutions depending on requirements.Fault tolerance solutions tend to be more expensive to develop and maintain due to the redundant hardware and complexity required
EXAMPLEHigh availability encompasses a much wider range including services like ecommerce platforms, corporate databases, email servers, etc.Fault tolerant system examples include space shuttle computers, military systems, and reactor control systems.

Fault-tolerance is a sense is a specialized approach to building highly available systems that are resilient in the face of hardware or software faults.

However, high availability as a whole includes fault tolerance as well as other techniques to minimize disruption and maximize uptime across all critical systems and services. The specific solutions used depend on reliability needs, costs, and risks.

But in general, fault tolerance provides the highest level of availability and resilience.