Designing for High Availability — System Design Masterclass

High availability is the property of a system that operates continuously without failure for a stated period of time. In practice, “high availability” means your system keeps serving requests even when individual components fail — and components will fail. Disks corrupt, servers crash, networks partition, entire data centers go dark.

The goal is not to prevent failure. The goal is to make failure invisible to users.

High availability architecture with redundancy and failover

Measuring Availability — The Nines

Availability is measured as a percentage of uptime over a given period. The industry uses “nines” as shorthand:

Availability	Downtime/Year	Downtime/Month	Downtime/Week
99% (two nines)	3.65 days	7.3 hours	1.68 hours
99.9% (three nines)	8.77 hours	43.8 minutes	10.1 minutes
99.99% (four nines)	52.6 minutes	4.38 minutes	1.01 minutes
99.999% (five nines)	5.26 minutes	26.3 seconds	6.05 seconds

Going from 99.9% to 99.99% does not sound like much. But it means cutting your allowed downtime from 8.7 hours per year to 52 minutes per year. That is a fundamentally different engineering challenge. Every additional nine typically costs 10x more in infrastructure and operational complexity.

SLAs, SLOs, and SLIs

These terms get thrown around loosely. Here is what they actually mean:

SLI (Service Level Indicator): A quantitative measurement. Example: “the percentage of requests completing in under 200ms.”
SLO (Service Level Objective): A target value for an SLI. Example: “99.9% of requests must complete in under 200ms.”
SLA (Service Level Agreement): A contract with consequences. Example: “If we drop below 99.9% availability, customers get service credits.”

Your SLO should be stricter than your SLA. If your SLA promises 99.9%, your internal SLO should target 99.95% so you have a buffer before you owe anyone money.

# Calculate allowed downtime from availability target
def allowed_downtime(availability_percent, period_hours=8760):
    """
    availability_percent: e.g. 99.99
    period_hours: hours in measurement period (8760 = 1 year)
    """
    downtime_fraction = 1 - (availability_percent / 100)
    downtime_hours = period_hours * downtime_fraction
    downtime_minutes = downtime_hours * 60

    print(f"Availability: {availability_percent}%")
    print(f"Allowed downtime per year: {downtime_hours:.2f} hours ({downtime_minutes:.1f} minutes)")
    print(f"Allowed downtime per month: {downtime_minutes / 12:.1f} minutes")
    return downtime_minutes

# Four nines: only 52.6 minutes per year
allowed_downtime(99.99)

Single Points of Failure

A single point of failure (SPOF) is any component whose failure brings down the entire system. Finding and eliminating SPOFs is the core discipline of high availability engineering.

Common SPOFs in a typical web application:

Single database server — the database goes down, the entire application goes down
Single load balancer — the entry point fails, no traffic reaches any server
Single DNS provider — DNS resolution fails, your domain stops resolving
Single region — a regional outage (power, network, natural disaster) takes out everything
Single configuration store — one bad config push, everything breaks
Single engineer — the one person who knows how the system works is on vacation

The fix for every SPOF follows the same pattern: redundancy. Run two or more of everything, and make sure they can take over for each other automatically.

Calculating System Availability

When components are in series (all must work), multiply their availabilities:

System = A1 * A2 * A3
System = 0.999 * 0.999 * 0.999 = 0.997 (99.7%)

Three components at 99.9% each give you only 99.7% overall. Adding components in series always reduces availability.

When components are in parallel (any one working is sufficient), the formula is:

System = 1 - (1 - A1) * (1 - A2)
System = 1 - (0.001) * (0.001) = 1 - 0.000001 = 0.999999 (99.9999%)

Two components at 99.9% each, in parallel, give you 99.9999%. This is the power of redundancy.

def series_availability(*components):
    """All components must work."""
    result = 1.0
    for a in components:
        result *= a
    return result

def parallel_availability(*components):
    """At least one component must work."""
    failure_prob = 1.0
    for a in components:
        failure_prob *= (1 - a)
    return 1 - failure_prob

# Single DB: 99.9%
single_db = 0.999
print(f"Single DB: {single_db * 100}%")

# Two DBs in active-passive: 99.9999%
dual_db = parallel_availability(0.999, 0.999)
print(f"Dual DB (parallel): {dual_db * 100:.4f}%")

# Full system: LB -> App (x2) -> DB (x2) -> Cache (x2)
lb = parallel_availability(0.999, 0.999)           # Redundant LB
app = parallel_availability(0.999, 0.999, 0.999)    # 3 app servers
db = parallel_availability(0.999, 0.999)             # Primary + replica
cache = parallel_availability(0.999, 0.999)          # Cache cluster

system = series_availability(lb, app, db, cache)
print(f"Full system: {system * 100:.6f}%")

Redundancy Patterns

Stateless Services

The easiest components to make redundant are stateless services. If a service holds no state (no local sessions, no in-memory caches that cannot be lost), you can run N copies behind a load balancer and lose any of them without impact.

Rules for stateless services:

Store sessions in Redis or a database, not in local memory
Store uploads in object storage (S3), not on local disk
Read configuration from a central config store, not from local files
Make every instance identical and interchangeable

# BAD: Stateful — session stored in-memory
sessions = {}

@app.route('/login', methods=['POST'])
def login():
    user = authenticate(request)
    session_id = generate_session_id()
    sessions[session_id] = user  # Dies if this server dies
    return jsonify({"session": session_id})

# GOOD: Stateless — session stored in Redis
@app.route('/login', methods=['POST'])
def login():
    user = authenticate(request)
    session_id = generate_session_id()
    redis_client.setex(
        f"session:{session_id}",
        timedelta(hours=24),
        json.dumps({"user_id": user.id, "email": user.email})
    )
    return jsonify({"session": session_id})

Database Redundancy

Databases are harder because they hold state. The standard pattern is primary-replica replication:

-- PostgreSQL: Set up streaming replication
-- On the primary:
ALTER SYSTEM SET wal_level = 'replica';
ALTER SYSTEM SET max_wal_senders = 5;
ALTER SYSTEM SET synchronous_standby_names = 'replica1';

-- On the replica:
-- pg_basebackup -h primary-host -D /var/lib/postgresql/data -U replicator -P -R
-- The -R flag creates standby.signal and sets primary_conninfo automatically

Key decisions for database replication:

Synchronous replication: The primary waits for the replica to confirm the write. Zero data loss (RPO = 0), but higher write latency. Use for financial data.

Asynchronous replication: The primary writes and moves on. The replica catches up later. Lower latency, but you can lose the most recent writes if the primary dies. Acceptable for most applications.

Cache Redundancy

Redis Sentinel or Redis Cluster provides automatic failover:

# Redis Sentinel configuration
# sentinel.conf
sentinel monitor mymaster 10.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 10000
sentinel parallel-syncs mymaster 1

# "2" means 2 sentinels must agree the master is down before failover
# This prevents split-brain scenarios

Health Checks

Health checks are how the system detects failure and triggers recovery. There are two types that serve different purposes:

Liveness Checks

“Is this process alive and not deadlocked?” A liveness check failing means the process should be killed and restarted.

@app.route('/healthz')
def liveness():
    """Simple liveness check — if this endpoint responds, the process is alive."""
    return jsonify({"status": "alive"}), 200

Readiness Checks

“Is this service ready to accept traffic?” A readiness check failing means the service should be removed from the load balancer but NOT killed. It might be warming up caches, running migrations, or waiting for a dependency.

@app.route('/ready')
def readiness():
    """Readiness check — verify all dependencies are reachable."""
    checks = {}

    # Check database
    try:
        db.execute("SELECT 1")
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"failed: {str(e)}"

    # Check cache
    try:
        redis_client.ping()
        checks["cache"] = "ok"
    except Exception as e:
        checks["cache"] = f"failed: {str(e)}"

    # Check downstream service
    try:
        resp = requests.get("http://payment-service/healthz", timeout=2)
        checks["payment_service"] = "ok" if resp.status_code == 200 else "degraded"
    except Exception as e:
        checks["payment_service"] = f"failed: {str(e)}"

    all_ok = all(v == "ok" for v in checks.values())
    status_code = 200 if all_ok else 503

    return jsonify({"ready": all_ok, "checks": checks}), status_code

Kubernetes Health Check Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: myapp:latest
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 3      # Kill after 3 consecutive failures
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 3
          failureThreshold: 2      # Remove from LB after 2 failures
        startupProbe:
          httpGet:
            path: /healthz
            port: 8080
          failureThreshold: 30     # Allow up to 150s for slow startup
          periodSeconds: 5

Circuit Breakers

When a downstream dependency is failing, continuing to send it requests makes things worse. A circuit breaker stops calling a failing service and returns a fallback response instead.

import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"       # Normal operation — requests flow through
    OPEN = "open"           # Failing — all requests short-circuited
    HALF_OPEN = "half_open" # Testing — allow one request through

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30, half_open_max=1):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max = half_open_max
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0
        self.half_open_calls = 0

    def call(self, func, *args, fallback=None, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.half_open_calls = 0
            else:
                # Circuit is open — return fallback immediately
                if fallback:
                    return fallback()
                raise Exception("Circuit breaker is OPEN")

        if self.state == CircuitState.HALF_OPEN:
            if self.half_open_calls >= self.half_open_max:
                if fallback:
                    return fallback()
                raise Exception("Circuit breaker HALF_OPEN limit reached")
            self.half_open_calls += 1

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            if fallback:
                return fallback()
            raise

    def _on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

def process_payment(order):
    return payment_breaker.call(
        payment_service.charge,
        order.amount,
        order.card_token,
        fallback=lambda: queue_for_retry(order)
    )

CAP theorem — consistency, availability, partition tolerance tradeoffs

The CAP Theorem in Practice

The CAP theorem states that a distributed data store can provide at most two of three guarantees:

Consistency (C): Every read receives the most recent write or an error
Availability (A): Every request receives a non-error response (but the data might be stale)
Partition tolerance (P): The system continues to operate despite network partitions between nodes

Since network partitions are unavoidable in distributed systems, the real choice is between consistency and availability during a partition. Here is what that looks like in practice.

CP Systems: Choose Consistency

When a partition occurs, CP systems refuse to serve potentially stale data. They return errors instead.

ZooKeeper uses a leader-based consensus protocol (ZAB). If a node cannot reach the leader, it stops accepting writes. This guarantees you never read stale data, but the system is unavailable during the partition.

When to choose CP: Financial transactions, inventory management, leader election, distributed locks. Any case where serving stale data causes real damage.

AP Systems: Choose Availability

When a partition occurs, AP systems continue serving requests, even if the data might be stale. They reconcile conflicts after the partition heals.

Cassandra (in its default configuration) accepts writes to any available node. If nodes are partitioned, both sides accept writes. When the partition heals, Cassandra uses last-write-wins or custom conflict resolution to merge divergent data.

DynamoDB similarly provides eventual consistency by default. You can request strongly consistent reads, but that reduces availability.

When to choose AP: Social media feeds, product catalogs, shopping carts, analytics. Cases where showing slightly stale data is better than showing nothing.

The Real-World Nuance

Most databases are not purely CP or AP. They offer tunable consistency:

# Cassandra: tunable consistency per query
from cassandra.cluster import Cluster
from cassandra import ConsistencyLevel

cluster = Cluster(['10.0.0.1', '10.0.0.2', '10.0.0.3'])
session = cluster.connect('myapp')

# AP-style: fast, eventually consistent
session.default_consistency_level = ConsistencyLevel.ONE

# Read from any single replica — fast but possibly stale
result = session.execute("SELECT * FROM users WHERE user_id = %s", [user_id])

# CP-style: strong consistency for critical operations
from cassandra.query import SimpleStatement

stmt = SimpleStatement(
    "INSERT INTO account_balance (user_id, balance) VALUES (%s, %s)",
    consistency_level=ConsistencyLevel.QUORUM
)
# QUORUM requires majority of replicas to acknowledge
# With replication_factor=3, QUORUM=2 must acknowledge
session.execute(stmt, [user_id, new_balance])

Failover Patterns

Active-Passive (Hot Standby)

One server handles all traffic. A standby server is synchronized and ready to take over if the primary fails.

How it works:

Primary handles all reads and writes
Data is replicated to the standby (synchronous or async)
Health checks monitor the primary
If the primary fails, the standby is promoted
DNS or virtual IP is updated to point to the new primary

RTO (Recovery Time Objective): 30 seconds to 5 minutes RPO (Recovery Point Objective): Near zero with synchronous replication

Drawback: The standby server is idle, wasting resources. But it is simple and well-understood.

Active-Active

Both servers handle traffic simultaneously. If one fails, the other absorbs the full load.

How it works:

Load balancer distributes traffic across both nodes
Both nodes serve reads and writes
Data is synchronized bidirectionally
If one node fails, the LB routes 100% to the survivor

RTO: Near zero — the load balancer detects failure in seconds RPO: Depends on replication strategy

Challenge: Conflict resolution. If both nodes accept writes to the same record simultaneously, you need a strategy — last-write-wins, vector clocks, CRDTs, or application-level merge logic.

Multi-Region

Deploy full copies of your stack in multiple geographic regions. A global load balancer routes traffic to the nearest healthy region.

# AWS Route53 health check + failover routing
# Primary region: us-east-1
# Failover region: eu-west-1

resource "aws_route53_health_check" "primary" {
  fqdn              = "us-east-1.api.myapp.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/ready"
  failure_threshold  = 3
  request_interval   = 10
}

resource "aws_route53_record" "api" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.myapp.com"
  type    = "A"

  failover_routing_policy {
    type = "PRIMARY"
  }

  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true
  }

  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.primary.id
}

resource "aws_route53_record" "api_failover" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.myapp.com"
  type    = "A"

  failover_routing_policy {
    type = "SECONDARY"
  }

  alias {
    name                   = aws_lb.eu_west.dns_name
    zone_id                = aws_lb.eu_west.zone_id
    evaluate_target_health = true
  }

  set_identifier = "secondary"
}

Graceful Degradation

When parts of your system fail, the remaining parts should still provide value. Degrade gracefully instead of failing completely.

class ProductService:
    def get_product(self, product_id):
        """
        Graceful degradation strategy:
        1. Try primary database
        2. Fall back to read replica
        3. Fall back to cache
        4. Return minimal data from search index
        """
        # Attempt 1: Primary database (freshest data)
        try:
            return self.primary_db.get_product(product_id)
        except DatabaseError:
            pass

        # Attempt 2: Read replica (possibly slightly stale)
        try:
            product = self.read_replica.get_product(product_id)
            product["_source"] = "replica"
            return product
        except DatabaseError:
            pass

        # Attempt 3: Redis cache (might be stale by minutes)
        cached = self.redis.get(f"product:{product_id}")
        if cached:
            product = json.loads(cached)
            product["_source"] = "cache"
            product["_stale"] = True
            return product

        # Attempt 4: Elasticsearch (basic info only)
        try:
            result = self.es.get(index="products", id=product_id)
            return {
                "id": product_id,
                "name": result["_source"]["name"],
                "price": result["_source"]["price"],
                "_source": "search_index",
                "_partial": True
            }
        except Exception:
            raise ServiceUnavailableError("Product data unavailable")

Feature flags enable selective degradation:

# Disable non-critical features under load
FEATURE_FLAGS = {
    "recommendations": True,
    "reviews": True,
    "personalization": True,
    "analytics_tracking": True,
}

def get_product_page(product_id):
    product = product_service.get_product(product_id)
    response = {"product": product}

    if FEATURE_FLAGS["recommendations"]:
        try:
            response["recommendations"] = recommendation_service.get(product_id)
        except Exception:
            response["recommendations"] = []  # Empty, not broken

    if FEATURE_FLAGS["reviews"]:
        try:
            response["reviews"] = review_service.get(product_id)
        except Exception:
            response["reviews"] = {"message": "Reviews temporarily unavailable"}

    return response

Chaos Engineering — Testing Your Resilience

Chaos engineering is the practice of intentionally injecting failures into your system to verify that your failover mechanisms actually work. Netflix pioneered this approach with Chaos Monkey.

The core principle: if you have not tested your failover, you do not have failover.

# Simple chaos testing script
import random
import subprocess

class ChaosMonkey:
    def __init__(self, targets):
        self.targets = targets

    def kill_random_instance(self):
        """Randomly terminate one instance during business hours."""
        target = random.choice(self.targets)
        print(f"Chaos Monkey: Terminating {target['instance_id']} "
              f"in {target['service']}")

        # Verify we have enough healthy instances first
        healthy = self.count_healthy(target['service'])
        if healthy <= target['min_healthy']:
            print(f"Skipping: only {healthy} healthy instances "
                  f"(minimum: {target['min_healthy']})")
            return

        subprocess.run([
            "aws", "ec2", "terminate-instances",
            "--instance-ids", target['instance_id']
        ])

    def simulate_network_partition(self, host):
        """Block traffic to simulate a network partition."""
        # Using iptables to drop packets
        subprocess.run([
            "iptables", "-A", "INPUT",
            "-s", host, "-j", "DROP"
        ])
        print(f"Network partition simulated: blocking traffic from {host}")

    def inject_latency(self, interface, latency_ms):
        """Add artificial latency to network interface."""
        subprocess.run([
            "tc", "qdisc", "add", "dev", interface,
            "root", "netem", "delay", f"{latency_ms}ms"
        ])
        print(f"Injected {latency_ms}ms latency on {interface}")

Start small. Before running Chaos Monkey in production, begin with:

Table-top exercises: Walk through failure scenarios on a whiteboard
Staging chaos: Inject failures in staging environments first
Game days: Scheduled, controlled chaos experiments with the team watching
Progressive rollout: Start with one service, expand gradually

Putting It Together — A Checklist

When designing for high availability, walk through each layer:

Layer	Question	Strategy
DNS	What if DNS fails?	Multiple DNS providers, low TTL
Load Balancer	What if the LB fails?	Redundant LBs (active-passive)
Application	What if a server crashes?	Stateless services, auto-scaling group
Database	What if the primary dies?	Replication + automatic failover
Cache	What if Redis goes down?	Redis Sentinel/Cluster, degrade gracefully
Region	What if a region goes down?	Multi-region deployment, DNS failover
Deploy	What if a bad deploy goes out?	Canary deploys, instant rollback
Config	What if a config change breaks things?	Feature flags, config versioning

Key Takeaways

Measure with nines. 99.9% and 99.99% are vastly different engineering challenges. Know your SLO and the cost of each additional nine.
Eliminate every single point of failure. If a component has no redundancy, assume it will fail at the worst possible time. Audit your architecture for SPOFs at every layer.
Parallel redundancy multiplies availability. Two servers at 99.9% each, in parallel, give you 99.9999%. Series components multiply failure probability. Redundancy is not optional.
Health checks must be meaningful. Liveness checks answer “is the process alive?” Readiness checks answer “can this instance serve traffic right now?” Use both, and make readiness checks verify real dependencies.
Circuit breakers prevent cascading failure. When a dependency is down, stop calling it. Return degraded responses instead of letting failures propagate through the entire system.
CAP is a spectrum, not a binary choice. Most databases offer tunable consistency. Use strong consistency (CP) for financial data, eventual consistency (AP) for feeds and catalogs. Choose per-operation, not per-system.
Graceful degradation over hard failure. A product page without reviews is better than a 500 error. Design every feature to degrade independently.
If you have not tested failover, you do not have failover. Run chaos experiments. Kill instances. Simulate partitions. The time to discover your failover is broken is not during a real outage.