High availability is the property of a system that operates continuously without failure for a stated period of time. In practice, “high availability” means your system keeps serving requests even when individual components fail — and components will fail. Disks corrupt, servers crash, networks partition, entire data centers go dark.
The goal is not to prevent failure. The goal is to make failure invisible to users.
Measuring Availability — The Nines
Availability is measured as a percentage of uptime over a given period. The industry uses “nines” as shorthand:
| Availability | Downtime/Year | Downtime/Month | Downtime/Week |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours | 1.68 hours |
| 99.9% (three nines) | 8.77 hours | 43.8 minutes | 10.1 minutes |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes | 1.01 minutes |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | 6.05 seconds |
Going from 99.9% to 99.99% does not sound like much. But it means cutting your allowed downtime from 8.7 hours per year to 52 minutes per year. That is a fundamentally different engineering challenge. Every additional nine typically costs 10x more in infrastructure and operational complexity.
SLAs, SLOs, and SLIs
These terms get thrown around loosely. Here is what they actually mean:
- SLI (Service Level Indicator): A quantitative measurement. Example: “the percentage of requests completing in under 200ms.”
- SLO (Service Level Objective): A target value for an SLI. Example: “99.9% of requests must complete in under 200ms.”
- SLA (Service Level Agreement): A contract with consequences. Example: “If we drop below 99.9% availability, customers get service credits.”
Your SLO should be stricter than your SLA. If your SLA promises 99.9%, your internal SLO should target 99.95% so you have a buffer before you owe anyone money.
# Calculate allowed downtime from availability target
def allowed_downtime(availability_percent, period_hours=8760):
"""
availability_percent: e.g. 99.99
period_hours: hours in measurement period (8760 = 1 year)
"""
downtime_fraction = 1 - (availability_percent / 100)
downtime_hours = period_hours * downtime_fraction
downtime_minutes = downtime_hours * 60
print(f"Availability: {availability_percent}%")
print(f"Allowed downtime per year: {downtime_hours:.2f} hours ({downtime_minutes:.1f} minutes)")
print(f"Allowed downtime per month: {downtime_minutes / 12:.1f} minutes")
return downtime_minutes
# Four nines: only 52.6 minutes per year
allowed_downtime(99.99)Single Points of Failure
A single point of failure (SPOF) is any component whose failure brings down the entire system. Finding and eliminating SPOFs is the core discipline of high availability engineering.
Common SPOFs in a typical web application:
- Single database server — the database goes down, the entire application goes down
- Single load balancer — the entry point fails, no traffic reaches any server
- Single DNS provider — DNS resolution fails, your domain stops resolving
- Single region — a regional outage (power, network, natural disaster) takes out everything
- Single configuration store — one bad config push, everything breaks
- Single engineer — the one person who knows how the system works is on vacation
The fix for every SPOF follows the same pattern: redundancy. Run two or more of everything, and make sure they can take over for each other automatically.
Calculating System Availability
When components are in series (all must work), multiply their availabilities:
System = A1 * A2 * A3
System = 0.999 * 0.999 * 0.999 = 0.997 (99.7%)Three components at 99.9% each give you only 99.7% overall. Adding components in series always reduces availability.
When components are in parallel (any one working is sufficient), the formula is:
System = 1 - (1 - A1) * (1 - A2)
System = 1 - (0.001) * (0.001) = 1 - 0.000001 = 0.999999 (99.9999%)Two components at 99.9% each, in parallel, give you 99.9999%. This is the power of redundancy.
def series_availability(*components):
"""All components must work."""
result = 1.0
for a in components:
result *= a
return result
def parallel_availability(*components):
"""At least one component must work."""
failure_prob = 1.0
for a in components:
failure_prob *= (1 - a)
return 1 - failure_prob
# Single DB: 99.9%
single_db = 0.999
print(f"Single DB: {single_db * 100}%")
# Two DBs in active-passive: 99.9999%
dual_db = parallel_availability(0.999, 0.999)
print(f"Dual DB (parallel): {dual_db * 100:.4f}%")
# Full system: LB -> App (x2) -> DB (x2) -> Cache (x2)
lb = parallel_availability(0.999, 0.999) # Redundant LB
app = parallel_availability(0.999, 0.999, 0.999) # 3 app servers
db = parallel_availability(0.999, 0.999) # Primary + replica
cache = parallel_availability(0.999, 0.999) # Cache cluster
system = series_availability(lb, app, db, cache)
print(f"Full system: {system * 100:.6f}%")Redundancy Patterns
Stateless Services
The easiest components to make redundant are stateless services. If a service holds no state (no local sessions, no in-memory caches that cannot be lost), you can run N copies behind a load balancer and lose any of them without impact.
Rules for stateless services:
- Store sessions in Redis or a database, not in local memory
- Store uploads in object storage (S3), not on local disk
- Read configuration from a central config store, not from local files
- Make every instance identical and interchangeable
# BAD: Stateful — session stored in-memory
sessions = {}
@app.route('/login', methods=['POST'])
def login():
user = authenticate(request)
session_id = generate_session_id()
sessions[session_id] = user # Dies if this server dies
return jsonify({"session": session_id})
# GOOD: Stateless — session stored in Redis
@app.route('/login', methods=['POST'])
def login():
user = authenticate(request)
session_id = generate_session_id()
redis_client.setex(
f"session:{session_id}",
timedelta(hours=24),
json.dumps({"user_id": user.id, "email": user.email})
)
return jsonify({"session": session_id})Database Redundancy
Databases are harder because they hold state. The standard pattern is primary-replica replication:
-- PostgreSQL: Set up streaming replication
-- On the primary:
ALTER SYSTEM SET wal_level = 'replica';
ALTER SYSTEM SET max_wal_senders = 5;
ALTER SYSTEM SET synchronous_standby_names = 'replica1';
-- On the replica:
-- pg_basebackup -h primary-host -D /var/lib/postgresql/data -U replicator -P -R
-- The -R flag creates standby.signal and sets primary_conninfo automaticallyKey decisions for database replication:
Synchronous replication: The primary waits for the replica to confirm the write. Zero data loss (RPO = 0), but higher write latency. Use for financial data.
Asynchronous replication: The primary writes and moves on. The replica catches up later. Lower latency, but you can lose the most recent writes if the primary dies. Acceptable for most applications.
Cache Redundancy
Redis Sentinel or Redis Cluster provides automatic failover:
# Redis Sentinel configuration
# sentinel.conf
sentinel monitor mymaster 10.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 10000
sentinel parallel-syncs mymaster 1
# "2" means 2 sentinels must agree the master is down before failover
# This prevents split-brain scenariosHealth Checks
Health checks are how the system detects failure and triggers recovery. There are two types that serve different purposes:
Liveness Checks
“Is this process alive and not deadlocked?” A liveness check failing means the process should be killed and restarted.
@app.route('/healthz')
def liveness():
"""Simple liveness check — if this endpoint responds, the process is alive."""
return jsonify({"status": "alive"}), 200Readiness Checks
“Is this service ready to accept traffic?” A readiness check failing means the service should be removed from the load balancer but NOT killed. It might be warming up caches, running migrations, or waiting for a dependency.
@app.route('/ready')
def readiness():
"""Readiness check — verify all dependencies are reachable."""
checks = {}
# Check database
try:
db.execute("SELECT 1")
checks["database"] = "ok"
except Exception as e:
checks["database"] = f"failed: {str(e)}"
# Check cache
try:
redis_client.ping()
checks["cache"] = "ok"
except Exception as e:
checks["cache"] = f"failed: {str(e)}"
# Check downstream service
try:
resp = requests.get("http://payment-service/healthz", timeout=2)
checks["payment_service"] = "ok" if resp.status_code == 200 else "degraded"
except Exception as e:
checks["payment_service"] = f"failed: {str(e)}"
all_ok = all(v == "ok" for v in checks.values())
status_code = 200 if all_ok else 503
return jsonify({"ready": all_ok, "checks": checks}), status_codeKubernetes Health Check Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: myapp:latest
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3 # Kill after 3 consecutive failures
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 3
failureThreshold: 2 # Remove from LB after 2 failures
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # Allow up to 150s for slow startup
periodSeconds: 5Circuit Breakers
When a downstream dependency is failing, continuing to send it requests makes things worse. A circuit breaker stops calling a failing service and returns a fallback response instead.
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation — requests flow through
OPEN = "open" # Failing — all requests short-circuited
HALF_OPEN = "half_open" # Testing — allow one request through
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30, half_open_max=1):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max = half_open_max
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time = 0
self.half_open_calls = 0
def call(self, func, *args, fallback=None, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.half_open_calls = 0
else:
# Circuit is open — return fallback immediately
if fallback:
return fallback()
raise Exception("Circuit breaker is OPEN")
if self.state == CircuitState.HALF_OPEN:
if self.half_open_calls >= self.half_open_max:
if fallback:
return fallback()
raise Exception("Circuit breaker HALF_OPEN limit reached")
self.half_open_calls += 1
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
if fallback:
return fallback()
raise
def _on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
# Usage
payment_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
def process_payment(order):
return payment_breaker.call(
payment_service.charge,
order.amount,
order.card_token,
fallback=lambda: queue_for_retry(order)
)The CAP Theorem in Practice
The CAP theorem states that a distributed data store can provide at most two of three guarantees:
- Consistency (C): Every read receives the most recent write or an error
- Availability (A): Every request receives a non-error response (but the data might be stale)
- Partition tolerance (P): The system continues to operate despite network partitions between nodes
Since network partitions are unavoidable in distributed systems, the real choice is between consistency and availability during a partition. Here is what that looks like in practice.
CP Systems: Choose Consistency
When a partition occurs, CP systems refuse to serve potentially stale data. They return errors instead.
ZooKeeper uses a leader-based consensus protocol (ZAB). If a node cannot reach the leader, it stops accepting writes. This guarantees you never read stale data, but the system is unavailable during the partition.
When to choose CP: Financial transactions, inventory management, leader election, distributed locks. Any case where serving stale data causes real damage.
AP Systems: Choose Availability
When a partition occurs, AP systems continue serving requests, even if the data might be stale. They reconcile conflicts after the partition heals.
Cassandra (in its default configuration) accepts writes to any available node. If nodes are partitioned, both sides accept writes. When the partition heals, Cassandra uses last-write-wins or custom conflict resolution to merge divergent data.
DynamoDB similarly provides eventual consistency by default. You can request strongly consistent reads, but that reduces availability.
When to choose AP: Social media feeds, product catalogs, shopping carts, analytics. Cases where showing slightly stale data is better than showing nothing.
The Real-World Nuance
Most databases are not purely CP or AP. They offer tunable consistency:
# Cassandra: tunable consistency per query
from cassandra.cluster import Cluster
from cassandra import ConsistencyLevel
cluster = Cluster(['10.0.0.1', '10.0.0.2', '10.0.0.3'])
session = cluster.connect('myapp')
# AP-style: fast, eventually consistent
session.default_consistency_level = ConsistencyLevel.ONE
# Read from any single replica — fast but possibly stale
result = session.execute("SELECT * FROM users WHERE user_id = %s", [user_id])
# CP-style: strong consistency for critical operations
from cassandra.query import SimpleStatement
stmt = SimpleStatement(
"INSERT INTO account_balance (user_id, balance) VALUES (%s, %s)",
consistency_level=ConsistencyLevel.QUORUM
)
# QUORUM requires majority of replicas to acknowledge
# With replication_factor=3, QUORUM=2 must acknowledge
session.execute(stmt, [user_id, new_balance])Failover Patterns
Active-Passive (Hot Standby)
One server handles all traffic. A standby server is synchronized and ready to take over if the primary fails.
How it works:
- Primary handles all reads and writes
- Data is replicated to the standby (synchronous or async)
- Health checks monitor the primary
- If the primary fails, the standby is promoted
- DNS or virtual IP is updated to point to the new primary
RTO (Recovery Time Objective): 30 seconds to 5 minutes RPO (Recovery Point Objective): Near zero with synchronous replication
Drawback: The standby server is idle, wasting resources. But it is simple and well-understood.
Active-Active
Both servers handle traffic simultaneously. If one fails, the other absorbs the full load.
How it works:
- Load balancer distributes traffic across both nodes
- Both nodes serve reads and writes
- Data is synchronized bidirectionally
- If one node fails, the LB routes 100% to the survivor
RTO: Near zero — the load balancer detects failure in seconds RPO: Depends on replication strategy
Challenge: Conflict resolution. If both nodes accept writes to the same record simultaneously, you need a strategy — last-write-wins, vector clocks, CRDTs, or application-level merge logic.
Multi-Region
Deploy full copies of your stack in multiple geographic regions. A global load balancer routes traffic to the nearest healthy region.
# AWS Route53 health check + failover routing
# Primary region: us-east-1
# Failover region: eu-west-1
resource "aws_route53_health_check" "primary" {
fqdn = "us-east-1.api.myapp.com"
port = 443
type = "HTTPS"
resource_path = "/ready"
failure_threshold = 3
request_interval = 10
}
resource "aws_route53_record" "api" {
zone_id = aws_route53_zone.main.zone_id
name = "api.myapp.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
alias {
name = aws_lb.us_east.dns_name
zone_id = aws_lb.us_east.zone_id
evaluate_target_health = true
}
set_identifier = "primary"
health_check_id = aws_route53_health_check.primary.id
}
resource "aws_route53_record" "api_failover" {
zone_id = aws_route53_zone.main.zone_id
name = "api.myapp.com"
type = "A"
failover_routing_policy {
type = "SECONDARY"
}
alias {
name = aws_lb.eu_west.dns_name
zone_id = aws_lb.eu_west.zone_id
evaluate_target_health = true
}
set_identifier = "secondary"
}Graceful Degradation
When parts of your system fail, the remaining parts should still provide value. Degrade gracefully instead of failing completely.
class ProductService:
def get_product(self, product_id):
"""
Graceful degradation strategy:
1. Try primary database
2. Fall back to read replica
3. Fall back to cache
4. Return minimal data from search index
"""
# Attempt 1: Primary database (freshest data)
try:
return self.primary_db.get_product(product_id)
except DatabaseError:
pass
# Attempt 2: Read replica (possibly slightly stale)
try:
product = self.read_replica.get_product(product_id)
product["_source"] = "replica"
return product
except DatabaseError:
pass
# Attempt 3: Redis cache (might be stale by minutes)
cached = self.redis.get(f"product:{product_id}")
if cached:
product = json.loads(cached)
product["_source"] = "cache"
product["_stale"] = True
return product
# Attempt 4: Elasticsearch (basic info only)
try:
result = self.es.get(index="products", id=product_id)
return {
"id": product_id,
"name": result["_source"]["name"],
"price": result["_source"]["price"],
"_source": "search_index",
"_partial": True
}
except Exception:
raise ServiceUnavailableError("Product data unavailable")Feature flags enable selective degradation:
# Disable non-critical features under load
FEATURE_FLAGS = {
"recommendations": True,
"reviews": True,
"personalization": True,
"analytics_tracking": True,
}
def get_product_page(product_id):
product = product_service.get_product(product_id)
response = {"product": product}
if FEATURE_FLAGS["recommendations"]:
try:
response["recommendations"] = recommendation_service.get(product_id)
except Exception:
response["recommendations"] = [] # Empty, not broken
if FEATURE_FLAGS["reviews"]:
try:
response["reviews"] = review_service.get(product_id)
except Exception:
response["reviews"] = {"message": "Reviews temporarily unavailable"}
return responseChaos Engineering — Testing Your Resilience
Chaos engineering is the practice of intentionally injecting failures into your system to verify that your failover mechanisms actually work. Netflix pioneered this approach with Chaos Monkey.
The core principle: if you have not tested your failover, you do not have failover.
# Simple chaos testing script
import random
import subprocess
class ChaosMonkey:
def __init__(self, targets):
self.targets = targets
def kill_random_instance(self):
"""Randomly terminate one instance during business hours."""
target = random.choice(self.targets)
print(f"Chaos Monkey: Terminating {target['instance_id']} "
f"in {target['service']}")
# Verify we have enough healthy instances first
healthy = self.count_healthy(target['service'])
if healthy <= target['min_healthy']:
print(f"Skipping: only {healthy} healthy instances "
f"(minimum: {target['min_healthy']})")
return
subprocess.run([
"aws", "ec2", "terminate-instances",
"--instance-ids", target['instance_id']
])
def simulate_network_partition(self, host):
"""Block traffic to simulate a network partition."""
# Using iptables to drop packets
subprocess.run([
"iptables", "-A", "INPUT",
"-s", host, "-j", "DROP"
])
print(f"Network partition simulated: blocking traffic from {host}")
def inject_latency(self, interface, latency_ms):
"""Add artificial latency to network interface."""
subprocess.run([
"tc", "qdisc", "add", "dev", interface,
"root", "netem", "delay", f"{latency_ms}ms"
])
print(f"Injected {latency_ms}ms latency on {interface}")Start small. Before running Chaos Monkey in production, begin with:
- Table-top exercises: Walk through failure scenarios on a whiteboard
- Staging chaos: Inject failures in staging environments first
- Game days: Scheduled, controlled chaos experiments with the team watching
- Progressive rollout: Start with one service, expand gradually
Putting It Together — A Checklist
When designing for high availability, walk through each layer:
| Layer | Question | Strategy |
|---|---|---|
| DNS | What if DNS fails? | Multiple DNS providers, low TTL |
| Load Balancer | What if the LB fails? | Redundant LBs (active-passive) |
| Application | What if a server crashes? | Stateless services, auto-scaling group |
| Database | What if the primary dies? | Replication + automatic failover |
| Cache | What if Redis goes down? | Redis Sentinel/Cluster, degrade gracefully |
| Region | What if a region goes down? | Multi-region deployment, DNS failover |
| Deploy | What if a bad deploy goes out? | Canary deploys, instant rollback |
| Config | What if a config change breaks things? | Feature flags, config versioning |
Key Takeaways
-
Measure with nines. 99.9% and 99.99% are vastly different engineering challenges. Know your SLO and the cost of each additional nine.
-
Eliminate every single point of failure. If a component has no redundancy, assume it will fail at the worst possible time. Audit your architecture for SPOFs at every layer.
-
Parallel redundancy multiplies availability. Two servers at 99.9% each, in parallel, give you 99.9999%. Series components multiply failure probability. Redundancy is not optional.
-
Health checks must be meaningful. Liveness checks answer “is the process alive?” Readiness checks answer “can this instance serve traffic right now?” Use both, and make readiness checks verify real dependencies.
-
Circuit breakers prevent cascading failure. When a dependency is down, stop calling it. Return degraded responses instead of letting failures propagate through the entire system.
-
CAP is a spectrum, not a binary choice. Most databases offer tunable consistency. Use strong consistency (CP) for financial data, eventual consistency (AP) for feeds and catalogs. Choose per-operation, not per-system.
-
Graceful degradation over hard failure. A product page without reviews is better than a 500 error. Design every feature to degrade independently.
-
If you have not tested failover, you do not have failover. Run chaos experiments. Kill instances. Simulate partitions. The time to discover your failover is broken is not during a real outage.
