How to Approach Any System Design Problem — System Design Masterclass

Most engineers fail system design interviews not because they lack knowledge, but because they lack structure. They jump straight into drawing boxes and arrows, skip critical constraints, and run out of time before addressing the hard parts.

This lesson gives you a repeatable 6-step framework that works for any system design problem — whether it is a URL shortener, a distributed file system, or a real-time chat application.

System design approach framework — requirements to architecture

Step 1: Clarify Requirements (5 minutes)

The single biggest mistake in system design is solving the wrong problem. Before you draw a single box, understand what you are building and for whom.

Functional Requirements

These define what the system does. Ask questions like: Who are the users? What are the core actions? What data flows in and out?

Example: Design a URL Shortener

Functional Requirements:
1. Given a long URL, generate a short URL
2. Given a short URL, redirect to the original URL
3. Users can optionally set custom short URLs
4. URLs expire after a configurable TTL
5. Analytics: track click counts per URL

Out of scope:
- User accounts / authentication
- URL editing after creation
- Bulk URL creation API

Non-Functional Requirements

These define how the system behaves under load and failure. They drive architectural decisions and are often more important than functional requirements.

Non-Functional Requirements for URL Shortener:

- Availability: 99.99% uptime (< 52 min downtime/year)
- Latency: Redirect in < 50ms (p99)
- Scale: 100M new URLs/day, 10:1 read/write ratio
- Consistency: Eventual for analytics; strong for URL creation
- Durability: URLs must never be lost once created

The key non-functional dimensions you should always consider:

Dimension	Question	Typical Targets
Latency	How fast must responses be?	< 100ms reads, < 500ms writes
Availability	How much downtime is acceptable?	99.9% = 8.7 hrs/yr, 99.99% = 52 min/yr
Consistency	Can data be stale? For how long?	Strong, eventual, or causal
Durability	Can we lose data?	Usually zero tolerance
Scale	Users, requests, data volume?	DAU, QPS, storage/year

Step 2: Back-of-Envelope Estimation (5 minutes)

Estimation answers one question: how big is this system? The numbers you calculate here drive every architectural decision that follows.

The Estimation Playbook

Start with daily active users (DAU) and work your way to QPS, storage, bandwidth, and memory.

Given: 100M DAU, 1 URL/day/user, 10:1 read/write ratio

WRITES:
- Write QPS = 100M / 86,400 sec ≈ 1,200 writes/sec
- Peak write QPS = 2-5x average ≈ 5,000 writes/sec

READS:
- Read QPS = 1,200 x 10 = 12,000 reads/sec
- Peak read QPS ≈ 50,000 reads/sec

STORAGE (per year):
- Each URL record ≈ 500 bytes (short URL + long URL + metadata)
- Daily: 100M x 500B = 50 GB/day
- Yearly: 50 GB x 365 = ~18 TB/year
- 5-year horizon: ~90 TB

BANDWIDTH:
- Incoming: 1,200 QPS x 500B = 600 KB/s (trivial)
- Outgoing: 12,000 QPS x 500B = 6 MB/s (still modest)

MEMORY (for caching):
- Cache the top 20% of URLs (80/20 rule)
- 20M URLs x 500B = 10 GB
- Fits in a single Redis instance

Numbers to Memorize

Powers of 2:
  2^10 = 1 thousand (1 KB)
  2^20 = 1 million (1 MB)
  2^30 = 1 billion (1 GB)
  2^40 = 1 trillion (1 TB)

Time:
  1 day = 86,400 seconds ≈ 10^5 seconds
  1 month ≈ 2.5 x 10^6 seconds
  1 year ≈ 3 x 10^7 seconds

Latency:
  L1 cache: 1 ns
  L2 cache: 4 ns
  RAM: 100 ns
  SSD random read: 100 us
  HDD seek: 10 ms
  Same datacenter round-trip: 500 us
  Cross-continent round-trip: 150 ms

Common Estimation Mistakes

Forgetting peak vs average. Peak is typically 2-5x average. Design for peak.
Ignoring the read/write ratio. Most systems are 10:1 to 100:1 read-heavy. This changes everything.
Not projecting growth. Design for 3-5 years ahead.
Over-precision. You are estimating, not calculating. 86,400 is “about 100,000.” Round aggressively.

Step 3: API Design (5 minutes)

Define the contract between clients and your system. This forces you to think about data model, operations, and boundaries.

POST /api/v1/urls
  Body: { "long_url": "https://...", "custom_alias": "my-link", "ttl_hours": 720 }
  Response: { "short_url": "https://short.ly/abc123", "expires_at": "..." }
  Status: 201 Created

GET /api/v1/urls/{short_code}
  Response: 302 Redirect -> Location: https://original-url.com
  (or 404 if not found, 410 if expired)

GET /api/v1/urls/{short_code}/stats
  Response: { "clicks": 42891, "created_at": "...", "last_accessed": "..." }
  Status: 200 OK

DELETE /api/v1/urls/{short_code}
  Status: 204 No Content

API Design Principles

1. Use nouns, not verbs: /urls not /createUrl
2. Version your API: /api/v1/ prefix
3. Use proper HTTP methods: GET (read), POST (create), PUT (replace),
   PATCH (partial update), DELETE (remove)
4. Use proper status codes: 201 (created), 400 (bad request),
   404 (not found), 429 (rate limited), 500 (server error)
5. Pagination for lists: ?cursor=abc&limit=20
6. Idempotency keys for writes: X-Idempotency-Key header

When REST Does Not Fit

Pattern	Use Case	Example
GraphQL	Complex nested data; mobile clients with bandwidth constraints	Social media feed with nested comments
gRPC	Service-to-service; low latency; streaming	Microservice communication
WebSocket	Real-time bidirectional	Chat, live updates
Server-Sent Events	Server-to-client streaming	Notifications, dashboards

Step 4: High-Level Architecture (10 minutes)

Start with the simplest design that satisfies requirements, then evolve.

Client -> Load Balancer -> Application Server -> Database

That's it. Start here. Then add complexity only when your
requirements demand it.

Common Components and When to Add Them

COMPONENT               WHEN TO ADD
--------------------------------------------------------------
Load Balancer           Multiple app servers (almost always)
Cache (Redis)           Read-heavy, expensive DB queries
CDN                     Static content, geographically spread users
Message Queue (Kafka)   Async processing, spike buffering
Search (Elasticsearch)  Full-text search, faceted filtering
Blob Storage (S3)       Images, videos, files
Rate Limiter            Public APIs, abuse prevention

Sketch the Architecture

For our URL shortener, the high-level design:

                    ┌─────────────┐
                    │     CDN     │  (redirect caching)
                    └──────┬──────┘
                           │
    ┌──────────┐    ┌──────┴──────┐    ┌──────────────┐
    │  Client  │───>│ Load Balancer│───>│  App Servers │
    └──────────┘    └─────────────┘    │  (stateless)  │
                                       └──────┬───────┘
                                              │
                                     ┌────────┴────────┐
                                     │                  │
                              ┌──────┴──────┐   ┌──────┴──────┐
                              │    Redis    │   │  PostgreSQL  │
                              │   (cache)   │   │   (primary)  │
                              └─────────────┘   └──────┬───────┘
                                                       │
                                              ┌────────┴────────┐
                                       ┌──────┴──────┐  ┌───────┴─────┐
                                       │  Replica 1  │  │  Replica 2  │
                                       └─────────────┘  └─────────────┘

Architecture Diagram Tips

Draw data flow, not just boxes. Show which direction data moves and label connections (HTTP, gRPC, async).
Label everything. Every box needs a name and purpose.
Show the read path and write path separately if they differ.
Call out the database schema for key entities.

Step 5: Deep Dive (15 minutes)

This is where you spend the most time. Pick the 2-3 most interesting or challenging components and go deep. Ask: “What is the hardest part of this system?”

Common deep dive areas:

1. Data model and schema design
2. The read path (if read-heavy)
3. The write path (if write-heavy)
4. Consistency and conflict resolution
5. The specific algorithm that makes the system work
6. Failure modes and recovery

Example: Short URL Generation

This is the core algorithm for a URL shortener. There are several approaches:

# Approach 1: Hash-based
import hashlib, base64
def generate_short_url(long_url: str) -> str:
    hash_bytes = hashlib.md5(long_url.encode()).digest()
    return base64.urlsafe_b64encode(hash_bytes).decode()[:7]
# Problem: Collisions. Solution: check DB, rehash with counter on conflict.

# Approach 2: Counter-based with Base62
import string
ALPHABET = string.digits + string.ascii_letters  # 62 chars
def base62_encode(num: int) -> str:
    if num == 0: return ALPHABET[0]
    result = []
    while num > 0:
        result.append(ALPHABET[num % 62])
        num //= 62
    return ''.join(reversed(result))
# Counter 1000000 -> "4c92". No collisions, but predictable.
# Solution: Add random offset or use Snowflake-style IDs.

# Approach 3: Pre-generated key pool
# Generate millions of random 7-char codes ahead of time.
# Store in a "key pool" table. On URL creation, grab one.
# No collisions, no coordination, fast. Must replenish periodically.

Example: Data Model

CREATE TABLE urls (
    id          BIGSERIAL PRIMARY KEY,
    short_code  VARCHAR(10) UNIQUE NOT NULL,
    long_url    TEXT NOT NULL,
    created_at  TIMESTAMP DEFAULT NOW(),
    expires_at  TIMESTAMP,
    click_count BIGINT DEFAULT 0
);

-- Index for redirect lookups (the hot path)
CREATE INDEX idx_short_code ON urls(short_code);

-- Index for cleanup of expired URLs
CREATE INDEX idx_expires_at ON urls(expires_at)
    WHERE expires_at IS NOT NULL;

-- Analytics table (append-only, partitioned by date)
CREATE TABLE click_events (
    id          BIGSERIAL,
    short_code  VARCHAR(10) NOT NULL,
    clicked_at  TIMESTAMP DEFAULT NOW(),
    user_agent  TEXT,
    ip_address  INET,
    referrer    TEXT
) PARTITION BY RANGE (clicked_at);

Step 6: Discuss Tradeoffs (5 minutes)

Every design decision has tradeoffs. The interviewer wants to see that you understand this. Do not present your design as perfect — acknowledge the weaknesses and explain what you would do differently under different constraints.

Framework for Discussing Tradeoffs

For each major decision, discuss: (1) what alternatives you considered, (2) why you picked this approach, (3) what you lose, and (4) under what conditions you would switch.

CONSISTENCY vs AVAILABILITY (CAP Theorem)
  "We chose eventual consistency for analytics because losing a
   few click counts during a partition is acceptable, but URL
   creation uses strong consistency to prevent duplicates."

SQL vs NoSQL
  "We chose PostgreSQL because our data is relational and we need
   strong consistency for writes. If we needed to scale beyond
   10TB, we'd consider DynamoDB with hash-based sharding."

CACHE vs NO CACHE
  "We cache the top 20% of URLs in Redis. This adds complexity
   (invalidation, memory cost) but reduces p99 latency from 50ms
   to 2ms and cuts DB load by 80%."

SYNC vs ASYNC
  "Click tracking is async via Kafka. We lose real-time accuracy
   (clicks may be delayed by up to 30 seconds) but gain
   throughput -- the redirect path stays fast."

MONOLITH vs MICROSERVICES
  "For a team of 5 building an MVP, a monolith is the right call.
   We'd split into services once the team grows past 20 engineers."

Scaling Discussion

Always address how your design scales:

VERTICAL SCALING (Scale Up)
  Bigger machines. Works to a point.
  PostgreSQL can handle ~10K writes/sec on a beefy machine.

HORIZONTAL SCALING (Scale Out)
  Add more machines. Requires partitioning.
  Shard URLs by hash(short_code) % num_shards.

READ SCALING
  Add read replicas. Our 10:1 read/write ratio means
  3 replicas can handle 10x the read load.

WRITE SCALING
  Shard the database. Each shard handles a subset of URLs.
  Use consistent hashing to distribute evenly.

GEOGRAPHIC SCALING
  Deploy in multiple regions. Use CDN for redirects.
  Each region has its own database with cross-region replication.

Common Mistakes to Avoid

Jumping to solutions. Spend the first 10 minutes on requirements and estimation. The interviewer will stop you if you start drawing too early.
Over-engineering. Do not add Kafka, Elasticsearch, and a service mesh to a system that handles 100 requests per second. Simple is better.
Ignoring failure modes. What happens when Redis goes down? When the database is unreachable? When a server crashes mid-write? Address these explicitly.
Not doing the math. “We need a cache” is weak. “We need a cache because our DB handles 5K QPS and we expect 50K peak QPS” is strong.
Single points of failure. Every component should have redundancy. If you draw one database, explain how you handle its failure.
Forgetting about data. How does data grow over time? What is your retention policy? How do you archive old data?

The Complete Checklist

Use this as a mental checklist during any system design session:

REQUIREMENTS
[ ] Functional requirements (3-5 core features)
[ ] Non-functional requirements (latency, availability, consistency, scale)
[ ] Explicit out-of-scope items

ESTIMATION
[ ] QPS (average and peak)
[ ] Storage (daily, yearly, 5-year)
[ ] Bandwidth (in and out)
[ ] Memory (cache sizing)

API DESIGN
[ ] Core endpoints with methods, parameters, responses
[ ] Error handling and status codes
[ ] Pagination, rate limiting, authentication

HIGH-LEVEL DESIGN
[ ] Core components labeled
[ ] Data flow arrows
[ ] Read path and write path
[ ] Database schema for key entities

DEEP DIVE (pick 2-3)
[ ] Core algorithm or data structure
[ ] Scaling strategy
[ ] Consistency model
[ ] Failure handling

TRADEOFFS
[ ] Alternatives considered for each major decision
[ ] Weaknesses of current approach acknowledged
[ ] Conditions under which you'd change your design

Key Takeaways

Every system design follows the same 6-step framework: requirements, estimation, API design, high-level architecture, deep dive, tradeoffs. The framework matters more than memorizing specific architectures.
Spend at least 10 minutes on requirements and estimation before drawing anything. The numbers you calculate drive every downstream decision.
Start with the simplest architecture that works, then add complexity only when your requirements demand it. A single PostgreSQL instance handles more than most engineers think.
The deep dive is where you differentiate yourself. Pick the hardest 2-3 components and show that you understand how they work at a low level.
Always discuss tradeoffs. There are no perfect designs, only designs that are good enough for specific constraints. Show that you understand what you are giving up with each decision.
Do the math. “We need caching” is a weak claim. “We need caching because our DB handles 5K QPS but peak load is 50K QPS” is a strong one that demonstrates engineering judgment.