Most engineers fail system design interviews not because they lack knowledge, but because they lack structure. They jump straight into drawing boxes and arrows, skip critical constraints, and run out of time before addressing the hard parts.
This lesson gives you a repeatable 6-step framework that works for any system design problem — whether it is a URL shortener, a distributed file system, or a real-time chat application.
Step 1: Clarify Requirements (5 minutes)
The single biggest mistake in system design is solving the wrong problem. Before you draw a single box, understand what you are building and for whom.
Functional Requirements
These define what the system does. Ask questions like: Who are the users? What are the core actions? What data flows in and out?
Example: Design a URL Shortener
Functional Requirements:
1. Given a long URL, generate a short URL
2. Given a short URL, redirect to the original URL
3. Users can optionally set custom short URLs
4. URLs expire after a configurable TTL
5. Analytics: track click counts per URL
Out of scope:
- User accounts / authentication
- URL editing after creation
- Bulk URL creation APINon-Functional Requirements
These define how the system behaves under load and failure. They drive architectural decisions and are often more important than functional requirements.
Non-Functional Requirements for URL Shortener:
- Availability: 99.99% uptime (< 52 min downtime/year)
- Latency: Redirect in < 50ms (p99)
- Scale: 100M new URLs/day, 10:1 read/write ratio
- Consistency: Eventual for analytics; strong for URL creation
- Durability: URLs must never be lost once createdThe key non-functional dimensions you should always consider:
| Dimension | Question | Typical Targets |
|---|---|---|
| Latency | How fast must responses be? | < 100ms reads, < 500ms writes |
| Availability | How much downtime is acceptable? | 99.9% = 8.7 hrs/yr, 99.99% = 52 min/yr |
| Consistency | Can data be stale? For how long? | Strong, eventual, or causal |
| Durability | Can we lose data? | Usually zero tolerance |
| Scale | Users, requests, data volume? | DAU, QPS, storage/year |
Step 2: Back-of-Envelope Estimation (5 minutes)
Estimation answers one question: how big is this system? The numbers you calculate here drive every architectural decision that follows.
The Estimation Playbook
Start with daily active users (DAU) and work your way to QPS, storage, bandwidth, and memory.
Given: 100M DAU, 1 URL/day/user, 10:1 read/write ratio
WRITES:
- Write QPS = 100M / 86,400 sec ≈ 1,200 writes/sec
- Peak write QPS = 2-5x average ≈ 5,000 writes/sec
READS:
- Read QPS = 1,200 x 10 = 12,000 reads/sec
- Peak read QPS ≈ 50,000 reads/sec
STORAGE (per year):
- Each URL record ≈ 500 bytes (short URL + long URL + metadata)
- Daily: 100M x 500B = 50 GB/day
- Yearly: 50 GB x 365 = ~18 TB/year
- 5-year horizon: ~90 TB
BANDWIDTH:
- Incoming: 1,200 QPS x 500B = 600 KB/s (trivial)
- Outgoing: 12,000 QPS x 500B = 6 MB/s (still modest)
MEMORY (for caching):
- Cache the top 20% of URLs (80/20 rule)
- 20M URLs x 500B = 10 GB
- Fits in a single Redis instanceNumbers to Memorize
Powers of 2:
2^10 = 1 thousand (1 KB)
2^20 = 1 million (1 MB)
2^30 = 1 billion (1 GB)
2^40 = 1 trillion (1 TB)
Time:
1 day = 86,400 seconds ≈ 10^5 seconds
1 month ≈ 2.5 x 10^6 seconds
1 year ≈ 3 x 10^7 seconds
Latency:
L1 cache: 1 ns
L2 cache: 4 ns
RAM: 100 ns
SSD random read: 100 us
HDD seek: 10 ms
Same datacenter round-trip: 500 us
Cross-continent round-trip: 150 msCommon Estimation Mistakes
- Forgetting peak vs average. Peak is typically 2-5x average. Design for peak.
- Ignoring the read/write ratio. Most systems are 10:1 to 100:1 read-heavy. This changes everything.
- Not projecting growth. Design for 3-5 years ahead.
- Over-precision. You are estimating, not calculating. 86,400 is “about 100,000.” Round aggressively.
Step 3: API Design (5 minutes)
Define the contract between clients and your system. This forces you to think about data model, operations, and boundaries.
POST /api/v1/urls
Body: { "long_url": "https://...", "custom_alias": "my-link", "ttl_hours": 720 }
Response: { "short_url": "https://short.ly/abc123", "expires_at": "..." }
Status: 201 Created
GET /api/v1/urls/{short_code}
Response: 302 Redirect -> Location: https://original-url.com
(or 404 if not found, 410 if expired)
GET /api/v1/urls/{short_code}/stats
Response: { "clicks": 42891, "created_at": "...", "last_accessed": "..." }
Status: 200 OK
DELETE /api/v1/urls/{short_code}
Status: 204 No ContentAPI Design Principles
1. Use nouns, not verbs: /urls not /createUrl
2. Version your API: /api/v1/ prefix
3. Use proper HTTP methods: GET (read), POST (create), PUT (replace),
PATCH (partial update), DELETE (remove)
4. Use proper status codes: 201 (created), 400 (bad request),
404 (not found), 429 (rate limited), 500 (server error)
5. Pagination for lists: ?cursor=abc&limit=20
6. Idempotency keys for writes: X-Idempotency-Key headerWhen REST Does Not Fit
| Pattern | Use Case | Example |
|---|---|---|
| GraphQL | Complex nested data; mobile clients with bandwidth constraints | Social media feed with nested comments |
| gRPC | Service-to-service; low latency; streaming | Microservice communication |
| WebSocket | Real-time bidirectional | Chat, live updates |
| Server-Sent Events | Server-to-client streaming | Notifications, dashboards |
Step 4: High-Level Architecture (10 minutes)
Start with the simplest design that satisfies requirements, then evolve.
Client -> Load Balancer -> Application Server -> Database
That's it. Start here. Then add complexity only when your
requirements demand it.Common Components and When to Add Them
COMPONENT WHEN TO ADD
--------------------------------------------------------------
Load Balancer Multiple app servers (almost always)
Cache (Redis) Read-heavy, expensive DB queries
CDN Static content, geographically spread users
Message Queue (Kafka) Async processing, spike buffering
Search (Elasticsearch) Full-text search, faceted filtering
Blob Storage (S3) Images, videos, files
Rate Limiter Public APIs, abuse preventionSketch the Architecture
For our URL shortener, the high-level design:
┌─────────────┐
│ CDN │ (redirect caching)
└──────┬──────┘
│
┌──────────┐ ┌──────┴──────┐ ┌──────────────┐
│ Client │───>│ Load Balancer│───>│ App Servers │
└──────────┘ └─────────────┘ │ (stateless) │
└──────┬───────┘
│
┌────────┴────────┐
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ Redis │ │ PostgreSQL │
│ (cache) │ │ (primary) │
└─────────────┘ └──────┬───────┘
│
┌────────┴────────┐
┌──────┴──────┐ ┌───────┴─────┐
│ Replica 1 │ │ Replica 2 │
└─────────────┘ └─────────────┘Architecture Diagram Tips
- Draw data flow, not just boxes. Show which direction data moves and label connections (HTTP, gRPC, async).
- Label everything. Every box needs a name and purpose.
- Show the read path and write path separately if they differ.
- Call out the database schema for key entities.
Step 5: Deep Dive (15 minutes)
This is where you spend the most time. Pick the 2-3 most interesting or challenging components and go deep. Ask: “What is the hardest part of this system?”
Common deep dive areas:
1. Data model and schema design
2. The read path (if read-heavy)
3. The write path (if write-heavy)
4. Consistency and conflict resolution
5. The specific algorithm that makes the system work
6. Failure modes and recoveryExample: Short URL Generation
This is the core algorithm for a URL shortener. There are several approaches:
# Approach 1: Hash-based
import hashlib, base64
def generate_short_url(long_url: str) -> str:
hash_bytes = hashlib.md5(long_url.encode()).digest()
return base64.urlsafe_b64encode(hash_bytes).decode()[:7]
# Problem: Collisions. Solution: check DB, rehash with counter on conflict.
# Approach 2: Counter-based with Base62
import string
ALPHABET = string.digits + string.ascii_letters # 62 chars
def base62_encode(num: int) -> str:
if num == 0: return ALPHABET[0]
result = []
while num > 0:
result.append(ALPHABET[num % 62])
num //= 62
return ''.join(reversed(result))
# Counter 1000000 -> "4c92". No collisions, but predictable.
# Solution: Add random offset or use Snowflake-style IDs.
# Approach 3: Pre-generated key pool
# Generate millions of random 7-char codes ahead of time.
# Store in a "key pool" table. On URL creation, grab one.
# No collisions, no coordination, fast. Must replenish periodically.Example: Data Model
CREATE TABLE urls (
id BIGSERIAL PRIMARY KEY,
short_code VARCHAR(10) UNIQUE NOT NULL,
long_url TEXT NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
expires_at TIMESTAMP,
click_count BIGINT DEFAULT 0
);
-- Index for redirect lookups (the hot path)
CREATE INDEX idx_short_code ON urls(short_code);
-- Index for cleanup of expired URLs
CREATE INDEX idx_expires_at ON urls(expires_at)
WHERE expires_at IS NOT NULL;
-- Analytics table (append-only, partitioned by date)
CREATE TABLE click_events (
id BIGSERIAL,
short_code VARCHAR(10) NOT NULL,
clicked_at TIMESTAMP DEFAULT NOW(),
user_agent TEXT,
ip_address INET,
referrer TEXT
) PARTITION BY RANGE (clicked_at);Step 6: Discuss Tradeoffs (5 minutes)
Every design decision has tradeoffs. The interviewer wants to see that you understand this. Do not present your design as perfect — acknowledge the weaknesses and explain what you would do differently under different constraints.
Framework for Discussing Tradeoffs
For each major decision, discuss: (1) what alternatives you considered, (2) why you picked this approach, (3) what you lose, and (4) under what conditions you would switch.
CONSISTENCY vs AVAILABILITY (CAP Theorem)
"We chose eventual consistency for analytics because losing a
few click counts during a partition is acceptable, but URL
creation uses strong consistency to prevent duplicates."
SQL vs NoSQL
"We chose PostgreSQL because our data is relational and we need
strong consistency for writes. If we needed to scale beyond
10TB, we'd consider DynamoDB with hash-based sharding."
CACHE vs NO CACHE
"We cache the top 20% of URLs in Redis. This adds complexity
(invalidation, memory cost) but reduces p99 latency from 50ms
to 2ms and cuts DB load by 80%."
SYNC vs ASYNC
"Click tracking is async via Kafka. We lose real-time accuracy
(clicks may be delayed by up to 30 seconds) but gain
throughput -- the redirect path stays fast."
MONOLITH vs MICROSERVICES
"For a team of 5 building an MVP, a monolith is the right call.
We'd split into services once the team grows past 20 engineers."Scaling Discussion
Always address how your design scales:
VERTICAL SCALING (Scale Up)
Bigger machines. Works to a point.
PostgreSQL can handle ~10K writes/sec on a beefy machine.
HORIZONTAL SCALING (Scale Out)
Add more machines. Requires partitioning.
Shard URLs by hash(short_code) % num_shards.
READ SCALING
Add read replicas. Our 10:1 read/write ratio means
3 replicas can handle 10x the read load.
WRITE SCALING
Shard the database. Each shard handles a subset of URLs.
Use consistent hashing to distribute evenly.
GEOGRAPHIC SCALING
Deploy in multiple regions. Use CDN for redirects.
Each region has its own database with cross-region replication.Common Mistakes to Avoid
- Jumping to solutions. Spend the first 10 minutes on requirements and estimation. The interviewer will stop you if you start drawing too early.
- Over-engineering. Do not add Kafka, Elasticsearch, and a service mesh to a system that handles 100 requests per second. Simple is better.
- Ignoring failure modes. What happens when Redis goes down? When the database is unreachable? When a server crashes mid-write? Address these explicitly.
- Not doing the math. “We need a cache” is weak. “We need a cache because our DB handles 5K QPS and we expect 50K peak QPS” is strong.
- Single points of failure. Every component should have redundancy. If you draw one database, explain how you handle its failure.
- Forgetting about data. How does data grow over time? What is your retention policy? How do you archive old data?
The Complete Checklist
Use this as a mental checklist during any system design session:
REQUIREMENTS
[ ] Functional requirements (3-5 core features)
[ ] Non-functional requirements (latency, availability, consistency, scale)
[ ] Explicit out-of-scope items
ESTIMATION
[ ] QPS (average and peak)
[ ] Storage (daily, yearly, 5-year)
[ ] Bandwidth (in and out)
[ ] Memory (cache sizing)
API DESIGN
[ ] Core endpoints with methods, parameters, responses
[ ] Error handling and status codes
[ ] Pagination, rate limiting, authentication
HIGH-LEVEL DESIGN
[ ] Core components labeled
[ ] Data flow arrows
[ ] Read path and write path
[ ] Database schema for key entities
DEEP DIVE (pick 2-3)
[ ] Core algorithm or data structure
[ ] Scaling strategy
[ ] Consistency model
[ ] Failure handling
TRADEOFFS
[ ] Alternatives considered for each major decision
[ ] Weaknesses of current approach acknowledged
[ ] Conditions under which you'd change your designKey Takeaways
- Every system design follows the same 6-step framework: requirements, estimation, API design, high-level architecture, deep dive, tradeoffs. The framework matters more than memorizing specific architectures.
- Spend at least 10 minutes on requirements and estimation before drawing anything. The numbers you calculate drive every downstream decision.
- Start with the simplest architecture that works, then add complexity only when your requirements demand it. A single PostgreSQL instance handles more than most engineers think.
- The deep dive is where you differentiate yourself. Pick the hardest 2-3 components and show that you understand how they work at a low level.
- Always discuss tradeoffs. There are no perfect designs, only designs that are good enough for specific constraints. Show that you understand what you are giving up with each decision.
- Do the math. “We need caching” is a weak claim. “We need caching because our DB handles 5K QPS but peak load is 50K QPS” is a strong one that demonstrates engineering judgment.
