Deep Dive on API Gateway: A System Design Interview Perspective

“An API Gateway is the front door to your microservices. Every request walks through it, and every cross-cutting concern lives there — so you don’t repeat it in 50 services.”

In a monolith, there’s one entry point. In a microservices architecture, there can be dozens or hundreds. Clients shouldn’t need to know about your internal service topology, manage multiple connections, or handle authentication, rate limiting, and retries themselves. That’s the API Gateway’s job.

This article covers everything you need for system design interviews: what an API Gateway does, how each component works, the algorithms behind rate limiting, and the patterns that come up when designing real systems.

What is an API Gateway?

An API Gateway is a reverse proxy that sits between external clients and your backend services. It handles cross-cutting concerns that every request needs but no individual service should implement:

API Gateway Architecture

Without a gateway, every service must independently handle:

TLS termination
Authentication / authorization
Rate limiting
Request validation
Logging, metrics, tracing
CORS headers

With a gateway, services only handle business logic. Everything else is centralized.

Request Lifecycle

Every request through an API Gateway follows a predictable pipeline:

Let’s break down each stage.

Authentication and Authorization

The gateway validates identity before requests reach your services.

Common Auth Patterns

Pattern	How	Best For
API Key	Key in header (`X-API-Key`) or query param	Service-to-service, simple APIs
JWT (Bearer Token)	Verify signature locally, extract claims	Stateless auth, microservices
OAuth 2.0	Token introspection or JWT validation	Third-party access, SSO
mTLS	Client presents certificate	Service mesh, zero-trust

JWT Validation at the Gateway

# Gateway middleware — JWT validation (no database call needed)
import jwt
from functools import wraps

PUBLIC_KEY = open('public.pem').read()

def authenticate(func):
    @wraps(func)
    def wrapper(request, *args, **kwargs):
        token = request.headers.get('Authorization', '').replace('Bearer ', '')
        if not token:
            return Response(status=401, body='Missing token')

        try:
            claims = jwt.decode(token, PUBLIC_KEY, algorithms=['RS256'],
                                audience='my-api')
        except jwt.ExpiredSignatureError:
            return Response(status=401, body='Token expired')
        except jwt.InvalidTokenError:
            return Response(status=401, body='Invalid token')

        # Attach user context for downstream services
        request.headers['X-User-ID'] = claims['sub']
        request.headers['X-User-Roles'] = ','.join(claims.get('roles', []))
        return func(request, *args, **kwargs)
    return wrapper

Interview insight: JWT validation is stateless — the gateway verifies the signature using the public key without calling the auth service. This is why JWTs are preferred over opaque tokens in API Gateways — no network call per request.

The tradeoff: you can’t revoke a JWT before it expires. Mitigations: short expiry (15 min) + refresh tokens, or a lightweight token blocklist in Redis.

Rate Limiting

Rate limiting protects your services from abuse and ensures fair usage. The gateway is the natural place to enforce it.

Token Bucket Algorithm

The most common algorithm. Each client gets a bucket with a maximum capacity. Tokens are added at a fixed rate. Each request consumes one token.

import time
import redis

class TokenBucketRateLimiter:
    def __init__(self, redis_client, max_tokens=100, refill_rate=10):
        self.redis = redis_client
        self.max_tokens = max_tokens       # bucket capacity
        self.refill_rate = refill_rate     # tokens per second

    def is_allowed(self, client_id: str) -> bool:
        key = f"ratelimit:{client_id}"
        now = time.time()

        # Lua script for atomic check-and-update
        lua_script = """
        local key = KEYS[1]
        local max_tokens = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])

        local data = redis.call('HMGET', key, 'tokens', 'last_refill')
        local tokens = tonumber(data[1]) or max_tokens
        local last_refill = tonumber(data[2]) or now

        -- Refill tokens based on elapsed time
        local elapsed = now - last_refill
        tokens = math.min(max_tokens, tokens + elapsed * refill_rate)

        if tokens >= 1 then
            tokens = tokens - 1
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 60)
            return 1  -- allowed
        else
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 60)
            return 0  -- rate limited
        end
        """
        return bool(self.redis.eval(lua_script, 1, key,
                                     self.max_tokens, self.refill_rate, now))

Sliding Window Log

More accurate than token bucket for strict per-second limits:

def sliding_window_is_allowed(redis_client, client_id, window_sec=60, max_requests=100):
    key = f"ratelimit:sw:{client_id}"
    now = time.time()

    pipe = redis_client.pipeline()
    pipe.zremrangebyscore(key, 0, now - window_sec)  # remove expired entries
    pipe.zadd(key, {str(now): now})                   # add current request
    pipe.zcard(key)                                    # count in window
    pipe.expire(key, window_sec)
    results = pipe.execute()

    return results[2] <= max_requests

Rate Limiting Response Headers

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1679000060

-- When exceeded:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0

Rate Limiting Dimensions

Dimension	Example	Use Case
Per API key	1000 req/min per key	SaaS API tiers
Per user	100 req/min per user	Logged-in users
Per IP	50 req/min per IP	Anonymous/public APIs
Per endpoint	10 req/min on `/api/export`	Expensive operations
Global	50K req/s total	Cluster protection

Routing and Load Balancing

Path-Based Routing

The gateway maps incoming paths to backend services:

# NGINX example
location /api/users/ {
    proxy_pass http://user-service:8080/;
}
location /api/orders/ {
    proxy_pass http://order-service:8080/;
}
location /api/payments/ {
    proxy_pass http://payment-service:8080/;
}

# Kong declarative config
services:
  - name: user-service
    url: http://user-service:8080
    routes:
      - name: user-routes
        paths:
          - /api/users
        strip_path: true

  - name: order-service
    url: http://order-service:8080
    routes:
      - name: order-routes
        paths:
          - /api/orders
        methods:
          - GET
          - POST

Header-Based Routing

Route based on headers for A/B testing, canary deployments, or API versioning:

# Version-based routing
location /api/ {
    if ($http_api_version = "v2") {
        proxy_pass http://service-v2:8080;
    }
    proxy_pass http://service-v1:8080;
}

Load Balancing Algorithms

Algorithm	Behavior	Best For
Round Robin	Rotate through instances sequentially	Homogeneous instances
Weighted Round Robin	More traffic to higher-weight instances	Mixed instance sizes
Least Connections	Route to instance with fewest active connections	Variable request duration
IP Hash	Same client always hits same instance	Session affinity
Random	Pick an instance randomly	Simple, surprisingly effective

# NGINX weighted upstream
upstream order-service {
    server order-1:8080 weight=3;  # gets 3x traffic
    server order-2:8080 weight=1;
    server order-3:8080 weight=1;
}

Circuit Breaker

When a backend service starts failing, the gateway should stop sending requests to it instead of overwhelming it further.

import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30,
                 success_threshold=2):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.success_threshold = success_threshold
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = 0

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise CircuitOpenError("Circuit is open")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
                self.success_count = 0

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
            self.success_count = 0

Response Caching

The gateway can cache responses to reduce backend load for idempotent requests.

# NGINX response caching
proxy_cache_path /tmp/cache levels=1:2 keys_zone=api_cache:10m
                 max_size=1g inactive=60m;

location /api/products/ {
    proxy_cache api_cache;
    proxy_cache_methods GET HEAD;
    proxy_cache_valid 200 5m;          # cache 200s for 5 minutes
    proxy_cache_valid 404 1m;
    proxy_cache_key "$request_uri|$http_authorization";  # vary by auth
    proxy_cache_bypass $http_cache_control;

    add_header X-Cache-Status $upstream_cache_status;
    proxy_pass http://product-service:8080;
}

What to cache:

GET requests with stable responses (product listings, user profiles)
Public endpoints (no auth variation)
Responses with explicit Cache-Control headers

What NOT to cache:

POST/PUT/DELETE (non-idempotent)
Responses with user-specific data (unless cache key includes user ID)
Real-time data (stock prices, live scores)

Request Aggregation (BFF Pattern)

For mobile or web clients that need data from multiple services in a single call:

# BFF endpoint — aggregate multiple service calls
import asyncio
import aiohttp

async def get_user_dashboard(user_id: str):
    async with aiohttp.ClientSession() as session:
        # Fan out to multiple services in parallel
        user_task = session.get(f'http://user-service/users/{user_id}')
        orders_task = session.get(f'http://order-service/users/{user_id}/orders?limit=5')
        recs_task = session.get(f'http://recommendation-service/users/{user_id}')

        user_resp, orders_resp, recs_resp = await asyncio.gather(
            user_task, orders_task, recs_task,
            return_exceptions=True
        )

        # Aggregate responses (graceful degradation)
        result = {
            'user': await user_resp.json() if not isinstance(user_resp, Exception) else None,
            'recent_orders': await orders_resp.json() if not isinstance(orders_resp, Exception) else [],
            'recommendations': await recs_resp.json() if not isinstance(recs_resp, Exception) else [],
        }
        return result

BFF (Backend for Frontend): One gateway per client type. The mobile BFF returns less data, fewer images, and different aggregations than the web BFF.

API Versioning Strategies

Strategy	Example	Pros	Cons
URL path	`/v1/users`, `/v2/users`	Clear, easy routing	URL pollution
Header	`Api-Version: 2`	Clean URLs	Hidden, harder to test
Query param	`/users?version=2`	Easy to test	Caching complications
Content negotiation	`Accept: application/vnd.api.v2+json`	RESTful	Complex

# URL-path versioning at the gateway
location /v1/users/ {
    proxy_pass http://user-service-v1:8080/users/;
}
location /v2/users/ {
    proxy_pass http://user-service-v2:8080/users/;
}

Security at the Gateway

CORS

location /api/ {
    add_header Access-Control-Allow-Origin "https://myapp.com";
    add_header Access-Control-Allow-Methods "GET, POST, PUT, DELETE, OPTIONS";
    add_header Access-Control-Allow-Headers "Authorization, Content-Type";
    add_header Access-Control-Max-Age 86400;

    if ($request_method = OPTIONS) {
        return 204;
    }
    proxy_pass http://backend;
}

IP Whitelisting and Geo-blocking

# Allow only specific IPs for admin endpoints
location /admin/ {
    allow 10.0.0.0/8;
    allow 192.168.1.0/24;
    deny all;
    proxy_pass http://admin-service:8080;
}

Request Size Limits

client_max_body_size 10m;       # reject requests > 10MB
proxy_read_timeout 30s;          # timeout slow backends
proxy_connect_timeout 5s;

Observability

Every request through the gateway should generate:

Structured Logging

{
  "timestamp": "2026-03-21T10:00:00Z",
  "method": "POST",
  "path": "/api/orders",
  "status": 201,
  "latency_ms": 45,
  "client_ip": "203.0.113.42",
  "user_id": "user:1001",
  "upstream": "order-service:8080",
  "request_id": "req-abc-123",
  "rate_limit_remaining": 73,
  "cache_status": "MISS"
}

Distributed Tracing

The gateway generates or propagates trace IDs:

import uuid

def add_trace_headers(request):
    # Generate trace ID if not present
    trace_id = request.headers.get('X-Trace-ID', str(uuid.uuid4()))
    span_id = str(uuid.uuid4())[:16]

    request.headers['X-Trace-ID'] = trace_id
    request.headers['X-Span-ID'] = span_id
    request.headers['X-Request-ID'] = request.headers.get('X-Request-ID',
                                                           str(uuid.uuid4()))
    return request

API Gateway vs Service Mesh

Aspect	API Gateway	Service Mesh (Istio/Linkerd)
Position	Edge (north-south traffic)	Internal (east-west traffic)
Clients	External (web, mobile, 3rd party)	Internal services only
Auth	JWT, OAuth, API keys	mTLS between services
Rate limiting	Per client/API key	Per service
Routing	Path/header based	Service-to-service
Protocol	HTTP/REST/GraphQL/WebSocket	gRPC, HTTP, TCP
Deployment	Dedicated proxy cluster	Sidecar per service

Interview insight: They’re complementary, not competing. Use an API Gateway for external traffic and a service mesh for internal service-to-service communication.

Popular API Gateways Compared

Feature	Kong	AWS API Gateway	Envoy	NGINX	Traefik
Type	Plugin-based	Managed	L7 proxy	Web server + proxy	Cloud-native proxy
Config	Declarative / Admin API	Console / CloudFormation	YAML / xDS	Config files	Auto-discovery
Rate limiting	Plugin (Redis-backed)	Built-in (per stage)	Filter	Lua / OpenResty	Plugin
Auth	Plugins (JWT, OAuth, etc)	Cognito, Lambda authorizer	ext_authz filter	Lua / modules	Middleware
gRPC	Yes	Yes	Native	Limited	Yes
WebSocket	Yes	Yes (v2)	Yes	Yes	Yes
Best for	General purpose, plugin ecosystem	AWS-native, serverless	Service mesh sidecar, high perf	Simple, battle-tested	Kubernetes-native

Interview Cheat Sheet

When to Use an API Gateway

Multiple backend services behind a single endpoint
Need centralized auth, rate limiting, and logging
Different client types (web, mobile, IoT) need different APIs
API versioning and canary deployments
Third-party API access with usage tracking

When NOT to Use

Single monolith — a simple reverse proxy (NGINX) is enough
Only internal traffic — use a service mesh instead
Ultra-low latency — every proxy hop adds 1-5ms

Key Numbers

Metric	Typical Value
Gateway latency overhead	1-5 ms per request
Rate limit check (Redis)	< 1 ms
JWT validation	< 0.5 ms (local, no network call)
Connection pool to upstream	100-1000 per service
Gateway instances (production)	2-4 (behind LB)

Single Point of Failure?

The gateway is on the critical path. Mitigate with:

Multiple instances behind a load balancer (or DNS round-robin)
Health checks — remove unhealthy gateway instances
Graceful degradation — if rate limiter (Redis) is down, fail open
Stateless design — any instance can handle any request (no sessions)

Interview Answer Template

When designing an API Gateway:

Why? — centralize cross-cutting concerns, decouple clients from internal topology
Request pipeline — TLS → Auth → Rate Limit → Validate → Route → LB → Upstream
Auth strategy — JWT for stateless, API keys for external consumers
Rate limiting — token bucket per API key, sliding window for strict limits, backed by Redis
Routing — path-based for service dispatch, header-based for versioning/canary
Resilience — circuit breaker per upstream, retries with exponential backoff, timeouts
Caching — response cache for GET endpoints, vary by auth context
Observability — structured logs, distributed tracing (X-Trace-ID), Prometheus metrics
Scaling — stateless horizontally-scaled instances behind an NLB
BFF — one gateway per client type if mobile and web need different aggregations

Wrapping Up

An API Gateway is the control plane for your external API traffic. It lets your services focus on business logic while the gateway handles the boring-but-critical stuff: authentication, rate limiting, routing, resilience, and observability.

The mental model: think of it as a pipeline of middleware. Each stage in the pipeline either transforms the request, rejects it, or enriches it. The order matters: authenticate before rate limiting (so you know who to limit), rate limit before routing (so you reject early), and cache after routing (so you cache per-service responses).

Get the pipeline right, and your entire microservices architecture gets cleaner.