software-design|March 21, 2026|10 min read

Deep Dive on API Gateway: A System Design Interview Perspective

TL;DR

An API Gateway is a reverse proxy that sits between clients and backend services, handling cross-cutting concerns: TLS termination, authentication, rate limiting, routing, load balancing, circuit breaking, caching, and observability. For interviews, know why it exists (decouple cross-cutting concerns from services), the request lifecycle (TLS → auth → rate limit → validate → route → LB → upstream → cache → respond), rate limiting algorithms (token bucket for burst, sliding window for accuracy), and when to use BFF pattern (one gateway per client type). Key tradeoff: single point of failure vs operational simplicity. Mitigate with horizontal scaling, health checks, and graceful degradation.

Deep Dive on API Gateway: A System Design Interview Perspective

“An API Gateway is the front door to your microservices. Every request walks through it, and every cross-cutting concern lives there — so you don’t repeat it in 50 services.”

In a monolith, there’s one entry point. In a microservices architecture, there can be dozens or hundreds. Clients shouldn’t need to know about your internal service topology, manage multiple connections, or handle authentication, rate limiting, and retries themselves. That’s the API Gateway’s job.

This article covers everything you need for system design interviews: what an API Gateway does, how each component works, the algorithms behind rate limiting, and the patterns that come up when designing real systems.

What is an API Gateway?

An API Gateway is a reverse proxy that sits between external clients and your backend services. It handles cross-cutting concerns that every request needs but no individual service should implement:

API Gateway Architecture

Without a gateway, every service must independently handle:

  • TLS termination
  • Authentication / authorization
  • Rate limiting
  • Request validation
  • Logging, metrics, tracing
  • CORS headers

With a gateway, services only handle business logic. Everything else is centralized.

Request Lifecycle

Every request through an API Gateway follows a predictable pipeline:

No

Yes

No

Yes

Success

Failure

Open

Closed

Client Request HTTPS

TLS Termination

Authentication

Authenticated?

401 Unauthorized

Rate Limiting

Within limit?

429 Too Many Requests

Request Validation

Route Matching

Load Balancer

Upstream Service

Response

Cache Response if GET

Circuit Breaker Check

503 / Fallback

Retry with backoff

Add Headers + Logging

Return to Client

Let’s break down each stage.

Authentication and Authorization

The gateway validates identity before requests reach your services.

Common Auth Patterns

Pattern How Best For
API Key Key in header (X-API-Key) or query param Service-to-service, simple APIs
JWT (Bearer Token) Verify signature locally, extract claims Stateless auth, microservices
OAuth 2.0 Token introspection or JWT validation Third-party access, SSO
mTLS Client presents certificate Service mesh, zero-trust

JWT Validation at the Gateway

# Gateway middleware — JWT validation (no database call needed)
import jwt
from functools import wraps

PUBLIC_KEY = open('public.pem').read()

def authenticate(func):
    @wraps(func)
    def wrapper(request, *args, **kwargs):
        token = request.headers.get('Authorization', '').replace('Bearer ', '')
        if not token:
            return Response(status=401, body='Missing token')

        try:
            claims = jwt.decode(token, PUBLIC_KEY, algorithms=['RS256'],
                                audience='my-api')
        except jwt.ExpiredSignatureError:
            return Response(status=401, body='Token expired')
        except jwt.InvalidTokenError:
            return Response(status=401, body='Invalid token')

        # Attach user context for downstream services
        request.headers['X-User-ID'] = claims['sub']
        request.headers['X-User-Roles'] = ','.join(claims.get('roles', []))
        return func(request, *args, **kwargs)
    return wrapper

Interview insight: JWT validation is stateless — the gateway verifies the signature using the public key without calling the auth service. This is why JWTs are preferred over opaque tokens in API Gateways — no network call per request.

The tradeoff: you can’t revoke a JWT before it expires. Mitigations: short expiry (15 min) + refresh tokens, or a lightweight token blocklist in Redis.

Rate Limiting

Rate limiting protects your services from abuse and ensures fair usage. The gateway is the natural place to enforce it.

Token Bucket Algorithm

The most common algorithm. Each client gets a bucket with a maximum capacity. Tokens are added at a fixed rate. Each request consumes one token.

import time
import redis

class TokenBucketRateLimiter:
    def __init__(self, redis_client, max_tokens=100, refill_rate=10):
        self.redis = redis_client
        self.max_tokens = max_tokens       # bucket capacity
        self.refill_rate = refill_rate     # tokens per second

    def is_allowed(self, client_id: str) -> bool:
        key = f"ratelimit:{client_id}"
        now = time.time()

        # Lua script for atomic check-and-update
        lua_script = """
        local key = KEYS[1]
        local max_tokens = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[2])
        local now = tonumber(ARGV[3])

        local data = redis.call('HMGET', key, 'tokens', 'last_refill')
        local tokens = tonumber(data[1]) or max_tokens
        local last_refill = tonumber(data[2]) or now

        -- Refill tokens based on elapsed time
        local elapsed = now - last_refill
        tokens = math.min(max_tokens, tokens + elapsed * refill_rate)

        if tokens >= 1 then
            tokens = tokens - 1
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 60)
            return 1  -- allowed
        else
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 60)
            return 0  -- rate limited
        end
        """
        return bool(self.redis.eval(lua_script, 1, key,
                                     self.max_tokens, self.refill_rate, now))

Sliding Window Log

More accurate than token bucket for strict per-second limits:

def sliding_window_is_allowed(redis_client, client_id, window_sec=60, max_requests=100):
    key = f"ratelimit:sw:{client_id}"
    now = time.time()

    pipe = redis_client.pipeline()
    pipe.zremrangebyscore(key, 0, now - window_sec)  # remove expired entries
    pipe.zadd(key, {str(now): now})                   # add current request
    pipe.zcard(key)                                    # count in window
    pipe.expire(key, window_sec)
    results = pipe.execute()

    return results[2] <= max_requests

Rate Limiting Response Headers

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1679000060

-- When exceeded:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0

Rate Limiting Dimensions

Dimension Example Use Case
Per API key 1000 req/min per key SaaS API tiers
Per user 100 req/min per user Logged-in users
Per IP 50 req/min per IP Anonymous/public APIs
Per endpoint 10 req/min on /api/export Expensive operations
Global 50K req/s total Cluster protection

Routing and Load Balancing

Path-Based Routing

The gateway maps incoming paths to backend services:

# NGINX example
location /api/users/ {
    proxy_pass http://user-service:8080/;
}
location /api/orders/ {
    proxy_pass http://order-service:8080/;
}
location /api/payments/ {
    proxy_pass http://payment-service:8080/;
}
# Kong declarative config
services:
  - name: user-service
    url: http://user-service:8080
    routes:
      - name: user-routes
        paths:
          - /api/users
        strip_path: true

  - name: order-service
    url: http://order-service:8080
    routes:
      - name: order-routes
        paths:
          - /api/orders
        methods:
          - GET
          - POST

Header-Based Routing

Route based on headers for A/B testing, canary deployments, or API versioning:

# Version-based routing
location /api/ {
    if ($http_api_version = "v2") {
        proxy_pass http://service-v2:8080;
    }
    proxy_pass http://service-v1:8080;
}

Load Balancing Algorithms

Algorithm Behavior Best For
Round Robin Rotate through instances sequentially Homogeneous instances
Weighted Round Robin More traffic to higher-weight instances Mixed instance sizes
Least Connections Route to instance with fewest active connections Variable request duration
IP Hash Same client always hits same instance Session affinity
Random Pick an instance randomly Simple, surprisingly effective
# NGINX weighted upstream
upstream order-service {
    server order-1:8080 weight=3;  # gets 3x traffic
    server order-2:8080 weight=1;
    server order-3:8080 weight=1;
}

Circuit Breaker

When a backend service starts failing, the gateway should stop sending requests to it instead of overwhelming it further.

States

failure threshold

exceeded (e.g. 50% in 10s)

timeout expires

(e.g. 30s)

probe succeeds

probe fails

CLOSED

(normal traffic)

OPEN

(all requests fail fast)

HALF-OPEN

(probe with 1 request)

import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30,
                 success_threshold=2):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.success_threshold = success_threshold
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = 0

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise CircuitOpenError("Circuit is open")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
                self.success_count = 0

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
            self.success_count = 0

Response Caching

The gateway can cache responses to reduce backend load for idempotent requests.

# NGINX response caching
proxy_cache_path /tmp/cache levels=1:2 keys_zone=api_cache:10m
                 max_size=1g inactive=60m;

location /api/products/ {
    proxy_cache api_cache;
    proxy_cache_methods GET HEAD;
    proxy_cache_valid 200 5m;          # cache 200s for 5 minutes
    proxy_cache_valid 404 1m;
    proxy_cache_key "$request_uri|$http_authorization";  # vary by auth
    proxy_cache_bypass $http_cache_control;

    add_header X-Cache-Status $upstream_cache_status;
    proxy_pass http://product-service:8080;
}

What to cache:

  • GET requests with stable responses (product listings, user profiles)
  • Public endpoints (no auth variation)
  • Responses with explicit Cache-Control headers

What NOT to cache:

  • POST/PUT/DELETE (non-idempotent)
  • Responses with user-specific data (unless cache key includes user ID)
  • Real-time data (stock prices, live scores)

Request Aggregation (BFF Pattern)

For mobile or web clients that need data from multiple services in a single call:

Mobile App

BFF Gateway

Mobile

Web App

BFF Gateway

Web

User Service

Order Service

Recommendation Service

Product Service

# BFF endpoint — aggregate multiple service calls
import asyncio
import aiohttp

async def get_user_dashboard(user_id: str):
    async with aiohttp.ClientSession() as session:
        # Fan out to multiple services in parallel
        user_task = session.get(f'http://user-service/users/{user_id}')
        orders_task = session.get(f'http://order-service/users/{user_id}/orders?limit=5')
        recs_task = session.get(f'http://recommendation-service/users/{user_id}')

        user_resp, orders_resp, recs_resp = await asyncio.gather(
            user_task, orders_task, recs_task,
            return_exceptions=True
        )

        # Aggregate responses (graceful degradation)
        result = {
            'user': await user_resp.json() if not isinstance(user_resp, Exception) else None,
            'recent_orders': await orders_resp.json() if not isinstance(orders_resp, Exception) else [],
            'recommendations': await recs_resp.json() if not isinstance(recs_resp, Exception) else [],
        }
        return result

BFF (Backend for Frontend): One gateway per client type. The mobile BFF returns less data, fewer images, and different aggregations than the web BFF.

API Versioning Strategies

Strategy Example Pros Cons
URL path /v1/users, /v2/users Clear, easy routing URL pollution
Header Api-Version: 2 Clean URLs Hidden, harder to test
Query param /users?version=2 Easy to test Caching complications
Content negotiation Accept: application/vnd.api.v2+json RESTful Complex
# URL-path versioning at the gateway
location /v1/users/ {
    proxy_pass http://user-service-v1:8080/users/;
}
location /v2/users/ {
    proxy_pass http://user-service-v2:8080/users/;
}

Security at the Gateway

CORS

location /api/ {
    add_header Access-Control-Allow-Origin "https://myapp.com";
    add_header Access-Control-Allow-Methods "GET, POST, PUT, DELETE, OPTIONS";
    add_header Access-Control-Allow-Headers "Authorization, Content-Type";
    add_header Access-Control-Max-Age 86400;

    if ($request_method = OPTIONS) {
        return 204;
    }
    proxy_pass http://backend;
}

IP Whitelisting and Geo-blocking

# Allow only specific IPs for admin endpoints
location /admin/ {
    allow 10.0.0.0/8;
    allow 192.168.1.0/24;
    deny all;
    proxy_pass http://admin-service:8080;
}

Request Size Limits

client_max_body_size 10m;       # reject requests > 10MB
proxy_read_timeout 30s;          # timeout slow backends
proxy_connect_timeout 5s;

Observability

Every request through the gateway should generate:

Structured Logging

{
  "timestamp": "2026-03-21T10:00:00Z",
  "method": "POST",
  "path": "/api/orders",
  "status": 201,
  "latency_ms": 45,
  "client_ip": "203.0.113.42",
  "user_id": "user:1001",
  "upstream": "order-service:8080",
  "request_id": "req-abc-123",
  "rate_limit_remaining": 73,
  "cache_status": "MISS"
}

Distributed Tracing

The gateway generates or propagates trace IDs:

import uuid

def add_trace_headers(request):
    # Generate trace ID if not present
    trace_id = request.headers.get('X-Trace-ID', str(uuid.uuid4()))
    span_id = str(uuid.uuid4())[:16]

    request.headers['X-Trace-ID'] = trace_id
    request.headers['X-Span-ID'] = span_id
    request.headers['X-Request-ID'] = request.headers.get('X-Request-ID',
                                                           str(uuid.uuid4()))
    return request

API Gateway vs Service Mesh

Aspect API Gateway Service Mesh (Istio/Linkerd)
Position Edge (north-south traffic) Internal (east-west traffic)
Clients External (web, mobile, 3rd party) Internal services only
Auth JWT, OAuth, API keys mTLS between services
Rate limiting Per client/API key Per service
Routing Path/header based Service-to-service
Protocol HTTP/REST/GraphQL/WebSocket gRPC, HTTP, TCP
Deployment Dedicated proxy cluster Sidecar per service

Interview insight: They’re complementary, not competing. Use an API Gateway for external traffic and a service mesh for internal service-to-service communication.

Feature Kong AWS API Gateway Envoy NGINX Traefik
Type Plugin-based Managed L7 proxy Web server + proxy Cloud-native proxy
Config Declarative / Admin API Console / CloudFormation YAML / xDS Config files Auto-discovery
Rate limiting Plugin (Redis-backed) Built-in (per stage) Filter Lua / OpenResty Plugin
Auth Plugins (JWT, OAuth, etc) Cognito, Lambda authorizer ext_authz filter Lua / modules Middleware
gRPC Yes Yes Native Limited Yes
WebSocket Yes Yes (v2) Yes Yes Yes
Best for General purpose, plugin ecosystem AWS-native, serverless Service mesh sidecar, high perf Simple, battle-tested Kubernetes-native

Interview Cheat Sheet

When to Use an API Gateway

  • Multiple backend services behind a single endpoint
  • Need centralized auth, rate limiting, and logging
  • Different client types (web, mobile, IoT) need different APIs
  • API versioning and canary deployments
  • Third-party API access with usage tracking

When NOT to Use

  • Single monolith — a simple reverse proxy (NGINX) is enough
  • Only internal traffic — use a service mesh instead
  • Ultra-low latency — every proxy hop adds 1-5ms

Key Numbers

Metric Typical Value
Gateway latency overhead 1-5 ms per request
Rate limit check (Redis) < 1 ms
JWT validation < 0.5 ms (local, no network call)
Connection pool to upstream 100-1000 per service
Gateway instances (production) 2-4 (behind LB)

Single Point of Failure?

The gateway is on the critical path. Mitigate with:

  1. Multiple instances behind a load balancer (or DNS round-robin)
  2. Health checks — remove unhealthy gateway instances
  3. Graceful degradation — if rate limiter (Redis) is down, fail open
  4. Stateless design — any instance can handle any request (no sessions)

Interview Answer Template

When designing an API Gateway:

  1. Why? — centralize cross-cutting concerns, decouple clients from internal topology
  2. Request pipeline — TLS → Auth → Rate Limit → Validate → Route → LB → Upstream
  3. Auth strategy — JWT for stateless, API keys for external consumers
  4. Rate limiting — token bucket per API key, sliding window for strict limits, backed by Redis
  5. Routing — path-based for service dispatch, header-based for versioning/canary
  6. Resilience — circuit breaker per upstream, retries with exponential backoff, timeouts
  7. Caching — response cache for GET endpoints, vary by auth context
  8. Observability — structured logs, distributed tracing (X-Trace-ID), Prometheus metrics
  9. Scaling — stateless horizontally-scaled instances behind an NLB
  10. BFF — one gateway per client type if mobile and web need different aggregations

Wrapping Up

An API Gateway is the control plane for your external API traffic. It lets your services focus on business logic while the gateway handles the boring-but-critical stuff: authentication, rate limiting, routing, resilience, and observability.

The mental model: think of it as a pipeline of middleware. Each stage in the pipeline either transforms the request, rejects it, or enriches it. The order matters: authenticate before rate limiting (so you know who to limit), rate limit before routing (so you reject early), and cache after routing (so you cache per-service responses).

Get the pipeline right, and your entire microservices architecture gets cleaner.

Related Posts

Deep Dive on Elasticsearch: A System Design Interview Perspective

Deep Dive on Elasticsearch: A System Design Interview Perspective

“If you’re searching, filtering, or aggregating over large volumes of semi…

Deep Dive on Apache Kafka: A System Design Interview Perspective

Deep Dive on Apache Kafka: A System Design Interview Perspective

“Kafka is not a message queue. It’s a distributed commit log that happens to be…

Deep Dive on Redis: Architecture, Data Structures, and Production Usage

Deep Dive on Redis: Architecture, Data Structures, and Production Usage

“Redis is not just a cache. It’s a data structure server that happens to be…

REST API Design: Pagination, Versioning, and Best Practices

REST API Design: Pagination, Versioning, and Best Practices

Every time two systems need to talk, someone has to design the contract between…

Deep Dive on Caching: From Browser to Database

Deep Dive on Caching: From Browser to Database

“There are only two hard things in Computer Science: cache invalidation and…

System Design Patterns for Real-Time Updates at High Traffic

System Design Patterns for Real-Time Updates at High Traffic

The previous articles in this series covered scaling reads and scaling writes…

Latest Posts

Deep Dive on Elasticsearch: A System Design Interview Perspective

Deep Dive on Elasticsearch: A System Design Interview Perspective

“If you’re searching, filtering, or aggregating over large volumes of semi…

Deep Dive on Apache Kafka: A System Design Interview Perspective

Deep Dive on Apache Kafka: A System Design Interview Perspective

“Kafka is not a message queue. It’s a distributed commit log that happens to be…

Deep Dive on Redis: Architecture, Data Structures, and Production Usage

Deep Dive on Redis: Architecture, Data Structures, and Production Usage

“Redis is not just a cache. It’s a data structure server that happens to be…

REST API Design: Pagination, Versioning, and Best Practices

REST API Design: Pagination, Versioning, and Best Practices

Every time two systems need to talk, someone has to design the contract between…

Efficient Data Modelling: A Practical Guide for Production Systems

Efficient Data Modelling: A Practical Guide for Production Systems

Most engineers learn data modelling backwards. They draw an ER diagram…

Deep Dive on Caching: From Browser to Database

Deep Dive on Caching: From Browser to Database

“There are only two hard things in Computer Science: cache invalidation and…