Scaling Writes — Sharding, Write-Ahead Logs & Partitioning — System Design Masterclass

Scaling reads is straightforward — add caches, replicas, CDN. Scaling writes is fundamentally harder. Every write must be durable, consistent, and ordered. You cannot cache a write or serve a stale one. Every single one must reach persistent storage.

When a single database can no longer handle your write throughput, you split the data across multiple servers. This is sharding, and it reshapes your entire architecture.

Sharding strategies compared — range-based, hash-based, and directory-based

Vertical vs Horizontal: Do You Need Sharding?

VERTICAL SCALING (Scale Up):
  PostgreSQL on a single server handles 10K-50K writes/sec and 10TB comfortably.
  If your workload fits, don't shard. Sharding adds massive operational complexity.

HORIZONTAL SCALING (Scale Out):
  Required when write throughput or data volume exceeds one machine,
  or you need geographic locality / fault isolation.

Sharding Strategies

Range-Based Sharding

class RangeShardRouter:
    def __init__(self):
        self.ranges = [
            {"min": "A", "max": "F", "shard": "shard-1"},
            {"min": "G", "max": "M", "shard": "shard-2"},
            {"min": "N", "max": "Z", "shard": "shard-3"},
        ]

    def get_shard(self, key: str) -> str:
        first_char = key[0].upper()
        for r in self.ranges:
            if r["min"] <= first_char <= r["max"]:
                return r["shard"]
        raise ValueError(f"No shard for key: {key}")

Pros: Range queries hit one or two shards. Easy to understand. Cons: Uneven distribution creates hot spots (names S-Z may be 3x larger than A-F).

Hash-Based Sharding

import hashlib

class HashShardRouter:
    def __init__(self, num_shards: int):
        self.num_shards = num_shards

    def get_shard(self, key: str) -> str:
        hash_val = int(hashlib.md5(key.encode()).hexdigest(), 16)
        return f"shard-{hash_val % self.num_shards}"

Pros: Even distribution, no hot spots from skewed values. Cons: Range queries must fan out to ALL shards. Adding/removing shards requires rehashing (solved by consistent hashing).

Directory-Based Sharding

class DirectoryShardRouter:
    def __init__(self):
        self.directory = {}  # key → shard mapping

    def get_shard(self, tenant_id: str) -> str:
        if tenant_id not in self.directory:
            self.directory[tenant_id] = self._assign_least_loaded_shard()
        return self.directory[tenant_id]

    def migrate(self, tenant_id: str, new_shard: str):
        self.directory[tenant_id] = new_shard  # Easy rebalancing

Pros: Maximum flexibility. Easy rebalancing (update the directory). Perfect for multi-tenant SaaS. Cons: Directory is a single point of failure and potential bottleneck.

Choosing Your Shard Key

The most important decision in a sharded system. A bad shard key will haunt you for years.

GOOD SHARD KEY PROPERTIES:
  High cardinality (many distinct values)
  Even distribution (writes spread across all shards)
  Aligned with queries (most queries include the shard key)
  Stable (doesn't change after creation)

COMMON CHOICES:
  Social media → user_id  |  E-commerce → order_id
  Multi-tenant → tenant_id  |  Time-series → timestamp
  Chat app → channel_id  |  Geo service → region

BAD CHOICES:
  Auto-increment ID (all new writes hit the last shard)
  Boolean fields (only 2 shards)
  Low-cardinality fields like country (195 values → hot spots)

Hot Spots

Even good shard keys can have hot keys (celebrity accounts generating 1000x more activity):

class HotKeyBuffer:
    """Buffer writes for hot keys, flush in batches."""
    def __init__(self, threshold=1000):
        self.buffer = {}
        self.write_counts = {}

    def write(self, key: str, value: dict):
        self.write_counts[key] = self.write_counts.get(key, 0) + 1
        if self.write_counts[key] > 1000:
            self.buffer.setdefault(key, []).append(value)
            if len(self.buffer[key]) >= 100:
                self._flush_batch(key)  # Batch-write to shard
        else:
            self._write_to_shard(key, value)  # Direct write

Write-Ahead Logs (WAL)

The foundation of durability in every serious database.

How WAL Works

Before changing the actual data structure, write the change to a sequential, append-only log. If the machine crashes, replay the log to recover.

class WriteAheadLog:
    def __init__(self, path: str):
        self.file = open(path, 'ab')
        self.lsn = 0

    def append(self, operation: str, key: str, value: str) -> int:
        self.lsn += 1
        entry = json.dumps({
            'lsn': self.lsn, 'op': operation, 'key': key, 'value': value
        }).encode()
        self.file.write(struct.pack('I', len(entry)) + entry)
        self.file.flush()
        os.fsync(self.file.fileno())  # Force to disk
        return self.lsn

    def replay(self, from_lsn=0):
        with open(self.file.name, 'rb') as f:
            while (length_bytes := f.read(4)):
                entry = json.loads(f.read(struct.unpack('I', length_bytes)[0]))
                if entry['lsn'] > from_lsn:
                    yield entry

class DurableKVStore:
    def __init__(self, data_dir):
        self.wal = WriteAheadLog(f"{data_dir}/wal.log")
        self.data = {}
        self._recover()

    def put(self, key, value):
        self.wal.append('PUT', key, value)  # WAL first
        self.data[key] = value              # Then memory

    def _recover(self):
        for entry in self.wal.replay():
            if entry['op'] == 'PUT':
                self.data[entry['key']] = entry['value']

Why WAL is Fast

Random writes (B-tree): Seek to page, update, seek to index, update index.
Sequential writes (WAL): Append to end of file. No seeking.
Sequential is 100-1000x faster on HDD, 10-100x on SSD.

LSM Trees: Write-Optimized Storage

The engine behind Cassandra, RocksDB, LevelDB, and HBase.

WRITE PATH:
1. Append to WAL (durability)
2. Insert into MEMTABLE (sorted in-memory tree, ~64MB)
3. When full, flush to disk as immutable SSTABLE (Sorted String Table)
4. Background COMPACTION merges SSTables, removes duplicates/tombstones

READ PATH:
1. Check memtable → Check L0 SSTables → L1 → L2 ...
2. BLOOM FILTERS skip SSTables that definitely don't contain the key

LSM TREE vs B-TREE:
  Write speed:   LSM fast (sequential) vs B-tree slower (random I/O)
  Read speed:    LSM slower (check levels) vs B-tree fast (single lookup)
  Space:         LSM higher (duplicates) vs B-tree lower (in-place)

  LSM users: Cassandra, RocksDB, LevelDB, HBase
  B-tree users: PostgreSQL, MySQL, SQLite, MongoDB

Consistent Hashing

Standard hash(key) % N has a fatal flaw: adding/removing a shard remaps nearly every key. Consistent hashing remaps only 1/N.

from bisect import bisect_right

class ConsistentHashRing:
    def __init__(self, virtual_nodes=150):
        self.virtual_nodes = virtual_nodes
        self.ring = []      # Sorted (hash_value, node_name) pairs
        self.nodes = set()

    def add_node(self, node: str):
        self.nodes.add(node)
        for i in range(self.virtual_nodes):
            h = int(hashlib.md5(f"{node}:vn{i}".encode()).hexdigest(), 16)
            idx = bisect_right([x[0] for x in self.ring], h)
            self.ring.insert(idx, (h, node))

    def remove_node(self, node: str):
        self.nodes.discard(node)
        self.ring = [(h, n) for h, n in self.ring if n != node]

    def get_node(self, key: str) -> str:
        h = int(hashlib.md5(key.encode()).hexdigest(), 16)
        idx = bisect_right([x[0] for x in self.ring], h)
        return self.ring[idx % len(self.ring)][1]

ring = ConsistentHashRing(virtual_nodes=150)
ring.add_node("db-1"); ring.add_node("db-2"); ring.add_node("db-3")
ring.get_node("user:1001")  # → "db-2"
ring.add_node("db-4")       # Only ~25% of keys remap

Virtual nodes (150+ per physical node) ensure even distribution. Without them, 3 nodes on a ring can have wildly uneven load.

Write Batching

Individual write overhead (network, fsync, commit) dominates for small writes. Batching amortizes it.

class WriteBatcher:
    def __init__(self, db, batch_size=100, flush_ms=50):
        self.queue = Queue()
        self.batch_size = batch_size
        self.flush_interval = flush_ms / 1000
        threading.Thread(target=self._flush_loop, daemon=True).start()

    def write(self, key, value):
        future = threading.Event()
        self.queue.put((key, value, future))
        future.wait(timeout=1.0)

    def _flush_loop(self):
        while True:
            batch, deadline = [], time.time() + self.flush_interval
            while len(batch) < self.batch_size and time.time() < deadline:
                try: batch.append(self.queue.get(timeout=0.01))
                except: continue
            if batch:
                self.db.begin_transaction()
                for key, value, _ in batch:
                    self.db.upsert(key, value)
                self.db.commit()
                for _, _, future in batch:
                    future.set()

# Without batching: 10K inserts → 10K transactions → ~30 sec
# With batching: 100 batches of 100 → 100 transactions → ~0.5 sec

Cross-Shard Queries

The biggest tradeoff of sharding. Queries spanning multiple shards are expensive.

def scatter_gather(shards, query, params, order_by, limit):
    """Send query to all shards in parallel, merge results."""
    with concurrent.futures.ThreadPoolExecutor() as pool:
        futures = [pool.submit(shard.execute, f"{query} ORDER BY {order_by} LIMIT {limit}", params)
                   for shard in shards.values()]
        results = []
        for f in concurrent.futures.as_completed(futures):
            results.extend(f.result())
    results.sort(key=lambda r: r[order_by], reverse=True)
    return results[:limit]

Minimizing Cross-Shard Queries

1. CO-LOCATE: Shard users + orders by user_id → single-shard lookups
2. DENORMALIZE: Copy user name into order record to avoid JOINs
3. APP-LEVEL JOINS: Fetch user from shard-1, orders from shard-3, join in code
4. CQRS: Write to sharded DB, project into read-optimized view (Elasticsearch)

Rebalancing

Shards become uneven over time. Online resharding uses a double-write approach:

1. Create new shards alongside existing ones
2. Double-write: every write goes to BOTH old and new topology
3. Backfill existing data from old shards to new shards
4. Verify data consistency between old and new
5. Cut reads over to new topology
6. Stop writes to old shards
7. Decommission after verification period (1-4 weeks total)

Real-World Write Scaling

CASSANDRA (100K+ writes/sec per node):
  LSM trees + consistent hashing + tunable consistency

KAFKA (millions of messages/sec):
  Append-only log + partitioning + sequential disk I/O + zero-copy

DYNAMODB (auto-scaling):
  Hash partitioning + automatic hot partition splitting

VITESS (YouTube's MySQL sharding layer):
  Application-level sharding + routing + rebalancing for MySQL

Key Takeaways

Scale writes by splitting data across machines (sharding). But verify you need it first — a single PostgreSQL handles 10K-50K writes/sec on modern hardware.
Your shard key is the most consequential decision. It must have high cardinality, even distribution, and align with query patterns. Changing it later is a multi-week migration.
Range sharding enables efficient range queries but risks hot spots. Hash sharding distributes evenly but kills range queries. Directory sharding is the most flexible but adds a lookup hop.
Write-ahead logs are the foundation of durability. Sequential append-only writes are 10-1000x faster than random writes, which is why WAL-first architectures dominate.
LSM trees trade read performance for write performance by buffering in memory and flushing sorted files to disk. Bloom filters compensate for slower reads. This powers Cassandra and RocksDB.
Consistent hashing solves the resharding problem. Adding a node remaps only 1/N of keys. Use 150+ virtual nodes for even distribution.
Cross-shard queries are the tax you pay for sharding. Minimize them by co-locating related data, denormalizing, or projecting into read-optimized views with CQRS.