System Design Masterclass
March 28, 2026|14 min read
Lesson 14 / 15

14. Designing a File Storage System (Dropbox/Google Drive)

TL;DR

Split files into chunks (4MB each), hash each chunk for deduplication, upload only changed chunks (delta sync). Use a metadata service to track file→chunk mappings. Sync via long polling or WebSocket for real-time updates. Handle conflicts with last-writer-wins or version branching. Store chunks in object storage (S3), metadata in a database. Dedup alone saves 30-50% storage.

A cloud file storage system seems straightforward until you think about what actually happens when two people edit the same file on different devices, or when a user uploads a 10GB video over a spotty connection, or when 500 million users each store 200 files and expect instant sync. The core challenge is not storage itself — object storage like S3 is practically infinite. The challenge is sync: keeping every device up to date, handling conflicts gracefully, minimizing bandwidth, and doing it all without losing a single byte of user data.

File storage system architecture — clients, metadata, block storage

Requirements

Functional Requirements

  • Upload and download files (up to 10GB per file)
  • Automatic sync across devices (desktop, mobile, web)
  • File and folder organization with move/rename/delete
  • File versioning (keep last N versions, allow rollback)
  • Share files/folders with other users via link or direct access
  • Offline access with sync when reconnected

Non-Functional Requirements

  • 500 million registered users, 100 million DAU
  • Average 200 files per user, average file size 500KB
  • 99.99% durability (never lose a file)
  • Sync latency under 10 seconds for small file changes
  • Resume interrupted uploads/downloads

Back-of-Envelope Estimation

Total files:             500M users * 200 files = 100 billion files
Total storage:           100B files * 500KB avg = 50 PB (before dedup)
After deduplication:     ~30 PB (40% savings — many shared docs, libraries, default files)
Daily uploads:           100M DAU * 2 changes/day = 200M file operations/day
Upload QPS:              200M / 86400 = ~2,300 QPS sustained
Peak QPS:                ~10,000 QPS

Metadata per file:       ~1KB (path, hashes, permissions, timestamps, version info)
Metadata storage:        100B files * 1KB = 100TB
Chunk metadata:          Each file avg 1 chunk (most are small) = 100B entries

Bandwidth (upload):
  Avg upload size: 500KB (but with delta sync, only changed chunks)
  Effective avg: ~100KB after delta sync
  Daily: 200M * 100KB = 20TB/day upload bandwidth

High-Level Architecture

                                 [Web / Desktop / Mobile Clients]
                                            |
                                    [API Gateway / LB]
                                     /      |       \
                            [Upload    [Metadata    [Sync
                             Service]   Service]     Service]
                               |           |            |
                          [Block        [Metadata    [Change
                           Storage]      DB]          Feed]
                           (S3)        (MySQL/       (Kafka)
                                       Postgres)
                                           |
                                     [Share Service]
                                     [Version Service]

Five core services:

  1. Upload/Download Service — handles chunked uploads, assembles downloads
  2. Metadata Service — tracks files, folders, chunks, permissions
  3. Sync Service — notifies clients of changes, resolves ordering
  4. Block Storage — stores actual file chunks (S3 or equivalent)
  5. Share Service — manages access control and shared links
File chunking and delta sync process

File Chunking

Large files are split into chunks before upload. This is the foundation of everything else — dedup, delta sync, resume, and efficient storage all depend on chunking.

Why Chunk?

  • Resumable uploads. If a 2GB upload fails at 1.5GB, you only re-upload the remaining chunks, not the entire file.
  • Delta sync. When a user edits a 100MB file, only the changed chunks are uploaded — not all 100MB.
  • Deduplication. Identical chunks across different files (or different users) are stored once.
  • Parallel transfer. Upload or download multiple chunks simultaneously.

Fixed-Size vs Content-Defined Chunking

Fixed-size chunking splits every file at exactly N-byte boundaries. Simple, but a single byte insertion shifts every chunk boundary, causing all subsequent chunks to look “new.”

def fixed_size_chunk(file_data: bytes, chunk_size: int = 4 * 1024 * 1024) -> list[bytes]:
    """Split file into fixed 4MB chunks."""
    chunks = []
    for i in range(0, len(file_data), chunk_size):
        chunks.append(file_data[i:i + chunk_size])
    return chunks

# Problem: insert 1 byte at position 0
# EVERY chunk changes, even though most of the file is identical

Content-defined chunking (CDC) uses a rolling hash (like Rabin fingerprint) to find chunk boundaries based on content, not position. A byte insertion only affects the chunk it falls in.

import hashlib

def content_defined_chunk(file_data: bytes, min_size=2*1024*1024, max_size=8*1024*1024,
                          target_size=4*1024*1024) -> list[bytes]:
    """Split file using content-defined boundaries via rolling hash."""
    chunks = []
    start = 0
    window_size = 48
    # Magic mask -- when hash & mask == 0, it's a boundary
    # Probability of boundary = 1/target_size
    mask = target_size - 1  # works when target_size is power of 2

    pos = start + min_size  # skip ahead to minimum chunk size

    while pos < len(file_data):
        # Rolling hash over a window
        window = file_data[pos:pos + window_size]
        h = hash(window)  # simplified; real impl uses Rabin fingerprint

        if (h & mask) == 0 or (pos - start) >= max_size:
            chunks.append(file_data[start:pos])
            start = pos
            pos = start + min_size
        else:
            pos += 1

    # Last chunk
    if start < len(file_data):
        chunks.append(file_data[start:])

    return chunks

Dropbox uses content-defined chunking with a target of 4MB. Google Drive uses fixed-size chunking but compensates with server-side diffing. For an interview, either approach works — just explain the tradeoff.

Chunk Hashing

Every chunk gets a SHA-256 hash. This hash serves as the chunk’s identity — its content address.

def hash_chunk(chunk_data: bytes) -> str:
    return hashlib.sha256(chunk_data).hexdigest()

def chunk_file(file_path: str) -> list[dict]:
    """Chunk a file and return chunk metadata."""
    with open(file_path, "rb") as f:
        file_data = f.read()

    raw_chunks = content_defined_chunk(file_data)
    chunk_metadata = []

    for i, chunk_data in enumerate(raw_chunks):
        chunk_hash = hash_chunk(chunk_data)
        chunk_metadata.append({
            "index": i,
            "hash": chunk_hash,
            "size": len(chunk_data),
            "data": chunk_data
        })

    return chunk_metadata

Deduplication

Content-addressable storage means the chunk hash is the storage key. If two files (or two users) have an identical chunk, it is stored exactly once.

class BlockStorageService:
    def __init__(self, s3_client):
        self.s3 = s3_client
        self.bucket = "file-chunks"

    def upload_chunk(self, chunk_hash: str, chunk_data: bytes) -> bool:
        """Upload chunk only if it doesn't already exist. Returns True if uploaded."""
        if self._chunk_exists(chunk_hash):
            return False  # Already stored -- dedup win

        self.s3.put_object(
            Bucket=self.bucket,
            Key=f"chunks/{chunk_hash[:2]}/{chunk_hash}",  # prefix for S3 partitioning
            Body=chunk_data,
            ContentType="application/octet-stream"
        )
        return True

    def _chunk_exists(self, chunk_hash: str) -> bool:
        """Check if chunk already exists using a Bloom filter + S3 HEAD."""
        # Fast check: Bloom filter (in-memory, 0.1% false positive rate)
        if not self.bloom_filter.might_contain(chunk_hash):
            return False
        # Slow check: actual S3 HEAD request (only on Bloom filter hits)
        try:
            self.s3.head_object(Bucket=self.bucket, Key=f"chunks/{chunk_hash[:2]}/{chunk_hash}")
            return True
        except self.s3.exceptions.NoSuchKey:
            return False

    def download_chunk(self, chunk_hash: str) -> bytes:
        response = self.s3.get_object(
            Bucket=self.bucket, Key=f"chunks/{chunk_hash[:2]}/{chunk_hash}"
        )
        return response["Body"].read()

Deduplication savings depend on your user base. Systems with many shared documents (corporate environments) see 40-60% savings. Consumer storage with mostly unique photos sees 10-20%.

A Bloom filter in front of S3 avoids the latency of checking S3 for every chunk. The Bloom filter says “definitely not there” or “maybe there.” On a “maybe,” you do the real S3 HEAD check. With 100 billion chunks and 0.1% false positive rate, the Bloom filter uses about 150GB of memory — distributed across multiple nodes.

Upload Flow

Here is the complete upload flow when a user saves a file.

class UploadService:
    def upload_file(self, user_id: str, file_path: str, file_data: bytes) -> dict:
        # Step 1: Chunk the file
        chunks = chunk_file_from_bytes(file_data)
        chunk_hashes = [c["hash"] for c in chunks]

        # Step 2: Check which chunks already exist (batch dedup check)
        existing = self.block_storage.batch_exists(chunk_hashes)
        new_chunks = [c for c in chunks if c["hash"] not in existing]

        # Step 3: Upload only new chunks (parallel)
        with ThreadPoolExecutor(max_workers=8) as pool:
            futures = [
                pool.submit(self.block_storage.upload_chunk, c["hash"], c["data"])
                for c in new_chunks
            ]
            for f in futures:
                f.result()  # raise on failure

        # Step 4: Update metadata
        file_metadata = {
            "user_id": user_id,
            "path": file_path,
            "size": len(file_data),
            "checksum": hashlib.sha256(file_data).hexdigest(),
            "chunks": chunk_hashes,
            "version": self.metadata.get_next_version(user_id, file_path),
            "updated_at": datetime.utcnow()
        }
        self.metadata.upsert_file(file_metadata)

        # Step 5: Publish change event for sync
        self.change_feed.publish({
            "user_id": user_id,
            "event": "file.updated",
            "path": file_path,
            "version": file_metadata["version"],
            "timestamp": file_metadata["updated_at"]
        })

        return {
            "file_path": file_path,
            "version": file_metadata["version"],
            "chunks_uploaded": len(new_chunks),
            "chunks_deduped": len(chunks) - len(new_chunks),
            "bytes_transferred": sum(c["size"] for c in new_chunks)
        }

The key optimization: step 2 eliminates redundant uploads. If you edit a 100MB file and only one 4MB chunk changes, only 4MB goes over the wire. The client can report: “100MB file synced, 4MB transferred.”

Download Flow

Reconstructing a file is the reverse: fetch chunk hashes from metadata, download each chunk, concatenate.

class DownloadService:
    def download_file(self, user_id: str, file_path: str, version: int = None) -> bytes:
        # Get file metadata (latest version if not specified)
        metadata = self.metadata.get_file(user_id, file_path, version)
        if not metadata:
            raise FileNotFoundError(f"{file_path} not found")

        # Download chunks in parallel
        chunk_hashes = metadata["chunks"]
        chunks = [None] * len(chunk_hashes)

        with ThreadPoolExecutor(max_workers=8) as pool:
            future_to_index = {
                pool.submit(self.block_storage.download_chunk, h): i
                for i, h in enumerate(chunk_hashes)
            }
            for future in as_completed(future_to_index):
                idx = future_to_index[future]
                chunks[idx] = future.result()

        # Reassemble file
        file_data = b"".join(chunks)

        # Verify integrity
        checksum = hashlib.sha256(file_data).hexdigest()
        if checksum != metadata["checksum"]:
            raise IntegrityError(f"Checksum mismatch for {file_path}")

        return file_data

Metadata Service

The metadata service is the source of truth for the file system hierarchy. It tracks which files exist, where they are, what chunks they contain, and who can access them.

-- File metadata
CREATE TABLE files (
    id              BIGINT PRIMARY KEY AUTO_INCREMENT,
    user_id         BIGINT NOT NULL,
    file_path       VARCHAR(4096) NOT NULL,
    file_name       VARCHAR(255) NOT NULL,
    size_bytes      BIGINT,
    checksum        CHAR(64),           -- SHA-256 of full file
    current_version INT DEFAULT 1,
    is_deleted      BOOLEAN DEFAULT false,
    created_at      TIMESTAMP DEFAULT NOW(),
    updated_at      TIMESTAMP DEFAULT NOW(),
    UNIQUE KEY (user_id, file_path)
);

-- File-to-chunk mapping (per version)
CREATE TABLE file_chunks (
    file_id         BIGINT NOT NULL,
    version         INT NOT NULL,
    chunk_index     INT NOT NULL,
    chunk_hash      CHAR(64) NOT NULL,  -- SHA-256, also the storage key
    chunk_size      INT,
    PRIMARY KEY (file_id, version, chunk_index),
    FOREIGN KEY (file_id) REFERENCES files(id)
);

-- Version history
CREATE TABLE file_versions (
    file_id         BIGINT NOT NULL,
    version         INT NOT NULL,
    size_bytes      BIGINT,
    checksum        CHAR(64),
    created_by      BIGINT,             -- user who made the change
    device_id       VARCHAR(100),
    created_at      TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (file_id, version)
);

-- Reference counting for chunks (for garbage collection)
CREATE TABLE chunk_references (
    chunk_hash      CHAR(64) PRIMARY KEY,
    ref_count       INT DEFAULT 1,
    size_bytes      INT,
    created_at      TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_files_user ON files (user_id, file_path);
CREATE INDEX idx_chunks_hash ON file_chunks (chunk_hash);

Reference counting is essential for deduplication. When a file is deleted, you decrement the ref count on its chunks. Only when ref_count reaches zero do you actually delete the chunk from S3. This is a background garbage collection process — never delete chunks synchronously.

Sync Protocol

Sync is the hardest part. Every device needs to stay up to date with changes from every other device.

Change Detection

The client maintains a local database of file states. On startup (and periodically), it compares local state with server state.

class SyncClient:
    def detect_local_changes(self) -> list[dict]:
        """Scan local filesystem for changes since last sync."""
        changes = []
        for local_file in self.scan_watched_directories():
            server_state = self.local_db.get_server_state(local_file.path)

            if server_state is None:
                changes.append({"type": "create", "path": local_file.path})
            elif local_file.modified_at > server_state["synced_at"]:
                # Compare chunk hashes, not just timestamps
                local_hashes = self.chunk_and_hash(local_file.path)
                if local_hashes != server_state["chunk_hashes"]:
                    changes.append({"type": "modify", "path": local_file.path})

        # Detect deletions
        for server_path in self.local_db.get_all_server_paths():
            if not os.path.exists(server_path):
                changes.append({"type": "delete", "path": server_path})

        return changes

Change Notification

Instead of polling, clients maintain a long-lived connection to the sync service.

class SyncService:
    def long_poll_changes(self, user_id: str, device_id: str, last_cursor: str):
        """Block until there are new changes for this user, or timeout after 60s."""
        timeout = 60
        start = time.time()

        while time.time() - start < timeout:
            changes = self.change_feed.get_changes_since(user_id, last_cursor)
            # Filter out changes from this device (don't echo back)
            changes = [c for c in changes if c["device_id"] != device_id]

            if changes:
                return {
                    "changes": changes,
                    "cursor": changes[-1]["cursor"],
                    "has_more": self.change_feed.has_more(user_id, changes[-1]["cursor"])
                }

            time.sleep(0.5)  # Brief sleep to avoid CPU spin

        return {"changes": [], "cursor": last_cursor, "has_more": False}

The change feed is a per-user ordered log of all file operations. Kafka or a database-backed change log both work. Each change gets a monotonically increasing cursor so clients can say “give me everything after cursor 12345.”

CREATE TABLE change_log (
    id              BIGINT PRIMARY KEY AUTO_INCREMENT,
    user_id         BIGINT NOT NULL,
    device_id       VARCHAR(100),
    event_type      VARCHAR(20),    -- 'create', 'modify', 'delete', 'move', 'share'
    file_path       VARCHAR(4096),
    new_path        VARCHAR(4096),  -- for moves/renames
    version         INT,
    created_at      TIMESTAMP DEFAULT NOW(),
    INDEX (user_id, id)
);
File sync conflict detection and resolution

Conflict Resolution

Conflicts happen when two devices edit the same file before either sync completes. This is unavoidable in a distributed system with offline support.

Detecting Conflicts

class ConflictDetector:
    def check_conflict(self, upload_request: dict) -> bool:
        """Check if the upload is based on a stale version."""
        server_version = self.metadata.get_current_version(
            upload_request["user_id"],
            upload_request["file_path"]
        )
        client_base_version = upload_request["base_version"]

        # If client was editing version 5, but server is now version 7,
        # someone else uploaded versions 6 and 7 in between
        return client_base_version < server_version

Resolution Strategies

Last-writer-wins (LWW). The most recent upload overwrites the previous version. Simple but lossy — the earlier edit is gone (though still accessible via version history).

def resolve_lww(self, upload_request: dict):
    """Last writer wins. Just accept the upload."""
    # The "loser" version is preserved in version history
    return self.upload_service.upload_file(
        upload_request["user_id"],
        upload_request["file_path"],
        upload_request["file_data"]
    )

Conflict copy. Create a separate copy of the conflicting file. This is what Dropbox does — you see “budget.xlsx (John’s conflicting copy)” in your folder. No data is lost, but the user has to manually merge.

def resolve_conflict_copy(self, upload_request: dict, conflicting_device: str):
    """Create a conflict copy -- no data loss."""
    original_path = upload_request["file_path"]
    base, ext = os.path.splitext(original_path)
    timestamp = datetime.utcnow().strftime("%Y-%m-%d")
    conflict_path = f"{base} ({conflicting_device}'s conflict {timestamp}){ext}"

    # Save the conflicting version as a new file
    self.upload_service.upload_file(
        upload_request["user_id"],
        conflict_path,
        upload_request["file_data"]
    )

    # Notify user of the conflict
    self.notification_service.notify(
        upload_request["user_id"],
        f"Conflict detected on {original_path}. A copy has been created."
    )

Operational Transform / CRDT. For collaborative editing (Google Docs style), track individual operations and merge them. This is extremely complex and only worth it for real-time co-editing, not general file sync.

For a file storage system, conflict copy is the safest default. Last-writer-wins is acceptable for files where conflicts are rare (most files are owned by a single person).

File Versioning

Every upload creates a new version. Users can browse version history and restore any previous version.

class VersionService:
    MAX_VERSIONS = 100  # Keep last 100 versions
    RETENTION_DAYS = 365

    def create_version(self, file_id: int, metadata: dict):
        version_num = self.db.execute(
            "INSERT INTO file_versions (file_id, version, size_bytes, checksum, created_by, device_id) "
            "VALUES (%s, %s, %s, %s, %s, %s)",
            (file_id, metadata["version"], metadata["size"], metadata["checksum"],
             metadata["user_id"], metadata["device_id"])
        )

        # Enforce version limit
        self._prune_old_versions(file_id)

    def _prune_old_versions(self, file_id: int):
        """Delete versions beyond the retention limit."""
        versions = self.db.query(
            "SELECT version FROM file_versions WHERE file_id = %s ORDER BY version DESC",
            (file_id,)
        )

        if len(versions) > self.MAX_VERSIONS:
            old_versions = versions[self.MAX_VERSIONS:]
            for v in old_versions:
                # Get chunk hashes for this version
                chunks = self.db.query(
                    "SELECT chunk_hash FROM file_chunks WHERE file_id = %s AND version = %s",
                    (file_id, v["version"])
                )
                # Decrement reference counts
                for chunk in chunks:
                    self._decrement_chunk_ref(chunk["chunk_hash"])

                # Delete version metadata
                self.db.execute(
                    "DELETE FROM file_versions WHERE file_id = %s AND version = %s",
                    (file_id, v["version"])
                )

    def restore_version(self, user_id: str, file_path: str, target_version: int):
        """Restore a file to a previous version (creates a new version)."""
        old_chunks = self.db.query(
            "SELECT chunk_hash, chunk_index FROM file_chunks "
            "WHERE file_id = (SELECT id FROM files WHERE user_id = %s AND file_path = %s) "
            "AND version = %s ORDER BY chunk_index",
            (user_id, file_path, target_version)
        )

        # Create a new version with the old chunk hashes
        # The chunks still exist in block storage (ref counted)
        new_version = self.metadata.get_next_version(user_id, file_path)
        for chunk in old_chunks:
            self.db.execute(
                "INSERT INTO file_chunks (file_id, version, chunk_index, chunk_hash) VALUES (%s, %s, %s, %s)",
                (self.metadata.get_file_id(user_id, file_path), new_version,
                 chunk["chunk_index"], chunk["chunk_hash"])
            )
            self._increment_chunk_ref(chunk["chunk_hash"])

        return new_version

Restoring a version does not re-upload any data. Since chunks are content-addressed and reference counted, restoring just creates new metadata pointing to existing chunks.

Sharing and Access Control

CREATE TABLE shares (
    id              BIGINT PRIMARY KEY AUTO_INCREMENT,
    file_id         BIGINT NOT NULL,
    owner_id        BIGINT NOT NULL,
    shared_with     BIGINT,            -- NULL for link shares
    permission      VARCHAR(10),       -- 'view', 'edit', 'admin'
    share_link      VARCHAR(100),      -- unique token for link sharing
    link_password   VARCHAR(255),      -- optional password hash
    expires_at      TIMESTAMP,
    created_at      TIMESTAMP DEFAULT NOW(),
    FOREIGN KEY (file_id) REFERENCES files(id)
);

CREATE INDEX idx_shares_user ON shares (shared_with);
CREATE INDEX idx_shares_link ON shares (share_link);
class ShareService:
    def share_with_user(self, owner_id: str, file_path: str, target_user_id: str, permission: str):
        file = self.metadata.get_file(owner_id, file_path)
        self.db.execute(
            "INSERT INTO shares (file_id, owner_id, shared_with, permission) VALUES (%s, %s, %s, %s)",
            (file["id"], owner_id, target_user_id, permission)
        )
        # The shared file appears in target user's "Shared with me" view
        # Sync service notifies the target user's devices

    def create_share_link(self, owner_id: str, file_path: str, permission: str = "view",
                          password: str = None, expires_in: int = None) -> str:
        file = self.metadata.get_file(owner_id, file_path)
        token = secrets.token_urlsafe(32)
        expires_at = datetime.utcnow() + timedelta(seconds=expires_in) if expires_in else None
        password_hash = bcrypt.hash(password) if password else None

        self.db.execute(
            "INSERT INTO shares (file_id, owner_id, permission, share_link, link_password, expires_at) "
            "VALUES (%s, %s, %s, %s, %s, %s)",
            (file["id"], owner_id, permission, token, password_hash, expires_at)
        )
        return f"https://drive.yourapp.com/s/{token}"

    def check_access(self, user_id: str, file_id: int, required_permission: str) -> bool:
        """Check if user has the required permission on a file."""
        # Owner always has access
        file = self.metadata.get_file_by_id(file_id)
        if file["user_id"] == user_id:
            return True

        # Check shares
        share = self.db.query_one(
            "SELECT permission FROM shares WHERE file_id = %s AND shared_with = %s AND "
            "(expires_at IS NULL OR expires_at > NOW())",
            (file_id, user_id)
        )
        if not share:
            return False

        permission_hierarchy = {"view": 1, "edit": 2, "admin": 3}
        return permission_hierarchy.get(share["permission"], 0) >= permission_hierarchy.get(required_permission, 0)

For shared folders, permission is inherited. If you share a folder with “edit” access, every file inside is editable. The access check walks up the directory tree.

Storage Tiering

Not all files are accessed equally. A file last opened three years ago should not sit on expensive SSD-backed storage.

Tier 1 (Hot):   S3 Standard       -- files accessed in last 30 days
Tier 2 (Warm):  S3 Infrequent     -- files accessed in last 90 days
Tier 3 (Cold):  S3 Glacier        -- files not accessed in 90+ days

Cost comparison (per GB/month):
  S3 Standard:          $0.023
  S3 Infrequent:        $0.0125
  S3 Glacier Instant:   $0.004
  S3 Glacier Deep:      $0.00099
class StorageTieringService:
    def run_tiering_job(self):
        """Background job: move chunks to appropriate storage tier."""
        # Find chunks not accessed in 90 days
        cold_chunks = self.db.query(
            "SELECT chunk_hash FROM chunk_access_log "
            "GROUP BY chunk_hash HAVING MAX(accessed_at) < NOW() - INTERVAL '90 days'"
        )

        for chunk in cold_chunks:
            self.s3.copy_object(
                CopySource={"Bucket": "file-chunks", "Key": f"chunks/{chunk['chunk_hash'][:2]}/{chunk['chunk_hash']}"},
                Bucket="file-chunks-glacier",
                Key=f"chunks/{chunk['chunk_hash'][:2]}/{chunk['chunk_hash']}",
                StorageClass="GLACIER"
            )
            # Delete from hot tier after confirming copy
            self.s3.delete_object(
                Bucket="file-chunks",
                Key=f"chunks/{chunk['chunk_hash'][:2]}/{chunk['chunk_hash']}"
            )
            # Update metadata to reflect new tier
            self.db.execute(
                "UPDATE chunk_references SET storage_tier = 'glacier' WHERE chunk_hash = %s",
                (chunk["chunk_hash"],)
            )

When a user requests a cold file, restore it from Glacier (takes seconds for Glacier Instant, hours for Deep Archive) and temporarily promote it back to hot storage.

Scaling the Metadata Service

The metadata service handles every file operation and is the most likely bottleneck.

Sharding by user_id. Each shard holds all files for a set of users. File operations only touch one shard (all within the same user’s namespace). Shared files need cross-shard lookups, but reads dominate writes.

Caching. Cache file metadata aggressively in Redis. File listings, chunk mappings, and permission checks are all highly cacheable. Invalidate on write.

Database choice. MySQL or PostgreSQL with read replicas. The metadata schema is relational (files, chunks, shares all reference each other). NoSQL would make many queries harder without meaningful benefit.

class MetadataCache:
    def get_file(self, user_id: str, file_path: str) -> dict:
        cache_key = f"file:{user_id}:{file_path}"
        cached = self.redis.get(cache_key)
        if cached:
            return json.loads(cached)

        result = self.db.query_one(
            "SELECT * FROM files WHERE user_id = %s AND file_path = %s",
            (user_id, file_path)
        )
        if result:
            self.redis.setex(cache_key, 3600, json.dumps(result))
        return result

    def invalidate(self, user_id: str, file_path: str):
        self.redis.delete(f"file:{user_id}:{file_path}")
        self.redis.delete(f"filelist:{user_id}:{os.path.dirname(file_path)}")

Key Takeaways

  • Split files into 4MB chunks using content-defined chunking. This enables deduplication, delta sync, resumable transfers, and efficient versioning.
  • Deduplicate at the chunk level using SHA-256 hashes as content addresses. Identical chunks are stored once regardless of how many files or users reference them. This saves 30-50% storage in practice.
  • Delta sync means uploading only changed chunks. Editing one paragraph of a 100MB file transfers 4MB, not 100MB. This is the biggest bandwidth win.
  • The metadata service is the brain. It maps files to chunks, tracks versions, and enforces permissions. Shard by user_id, cache aggressively, and use a relational database.
  • Sync via long polling or WebSocket. Maintain a per-user change log with monotonic cursors so clients can resume from where they left off.
  • Handle conflicts with conflict copies (Dropbox approach) for safety, or last-writer-wins for simplicity. Always preserve both versions in the version history.
  • Version history is cheap when chunks are reference counted. Restoring an old version just creates new metadata pointing to existing chunks — no data copying.
  • Storage tiering (hot/warm/cold) is essential at petabyte scale. Move untouched files to Glacier automatically, but design for fast retrieval when users need them.
  • The hardest parts of this system are sync ordering (ensuring all devices converge to the same state) and garbage collection (safely deleting chunks only when no versions reference them).