Efficient Data Modelling: A Practical Guide for Production Systems

Most engineers learn data modelling backwards. They draw an ER diagram, normalize to 3NF, add some indexes, and call it done. Then production traffic arrives and everything falls apart — queries take seconds, JOINs choke the database, and someone starts adding cache layers to compensate for a bad schema.

Efficient data modelling starts with one question: what queries will your system run 1,000 times per second? Everything else — table structure, normalization level, index selection — flows from that answer.

The Three Layers of Data Modelling

graph TD
    A[Conceptual Model] -->|"What entities exist?"| B[Logical Model]
    B -->|"What are the tables + columns?"| C[Physical Model]
    C -->|"How is data stored on disk?"| D[(Database)]

    subgraph "Conceptual"
        A1["Entities: User, Order, Product"]
        A2["Relationships: User places Order"]
    end

    subgraph "Logical"
        B1["Tables, columns, types, FKs"]
        B2["Normalization decisions"]
    end

    subgraph "Physical"
        C1["Indexes, partitions, tablespaces"]
        C2["Storage engine, compression"]
    end

    style A fill:#2563eb,stroke:#1d4ed8,color:#fff
    style B fill:#059669,stroke:#047857,color:#fff
    style C fill:#f59e0b,stroke:#d97706,color:#fff
    style D fill:#7c3aed,stroke:#6d28d9,color:#fff

Most teams spend too long on the conceptual layer (endless ER diagram debates) and too little on the physical layer (where performance lives). Let’s fix that.

Start With Access Patterns, Not Entities

The biggest mistake in data modelling is designing tables before understanding queries. An entity-first approach gives you clean diagrams but slow systems. An access-pattern-first approach gives you fast systems that might look “messy” on paper — but that’s the right tradeoff.

Access Pattern Decision Matrix

How to Map Access Patterns

Before writing a single CREATE TABLE, document every query your system will run:

| # | Query                              | Frequency | Latency | Type  |
|---|-------------------------------------|-----------|---------|-------|
| 1 | Get user's recent orders            | 10K/s     | < 5ms   | Read  |
| 2 | Search products by category + price | 5K/s      | < 50ms  | Read  |
| 3 | Place a new order                   | 500/s     | < 100ms | Write |
| 4 | Get product with avg rating         | 8K/s      | < 10ms  | Read  |
| 5 | Monthly sales report                | 1/hour    | < 30s   | Read  |

Now look at the numbers. Queries #1 and #4 dominate — they need to be fast. Query #5 runs once an hour, so it can afford a complex JOIN or even scan a separate analytics table. This ranking drives every schema decision.

Example: E-Commerce Order System

Entity-first approach (4 tables, 3 JOINs for the most common query):

-- Normalized: clean but slow for hot paths
SELECT o.id, o.created_at, o.total,
       u.name AS user_name,
       p.name AS product_name, oi.quantity
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN order_items oi ON oi.order_id = o.id
JOIN products p ON oi.product_id = p.id
WHERE o.user_id = 42
ORDER BY o.created_at DESC
LIMIT 20;
-- 120ms at 1M orders, gets worse linearly

Access-pattern-first approach (optimized for the 10K/s query):

-- Create a read-optimized view
CREATE MATERIALIZED VIEW user_order_summary AS
SELECT
    o.id AS order_id,
    o.user_id,
    o.created_at,
    o.total,
    u.name AS user_name,
    jsonb_agg(jsonb_build_object(
        'product', p.name,
        'quantity', oi.quantity,
        'price', p.price
    )) AS items
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN order_items oi ON oi.order_id = o.id
JOIN products p ON oi.product_id = p.id
GROUP BY o.id, o.user_id, o.created_at, o.total, u.name;

-- Index for the hot path
CREATE INDEX idx_user_order_summary_user
ON user_order_summary(user_id, created_at DESC);

-- Now the 10K/s query is a simple index scan:
SELECT * FROM user_order_summary
WHERE user_id = 42
ORDER BY created_at DESC
LIMIT 20;
-- 2ms, constant time regardless of table size

The normalized tables still exist for writes. The materialized view serves reads. This is the pragmatic middle ground.

Normalization: How Far Is Far Enough?

Normalization eliminates data duplication. Each normal form builds on the previous one:

Normalization-Denormalization Spectrum

The Normal Forms in 60 Seconds

1NF: Each column holds a single value (no arrays, no CSVs in a field).

-- Violates 1NF
CREATE TABLE orders (
    id INT,
    products TEXT  -- "iPhone, MacBook, AirPods"  BAD
);

-- 1NF compliant
CREATE TABLE order_items (
    order_id INT,
    product_id INT,
    quantity INT
);

2NF: Every non-key column depends on the entire primary key (no partial dependencies).

-- Violates 2NF (product_name depends only on product_id, not order_id)
CREATE TABLE order_items (
    order_id INT,
    product_id INT,
    quantity INT,
    product_name TEXT,  -- partial dependency
    PRIMARY KEY (order_id, product_id)
);

-- 2NF: move product_name to products table

3NF: No column depends on another non-key column (no transitive dependencies).

-- Violates 3NF (city depends on zip_code, not on user_id)
CREATE TABLE users (
    id INT PRIMARY KEY,
    name TEXT,
    zip_code TEXT,
    city TEXT       -- transitively depends on zip_code
);

-- 3NF: separate addresses table
CREATE TABLE zip_codes (
    zip_code TEXT PRIMARY KEY,
    city TEXT,
    state TEXT
);

When to Stop Normalizing

In practice, 3NF is the sweet spot for transactional systems. Going beyond (BCNF, 4NF, 5NF) rarely provides practical benefits and adds query complexity.

graph LR
    A[1NF] -->|Remove partial deps| B[2NF]
    B -->|Remove transitive deps| C[3NF]
    C -->|"Usually stop here"| D{Good enough?}
    D -->|Yes| E[Production Schema]
    D -->|"Hot read path slow"| F[Strategic Denormalization]
    F --> E

    style A fill:#2563eb,stroke:#1d4ed8,color:#fff
    style B fill:#2563eb,stroke:#1d4ed8,color:#fff
    style C fill:#059669,stroke:#047857,color:#fff
    style D fill:#f59e0b,stroke:#d97706,color:#fff
    style E fill:#059669,stroke:#047857,color:#fff
    style F fill:#c84b2f,stroke:#991b1b,color:#fff

Strategic Denormalization

When you denormalize, do it surgically. Don’t flatten everything into one table. Instead, duplicate specific fields that eliminate JOINs on your hottest paths:

-- Orders table with strategically cached fields
CREATE TABLE orders (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES users(id),
    user_name TEXT NOT NULL,           -- denormalized: avoids JOIN to users
    user_email TEXT NOT NULL,          -- denormalized: needed for email notifications
    total DECIMAL(10,2) NOT NULL,
    item_count INT NOT NULL DEFAULT 0, -- denormalized: avoids COUNT on order_items
    status TEXT NOT NULL DEFAULT 'pending',
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Rules for safe denormalization:

The source of truth stays in the normalized table (users.name)
Updates to users.name trigger an async update to orders.user_name
Accept that the cached field can be briefly stale
Document every denormalized field and its sync mechanism

Indexing: The Performance Multiplier

A bad index strategy is worse than no indexes at all — you pay the write overhead without getting read benefits. Here’s how to think about indexes systematically.

Indexing Strategies

Composite Index Design

The column order in a composite index determines which queries it can serve. This is the left-prefix rule:

CREATE INDEX idx_orders_lookup
ON orders(user_id, status, created_at DESC);

This single index serves all of these queries:

-- Uses full index
WHERE user_id = 42 AND status = 'shipped' ORDER BY created_at DESC

-- Uses first two columns
WHERE user_id = 42 AND status = 'shipped'

-- Uses first column only
WHERE user_id = 42

-- CANNOT use this index (skips user_id)
WHERE status = 'shipped'  -- needs separate index on (status)

Design rule: Put equality columns first, then range/sort columns last.

-- Pattern: WHERE a = ? AND b = ? AND c > ? ORDER BY d
-- Index:   (a, b, c, d)
--           ^equality^  ^range^  ^sort^

Covering Indexes (Index-Only Scans)

If the index contains all columns the query needs, PostgreSQL can answer the query without touching the table at all:

-- Query: Get order IDs and totals for a user
SELECT id, total FROM orders WHERE user_id = 42;

-- Covering index: includes id and total
CREATE INDEX idx_orders_user_covering
ON orders(user_id) INCLUDE (id, total);

-- Result: Index-Only Scan (heap access avoided entirely)

Partial Indexes

Don’t index rows you’ll never query:

-- Only 2% of orders are 'pending', but that's what the dashboard queries
CREATE INDEX idx_orders_pending
ON orders(created_at DESC)
WHERE status = 'pending';

-- Index is tiny (2% of table), yet serves the dashboard perfectly
SELECT * FROM orders
WHERE status = 'pending'
ORDER BY created_at DESC
LIMIT 50;

EXPLAIN ANALYZE: Your Best Friend

Never guess about index usage. Always verify:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders
WHERE user_id = 42 AND status = 'shipped'
ORDER BY created_at DESC
LIMIT 20;

What to look for in the output:

Index Scan using idx_orders_lookup on orders
  Index Cond: (user_id = 42 AND status = 'shipped')
  Rows Removed by Filter: 0        -- Good: no wasted reads
  Buffers: shared hit=4             -- Good: all from cache
  Planning Time: 0.1 ms
  Execution Time: 0.3 ms            -- Excellent

Red flags:

Seq Scan on a large table → missing index
Rows Removed by Filter: 50000 → index not selective enough
Buffers: shared read=1200 → cold cache or too much data scanned

NoSQL Data Modelling

NoSQL databases (DynamoDB, MongoDB, Cassandra) flip the SQL playbook. Instead of normalizing and JOINing at query time, you pre-JOIN at write time.

DynamoDB Single-Table Design

In DynamoDB, you model all entities in one table using composite partition and sort keys:

// Single table, multiple entity types
const items = [
  // User entity
  { PK: 'USER#42', SK: 'PROFILE',    name: 'Jane', email: '[email protected]' },

  // User's orders (sorted by date)
  { PK: 'USER#42', SK: 'ORDER#2025-03-15#1001', total: 89.99, status: 'shipped' },
  { PK: 'USER#42', SK: 'ORDER#2025-03-10#998',  total: 45.50, status: 'delivered' },

  // Product entity
  { PK: 'PROD#567', SK: 'DETAIL',     name: 'Widget', price: 29.99 },

  // Product reviews (GSI: by rating)
  { PK: 'PROD#567', SK: 'REVIEW#2025-03-14#42', rating: 5, text: 'Great!' },
];

Now the most common queries are single-partition reads:

// Get user profile + recent orders in ONE query
const result = await dynamodb.query({
  TableName: 'app_table',
  KeyConditionExpression: 'PK = :pk AND SK >= :sk',
  ExpressionAttributeValues: {
    ':pk': 'USER#42',
    ':sk': 'ORDER#'
  },
  ScanIndexForward: false,  // newest first
  Limit: 20
});

No JOINs, no cross-partition reads, single-digit millisecond latency at any scale.

MongoDB: Embed vs Reference

graph TD
    A{How is data accessed?} -->|"Always together"| B[Embed]
    A -->|"Independently queried"| C[Reference]
    A -->|"1:few relationship"| B
    A -->|"1:many (unbounded)"| C

    B --> D["{ order: { items: [...] } }"]
    C --> E["{ order: { item_ids: [...] } }"]

    style A fill:#0e0e0e,stroke:#0e0e0e,color:#fff
    style B fill:#059669,stroke:#047857,color:#fff
    style C fill:#2563eb,stroke:#1d4ed8,color:#fff
    style D fill:#059669,stroke:#047857,color:#fff
    style E fill:#2563eb,stroke:#1d4ed8,color:#fff

Embed when the child data is always accessed with the parent and the array won’t grow unboundedly:

// Good: order with its items (bounded, always fetched together)
{
  _id: ObjectId("..."),
  user_id: 42,
  items: [
    { product_id: 567, name: "Widget", quantity: 2, price: 29.99 },
    { product_id: 890, name: "Gadget", quantity: 1, price: 49.99 }
  ],
  total: 109.97,
  created_at: ISODate("2025-03-15")
}

Reference when the child data is queried independently or can grow without bound:

// Good: product with review references (unbounded, queried separately)
{
  _id: ObjectId("..."),
  name: "Widget",
  price: 29.99,
  review_count: 1247,     // denormalized counter
  avg_rating: 4.3          // denormalized aggregate
  // Reviews live in a separate collection
}

Schema Evolution: Changing Without Breaking

Your data model will change. Features get added, business requirements shift, and you’ll discover your original model doesn’t serve the new access patterns. The key is evolving safely.

Data Model Evolution Pipeline

The Expand-Contract Pattern

This is the safest way to change a column:

-- Step 1: EXPAND — Add new column (nullable, no app changes yet)
ALTER TABLE users ADD COLUMN full_name TEXT;

-- Step 2: BACKFILL — Populate from existing data
UPDATE users SET full_name = first_name || ' ' || last_name
WHERE full_name IS NULL;
-- Run in batches of 1000 to avoid lock contention:
-- UPDATE users SET full_name = ... WHERE id BETWEEN 1 AND 1000;

-- Step 3: DUAL-WRITE — App writes to both columns
-- Deploy code: user.full_name = firstName + ' ' + lastName

-- Step 4: SWITCH READS — App reads from new column
-- Deploy code: display(user.full_name)

-- Step 5: CONTRACT — Drop old columns (after verification period)
ALTER TABLE users DROP COLUMN first_name;
ALTER TABLE users DROP COLUMN last_name;
ALTER TABLE users ALTER COLUMN full_name SET NOT NULL;

Each step is independently deployable and reversible. If step 3 causes problems, roll back the code and the old columns are still there.

Adding Indexes Without Downtime

A standard CREATE INDEX locks the table. On a 100M-row table, that can be minutes of downtime. Use CONCURRENTLY:

-- Blocks reads/writes (bad on large tables)
CREATE INDEX idx_orders_status ON orders(status);

-- Builds in background, no locks (use this)
CREATE INDEX CONCURRENTLY idx_orders_status ON orders(status);

The tradeoff: CONCURRENTLY takes longer and can fail if there are concurrent schema changes. Always check it completed:

-- Verify the index is valid
SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE indexrelid = 'idx_orders_status'::regclass;

Data Types: Small Choices, Big Impact

Choose the Smallest Type That Works

-- Bad: storing boolean as TEXT
status TEXT  -- 'true' / 'false' = 5 bytes + overhead

-- Good: native boolean
status BOOLEAN  -- 1 byte

-- Bad: storing IPs as TEXT
ip_address TEXT  -- '192.168.1.100' = 15 bytes

-- Good: native inet type
ip_address INET  -- 4 bytes (IPv4) or 16 bytes (IPv6)

-- Bad: using BIGINT for everything
user_age BIGINT  -- 8 bytes, max 9.2 quintillion

-- Good: sized appropriately
user_age SMALLINT  -- 2 bytes, max 32,767 (plenty for age)

Timestamps: Always Use TIMESTAMPTZ

-- Bad: naive timestamp (what timezone is this?)
created_at TIMESTAMP  -- '2025-03-15 14:30:00' — UTC? PST? IST?

-- Good: timezone-aware (always stored as UTC in Postgres)
created_at TIMESTAMPTZ DEFAULT now()  -- '2025-03-15 14:30:00+00'

UUIDs vs Serial IDs

-- Auto-increment: simple, ordered, leaks info
id BIGSERIAL PRIMARY KEY  -- 1, 2, 3, ... (competitor can guess your user count)

-- UUIDv4: random, no info leak, but bad for B-tree locality
id UUID DEFAULT gen_random_uuid()  -- random, scattered inserts

-- UUIDv7: time-ordered UUID (best of both worlds)
id UUID DEFAULT gen_random_uuid()  -- use app-generated UUIDv7 instead
-- UUIDv7 is time-prefixed: sorted inserts, good B-tree locality, no info leak

Partitioning: When One Table Isn’t Enough

When a table exceeds 100M+ rows, even good indexes start degrading. Partitioning splits one logical table into multiple physical segments:

-- Range partition by month (most common for time-series)
CREATE TABLE events (
    id BIGSERIAL,
    event_type TEXT NOT NULL,
    payload JSONB,
    created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
CREATE TABLE events_2025_03 PARTITION OF events
    FOR VALUES FROM ('2025-03-01') TO ('2025-04-01');

-- Queries automatically prune to relevant partitions
SELECT * FROM events
WHERE created_at >= '2025-03-01' AND created_at < '2025-04-01';
-- Only scans events_2025_03, ignores all other partitions

When to partition:

Table exceeds 100M rows and growing
Queries almost always include the partition key in WHERE
You need to drop old data efficiently (DROP TABLE events_2024_01 is instant vs DELETE which generates tons of WAL)

The Data Modelling Checklist

Before shipping your schema:

Common Anti-Patterns

The God Table

-- Everything in one table, 200 columns, NULL everywhere
CREATE TABLE everything (
    id SERIAL,
    type TEXT,  -- 'user', 'order', 'product', 'review', ...
    name TEXT,
    email TEXT,          -- NULL for non-users
    price DECIMAL,       -- NULL for non-products
    quantity INT,        -- NULL for non-order-items
    rating INT,          -- NULL for non-reviews
    -- ... 190 more columns
);

This is the worst of both worlds: no normalization benefits, no denormalization speed (because the rows are enormous and sparse).

The EAV Anti-Pattern

-- Entity-Attribute-Value: flexible but unusable at scale
CREATE TABLE attributes (
    entity_id INT,
    attribute_name TEXT,
    attribute_value TEXT  -- everything is a string, no type safety
);

-- "Get user's name and email" requires two self-joins
SELECT
    n.attribute_value AS name,
    e.attribute_value AS email
FROM attributes n
JOIN attributes e ON n.entity_id = e.entity_id
WHERE n.entity_id = 42
  AND n.attribute_name = 'name'
  AND e.attribute_name = 'email';

If you need flexible attributes, use JSONB columns instead:

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    email TEXT NOT NULL,
    metadata JSONB DEFAULT '{}'  -- flexible, indexable, type-preserving
);

-- GIN index for JSONB queries
CREATE INDEX idx_users_metadata ON users USING gin(metadata);

-- Query flexible fields naturally
SELECT * FROM users WHERE metadata @> '{"plan": "enterprise"}';

Conclusion

Efficient data modelling is not about following normal forms religiously or denormalizing everything for speed. It’s about making informed tradeoffs based on your actual workload.

The process, in order:

Map your access patterns — frequency, latency targets, read vs write ratio
Normalize your write path — 3NF for transactional integrity
Optimize your read path — materialized views, strategic denormalization, covering indexes
Index precisely — composite indexes matching your WHERE clause order, partial indexes for hot subsets
Evolve safely — expand-contract migrations, concurrent index builds, never lock a hot table

The best data model is the one that makes your top 5 queries fast and your schema easy to change when requirements shift. Everything else is academic.