LLMs know a lot, but they don’t know your data. They can’t read your company wiki, your product docs, your private codebase, or last week’s support tickets. Ask them a question about your internal systems and they’ll either hallucinate a confident-sounding wrong answer or admit they don’t know. RAG fixes this. Instead of hoping the model memorized the right training data, you retrieve the relevant documents yourself and hand them to the model as context. It’s the single most important pattern in production AI.
What RAG Is and Why It Exists
RAG stands for Retrieval Augmented Generation. The idea is simple: before the LLM generates an answer, retrieve relevant documents and stuff them into the prompt.
User question: "What is our refund policy for enterprise customers?"
Without RAG: LLM guesses (hallucination)
With RAG: Search your docs -> find refund-policy.md -> paste it into prompt -> LLM answers accuratelyThree reasons RAG dominates production AI:
- No fine-tuning required. Fine-tuning is expensive, slow, and hard to update. RAG uses the model as-is — you just change what context you feed it.
- Always up to date. Your vector database can be updated in real time. Fine-tuned models are frozen at training time.
- Verifiable answers. RAG can cite sources. You can show users which documents the answer came from.
The tradeoff: RAG adds latency (retrieval step), complexity (vector DB infrastructure), and cost (embedding API calls). But for most use cases, it’s the right call.
Embeddings Explained
An embedding is a vector — a list of numbers — that represents the meaning of a piece of text. Similar texts get similar vectors. This is what makes semantic search possible.
from openai import OpenAI
client = OpenAI()
# Embed a single piece of text
response = client.embeddings.create(
model="text-embedding-3-small",
input="How do I reset my password?"
)
vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}") # 1536
print(f"First 5 values: {vector[:5]}")
# [-0.0023, 0.0145, -0.0067, 0.0312, -0.0089]The key insight: “How do I reset my password?” and “I forgot my login credentials” produce vectors that are close together in 1536-dimensional space, even though they share almost no words. That’s the magic of embeddings — they capture meaning, not just keywords.
Embedding Models
| Model | Dimensions | Provider | Cost |
|---|---|---|---|
| text-embedding-3-small | 1536 | OpenAI | $0.02/1M tokens |
| text-embedding-3-large | 3072 | OpenAI | $0.13/1M tokens |
| voyage-3 | 1024 | Voyage AI | $0.06/1M tokens |
| all-MiniLM-L6-v2 | 384 | Sentence Transformers | Free (local) |
For local/free embeddings, sentence-transformers is the go-to:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"How do I reset my password?",
"I forgot my login credentials",
"The weather is nice today"
]
embeddings = model.encode(sentences)
print(f"Shape: {embeddings.shape}") # (3, 384)
# Compute similarity
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(embeddings)
print(f"password vs credentials: {sims[0][1]:.3f}") # ~0.78 (similar)
print(f"password vs weather: {sims[0][2]:.3f}") # ~0.12 (unrelated)Critical rule: the same embedding model must be used for both ingestion and querying. If you embed your documents with text-embedding-3-small, you must embed your queries with text-embedding-3-small. Mixing models produces garbage results.
Vector Databases
You need somewhere to store embeddings and search them by similarity. That’s what vector databases do.
ChromaDB (Local, Zero Config)
ChromaDB runs in-process with no server. Perfect for prototyping and small-to-medium datasets.
import chromadb
client = chromadb.Client() # In-memory
# Or persistent: chromadb.PersistentClient(path="./chroma_data")
collection = client.create_collection(
name="docs",
metadata={"hnsw:space": "cosine"} # cosine similarity
)
# Add documents — Chroma embeds them for you using its default model
collection.add(
ids=["doc1", "doc2", "doc3"],
documents=[
"Our refund policy allows returns within 30 days.",
"Enterprise customers get a dedicated support channel.",
"Password resets can be done from the settings page."
],
metadatas=[
{"source": "policy.md", "section": "refunds"},
{"source": "enterprise.md", "section": "support"},
{"source": "faq.md", "section": "auth"}
]
)
# Query
results = collection.query(
query_texts=["How do I get a refund?"],
n_results=2
)
print(results["documents"])
# [['Our refund policy allows returns within 30 days.',
# 'Enterprise customers get a dedicated support channel.']]Pinecone (Hosted, Production-Ready)
Pinecone is a managed vector database. You don’t run any infrastructure.
from pinecone import Pinecone
pc = Pinecone(api_key="your-api-key")
index = pc.Index("my-index")
# Upsert vectors (you must embed them yourself)
index.upsert(vectors=[
{"id": "doc1", "values": embedding_vector, "metadata": {"source": "policy.md"}},
{"id": "doc2", "values": embedding_vector_2, "metadata": {"source": "faq.md"}}
])
# Query
results = index.query(vector=query_embedding, top_k=5, include_metadata=True)
for match in results.matches:
print(f"{match.id}: {match.score:.3f} — {match.metadata['source']}")pgvector (Postgres Extension)
If you already run Postgres, pgvector adds vector search without a new database.
CREATE EXTENSION vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536),
source TEXT
);
-- Insert
INSERT INTO documents (content, embedding, source)
VALUES ('Refund policy...', '[0.0023, 0.0145, ...]', 'policy.md');
-- Nearest neighbor search
SELECT content, source, embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;Quick Comparison
| Database | Setup | Best For | Scales To |
|---|---|---|---|
| ChromaDB | pip install |
Prototyping, small projects | ~1M vectors |
| Pinecone | Managed SaaS | Production, zero-ops | Billions |
| pgvector | Postgres extension | Teams already on Postgres | ~10M vectors |
| Weaviate | Self-hosted or cloud | Multi-modal, GraphQL fans | Billions |
For this lesson, we’ll use ChromaDB throughout. Everything transfers to other vector DBs — the concepts are identical.
Chunking Strategies
You can’t embed an entire 50-page PDF as one vector. You need to split documents into chunks. Chunking is where most RAG pipelines succeed or fail.
Fixed-Size Chunking
The simplest approach. Split text every N characters with overlap.
def fixed_size_chunks(text: str, chunk_size: int = 500, overlap: int = 100) -> list[str]:
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap # Overlap prevents cutting mid-sentence
return chunks
text = "A very long document... " * 200
chunks = fixed_size_chunks(text, chunk_size=500, overlap=100)
print(f"Created {len(chunks)} chunks")Recursive Character Splitting
Split on natural boundaries — paragraphs first, then sentences, then words. This is the most commonly used strategy.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100,
separators=["\n\n", "\n", ". ", " ", ""] # Try each in order
)
text = open("long_document.txt").read()
chunks = splitter.split_text(text)
print(f"Chunks: {len(chunks)}, avg size: {sum(len(c) for c in chunks) / len(chunks):.0f}")Semantic Chunking
Group sentences that are semantically related. More expensive but produces higher-quality chunks.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
chunker = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=75
)
chunks = chunker.split_text(text)Why Chunk Size Matters
This is not a tuning knob you can ignore:
- Too small (< 200 chars): Chunks lack context. “The deadline is 30 days” means nothing without knowing what it refers to.
- Too large (> 2000 chars): Retrieval loses precision. A chunk about five topics matches queries about all five topics — poorly.
- Sweet spot: 300-800 characters for most use cases. Overlap of 10-20% of chunk size.
Test your chunking on real queries before you move on. Bad chunking is the number one cause of bad RAG results.
Building the Ingestion Pipeline
Time to build. This pipeline loads documents, chunks them, embeds them, and stores them in ChromaDB.
# rag_ingest.py — Full ingestion pipeline
import os
import chromadb
from pathlib import Path
# 1. Load documents from a directory
def load_documents(directory: str) -> list[dict]:
"""Load all .txt and .md files from a directory."""
docs = []
for path in Path(directory).rglob("*"):
if path.suffix in (".txt", ".md"):
content = path.read_text(encoding="utf-8")
docs.append({
"content": content,
"source": str(path),
"filename": path.name
})
print(f"Loaded {len(docs)} documents")
return docs
# 2. Chunk documents
def chunk_documents(docs: list[dict], chunk_size: int = 500, overlap: int = 100) -> list[dict]:
"""Split documents into overlapping chunks, preserving metadata."""
chunks = []
for doc in docs:
text = doc["content"]
start = 0
chunk_index = 0
while start < len(text):
end = start + chunk_size
chunk_text = text[start:end]
# Try to break at a sentence boundary
if end < len(text):
last_period = chunk_text.rfind(". ")
if last_period > chunk_size * 0.5:
chunk_text = chunk_text[:last_period + 1]
end = start + last_period + 1
chunks.append({
"id": f"{doc['filename']}_{chunk_index}",
"text": chunk_text.strip(),
"metadata": {
"source": doc["source"],
"filename": doc["filename"],
"chunk_index": chunk_index
}
})
start = end - overlap
chunk_index += 1
print(f"Created {len(chunks)} chunks from {len(docs)} documents")
return chunks
# 3. Store in ChromaDB
def create_vector_store(chunks: list[dict], collection_name: str = "knowledge_base"):
"""Embed and store chunks in ChromaDB."""
client = chromadb.PersistentClient(path="./chroma_data")
# Delete existing collection if it exists
try:
client.delete_collection(collection_name)
except ValueError:
pass
collection = client.create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}
)
# ChromaDB handles embedding internally using its default model
# For production, you'd pass your own embedding function
batch_size = 100
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
collection.add(
ids=[c["id"] for c in batch],
documents=[c["text"] for c in batch],
metadatas=[c["metadata"] for c in batch]
)
print(f" Stored batch {i // batch_size + 1} ({len(batch)} chunks)")
print(f"Vector store created: {collection.count()} chunks indexed")
return collection
# Run the pipeline
if __name__ == "__main__":
docs = load_documents("./my_knowledge_base")
chunks = chunk_documents(docs, chunk_size=500, overlap=100)
collection = create_vector_store(chunks)Run it:
pip install chromadb
mkdir -p my_knowledge_base
# Put your .txt or .md files in my_knowledge_base/
python rag_ingest.pyBuilding the Query Pipeline
Now the retrieval and generation side. Embed the user’s question, find similar chunks, build a prompt, and call the LLM.
# rag_query.py — Full query pipeline
import chromadb
from openai import OpenAI
openai_client = OpenAI()
def retrieve(query: str, collection, n_results: int = 5) -> list[dict]:
"""Search the vector store for relevant chunks."""
results = collection.query(
query_texts=[query],
n_results=n_results
)
retrieved = []
for i in range(len(results["documents"][0])):
retrieved.append({
"text": results["documents"][0][i],
"metadata": results["metadatas"][0][i],
"distance": results["distances"][0][i]
})
return retrieved
def build_prompt(query: str, context_chunks: list[dict]) -> str:
"""Combine retrieved context with the user's question."""
context = "\n\n---\n\n".join([
f"[Source: {c['metadata']['source']}]\n{c['text']}"
for c in context_chunks
])
return f"""You are a helpful assistant. Answer the user's question using ONLY the provided context.
If the context doesn't contain enough information to answer, say "I don't have enough information to answer that."
Always cite which source document you used.
CONTEXT:
{context}
QUESTION: {query}
ANSWER:"""
def generate(prompt: str) -> str:
"""Call the LLM with the augmented prompt."""
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.2, # Low temperature for factual answers
max_tokens=1000
)
return response.choices[0].message.content
def rag_query(query: str, collection, n_results: int = 5) -> dict:
"""Full RAG pipeline: retrieve -> augment -> generate."""
# Step 1: Retrieve
chunks = retrieve(query, collection, n_results)
print(f"Retrieved {len(chunks)} chunks")
for c in chunks:
print(f" [{c['distance']:.3f}] {c['metadata']['source']}")
# Step 2: Augment (build prompt with context)
prompt = build_prompt(query, chunks)
# Step 3: Generate
answer = generate(prompt)
return {
"answer": answer,
"sources": [c["metadata"]["source"] for c in chunks],
"chunks_used": len(chunks)
}The Complete RAG Pipeline
Here’s both pipelines wired together in one runnable script:
# rag_complete.py — End-to-end RAG system
import chromadb
from openai import OpenAI
from pathlib import Path
openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./chroma_data")
COLLECTION_NAME = "knowledge_base"
# ---------- Ingestion ----------
def ingest(directory: str, chunk_size: int = 500, overlap: int = 100):
"""Load, chunk, and index documents."""
# Load
docs = []
for path in Path(directory).rglob("*"):
if path.suffix in (".txt", ".md"):
docs.append({"content": path.read_text(), "name": path.name})
# Chunk
chunks, ids, metadatas = [], [], []
for doc in docs:
text = doc["content"]
start, idx = 0, 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunks.append(text[start:end])
ids.append(f"{doc['name']}_{idx}")
metadatas.append({"source": doc["name"], "chunk": idx})
start = end - overlap
idx += 1
# Store
try:
chroma_client.delete_collection(COLLECTION_NAME)
except ValueError:
pass
collection = chroma_client.create_collection(COLLECTION_NAME)
collection.add(ids=ids, documents=chunks, metadatas=metadatas)
print(f"Indexed {len(chunks)} chunks from {len(docs)} documents")
return collection
# ---------- Query ----------
def query(question: str, collection, top_k: int = 5) -> str:
"""Retrieve context and generate an answer."""
results = collection.query(query_texts=[question], n_results=top_k)
context = "\n\n".join([
f"[{results['metadatas'][0][i]['source']}]: {results['documents'][0][i]}"
for i in range(len(results["documents"][0]))
])
prompt = f"""Answer using ONLY the context below. Cite your sources.
If the context doesn't contain the answer, say so.
Context:
{context}
Question: {question}"""
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.2
)
return response.choices[0].message.content
# ---------- Main ----------
if __name__ == "__main__":
# Ingest your docs
collection = ingest("./my_knowledge_base")
# Ask questions
while True:
q = input("\nQuestion (or 'quit'): ")
if q.lower() == "quit":
break
answer = query(q, collection)
print(f"\n{answer}")Install dependencies and run:
pip install chromadb openai
python rag_complete.pyThat’s a working RAG system in under 70 lines. Everything else is optimization.
Advanced RAG Patterns
The basic pipeline works, but production systems need more.
Hybrid Search
Combine vector similarity with keyword matching. Vector search misses exact terms (product codes, error IDs). Keyword search misses semantic meaning. Use both.
# Pseudo-code for hybrid search
def hybrid_search(query: str, collection, n_results: int = 10):
# Vector search (semantic)
vector_results = collection.query(query_texts=[query], n_results=n_results)
# Keyword search (BM25 or full-text)
keyword_results = bm25_search(query, documents, top_k=n_results)
# Reciprocal Rank Fusion to merge results
fused = reciprocal_rank_fusion(vector_results, keyword_results, k=60)
return fused[:n_results]Reranking
The retrieval step casts a wide net. A reranker scores each result more carefully.
# pip install cohere
import cohere
co = cohere.Client("your-api-key")
# First, retrieve more candidates than you need
candidates = retrieve(query, collection, n_results=20)
# Then rerank to find the best ones
reranked = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=[c["text"] for c in candidates],
top_n=5
)
top_chunks = [candidates[r.index] for r in reranked.results]HyDE (Hypothetical Document Embeddings)
Instead of embedding the question directly, ask the LLM to generate a hypothetical answer, then embed that. The hypothetical answer is closer in embedding space to the real documents than a short question would be.
def hyde_retrieve(query: str, collection, n_results: int = 5):
# Generate a hypothetical answer
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Write a short paragraph answering this question: {query}"
}],
temperature=0.5
)
hypothetical_doc = response.choices[0].message.content
# Embed the hypothetical answer (not the question)
results = collection.query(query_texts=[hypothetical_doc], n_results=n_results)
return resultsMetadata Filtering
Filter by source, date, category, or any metadata before searching.
results = collection.query(
query_texts=["refund policy"],
n_results=5,
where={"source": "enterprise_docs.md"}, # Exact match
where_document={"$contains": "enterprise"} # Document content filter
)
# Combine filters
results = collection.query(
query_texts=["deployment guide"],
n_results=5,
where={
"$and": [
{"category": {"$eq": "engineering"}},
{"updated_after": {"$gte": "2025-01-01"}}
]
}
)Evaluating RAG Quality
You can’t improve what you don’t measure. RAG evaluation has three dimensions:
Context Relevance — Did you retrieve the right documents?
def context_precision(retrieved_docs, relevant_docs):
"""What fraction of retrieved docs are actually relevant?"""
relevant_retrieved = set(retrieved_docs) & set(relevant_docs)
return len(relevant_retrieved) / len(retrieved_docs) if retrieved_docs else 0
def context_recall(retrieved_docs, relevant_docs):
"""What fraction of relevant docs were retrieved?"""
relevant_retrieved = set(retrieved_docs) & set(relevant_docs)
return len(relevant_retrieved) / len(relevant_docs) if relevant_docs else 0Faithfulness — Is the answer grounded in the retrieved context, or did the LLM hallucinate?
def check_faithfulness(answer: str, context: str) -> str:
"""Use an LLM to judge if the answer is supported by context."""
prompt = f"""Given the context and answer below, identify any claims in the answer
that are NOT supported by the context.
Context: {context}
Answer: {answer}
List unsupported claims, or say "All claims are supported."
"""
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return response.choices[0].message.contentAnswer Correctness — Is the final answer actually right? This usually requires a ground-truth test set.
# Build an evaluation dataset
eval_set = [
{
"question": "What is the refund policy for enterprise?",
"expected_answer": "Enterprise customers can request refunds within 60 days.",
"relevant_docs": ["enterprise_policy.md"]
},
# ... more test cases
]
# Run evaluation
for test in eval_set:
result = rag_query(test["question"], collection)
precision = context_precision(result["sources"], test["relevant_docs"])
print(f"Q: {test['question']}")
print(f" Precision: {precision:.2f}")
print(f" Answer: {result['answer'][:100]}...")For automated evaluation at scale, look at frameworks like Ragas and DeepEval.
Common Pitfalls
These are the mistakes that waste weeks of debugging time.
Bad chunking. This is the most common problem. If your chunks split a paragraph in the middle of a key sentence, retrieval will return fragments that lack context. Always test your chunking by reading the actual chunks. If a chunk doesn’t make sense to a human, it won’t make sense to the retriever.
Wrong K value. Retrieving too few chunks (K=1 or 2) misses relevant context. Retrieving too many (K=20) floods the prompt with noise and confuses the LLM. Start with K=5, then tune based on your evaluation metrics.
Context window overflow. If you retrieve 10 chunks of 500 tokens each, that’s 5,000 tokens of context before the question and instructions. Add the system prompt and the LLM’s response, and you can blow past the context window. Always calculate your token budget:
def check_token_budget(chunks: list[str], max_context_tokens: int = 4000) -> list[str]:
"""Trim chunks to fit within token budget."""
# Rough estimate: 1 token ~ 4 characters
total_chars = 0
selected = []
for chunk in chunks:
if total_chars + len(chunk) > max_context_tokens * 4:
break
selected.append(chunk)
total_chars += len(chunk)
return selectedEmbedding model mismatch. If you indexed with text-embedding-3-small and query with all-MiniLM-L6-v2, the vectors live in different spaces. Every similarity score will be meaningless. Pin your embedding model and version in configuration.
Not handling empty results. When the vector store returns nothing relevant (all distances > 0.8), the LLM will try to answer with no context — which means hallucination. Add a distance threshold and return “I don’t know” when nothing is close enough:
def retrieve_with_threshold(query, collection, n_results=5, max_distance=0.5):
results = collection.query(query_texts=[query], n_results=n_results)
filtered = []
for i in range(len(results["documents"][0])):
if results["distances"][0][i] <= max_distance:
filtered.append(results["documents"][0][i])
if not filtered:
return None # Signal "no relevant context found"
return filteredIgnoring metadata. Vectors alone lose structure. A chunk from a deprecated 2019 doc and a chunk from the current 2025 doc will have similar embeddings if the content is similar. Always store and filter by metadata — date, source, version, category.
Key Takeaways
- RAG = retrieve relevant context + stuff it into the prompt + let the LLM generate. It beats fine-tuning for most use cases.
- Embeddings convert text to vectors that capture meaning. Use the same model for indexing and querying.
- ChromaDB is the fastest way to prototype a vector store. Pinecone and pgvector are solid production choices.
- Chunking is the most under-appreciated part of RAG. Bad chunks produce bad retrieval. Target 300-800 characters with 10-20% overlap.
- The prompt template matters. Tell the LLM to answer only from context and cite sources.
- Measure retrieval precision, recall, and faithfulness. You cannot improve RAG by guessing.
- Advanced patterns (hybrid search, reranking, HyDE) are worth adding once the basic pipeline works and you have evaluation metrics to guide you.
- Always set a distance threshold on retrieval. When nothing matches, say “I don’t know” instead of hallucinating.
