On This Page
What is RAG?
Retrieval-Augmented Generation (RAG) is an architecture pattern that enhances Large Language Models by giving them access to external knowledge. Instead of relying solely on the model's training data, RAG retrieves relevant documents at query time and includes them in the prompt context.
This approach solves several LLM limitations: it reduces hallucinations, enables access to private/recent data, and provides source attribution for generated responses. RAG has become the standard pattern for enterprise AI applications.
"RAG transforms LLMs from impressive but unreliable party tricks into production-ready systems that can be trusted with real business decisions."
RAG Architecture
A RAG pipeline consists of two main phases:
Indexing Phase (offline):
- Load documents from various sources (PDFs, databases, APIs)
- Split documents into chunks of appropriate size
- Generate embeddings for each chunk using an embedding model
- Store embeddings in a vector database with metadata
Query Phase (online):
- Convert user query to an embedding
- Search vector database for similar chunks
- Construct prompt with retrieved context
- Send to LLM and return response
Document Processing
Document processing is where most RAG implementations succeed or fail:
- Chunking strategy: Chunks that are too small lose context; too large and they dilute relevance. Start with 500-1000 tokens with 100-200 token overlap.
- Metadata extraction: Store source, date, author, and section headers. Metadata enables filtering and improves retrieval.
- Document parsing: PDFs are notoriously difficult. Use specialized parsers like Unstructured, LlamaParse, or cloud document AI services.
- Cleaning: Remove headers, footers, page numbers, and irrelevant content that will confuse retrieval.
# Example: Document chunking with LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
# Load PDF
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
# Add metadata
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_id"] = i
chunk.metadata["source"] = "company_handbook"
Vector Database Selection
Vector database choice depends on scale and requirements:
- Pinecone: Fully managed, excellent developer experience, best for getting started quickly.
- Weaviate: Open-source, hybrid search (vector + keyword), good for complex queries.
- Chroma: Lightweight, runs in-memory, ideal for development and small datasets.
- pgvector: PostgreSQL extension, good if you're already using Postgres.
- Qdrant: High performance, good filtering capabilities, self-hosted or cloud.
For most production use cases, start with Pinecone or Weaviate. Migrate to self-hosted options if cost or data residency becomes a concern.
Implementation Walkthrough
A minimal RAG implementation in Python:
from openai import OpenAI
from pinecone import Pinecone
# Initialize clients
openai_client = OpenAI()
pc = Pinecone(api_key="your-api-key")
index = pc.Index("knowledge-base")
def get_embedding(text):
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def query_rag(user_question):
# Step 1: Embed the question
query_embedding = get_embedding(user_question)
# Step 2: Search vector database
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True
)
# Step 3: Build context from results
context = "\n\n".join([
match.metadata["text"] for match in results.matches
])
# Step 4: Generate response with context
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer based on this context:\n{context}"},
{"role": "user", "content": user_question}
]
)
return response.choices[0].message.content
Production Deployment Tips
Moving RAG from prototype to production requires attention to:
- Evaluation: Build test datasets with known answers. Measure retrieval precision and generation quality.
- Caching: Cache embeddings for repeated queries. Cache LLM responses for identical prompts.
- Streaming: Stream LLM responses for better user experience on long answers.
- Fallbacks: Handle cases where retrieval returns nothing relevant.
- Observability: Log queries, retrieved chunks, and responses for debugging and improvement.
Priya Nair
·AWS Data Engineer
Priya is an AWS Data Engineer specializing in building scalable data pipelines and real-time analytics solutions. She holds multiple AWS certifications and has led data platform modernization projects for Fortune 500 companies.
Connect on LinkedIn