📄 article

Retrieval-Augmented Generation (RAG)

Difficulty: M.TechRead Time: ~15 min

Overcoming LLM Hallucinations

LLMs suffer from two major problems:

Knowledge Cutoff: They don't know about events that happened after they were trained.
Hallucinations: They confidently make up facts when they don't know the answer.

Retrieval-Augmented Generation (RAG) solves this by retrieving relevant context from a database and injecting it into the prompt.

The RAG Pipeline

Ingestion: Split large documents into smaller chunks (e.g., 500 tokens).
Embedding: Pass chunks through an embedding model (like text-embedding-3-small) to convert them into dense vectors.
Storage: Store the vectors in a Vector Database (e.g., Pinecone, Milvus, pgvector).
Retrieval: When a user asks a query, embed the query, perform a Cosine Similarity search in the vector DB, and retrieve the top-K chunks.
Generation: Send the original query plus the retrieved chunks to the LLM to generate a grounded answer.

python

# Simple RAG Prompt Template
prompt = f"""
Use the following context to answer the user's question. If the answer is not in the context, say "I don't know".

Context:
{retrieved_chunks}

Question:
{user_query}
"""