📄 article
Retrieval-Augmented Generation (RAG)
Difficulty: M.TechRead Time: ~15 min
Overcoming LLM Hallucinations
LLMs suffer from two major problems:
- Knowledge Cutoff: They don't know about events that happened after they were trained.
- Hallucinations: They confidently make up facts when they don't know the answer.
Retrieval-Augmented Generation (RAG) solves this by retrieving relevant context from a database and injecting it into the prompt.
The RAG Pipeline
- Ingestion: Split large documents into smaller chunks (e.g., 500 tokens).
- Embedding: Pass chunks through an embedding model (like
text-embedding-3-small) to convert them into dense vectors. - Storage: Store the vectors in a Vector Database (e.g., Pinecone, Milvus, pgvector).
- Retrieval: When a user asks a query, embed the query, perform a Cosine Similarity search in the vector DB, and retrieve the top-K chunks.
- Generation: Send the original query plus the retrieved chunks to the LLM to generate a grounded answer.
python
# Simple RAG Prompt Template
prompt = f"""
Use the following context to answer the user's question. If the answer is not in the context, say "I don't know".
Context:
{retrieved_chunks}
Question:
{user_query}
"""