N
NeuronLabs
📄 article

Retrieval-Augmented Generation (RAG)

Difficulty: M.TechRead Time: ~15 min

Overcoming LLM Hallucinations

LLMs suffer from two major problems:

  1. Knowledge Cutoff: They don't know about events that happened after they were trained.
  2. Hallucinations: They confidently make up facts when they don't know the answer.

Retrieval-Augmented Generation (RAG) solves this by retrieving relevant context from a database and injecting it into the prompt.

The RAG Pipeline

  1. Ingestion: Split large documents into smaller chunks (e.g., 500 tokens).
  2. Embedding: Pass chunks through an embedding model (like text-embedding-3-small) to convert them into dense vectors.
  3. Storage: Store the vectors in a Vector Database (e.g., Pinecone, Milvus, pgvector).
  4. Retrieval: When a user asks a query, embed the query, perform a Cosine Similarity search in the vector DB, and retrieve the top-K chunks.
  5. Generation: Send the original query plus the retrieved chunks to the LLM to generate a grounded answer.
python
# Simple RAG Prompt Template
prompt = f"""
Use the following context to answer the user's question. If the answer is not in the context, say "I don't know".

Context:
{retrieved_chunks}

Question:
{user_query}
"""