Building RAG Systems That Actually WorkPicsum ID: 454

The Basic RAG Architecture

A RAG system has two phases: indexing and inference. During indexing, documents are chunked, converted to vector embeddings, and stored in a vector database. During inference, the user’s query is also converted to an embedding, used to retrieve the most relevant document chunks, and those chunks are inserted into the LLM’s context window along with the original query. The LLM then generates an answer grounded in the retrieved content.

Common Failure Modes

Poor Retrieval Quality

If the retrieval step doesn’t return the right documents, the LLM cannot generate a good answer regardless of how capable it is. Poor retrieval is usually caused by suboptimal chunking (chunks that are too large, too small, or split at semantic boundaries), weak embedding models, or missing metadata filtering. Investing in retrieval quality almost always yields better results than tweaking the generation prompt.

Context Window Overload

Inserting too many retrieved chunks into the context window can cause the LLM to lose focus or truncate information. Techniques like re-ranking (using a more sophisticated model to re-order retrieved chunks by relevance), compression (summarizing retrieved chunks before insertion), and selective retrieval (only retrieving what is needed for the specific query) help manage context effectively.

Hallucination Despite Retrieval

LLMs can still hallucinate even when relevant documents are in context, especially if the documents contain ambiguous information or if the model is not properly instructed to ground its answer in the provided sources. Explicit instructions (“Answer based only on the provided documents. If the answer is not in the documents, say ‘I don’t know’.”) and citation requirements (“Cite the specific document section that supports each claim”) significantly reduce hallucinations.

Advanced RAG Techniques

Hybrid Search

Combining dense vector retrieval (semantic search) with sparse retrieval (keyword-based, like BM25) consistently outperforms either approach alone. Hybrid search retrieves using both methods and merges results, capturing both semantic matches and exact keyword matches.

Query Expansion and Rewrite

User queries are often underspecified or use different terminology than the knowledge base. Query expansion rewrites the user’s query to be more specific, adds synonyms, or generates multiple query variants—all of which improve retrieval recall.

Evaluation Frameworks

You cannot improve what you cannot measure. RAG evaluation requires assessing both retrieval quality (precision@k, recall@k, MRR) and generation quality (faithfulness to retrieved docs, answer relevance, citation accuracy). Frameworks like RAGAS, TruLens, and custom evaluation pipelines are essential for systematic improvement.

Production Considerations

RAG systems in production need monitoring for retrieval failures, answer quality degradation, and latency. They need versioning for both the knowledge base and the retrieval/generation pipeline. And they need guardrails: what to do when retrieval returns nothing, when the LLM refuses to answer, or when the user asks an out-of-scope question.

By admin

14 thoughts on “Building RAG Systems That Actually Work”
  1. The point about establishing an AI Ethics Committee at the board level is crucial. Without top-down commitment, these initiatives wither.

  2. One thing I would add: AI ethics training for non-technical staff. Everyone touches AI products, everyone needs baseline literacy.

  3. The “ethical debt” concept is real. We are paying for rushed AI deployments from 3 years ago. Great call-out.

  4. The section on ongoing monitoring should be emphisized more. Too many organizations treat AI ethics as a one-time audit.

  5. As someone who works in fintech AI, the fairness metrics discussion was gold. Do you have recommendations for handling intersectional fairness?

  6. The external audit recommendation is spot on. Internal review alone is not credible, as we have seen from multiple high-profile AI failures.

  7. The competitive advantage argument is compelling. We are already seeing talent preferentially join companies with strong AI ethics programs.

  8. One question: how do you handle the tension between explainability and performance? Often the best-performing models are the least interpretable.

  9. I appreciate the concrete examples of technical safeguards. The SHAP and LIME references are particularly valuable for practitioners.

  10. I shared this with our legal team. The regulatory compliance angle is particularly relevant for our EU operations.

  11. The explainability tools you mentioned—any experience with which one works best in production? We are evaluating options.

  12. This balanced perspective is rare. Too much AI writing is either utopian or dystopian. This is grounded and useful.

  13. The matrix of fairness definitions (individual vs. group fairness) would be a great addition to this article.

  14. The bias mitigation techniques you mentioned—reweighting and synthetic augmentation—deserve more detail. Any plans for a technical follow-up?

Leave a Reply

Your email address will not be published. Required fields are marked *