Retrieval-Augmented Generation

scroll ↓ to Resources

Contents

Note

  • RAG is stitching together information retrieval and generation parts, where the latter is handled by LLMs.
  • Tendency for longer context length does not undermine the importance of RAG
  • evaluation is a crucial part of RAG implementation
  • RAG >> fine-tuning:
    • It is easier and cheaper to keep retrieval indices up-to-date than do continuous pre-training or fine-tuning.
    • More fine-grained control over how we retrieve documents, e.g. separating different organizations’ access by partitioning the retrieval indices.

One study compared RAG against unsupervised finetuning (aka continued pretraining), evaluating both on a subset of MMLU and current events. They found that RAG consistently outperformed finetuning for knowledge encountered during training as well as entirely new knowledge. In another paper, they compared RAG against supervised finetuning on an agricultural dataset. Similarly, the performance boost from RAG was greater than finetuning, especially for GPT-4 (see Table 20).


Vanilla RAG

  • Simplest pipeline for retrieval with bi-encoder approach, when context documents and queries are computed entirely separately, unaware of each other.


  • At larger scales one needs a vector database or an index to allow approximate search so that you don’t have to compute all cosine similarity between each query and document
    • Promoted by LLM, approximate search is based on dense representation of queries and documents in latent fixed vector space
      • Compressing hundreds of tokens into a signal vector means losing information.
      • encoding is mainly supervised via transfer learning (text embedding encoder-style transformer models)
      • Vector index (IVF, PQ, HNSW, DiskANN++)
  • vector search is always combined with keyword search, which help handling specific terms and acronyms. Their inference computational overhead is unnoticeable, but the impact can be unbeatable for certain queries. Good old method is BM25 (TF-IDF)

Reranking

How to select every ranking model for fine tuning

  • It’s an iterative process where one cannot just select the perfect model architecture from the beginning. Instead, it’s good to create a framework which will include testing several models and evaluating them against specific constraints such as latency, costing requirements, and performance.
  • https://bge-model.com/ I

Full MVP vanilla RAG

  • Full MVP vanilla RAG pipeline may look like this (with combine the scores module too)

Evaluating information retrieval


Challenges with RAG

  • Lost in the Middle effect (not specific to RAG, but rather long context)
  • RAG retrieval capabilities are often evaluated using needle in a haystack tasks, but that is not what we usually want in real world tasks (summarization, joining of sub-parts of long documents, etc.) knowledge graph may be a good improvement for this
  • multi-hop question answering, reasoning-based queries that require connecting information from multiple sources - - pre-constructed knowledge graph is potentially a solution, agentic AI workflows
  • documents’ encoding failures due to formats, tables or unexpected encoding (e.g. UTF-8 vs Latin-1) silently reduce your knowledge index - monitor processed data at each step, implement error handling, be careful with off-the-shelf PDF extractors
  • Irrelevant documents accumulate and await to be retrieved for a query, ticking time bombs - careful curation, metadata filtering
    • documents can become irrelevant with time: Index staleness, database needs to be always up-to-date - timestamps metadata filtering
  • Privacy or access rights can be compromised when RAG is used by various users
  • Database may contain factual or outdated info (sometimes along with the correct info) increase data quality checks or improve model robustness, put more weight on more recent documents, filter by date
  • Relevant document is missing in top-K retrievals - improve the embedder or reranker
  • Relevant document was chopped during context retrieval - use LLM with larger context size or improve the mechanism of context retrieval
  • Relevant document got into top-K, but the LLM didn’t use that info for output generation finetune the model for the contextual data or reduce the noise level in the retrieved context
  • LLM output is not following expected format - finetuning the model, or improve the prompt
  • Pooling dilutes long text representation: During the encoding step, each token in the query receives a representation and then there is a pooling step which is typically averaging to provide one vector for all tokens in a query (query-sentence ---> one vector)
  • chunking strategy is a hyperparameter and it is not independent from others.
  • Arbitrary queries
    • Low information value or vague queries (e.g. “health tips”) - detect through heuristics or classifiers and ask users for clarification
    • Off-topic queries intent recognition and fallback scenario
  • temporal data
    • challenging because the model needs to keep track of order of events and consequences
    • medical\prescription records, FED speeches, economic reports
    • present chunks chronologically, explore the effect of ascending vs descending order
    • two-stage approach: let the model first extract and reorganize relevantnt info, then reason about it
    • mining reasoning chains from users to create training data
  • hallucination - inline citations

Advanced RAG techniques


Other topics

Inference Scaling for Long-Context Retrieval Augmented Generation

Resources

Data Talks Club

DTC - LLM Zomcamp
RAG in Action: Next-Level Retrieval Augmented Generation - Leonard Püttmann - YouTube
Implement a Search Engine - Alexey Grigorev - YouTube

deeplearning.ai

DLAI - Building and Evaluating Advanced RAG


table file.tags from [[]] and !outgoing([[]])  AND -"Changelog"