Retrieval-Augmented Generation

scroll ↓ to Resources

Contents

Note

  • RAG is stitching together information retrieval and generation parts, where the latter is handled by LLMs.
  • Tendency for longer context length does not undermine the importance of RAG
  • evaluation is a crucial part of RAG implementation
  • RAG >> fine-tuning:
    • It is easier and cheaper to keep retrieval indices up-to-date than do continuous pre-training or fine-tuning.
    • More fine-grained control over how we retrieve documents, e.g. separating different organizations’ access by partitioning the retrieval indices.

One study compared RAG against unsupervised finetuning (aka continued pretraining), evaluating both on a subset of MMLU and current events. They found that RAG consistently outperformed finetuning for knowledge encountered during training as well as entirely new knowledge. In another paper, they compared RAG against supervised finetuning on an agricultural dataset. Similarly, the performance boost from RAG was greater than finetuning, especially for GPT-4 (see Table 20).


Vanilla RAG

  • Simplest pipeline for retrieval with bi-encoder approach, when context documents and queries are computed entirely separately, unaware of each other.


  • At larger scales one needs a vector database or an index to allow approximate search so that you don’t have to compute all cosine similarity between each query and document
    • Promoted by LLM, approximate search is based on dense representation of queries and documents in latent fixed vector space
      • Compressing hundreds of tokens into a signal vector means losing information.
      • encoding is mainly supervised via transfer learning (text embedding encoder-style transformer models)
      • Vector index (IVF, PQ, HNSW, DiskANN++)
  • vector search is always combined with keyword search, which help handling specific terms and acronyms. Their inference computational overhead is unnoticeable, but the impact can be unbeatable for certain queries. Good old method is BM25 (TF-IDF)

Re-ranking

  • To fix the disadvantage of the bi-encoder approach when documents’ and queries’ representations are computed separately, we can add another re-ranking stage with cross-encoder as an extra step, before calling the generator model. ^2dc17c
    • The idea is to use a powerful, computationally expensive model to score only a subset of your documents, previously retrieved by a more efficient and cheap model. It is not computationally feasible for each query-document pair.
    • A typical re-ranking solution uses open-source Cross-Encoder models from sentence transformers, which take both the question and context as input and return a score from 0 to 1. Though it is also possible to use GPT4 + prompt engineering.gg
    • Originally, cross-encoder is a binary classifier where the probability of being a positive class is taken as a similarity score. Now there are also T5-based re-rankers, RankGPT, …
    • generally bi-encoder is more loose and reranker is more strict

Full MVP vanilla RAG

  • Full MVP vanilla RAG pipeline may look like this (with combine the scores module too)

Evaluating information retrieval

  • See general approach to evaluating LLMs: How to evaluate LLMs? and how to evaluate LLM chatbots
  • RAG impact is dependent on the quality of retrieved documents, which in turn is evaluated by:
    • relevance: how good the system is at ranking relevant documents higher and irrelevant documents lower
    • information density: if two documents are equally relevant, we should prefer one that’s more concise and has fewer extraneous details
    • level of detail:

Challenges with RAG

  • See Common issues due to tokenization
  • Lost in the Middle effect (not specific to RAG, but rather long context)
  • RAG retrieval capabilities are often evaluated using needle in a haystack tasks, but that is not what we usually want in real world tasks (summarization, joining of sub-parts of long documents, etc.) knowledge graphs may be a good improvement for this
  • Database needs to be always up-to-date
  • multi-hop question answering knowledge graphs
  • Privacy or access rights can be compromised when RAG is used by various users
  • Database may contain factual or outdated info (sometimes along with the correct info) increase data quality checks or improve model robustness, put more weight on more recent documents, filter by date
  • Relevant document is missing in top-K retrievals - improve the embedder or re-ranker
  • Relevant document was chopped during context retrieval - use LLM with larger context size or improve the mechanism of context retrieval
  • Relevant document got into top-K, but the LLM didn’t use that info for output generation finetune the model for the contextual data or reduce the noise level in the retrieved context
  • LLM output is not following expected format - finetune the model, or improve the prompt
  • Pooling dilutes long text representation: During the encoding step, each token in the query receives a representation and then there is a pooling step which is typically averaging to provide one vector for all tokens in a query (query-sentence ---> one vector)
  • Requires chunking
    • One doc - many chunks and vectors.
    • Retrieve docs or chunks?
  • Fixed vocabulary

Advanced RAG techniques


Other topics

Inference Scaling for Long-Context Retrieval Augmented Generation

Resources

Data Talks Club

DTC - LLM Zomcamp
RAG in Action: Next-Level Retrieval Augmented Generation - Leonard Püttmann - YouTube
Implement a Search Engine - Alexey Grigorev - YouTube

deeplearning.ai

DLAI - Building and Evaluating Advanced RAG


table file.tags from [[]] and !outgoing([[]])  AND -"Changelog"