Retrieval-Augmented Generation

_{scroll ↓ to Resources}

Note
Vanilla RAG
- [[#Vanilla RAG#Re-ranking|Re-ranking]]
- [[#Vanilla RAG#Full MVP vanilla RAG|Full MVP vanilla RAG]]
Evaluating information retrieval
Challenges with RAG
Advanced RAG techniques
Other topics
Resources

Note

RAG is stitching together information retrieval and generation parts, where the latter is handled by LLMs.
Tendency for longer context length does not undermine the importance of RAG
evaluation is a crucial part of RAG implementation
RAG >> fine-tuning:
- It is easier and cheaper to keep retrieval indices up-to-date than do continuous pre-training or fine-tuning.
- More fine-grained control over how we retrieve documents, e.g. separating different organizations’ access by partitioning the retrieval indices.

One study compared RAG against unsupervised finetuning (aka continued pretraining), evaluating both on a subset of MMLU and current events. They found that RAG consistently outperformed finetuning for knowledge encountered during training as well as entirely new knowledge. In another paper, they compared RAG against supervised finetuning on an agricultural dataset. Similarly, the performance boost from RAG was greater than finetuning, especially for GPT-4 (see Table 20).

Vanilla RAG

Simplest pipeline for retrieval with bi-encoder approach, when context documents and queries are computed entirely separately, unaware of each other.

At larger scales one needs a vector database or an index to allow approximate search so that you don’t have to compute all cosine similarity between each query and document
- Promoted by LLM, approximate search is based on dense representation of queries and documents in latent fixed vector space
  - Compressing hundreds of tokens into a signal vector means losing information.
  - encoding is mainly supervised via transfer learning (text embedding encoder-style transformer models)
  - Vector index (IVF, PQ, HNSW, DiskANN++)
vector search is always combined with keyword search, which help handling specific terms and acronyms. Their inference computational overhead is unnoticeable, but the impact can be unbeatable for certain queries. Good old method is BM25 (TF-IDF)

Re-ranking

To fix the disadvantage of the bi-encoder approach when documents’ and queries’ representations are computed separately, we can add another re-ranking stage with cross-encoder as an extra step, before calling the generator model. ^2dc17c
- The idea is to use a powerful, computationally expensive model to score only a subset of your documents, previously retrieved by a more efficient and cheap model. It is not computationally feasible for each query-document pair.
- A typical re-ranking solution uses open-source Cross-Encoder models from sentence transformers, which take both the question and context as input and return a score from 0 to 1. Though it is also possible to use GPT4 + prompt engineering.gg
- Originally, cross-encoder is a binary classifier where the probability of being a positive class is taken as a similarity score. Now there are also T5-based re-rankers, RankGPT, …
- generally bi-encoder is more loose and reranker is more strict
- Search reranking with cross-encoders
- Retrieve & Re-Rank — Sentence Transformers documentation

Full MVP vanilla RAG

Full MVP vanilla RAG pipeline may look like this (with combine the scores module too)

Evaluating information retrieval

See general approach to evaluating LLMs: How to evaluate LLMs? and how to evaluate LLM chatbots
RAG impact is dependent on the quality of retrieved documents, which in turn is evaluated by:
- relevance: how good the system is at ranking relevant documents higher and irrelevant documents lower
- information density: if two documents are equally relevant, we should prefer one that’s more concise and has fewer extraneous details
- level of detail:

Metrics

Metrics

recall@k: all the relevant

precision@k: nothing, but relevant

Normalized Discounted Cumulative Gain@k: NDCG: What It Is and Where To Use It? AI Essential Lessons

Mean Reciprocal Rank

LGTM@10

Industry-based:

engagement: click, add, dwell

Revenue

Multi-objective ranking, not just optimizing relevance.

Mastering RAG: 8 Scenarios To Evaluate Before Going To Production - Galileo

GitHub - amazon-science/RAGChecker: RAGChecker: A Fine-grained Framework For Diagnosing RAG

Build your own relevance dataset

Build your own relevance dataset

Sample user queries and their outputs from a production RAG and put a few hours to rank the results

Irrelevant, somehow relevant, highly relevant.

Static collection is preferred for consistency

If there is no existing system, use LLM to generate queries for your content.

Create a good prompt for LLM-as-a-judge to achieve automatic ranking at the same level as your own manual

Prompt example

Given a query and a passage, you must provide a score on an integer scale of 0 to 3 with the following meanings:
0 = represent that the passage has nothing to do with the query,
1 = represents that the passage seems related to the query but does not answer it,
2 = represents that the passage has some answer for the query, but the answer may be a bit unclear, or hidden amongst extraneous information and
3 = represents that the passage is dedicated to the query and contains the exact answer. Important Instruction: Assign category 1 if the passage somewhat related to the topic but not completely, category 2 if passage presents something very important related to the entire topic but also has some extra information and category 3 if the passage only and entirely refers to the topic. If none of the above satisfies give it category 0.
Query: (query)
Passage: (passage)
Split this problem into steps:
Consider the underlying intent of the search.
Measure how well the content matches a likely intent of the query (M) .
Measure how trustworthy the passage is (T).
Consider the aspects above and the relative importance of each, and decide on a final score (0). Final score must be an integer value only.
Do not provide any code in result. Provide each score in the format of: ##final score: score without providing any reasoning.

Challenges with RAG

See Common issues due to tokenization
Lost in the Middle effect (not specific to RAG, but rather long context)
RAG retrieval capabilities are often evaluated using needle in a haystack tasks, but that is not what we usually want in real world tasks (summarization, joining of sub-parts of long documents, etc.) ⇒ knowledge graphs may be a good improvement for this
Database needs to be always up-to-date
multi-hop question answering → knowledge graphs
Privacy or access rights can be compromised when RAG is used by various users
Database may contain factual or outdated info (sometimes along with the correct info) ⇒ increase data quality checks or improve model robustness, put more weight on more recent documents, filter by date
Relevant document is missing in top-K retrievals -⇒ improve the embedder or re-ranker
Relevant document was chopped during context retrieval -⇒ use LLM with larger context size or improve the mechanism of context retrieval
Relevant document got into top-K, but the LLM didn’t use that info for output generation ⇒ finetune the model for the contextual data or reduce the noise level in the retrieved context
LLM output is not following expected format -⇒ finetune the model, or improve the prompt
Pooling dilutes long text representation: During the encoding step, each token in the query receives a representation and then there is a pooling step which is typically averaging to provide one vector for all tokens in a query (query-sentence ---> one vector)
Requires chunking
- One doc -⇒ many chunks and vectors.
- Retrieve docs or chunks?
Fixed vocabulary

Advanced RAG techniques

from Advanced RAG techniques

Advanced improvements to RAG

Most probably you have to chunk your context data into smaller pieces. Chunking strategy can have a huge impact on RAG performance.

small chunks ⇒ limited context ⇒ incomplete answers

large chunks ⇒ noise in data ⇒ poor recall

By symbols, sentences, semantic meaning, using dedicated model or an LLM call

semantic chunking by detecting where the change of topic has happened

Consider inference latency, number of tokens embedding models were trained on

Overlapping or not?

Use small chunks on embedding stage and large size during the inference, by appending adjacent chunks before feeding to LLM

re-ranking

see Re-ranking

query expansion and enhancement

Another LLM-call-module can be added to rewrite and expand the initial user query by adding synonyms, rephrasing, complementing with initial LLM output (without RAG context), etc.

In addition, to dense embedding models, historically, there are also sparse representation methods. These can and should be used in addition to vector search, resulting in hybrid search ^f44082

encoding is supervised (e.g splade) or unsupervised (e.g BM25, TF-IDF)

search accelerated with top-k retrieval algorithms like WAND, MaxScore, BM-WAND and more

Using hybrid search (at least full-text + vector search) is standard to RAG, but it requires combining several scores into one ^6fd281

use weighted average

take several top-results from each search module

use Reciprocal Rank Fusion, Mean Average Precision, NDCG, etc.

metadata filtering reduces the search space, hence, improves retrieval and reduces computational burden

dates. freshness, source authority (for health datasets), business-relevant tags

categories: use entity detection models: GliNER

if there is no metadata, one can ask LLM to generate it

Shuffling context chunks will create randomness in outputs, which is comparable to increasing diversity of the downstream output (as an alternative to hyperparameter tuning using softmax temperature) - e.g. previously purchased items are provided in random order to make recommendation engine output more creative

One can generate summary of documents (or questions to each chunk\document) and embed that info too

Not RAG-specific

Off-the-shelf bi-encoders embedding models) can be fine-tuned like any other model, but it is barely done on practice by anyone as there are much lower hanging fruits

Other

AutoML tool for RAG - auto-configuring your RAG

Contextual Retrieval \ Anthropic

Query Classification / Routing - save resources by pre-defining when the query doesn’t need external context and can be answered directly or using chat history.

Multi-modal RAG, in case your queries need access to images, tables, video, etc. Then you need a multi-modal embedding model too.

Self-RAG, Iterative RAG

Hierarchical Index Retrieval - first search for a relevant book, then chapter, etc.

Graph-RAG

Chain-of-Note

Contextual Document Embeddings

Link to original

Resources

paper review Seven Failure Points When Engineering a Retrieval Augmented Generation System - YouTube or Семь точек отказа RAG систем | Дмитрий Колодезев
Back to Basics for RAG w/ Jo Bergum - YouTube
Mastering RAG: How to Select A Reranking Model - Galileo
==Systematically improving RAG applications – Parlance==
Подбор гиперпараметров RAG-системы с помощью Optuna / Хабр
A Beginner-friendly and Comprehensive Deep Dive on Vector Databases: ArchiveBox from dailydoseofds
RAG From Scratch: Part 1 (Overview) - YouTube
Local Retrieval Augmented Generation (RAG) from Scratch (step by step tutorial) - YouTube: 5 hours step by step hands-on tutorial
Retrieval-Augmented Generation for Large Language Models: A Survey

Data Talks Club

DTC - LLM Zomcamp
RAG in Action: Next-Level Retrieval Augmented Generation - Leonard Püttmann - YouTube
Implement a Search Engine - Alexey Grigorev - YouTube

deeplearning.ai

DLAI - Building and Evaluating Advanced RAG

Links to this File

table file.tags from [[]] and !outgoing([[]])  AND -"Changelog"

Notitia Restante 🌱

On this site

Retrieval-Augmented Generation

Contents

Note

Vanilla RAG

Re-ranking

Full MVP vanilla RAG

Evaluating information retrieval

Metrics

Build your own relevance dataset

Prompt example

Challenges with RAG

Advanced RAG techniques

Advanced improvements to RAG

Not RAG-specific

Other

Other topics

Inference Scaling for Long-Context Retrieval Augmented Generation

Resources

Data Talks Club

deeplearning.ai

Links to this File

Graph View

On this page

Backlinks

Recent

ResNet

image retrieval

image segmentation

contrastive loss

test Lowe