Evaluating information retrieval
scroll ↓ to Resources
Contents
- Related
- Note
- Build your own relevance dataset
- Real data
- Synthetic data - Types of experiments
- [[#Types of experiments#System architecture decisions|System architecture decisions]]
- [[#Types of experiments#Other|Other]]
- Metrics
- [[#Metrics#Technical|Technical]]
- [[#Metrics#Use-case defined|Use-case defined]]
- Resources
Related
- See general approach to evaluating LLMs: How to evaluate LLMs? and how to evaluate LLM chatbots
Note
- end-to-end model evaluation is challenging, unless we expect a single short answer.Evaluate separately:
- information extraction (did the system find the correct information?)
- reasoning (Given correct information, did the system make the right conclusions?)
- output generation (Was the final response clear and actionable?)
- some domains are easier than others
- coding: does the code pass tests?
- user feedback or the way they interact with the results can be the ultimate metric
- forAI-generated emails: do users make edits before sending?
- when evaluating the performance of the system you have, don’t forget to register what is missing
- Inventory issues - Lack of data to fulfill certain user requests. Better algorithm can’t help with that,
- Capability issues - Functionality gaps where a system can’t perform certain types of queries or filters.
- RAG impact is dependent on the quality of retrieved documents, which in turn is evaluated by:
- relevance: how good the system is at ranking relevant documents higher and irrelevant documents lower
- information density: if two documents are equally relevant, we should prefer one that’s more concise and has fewer extraneous details
- apparently,built-in Google PDF processing encodes each page with a fixed number of tokens ⇒ retrieval from a dense page will be worse, because of higher data compression
- level of detail:
- Separate retrieval evaluations vs generation evals and focus on the retrieval part first
- retrieval is cheap, generation expensive
- generation comes later in the pipeline and assumes the retrieval is correct
- Group your evaluation set queries by difficulty into N groups (e.g. 5x20) and only start evaluating the next group once you reach the desired accuracy or recall on simpler questions.
- Build your own relevance dataset
- statistical validation of potential improvements to quantify confidence in performance differences and avoid investing in unreliable improvements.
- create a
@dataclass
ExperimentConfig
, functions to sample from available data and calculate metrics - bootstraping on N samples with various RAG configurations and calculate confidence interval
- plot Recall@k for different K for pairs of
ExperimentConfig
- if confidence intervals are too large - increase N
- if confidence intervals for two different configurations overlap - it is possible that the difference in performance was due to chance
- plot Recall@k for different K for pairs of
- t-test is another way to tell if the difference in the means of the two configurations is due to chance
- use distribution of the means from bootstraping, not means themselves
- high p-value and low t-statistics points to NO statistical significance
- create a
- if you have a number of tools for various search-use-cases (somewhat similar to intent recognition) evaluate them independently
- ask the model to make a plan which tools to use. track plan acceptance rates by the users
Build your own relevance dataset
Real data
- If available, sample user queries and their outputs from a production RAG and put time to rank the results yourself or with the help of LLM
- collect unstructured feedback (comments, issue reports)
- hierarchical clustering to identify patterns and create a taxonomy of categories
Synthetic data
- If there is no existing system, use LLM to generate queries for your content. See synthetic data generation for RAG evaluation
Types of experiments
- Prioritize experiments based on potential impact and resources, log everything and present in tidy format
System architecture decisions
- different embedding models by size, dimensionality, developer, …
text-embeddings-3-small
vslarge
- chunking strategy
- test how the re-ranking performance changes depending on the N chunks we pass in
- hybrid search vs vector search-only
- formatting of your documents (markdown, yaml, json, xml)
Other
- invariance testing
- the system’s output shouldn’t change due to rephrasing, shorter\longer queries, abbreviations, change of irrelevant details (names, genders, etc.)
- Experiment with the top-k sampling with cosine similarity and top N with re-ranking to see how to get better recall where N << K
- compare latency trade-offs
Metrics
Technical
- recall@k: include everything that is relevant
- Most current models are very good at recall because they are optimized for needle in a haystack tests., But the sensitivity to irrelevant information is less well optimized, therefore don’t forget about the precision
- precision@k: do not include anything irrelevant
- Normalized Discounted Cumulative Gain@k:
- Mean Reciprocal Rank
- LGTM@10
Use-case defined
- end-user-engagement: click, add, dwell
- human-involved evaluation, but not end-users:
- for instance, AI system generates emails to prospective buyers with several price options, conditional discounts and other upselling tricks. Prior sending these emails reviewed by sales people. If they make corrections\edits to the email, we consider that smth went wrong in model reasoning and analyze the pitfall.
- satisfaction feedback
- ratio of FAQ requests forwarded to a live agent
- Revenue
- Multi-objective ranking, not just optimizing relevance.
Resources
- Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)
- Mastering RAG: 8 Scenarios To Evaluate Before Going To Production - Galileo
- RAGChecker: A Fine-grained Framework For Diagnosing RAG
Links to this File
table file.inlinks, file.outlinks from [[]] and !outgoing([[]]) AND -"Changelog"