synthetic data generation for RAG evaluation
scroll ↓ to Resources
Contents
Note
- treat this as dynamic dataset, not static. As the system improves, evaluation data must become more advanced
Steps
- Chunk filtering
- pre-filter the documents\chunks based on relevance for users (probability of being queried)
- Aligned LLM-as-a-judge
- manually label small part of documents as relevant or irrelevant
- Iterate on LLM-as-a-judge criteria to align with labeled data perfectly
- Label the rest of the data
- Use context, tags, metadata, date ranges
- contextual chunk rewriting (optional)
- expensive if is run on every chunk
- identify chunks which require context such as tables, images, …
- Query generation
- generate questions from documents
- use few-shot learning and context to create realistic queries, both by content and formulation\format
- what is the purpose of X in Y is too clean and easy to search, rather than X is not working, which is more likely to be asked by a user
- review and validate by domain experts
- Ranking generation from questions and chunks
- Create a good prompt for LLM-as-a-judge to achieve automatic ranking at the same level as your own manual way
- summarization of ingested documents (optional) ^ea0ca7
- cost-efficiency drops for modern models with large context windows
- consider a separate search summaries tool and use summarized chunks as supplement to raw data
- design summarization prompts with use-case needs in mind
- good for financial reports, if numbers are crucial make sure the model retains\sums up them
- multi-media content without text captions
Resources
- Systematically Improving RAG Applications
- Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge)
Links to this File
table file.inlinks, file.outlinks from [[]] and !outgoing([[]]) AND -"Changelog"