synthetic data generation for RAG evaluation

_{scroll ↓ to Resources}

Note

treat this as dynamic dataset, not static. As the system improves, evaluation data must become more advanced

Chunk filtering
- pre-filter the documents\chunks based on relevance for users (probability of being queried)
- Aligned LLM-as-a-judge
  - manually label small part of documents as relevant or irrelevant
  - Iterate on LLM-as-a-judge criteria to align with labeled data perfectly
  - Label the rest of the data
- Use context, tags, metadata, date ranges
contextual chunk rewriting (optional)
- expensive if is run on every chunk
- identify chunks which require context such as tables, images, …
Query generation
- generate questions from documents
- use few-shot learning and context to create realistic queries, both by content and formulation\format
  - what is the purpose of X in Y is too clean and easy to search, rather than X is not working, which is more likely to be asked by a user
- review and validate by domain experts
Ranking generation from questions and chunks
- Create a good prompt for LLM-as-a-judge to achieve automatic ranking at the same level as your own manual way
summarization of ingested documents (optional) ^ea0ca7
- cost-efficiency drops for modern models with large context windows
- consider a separate search summaries tool and use summarized chunks as supplement to raw data
- design summarization prompts with use-case needs in mind
  - good for financial reports, if numbers are crucial make sure the model retains\sums up them
  - multi-media content without text captions

table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"