Advanced RAG techniques
scroll β to Resources
Advanced improvements to RAG
- Most probably you have to chunk your context data into smaller pieces. Chunking strategy can have a huge impact on RAG performance.
- small chunks β limited context β incomplete answers
- large chunks β noise in data β poor recall
- By symbols, sentences, semantic meaning, using dedicated model or an LLM call
- semantic chunking by detecting where the change of topic has happened
- Consider inference latency, number of tokens embedding models were trained on
- Overlapping or not?
- Use small chunks on embedding stage and large size during the inference, by appending adjacent chunks before feeding to LLM
- re-ranking
- see Re-ranking
- query expansion and enhancement
- Another LLM-call-module can be added to rewrite and expand the initial user query by adding synonyms, rephrasing, complementing with initial LLM output (without RAG context), etc.
- In addition, to dense embedding models, historically, there are also sparse representation methods. These can and should be used in addition to vector search, resulting in hybrid search ^f44082
- Using hybrid search (at least full-text + vector search) is standard to RAG, but it requires combining several scores into one ^6fd281
- use weighted average
- take several top-results from each search module
- use Reciprocal Rank Fusion, Mean Average Precision, NDCG, etc.
- metadata filtering reduces the search space, hence, improves retrieval and reduces computational burden
- dates. freshness, source authority (for health datasets), business-relevant tags
- categories: use entity detection models: GliNER
- if there is no metadata, one can ask LLM to generate it
- Shuffling context chunks will create randomness in outputs, which is comparable to increasing diversity of the downstream output (as an alternative to hyperparameter tuning using softmax temperature) - e.g. previously purchased items are provided in random order to make recommendation engine output more creative
- One can generate summary of documents (or questions to each chunk\document) and embed that info too
Not RAG-specific
- Off-the-shelf bi-encoders embedding models) can be fine-tuned like any other model, but it is barely done on practice by anyone as there are much lower hanging fruits
Other
- AutoML tool for RAG - auto-configuring your RAG
- Contextual Retrieval \ Anthropic
- Query Classification / Routing - save resources by pre-defining when the query doesnβt need external context and can be answered directly or using chat history.
- Multi-modal RAG, in case your queries need access to images, tables, video, etc. Then you need a multi-modal embedding model too.
- Self-RAG, Iterative RAG
- Hierarchical Index Retrieval - first search for a relevant book, then chapter, etc.
- Graph-RAG
- Chain-of-Note
- Contextual Document Embeddings
Resources
- GitHub - NirDiamant/RAG_Techniques: This repository showcases various advanced techniques for Retrieval-Augmented Generation (RAG) systems
- Yet another RAG system - implementation details and lessons learned : r/LocalLLaMA
Links to this File
table file.inlinks, file.outlinks from [[]] and !outgoing([[]]) AND -"Changelog"