Advanced RAG techniques

scroll ↓ to Resources

Advanced improvements to RAG

  • Most probably you have to chunk your context data into smaller pieces. Chunking strategy can have a huge impact on RAG performance.
    • small chunks β‡’ limited context β‡’ incomplete answers
    • large chunks β‡’ noise in data β‡’ poor recall
    • By symbols, sentences, semantic meaning, using dedicated model or an LLM call
    • semantic chunking by detecting where the change of topic has happened
    • Consider inference latency, number of tokens embedding models were trained on
    • Overlapping or not?
    • Use small chunks on embedding stage and large size during the inference, by appending adjacent chunks before feeding to LLM
  • re-ranking
  • query expansion and enhancement
    • Another LLM-call-module can be added to rewrite and expand the initial user query by adding synonyms, rephrasing, complementing with initial LLM output (without RAG context), etc.
  • In addition, to dense embedding models, historically, there are also sparse representation methods. These can and should be used in addition to vector search, resulting in hybrid search ^f44082
    • encoding is supervised (e.g splade) or unsupervised (e.g BM25, TF-IDF)
    • search accelerated with top-k retrieval algorithms like WAND, MaxScore, BM-WAND and more
  • Using hybrid search (at least full-text + vector search) is standard to RAG, but it requires combining several scores into one ^6fd281
    • use weighted average
    • take several top-results from each search module
    • use Reciprocal Rank Fusion, Mean Average Precision, NDCG, etc.
  • metadata filtering reduces the search space, hence, improves retrieval and reduces computational burden
    • dates. freshness, source authority (for health datasets), business-relevant tags
    • categories: use entity detection models: GliNER
    • if there is no metadata, one can ask LLM to generate it
  • Shuffling context chunks will create randomness in outputs, which is comparable to increasing diversity of the downstream output (as an alternative to hyperparameter tuning using softmax temperature) - e.g. previously purchased items are provided in random order to make recommendation engine output more creative
  • One can generate summary of documents (or questions to each chunk\document) and embed that info too

Not RAG-specific

  • Off-the-shelf bi-encoders embedding models) can be fine-tuned like any other model, but it is barely done on practice by anyone as there are much lower hanging fruits

Other

Resources


table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"