prefix caching

scroll ↓ to Resources

Note

  • prefill operation - Calculating the attention key and value scores (for the input we are passing to the LLM) - is one of the most compute-intensive and slowest operations in LLM inference
    • The output of the prefill are the attention key and value scores for each layer of the transformer for the entire input.
  • KV-cache allows to avoid recalculating attention scores for the input on each autoregressive decode step.
  • prefix caching refers to the process of caching the KV-cache itself between subsequent inference requests in order to reduce the latency and costs of the pre-fill operation.
  • LLM-chatbots with multi-turn conversations and large document/code uploads - are applications which naturally benefit from the prefix caching
  • For it to be effective, the input structure and schema must remain prefix caching friendly: do not alter the prefix in subsequent requests as this will invalidate the cache for all the tokens that follow.
    • As an example, putting a fresh timestamp at the very beginning of each request will invalidate the cache completely

Resources


table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"