inference optimization
scroll ↓ to Resources
Note
- caching ^ef1842
- prefix caching: Paged Attention, vAttention
- prompt caching (also partial, in the middle)
- query caching
- quantization
- By quantizing the model to float16, int8 or int4 is important to check how much the model degrades across different tasks and languages or modalities.
- speculative decoding
- batching
- using different models depending on prompt complexity. Simple question sent to a local model, and complex to ChatGPT
- RouteLLM is a library developing a special router-model to estimate prompt complexity.
- Mixture-of-Experts
Resources
Links to this File
table file.inlinks, file.outlinks from [[]] and !outgoing([[]]) AND -"Changelog"