inference optimization
scroll ↓ to Resources
Note
- use smaller, faster models
- caching ^ef1842
- prefix caching: Paged Attention, vAttention
- Structure your prompt in the way that the most important context is all and changes and additional data appears later
- prompt caching (also partial, in the middle)
- Prompt Cache: Modular Attention Reuse for Low-Latency Inference
- cache identical prompts not to call LLM provider for the same inputs twice
- query caching
- prefix caching: Paged Attention, vAttention
- quantization
- By quantizing the model to float16, int8 or int4 is important to check how much the model degrades across different tasks and languages or modalities.
- pruning
- knowledge distillation
- Transferring knowledge from a larger to smaller model
- speculative decoding
- batching
- using different models depending on prompt complexity. Simple question sent to a local model, and complex to ChatGPT
- RouteLLM is a library developing a special router-model to estimate prompt complexity.
- Mixture-of-Experts
- parallelism
- data parallelism - process several documents
- task parallelism - run independent operations at the same time
- make it look faster
- show users intermediate steps
- first provide a rough answer and take more time for refinement and improvement
Resources
- [2407.12391] LLM Inference Serving: Survey of Recent Advances and Opportunities
- Transformers Inference Optimization Toolset
Links to this File
table file.inlinks, file.outlinks from [[]] and !outgoing([[]]) AND -"Changelog"