inference optimization

_{scroll ↓ to Resources}

Note

use smaller, faster models
caching ^ef1842
- prefix caching: Paged Attention, vAttention
  - Structure your prompt in the way that the most important context is all and changes and additional data appears later
- prompt caching (also partial, in the middle)
  - Prompt Cache: Modular Attention Reuse for Low-Latency Inference
  - cache identical prompts not to call LLM provider for the same inputs twice
- query caching
  - GitHub - zilliztech/GPTCache: Semantic cache for LLMs. Fully integrated with LangChain and llama_index.
quantization
- By quantizing the model to float16, int8 or int4 is important to check how much the model degrades across different tasks and languages or modalities.
pruning
knowledge distillation
- Transferring knowledge from a larger to smaller model
speculative decoding
batching
- continuous batching
using different models depending on prompt complexity. Simple question sent to a local model, and complex to ChatGPT
- RouteLLM is a library developing a special router-model to estimate prompt complexity.
Mixture-of-Experts
- Mixture of Experts Explained
parallelism
- data parallelism - process several documents
- task parallelism - run independent operations at the same time
make it look faster
- show users intermediate steps
- first provide a rough answer and take more time for refinement and improvement

Resources

Links to this File

table file.inlinks, filter(file.outlinks, (x) => !contains(string(x), ".jpg") AND !contains(string(x), ".pdf") AND !contains(string(x), ".png")) as "Outlinks" from [[]] and !outgoing([[]])  AND -"Changelog"

Fluent Numbers 🌱

On this site

inference optimization

Note

Resources

Links to this File

Graph View

On this page

Backlinks

Recent

log probs

synthetic data

chunking strategy

hard negative

How to kindly request the best interview feedback