LLM

_{scroll ↓ to Resources}

Note
LLM training stages
How to evaluate LLMs?
How to improve LLMs without fine-tuning?
- [[#How to improve LLMs without fine-tuning?#Prompting|Prompting]]
- [[#How to improve LLMs without fine-tuning?#Choose the best sampling strategy|Choose the best sampling strategy]]
- [[#How to improve LLMs without fine-tuning?#Answer validation|Answer validation]]
Resources

Note

post-2022 GPT-like Large Language Models architecturally are based on the ideas from Attention is all you need paper with slight modifications within modules like layer normalization and positional encoding. See more about the original Transformer.
Read the paper review about the Llama 3.1 model by Meta to see example of the modern LLM implementation
Some nice visual explanatory links:
- Generative AI exists because of the transformer
- Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

How are LLMs different from other models?

Emergent capabilities: unlike in regular models, for LLMs there is a sharp bump in model size vs performance. diminishing returns issue start appearing later, after the huge increase. It wasn’t designed, therefore “LLM is not an invention, but a discovery”.

zero-shot learning performance

gets better at all tasks at once

training is VERY expensive

evaluation data contamination

idealogical biases in the models

TODO: agentic reasoning

LLM training stages

How are LLMs trained?

There are several stages of model training

pre-training:

super-large text datasets (trillions of tokens) are used for unsupervised learning, namely next token prediction task

in the result, the network can only continue given sequence, but because of that it now knows the language structure and its statistics

the pre-training costs up to millions $

post-training: includes supervised finetuning, alignment, etc.

supervised finetuning: using high-quality relatively small annotated dataset, train the model to do a particular task well

instruction finetuning: improve the model ability to follow instructions, the essential skill of any chat-model

domain finetuning: extend the pre-training with a specific domain-dataset, which can also be not-labeled (then it is, again, next token prediction training)

Reinforcement Learning from Human Feedback or RLHF with several actual implementations - PPO, DPO

Other methods:

Finetune only the last fully connected layer of the model as in classical deep-learning

Add small amounts of new trainable parameters (adapters) into existing network architecture, keep other weights frozen - LoRA QLoRA

Libraries for finetuning

Hugging Face Transformers - the most popular and widely applicable

Torchtune - PyTorch-native

Lingua by Meta

How to evaluate LLMs?

model evaluation often is done using multiple metrics and approaches, depending on the task, labels and resources

Basic ML metrics

loss function, perplexity - parameters which can be monitored directly during the training stage, on a training and validation sets. These allow to judge about the training dynamics and relatively compare different models, but it’s hard to interpret these numbers in application to a concrete business use-case the model is intended for.

Accuracy, precision and recall, ROC-AUC, etc. - Good choices for (binary) classification tasks and named entity recognition.

BLEU, ROUGE, METEOR - all these metrics compare two texts quantifying tokens’ overlap in some way. Texts can be semantically similar, but be formulated using completely different vocabulary, which means that these metrics aren’t good for the tasks where the model is not expected to output exact text. This disadvantage is partially solved by more complicated metrics like BertScore which use another, smaller LLM under the hood.

Comparison and high-level requirements

The model quality can be rated using a scale (1 to 5), either in overall or by a specific criterium: Language proficiency, reasoning, knowledge, planning, durability, robustness, biases, creativity, safety, helpfulness, trustworthiness, completeness, politeness, etc.

For each of the selected metrics, there should be a well-described definitions of what each score means in each category.

Scoring itself can be done:

using human labelers, experts or regular person who select the best output or ranks all of them.

This is expensive, subjective, slow and doesn’t scale.

LLM-as-a-judge - a regular or a specifically trained LLM scores and makes comparisons of outputs.

The problem of this approach is that it works worse on complicated tasks. LLMs can have preferences due to irrelevant, internal criteria

here is a guide on how to create LLM-judges: Creating a LLM-as-a-Judge That Drives Business Results – Hamel’s Blog

It is also prone to verbosity bias, self-bias, position bias

Business metrics

It is best to check the effect of any ML model on a real business process that you want to improve. For instance, it can be money or time efficiency, user satisfaction scores, etc. AB test is an essential tool here.

It can also happen that business metrics are improving but the model output is not satisfying, for instance because of improper output format.

Task-specific non-business metrics

Output format correctness and robustness (json)
Hallucinations, toxicity - important for medical and judicial applications: ChainPoll, GPTScore, SelfCheckGPT.
Robustness to noise, typos, factual errors, short\long context
Ability to admit “not knowing” the answer, insufficient input information

Links

What We’ve Learned From A Year of Building with LLMs – Applied LLMs
Ragas: Ragas is a library that provides tools to supercharge the evaluation of Large Language Model (LLM) applications

How to improve LLMs without fine-tuning?

Prompting

Fluent Numbers 🌱

On this site

LLM

Contents

Note

LLM training stages

Libraries for finetuning

How to evaluate LLMs?

Links

How to improve LLMs without fine-tuning?

Prompting

Choose the best sampling strategy

Answer validation

Resources

Links to this File

Graph View

On this page

Backlinks

Recent

chunking strategy

hard negative

How to kindly request the best interview feedback

Evaluating information retrieval

synthetic data generation for RAG evaluation