Evolution of embeddings

_{scroll ↓ to Resources}

Note

lightweight, context-free word embedding
Word2Vec
- operates on the principle of “the semantic meaning of a word is defined by its neighbors”, or words that frequently appear close to each other in the training corpus
- it accounts well for local statistics of words within a certain sliding window, but it does not capture the global statistics (words in the whole corpus)
GloVe
- leverages both global and local statistics of words
- Creates a co-occurrence matrix which represents the relationship between words and then uses factorization technique to learn word representations from this matrix.
SWIVEL ( Skip-Window Vectors with Negative Sampling)

Early embedding algorithms based on shallow Bag-of-Words models paradigm assumed that a document is an unordered collection of words.
- Latent Semantic Analysis (LSA)
- Latent Dirichlet Allocation (LDA)
- TF-IDF models are statistical models that use the word frequency to represent the document embeddings. BM25, a TF-IDF-based bag-of-words model, is still a strong baseline in today’s retrieval benchmarks
the word ordering and the semantic meanings are ignored
Doc2Vec use shallow neural network for generating document embedding

BERT became the base model for multiple other embedding models - Sentence BERT, SimCSE, E5
- More complex bi-directional deep NN
- Massive pre-training on unlabeled data with masked language model as the objective to utilize left and right context
- Sub-word tokenizer
- Outputs a contextualized embedding for every token in the input, but the embedding of the first token, named [CLS], is used as the embedding for the whole input.
T5 with 11B parameters
PaLM with 540B parameters
Model families generating multi-vector embedding ColBERT, XTR

table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"