Evolution of embeddings

scroll ↓ to Resources

Contents

  • Note
  • Evolution of embeddings
    • [[#Evolution of embeddings#Word Embeddings|Word Embeddings]]
    • [[#Evolution of embeddings#Document embeddings|Document embeddings]]
      • [[#Document embeddings#Bag-of-Words models and shallow neural networks|Bag-of-Words models and shallow neural networks]]
      • [[#Document embeddings#Deeper neural networks|Deeper neural networks]]
    • [[#Evolution of embeddings#Image & multimodal embeddings|Image & multimodal embeddings]]
    • [[#Evolution of embeddings#Structured data embeddings|Structured data embeddings]]
    • [[#Evolution of embeddings#Graph embeddings|Graph embeddings]]
  • Resources

Note

Evolution of embeddings

Word Embeddings

  • lightweight, context-free word embedding
  • Word2Vec
    • operates on the principle of “the semantic meaning of a word is defined by its neighbors”, or words that frequently appear close to each other in the training corpus
    • it accounts well for local statistics of words within a certain sliding window, but it does not capture the global statistics (words in the whole corpus)
  • GloVe
    • leverages both global and local statistics of words
    • Creates a co-occurrence matrix which represents the relationship between words and then uses factorization technique to learn word representations from this matrix.
  • SWIVEL ( Skip-Window Vectors with Negative Sampling)

Document embeddings

Bag-of-Words models and shallow neural networks

  • Early embedding algorithms based on shallow Bag-of-Words models paradigm assumed that a document is an unordered collection of words.
  • the word ordering and the semantic meanings are ignored
  • Doc2Vec use shallow neural network for generating document embedding

Deeper neural networks

  • BERT became the base model for multiple other embedding models - Sentence BERT, SimCSE, E5
    • More complex bi-directional deep NN
    • Massive pre-training on unlabeled data with masked language model as the objective to utilize left and right context
    • Sub-word tokenizer
    • Outputs a contextualized embedding for every token in the input, but the embedding of the first token, named [CLS], is used as the embedding for the whole input.
  • T5 with 11B parameters
  • PaLM with 540B parameters
  • Model families generating multi-vector embedding ColBERT, XTR

Image & multimodal embeddings

Structured data embeddings

Graph embeddings

Resources


table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"