vector embeddings

unstructured data -> model -> vector embedding

vector database

A vector database indexes and stores vector embeddings for fast retrieval and similarity search.
unstructured data -> model -> index(data structure including a distance metric) + embeddings
use cases: long-term memory for LLMs, semantic search: search based on the meaning or context; similarity search for text, images, audio or video data; recommendation engine

LangChain

Python framework, building applications with LLMs through composability

  • Models: generic interface for LLMs
  • Prompts: prompt management, optimization and serialization
  • Chains: sequeences of calls, chain together a prompt template and llm
  • Memory: interface for memory and memory implementations
  • Indexes: utility functions to load your own text data
  • Agents & Tools: set up agents that can use tools like google search or wikipedia

Loss table

A loss table is a table that records the values of a loss function during model training or evaluation, often used in machine learning or deep learning to monitor how well a model is learning over time.

sentence-transformers

https://huggingface.co/sentence-transformers/all-mpnet-base-v2

A Python object (SentenceTransformer) that wraps:

  • A pretrained transformer model (e.g., BERT, RoBERTa, MiniLM)
  • A pooling layer
  • Tokenizer
  • Pretrained weights

## Components It Contains:
Transformer Model (like BERT/MiniLM)
Trained and fine-tuned for sentence-level tasks (semantic similarity, paraphrase detection, etc.).

Tokenizer
Converts text into tokens that the transformer can process.

Pooling Layer
Aggregates token-level embeddings into a single sentence vector.

Model Configuration Files
Describe how to use the above components together.

The WordPiece tokenizer, used in BERT, is a subword tokenization algorithm designed to balance the trade-off between word-level and character-level models. It was originally developed by Google for models like BERT and has some key advantages in handling unknown or rare words.
Instead of splitting text into words or characters, WordPiece splits words into common subword units based on statistical frequency in a large corpus.