BERT
BERT is a powerful natural language processing (NLP) model developed by Google. It stands for Bidirectional Encoder Representations from Transformers. Essentially, BERT is designed to understand the context of words in a sentence by looking at both the words before and after them, unlike previous models that read text sequentially. This bidirectional approach allows BERT to grasp the nuances of language more effectively, making it useful for a wide range of NLP tasks
vector embeddings
unstructured data -> model -> vector embedding
vector database
A vector database indexes and stores vector embeddings for fast retrieval and similarity search.
unstructured data -> model -> index(data structure including a distance metric) + embeddings
use cases: long-term memory for LLMs, semantic search: search based on the meaning or context; similarity search for text, images, audio or video data; recommendation engine
LangChain
Python framework, building applications with LLMs through composability
- Models: generic interface for LLMs
- Prompts: prompt management, optimization and serialization
- Chains: sequeences of calls, chain together a prompt template and llm
- Memory: interface for memory and memory implementations
- Indexes: utility functions to load your own text data
- Agents & Tools: set up agents that can use tools like google search or wikipedia
Loss table
A loss table is a table that records the values of a loss function during model training or evaluation, often used in machine learning or deep learning to monitor how well a model is learning over time.
sentence-transformers
https://huggingface.co/sentence-transformers/all-mpnet-base-v2
A Python object (SentenceTransformer) that wraps:
- A pretrained transformer model (e.g., BERT, RoBERTa, MiniLM)
- A pooling layer
- Tokenizer
- Pretrained weights
Components It Contains:
Transformer Model (like BERT/MiniLM)
Trained and fine-tuned for sentence-level tasks (semantic similarity, paraphrase detection, etc.).
Tokenizer
Converts text into tokens that the transformer can process.
Pooling Layer
Aggregates token-level embeddings into a single sentence vector.
Model Configuration Files
Describe how to use the above components together.
The WordPiece tokenizer, used in BERT, is a subword tokenization algorithm designed to balance the trade-off between word-level and character-level models. It was originally developed by Google for models like BERT and has some key advantages in handling unknown or rare words.
Instead of splitting text into words or characters, WordPiece splits words into common subword units based on statistical frequency in a large corpus.
Generative AI
input -> tokenization -> input layer(understand token) -> 【Attention(understand context) -> feed forward(整合,思考)】Transformer block * N -> Output Layer
token list -> 自己定,根据对语言的理解: BPE(BYTE PAIR ENCODING)
embedding : token -> vector
Token Embedding 是在训练中得到的, 但没有考虑上下文(bank,银行,岸)
Positional Embedding
Attention: 考虑上下文(苹果电脑,吃苹果),通过训练得到(每个embedding的相关性,attention weight)
Feed Forward: multiple embeddings(after attention) -> one embedding
Output Layer: Linear transformer + softmax -> 概率分布
Image: Non-autoregressive generation