A Quick Guide to LLMs
This note has not been finished yet.
Main Storyline
Transformer
Google. 2017.6. “Attention is all you need.”
For more detailed introduction, see my blog.
- Encoder-Decoder Framework. No RNNs.
- Positional Encoding.
- Introduce periodicity using trigonometric functions, allowing the model to handle longer input lengths that it may not have seen during training.
- The encoded positional vector can be added directly to the input because they can be seen as linear independent in the embedding space.
- Self-Attention: $QKV$
- Input a query, return the value whose key is the most similar to the query.
- Dot product can be used to calculate the similarity of $Q$ and $K$ because they are embedded and the product can be seen as the inner product of two vectors.
- The results of $QK$ are weights which reflect the similarity. And $(QK)V$ is extracting the content of input.
- The output of $QK$ in decoder is masked so each input vector cannot see the vectors after it.
- The output layer is softmax. The output is the largest number of the softmax output. Or it is done by beam search.
- The time complexity is $O(n^2d),$ and RNNs’ is $O(nd^2),$ where $n$ is the sequence length and $d$ is the model dimension. And the attention can done in parallel.
GPT-1
OpenAI. 2018. “Improving language understanding by generative pre-training.”
For more detailed introduction, see my blog.
- GPT-1 = Decoder (in Transformer, with learnable positional encoding) + Pre-Training + Fine-Tuning.
- Pre-Training: Unsupervised learning. The model is trained using unlabeled data to predict the next word. It is used to make the model to be familiar with human common knowledge.
- Fine-Tuning: Supervised learning. The model is trained using labeled data for specific downstream NLP tasks.
BERT
Google. 2018.10. “Bert: Pre-training of deep bidirectional transformers for language understanding.”
- BERT = Encoder (in Transformer) + Pre-Training + Fine-Tuning + Masked Input.
- The inputs are masked before they come through the self-attention part. E.g.
"I love singing because it is fun." -> "I [?] singing because it is fun."
- BERT uses the encoder so the self-attention part does not have a mask layer.
- BERT is good at NLU (Understanding), while GPT is good at NLG (Generation). NLG is harder because it has open ending.
GPT-2
OpenAI. 2019. “Language models are unsupervised multitask learners.”
For more detailed introduction, see my blog.
- GPT-2 = Decoder (in Transformer) + Pre-Training + Turning Fine-tuning to Pre-Training + More Parameters.
- Enhanced pre-training. Eliminated fine-tuning.
- The unsupervised objective of the earlier pre-training is demonstrated to be the same as the supervised objective of the later fine-tuning.
- The downstream tasks can be reconstructed to be descripted in the form used in pre-training.
- A competent generalist is not an agregation of narrow experts.
- The scaling law is initially emerging: The more parameters, the better the performance, and the improvement is very stable.
- The Number of Parameters: 1.5B
Scaling Law
OpenAI. 2020.1. “Scaling Laws for Neural Language Models.”
GPT-3
OpenAI. 2020.5. “Language models are few-shot learners.”
Code
- Ollama
- OpenSpiel
Memo
- The number of parameters.
- 1K = 1,000
- 1M = 1,000,000 = 一百万
- 20M = 两千万
- 200M = 两亿
- 1B = 1,000,000,000 = 十亿
This post is licensed under CC BY 4.0 by the author.