A Quick Guide to LLMs

Posted Jun 5, 2024 Updated Dec 31, 2024

By Yue Lin 2 min read

This note has not been finished yet.

Main Storyline

Google. 2017.6. “Attention is all you need.”

For more detailed introduction, see my blog.

Encoder-Decoder Framework. No RNNs.
Positional Encoding.
1. Introduce periodicity using trigonometric functions, allowing the model to handle longer input lengths that it may not have seen during training.
2. The encoded positional vector can be added directly to the input because they can be seen as linear independent in the embedding space.
Self-Attention: $QKV$
1. Input a query, return the value whose key is the most similar to the query.
2. Dot product can be used to calculate the similarity of $Q$ and $K$ because they are embedded and the product can be seen as the inner product of two vectors.
3. The results of $QK$ are weights which reflect the similarity. And $(QK)V$ is extracting the content of input.
The output of $QK$ in decoder is masked so each input vector cannot see the vectors after it.
The output layer is softmax. The output is the largest number of the softmax output. Or it is done by beam search.
The time complexity is $O(n^2d),$ and RNNs’ is $O(nd^2),$ where $n$ is the sequence length and $d$ is the model dimension. And the attention can done in parallel.

OpenAI. 2018. “Improving language understanding by generative pre-training.”

For more detailed introduction, see my blog.

GPT-1 = Decoder (in Transformer, with learnable positional encoding) + Pre-Training + Fine-Tuning.
Pre-Training: Unsupervised learning. The model is trained using unlabeled data to predict the next word. It is used to make the model to be familiar with human common knowledge.
Fine-Tuning: Supervised learning. The model is trained using labeled data for specific downstream NLP tasks.

Google. 2018.10. “Bert: Pre-training of deep bidirectional transformers for language understanding.”

BERT = Encoder (in Transformer) + Pre-Training + Fine-Tuning + Masked Input.
The inputs are masked before they come through the self-attention part. E.g. "I love singing because it is fun." -> "I [?] singing because it is fun."
BERT uses the encoder so the self-attention part does not have a mask layer.
BERT is good at NLU (Understanding), while GPT is good at NLG (Generation). NLG is harder because it has open ending.

OpenAI. 2019. “Language models are unsupervised multitask learners.”

For more detailed introduction, see my blog.

GPT-2 = Decoder (in Transformer) + Pre-Training + Turning Fine-tuning to Pre-Training + More Parameters.
Enhanced pre-training. Eliminated fine-tuning.
The unsupervised objective of the earlier pre-training is demonstrated to be the same as the supervised objective of the later fine-tuning.
The downstream tasks can be reconstructed to be descripted in the form used in pre-training.
A competent generalist is not an agregation of narrow experts.
The scaling law is initially emerging: The more parameters, the better the performance, and the improvement is very stable.
The Number of Parameters: 1.5B

OpenAI. 2020.1. “Scaling Laws for Neural Language Models.”

OpenAI. 2020.5. “Language models are few-shot learners.”

This post is licensed under CC BY 4.0 by the author.