Transformer Architecture

Traditional text generation approaches used Recurrent Neural Networks (RNNs), however they were limited by the amount of compute and memory needed. Text generation by heart is next word prediction, no matter how much you scale RNNs they still couldn’t see enough of the input to make good predictions. This all changed after the paper Attention is All You Need in 2017 where it introduced the concept of the transformer architecture, the main benefit of it is that it can scale efficiently to multi-core GPUs, and parallel process input data to make use of a much larger training datasets. Meanwhile paying attention to the semantic meaning of the words it’s processing.

Self-attention

Self-attention is the mechanism that captures the context between each word and every other word. Between each word there is an attention weight. For example in the diagram below the word book is paying attention mostly to the word teacher and student.

attention-1
Attention Map Diagram

Self-attention is a key attribute in the transformer architecture, it highly improves the model ability to encode language.

Simplified Model Architecture

The transformer architecture is split into two main parts: encoder and decoder, both working together and sharing a number of similarities. The input representation leaves the encoder and influence the decoder self-attention mechanism.

encoder-1
Simplified Transformer Architecture

Machine learning models are statistical calculators, accordingly the inputs are a numerical representation of the text also called text embedding.

To generate the text embedding we use a process called tokenizer, which transformers each word into it’s position in a dictionary of all possible words, also called token ID. There are different types of tokenizers which we won’t go into in this summary.

tokenizer
Tokenizer

Once we have the tokens we pass them into the embedding layer, where each token ID is matched to a multi-dimensional vector, the intuition is that these vectors learn to encode the meaning and context of individual tokens in the input sequence. The following image shows the mapped token IDs to each respective vector.

tokenizer
Embedding Vectors

Once we add the embedding vectors to the base of the encoder and the decoder, we also add Positional Encodings, which preserves the information about the word order without losing the position of the word in the sentence. The summation of the Token embeddings and Position embeddings are passed into the self-attention layer.

tokenizer

In the self-attention layer, the model analyzes the relationship between tokens in the input sequence by capturing the contextual dependencies between the words. Self-attention weights are learned during the training and stored in self-attention layers. However, this process does not happen once, in the transformer architecture, a multi-headed self-attention is used, so multiple sets of self-attention weights (heads) are learned in parallel independently of each other.

tokenizer
Multi-headed self-attention layer

The number of heads is different from each model, however commonly it ranges from 12-100. The intention is that each attention head will learn a different aspect of language. The output is processed through a fully-connected feed-forward network, the output of this layer is a vector of logits proportional to the probability score for each and every token in the tokenizer dictionary. Logits then are passed to an activation function in this case softmax layer. The softmax layer normalizes the logits into a probability score for each word. The word with the highest probability is the predicted next word.

tokenizer
Complete Transformer Architecture

There are multiple transformer architectures including encoder-only models, encoder-decoder models and decoder-only models.

tokenizer

Encoder only models are used for classification tasks such as sentiment analysis, an example on encoder only models is BERT. Encoder-decoder models are great for sequence-to-sequence tasks such as translation, where input and output sequences are of different lengths. An example on it is BART. Finally decoder-only models are the most common today such as GPT family models.

In-context Learning (ICL)

Providing examples of the task required from the LLM through prompt engineering inside the context window is called in-context learning. Context window is the amount of the text or the memory that is available to use for the prompt.

Zero-shot Inference

tokenizer

One-shot Inference

tokenizer

Few-shot Inference

tokenizer

Inference Parameters

The following configuration parameters are invoked at inference time.

Hyperparameter Description
Max new tokens Limits the number of tokens the model will generate, the model can generate below or up to this number.
Sample top K Select an output from top-k results after applying random-weighted strategy. Reducing this value limits the random sampling and increases the chance the output will be sensible.
Sample top P Selects an output from top-P results after limiting the random sampling to the predictions whose combined probabilities do not exceed P.
Temperature Small temp values concentrates the probabilities in small number of words (strongly peaked probability distribution), which means the resulting text will be less random. High values will return a more flatter probability distribution. Temperature value actually alters the output unlike top-k and top-p.


Most LLMs operate on greedy decoding, which is the simplest form of next-word prediction, where the model will always choose the word with the highest probability. This approach leads to output that is not completely natural and less creative. Another method is called random sampling which introduces variability by choosing a token based on random-weighted strategy across the probabilities of all tokens.

Generative AI Project Lifecycle

It’s important to keep in mind that these steps are highly iterative not sequential.

project
LLMs Project Lifecycle

Pre-training LLMs

LLMs encode a deep statistical representation of language using self-supervised learning where the model internalizes the structures present in the language.

Pretraining Objectives

project

Efficient Multi-GPU Compute Strategies

When to use distributed compute?

  • Model is too big for a single GPU
  • Model fits on GPU, train data in parallel

Distributed Data Parallel (DDP)

It copies the model into each GPU, and send batches of data to each of the GPUs in parallel. A synchronization step combines the results of each GPU accordingly updating the model on each GPU. Resulting in faster training. NOTE: DDP requires all models weights and all of the additional parameters, gradients, and optimizer states needed for training to fit on a single GPU.

project

Fully Sharded Data Parallel (FSDP)

If the model can not fit onto a single GPU, another technique that can be used is modal sharding or FSDP. In FSDP each GPU requests data/weights from the other GPUs on demand to materialize the sharded data into unsharded data for the duration of the operation. Then the after the backward pass, the gradients are synchronized across the GPUs in the same way as the DDP.

project

FSDP helps to reduce overall GPU memory utilization. It also supports offloading to CPU if needed. One major problem is the performance memory tradeoff that comes with sharding. Sharding factor is a configuration that can be modified to control this tradeoff. A sharding factor of value 1 removes all sharding similar to DDP. If w turn it into the max number of GPUs we turn full sharding. This has the most memory savings, but increases the communication volume between GPUs hence the performance tradeoff.

Chinchilla Scaling Laws

Training Compute-Optimal Large Language Models A paper published in 2022 discussed the relationship between three factors: model size, number of tokens, and compute budget. Finding the balance for these factors is important for optimal model pretraining. For compute-optimal training, the number of training tokens and model size must be scaled equally. The conclusion of the paper was that models like GPT-3 are under-trained, it increased the number of parameters but kept the training data constant.

project

After finding the relationship between the three factors, they trained a new LLM called Chinchilla which uses same compute budget as 280B Gopher but has 70B parameters and 4 times more training data. Chinchilla outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron (530B). This result is in contradiction to the “Scaling laws for LLMs” by OpenAI. Now, relatively smaller models can give better performance if trained on more data. Smaller models are easy to fine-tune and also have less latency at inference. These models should not be to their lowest possible loss to be compute optimal.

Instruction Fine-tuning

Few shot inference is great to give the model some context on how to answer the instruction, however it have many drawbacks:

  • Not very beneficial for small models.

  • Takes a lot of space from the context window.

As a rule of thumb if you have to add seven or more examples in the prompt for the model to learn how to answer the instruction, then it’s better to fine-tune the model. Unlike pretraining where the model learns in a self-supervised way, fine-tuning is supervised learning where we use dataset of labeled examples to update the weights of the LLM. One method of finetuning is called instruction finetuning; it trains the model using examples that demonstrates how the model should respond to a specific instruction.

Catastrophic Forgetting

A phenomenon that happens when we fine-tune a model on a single task. Fine-tuning on a single task significantly increases the performance on a specific task modifying the weights of the original LLM degrading the performance on other tasks. If the model is needed to generalize well to multiple tasks then we can fine-tune it on multiple tasks at the same time. Another method to avoid catastrophic forgetting is to use Parameter Efficient Fine-tuning (PEFT) this method preserves the weights of the original LLM by freezing them and trains only a small number of task-specific adapter layers and parameters.

Fine-tuned LAnguage Net (FLAN)

An instruction fine-tuning method proposed in 2022 Scaling Instruction-Finetuned Language Models the paper demonstrates that by fine-tuning the 540B PaLM model o 1836 different tasks while incorporating Chain-of-Thought Reasoning data, it achieved improvements in generalization ,human usability and zero-shot reasoning.

Model Evaluation

Deterministic Accuracy

For certain tasks such as classification deterministic accuracy works well for evaluating the output of LLMs.

\begin{equation} \begin{split} Accuracy &=\frac{\text{Correct Predictions}}{\text{Total Predictions}} \end{split} \end{equation}

Recall Oriented Understudy (ROUGE)

ROUGE is a metric used mainly for text summarizarion. It compares a summary to one or more reference summaries.

ROUGE for Unigrams

\begin{equation} \begin{split} \text{ROUGE-1 Recall} = \frac{\text{unigram matches}}{\text{unigrams in reference}} \end{split} \end{equation}

\begin{equation} \begin{split} \text{ROUGE-1 Precision} = \frac{\text{unigram matches}}{\text{unigrams in output}} \end{split} \end{equation}

\begin{equation} \begin{split} \text{ROUGE-1 F1} = 2 \cdot \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \end{split} \end{equation}

ROUGE for Bigrams

\begin{equation} \begin{split} \text{ROUGE-2 Recall} = \frac{\text{bigram matches}}{\text{bigrams in reference}} \end{split} \end{equation}

\begin{equation} \begin{split} \text{ROUGE-2 Precision} = \frac{\text{bigram matches}}{\text{bigrams in output}} \end{split} \end{equation}

\begin{equation} \begin{split} \text{ROUGE-2 F1} = 2 \cdot \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \end{split} \end{equation}

ROUGE - Longest Common Subsequence (LCS)

\begin{equation} \begin{split} \text{ROUGE-L Recall} = \frac{\text{LCS(Gen, Ref)}}{\text{unigrams in reference}} \end{split} \end{equation}

\begin{equation} \begin{split} \text{ROUGE-L Precision} = \frac{\text{LCS(Gen, Ref)}}{\text{unigrams in output}} \end{split} \end{equation}

\begin{equation} \begin{split} \text{ROUGE-L F1} = 2 \cdot \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \end{split} \end{equation}

ROUGE Clipping

The main advantage is that it does not falsely increase the metric value when there are the repetition of words. The disadvantage is that it does not adjust the metric with the order of the sequences in the sentence.

\begin{equation} \begin{split} \text{Modified Precision} = \frac{\text{clip(unigram matches)}}{\text{unigrams in output}} \end{split} \end{equation}

Bilingual Evaluation Understudy (BLEU)

BLEU is used for text translation, it compares the output to human-generated translations. It evaluates the quality of the machine-translated texts compared to human reference translation. It’s similar to ROUGE-1 however it is calculated for multiple n-gram sizes and then averaged.

\begin{equation} \begin{split} \text{BLEU Metric} = \text{Avg(precision across range of n-gram sizes)} \end{split} \end{equation}

Benchmarks

While ROUGE and BLEU are simple and helpful scores, the don’t capture the complexity of LLMs. To efficiently measure and compare LLMs it’s better to use a pre-existing benchmark datasets established by LLM researchers.

Parameter Efficient Fine-tuning (PEFT)

Full fine-tuning is not feasible for majority of consumer GPUs. In contrast to full fine tuning, where every model weight is updated during fine-tuning, PEFT methods either only update a small subset of parameters or instead add a small number of new parameters/layers and fine-tune them.

PEFT Methods

Selective

Selects subset of initial LLM parameters to fine-tune.

Reparameterization

Reparameterize model weights using a low-rank representation aka LoRA.

Additive

Add trainable layers or parameters to the model aka Adapters, and Soft Prompts; manipulating the inputs to achieve better performance by adding trainable parameters to prompt embeddings or keeping the input fixed and retraining the embedding weights aka Prompt Tuning.

Low Rank Adaptation of Large Language Models (LoRA)

  • Freeze most of the original LLM weights.
  • Inject 2 rank decomposition matrices alongside the original weights.
  • Train the weights of the smaller matrices using supervised learning.
project

Steps to update the model for inference:

  1. Matrix multiply the low rank matrices
project
  1. Add to original weights
project

By this process, now we have a fine-tuned model using LoRA. Because we retained the same number of parameters as the original model there is no impact on inference latency.

project

Since the rank decomposition matrices are small, we can fine-tune a different set for each task and switch them out at inference time by updating the weights before inference.

attention-1
attention-2

Soft Prompts

There are two types of prompts:

  • Hard Prompts; manually crafted text prompts with discrete input tokens, the downside is that it requires a lot of efforts to create a good prompts..
  • Soft Prompts; Learnable tensors concatenated with the input embeddings that can be optimized to a dataset, these learnable tensors are not human readable as these virtual tokens are not being matched to any real words.
attention-1
Prompt Tuning
attention-1
Prefix Tuning

Reinforcement Learning from Human Feedback (RLHF)

Overview

A method that uses reinforcement learning to fine-tune a model using human feedback data. To get a model that is more aligned with human preferences. RLHF helps to maximize helpfulness and relevance, minimizes harm, and avoids dangerous topics.

attention-1
Reinforcement Learning from Human Feedback

Reinforcement Learning (RL)

Two main concepts in RL: Agent, and Environment. The agent continually learns from its experiences by taking actions, observing the changes in the environment and receiving rewards or penalties based on the outcomes of its actions.

attention-1
Reinforcement Learning Overview

Iterating through this process, the agent gradually refines a strategy/policy to make better decisions.

The following figure shows the process of using reinforcement learning to fine-tune LLMs. In this case the agent policy that guides the actions is the LLM, and the main objective is to generate text that is aligned with human preferences. This could mean that the text is helpful, accurate and non-toxic. The environment in this case is the context window of the model. The state the model considers before making an action is the current context (any text in the context window). The action is the act of generating text. The action space is the token vocabulary (all possible tokens the model can choose from to generate text). The series of actions and states is called a rollout.

attention-1
RLHF

At any given moment, the token the model will choose depends on the prompt text in the context window and the probability distribution over the probability vocabulary space. The reward is assigned based on how close the generation is to human preferences.

Evaluating Model Completions

  1. Have a human evaluate all of the model completions against some defined alignment metrics (e.g is there generated text toxic or non-toxic) this feedback can be represented as 0 or 1. The LLM weights get updated iteratively to maximize the reward obtained from the human classifier. However this method is incredibly time consuming as it includes a lot of manual labor.

  2. Have a reward model to classify the outputs of the LLM and evaluate the degree of alignment with human preferences. The model can be trained using human examples and traditional supervised learning approaches. The model can be used to assign a reward value after assessing the LLM output. In turn updating the weights of the LLM and train a new human aligned version.

How the weights get updated depends on the algorithm used to optimize the policy.

Credit

This post is a summary for the Coursera Course Generative AI with LLMs.

All the pictures and figures in the post are from the authors of the course.