Efficiently Serving LLMs | Noura Abdelhafez

The following is a summary for a course Efficiently Serving LLMs which discusses the technical details of serving large language models. The main ideas fall under caching parts of the computations inside the transformer network to optimize the performance. The course discusses techniques used to handle many requests from multiple users simultaneously. These techniques include vectorization which allows models to process multiple inputs in a single operation. As well as KV caching which stores some results of the transformer network attention calculation in memory as each token is generated to avoid redundant calculations. KV caching is especially helpful for auto-regressive models which generate each token individually.

Other techniques that are discussed in the course are batching ,which batches prompts into a single tensor so the model can process multiple inputs at the same time. Another extended idea is continuous batching it allows us to dynamically update the batch as new requests come in and old requests complete. Continuous batching helps with maintaining low latency and low throughput at the same time.

Some popular techniques to reduce the memory footprint of the model are also discussed, such as quantization which transforms model weights into lower precision representation. LoRa is a parameter efficient fine-tuning technique which makes it possible to dynamically load fine-tuned adapters at runtime. Combining multiple LoRas with continuous batching can be used to serve hundreds of fine-tuned models simultaneously with high throughput and low latency.

KV Caching

In auto-regressive models predicting the i-th token depends on every j-th token where j < i. For each newly generated token the same embedding is calculated for all tokens before it, the larger the number of tokens the more redundant the calculations will become. This technique caches embeddings for all previously generated tokens. This technique only works for auto-regressive models such as GPT-model architecture.

Batching

GPUs are fast due to their parallel architecture, however when loading a large language model, most of the memory bandwidth is spent loading model parameters. Batching is loading the model parameters once and use it to process multiple inputs sequences, which increases GPU utilization and provides higher throughput.

Static Batching

In static or naive batching the size of the batch remains constant until the inference is complete. The following image shows 4 sequences, the yellow tokens represents the prompt tokens, the blue ones are the generated tokens, and the red is the end-of-sequence token. The second sequence took the longer than the others and the process had to wait for all the sequences in the batch to be over, which means that the GPU was underutilized.

Static batching would achieve the highest GPU utilization if we have the same size of input and output sequences.

Dynamic Batching

Dynamic or continuous batching is used when we have different input sequences. Instead of waiting for all sequences to finish generation, the batch size is determined per iteration, which means once a sequence in a batch has finished, a new sequence is inserted in it’s place which would yield a higher GPU utilization.

Throughput-Latency Trade off

Latency depends on various factors like network speed, input sequence length and model size. The throughput is how many requests can be processed at a given time. In general using better GPUs will improve both throughput and latency. If we need to implement batching, batching tends to improve throughput however it makes the latency worse, which will introduce a throughput-latency tradeoff. The following image shows the latency and throughput without batching.

Notice how after batching, the throughput of the LLM server improved however the latency became worse.

Quantization

Quantization methods aim to reduce data size while preserving accuracy by converting data types to use fewer bits. For instance, converting model weights from 32-bit to 16-bit floating points reduces model size, easing storage and decreasing memory usage.

Low-Rank Adaptation (LoRA)

Large language models are harder to fine-tune while retaining all model parameters. For example GPT-3 has 175B parameters so deploying independent instances of such model would be extremely expensive. LoRA freezes some pretrained model weights reducing the number of trainable parameters. LoRA can reduce the number of trainable parameters by 10,000 times and the GPU requirements by 3 times.

Multi-LoRA Inference

Multiple loras allow us to have multiple fine-tuned adapters all connected to one base model. There are multiple use cases for using multiple LoRAs such as:

Training on different segments of data
Chaining several related tasks
Supporting multiple tenants; having multiple users fine-tune and serve adapters on one base model.

LoRA eXchange (LoRAX)

Open source project, it is designed for serving many fine-tuned models at once using a shared set of GPUs. Compared with conventional dedicated LLM deployments, LoRAX consists of three novel components:

Dynamic Adapter Loading, allowing each set of fine-tuned LoRA weights to be loaded from storage just-in-time as requests come in at runtime, without blocking concurrent requests.
Tiered Weight Caching, to support fast exchanging of LoRA adapters between requests, and offloading of adapter weights to CPU and disk to avoid out-of-memory errors.
Continuous Multi-Adapter Batching, a fair scheduling policy for optimizing aggregate throughput of the system that extends the popular continuous batching strategy to work across multiple sets of LoRA adapters in parallel.

Credit

All credit goes to the course content from Deeplearning.ai Efficiently Serving LLMs