Inference

Intially let’s start with the memory required for 1 parameter which is 4 bytes given FP32 precision.

1 Parameter (Weights) = 4 Bytes (FP32)

To calculate the memory required for 1 billion parameter we will multiply 4 bytes with a billion which would give us around 4 GB.

1 Billion Parameter = 4 * 10^9 = 4 GB

The following table shows the memory requirements for different model precisions per 1 billion parameters.

  Full Precision (32bits) Half Precision (16bits) 8bits
1 Billion Parameter 4 GB 2 GB 1 GB


Accordingly now you can multiply the vRAM number in the table above with the number of billion parameters in the model based on it’s precision. The table below shows the minimum memory requirements to load the model for inference without accounting for the memory required for the hits on the model.

Model Name Full Precision (32bits) Half Precision (16bits) 8bits
Falcon (7B) 28 GB 14 GB 7 GB
Llama2 (7B) 28 GB 14 GB 7 GB
Jais (13B) 52 GB 26 GB 13 GB
Jais (30B) 120 GB 60 GB 30 GB
Falcon (40B) 160 GB 80 GB 40 GB


To determine how much more you need is based on your system requirements such as concurrent user queries, caching as so on. I believe stress testing is required.

Finetuning

To fine tune a model we will need to load all the following into memory which means we will need X6 the minimum memory requirements for inference. The following shows the memory required for a full-precision model per 1 billion parameters.

Model Component Full Precision Memory
Model Weights 4 GB
Optimizer States 8 GB
Gradients 4 GB
Activations 8 GB
Total 24 GB


However, this makes fine-tuning very large models infeasible using full precision. It is recommended to use mixed precision either half precision or 8 bit precision while fine-tuning.

The following table shows the differences between the minimum memory requirements for different precision types.

Model Name Full Precision (32bits) Half Precision (16bits) 8bits 4bits
Falcon (7B) 168 GB 84 GB 42 GB 21 GB
Llama2 (7B) 168 GB 84 GB 42 GB 21 GB
Jais (13B) 312 GB 156 GB 78 GB 39 GB
Jais (30B) 720 GB 360 GB 180 GB 90 GB
Falcon (40B) 960 GB 480 GB 240 GB 120 GB