Inference

Initially let’s start with the memory required for 1 parameters which is 4 bytes given FP32 precision.

1 Parameters (Weights) = 4 Bytes (FP32)

To calculate the memory required for 1 billion parameters we will multiply 4 bytes with a billion which would give us around 4 GB.

1 Billion Parameters = 4 * 10^9 = 4 GB

The following table shows the memory requirements for different model precisions per 1 billion parameters.

Loading, please wait
 
Full Precision (32bits)
Half Precision (16bits)
8bits
1 Billion Parameters4 GB2 GB1 GB


Accordingly now you can multiply the vRAM number in the table above with the number of billion parameters in the model based on its precision. The table below shows the minimum memory requirements to load the model for inference without accounting for the memory required for the hits on the model.

Loading, please wait
Model Name
Full Precision (32bits)
Half Precision (16bits)
8bits
Falcon (7B)28 GB14 GB7 GB
Llama2 (7B)28 GB14 GB7 GB
Jais (13B)52 GB26 GB13 GB
Jais (30B)120 GB60 GB30 GB
Falcon (40B)160 GB80 GB40 GB


To determine how much more you need is based on your system requirements such as concurrent user queries, caching and so on. I believe stress testing is required.

Finetuning

To fine-tune a model we will need to load all the following into memory which means we will need X6 the minimum memory requirements for inference. The following shows the memory required for a full-precision model per 1 billion parameters.

Loading, please wait
Model Component
Full Precision Memory
Model Weights4 GB
Optimizer States8 GB
Gradients4 GB
Activations8 GB
Total24 GB


However, this makes fine-tuning very large models infeasible using full precision. It is recommended to use mixed precision either half precision or 8 bit precision during fine-tuning.

The following table shows the differences between the minimum memory requirements for different precision types.

Loading, please wait
Model Name
Full Precision (32bits)
Half Precision (16bits)
8bits
4bits
Falcon (7B)168 GB84 GB42 GB21 GB
Llama2 (7B)168 GB84 GB42 GB21 GB
Jais (13B)312 GB156 GB78 GB39 GB
Jais (30B)720 GB360 GB180 GB90 GB
Falcon (40B)960 GB480 GB240 GB120 GB