Calculate LLMs GPU Requirements

Inference

Initially let’s start with the memory required for 1 parameters which is 4 bytes given FP32 precision.

1 Parameters (Weights) = 4 Bytes (FP32)

To calculate the memory required for 1 billion parameters we will multiply 4 bytes with a billion which would give us around 4 GB.

1 Billion Parameters = 4 * 10^9 = 4 GB

The following table shows the memory requirements for different model precisions per 1 billion parameters.

Loading, please wait

	Full Precision (32bits)	Half Precision (16bits)	8bits
1 Billion Parameters	4 GB	2 GB	1 GB

Accordingly now you can multiply the vRAM number in the table above with the number of billion parameters in the model based on its precision. The table below shows the minimum memory requirements to load the model for inference without accounting for the memory required for the hits on the model.

Loading, please wait

Model Name	Full Precision (32bits)	Half Precision (16bits)	8bits
Falcon (7B)	28 GB	14 GB	7 GB
Llama2 (7B)	28 GB	14 GB	7 GB
Jais (13B)	52 GB	26 GB	13 GB
Jais (30B)	120 GB	60 GB	30 GB
Falcon (40B)	160 GB	80 GB	40 GB

To determine how much more you need is based on your system requirements such as concurrent user queries, caching and so on. I believe stress testing is required.

Finetuning

To fine-tune a model we will need to load all the following into memory which means we will need X6 the minimum memory requirements for inference. The following shows the memory required for a full-precision model per 1 billion parameters.

Loading, please wait

Model Component	Full Precision Memory
Model Weights	4 GB
Optimizer States	8 GB
Gradients	4 GB
Activations	8 GB
Total	24 GB

However, this makes fine-tuning very large models infeasible using full precision. It is recommended to use mixed precision either half precision or 8 bit precision during fine-tuning.

The following table shows the differences between the minimum memory requirements for different precision types.

Loading, please wait

Model Name	Full Precision (32bits)	Half Precision (16bits)	8bits	4bits
Falcon (7B)	168 GB	84 GB	42 GB	21 GB
Llama2 (7B)	168 GB	84 GB	42 GB	21 GB
Jais (13B)	312 GB	156 GB	78 GB	39 GB
Jais (30B)	720 GB	360 GB	180 GB	90 GB
Falcon (40B)	960 GB	480 GB	240 GB	120 GB

Inference

Finetuning

Enjoyed reading this article?