This post was written before the launch of the arabic llms leaderboard you can now view it at Hugging Face OALL


Overview

The objective of this report is to investigate the landscape of open-source large language models (LLMs) designed to support the Arabic language. It commences with an examination of the existing models, delving into their performance evaluations and benchmarking across diverse datasets. Additionally, the report will delve into the specifics of the Jais model, including its architecture and the training data it utilized. Furthermore, it will address scalability and resource optimization concerns, outlining the computational resources necessary and strategies to enhance both training efficiency and inference speed.

Challenge of Arabic LLMs

The primary challenge in developing an Arabic LLM is the limited availability of high-quality Arabic data. As compared to English, where corpora of size up to two trillion tokens are readily available, Arabic corpora are significantly smaller in size. 1

Available Open Source Models

There are plenty of available models that support Arabic language, some of these models only support Arabic language such as AraBART and AraT5. As well as some bilingual models such as Jais, which supports both Arabic and English language. The reason of Jais being bilingual will be explained in a later section. The other models are multilingual models so they support plenty of languages including Arabic.

Available Arabic Language Models

The following table shows a comparison of between size and number of parameters available models that support the Arabic language.

Model Description Number of Parameters Supported Languages
Jais Bilingual LLM based on transformer-based decoder-only (GPT-3) architecture 13 Billion Arabic, English
Jais Bilingual LLM based on transformer-based decoder-only (GPT-3) architecture 30 Billion Arabic, English
BLOOM Auto-regressive LLM, trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. 176 Billion 46 Languages and 16 programming languages
LLaMA2 Auto-regressive LLM that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. 7 Billion Supports 50+ languages (Predominately English)
LLaMA2 Auto-regressive LLM that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. 13 Billion Supports 50+ languages (Predominately English)
LLaMA2 Auto-regressive LLM that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. 70 Billion Supports 50+ languages (Predominately English)
AraT5 Based on the T5 architecture it has as encoder and decoder components similar in size and configuration to BERT Base with 12 layers each with 12 attention heads, and 768 hidden units. 220 Million 11 Languages (Predominately Arabic)
AraBART Based on the architecture of BART-Base which has 6 encoder and 6 decoder layers and 768 hidden dimensions. 139 Million Arabic

Performance Evaluation and Benchmarking for Arabic Supported LLMs

Evaluating the performance of large language models (LLMs) involves a multifaceted approach that considers various aspects of their functionality and effectiveness in language processing tasks. Key methodologies include assessing the quality of generated text through human evaluation, where human judges provide subjective ratings based on factors like fluency, coherence, and relevance. Additionally, objective evaluation metrics such as BLEU (Bilingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for Gisting Evaluation), and METEOR (Metric for Evaluation of Translation with Explicit Ordering) are commonly employed to measure the similarity between model-generated text and reference translations. Furthermore, diagnostic evaluation techniques aim to identify model weaknesses and biases through probing tasks and linguistic analyses.

attention-1
Figure 1: Evaluation Methods for LLMs
Source: Evaluating Large Language Models https://www.packtpub.com/article-hub/evaluating-large-language-models

In this report I’ll be diving into the evaluation metrics used by Core42 company to evaluate their Arabic LLM (Jais) based on their published research paper.

Evaluation Criteria

A comprehensive evaluation conducted by Core42 benchmarked Jais with other leading base language models, focusing on both English and Arabic. The evaluation criteria spanned various dimensions, including:

  1. Knowledge: How well the model answers factual questions.

  2. Reasoning: The model’s ability to answer questions requiring reasoning.

  3. Misinformation/Bias: Assessment of the model’s susceptibility to generating false or misleading information, and its neutrality.

Jais

Jais model is now considered the benchmark for Arabic LLMs. It offers both foundation and instruction-tuned models. The models are based on GPT-3 decoder-only architecture. Jais is pretrained on Arabic and English texts in addition to various programming languages. To address the primary challenge in Arabic LLMs which is limited Arabic data, Jais was trained on English as well with a total of 395 billion tokens, 116 billion Arabic tokens and 232 billion English tokens, which means Arabic data constitutes only 33% of the pretraining data.

Pretraining Data

The following table shows the various datasets the Jais model was trained on, along with their token count.

attention-1
Figure 2: Composition and Breakdown of Arabic Pretraining Dataset for Jais Model

As can be seen from the table above, the data is collected form multiple sources including web pages, Wikipedia, news, Arabic books and social network content. The data was also augmented with Arabic content translated from English using an in-house machine translation model at the founding company of Jais. The data is curated which means only high-quality data is included.

Data Preprocessing Pipeline

The training data fed to the model went through the preprocessing pipeline specified in the following figure, which guarantees only high-quality data is used for training. The same pipeline we can use it if we want to fine-tune Jais model later on.

attention-1
Figure 3. Jais Training Data Preprocessing Pipeline

Model Architecture

Jais is based on a standard transformer-based architecture which is the decoder-only model similar to GPT2 and LlaMA models. Currently Decoder-only models have achieved state-of the art performance. The figure below shows a general pipeline of the decoder-only architecture.

attention-1
Figure 4. GPT Decoder Only Architecture

However, Jais includes other modifications to accommodate for the Arabic Language to provide more improvements from available models in the literature such as Jais tokenizer, AliBi Position Encodings, and SwiGLU Activation Function.

Jais Tokenizer

The GPT tokenizer is mainly trained on the English language, so it’s unable to tokenize Arabic content in an accurate format. A new tokenizer was built using byte-pair encoding (BPE) using equal proportions of English and Arabic languages.

AliBi Positional Encodings

In simple terms, positional embeddings are used in transformer-based large language models (LLMs) to understand the order of words in a sentence. When training these models, it’s common to limit the amount of context they can consider to manage complexity. However, during actual use (inference), the model may need to understand longer contexts. Recent research has found that traditional methods of representing word order, like learnable positional embeddings or sinusoidal encoding, don’t work well with longer contexts. Instead, a new method called Attention with Linear Biases (ALiBi) positional encodings is used. ALiBi adjusts the attention scores (how much the model focuses on each word) based on their distance from each other, helping the model understand longer contexts more efficiently. Instead of changing the input embeddings directly, ALiBi adjusts the attention mechanism itself.

SwiGLU Activation Function

Swish-Gated Linear Unit (SwiGLU) is a variant of GLU activation functions, which combines the advantages of both activation functions. Through experimentation it improved the performance on language tasks in comparison with other activation functions.

Model Hyperparameters

The following table shows the training and optimization hyperparameters used in Jais 13b model.

Model Layers Heads Dimension Learning Rate Batch Size
Jais-13b 40 40 5120 1.2e-2 3392

When a document was smaller than 2048 tokens, we concatenated several documents into one sequence. <|endoftext|> is used to demarcate the end of each document, giving the language model the information necessary to infer that tokens separated by <|endoftext|> are unrelated.

Model Evaluation

Extensive evaluation has been done to compare Jais with other models that support Arabic language. The main evaluation methodology adopted is the LM-Evaluation-Harness framework to evaluate each model using zero-shot setting. Multiple datasets are used for each one of the three main criteria: World Knowledge, Commonsense Reasoning and Misinformation and Bias. The following table shows the results for the Arabic language datasets.

attention-1
Figure 5. Zero-shot Evaluation Results for Arabic (%) Average is the mean score.

As can be seen from the previous figure, Jais-chat (13B) had the best average performance across plenty of other models.

Safety Layer

The safety layer is an a crucial step in making the model safer to interact with. The Jais model has multiple safety measures including: Safety via Instruction-tuning to avoid generating certain harmful content, safety via prompting by modifying the user prompt to explicitly instruct the model to not generate any harmful content, another method is using external models to detect hate and offensive content using different classifiers on top of the LLM, and the last method is safety via keywords which applies keyword based filtering mechanism.

Scalability and Resource Optimization

Optimizing the scalability and resource efficiency of LLMs is possible through various techniques such as distributed training, model compression, and deployment on cloud infrastructure.

Distributed Fine-tuning

Fine-tuning Large Language Models using instruction following is less memory-intensive than training them from scratch, distributed parallel training is still necessary to help manage memory usage effectively.

The figure below shows one way of tackling the memory challenges of fine-tuning an LLM. When we want to fine-tune an LLM it has to download the model to memory which can cause a lot of memory challenges. DeepSpeed implements a three-stage strategy known as ZeRO (Zero Redundancy Optimizer) for managing memory. [2]

attention-1
Figure 6. ZeRO Partitioning Strategy

Parallelism Strategies

Depending on whether the setup is a single GPU or multi-GPU setup, in case of single GPU the primary parallelization strategy is data parallelism and ZeRO. In case of multi-GPU setup the primary options are model parallelism an tensor parallelism.

Data Parallelism

Each GPU worker processes a portion of the data, computes gradients, which are then averaged across all workers to update model weights. In PyTorch DDP, each GPU stores model weights, optimizer state, and gradients for its data fraction.

Model Parallelism

Different layers of the model are placed on different GPU workers.

Tensor Parallelism

Or Tensor Slicing, where each GPU processes a slice of a tensor and only aggregates the full tensor for operations requiring it. [3]

Quantization

Quantization methods aim to reduce data size while preserving accuracy by converting data types to use fewer bits. For instance, converting model weights from 32-bit to 16-bit floating points reduces model size, easing storage and decreasing memory usage. Additionally, lower precision speeds up inference by reducing computation time with fewer bits. There are plenty of LLMs compression methods:

Additive Quantization of Language Models (AQLM)

The quantization/compression of multiple weights together and take advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes. Inference support for AQLM is realised in the aqlm library.

$ pip install aqlm[gpu,cpu]

AQLM library supports Parameter Efficient Fine-Tuning (PEFT) in the form LoRA. AQLM is integrated in the PEFT library. [4]

Activation-aware Weight Quantization (AWQ)

The quantization of some weights in the model, preserving a small percentage of weights important for LLM performance. The main benefit of this approach as it reduced quantization even in 4-bit precision without experiencing any performance degradation. [4]

$ pip install autoawq

Generative Post-Training Quantization (GPTQ)

A post-training quantization technique for 4-bit quantization. It focuses mainly on GPU inference and performance. GPTQ observes that for large models, quantizing weights in any fixed order can perform just as well. This is because even though some weights might introduce more error individually, they are quantized later in the process when there are few other weights left that could increase the error. Accordingly GPTQ aims to quantize all weights in the same order for all rows of a matrix. This makes the process faster because certain computations have to be done only once for each column, rather than once for each weight. [6]

$ pip install auto-gptq

bitsandbytes

The most famous and easiest library to perform quantization of models into 8 and 4-bit.

8-bit Quantization

The natural progression after 16-bit precision. Although it was the natural progression, the implementation was not as simple as moving from FP32 to FP16 – as those two floating point types share the same representation scheme and 8-bit does not. 8-bit quantization requires a new representation scheme, and this new scheme allows for fewer numbers to be represented than FP16 or FP32. This means model performance may be affected when using quantization, so it is good to be aware of this trade-off. [8]

4-bit Quantization

The aim of 4-bit quantization is to reduce the memory usage of the model parameters by using lower precision types than full (float32) or half (bfloat16) precision. Meaning – 4-bit quantization compresses models that have billions of parameters like Llama 2 that makes them require less memory. [7] One popular method that implements 4-bit quantization is QloRA which will be discussed further in details in the next section.

Finetuning

There are two techniques that massively improves the efficiency of fine-tuning LLMs: Low-Rank Adaptation (LoRA), and Quantized Low-Rank Adaptation (QLoRA). The following table shows the main differences between both. [5]

Feature LoRA QLoRA
Parameter Reduction Low-rank approximation to the weight update matrix. Quantization of LoRA adapter weights.
Memory Footprint Reduced Further Reduction
Fine-tuning Speed Fast Slower than LoRA
Performance Close to traditional fine-tuning Slightly less than LoRA


Low-Rank Adaptation (LORA)

Large language models are harder to fine-tune while retaining all model parameters. For example GPT-3 has 175B parameters so deploying independent instances of such model would be extremely expensive. LoRA freezes some pretrained model weights reducing the number of trainable parameters. LoRA can reduce the number of trainable parameters by 10,000 times and the GPU requirements by 3 times. [10]

Quantized Low-Rank Adaptation (QLORA)

QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA). [11]

The following table shows which method is recommended to use for fine-tuning based on requirements.

attention-1
Tuning Recommendations for LoRA and QLoRA

Inference

Training and fine-tuning large language models typically involves performing orders of magnitude more calculations on large data sets. To speed this process up, specialized hardware like GPUs is used for much faster data-parallel operations. Having access to these specialized compute resources becomes essential for both training and deploying large language models. The cost of inference can also make model compression and distillation techniques important. [12] One library that helps in the deployment of large language models is vLLM.

vLLM

It is a fast and easy to use library for LLM inference and serving. It uses the technique PagedAttention for efficient memory management. It implements a continious batching of incoming requests, applies quantization using techniques such as GPTQ and AWQ. [13] The following version of vLLM supports the model Jais since early this year. vLLM

References

[1] N. Sengupta et al., Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models. 2023.

[2] Tsai, Y.-C. (2023, October 20). Fine-tuning large language models: A guide into distributed parallel training with DeepSpeed, Ray… Medium

[3] Hegde, S. R. (2023, October 1). Everything about Distributed Training and Efficient Finetuning. Sumanth’s Personal Website

[4] “Quantization.” . Accessed 4 Mar. 2024.

[5] Mudadla, Sujatha. “Difference between QLoRA and LoRA for Fine-Tuning LLMs.” Medium, 12 Dec. 2023. Accessed 4 Mar. 2024.

[6] Lbonne, Maxime. “ML Blog - 4-Bit LLM Quantization with GPTQ.” Mlabonne.github.io, 30 July 2023​.

[7] Goheen, Justin. “4-Bit Quantization with Lightning Fabric.” Lightning AI, 6 Nov. 2023. Accessed 4 Mar. 2024.

[8] “8-Bit Quantization with Lightning Fabric.” Lightning AI, 15 Nov. 2023. Accessed 4 Mar. 2024.

[9] “LoRA and QLoRA Recommendations for LLMs” | Vertex AI.Google Cloud. Accessed 4 Mar. 2024.

[10] Hu, Edward J., et al. “LoRA: Low-Rank Adaptation of Large Language Models.” ArXiv:2106.09685 [Cs], 16 Oct. 2021.

[11] Dettmers, Tim, et al. “QLoRA: Efficient Finetuning of Quantized LLMs.”, 23 May 2023.

[12] “What Is LLMOps?” Databricks, 12 June 2023, Accessed 4 Mar. 2024.

[13] “Vllm-Project/Vllm.” GitHub, 4 Oct. 2023.