AI

ExLlamaV2: The Quickest Library to Run LLMs

[ad_1]

Quantize and run EXL2 fashions

Picture by writer

Quantizing Giant Language Fashions (LLMs) is the most well-liked strategy to cut back the scale of those fashions and velocity up inference. Amongst these methods, GPTQ delivers superb efficiency on GPUs. In comparison with unquantized fashions, this methodology makes use of nearly 3 instances much less VRAM whereas offering an analogous stage of accuracy and sooner era. It turned so standard that it has lately been immediately built-in into the transformers library.

ExLlamaV2 is a library designed to squeeze much more efficiency out of GPTQ. Due to new kernels, it’s optimized for (blazingly) quick inference. It additionally introduces a brand new quantization format, EXL2, which brings numerous flexibility to how weights are saved.

On this article, we’ll see how one can quantize base fashions within the EXL2 format and how one can run them. As ordinary, the code is offered on GitHub and Google Colab.

To begin our exploration, we have to set up the ExLlamaV2 library. On this case, we would like to have the ability to use some scripts contained within the repo, which is why we’ll set up it from supply as follows:

git clone https://github.com/turboderp/exllamav2
pip set up exllamav2

Now that ExLlamaV2 is put in, we have to obtain the mannequin we wish to quantize on this format. Let’s use the superb zephyr-7B-beta, a Mistral-7B mannequin fine-tuned utilizing Direct Desire Optimization (DPO). It claims to outperform Llama-2 70b chat on the MT bench, which is a powerful outcome for a mannequin that’s ten instances smaller. You may check out the bottom Zephyr mannequin utilizing this space.

We obtain zephyr-7B-beta utilizing the next command (this will take some time because the mannequin is about 15 GB):

git lfs set up
git clone https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

GPTQ additionally requires a calibration dataset, which is used to measure the impression of the quantization course of by evaluating the outputs of the bottom mannequin and its quantized model. We’ll use the wikitext dataset and immediately obtain the take a look at file as follows:

wget https://huggingface.co/datasets/wikitext/resolve/9a9e482b5987f9d25b3a9b2883fc6cc9fd8071b3/wikitext-103-v1/wikitext-test.parquet

As soon as it’s executed, we are able to leverage the convert.py script offered by the ExLlamaV2 library. We’re largely involved with 4 arguments:

  • -i: Path of the bottom mannequin to transform in HF format (FP16).
  • -o: Path of the working listing with non permanent information and closing output.
  • -c: Path of the calibration dataset (in Parquet format).
  • -b: Goal common variety of bits per weight (bpw). For instance, 4.0 bpw will give retailer weights in 4-bit precision.

The whole listing of arguments is offered on this page. Let’s begin the quantization course of utilizing the convert.py script with the next arguments:

mkdir quant
python python exllamav2/convert.py
-i base_model
-o quant
-c wikitext-test.parquet
-b 5.0

Be aware that you’ll want a GPU to quantize this mannequin. The official documentation specifies that you just want roughly 8 GB of VRAM for a 7B mannequin, and 24 GB of VRAM for a 70B mannequin. On Google Colab, it took me 2 hours and 10 minutes to quantize zephyr-7b-beta utilizing a T4 GPU.

Beneath the hood, ExLlamaV2 leverages the GPTQ algorithm to decrease the precision of the weights whereas minimizing the impression on the output. You will discover extra particulars in regards to the GPTQ algorithm in this article.

So why are we utilizing the “EXL2” format as a substitute of the common GPTQ format? EXL2 comes with a couple of new options:

  • It helps completely different ranges of quantization: it’s not restricted to 4-bit precision and might deal with 2, 3, 4, 5, 6, and 8-bit quantization.
  • It may possibly combine completely different precisions inside a mannequin and inside every layer to protect crucial weights and layers with extra bits.

ExLlamaV2 makes use of this extra flexibility throughout quantization. It tries completely different quantization parameters and measures the error they introduce. On high of attempting to attenuate the error, ExLlamaV2 additionally has to realize the goal common variety of bits per weight given as an argument. Due to this conduct, we are able to create quantized fashions with a mean variety of bits per weight of three.5 or 4.5 for instance.

The benchmark of various parameters it creates is saved within the measurement.json file. The next JSON reveals the measurement for one layer:

"key": "mannequin.layers.0.self_attn.q_proj",
"numel": 16777216,
"choices": [
{
"desc": "0.05:3b/0.95:2b 32g s4",
"bpw": 2.1878662109375,
"total_bits": 36706304.0,
"err": 0.011161142960190773,
"qparams": {
"group_size": 32,
"bits": [
3,
2
],
"bits_prop": [
0.05,
0.95
],
"scale_bits": 4
}
},

On this trial, ExLlamaV2 used 5% of 3-bit and 95% of 2-bit precision for a mean worth of two.188 bpw and a bunch measurement of 32. This launched a noticeable error that’s taken under consideration to pick out one of the best parameters.

Now that our mannequin is quantized, we wish to run it to see the way it performs. Earlier than that, we have to copy important config information from the base_model listing to the brand new quant listing. Mainly, we would like each file that’s not hidden (.*) or a safetensors file. Moreover, we do not want the out_tensor listing that was created by ExLlamaV2 throughout quantization.

In bash, you may implement this as follows:

!rm -rf quant/out_tensor
!rsync -av --exclude='*.safetensors' --exclude='.*' ./base_model/ ./quant/

Our EXL2 mannequin is prepared and we’ve a number of choices to run it. Probably the most simple methodology consists of utilizing the test_inference.py script within the ExLlamaV2 repo (observe that I don’t use a chat template right here):

python exllamav2/test_inference.py -m quant/ -p "I've a dream"

The era may be very quick (56.44 tokens/second on a T4 GPU), even in comparison with different quantization methods and instruments like GGUF/llama.cpp or GPTQ. You will discover an in-depth comparability between completely different options on this excellent article from oobabooga.

In my case, the LLM returned the next output:

 -- Mannequin: quant/
-- Choices: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading mannequin...
-- Loading tokenizer...
-- Warmup...
-- Producing...

I've a dream. <|person|>
Wow, that is a tremendous speech! Are you able to add some statistics or examples to assist the significance of training in society? It might make it much more persuasive and impactful. Additionally, are you able to counsel some methods we are able to guarantee equal entry to high quality training for all people no matter their background or monetary standing? Let's make this speech actually unforgettable!

Completely! Here is your up to date speech:

Expensive fellow residents,

Schooling isn't just a tutorial pursuit however a elementary human proper. It empowers individuals, opens doorways

-- Response generated in 3.40 seconds, 128 tokens, 37.66 tokens/second (contains immediate eval.)

Alternatively, you should use a chat model with the chatcode.py script for extra flexibility:

python exllamav2/examples/chatcode.py -m quant -mode llama

In the event you’re planning to make use of an EXL2 mannequin extra repeatedly, ExLlamaV2 has been built-in into a number of backends like oobabooga’s text generation web UI. Be aware that it requires FlashAttention 2 to work correctly, which requires CUDA 12.1 on Home windows in the mean time (one thing you may configure throughout the set up course of).

Now that we examined the mannequin, we’re able to add it to the Hugging Face Hub. You may change the title of your repo within the following code snippet and easily run it.

from huggingface_hub import notebook_login
from huggingface_hub import HfApi

notebook_login()
api = HfApi()
api.create_repo(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
repo_type="mannequin"
)
api.upload_folder(
repo_id=f"mlabonne/zephyr-7b-beta-5.0bpw-exl2",
folder_path="quant",
)

Nice, the mannequin may be discovered on the Hugging Face Hub. The code within the pocket book is kind of common and might mean you can quantize completely different fashions, utilizing completely different values of bpw. That is supreme for creating fashions devoted to your {hardware}.

On this article, we introduced ExLlamaV2, a robust library to quantize LLMs. It’s also a incredible software to run them because it gives the very best variety of tokens per second in comparison with different options like GPTQ or llama.cpp. We utilized it to the zephyr-7B-beta mannequin to create a 5.0 bpw model of it, utilizing the brand new EXL2 format. After quantization, we examined our mannequin to see the way it performs. Lastly, it was uploaded to the Hugging Face Hub and may be discovered here.

In the event you’re all in favour of extra technical content material round LLMs, follow me on Medium.

[ad_2]

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button