AI

# Introduction to Weight Quantization | In direction of Knowledge Science

The Python implementation is kind of simple:

`def zeropoint_quantize(X):# Calculate worth vary (denominator)x_range = torch.max(X) - torch.min(X)x_range = 1 if x_range == 0 else x_range# Calculate scalescale = 255 / x_range# Shift by zero-pointzeropoint = (-scale * torch.min(X) - 128).spherical()# Scale and around the inputsX_quant = torch.clip((X * scale + zeropoint).spherical(), -128, 127)# DequantizeX_dequant = (X_quant - zeropoint) / scalereturn X_quant.to(torch.int8), X_dequant`

As an alternative of counting on full toy examples, we will use these two capabilities on an actual mannequin because of the `transformers`library.

We begin by loading the mannequin and tokenizer for GPT-2. It is a very small mannequin we in all probability don’t need to quantize, however it is going to be ok for this tutorial. First, we need to observe the mannequin’s measurement so we will examine it later and consider the reminiscence financial savings on account of 8-bit quantization.

`!pip set up -q bitsandbytes>=0.39.0!pip set up -q git+https://github.com/huggingface/speed up.git!pip set up -q git+https://github.com/huggingface/transformers.git`
`from transformers import AutoModelForCausalLM, AutoTokenizerimport torchtorch.manual_seed(0)# Set machine to CPU for nowmachine = 'cpu'# Load mannequin and tokenizermodel_id = 'gpt2'mannequin = AutoModelForCausalLM.from_pretrained(model_id).to(machine)tokenizer = AutoTokenizer.from_pretrained(model_id)# Print mannequin measurementprint(f"Mannequin measurement: {mannequin.get_memory_footprint():,} bytes")`
`Mannequin measurement: 510,342,192 bytes`

The dimensions of the GPT-2 mannequin is roughly 487MB in FP32. The following step consists of quantizing the weights utilizing zero-point and absmax quantization. Within the following instance, we apply these strategies to the primary consideration layer of GPT-2 to see the outcomes.

`# Extract weights of the primary layerweights = mannequin.transformer.h.attn.c_attn.weight.knowledgeprint("Authentic weights:")print(weights)# Quantize layer utilizing absmax quantizationweights_abs_quant, _ = absmax_quantize(weights)print("nAbsmax quantized weights:")print(weights_abs_quant)# Quantize layer utilizing absmax quantizationweights_zp_quant, _ = zeropoint_quantize(weights)print("nZero-point quantized weights:")print(weights_zp_quant)`
`Authentic weights:tensor([[-0.4738, -0.2614, -0.0978,  ...,  0.0513, -0.0584,  0.0250],[ 0.0874,  0.1473,  0.2387,  ..., -0.0525, -0.0113, -0.0156],[ 0.0039,  0.0695,  0.3668,  ...,  0.1143,  0.0363, -0.0318],...,[-0.2592, -0.0164,  0.1991,  ...,  0.0095, -0.0516,  0.0319],[ 0.1517,  0.2170,  0.1043,  ...,  0.0293, -0.0429, -0.0475],[-0.4100, -0.1924, -0.2400,  ..., -0.0046,  0.0070,  0.0198]])Absmax quantized weights:tensor([[-21, -12,  -4,  ...,   2,  -3,   1],[  4,   7,  11,  ...,  -2,  -1,  -1],[  0,   3,  16,  ...,   5,   2,  -1],...,[-12,  -1,   9,  ...,   0,  -2,   1],[  7,  10,   5,  ...,   1,  -2,  -2],[-18,  -9, -11,  ...,   0,   0,   1]], dtype=torch.int8)Zero-point quantized weights:tensor([[-20, -11,  -3,  ...,   3,  -2,   2],[  5,   8,  12,  ...,  -1,   0,   0],[  1,   4,  18,  ...,   6,   3,   0],...,[-11,   0,  10,  ...,   1,  -1,   2],[  8,  11,   6,  ...,   2,  -1,  -1],[-18,  -8, -10,  ...,   1,   1,   2]], dtype=torch.int8)`

The distinction between the unique (FP32) and quantized values (INT8) is evident, however the distinction between absmax and zero-point weights is extra refined. On this case, the inputs look shifted by a worth of -1. This means that the burden distribution on this layer is kind of symmetric.

We will examine these strategies by quantizing each layer in GPT-2 (linear layers, consideration layers, and many others.) and create two new fashions: `model_abs` and `model_zp`. To be exact, we are going to really change the unique weights with de-quantized ones. This has two advantages: it permits us to 1/ examine the distribution of our weights (identical scale) and a couple of/ really run the fashions.

Certainly, PyTorch doesn’t enable INT8 matrix multiplication by default. In an actual situation, we might dequantize them to run the mannequin (in FP16 for instance) however retailer them as INT8. Within the subsequent part, we are going to use the `bitsandbytes` library to unravel this subject.

`import numpy as npfrom copy import deepcopy# Retailer authentic weightsweights = [param.data.clone() for param in model.parameters()]# Create mannequin to quantizemodel_abs = deepcopy(mannequin)# Quantize all mannequin weightsweights_abs = []for param in model_abs.parameters():_, dequantized = absmax_quantize(param.knowledge)param.knowledge = dequantizedweights_abs.append(dequantized)# Create mannequin to quantizemodel_zp = deepcopy(mannequin)# Quantize all mannequin weightsweights_zp = []for param in model_zp.parameters():_, dequantized = zeropoint_quantize(param.knowledge)param.knowledge = dequantizedweights_zp.append(dequantized)`

Now that our fashions have been quantized, we need to test the affect of this course of. Intuitively, we need to make it possible for the quantized weights are near the unique ones. A visible method to test it’s to plot the distribution of the dequantized and authentic weights. If the quantization is lossy, it will drastically change the burden distribution.

The next determine exhibits this comparability, the place the blue histogram represents the unique (FP32) weights, and the pink one represents the dequantized (from INT8) weights. Observe that we solely show this plot between -2 and a couple of due to outliers with very excessive absolute values (extra on that later).

Each plots are fairly related, with a stunning spike round 0. This spike exhibits that our quantization is kind of lossy since reversing the method doesn’t output the unique values. That is notably true for the absmax mannequin, which shows each a decrease valley and the next spike round 0.

Let’s examine the efficiency of the unique and quantized fashions. For this goal, we outline a `generate_text()` operate to generate 50 tokens with top-k sampling.

`def generate_text(mannequin, input_text, max_length=50):input_ids = tokenizer.encode(input_text, return_tensors='pt').to(machine)output = mannequin.generate(inputs=input_ids,max_length=max_length,do_sample=True,top_k=30,pad_token_id=tokenizer.eos_token_id,attention_mask=input_ids.new_ones(input_ids.form))return tokenizer.decode(output, skip_special_tokens=True)# Generate textual content with authentic and quantized fashionsoriginal_text = generate_text(mannequin, "I've a dream")absmax_text   = generate_text(model_abs, "I've a dream")zp_text       = generate_text(model_zp, "I've a dream")print(f"Authentic mannequin:n{original_text}")print("-" * 50)print(f"Absmax mannequin:n{absmax_text}")print("-" * 50)print(f"Zeropoint mannequin:n{zp_text}")`
`Authentic mannequin:I've a dream, and it's a dream I imagine I might get to stay in my future. I like my mom, and there was that one time I had been informed that my household wasn't even that robust. After which I received the--------------------------------------------------Absmax mannequin:I've a dream to seek out out the origin of her hair. She loves it. However there isn't any method you might be trustworthy about how her hair is made. She should be loopy.We discovered a photograph of the coiffure posted on--------------------------------------------------Zeropoint mannequin:I've a dream of making two full-time jobs in America—one for individuals with psychological well being points, and one for individuals who don't endure from psychological sickness—or at the least have an employment and household historical past of substance abuse, to work half`

As an alternative of attempting to see if one output makes extra sense than the others, we will quantify it by calculating the perplexity of every output. It is a frequent metric used to judge language fashions, which measures the uncertainty of a mannequin in predicting the following token in a sequence. On this comparability, we make the frequent assumption that the decrease the rating, the higher the mannequin is. In apply, a sentence with a excessive perplexity may be appropriate.

We implement it utilizing a minimal operate because it doesn’t want to think about particulars just like the size of the context window since our sentences are brief.

`def calculate_perplexity(mannequin, textual content):# Encode the textual contentencodings = tokenizer(textual content, return_tensors='pt').to(machine)# Outline input_ids and target_idsinput_ids = encodings.input_idstarget_ids = input_ids.clone()with torch.no_grad():outputs = mannequin(input_ids, labels=target_ids)# Loss calculationneg_log_likelihood = outputs.loss# Perplexity calculationppl = torch.exp(neg_log_likelihood)return pplppl     = calculate_perplexity(mannequin, original_text)ppl_abs = calculate_perplexity(model_abs, absmax_text)ppl_zp  = calculate_perplexity(model_zp, absmax_text)print(f"Authentic perplexity:  {ppl.merchandise():.2f}")print(f"Absmax perplexity:    {ppl_abs.merchandise():.2f}")print(f"Zeropoint perplexity: {ppl_zp.merchandise():.2f}")`
`Authentic perplexity:  15.53Absmax perplexity:    17.92Zeropoint perplexity: 17.97`

We see that the perplexity of the unique mannequin is barely decrease than the 2 others. A single experiment shouldn’t be very dependable, however we might repeat this course of a number of occasions to see the distinction between every mannequin. In concept, zero-point quantization must be barely higher than absmax, however can also be extra expensive to compute.

On this instance, we utilized quantization strategies to complete layers (per-tensor foundation). Nonetheless, we might apply it at completely different granularity ranges: from all the mannequin to particular person values. Quantizing all the mannequin in a single cross would significantly degrade the efficiency, whereas quantizing particular person values would create an enormous overhead. In apply, we regularly want the vector-wise quantization, which considers the variability of values in rows and columns within the identical tensor.

Nonetheless, even vector-wise quantization doesn’t remedy the issue of outlier options. Outlier options are excessive values (damaging or constructive) that seem in all transformer layers when the mannequin attain a sure scale (>6.7B parameters). This is a matter since a single outlier can cut back the precision for all different values. However discarding these outlier options shouldn’t be an choice since it will vastly degrade the mannequin’s efficiency.

Launched by Dettmers et al. (2022), LLM.int8() is an answer to the outlier downside. It depends on a vector-wise (absmax) quantization scheme and introduces mixed-precision quantization. Because of this outlier options are processed in a FP16 format to retain their precision, whereas the opposite values are processed in an INT8 format. As outliers signify about 0.1% of values, this successfully reduces the reminiscence footprint of the LLM by nearly 2x.