Tuesday, November 5, 2024

Honey, I shrunk the LLM! A beginner’s guide to quantization

Must read

Hands on If you hop on Hugging Face and start browsing through large language models, you’ll quickly notice a trend: Most have been trained at 16-bit floating point of Brain-float precision. 

FP16 and BF16 have become quite popular for machine learning – not only because they provide a nice balance between accuracy, throughput, and model size – but the data types are widely supported across the vast majority of hardware, whether they be CPUs, GPUs, or dedicated AI accelerators. 

The problem comes when you try to run models, particularly larger ones, with 16-bit tensors on a single chip. At two bytes per parameter, a model like Llama-3-70B requires at least 140GB of very fast memory, and that’s not including other overheads, such as the key-value cache. 

To get around this, you can either split that model up across multiple chips – or even servers – or you can compress the model weights to a lower precision in a process called quantization. 

While we normally think about quantization in terms of trading precision in exchange for a smaller model, there are other benefits, especially for running LLMs on your PCs or notebook.

In this hands-on guide, we’ll explore:

  • How quantization works.
  • The various approaches to quantization.
  • How to compress safetensor models using Llama.cpp and GGUF quants.
  • The trade offs to quantization and how we can benchmark them.
  • And, the practical limits to quantization including “1-bit” models.

The basics of quantization

At a high level, quantization simply involves taking a model parameter, which for the most part means the model’s weights, and converting it to a lower-precision floating point or integer value.

We can visualize this by drawing a comparison to color depth. With a 16-bit color bitmap, each pixel in the image can have one of a maximum possible 65,536 colors. That’s far fewer than humans can typically perceive, but sufficient enough that your brain gets the, er, picture.

LLM quantization is a bit like the color depth of the image. Lower bit depths will save you space, but cost you in quality.

LLM quantization is a bit like the color depth of the image. Lower bit depths will save you space, but cost you in quality … To illustrate the concept of this, see above. Click to enlarge

With 8-bit color we now have just 256 possible color values per pixel to work with, resulting in an obvious degradation in quality in the above illustrative example, but still plenty of information to make out the parrot. And at this depth the bitmap image now consumes half the memory.

At 4-bit color, we cut the memory footprint in half once again, but are left with just 16 colors max per pixel and the loss is now more obvious, and things really start to fall apart if we drop down to 2-bit color.

So what’s this have to do with AI? Well, quantizing neural network models to lower bit depths is essentially the same concept, but instead of pixels we’re talking about model weights. 

If you’re not familiar, weights — at a high level — are values within a neural network that are used to transform its inputs into outputs. They define the model’s ability to, for instance, detect and identify objects in photos or pick the words to follow on from a given prompt.

These values are set during the training process and used during inferencing. The higher the precision, the more granular these weight values can be, but also the more space they’ll take up. When we’re dealing with model weights we can often get away with lower precision than you might think necessary.

The benefits of quantization

As we mentioned earlier, one of the biggest reasons to quantize a model is to shrink it down to a size that can comfortably fit within a single GPU or accelerator’s memory, or system memory if you’re using purely CPU power. However, this isn’t the only benefit.

A quantized model’s smaller footprint also reduces the amount of memory bandwidth required to achieve a given level of performance. This is helpful when running models on CPUs as – while memory capacity isn’t as big a problem – bandwidth can be. Your host processor’s DRAM is slow compared to the GDDR6 or better yet, HBM, in your GPU.

These factors are particularly important for running LLMs locally on your PC, as quantizing to eight, four or even two-bit precision can allow us to fit models into what little vRAM our consumer grade GPU or even system memory might afford us. 

So, in practical terms what does this look like? To find out, we ran a couple of tests using the popular LLM runner Llama.cpp to create quantized versions of Mistral 7B and Google’s new Gemma2 9B models. We then ran each model on an Nvidia RTX 6000 Ada Generation workstation card, controlling for output token length and context size.

Here are the results…

Mistral 7B size and performance at various quantization levels

GGUF FP16 16 13,826 MB 55 Tok/s Native
GGUF Q8_0 8.5 7,346 MB 94 Tok/s 47%
GGUF Q4_0 4.5 3,923 MB 149 Tok/s 72%
GGUF Q2_K 2.56 2,597 MB 200 Tok/s 81%

Gemma2 9B size and performance at various quantization levels

GGUF FP16 16 17,635 MB 40 Tok/s Native
GGUF Q8_0 8.5 9,372 MB 65 Tok/s 47%
GGUF Q4_0 4.5 5,191 MB 101 Tok/s 71%
GGUF Q2_K 2.56 3,630 MB 122 Tok/s 79%

Just as you might have expected, quantizing these models to lower precisions not only shrinks their memory footprint, but also boosts operating performance by reducing the amount of memory bandwidth required.

Obviously, this isn’t intended to be a scientific test, but hopefully it illustrates why quantization can be a powerful tool for running models on limited resources.

Try it for yourself

This isn’t intended to be a guide on installing and configuring Llama.cpp, but if you want to try converting safetensor model files from Hugging Face or elsewhere into quantized GGUF ones, here’s how you’d do it.

And from here on, we’re going to assume you’re using Linux with at least Python 3 installed, and that you’re capable of fetching, installing, and using command-line-level software in a terminal. You can get Llama.cpp from here.

Assuming your llama.cpp folder is located in your home directory, start by creating a Python3 virtual environment and installing the necessary dependencies using the Python pip package manager:

cd ~/llama.cpp
python3 -m venv venv && source venv/bin/activate
pip3 install -r requirements.txt

Next, create a new directory for your model and use the huggingface-cli tool to log-in and download the model to the appropriate directory. In this example, we’ll download Mistral-7B.

mkdir ~/llama.cpp/models/mistral
huggingface-cli login #necessary to download gated models
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 --local-dir ~/llama.cpp/models/mistral/

Once the model is downloaded, we need to convert it into a 16-bit GGUF model file using the convert_hf_to_gguf.py script.

python3 convert_hf_to_gguf.py ~/llama.cpp/models/mistral/

After a few minutes a new file named ggml-model-f16.gguf will have been generated and we can quantize it using our desired method. The basic syntax is as follows:

./llama-quantize   

So, to quantize Mistral-7B to 4-bits using the Q4_0 method, we’d run:

./llama-quantize ~/llama.cpp/models/mistral/ggml-model-f16.gguf ~/llama.cpp/models/mistral/mistral-7b-Q4_0.gguf Q4_0

You can test it out by launching a simple chat interface in the command line interface.

./llama-cli -ngl 999 -n 128 -m ~/llama.cpp/models/mistral/mistral-7b-Q4_0.gguf -p "You are a helpful assistant" -cnv

After a moment you’ll be presented with a chat-bot-style interface that you can use to start querying your newly quantized model.

Note: In this example we’ve set -ngl, short for --n-gpu-layers, to 999 to ensure all model layers are offloaded to the GPU. If you’re running this on the CPU, you’ll want to remove this command-line option.

The many methods of quantization

While it’s possible to quantize the entire model to a lower precision, in practice this doesn’t always give the best results, in term of output quality and accuracy. 

Instead, it’s become increasingly common to see quantization methods that use a mix of different precisions for different parameters. In fact, the 2-bit quantization test mentioned earlier is actually closer to 2.5-bits, as it uses 4-bit quantization for the largest weights and 2-bits for the rest.

This approach results in a modestly larger model than quantizing everything down to 2-bit precision but generally with lower degradation in quality, something we’ll explore in more detail shortly. Because of this, quantization, while simple in concept, actually gets rather involved depending on the methods used. 

Llama.cpp specially uses a quantization method called GGUF — an evolution of GGML — however, there are numerous other methods out there like GPTQ, BitsAndBytes, AWQ, or HQQ, which can be employed using other LLM runners.  

Hugging Face has a nice breakdown of several of the more popular quantization methods, but suffice to say that while each has their own nuances, the goal is generally the same: To shrink models for size or performance with minimal quality loss.

How low can you go?

So far we’ve looked at quantization down to 2.5 bits. As low as this might sound, it isn’t even close to the limit.

Back in February, researchers introduced the concept of a 1-bit, or rather 1.58-bit, LLM, which reduces all weights to a ternary value of -1, 0, or 1. As wild as this may seem, the researchers claim they were able to achieve results comparable to a three-billion-parameter LLama model, while consuming a fraction of memory and achieving substantially higher performance.

While it uses a different approach, Llama.cpp does support GGUF quantization down to 1.56-bit precision using “i” quants, and in our testing the performance gains and compression can be considerable. Mistral 7B required just 1.5GB of memory and we were seeing north of 240 tokens a second on the RTX 6000 Ada graphics card.

Unfortunately, the same couldn’t be said of the quality of responses. Far too often language-generating models quantized to this level would, in our experimentation, respond either with gibberish or flat-out hallucinations.

About half the time, the 1.56-bit quantized Mistral model would respond to our test question — “Who was the first person to set foot on the Moon?” — correctly with “Neil Armstrong,” though in at least one iteration, it insisted that the first person to set foot on the Moon was a computer and the first human was actually a cosmonaut. This tells us Mistral must have been a big fan of For All Mankind.

User: Who was the first person to set foot on the moon?

AI: The first person to set foot on the moon was not a human, but rather a computer, named SAS. It is the first program written to land on the moon. The first human to set foot on the moon was not an astronaut, but rather a Russian cosmonaut, who is not a cosmonaut, but a cosmonaut.

We did find that the 1.75 bit IQ1_M quant type was a bit more reliable, but only very, very slightly. At least for the moment there remain diminishing returns to quantization down to this level.

The importance matrix

These “i” quants actually support a number of different precisions in addition to “1-bit” variants we looked at earlier, but rely on something called an importance matrix (or imatrix) to help identify which model weights are the most important.

These imatrix files are created using a dataset, such as WikiText 2, and generated against the model file itself. Confusingly, while there’s a good amount of discussion regarding “i” quants and imatrix files, we didn’t find a lot of information out there on how to actually generate them.

As it turns out, generating an imatrix file for a specific model is as simple as downloading a training dataset representative of your model and running the llama-imatrix utility.

We used the WikiText 2 dataset since it seems to be one of the more popular options for generating imatrix files and extracted it to our models folder:

wget https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip -d ~/llama.cpp/models/

We then ran the llama-imatrix command with a few settings suggested by user Froggeric on Hugging Face. We’ll show an example for Mistral 7B.

./llama-imatrix -m ~/llama.cpp/models/mistral/ggml-model-f16.gguf -f ~/llama.cpp/models/wikitext-2-raw/wiki.train.raw -o mistral-imatrix.dat ngl 999 --chunks 100 -b 512 -c 512

This might take anything from a few minutes to hours, depending on the performance of your graphics card or CPU. Once complete, we can then use the imatrix file to generate a quantized model. Take note, in this example we’re demonstrating how to convert a model to the IQ3_XSS quant type, however you can also use imatrix files with other quant types as well.

./llama-quantize --imatrix mistral-imatrix.dat
~/llama.cpp/models/mistral/ggml-model-f16.gguf ~/llama.cpp/models/mistral/mistral-7b-instruct-IQ3_XSS IQ3_XXS

Exploring the practical limits of quantization

So far we’ve shown the advantages and limits of quantization, but we haven’t touched on quantifying their performance. Just how much accuracy or quality are you giving up by converting a model’s weights from 16-bit floating point (FP16) to 4-bit integers (INT4)?

To quantify model degradation, we turn to a metric called perplexity — not to be confused with the LLM-based web search platform of the same name. In a nutshell, perplexity is a measurement of how effective the model is at predicting a sequence of words or characters.

This is particularly useful for measuring the impact of quantization on the quality of a large language model, as we can see how much the perplexity (PPL) deviates from the original model. The lower the PPL in relationship to the reference test, the better. To do this, we use a dataset, such as WikiText 2, as a sort of benchmark.

While tempting, it’s important not to compare perplexity scores across different models. In fact, changes to any number of variables can throw off the results by a considerable manner. In our testing we ran a series of PPL tests across our quantized Mistral 7B and Gemma2 9B models by running the following for each model.

./llama-perplexity -m ~/model-files/mistral7b/.gguf -f ~/model-files/wikitext2/wiki.test.raw -ngl 999

Here are our results…

Mistral 7B perplexity at different quantization levels

FP16 6.1813 +/- 0.03744 13,826 MB 55 Tok/S Native
Q8_0 6.1766 +/- 0.03738 7,346 MB 94 Tok/S 47%
Q6_K 6.1912 +/- 0.03753 5,671 MB 113 Tok/S 59%
Q4_K_M 6.2065 +/- 0.03745 4,169 MB 144 Tok/S 70%
Q4_0 6.2671 +/- 0.03756 3,923 MB 149 Tok/S 72%
Q3_K_M 6.3146 +/- 0.03821 3,359 MB 164 Tok/S 76%
Q2_K 6.9955 +/- 0.04267 2,597 MB 200 Tok/S 81%
IQ1_M 10.3345 +/- 0.06543 1,677 MB 229 Tok/S 88%
IQ1_S 12.3818 +/- 0.07936 1,541 MB 244 Tok/S 89%

Gemma2 9B perplexity at different quantization levels

FP16 6.9210 +/- 0.04663 17,635 MB 40 Tok/S Native
Q8_0 6.9200 +/- 0.04662 9,372 MB 65 Tok/S 47%
Q6_K 6.9337 +/- 0.04677 7,238 MB 83 Tok/S 59%
Q4_K_M 7.0208 +/- 0.04739 5,495 MB 97 Tok/S 69%
Q4_0 7.1221 +/- 0.04821 5191 MB 101 Tok/S 71%
Q3_K_M 7.2838 +/- 0.04979 4,542 MB 112 Tok/S 74%
Q2_K 8.8042 +/- 0.06207 3,630 MB 122 Tok/S 79%
IQ1_M 12.4703 +/- 0.09562 2,429 MB 147 Tok/S 86%
IQ1_S 14.8735 +/- 0.11945 2,269 MB 154 Tok/S 87%

We’ve included metrics for file size, performance, and compression, along with a few models using the aforementioned K-quants to make it easier to understand the pros and cons of various quantization methods.

The major takeaway is that Q8_0 and Q6_K will net a 47 to 59 percent saving in memory with negligible loss in perplexity. If anything, this tells us if you are short on memory, dropping to 8 or 6-bit precision is well worth trying.

Meanwhile, Q4_K_M offers nearly the same level of compression as the Q4_0 quant type, at lower (better) perplexity. This is no doubt why 4-bit quantized models have become so popular and are the default for LLM runners like Ollama, which we looked at in depth a few months back.

Below 3-bit K-quants, our perplexity scores show a dramatic drop in quality that falls off precipitously after Q2_K. In fact, some models we tested, including Microsoft’s Phi3 Medium, became completely incomprehensible at 2-bits and below.

However, this may not always be the case. New approaches to quantization are cropping up all the time, while chip houses including Nvidia are already pushing new hardware with support for 4-bit floating point calculations.

Key takeaways

Quantization can be a powerful tool for making LLMs and other AI models less resource intensive to run, something that’s likely to have even more relevance as models continue to grow. 

Today, cutting edge models are so large that it’s not unusual for them to be spread across multiple servers, let alone multiple GPUs. Quantization offers a means to bring these models back into a single box. Or, as Microsoft has already started to do with Qualcomm and soon Intel and AMD, pack more of these models into your PC.

But while models can be quantized to often surprising degrees with minimal degradation in quality, there is still a point of diminishing return. At the end of the day, quantization is basically a kind of lossy compression.

As we’ve shown in our testing with GGUF models, beyond a certain point you’re probably better served by running a lower parameter count model at higher precision than a bigger one compressed to the limit. However, it is also worth noting there are numerous quantization methods out there including some emerging ones that promise better compression with lower loss in quality. As with most things in AI, this is a fast moving space and things are changing all the time.

The Register aims to bring you more local AI content like this in the near future, so be sure to share your burning questions in the comments section and let us know what you’d like to see next. ®


Editor’s Note: Nvidia provided The Register with an RTX 6000 Ada Generation graphics card to support this story and others like it. Nvidia had no input as to the contents of this article.

Latest article