llamaperf

Quantization for local LLMs

What Q4_K_M, Q5_K_M, MLX-4bit, and the rest actually mean — and how to pick the right one for your hardware.

The one-paragraph version

A quantized LLM stores its weights in fewer bits than the original model — usually 4, 5, or 8 bits per weight instead of 16. This makes the model 2–4× smaller (and faster), at the cost of some quality. Modern 4-bit quants (Q4_K_M and friends) preserve roughly 95–98% of the original model's quality, which is invisible in normal use. If your hardware can fit a Q4_K_M version of a model, that's almost always the right starting point.

Why quantization matters

Open-weight LLMs are published in 16-bit precision (BF16 or FP16), which means each parameter takes 2 bytes. A 70B model is therefore about 140GB of weights — far more than fits in any consumer GPU and most workstation cards. Without quantization, running these models locally would be impossible for almost everyone.

Quantization compresses the weights into a smaller representation. The compression isn't lossless, but for inference it turns out you can throw away a lot of precision before the model's outputs change in any noticeable way. A 4-bit quant uses one quarter of the memory of the original, which means a 70B model that needed an A100 (80GB) suddenly fits on a single 48GB card or a Mac with 64GB of unified memory.

The quant levels, by VRAM cost

The table below shows the major GGUF quantization levels used by llama.cpp, Ollama, and LM Studio. The VRAM ratio is relative to the BF16 reference — a 70B model in Q4_K_M, for instance, takes roughly 70 × 0.5 = 35GB of weights (KV cache and context add more).

QuantBitsVRAM (× FP16)Quality
Q2_K2×0.25low — visible degradation
Q3_K_M3×0.38fair — usable but compromised
IQ4_XS4×0.45good — competes with Q4_K_M at smaller size
Q4_K_M4×0.50very good — the most popular quant
Q5_K_M5×0.62excellent — small step up from Q4
Q6_K6×0.75near-perfect
Q8_08×1.00indistinguishable from FP16
BF16 / FP1616×2.00full precision (reference)

Most users settle on Q4_K_M for the largest model that fits, then step up to Q5_K_M or Q6_K if they have headroom. Q8 and FP16 are rarely worth it: the quality difference vs Q5 is barely measurable and the memory cost is significant.

What the letters and numbers mean

GGUF quant names look cryptic but follow a pattern.

  • Q4, Q5, Q8 — bits per weight (on average; some weights use more, some less).
  • _K — uses k-quants, llama.cpp's modern quantization method (better quality than the legacy Q4_0/Q4_1 scheme).
  • _S / _M / _L — small, medium, large variant. Each step up uses ~10% more memory for a small quality gain.
  • IQ — i-quants, which use importance matrices to pack bits more intelligently. IQ4_XS often matches Q4_K_M quality at smaller size.
  • _0, _1 — legacy (pre-k-quant) variants. Generally superseded by the K variants.

So Q4_K_M reads as "4-bit, k-quant family, medium variant". That's it.

GGUF vs EXL2 vs AWQ vs MLX

GGUF isn't the only quant format. The format you pick depends on the engine you're running.

  • GGUF — used by llama.cpp, Ollama, LM Studio, KoboldCpp. The most portable format. Runs on every backend (CPU, CUDA, ROCm, Metal, Vulkan).
  • EXL2 — used by ExLlamaV2 and ExLlamaV3. NVIDIA-only. Often the highest single-stream throughput on CUDA.
  • AWQ / GPTQ — used by vLLM and similar serving engines. NVIDIA-first. Tuned for batched multi-user serving.
  • MLX quants — used by MLX. Apple Silicon only. Typically the fastest option on a Mac.
  • FP8, NVFP4 — newer hardware-accelerated formats on Hopper / Blackwell GPUs. Used by vLLM, TensorRT-LLM. Very fast but limited hardware support.

How to pick a quant for your hardware

  1. Start from your VRAM. Whatever model you want to run, the weights need to fit with at least ~4–8GB of headroom for context and KV cache.
  2. Try Q4_K_M of the largest model that fits. A larger model in Q4 almost always beats a smaller model in Q8. Quality scales with parameters more than with quant level.
  3. Step up to Q5 or Q6 if you have headroom. The quality gain is small but free if you're not memory-bound.
  4. Don't go below Q4 unless you have to. Q3 and Q2 quants degrade quality noticeably. If a Q4 doesn't fit, prefer a smaller model.
  5. Match the format to your engine. GGUF if you're on llama.cpp / Ollama / LM Studio. EXL2 if on ExLlamaV2. AWQ/GPTQ if on vLLM. MLX if on Apple Silicon at the limits of your memory.

Frequently asked

What does Q4_K_M mean?

Q4_K_M is a 4-bit GGUF quantization scheme used by llama.cpp. The K means it uses k-quants (a more sophisticated method than legacy Q4_0). The M stands for medium — there are also S (small) and L (large) variants that trade file size for quality. In practice, Q4_K_M is the most commonly recommended quant for local LLMs because it offers a near-optimal quality-per-byte ratio.

Is Q4 quantization too lossy for serious use?

For most users, no. Modern Q4 k-quants (Q4_K_M, Q4_K_S) preserve roughly 95–98% of the full-precision model's quality on standard benchmarks while using a quarter of the memory. The drop is usually invisible in conversation. Q3 and below start to noticeably degrade quality.

Should I use Q4_K_M, Q5_K_M, or Q8?

If your hardware can fit it, Q5_K_M is a small step up from Q4_K_M in quality at ~25% more memory. Q8 is essentially indistinguishable from full precision but uses ~50% more memory than Q5. Most users settle on Q4_K_M for the largest model that fits, or Q5/Q6 for slightly smaller models. Q8 is rarely worth it.

What's the difference between GGUF and EXL2 and AWQ quants?

GGUF (used by llama.cpp, Ollama, LM Studio) is the most portable format and works on every backend. EXL2 (used by ExLlamaV2) is NVIDIA-only and offers higher throughput on CUDA. AWQ and GPTQ (used by vLLM and others) are designed for batched serving on NVIDIA. MLX has its own quantization scheme for Apple Silicon. They're not interchangeable.

Do quantized models need less VRAM?

Yes — that's their main purpose. A 70B model in full BF16 precision is ~140GB. At Q8 it's ~70GB. At Q5_K_M it's ~50GB. At Q4_K_M it's ~40GB. The same model can run on hardware ten times cheaper depending on which quant you choose.

What is i-quants (IQ4_XS, IQ3_M)?

i-quants are a newer family of llama.cpp quants that use importance matrices (computed from a calibration dataset) to allocate bits more intelligently. They preserve more quality than the equivalent legacy quant at the same file size. IQ4_XS, for example, is smaller than Q4_K_M but with comparable quality.

See also