ExLlamaV2

Highly-optimized NVIDIA-only runtime for EXL2-quantized models — best raw single-stream throughput on a single GPU.

0 community reports

ExLlamaV2 (and the newer ExLlamaV3) is built specifically to wring maximum throughput out of NVIDIA hardware on single-stream inference. It uses a custom EXL2 quantization format and aggressive CUDA kernels.

For users prioritizing raw tokens-per-second on a single 3090/4090/A6000, exllamav2 is typically the fastest option — often 20-50% faster than llama.cpp on the same GPU and model. The cost is portability: NVIDIA-only, narrower model coverage, custom quant format.

Frequently asked

Is exllamav2 faster than llama.cpp?

On NVIDIA single-stream inference, yes — typically 20-50% faster on the same model and quant level. On other hardware or batched serving, the answer changes.

Does exllamav2 work on AMD or Apple GPUs?

No. It's CUDA-only. For AMD use llama.cpp ROCm or vLLM; for Apple Silicon use MLX or llama.cpp Metal.