ExLlamaV2
Highly-optimized NVIDIA-only runtime for EXL2-quantized models — best raw single-stream throughput on a single GPU.
0 community reports
ExLlamaV2 (and the newer ExLlamaV3) is built specifically to wring maximum throughput out of NVIDIA hardware on single-stream inference. It uses a custom EXL2 quantization format and aggressive CUDA kernels.
For users prioritizing raw tokens-per-second on a single 3090/4090/A6000, exllamav2 is typically the fastest option — often 20-50% faster than llama.cpp on the same GPU and model. The cost is portability: NVIDIA-only, narrower model coverage, custom quant format.
Frequently asked
Is exllamav2 faster than llama.cpp?
On NVIDIA single-stream inference, yes — typically 20-50% faster on the same model and quant level. On other hardware or batched serving, the answer changes.
Does exllamav2 work on AMD or Apple GPUs?
No. It's CUDA-only. For AMD use llama.cpp ROCm or vLLM; for Apple Silicon use MLX or llama.cpp Metal.