llama.cpp
The reference C/C++ inference runtime for GGUF-quantized open-weight LLMs.
10 community reports
llama.cpp is the de facto reference implementation for running open-weight LLMs locally. It introduced the GGUF format that almost every other consumer-facing local LLM tool now consumes, and supports CPU, CUDA, ROCm, Metal, Vulkan, and SYCL backends.
Almost every easier UI you've heard of (Ollama, LM Studio, Jan, GPT4All) wraps llama.cpp under the hood. Reports tagged with 'Ollama' or 'LM Studio' on llamaperf are typically running llama.cpp internally — the engine field captures the user-facing tool, not the underlying runtime.
Performance is competitive across hardware. On NVIDIA, exllamav2 and vLLM beat it on raw throughput for batched workloads, but llama.cpp wins on portability, quant variety (Q2_K through Q8_0 plus i-quants), and single-user latency.
Top GPUs running llama.cpp
| GPU | VRAM | Reports | Fastest t/s |
|---|---|---|---|
| RTX A6000 48GBnvidia | 48GB | 4 | 16.9 |
| AMD Threadripper 256GBamd | 256GB | 4 | 8.8 |
| RTX 3090nvidia | 24GB | 1 | 28.0 |
| RTX 5090nvidia | 32GB | 1 | — |
Top models on llama.cpp
Frequently asked
Is llama.cpp the fastest engine for local LLMs?
It depends on the workload. For single-user interactive inference on consumer hardware, llama.cpp is competitive with or faster than alternatives. For batched serving on NVIDIA, vLLM and exllamav2 are typically faster. On Apple Silicon, MLX often edges it out.
What hardware does llama.cpp support?
CPU (any architecture with reasonable SIMD), CUDA (NVIDIA), ROCm (AMD), Metal (Apple Silicon), Vulkan (cross-vendor), and SYCL (Intel). The portability is unmatched.
What is GGUF?
GGUF is the file format llama.cpp uses to package quantized model weights and metadata in a single file. It superseded the older GGML format and is now the most widely used local-LLM file format.