Question 1

Is llama.cpp the fastest engine for local LLMs?

Accepted Answer

It depends on the workload. For single-user interactive inference on consumer hardware, llama.cpp is competitive with or faster than alternatives. For batched serving on NVIDIA, vLLM and exllamav2 are typically faster. On Apple Silicon, MLX often edges it out.

Question 2

What hardware does llama.cpp support?

Accepted Answer

CPU (any architecture with reasonable SIMD), CUDA (NVIDIA), ROCm (AMD), Metal (Apple Silicon), Vulkan (cross-vendor), and SYCL (Intel). The portability is unmatched.

Question 3

What is GGUF?

Accepted Answer

GGUF is the file format llama.cpp uses to package quantized model weights and metadata in a single file. It superseded the older GGML format and is now the most widely used local-LLM file format.

GPU	VRAM	Reports	Fastest t/s
RTX A6000 48GBnvidia	48GB	4	16.9
AMD Threadripper 256GBamd	256GB	4	8.8
RTX 3090nvidia	24GB	1	28.0
RTX 5090nvidia	32GB	1	—

llama.cpp

Top GPUs running llama.cpp

Top models on llama.cpp

Frequently asked

Is llama.cpp the fastest engine for local LLMs?

What hardware does llama.cpp support?

What is GGUF?