Local LLM inference engines

The runtime you pick matters as much as the GPU. Each engine has its own strengths — single-user latency, batched throughput, hardware coverage, quantization support. Below: every engine with community performance reports on llamaperf.

Ollama43 reports
A user-friendly wrapper around llama.cpp with a model registry and one-line install.
llama.cpp11 reports
The reference C/C++ inference runtime for GGUF-quantized open-weight LLMs.
vLLM11 reports
High-throughput inference server with PagedAttention — built for batched serving at scale.
LM Studio4 reports
GUI app for running local LLMs — wraps llama.cpp with a polished chat UI and model browser.
MLX2 reports
Apple's machine-learning framework — the fastest way to run LLMs on Apple Silicon.
ExLlamaV2no data yet
Highly-optimized NVIDIA-only runtime for EXL2-quantized models — best raw single-stream throughput on a single GPU.
ExLlamaV3no data yet
An inference engine for running open-weight LLMs locally.
KoboldCppno data yet
An inference engine for running open-weight LLMs locally.
TabbyAPIno data yet
An inference engine for running open-weight LLMs locally.
text-generation-webuino data yet
An inference engine for running open-weight LLMs locally.