Local LLM inference engines
The runtime you pick matters as much as the GPU. Each engine has its own strengths — single-user latency, batched throughput, hardware coverage, quantization support. Below: every engine with community performance reports on llamaperf.
- Ollama43 reports
A user-friendly wrapper around llama.cpp with a model registry and one-line install.
- llama.cpp11 reports
The reference C/C++ inference runtime for GGUF-quantized open-weight LLMs.
- vLLM11 reports
High-throughput inference server with PagedAttention — built for batched serving at scale.
- LM Studio4 reports
GUI app for running local LLMs — wraps llama.cpp with a polished chat UI and model browser.
- MLX2 reports
Apple's machine-learning framework — the fastest way to run LLMs on Apple Silicon.
- ExLlamaV2no data yet
Highly-optimized NVIDIA-only runtime for EXL2-quantized models — best raw single-stream throughput on a single GPU.
- ExLlamaV3no data yet
An inference engine for running open-weight LLMs locally.
- KoboldCppno data yet
An inference engine for running open-weight LLMs locally.
- TabbyAPIno data yet
An inference engine for running open-weight LLMs locally.
- text-generation-webuino data yet
An inference engine for running open-weight LLMs locally.