vLLM
High-throughput inference server with PagedAttention — built for batched serving at scale.
11 community reports
vLLM is the standard for serving local LLMs to multiple concurrent users. PagedAttention dramatically improves throughput on batched workloads by managing the KV cache like a virtual memory system, letting many requests share GPU memory efficiently.
Single-user latency is comparable to other engines, but where vLLM shines is when you have 4, 16, or 64 concurrent requests — throughput stays high while llama.cpp or Ollama would serialize.
vLLM is NVIDIA-first. AMD support exists but is less mature. There is no Apple Silicon backend. If you're building a multi-user service rather than a single-user chat client, vLLM is almost always the right choice.
Top GPUs running vLLM
| GPU | VRAM | Reports | Fastest t/s |
|---|---|---|---|
| RX 7900 XTXamd | 24GB | 4 | 58.0 |
| RTX 5090nvidia | 32GB | 2 | 106.5 |
| Instinct MI300X 192GBamd | 192GB | 2 | 60.0 |
| RTX 3090nvidia | 24GB | 1 | 66.0 |
| H100 80GBnvidia | 80GB | 1 | 45.0 |
| Instinct MI250X 128GBamd | 128GB | 1 | 20.0 |
Top models on vLLM
Frequently asked
Is vLLM faster than llama.cpp?
For multi-user concurrent serving, yes — significantly so. For single-user inference on a 7B–70B model, throughput is comparable. vLLM's win is in batching, not raw single-stream speed.
Does vLLM run on Apple Silicon?
No. vLLM is built around CUDA (with ROCm support for AMD). For Apple Silicon, use llama.cpp Metal or MLX.
What models does vLLM support?
Most popular open-weight architectures: Llama, Qwen, Mistral, DeepSeek, Phi, Gemma, and many more. Quantization support includes AWQ, GPTQ, FP8, and BitsAndBytes.