Best GPUs for running local LLMs
Picking a GPU for local LLM inference comes down to VRAM (does the model fit?), memory bandwidth (how fast it generates), and software support. The list below is ranked by how many community reports each card has on llamaperf — a rough proxy for how heavily it gets used in practice — and surfaces the fastest tokens-per-second observed on each.
Ranked from 73 community reports on llamaperf.
Ranked by community reports
| # | GPU | VRAM | Reports | Fastest t/s |
|---|---|---|---|---|
| 1 | RX 7900 XTXamd | 24GB | 6 | 58.0 |
| 2 | RTX 5090nvidia | 32GB | 4 | 106.5 |
| 3 | RTX 4060 Ti 16GBnvidia | 16GB | 4 | 45.0 |
| 4 | RTX A6000 48GBnvidia | 48GB | 4 | 16.9 |
| 5 | AMD Threadripper 256GBamd | 256GB | 4 | 8.8 |
| 6 | RTX 4090nvidia | 24GB | 3 | 149.6 |
| 7 | RTX 3090nvidia | 24GB | 3 | 66.0 |
| 8 | RTX 3060 12GBnvidia | 12GB | 3 | 60.0 |
| 9 | Instinct MI300X 192GBamd | 192GB | 2 | 60.0 |
| 10 | RTX 4070nvidia | 12GB | 2 | 55.0 |
| 11 | RX 7900 XTamd | 20GB | 2 | 38.0 |
| 12 | Instinct MI250X 128GBamd | 128GB | 2 | 35.0 |
| 13 | Intel Arc B580 12GBintel | 12GB | 2 | 30.0 |
| 14 | RX 7800 XT 16GBamd | 16GB | 2 | 27.0 |
| 15 | RX 7600 8GBamd | 8GB | 2 | 25.0 |
| 16 | M5 Max 128GBapple | 128GB | 2 | 7.5 |
| 17 | H100 80GBnvidia | 80GB | 1 | 45.0 |
| 18 | RTX 5060 Ti 16GBnvidia | 16GB | 1 | 45.0 |
| 19 | M5 Max 64GBapple | 64GB | 1 | 32.0 |
| 20 | M4 Max 64GBapple | 64GB | 1 | 23.0 |
| 21 | M4 16GBapple | 16GB | 1 | 23.0 |
| 22 | M4 Max 36GBapple | 36GB | 1 | 21.0 |
| 23 | M3 16GBapple | 16GB | 1 | 21.0 |
| 24 | M1 Pro 16GBapple | 16GB | 1 | 20.0 |
| 25 | M4 Pro 24GBapple | 24GB | 1 | 19.0 |
| 26 | M3 Max 48GBapple | 48GB | 1 | 18.0 |
| 27 | M2 16GBapple | 16GB | 1 | 18.0 |
| 28 | M3 8GBapple | 8GB | 1 | 18.0 |
| 29 | M1 8GBapple | 8GB | 1 | 17.5 |
| 30 | M3 Max 36GBapple | 36GB | 1 | 16.0 |
| 31 | M2 Max 32GBapple | 32GB | 1 | 16.0 |
| 32 | M2 8GBapple | 8GB | 1 | 16.0 |
| 33 | M2 Ultra 64GBapple | 64GB | 1 | 14.0 |
| 34 | M3 Pro 18GBapple | 18GB | 1 | 14.0 |
| 35 | M1 16GBapple | 16GB | 1 | 14.0 |
| 36 | M1 Ultra 64GBapple | 64GB | 1 | 12.0 |
| 37 | M2 Pro 16GBapple | 16GB | 1 | 12.0 |
| 38 | M1 Max 32GBapple | 32GB | 1 | 10.0 |
| 39 | AMD MI50 32GBamd | 32GB | 1 | 9.7 |
| 40 | M3 Max 128GBapple | 128GB | 1 | 5.5 |
| 41 | M2 Ultra 192GBapple | 192GB | 1 | — |
| 42 | DGX Sparknvidia | 128GB | 1 | — |
No reports yet
These match the profile but nobody has submitted a report yet.
What to look for
VRAM is the gating constraint
Whether a model runs at all is decided by memory. A Q4_K_M quant of a 7B model needs ~5GB; a 13B needs ~8GB; a 30B needs ~20GB; a 70B needs ~40GB — plus headroom for context and KV cache. If the weights don't fit, generation either crawls (CPU offload) or fails outright.
Bandwidth determines tokens-per-second
Once weights fit, throughput is dominated by memory bandwidth, not raw FLOPs. An RTX 3090 (936 GB/s) and an RTX 4090 (1008 GB/s) are within ~10% of each other on inference-bound workloads despite the 4090's much larger compute budget. M-series Macs trade off here: massive memory pool, but Pro-tier bandwidth is closer to a midrange discrete card.
Software support gates which engines you can use
NVIDIA has CUDA kernels in every major engine (llama.cpp, vLLM, exllamav2, TensorRT-LLM). AMD support has improved sharply via ROCm but still trails on engine coverage. Apple Silicon is best-in-class for MLX and llama.cpp Metal but unsupported by vLLM. Match the engine you want to use to the hardware ecosystem.
Frequently asked
What is the best GPU for running local LLMs?
There is no single answer — it depends on which model size you want to run. For 7B–13B models, an RTX 3060 12GB or RTX 4060 Ti 16GB is enough. For 30B-class models, an RTX 3090 or 4090 (24GB) is the sweet spot. For 70B-class, you need 40GB+ of VRAM (RTX A6000, dual 3090s, or an M-series Mac with 64GB+ unified memory).
Is more VRAM or more compute better for local LLMs?
VRAM, by a wide margin. Inference throughput is memory-bandwidth bound, not compute bound. A card with enough VRAM to fit your model and decent bandwidth will outperform a faster GPU that has to offload weights to system memory.
Do I need an NVIDIA GPU for local LLMs?
No. AMD GPUs work via ROCm with most major engines, and Apple Silicon Macs run llama.cpp Metal and MLX natively. NVIDIA still has the broadest engine support and best out-of-the-box experience, but it's no longer the only option.
How is this list ranked?
By the number of community submissions on llamaperf for each GPU. More reports indicate a GPU is widely used in practice for local LLM inference. The fastest tokens-per-second observed on each is shown alongside as a quality signal.
How we rank
Hardware is sorted by the number of community submissions on llamaperf — a proxy for how widely each card is used in practice for local LLM inference. Within that, we surface the fastest tokens-per-second observed on each as a quality signal. Submissions come primarily from r/LocalLLaMA discussions and direct user uploads. Nothing here is sponsored or affiliate-driven.