llamaperf

Best NVIDIA GPUs for local LLMs

NVIDIA still has the deepest software stack for local inference: CUDA kernels in every major engine, exllamav2 throughput, vLLM batching. The ranking below is from real community submissions, not synthetic benchmarks.

Ranked from 26 community reports on llamaperf.

Ranked by community reports

#GPUVRAMReportsFastest t/s
1RTX 5090nvidia32GB4106.5
2RTX 4060 Ti 16GBnvidia16GB445.0
3RTX A6000 48GBnvidia48GB416.9
4RTX 4090nvidia24GB3149.6
5RTX 3090nvidia24GB366.0
6RTX 3060 12GBnvidia12GB360.0
7RTX 4070nvidia12GB255.0
8H100 80GBnvidia80GB145.0
9RTX 5060 Ti 16GBnvidia16GB145.0
10DGX Sparknvidia128GB1

No reports yet

These match the profile but nobody has submitted a report yet.

What to look for

Used 3090s remain the price/performance king

A used RTX 3090 with 24GB VRAM consistently runs 30B-class models at usable speeds and can handle 70B with two cards in tensor-parallel mode. Despite being two generations old, the bandwidth (936 GB/s) is competitive with new cards costing 3× as much.

RTX 4090 vs RTX 3090

On pure inference (memory-bandwidth-bound), the 4090 is only ~10–15% faster than the 3090 despite massively more compute. The 4090 wins on prompt processing (compute-bound) and any workload involving training/fine-tuning, but for dollar-per-token-per-second, the 3090 is still hard to beat.

Workstation cards (A6000, A100, H100)

These give you 48–80GB of VRAM in a single card — enough for 70B+ models without partitioning. The premium over consumer cards is steep but justified for production serving or research workloads where multi-card scaling adds latency overhead.

Frequently asked

What is the best NVIDIA GPU for local LLMs?

For most users, an RTX 4090 (new) or RTX 3090 (used) — both 24GB — hit the sweet spot for 30B-class models. For 70B work, an RTX A6000 (48GB) or two 3090s in tensor-parallel mode are the standard recommendations.

Is an RTX 4090 worth it over an RTX 3090 for local LLMs?

For pure inference, the 4090 is only modestly faster (~10–15%) because both are limited by memory bandwidth on the same model. The 4090 wins clearly on prompt processing speed and any compute-bound workload (training, fine-tuning, very long contexts).

Can I run local LLMs on an RTX 3060 or 4060?

Yes, on smaller models. A 12GB card runs 7B–13B models comfortably at Q4 quantization. 8GB cards are workable for 7B but tight on context length.

How we rank

Hardware is sorted by the number of community submissions on llamaperf — a proxy for how widely each card is used in practice for local LLM inference. Within that, we surface the fastest tokens-per-second observed on each as a quality signal. Submissions come primarily from r/LocalLLaMA discussions and direct user uploads. Nothing here is sponsored or affiliate-driven.

See also