Best NVIDIA GPUs for local LLMs

NVIDIA still has the deepest software stack for local inference: CUDA kernels in every major engine, exllamav2 throughput, vLLM batching. The ranking below is from real community submissions, not synthetic benchmarks.

Ranked from 26 community reports on llamaperf.

Ranked by community reports

#	GPU	VRAM	Reports	Fastest t/s
1	RTX 5090nvidia	32GB	4	106.5
2	RTX 4060 Ti 16GBnvidia	16GB	4	45.0
3	RTX A6000 48GBnvidia	48GB	4	16.9
4	RTX 4090nvidia	24GB	3	149.6
5	RTX 3090nvidia	24GB	3	66.0
6	RTX 3060 12GBnvidia	12GB	3	60.0
7	RTX 4070nvidia	12GB	2	55.0
8	H100 80GBnvidia	80GB	1	45.0
9	RTX 5060 Ti 16GBnvidia	16GB	1	45.0
10	DGX Sparknvidia	128GB	1	—

No reports yet

These match the profile but nobody has submitted a report yet.

RTX Pro 6000 Blackwell A100 80GB A100 40GB RTX 3090 Ti RTX 4070 Ti Super RTX 5080 RTX 4080

What to look for

Used 3090s remain the price/performance king

A used RTX 3090 with 24GB VRAM consistently runs 30B-class models at usable speeds and can handle 70B with two cards in tensor-parallel mode. Despite being two generations old, the bandwidth (936 GB/s) is competitive with new cards costing 3× as much.

RTX 4090 vs RTX 3090

On pure inference (memory-bandwidth-bound), the 4090 is only ~10–15% faster than the 3090 despite massively more compute. The 4090 wins on prompt processing (compute-bound) and any workload involving training/fine-tuning, but for dollar-per-token-per-second, the 3090 is still hard to beat.

Workstation cards (A6000, A100, H100)

These give you 48–80GB of VRAM in a single card — enough for 70B+ models without partitioning. The premium over consumer cards is steep but justified for production serving or research workloads where multi-card scaling adds latency overhead.

Frequently asked

What is the best NVIDIA GPU for local LLMs?

For most users, an RTX 4090 (new) or RTX 3090 (used) — both 24GB — hit the sweet spot for 30B-class models. For 70B work, an RTX A6000 (48GB) or two 3090s in tensor-parallel mode are the standard recommendations.

Is an RTX 4090 worth it over an RTX 3090 for local LLMs?

For pure inference, the 4090 is only modestly faster (~10–15%) because both are limited by memory bandwidth on the same model. The 4090 wins clearly on prompt processing speed and any compute-bound workload (training, fine-tuning, very long contexts).

Can I run local LLMs on an RTX 3060 or 4060?

Yes, on smaller models. A 12GB card runs 7B–13B models comfortably at Q4 quantization. 8GB cards are workable for 7B but tight on context length.

How we rank

Hardware is sorted by the number of community submissions on llamaperf — a proxy for how widely each card is used in practice for local LLM inference. Within that, we surface the fastest tokens-per-second observed on each as a quality signal. Submissions come primarily from r/LocalLLaMA discussions and direct user uploads. Nothing here is sponsored or affiliate-driven.