llamaperf

Best GPUs for 30B local LLMs

30B-class models in Q4 fit comfortably in 24GB of VRAM with room for a useful context window — the sweet spot for a single consumer GPU. Apple Silicon Macs with 32GB+ unified memory also handle them well. Ranked from community reports.

Ranked from 61 community reports on llamaperf.

Ranked by community reports

#GPUVRAMReportsFastest t/s
1RX 7900 XTXamd24GB658.0
2RTX 5090nvidia32GB4106.5
3RTX 4060 Ti 16GBnvidia16GB445.0
4RTX A6000 48GBnvidia48GB416.9
5AMD Threadripper 256GBamd256GB48.8
6RTX 4090nvidia24GB3149.6
7RTX 3090nvidia24GB366.0
8Instinct MI300X 192GBamd192GB260.0
9RX 7900 XTamd20GB238.0
10Instinct MI250X 128GBamd128GB235.0
11RX 7800 XT 16GBamd16GB227.0
12M5 Max 128GBapple128GB27.5
13H100 80GBnvidia80GB145.0
14RTX 5060 Ti 16GBnvidia16GB145.0
15M5 Max 64GBapple64GB132.0
16M4 Max 64GBapple64GB123.0
17M4 16GBapple16GB123.0
18M4 Max 36GBapple36GB121.0
19M3 16GBapple16GB121.0
20M1 Pro 16GBapple16GB120.0
21M4 Pro 24GBapple24GB119.0
22M3 Max 48GBapple48GB118.0
23M2 16GBapple16GB118.0
24M3 Max 36GBapple36GB116.0
25M2 Max 32GBapple32GB116.0
26M2 Ultra 64GBapple64GB114.0
27M3 Pro 18GBapple18GB114.0
28M1 16GBapple16GB114.0
29M1 Ultra 64GBapple64GB112.0
30M2 Pro 16GBapple16GB112.0
31M1 Max 32GBapple32GB110.0
32AMD MI50 32GBamd32GB19.7
33M3 Max 128GBapple128GB15.5
34M2 Ultra 192GBapple192GB1
35DGX Sparknvidia128GB1

Models that fit

No reports yet

These match the profile but nobody has submitted a report yet.

What to look for

24GB cards are the sweet spot

RTX 3090s and 4090s (both 24GB) hold a 30B-class model in Q4 with plenty of headroom for an 8–16K context. This is arguably the best price/capability point in local LLM inference today — you get most of the quality of a 70B model at a fraction of the hardware cost.

16GB cards work with tighter quants

An RTX 4060 Ti 16GB or RTX 4070 Ti Super 16GB can run 30B models at Q3/Q4 with shorter contexts, though you'll feel the squeeze with longer prompts. Q3 quants noticeably hurt quality on most models — Q4 is the practical floor.

Frequently asked

What's the best GPU for a 30B local LLM?

RTX 3090 (used) or RTX 4090 (new) — both 24GB — are the standard recommendations. They hold a 30B model in Q4 with headroom for a useful context window and run at 25–50 tokens-per-second on most engines.

Can a 16GB GPU run 30B models?

Yes, with caveats. Q3/Q4 quants of 30B-class models fit in ~14–17GB depending on the architecture. You'll have less context room and may need to lower precision further than ideal. A 24GB card is meaningfully better.

How we rank

Hardware is sorted by the number of community submissions on llamaperf — a proxy for how widely each card is used in practice for local LLM inference. Within that, we surface the fastest tokens-per-second observed on each as a quality signal. Submissions come primarily from r/LocalLLaMA discussions and direct user uploads. Nothing here is sponsored or affiliate-driven.

See also