llamaperf

Best GPUs for 70B local LLMs

70B models in Q4_K_M quantization need roughly 40GB of VRAM plus headroom for context. That puts the realistic floor at a 48GB card (or two 24GB cards), or an Apple Silicon Mac with 64GB+ unified memory. The list below filters to hardware that has actually been reported running models at this scale.

Ranked from 32 community reports on llamaperf.

Ranked by community reports

#GPUVRAMReportsFastest t/s
1RTX 5090nvidia32GB4106.5
2RTX A6000 48GBnvidia48GB416.9
3AMD Threadripper 256GBamd256GB48.8
4Instinct MI300X 192GBamd192GB260.0
5Instinct MI250X 128GBamd128GB235.0
6M5 Max 128GBapple128GB27.5
7H100 80GBnvidia80GB145.0
8M5 Max 64GBapple64GB132.0
9M4 Max 64GBapple64GB123.0
10M4 Max 36GBapple36GB121.0
11M3 Max 48GBapple48GB118.0
12M3 Max 36GBapple36GB116.0
13M2 Max 32GBapple32GB116.0
14M2 Ultra 64GBapple64GB114.0
15M1 Ultra 64GBapple64GB112.0
16M1 Max 32GBapple32GB110.0
17AMD MI50 32GBamd32GB19.7
18M3 Max 128GBapple128GB15.5
19M2 Ultra 192GBapple192GB1
20DGX Sparknvidia128GB1

Models that fit

No reports yet

These match the profile but nobody has submitted a report yet.

What to look for

VRAM math for 70B models

At Q4_K_M, a 70B model is roughly 40–43GB of weights. At Q5/Q6 you're at 48–55GB. KV cache scales with context length (a 4K context on a 70B model adds another ~3GB at full precision; less with KV quantization). Plan for 48GB minimum for comfortable Q4, 64GB+ for higher quants or longer contexts.

Single big card vs multi-card

An RTX A6000 (48GB) or H100 (80GB) holds the whole model in one device — no inter-GPU communication overhead. Two RTX 3090s with NVLink approach this but pay a small latency penalty for tensor-parallel split. For interactive use both work; for serving at scale, single-card is simpler.

Apple Silicon as a 70B host

M2 Ultra and M3 Ultra Macs with 128GB+ run 70B models well, often at 5–12 tokens-per-second depending on quant. The advantage is total cost — a Mac Studio Ultra is competitive in price with a single H100 and runs the same model class without datacenter cooling.

Frequently asked

What's the minimum VRAM for a 70B local LLM?

Around 40GB at Q4 quantization, but practical use needs 48GB or more once you account for context and KV cache. Below that, you'll need to offload layers to system memory which sharply degrades tokens-per-second.

Can I run a 70B model on a single RTX 4090?

Not well. A 24GB card requires aggressive quantization (Q2/Q3) or offloading large portions to CPU memory, both of which degrade quality and speed substantially. Two 3090s/4090s, or a single A6000/H100, is the standard solution.

Is a Mac Studio Ultra good for 70B models?

Yes — M-Ultra Macs with 128GB+ unified memory run 70B models comfortably with no special setup. Throughput is lower than a discrete A100/H100 but the total cost is much lower.

How we rank

Hardware is sorted by the number of community submissions on llamaperf — a proxy for how widely each card is used in practice for local LLM inference. Within that, we surface the fastest tokens-per-second observed on each as a quality signal. Submissions come primarily from r/LocalLLaMA discussions and direct user uploads. Nothing here is sponsored or affiliate-driven.

See also