llamaperf

Best local LLMs by hardware tier

Rankings only make sense once you fix the hardware. Pick a tier below — the leaderboard re-sorts to the models the community actually runs there, weighted by report count, fastest observed tokens-per-second, and recency.

Hardware tier
Model size

NVIDIA 24 GB consumer

The community sweet spot — 30B-class in Q4 with headroom for context. Ranked from 6 reports.

#Model familyReportsFastest t/s
1Gemma 4Google DeepMind
26B A4B Instruct (MoE) · 25B-A3.8B · Q4_K_M · on RTX 4090
2149.6
2Qwen3.6Alibaba
· on RTX 3090
366.0
3Qwen2.5Alibaba
Q4_K_M · on RTX 3090
128.0

How we rank

A single global "best models" list doesn't really exist — what runs well on a 5090 is often unrunnable on a 4060, and a 7B that screams on an M3 Max is usually a poor pick on an H100. So we fix the hardware first, then rank the families that actually have community reports on it. The score blends popularity (log-scaled report count), fastest observed tokens-per-second normalized within the bucket, recency (90-day half-life), and a small bias for rows where we know the variant + quant + GPU cleanly. Click into a family for the full breakdown of records.