llamaperf

How open-weight LLMs run on your hardware.
Crowdsourced from the community.

📊 Community Benchmarks Snapshot

82 records · 22 GPUs · 8 model families · 5 engines

🖥️ Records by GPU

RTX 5090 11 RTX 3090 10 RTX 3060 12GB 5 M5 Max 128GB 5 RTX 4090 4 RX 7900 XTX 4 M5 Max 64GB 3 H100 80GB 3 AMD Strix Halo 128GB 3 RTX Pro 6000 Blackwell 3 M2 Max 96GB 2 RTX 5060 Ti 16GB 2 AMD MI50 32GB 2 M3 Max 128GB 2 RTX A6000 48GB 1 RTX 4070 1 RX 9070 1 RTX 5080 1 M1 8GB 1 AMD Threadripper 256GB 1 DGX Spark 1 RTX 4070 Ti Super 1

🧠 Records by Model

76 total
Qwen3.651
Gemma 418
Qwen2.52
Kimi K2.61
Qwen3.51
GLM-5.11
Llama 3.11
Llama 3.31

⚙️ Engine

56 total
llama.cpp28
vLLM13
Ollama9
MLX4
LM Studio2

🏷️ Use Cases

coding 34agentic 11text-generation 9tool-use 7summarization 6long-context 5vision 3creative-writing 3math 3multilingual 1
coding 34 agentic 11 text-generation 9 tool-use 7 summarization 6 long-context 5 vision 3 creative-writing 3 math 3 multilingual 1

⚡ Avg Gen TPS by GPU

Pro 6000 3500 5090 648 4070 Ti Super 110 4090 98 H100 80GB 85 M5 Max 64GB 64 3090 60 5080 56 4070 55 3060 12GB 53 RX 7900 XTX 49 RX 9070 47 Strix Halo 128GB 21 5060 Ti 16GB 21 M2 Max 96GB 18 M1 8GB 18 A6000 48GB 17 MI50 32GB 10 Threadripper 256GB 8 M5 Max 128GB 7 M3 Max 128GB 6

⚡ Avg Gen TPS by Model

Qwen3.6 212 Gemma 4 102 Qwen2.5 28 Kimi K2.6 10 Llama 3.1 1

🔢 Quants

Q4_K_M 15 IQ4_XS 6 NVFP4 4 Q8 4 Q4 3 Q5_K_S 2 Q6_K 2 Q8_0 2 Q2-XS 1 IQ4_KS 1 Q4_K_S 1 F16 1 4bit 1 UD-Q5_K_XL 1 Q4_K_XL 1 FP16 1 Q5_K_M 1 FP8 1 INT4 1 Q6 1 Q4_0 1 bf16 1 4-bit 1 IQ4_XS-4.19bpw 1 AWQ-4bit 1

Qwen3.6 27B

RTX 5060 Ti 16GB · llama.cpp · 131,072 ctx

Tone: positive
throughput:
19.0 t/s gen
quant:
IQ4_XS (gguf)
kv:
F16

User reports that offloading KV cache to RAM (with -nkvo) allows fitting the whole model on GPU with f16 KV cache, achieving 19 tps peak and 14 tps during long generation at 65k context. With 128k context and 63 layers on GPU, speed remained similar. KV cache quant to RAM didn't improve performance.

Gemma 4

RX 7900 XTX · llama-swap

Tone: positive

Benchmark of Gemma 4 QAT vs regular quants on AMD 7900 XTX. No token/s reported, but wall clock times show significant speedups (e.g., 12B QAT 45% faster, 83% throughput increase). Quality reported identical. Models tested: 12B, 26B, 31B, E4B.

Benchmark of abliteration tools (Apostate, Huihui, Heretic) on Qwen 2.5 7B. Evaluated with lm-evaluation-harness via vLLM 0.19.0, bf16 on RTX 5090 32GB. Reports MMLU, GSM8K, HellaSwag, ARC Challenge, WinoGrande, TruthfulQA MC2, PiQA, LAMBADA ppl, HarmBench ASR, KL divergence. No tokens/sec reported.

Tone: positive
throughput:
138.0 t/s gen
coding

Benchmark of Gemma 4 26B-A4B vs 12B on RTX 4090. 26B-A4B used 15GB VRAM, 138 tok/s; 12B used 9GB, 80 tok/s. 26B-A4B won every scene and ran ~1.7x faster. 12B ideal for 16GB laptop.

Qwen3.6 27B

RTX 3090 · Ollama · 32,000 ctx

Tone: mixed
quant:
Q6_K (gguf)
codingagentic

User replaced Claude with Qwen3.6-27B in multi-agent orchestrator for 2 weeks. Plan generation good, tool-call reliability poor (12% format error), long-context drift past ~14k tokens, cascade-failure handling weak. Viable as reasoning layer but not execution layer.

quant:
4bit
agenticcoding

User currently runs Qwen3.6-35B-A3B-4bit on M3 Max 128GB for production sub-agent delegations. Also mentions GLM-5.1 for orchestration. Considering building a 5090 rig.

Qwen3.6 27B

RTX 3090 · 128,000 ctx

throughput:
104.0 t/s gen · 1399.0 t/s pp
quant:
Q8
kv:
F16

User recommends getting enough GPUs to avoid VRAM hacks. Uses 2x RTX 3090s.

Tone: positive
throughput:
70.0 t/s gen
creative-writing

User mentions using Qwen3.6-27B on dual RTX 3090s for generating interactive HTML content inline with chat. Reports ~70 t/s.

Qwen3.6 27B

RTX 3060 12GB · llama.cpp · 64,000 ctx

Tone: positive
throughput:
43.3 t/s gen · 456.1 t/s pp
quant:
Q4_K_S (gguf)

Dual RTX 3060 setup with tensor parallel. MTP enabled. Context 64k. Prefill 456 t/s, generation 43.26 t/s at 12k context. Without MTP, context 96k, generation 31 t/s. User praises value and stability of CUDA.

throughput:
243.9 t/s gen · 13809.2 t/s pp
quant:
Q4_K_M (gguf)
kv:
Q8

Benchmark of FWHT CUDA implementation for kv-cache quantization. Results show 1-2% pp boost and 7-9% tg boost on Gemma 4 26B.A4B Q4_K_M with -ctk q8_0 -ctv q8_0. pp2048 and tg128 values reported; highest t/s from cuda-fwt column.

throughput:
3500.0 t/s gen · 30000.0 t/s pp

Two benchmarks: Qwen3.6 27B BF16 and Qwen3.6 35B BF16. For 35B, best gen tps 3500 at 128 concurrency with MTP off, prompt tps 30000. Also tested 27B with MTP on/off.

Tone: positive
visionsummarization

Model based on Qwen3.5-4B. Trained on 8xH100 for 3 days. Supports Safetensors, GGUF, MLX weights. Requires as little as 4GB VRAM. Multiple quantizations available (GPTQ, W8A8, FP8, Q4, Q6). Tested with vLLM, SGLang, llama.cpp.

Qwen3.6 27B

RX 9070 · llama.cpp · 131,072 ctx

Tone: positive
throughput:
46.9 t/s gen · 398.4 t/s pp
quant:
UD-Q5_K_XL (gguf)
flash attention:
on
mtp (multi-token prediction):
on
codingagentic

User runs two RX 9070 XTs with ROCm, uses MTP (spec-type = draft-mtp, spec-draft-n-max = 2). Prompt t/s varies; generation t/s around 45-52. Draft acceptance rate ~0.8-0.99. User praises speed, smarts, steerability for agentic coding tasks. Quant is UD-Q5_K_XL (unsloth GGUF).

Qwen3.6 35B (3B active)

RTX 4070 Ti Super · ik_llama.cpp · 131,072 ctx

Tone: positive
throughput:
110.2 t/s gen
quant:
IQ4_XS-4.19bpw (gguf)
kv:
Q8
mtp (multi-token prediction):
on
codingsummarizationmath

Benchmark comparing llama.cpp (89.76 t/s) vs ik_llama.cpp (110.24 t/s) with MTP on Qwen3.6-35B-A3B IQ4_XS quant. 23% speed increase. CPU: Ryzen 7 9700X, OS: CachyOS. GPU used as secondary with iGPU for display.

Qwen3.6 35B (3B active)

RTX 5080 · llama.cpp · 131,072 ctx

throughput:
56.0 t/s gen · 1584.0 t/s pp
quant:
Q4_K_XL (gguf)
kv:
Q8
flash attention:
on
mtp (multi-token prediction):
off
codingagentic

Best config for 35B Q4_K_XL at 128k context: no MTP, --fit-target 1536. MTP doesn't help at 128k. 27B IQ3 fits fully on GPU and benefits from MTP (73 tok/s).

Qwen3.6 27B

M2 Max 96GB · llama.cpp · 256,000 ctx

Tone: positive
throughput:
8.0 t/s gen
quant:
F16 (gguf)
mtp (multi-token prediction):
on
codingagentic

User benchmarked Qwen 3.6 27b F16 on M2 Max 96GB using llama.cpp with MTP speculative decoding. Generation speed varied 8-18 tok/s depending on task; without MTP got 6.6 tok/s. Used for agentic coding to create a Pacman game. Also tested Q8 quant but results were worse. Context up to 150k+ tokens usable. Chat template fixes were critical.

Qwen3.6 27B

RTX 3090 · BeeLlama · 128,000 ctx

quant:
Q5_K_S (gguf)
kv:
Q8
long-context

Benchmark of KV cache quantization methods using Qwen3.6 27B at 64k and 128k context. Also tested IQ4_XS quant. Article at https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

Qwen3.6 27B

RTX 3090 · ik_llama.cpp · 156,000 ctx

Tone: positive
throughput:
72.9 t/s gen · 1261.0 t/s pp
quant:
IQ4_KS (gguf)
kv:
Q8
flash attention:
on
mtp (multi-token prediction):
on
coding

Best setup on RTX 3090 24GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf, 156k context, q8_0/q8_0 KV, MTP, vision on CPU. Prefill 1261 tok/s, decode 72.9 tok/s. Also tested llama.cpp and BeeLlama.

Qwen3.6 27B

AMD MI50 32GB · llama.cpp

quant:
Q8_0 (gguf)
kv:
Q8
mtp (multi-token prediction):
on

Benchmark comparing MTP KV cache quantization (Q8_0) vs no quantization on Qwen3.6-27B-Q8_0 with llama.cpp. Results show negligible wall time difference (~0.14s) with quantized draft KV cache. Also tested with tensor parallelism on 2xMI50 32GB. Aggregate accept rate ~0.735-0.741.

throughput:
21.2 t/s gen
quant:
Q4_K_M (gguf)
mtp (multi-token prediction):
on

MTP enabled with --spec-type draft-mtp --spec-draft-n-max 3. Baseline without MTP: 11.7 tok/s. Also tested Q8_0: 7.4 → 18.1 tok/s (2.44×).

Showing 120 of 82
Page 1 of 5