How open-weight LLMs run on your hardware

75 performance reports, crowdsourced from the community.

Pick your GPU - see which models fit, at which quant, and how fast they run.

Gemma 4

M5 32GB · MLX · 130,173 ctx

throughput:: 3029.0 t/s pp

User built custom w8a8 kernels for M5 MacBook Air. Baseline prefill 2193 tps, improved to 3029 tps for 130k tokens. Also mentions llama.cpp for Macs.

Qwen3.6 27B

4× RTX 5060 Ti 16GB · 256,000 ctx

throughput:: 52.2 t/s gen · 608.0 t/s pp
quant:: Q8
kv:: F16
mtp (multi-token prediction):: on

coding

Benchmark on Vast AI instance with 4x RTX 5060 Ti 16GB. Q8 quant, FP16 KV cache, MTP enabled. 256K context. Cold prefill 608 t/s, decode 52.2 t/s. User considers this excellent for $2K hardware.

Kimi K2.6

RTX 5090 · llama.cpp

throughput:: 471.4 t/s pp
quant:: IQ3_M (gguf)

LLM prompt processing benchmark with Kimi K2.5 IQ3_M (80GB offload) at 500W. RTX 5090 achieved 471.40 t/s PP. Also tested GLM 5.1 IQ4_NL (70GB offload) at 574.98 t/s PP. Comparison with RTX 6000 PRO MaxQ shunt modded.

Qwen3.6 35B (3B active)

AMD Strix Halo 128GB

throughput:: 50.0 t/s gen
quant:: Q8_XL

User emphasizes energy efficiency and versatility of Strix Halo compared to Nvidia cards. Mentions running Qwen 3.6 35B at 50 tps on Q8_XL quant.

Qwen3.6 30B (3B active)

RTX 5060 Ti 16GB

throughput:: 52.0 t/s gen
quant:: float8

Custom CUDA/C++ engine, 50-54 tok/s, 50% improvement over llama.cpp (33-34 tok/s).

Qwen2.5 27B

2× RTX 3090 · llama.cpp

throughput:: 70.0 t/s gen · 1850.0 t/s pp
quant:: Q6_K_XL (gguf)

coding

Multi-token prediction enabled. 96GB total VRAM (24+24? but user says 96GB system). Reliable for code generation and codebase ingestion.

Qwen3.6 27B

2× RTX 3090 Ti · llama.cpp · 196,608 ctx

throughput:: 100.0 t/s gen
quant:: Q8_0 (gguf)

Tensor split-mode improved t/s from 70+ to 100+. Peak 130 t/s reported. Power draw 750W+.

Qwen3.6 27B

M2 Max 96GB · llama.cpp

quant:: Q4_K_M (gguf)

vision

Champion model in vision benchmark. Best quality and stability with thinking disabled. 90/90 successful runs. Speed: 70 s/img. Tested on Apple M2 Max 96GB with llama.cpp b9690.

Qwen3.6 27B

RTX 5060 Ti 16GB · llama.cpp · 131,072 ctx

throughput:: 19.0 t/s gen
quant:: IQ4_XS (gguf)
kv:: F16

User reports that offloading KV cache to RAM (with -nkvo) allows fitting the whole model on GPU with f16 KV cache, achieving 19 tps peak and 14 tps during long generation at 65k context. With 128k context and 63 layers on GPU, speed remained similar. KV cache quant to RAM didn't improve performance.

Gemma 4

RX 7900 XTX · llama-swap

Benchmark of Gemma 4 QAT vs regular quants on AMD 7900 XTX. No token/s reported, but wall clock times show significant speedups (e.g., 12B QAT 45% faster, 83% throughput increase). Quality reported identical. Models tested: 12B, 26B, 31B, E4B.

Qwen2.5 7B

RTX 5090 · vLLM

Benchmark of abliteration tools (Apostate, Huihui, Heretic) on Qwen 2.5 7B. Evaluated with lm-evaluation-harness via vLLM 0.19.0, bf16 on RTX 5090 32GB. Reports MMLU, GSM8K, HellaSwag, ARC Challenge, WinoGrande, TruthfulQA MC2, PiQA, LAMBADA ppl, HarmBench ASR, KL divergence. No tokens/sec reported.

Gemma 4 26B (4B active)

RTX 4090

throughput:: 138.0 t/s gen

coding

Benchmark of Gemma 4 26B-A4B vs 12B on RTX 4090. 26B-A4B used 15GB VRAM, 138 tok/s; 12B used 9GB, 80 tok/s. 26B-A4B won every scene and ran ~1.7x faster. 12B ideal for 16GB laptop.

Qwen3.6 27B

RTX 3090 · Ollama · 32,000 ctx

quant:: Q6_K (gguf)

codingagentic

User replaced Claude with Qwen3.6-27B in multi-agent orchestrator for 2 weeks. Plan generation good, tool-call reliability poor (12% format error), long-context drift past ~14k tokens, cascade-failure handling weak. Viable as reasoning layer but not execution layer.

Qwen3.6 35B (3B active)

M3 Max 128GB

quant:: 4bit

agenticcoding

User currently runs Qwen3.6-35B-A3B-4bit on M3 Max 128GB for production sub-agent delegations. Also mentions GLM-5.1 for orchestration. Considering building a 5090 rig.

Qwen3.6 27B

2× RTX 3090 · 128,000 ctx

throughput:: 104.0 t/s gen · 1399.0 t/s pp
quant:: Q8
kv:: F16

User recommends getting enough GPUs to avoid VRAM hacks. Uses 2x RTX 3090s.

Qwen3.6 27B

2× RTX 3090

throughput:: 70.0 t/s gen

creative-writing

User mentions using Qwen3.6-27B on dual RTX 3090s for generating interactive HTML content inline with chat. Reports ~70 t/s.

Qwen3.6 27B

2× RTX 3060 12GB · llama.cpp · 64,000 ctx

throughput:: 43.3 t/s gen · 456.1 t/s pp
quant:: Q4_K_S (gguf)

Dual RTX 3060 setup with tensor parallel. MTP enabled. Context 64k. Prefill 456 t/s, generation 43.26 t/s at 12k context. Without MTP, context 96k, generation 31 t/s. User praises value and stability of CUDA.

Gemma 4 26B (4B active)

RTX 5090 · llama.cpp

throughput:: 243.9 t/s gen · 13809.2 t/s pp
quant:: Q4_K_M (gguf)
kv:: Q8

Benchmark of FWHT CUDA implementation for kv-cache quantization. Results show 1-2% pp boost and 7-9% tg boost on Gemma 4 26B.A4B Q4_K_M with -ctk q8_0 -ctv q8_0. pp2048 and tg128 values reported; highest t/s from cuda-fwt column.

Qwen3.6 35B

2× RTX Pro 6000 Blackwell · vLLM

throughput:: 3500.0 t/s gen · 30000.0 t/s pp

Two benchmarks: Qwen3.6 27B BF16 and Qwen3.6 35B BF16. For 35B, best gen tps 3500 at 128 concurrency with MTP off, prompt tps 30000. Also tested 27B with MTP on/off.

Qwen3.5 4B NuExtract3

8× H100 80GB

visionsummarization

Model based on Qwen3.5-4B. Trained on 8xH100 for 3 days. Supports Safetensors, GGUF, MLX weights. Requires as little as 4GB VRAM. Multiple quantizations available (GPTQ, W8A8, FP8, Q4, Q6). Tested with vLLM, SGLang, llama.cpp.

Showing 1–20 of 90

Page 1 of 5

Community benchmarks snapshot

90 records · 24 GPUs · 8 model families · 5 engines

Records by GPU

Records by model

Qwen3.656

Gemma 419

Qwen2.53

Kimi K2.62

GLM-5.11

Qwen3.51

other2

Records by engine

llama.cpp32

vLLM13

Ollama9

MLX5

LM Studio2

Use cases

coding 36agentic 11text-generation 9tool-use 7summarization 6long-context 5vision 4creative-writing 3math 3multilingual 1

How open-weight LLMs run on your hardware

Gemma 4

Qwen3.6 27B

Kimi K2.6

Qwen3.6 35B (3B active)

Qwen3.6 30B (3B active)

Qwen2.5 27B

Qwen3.6 27B

Qwen3.6 27B

Qwen3.6 27B

Gemma 4

Qwen2.5 7B

Gemma 4 26B (4B active)

Qwen3.6 27B

Qwen3.6 35B (3B active)

Qwen3.6 27B

Qwen3.6 27B

Qwen3.6 27B

Gemma 4 26B (4B active)

Qwen3.6 35B

Qwen3.5 4B NuExtract3

Community benchmarks snapshot

Records by GPU

Records by model

Records by engine

Use cases

Avg gen t/s by GPU

Avg gen t/s by model

Quants