- throughput:
- 19.0 t/s gen
- quant:
- IQ4_XS (gguf)
- kv:
- F16
User reports that offloading KV cache to RAM (with -nkvo) allows fitting the whole model on GPU with f16 KV cache, achieving 19 tps peak and 14 tps during long generation at 65k context. With 128k context and 63 layers on GPU, speed remained similar. KV cache quant to RAM didn't improve performance.
Benchmark of Gemma 4 QAT vs regular quants on AMD 7900 XTX. No token/s reported, but wall clock times show significant speedups (e.g., 12B QAT 45% faster, 83% throughput increase). Quality reported identical. Models tested: 12B, 26B, 31B, E4B.
Benchmark of abliteration tools (Apostate, Huihui, Heretic) on Qwen 2.5 7B. Evaluated with lm-evaluation-harness via vLLM 0.19.0, bf16 on RTX 5090 32GB. Reports MMLU, GSM8K, HellaSwag, ARC Challenge, WinoGrande, TruthfulQA MC2, PiQA, LAMBADA ppl, HarmBench ASR, KL divergence. No tokens/sec reported.
- throughput:
- 138.0 t/s gen
coding
Benchmark of Gemma 4 26B-A4B vs 12B on RTX 4090. 26B-A4B used 15GB VRAM, 138 tok/s; 12B used 9GB, 80 tok/s. 26B-A4B won every scene and ran ~1.7x faster. 12B ideal for 16GB laptop.
- quant:
- Q6_K (gguf)
codingagentic
User replaced Claude with Qwen3.6-27B in multi-agent orchestrator for 2 weeks. Plan generation good, tool-call reliability poor (12% format error), long-context drift past ~14k tokens, cascade-failure handling weak. Viable as reasoning layer but not execution layer.
- quant:
- 4bit
agenticcoding
User currently runs Qwen3.6-35B-A3B-4bit on M3 Max 128GB for production sub-agent delegations. Also mentions GLM-5.1 for orchestration. Considering building a 5090 rig.
- throughput:
- 104.0 t/s gen · 1399.0 t/s pp
- quant:
- Q8
- kv:
- F16
User recommends getting enough GPUs to avoid VRAM hacks. Uses 2x RTX 3090s.
- throughput:
- 70.0 t/s gen
creative-writing
User mentions using Qwen3.6-27B on dual RTX 3090s for generating interactive HTML content inline with chat. Reports ~70 t/s.
- throughput:
- 43.3 t/s gen · 456.1 t/s pp
- quant:
- Q4_K_S (gguf)
Dual RTX 3060 setup with tensor parallel. MTP enabled. Context 64k. Prefill 456 t/s, generation 43.26 t/s at 12k context. Without MTP, context 96k, generation 31 t/s. User praises value and stability of CUDA.
- throughput:
- 243.9 t/s gen · 13809.2 t/s pp
- quant:
- Q4_K_M (gguf)
- kv:
- Q8
Benchmark of FWHT CUDA implementation for kv-cache quantization. Results show 1-2% pp boost and 7-9% tg boost on Gemma 4 26B.A4B Q4_K_M with -ctk q8_0 -ctv q8_0. pp2048 and tg128 values reported; highest t/s from cuda-fwt column.
- throughput:
- 3500.0 t/s gen · 30000.0 t/s pp
Two benchmarks: Qwen3.6 27B BF16 and Qwen3.6 35B BF16. For 35B, best gen tps 3500 at 128 concurrency with MTP off, prompt tps 30000. Also tested 27B with MTP on/off.
visionsummarization
Model based on Qwen3.5-4B. Trained on 8xH100 for 3 days. Supports Safetensors, GGUF, MLX weights. Requires as little as 4GB VRAM. Multiple quantizations available (GPTQ, W8A8, FP8, Q4, Q6). Tested with vLLM, SGLang, llama.cpp.
- throughput:
- 46.9 t/s gen · 398.4 t/s pp
- quant:
- UD-Q5_K_XL (gguf)
- flash attention:
- on
- mtp (multi-token prediction):
- on
codingagentic
User runs two RX 9070 XTs with ROCm, uses MTP (spec-type = draft-mtp, spec-draft-n-max = 2). Prompt t/s varies; generation t/s around 45-52. Draft acceptance rate ~0.8-0.99. User praises speed, smarts, steerability for agentic coding tasks. Quant is UD-Q5_K_XL (unsloth GGUF).
- throughput:
- 110.2 t/s gen
- quant:
- IQ4_XS-4.19bpw (gguf)
- kv:
- Q8
- mtp (multi-token prediction):
- on
codingsummarizationmath
Benchmark comparing llama.cpp (89.76 t/s) vs ik_llama.cpp (110.24 t/s) with MTP on Qwen3.6-35B-A3B IQ4_XS quant. 23% speed increase. CPU: Ryzen 7 9700X, OS: CachyOS. GPU used as secondary with iGPU for display.
- throughput:
- 56.0 t/s gen · 1584.0 t/s pp
- quant:
- Q4_K_XL (gguf)
- kv:
- Q8
- flash attention:
- on
- mtp (multi-token prediction):
- off
codingagentic
Best config for 35B Q4_K_XL at 128k context: no MTP, --fit-target 1536. MTP doesn't help at 128k. 27B IQ3 fits fully on GPU and benefits from MTP (73 tok/s).
- throughput:
- 8.0 t/s gen
- quant:
- F16 (gguf)
- mtp (multi-token prediction):
- on
codingagentic
User benchmarked Qwen 3.6 27b F16 on M2 Max 96GB using llama.cpp with MTP speculative decoding. Generation speed varied 8-18 tok/s depending on task; without MTP got 6.6 tok/s. Used for agentic coding to create a Pacman game. Also tested Q8 quant but results were worse. Context up to 150k+ tokens usable. Chat template fixes were critical.
- quant:
- Q5_K_S (gguf)
- kv:
- Q8
long-context
Benchmark of KV cache quantization methods using Qwen3.6 27B at 64k and 128k context. Also tested IQ4_XS quant. Article at https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context
- throughput:
- 72.9 t/s gen · 1261.0 t/s pp
- quant:
- IQ4_KS (gguf)
- kv:
- Q8
- flash attention:
- on
- mtp (multi-token prediction):
- on
coding
Best setup on RTX 3090 24GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf, 156k context, q8_0/q8_0 KV, MTP, vision on CPU. Prefill 1261 tok/s, decode 72.9 tok/s. Also tested llama.cpp and BeeLlama.
- quant:
- Q8_0 (gguf)
- kv:
- Q8
- mtp (multi-token prediction):
- on
Benchmark comparing MTP KV cache quantization (Q8_0) vs no quantization on Qwen3.6-27B-Q8_0 with llama.cpp. Results show negligible wall time difference (~0.14s) with quantized draft KV cache. Also tested with tensor parallelism on 2xMI50 32GB. Aggregate accept rate ~0.735-0.741.
- throughput:
- 21.2 t/s gen
- quant:
- Q4_K_M (gguf)
- mtp (multi-token prediction):
- on
MTP enabled with --spec-type draft-mtp --spec-draft-n-max 3. Baseline without MTP: 11.7 tok/s. Also tested Q8_0: 7.4 → 18.1 tok/s (2.44×).