- throughput:
- 19.0 t/s gen
- quant:
- IQ4_XS (gguf)
- kv:
- F16
User reports that offloading KV cache to RAM (with -nkvo) allows fitting the whole model on GPU with f16 KV cache, achieving 19 tps peak and 14 tps during long generation at 65k context. With 128k context and 63 layers on GPU, speed remained similar. KV cache quant to RAM didn't improve performance.
- quant:
- Q6_K (gguf)
codingagentic
User replaced Claude with Qwen3.6-27B in multi-agent orchestrator for 2 weeks. Plan generation good, tool-call reliability poor (12% format error), long-context drift past ~14k tokens, cascade-failure handling weak. Viable as reasoning layer but not execution layer.
- quant:
- 4bit
agenticcoding
User currently runs Qwen3.6-35B-A3B-4bit on M3 Max 128GB for production sub-agent delegations. Also mentions GLM-5.1 for orchestration. Considering building a 5090 rig.
- throughput:
- 104.0 t/s gen · 1399.0 t/s pp
- quant:
- Q8
- kv:
- F16
User recommends getting enough GPUs to avoid VRAM hacks. Uses 2x RTX 3090s.
- throughput:
- 70.0 t/s gen
creative-writing
User mentions using Qwen3.6-27B on dual RTX 3090s for generating interactive HTML content inline with chat. Reports ~70 t/s.
- throughput:
- 43.3 t/s gen · 456.1 t/s pp
- quant:
- Q4_K_S (gguf)
Dual RTX 3060 setup with tensor parallel. MTP enabled. Context 64k. Prefill 456 t/s, generation 43.26 t/s at 12k context. Without MTP, context 96k, generation 31 t/s. User praises value and stability of CUDA.
- throughput:
- 3500.0 t/s gen · 30000.0 t/s pp
Two benchmarks: Qwen3.6 27B BF16 and Qwen3.6 35B BF16. For 35B, best gen tps 3500 at 128 concurrency with MTP off, prompt tps 30000. Also tested 27B with MTP on/off.
- throughput:
- 46.9 t/s gen · 398.4 t/s pp
- quant:
- UD-Q5_K_XL (gguf)
- flash attention:
- on
- mtp (multi-token prediction):
- on
codingagentic
User runs two RX 9070 XTs with ROCm, uses MTP (spec-type = draft-mtp, spec-draft-n-max = 2). Prompt t/s varies; generation t/s around 45-52. Draft acceptance rate ~0.8-0.99. User praises speed, smarts, steerability for agentic coding tasks. Quant is UD-Q5_K_XL (unsloth GGUF).
- throughput:
- 110.2 t/s gen
- quant:
- IQ4_XS-4.19bpw (gguf)
- kv:
- Q8
- mtp (multi-token prediction):
- on
codingsummarizationmath
Benchmark comparing llama.cpp (89.76 t/s) vs ik_llama.cpp (110.24 t/s) with MTP on Qwen3.6-35B-A3B IQ4_XS quant. 23% speed increase. CPU: Ryzen 7 9700X, OS: CachyOS. GPU used as secondary with iGPU for display.
- throughput:
- 56.0 t/s gen · 1584.0 t/s pp
- quant:
- Q4_K_XL (gguf)
- kv:
- Q8
- flash attention:
- on
- mtp (multi-token prediction):
- off
codingagentic
Best config for 35B Q4_K_XL at 128k context: no MTP, --fit-target 1536. MTP doesn't help at 128k. 27B IQ3 fits fully on GPU and benefits from MTP (73 tok/s).
- throughput:
- 8.0 t/s gen
- quant:
- F16 (gguf)
- mtp (multi-token prediction):
- on
codingagentic
User benchmarked Qwen 3.6 27b F16 on M2 Max 96GB using llama.cpp with MTP speculative decoding. Generation speed varied 8-18 tok/s depending on task; without MTP got 6.6 tok/s. Used for agentic coding to create a Pacman game. Also tested Q8 quant but results were worse. Context up to 150k+ tokens usable. Chat template fixes were critical.
- quant:
- Q5_K_S (gguf)
- kv:
- Q8
long-context
Benchmark of KV cache quantization methods using Qwen3.6 27B at 64k and 128k context. Also tested IQ4_XS quant. Article at https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context
- throughput:
- 72.9 t/s gen · 1261.0 t/s pp
- quant:
- IQ4_KS (gguf)
- kv:
- Q8
- flash attention:
- on
- mtp (multi-token prediction):
- on
coding
Best setup on RTX 3090 24GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf, 156k context, q8_0/q8_0 KV, MTP, vision on CPU. Prefill 1261 tok/s, decode 72.9 tok/s. Also tested llama.cpp and BeeLlama.
- quant:
- Q8_0 (gguf)
- kv:
- Q8
- mtp (multi-token prediction):
- on
Benchmark comparing MTP KV cache quantization (Q8_0) vs no quantization on Qwen3.6-27B-Q8_0 with llama.cpp. Results show negligible wall time difference (~0.14s) with quantized draft KV cache. Also tested with tensor parallelism on 2xMI50 32GB. Aggregate accept rate ~0.735-0.741.
- throughput:
- 21.2 t/s gen
- quant:
- Q4_K_M (gguf)
- mtp (multi-token prediction):
- on
MTP enabled with --spec-type draft-mtp --spec-draft-n-max 3. Baseline without MTP: 11.7 tok/s. Also tested Q8_0: 7.4 → 18.1 tok/s (2.44×).
- quant:
- Q8_0 (gguf)
- flash attention:
- on
long-context
Benchmark compares MTP vs non-MTP for 27B and 35B-A3B models. 27B-MTP shows significant speedup in generation and overall wall time for long-context chat; 35B-MTP shows mixed results with faster generation but slower end-to-end due to prefill overhead.
- throughput:
- 21.1 t/s gen
- quant:
- Q4_K_M (gguf)
coding
Strix Halo 128GB, ROCm backend, Q4_K_M quant, chat workload. Also tested RTX 3090 and RTX 5070. Multiple models and quants reported.
- throughput:
- 29.0 t/s gen · 239.0 t/s pp
Benchmark of 4x RTX 3090 with Qwen3.6-27B FP16 using vLLM TP=4. Power limit sweep from 200W to unrestricted. Peak efficiency at 220W. User expresses satisfaction with Qwen 3.6 27B as daily driver.
- throughput:
- 3238.0 t/s gen
Benchmark of Qwen3.6-35B-A3B with LDLM diffusion model on RTX 5090 32GB. Throughput 3,238 tok/s at 10 diffusion steps, seq len 64, batch size 1. Also reports ~6,500 tok/s for 4 steps (extrapolated). Untrained weights. Also mentions Qwen3.6-27B at 745 tok/s (10 steps) and ~1,500 tok/s (4 steps).
- throughput:
- 70.0 t/s gen · 780.0 t/s pp
- quant:
- Q2-XS (gguf)
- kv:
- Q8
- flash attention:
- off
- rating:
- 4/5
codingcreative-writingtool-usesummarizationvisionagenticmultilingual
defnitly want to try an higher qwant. I vé took that one beacause gguf sise 11,9 go, ans barely offload this dense modèle to cpu
summarizationtool-use
User is a vet building a dictation/SOAP scribe. Reports inconsistent output from local models (Gemma 4, Qwen 3.6 35B A3B) compared to frontier models. System prompt is a 25-30k token markdown file. Hardware: Core Ultra 9, 128GB RAM, RTX 5090, Proxmox, AnythingLLM + Ollama (llama.cpp).
- throughput:
- 80.0 t/s gen
- quant:
- Q4_K_M (gguf)
- kv:
- Q4
MTP draft acceptance ~73%, TBQ4_0 KV cache, MTP draft 3. Fork: https://github.com/Indras-Mirror/llama.cpp-mtp
- throughput:
- 46.8 t/s gen · 914.0 t/s pp
- quant:
- IQ4_XS (gguf)
- kv:
- Q8
- flash attention:
- on
coding
Best plain llama-bench: pp512 ~914 t/s, tg128 ~46.8 t/s. Practical coding profile: 32k context, ~43.4 t/s generation. MTP gave ~47.7 t/s (2% improvement).
- throughput:
- 28.0 t/s gen
- quant:
- Q5_K_M (gguf)
- kv:
- Q4
codingagentic
MTP speculative decoding gives 2.5x speedup. Tested on M2 Max 96GB with Q5_K_M quant and q4_0 KV cache. Also provides hardware recommendations for various Apple Silicon and NVIDIA GPUs.
- throughput:
- 73.6 t/s gen · 2883.0 t/s pp
- quant:
- NVFP4 (safetensors)
- kv:
- Q8
- flash attention:
- on
coding
MTP enabled with 3 speculative tokens. KV cache fp8_e4m3. Prefix caching tested. Stability pass at 200k: 10/10 runs. Generation speed varies 59-111 tok/s. Mean MTP acceptance length 2.28.
- throughput:
- 215.1 t/s gen
- quant:
- Q4 (gguf)
MTP grafted model; Q4 speed increase only 6% on 5090. Also tested Q8 on 5090+3090: 148.20 t/s without MTP, 152.02 t/s with MTP.
- throughput:
- 50.0 t/s gen
- quant:
- Q4_K_M (gguf)
- kv:
- Q4
- flash attention:
- on
coding
Uses MTP (Multi-Token Prediction) GGUF with am17an commit for faster inference. KV cache quantized to Q4_0. Speculative draft set to 2. At 100k context, VRAM usage ~19GB. Performance degrades above 90k context.
- throughput:
- 54.5 t/s gen
- kv:
- Q8
codingtool-use
MTP branch of llama.cpp by am17an. 29-30 t/s without MTP, 54-55 t/s with MTP at 150W power limit. Falls to 40-45 t/s after 50k tokens. Used as vscode copilot.
- throughput:
- 22.0 t/s gen · 760.0 t/s pp
- quant:
- IQ4_XS (gguf)
- kv:
- Q8
- flash attention:
- on
User tested Qwen3.6 27B IQ4_XS on RTX 5060 Ti 16GB with llama.cpp (TheTom's TurboQuant fork). Prompt processing 760 t/s, generation 22 t/s. Context window limited to 75k. KV cache quant turbo4/turbo2. Also tested BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS, Q3_K_XL, Q3_K_M, Q2_K_XL on L40S or RTX 5060 Ti. Quality comparison using chess board SVG generation task. Recommends IQ4_XS as minimum.
- throughput:
- 63.0 t/s gen
- quant:
- 4-bit (mlx)
codingcreative-writing
MTPLX engine achieves 63 tok/s on Qwen3.6-27B 4-bit MLX on M5 Max 64GB, up from 28 tok/s baseline. Uses native MTP heads with temperature 0.6, top_p 0.95, top_k 20. Optimal depth D3. Custom patched MLX fork with Metal kernels.
- throughput:
- 80.0 t/s gen
- quant:
- FP8 (safetensors)
- kv:
- F16
- flash attention:
- on
codingagentic
Uses vLLM 0.20.1 with CUDA 12.9. BF16 KV cache. MTP=2 speculative decoding. Performance range 60-90 TPS, reported 80 TPS typical.
codingsummarization
Compared Qwen3.6-27B (with and without thinking) against Coder-Next. 27B with thinking disabled was most consistent (95.8% ship rate). 27B and Coder-Next statistically tied overall. Also mentions Qwen3.6-35B-A3B performed poorly and was dropped.
- throughput:
- 23.0 t/s gen
- kv:
- Q8
- flash attention:
- on
agentic
Running on a 5-year-old laptop with RTX 2060 Max-Q 6GB VRAM and 24GB RAM. Uses llama.cpp with Q8 KV cache, flash attention, and 64k context. Also mentions a long context variant with 128k context using Tom's fork. Model is Qwen3.6-35B-A3B (MoE, 3B active).
- throughput:
- 5.5 t/s gen · 160.0 t/s pp
- quant:
- Q8 (mlx)
long-context
User reports 160 tok/s prefill, 5-6 tok/s generation (later 4-5 tok/s) on M5 Max 128GB with Qwen 3.6 27B Q8 MLX at 290k context. GPU utilization 36-50%. User feels performance is lower than expected and seeks comparison.
- throughput:
- 32.0 t/s pp
- quant:
- Q6 (gguf)
coding
User is considering adding a second 7900 XTX for 48GB VRAM to run larger models. Currently running Qwen 27B Q6 dense with 32K context at 32 t/s prompt processing. Main use case is coding via opencode.
- throughput:
- 5.5 t/s gen · 160.0 t/s pp
- quant:
- Q8 (mlx)
long-context
User reports 160 tok/s prefill and 5-6 tok/s generation on M5 Max 128GB with Qwen 3.6 27B Q8 MLX at 290k context. GPU utilization only 36-50%, feels off compared to expected 8-14 tok/s generation. Seeking comparison from others.
- throughput:
- 32.0 t/s gen
coding
Qwen 3.6 27B on MacBook Pro M5 Max 64GB: 32 tokens/sec, 18m04s, 33946 tokens. Compared to Gemma 4 31B (27 t/s, 3m51s, 6209 tokens). Qwen showed more creativity but Gemma won for game logic.
- throughput:
- 66.0 t/s gen
tool-usecoding
~218K context @ ~50/66 TPS (text, narr/code). Tool calls with ~25K-token outputs now complete without OOM after fixing Genesis patch (PN12). Lower TPS than earlier config but higher context + stability.
- throughput:
- 5.5 t/s gen · 160.0 t/s pp
- quant:
- Q8 (mlx)
long-context
User reports 160 tok/s prefill, 5-6 tok/s generation on M5 Max 128GB with Qwen 3.6 27B Q8 MLX at 290k context. GPU utilization only 36-50%, feels off compared to expected 8-14 tok/s generation. Asks for comparison with other setups.
- throughput:
- 45.0 t/s gen
codingagentic
User rents GPU instance with 2x H100s (160GB VRAM) to run Qwen3.6-27B at 45 t/s. Uses vLLM for inference. Runs multiple agents (Claude Code, QwenCode, social media bots) hitting the API simultaneously. Context length 128K. Cost ~$0.90/hr, spent $120 last month. Model outperformed 120B model in tests.
- throughput:
- 25.0 t/s gen
- quant:
- Q4_0 (gguf)
- kv:
- Q4
coding
User reports 3000 tokens in ~2 minutes (25 t/s) with Q4_0 quant, 120k context, both caches quantized to 4_0. Seeking faster performance. Also includes a reply with vLLM benchmark on RTX 3090: 27B INT4 quant, 125K context, TurboQuant 3-bit NC KV cache, MTP speculative decoding, 82 tok/s generation, 0.3-0.6s TTFT.
- throughput:
- 38.0 t/s gen · 1021.0 t/s pp
- quant:
- IQ4_XS (gguf)
- kv:
- Q8
- flash attention:
- on
codingtool-usesummarizationagentic
Poster used llama-server.exe with Vulkan backend on a 16GB VRAM GPU (model not specified). Model is Qwen3.6-35B-A3B (MoE, 34.66B params, ~4.25 bpw) in IQ4_XS quant from Unsloth. Used --n-gpu-layers 99, --n-cpu-moe 16 (offloading MoE experts to CPU), --threads 14, --batch-size 1024, --ubatch-size 1024, --flash-attn 1, --cache-type-k q8_0, --cache-type-v q8_0, --ctx-size 80000, --cache-ram 2048, --no-mmap. Prompt processing: 1021.05 ± 1.24 t/s (pp80000). Generation: 37.96 ± 0.10 t/s (tg1000). Combined with a pi coding agent for file operations, tool calls, summarizations, and MCP calls. Poster says it's usable for someone with a 16GB GPU.
- throughput:
- 27.6 t/s gen
- quant:
- Q6_K (gguf)
vision
Used open-visual in Open WebUI for image generation. Multiple prompts with generation speeds around 27 t/s.
- throughput:
- 106.5 t/s gen
- quant:
- INT4 (safetensors)
- kv:
- Q8
Qwen3.6-27B-INT4 via vllm 0.19 on 1x RTX 5090. Achieves 105-108 tps generation with 256k context. Uses fp8_e4m3 KV cache, flashinfer attention, MTP speculative decoding (3 tokens). Model from Lorbus quant (AutoRound).
- throughput:
- 55.0 t/s gen
- quant:
- IQ4_XS (gguf)
coding
Switched from Qwen3.6 35b-a3b IQ4_XS to Qwen3.6 27b IQ3_M. 35b-a3b got 50-60 t/s but slow prompt processing; 27b got ~40 t/s but consistent. 27b found a bug the 35b couldn't. Dense model handles compression better than MoE.
- throughput:
- 30.0 t/s gen
- quant:
- Q5_K_S (gguf)
User found that larger quants (Q4_K_XL, Q5_K_S) gave better speed than smaller Q4 (IQ4_XS) on MoE model Qwen3.6-35B-A3B. With Q5_K_S, ~30 t/s at 128k context. Also tested Q4_K_XL: 32 t/s. IQ4_XS gave 25-30 t/s with 32k context. System: RTX 3070 8GB + 64GB DDR4.
- throughput:
- 80.0 t/s gen
- quant:
- NVFP4 (safetensors)
Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19.1rc1. Uses NVFP4 quantization.
- throughput:
- 19.2 t/s gen · 186.8 t/s pp
- quant:
- Q4_K_M (gguf)
- kv:
- Q8
User reports using a 5070Ti 16GB and a 2060 6GB to run Qwen3.6-27B Q4_K_M with llama-server. At 71k actual context, pp=186.76 t/s, tg=19.21 t/s. Also provides llama-bench results with CUDA showing tg speeds around 16-25 t/s depending on configuration.
- throughput:
- 22.5 t/s gen
- quant:
- Q4_K_M (gguf)
codingtool-use
Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks: HumanEval, HellaSwag, BFCL. Q4_K_M best practical variant: 1.45x faster than BF16, 48% less peak RAM, 68.8% smaller model file, nearly identical function calling score.
- throughput:
- 913.0 t/s pp
- quant:
- Q4_K_M (gguf)
- kv:
- Q4
- flash attention:
- on
codingmath
Speculative decoding (DFlash) on single RTX 3090. Target: Qwen3.6-27B Q4_K_M GGUF (~16 GB). Draft: z-lab Qwen3.6-27B-DFlash bf16 (~3.46 GB). DDTree tree-verify, block size 16, budget 22, greedy verify. KV cache compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) with 4096-slot ring buffer enabling 256K context in 24 GB. Sliding-window flash attention (2048-token window) at decode. Prefill ubatch auto-bumps from 16 to 192 for prompts >2048 tokens. OpenAI-compatible HTTP endpoint. CUDA only, no Metal/ROCm/multi-GPU. Bit-identical output to autoregressive in AR mode; draft matches z-lab PyTorch reference at cos sim 0.999812.