- quant:
- Q6_K (gguf)
codingagentic
User replaced Claude with Qwen3.6-27B in multi-agent orchestrator for 2 weeks. Plan generation good, tool-call reliability poor (12% format error), long-context drift past ~14k tokens, cascade-failure handling weak. Viable as reasoning layer but not execution layer.
- throughput:
- 104.0 t/s gen · 1399.0 t/s pp
- quant:
- Q8
- kv:
- F16
User recommends getting enough GPUs to avoid VRAM hacks. Uses 2x RTX 3090s.
- throughput:
- 70.0 t/s gen
creative-writing
User mentions using Qwen3.6-27B on dual RTX 3090s for generating interactive HTML content inline with chat. Reports ~70 t/s.
- quant:
- Q5_K_S (gguf)
- kv:
- Q8
long-context
Benchmark of KV cache quantization methods using Qwen3.6 27B at 64k and 128k context. Also tested IQ4_XS quant. Article at https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context
- throughput:
- 72.9 t/s gen · 1261.0 t/s pp
- quant:
- IQ4_KS (gguf)
- kv:
- Q8
- flash attention:
- on
- mtp (multi-token prediction):
- on
coding
Best setup on RTX 3090 24GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf, 156k context, q8_0/q8_0 KV, MTP, vision on CPU. Prefill 1261 tok/s, decode 72.9 tok/s. Also tested llama.cpp and BeeLlama.
- throughput:
- 29.0 t/s gen · 239.0 t/s pp
Benchmark of 4x RTX 3090 with Qwen3.6-27B FP16 using vLLM TP=4. Power limit sweep from 200W to unrestricted. Peak efficiency at 220W. User expresses satisfaction with Qwen 3.6 27B as daily driver.
- throughput:
- 50.0 t/s gen
- quant:
- Q4_K_M (gguf)
- kv:
- Q4
- flash attention:
- on
coding
Uses MTP (Multi-Token Prediction) GGUF with am17an commit for faster inference. KV cache quantized to Q4_0. Speculative draft set to 2. At 100k context, VRAM usage ~19GB. Performance degrades above 90k context.
- throughput:
- 66.0 t/s gen
tool-usecoding
~218K context @ ~50/66 TPS (text, narr/code). Tool calls with ~25K-token outputs now complete without OOM after fixing Genesis patch (PN12). Lower TPS than earlier config but higher context + stability.
- throughput:
- 28.0 t/s gen · 450.0 t/s pp
- quant:
- Q4_K_M (gguf)
- kv:
- Q8
coding
Solid for autocomplete, occasionally hallucinates imports in multi-file refactors. Build b4400.
- throughput:
- 913.0 t/s pp
- quant:
- Q4_K_M (gguf)
- kv:
- Q4
- flash attention:
- on
codingmath
Speculative decoding (DFlash) on single RTX 3090. Target: Qwen3.6-27B Q4_K_M GGUF (~16 GB). Draft: z-lab Qwen3.6-27B-DFlash bf16 (~3.46 GB). DDTree tree-verify, block size 16, budget 22, greedy verify. KV cache compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) with 4096-slot ring buffer enabling 256K context in 24 GB. Sliding-window flash attention (2048-token window) at decode. Prefill ubatch auto-bumps from 16 to 192 for prompts >2048 tokens. OpenAI-compatible HTTP endpoint. CUDA only, no Metal/ROCm/multi-GPU. Bit-identical output to autoregressive in AR mode; draft matches z-lab PyTorch reference at cos sim 0.999812.