- quant:
- Q4_K_M (gguf)
- kv:
- Q4
- flash-attn:
- on
codingmath
Speculative decoding (DFlash) on single RTX 3090. Target: Qwen3.6-27B Q4_K_M GGUF (~16 GB). Draft: z-lab Qwen3.6-27B-DFlash bf16 (~3.46 GB). DDTree tree-verify, block size 16, budget 22, greedy verify. KV cache compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) with 4096-slot ring buffer enabling 256K context in 24 GB. Sliding-window flash attention (2048-token window) at decode. Prefill ubatch auto-bumps from 16 to 192 for prompts >2048 tokens. OpenAI-compatible HTTP endpoint. CUDA only, no Metal/ROCm/multi-GPU. Bit-identical output to autoregressive in AR mode; draft matches z-lab PyTorch reference at cos sim 0.999812.
- quant:
- IQ4_NL (gguf)
- kv:
- Q8
- flash-attn:
- on
coding
Running llama.cpp server with pi.dev (coding agent). Uses n-gram speculation (spec-type ngram-mod) for throughput. Parallel=2 allows 2 simultaneous requests. --keep 3000 prevents system prompt from being shifted out. --jinja with preserve_thinking for multi-turn reasoning. Mentions "generating as high as 38 t/s" and "35 t/s throughput" with speculation.
- quant:
- UD-Q6XL (gguf)
User mentions also running 27B UD-Q4XL at 0.5 tps. The 35B UD-Q6XL gives ~30 tps. Running on server with Epyc 7763 CPU, 128GB system RAM, RTX 5060 Ti. Uses llama.cpp/vllm on kubernetes. The model name in the post title is "Qwen3.6 35B A3B Heretic (KLD 0.0015!)" - this appears to be a Qwen3-family MoE model (35B total, 3B active).
- quant:
- Q4_K_XL (gguf)
- kv:
- Q8
- flash-attn:
- on
User ran llama-bench with flash attention (fa=1) showing pp512 at 1600.11 t/s and tg128 at 44.43 t/s. Then ran llama-server with Q8 KV cache (ctk q8_0, ctv q8_0), context 131072, and reported 41.9 t/s average on a 2k output prompt. The model variant is Qwen3.6-27B-UD (26.9B params). GPU is RTX 3090 Ti (24GB VRAM).
- quant:
- IQ4
User reports 37 t/s with llama-server (llama.cpp). Using IQ4 quant because Q4_K_M causes out-of-memory. Model is "Luce DFlash: Qwen3.6-27B" which appears to be a variant of Qwen3 at 27B parameters.
- quant:
- IQ3_XXS (gguf)
- kv:
- Q4
User mentions Qwen3.x models are good at managing KV cache, suggests using Q4 KV cache. Reports running Qwen3.5-27B-UD-IQ3_XXS.gguf with llama.cpp on a desktop environment (LXQT) with ~30K context. Device memory usage ~11744 MiB out of ~11783 MiB free (likely an RTX 4090 or similar 24GB card based on memory figures). Also mentions -ctx-size 60960 without X11. Prompt eval: 160.11 t/s (6198 tokens), generation: 23.31 t/s (148 tokens).
- quant:
- Q8 (gguf)
- kv:
- Q8
agenticcoding
User is running on Strix Halo (AMD Ryzen AI Max). They mention Q8 of Qwen3.6 27b with Q8 KV cache and 115k context, getting 46 tok/sec. They find this slow for agentic coding tasks. The model name "Qwen3.6" is unusual — likely a variant of Qwen3 or a typo, but kept as-is per instructions.
- quant:
- IQ4_XS (gguf)
- kv:
- Q8
- rating:
- 5/5
codingtool-use
User reports Qwen 3.6 27B is excellent for pyspark/python and data transformation debugging. Running on ASUS ROG Strix SCAR 18 with RTX 5090 laptop (24GB VRAM) and 64GB DDR5 RAM. Using llama.cpp with IQ4_XS quant at 200k context with Q8_0 KV cache. Initially tried q4_k_m at q4_0. Cancelling cloud subscriptions due to local performance. No tokens/sec reported.
- quant:
- turbo quant
codingagentictool-use
User reports Qwen 3.6 27B working in Claude Code (CC) with 125k context and turbo quant on a single RTX 3090. Speculative decoding produced garbage output. The user fixed repetition issues by launching Claude with --model claude-3-haiku-20240307 instead of claude-4-haiku (disabling extended thinking). Generation speed of 85 t/s mentioned in the linked Medium article which the user implemented. The user also tried AWQ/GGUF formats with llamacpp, vllm, and LM Studio before settling on this config.