- quant:
- Q4_K_M (gguf)
- kv:
- Q4
- flash-attn:
- on
codingmath
Speculative decoding (DFlash) on single RTX 3090. Target: Qwen3.6-27B Q4_K_M GGUF (~16 GB). Draft: z-lab Qwen3.6-27B-DFlash bf16 (~3.46 GB). DDTree tree-verify, block size 16, budget 22, greedy verify. KV cache compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) with 4096-slot ring buffer enabling 256K context in 24 GB. Sliding-window flash attention (2048-token window) at decode. Prefill ubatch auto-bumps from 16 to 192 for prompts >2048 tokens. OpenAI-compatible HTTP endpoint. CUDA only, no Metal/ROCm/multi-GPU. Bit-identical output to autoregressive in AR mode; draft matches z-lab PyTorch reference at cos sim 0.999812.
- quant:
- Q6
User reports getting max 26-29 t/s with Q6 quant on a single RTX 3090. Asking what quant the OP is using. The model name appears to be Qwen3.6-27B (likely a variant/fine-tune of Qwen3).
- quant:
- IQ4
User reports 37 t/s with llama-server (llama.cpp). Using IQ4 quant because Q4_K_M causes out-of-memory. Model is "Luce DFlash: Qwen3.6-27B" which appears to be a variant of Qwen3 at 27B parameters.
- quant:
- UD-IQ4_XS
User reports 13 t/s with Qwen3.6-27B UD-IQ4_XS on a single RTX 3090 and expresses concern that this seems low compared to the post's claim of up to 2x throughput.
Same setup as earlier in thread. Speculative decoding (spec dec) mentioned earlier helped increase t/s to about 65 on average.
- quant:
- Q8_0 (gguf)
Same setup as parent post (Luce DFlash: Qwen3.6-27B). Running Unsloth's Q8_0 with full context at ~23tps with the cards set to 75% power limit.
- quant:
- turbo quant
codingagentictool-use
User reports Qwen 3.6 27B working in Claude Code (CC) with 125k context and turbo quant on a single RTX 3090. Speculative decoding produced garbage output. The user fixed repetition issues by launching Claude with --model claude-3-haiku-20240307 instead of claude-4-haiku (disabling extended thinking). Generation speed of 85 t/s mentioned in the linked Medium article which the user implemented. The user also tried AWQ/GGUF formats with llamacpp, vllm, and LM Studio before settling on this config.