- throughput:
- 38.0 t/s gen · 1021.0 t/s pp
- quant:
- IQ4_XS (gguf)
- kv:
- Q8
- flash-attn:
- on
codingtool-usesummarizationagentic
Poster used llama-server.exe with Vulkan backend on a 16GB VRAM GPU (model not specified). Model is Qwen3.6-35B-A3B (MoE, 34.66B params, ~4.25 bpw) in IQ4_XS quant from Unsloth. Used --n-gpu-layers 99, --n-cpu-moe 16 (offloading MoE experts to CPU), --threads 14, --batch-size 1024, --ubatch-size 1024, --flash-attn 1, --cache-type-k q8_0, --cache-type-v q8_0, --ctx-size 80000, --cache-ram 2048, --no-mmap. Prompt processing: 1021.05 ± 1.24 t/s (pp80000). Generation: 37.96 ± 0.10 t/s (tg1000). Combined with a pi coding agent for file operations, tool calls, summarizations, and MCP calls. Poster says it's usable for someone with a 16GB GPU.
- quant:
- i1-q4_k_s (gguf)
- kv:
- Q8
- flash-attn:
- on
- rating:
- 5/5
codingtool-use
User reports Qwen 3.6 35B-A3B (i1-q4_k_s quant) running on AMD 7700 XT with 32GB DDR4 RAM and Ryzen 5 5600. All 40 layers offloaded to GPU, 128k context, flash attention enabled, Q8_0 KV cache quantization. Used LM Studio as OpenAI-compatible endpoint with GitHub Copilot and RooCode frontends. Successfully fixed scraper bugs in one shot (25 min) and updated project README with emulator screenshots (45 min). User describes it as a game-changer for local coding, reserving cloud models only for demanding work.
- throughput:
- 913.0 t/s pp
- quant:
- Q4_K_M (gguf)
- kv:
- Q4
- flash-attn:
- on
codingmath
Speculative decoding (DFlash) on single RTX 3090. Target: Qwen3.6-27B Q4_K_M GGUF (~16 GB). Draft: z-lab Qwen3.6-27B-DFlash bf16 (~3.46 GB). DDTree tree-verify, block size 16, budget 22, greedy verify. KV cache compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) with 4096-slot ring buffer enabling 256K context in 24 GB. Sliding-window flash attention (2048-token window) at decode. Prefill ubatch auto-bumps from 16 to 192 for prompts >2048 tokens. OpenAI-compatible HTTP endpoint. CUDA only, no Metal/ROCm/multi-GPU. Bit-identical output to autoregressive in AR mode; draft matches z-lab PyTorch reference at cos sim 0.999812.