llamaperf

Qwen3.6

Alibaba · 4 reports

Qwen3.6 35B (3B active)

16GB VRAM (Vulkan backend, unspecified GPU model) · llama.cpp (llama-server.exe, Vulkan backend) · 80,000 ctx

throughput:
38.0 t/s gen · 1021.0 t/s pp
quant:
IQ4_XS (gguf)
kv:
Q8
flash-attn:
on
codingtool-usesummarizationagentic

Poster used llama-server.exe with Vulkan backend on a 16GB VRAM GPU (model not specified). Model is Qwen3.6-35B-A3B (MoE, 34.66B params, ~4.25 bpw) in IQ4_XS quant from Unsloth. Used --n-gpu-layers 99, --n-cpu-moe 16 (offloading MoE experts to CPU), --threads 14, --batch-size 1024, --ubatch-size 1024, --flash-attn 1, --cache-type-k q8_0, --cache-type-v q8_0, --ctx-size 80000, --cache-ram 2048, --no-mmap. Prompt processing: 1021.05 ± 1.24 t/s (pp80000). Generation: 37.96 ± 0.10 t/s (tg1000). Combined with a pi coding agent for file operations, tool calls, summarizations, and MCP calls. Poster says it's usable for someone with a 16GB GPU.

Qwen3.6 35B (3B active)

AMD 7700 XT · LM Studio · 128,000 ctx

quant:
i1-q4_k_s (gguf)
kv:
Q8
flash-attn:
on
rating:
5/5
codingtool-use

User reports Qwen 3.6 35B-A3B (i1-q4_k_s quant) running on AMD 7700 XT with 32GB DDR4 RAM and Ryzen 5 5600. All 40 layers offloaded to GPU, 128k context, flash attention enabled, Q8_0 KV cache quantization. Used LM Studio as OpenAI-compatible endpoint with GitHub Copilot and RooCode frontends. Successfully fixed scraper bugs in one shot (25 min) and updated project README with emulator screenshots (45 min). User describes it as a game-changer for local coding, reserving cloud models only for demanding work.

Qwen3.6 27B

RTX 3090 · Luce DFlash (custom C++/CUDA stack on ggml) · 256,000 ctx

Tone: positive
throughput:
913.0 t/s pp
quant:
Q4_K_M (gguf)
kv:
Q4
flash-attn:
on
codingmath

Speculative decoding (DFlash) on single RTX 3090. Target: Qwen3.6-27B Q4_K_M GGUF (~16 GB). Draft: z-lab Qwen3.6-27B-DFlash bf16 (~3.46 GB). DDTree tree-verify, block size 16, budget 22, greedy verify. KV cache compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) with 4096-slot ring buffer enabling 256K context in 24 GB. Sliding-window flash attention (2048-token window) at decode. Prefill ubatch auto-bumps from 16 to 192 for prompts >2048 tokens. OpenAI-compatible HTTP endpoint. CUDA only, no Metal/ROCm/multi-GPU. Bit-identical output to autoregressive in AR mode; draft matches z-lab PyTorch reference at cos sim 0.999812.

Qwen3.6 27B

RTX 5090 · llama.cpp · 200,000 ctx

Tone: positive
quant:
IQ4_XS (gguf)
kv:
Q8
rating:
5/5
codingtool-use

User reports Qwen 3.6 27B is excellent for pyspark/python and data transformation debugging. Running on ASUS ROG Strix SCAR 18 with RTX 5090 laptop (24GB VRAM) and 64GB DDR5 RAM. Using llama.cpp with IQ4_XS quant at 200k context with Q8_0 KV cache. Initially tried q4_k_m at q4_0. Cancelling cloud subscriptions due to local performance. No tokens/sec reported.