Qwen3

Alibaba · 9 reports

Qwen3 27.00B Qwen3.6-27B

RTX 3090 · 256,000 ctx

913.0 t/s pp
quant:
Q4_K_M (gguf)
kv:
Q4
flash-attn:
on
codingmath

Speculative decoding (DFlash) on single RTX 3090. Target: Qwen3.6-27B Q4_K_M GGUF (~16 GB). Draft: z-lab Qwen3.6-27B-DFlash bf16 (~3.46 GB). DDTree tree-verify, block size 16, budget 22, greedy verify. KV cache compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) with 4096-slot ring buffer enabling 256K context in 24 GB. Sliding-window flash attention (2048-token window) at decode. Prefill ubatch auto-bumps from 16 to 192 for prompts >2048 tokens. OpenAI-compatible HTTP endpoint. CUDA only, no Metal/ROCm/multi-GPU. Bit-identical output to autoregressive in AR mode; draft matches z-lab PyTorch reference at cos sim 0.999812.

Qwen3 27.00B Qwen3.6-27B

Unknown GPU · llama.cpp · 163,840 ctx

38.0 t/s gen
quant:
IQ4_NL (gguf)
kv:
Q8
flash-attn:
on
coding

Running llama.cpp server with pi.dev (coding agent). Uses n-gram speculation (spec-type ngram-mod) for throughput. Parallel=2 allows 2 simultaneous requests. --keep 3000 prevents system prompt from being shifted out. --jinja with preserve_thinking for multi-turn reasoning. Mentions "generating as high as 38 t/s" and "35 t/s throughput" with speculation.

30.0 t/s gen
quant:
UD-Q6XL (gguf)

User mentions also running 27B UD-Q4XL at 0.5 tps. The 35B UD-Q6XL gives ~30 tps. Running on server with Epyc 7763 CPU, 128GB system RAM, RTX 5060 Ti. Uses llama.cpp/vllm on kubernetes. The model name in the post title is "Qwen3.6 35B A3B Heretic (KLD 0.0015!)" - this appears to be a Qwen3-family MoE model (35B total, 3B active).

Qwen3 26.90B Qwen3.6-27B-UD

RTX 3090 Ti · llama.cpp · 131,072 ctx

41.9 t/s gen · 1600.1 t/s pp
quant:
Q4_K_XL (gguf)
kv:
Q8
flash-attn:
on

User ran llama-bench with flash attention (fa=1) showing pp512 at 1600.11 t/s and tg128 at 44.43 t/s. Then ran llama-server with Q8 KV cache (ctk q8_0, ctv q8_0), context 131072, and reported 41.9 t/s average on a 2k output prompt. The model variant is Qwen3.6-27B-UD (26.9B params). GPU is RTX 3090 Ti (24GB VRAM).

37.0 t/s gen
quant:
IQ4

User reports 37 t/s with llama-server (llama.cpp). Using IQ4 quant because Q4_K_M causes out-of-memory. Model is "Luce DFlash: Qwen3.6-27B" which appears to be a variant of Qwen3 at 27B parameters.

Qwen3 27.00B Qwen3.5-27B-UD-IQ3_XXS

Unknown GPU · llama.cpp · 30,000 ctx

23.3 t/s gen · 160.1 t/s pp
quant:
IQ3_XXS (gguf)
kv:
Q4

User mentions Qwen3.x models are good at managing KV cache, suggests using Q4 KV cache. Reports running Qwen3.5-27B-UD-IQ3_XXS.gguf with llama.cpp on a desktop environment (LXQT) with ~30K context. Device memory usage ~11744 MiB out of ~11783 MiB free (likely an RTX 4090 or similar 24GB card based on memory figures). Also mentions -ctx-size 60960 without X11. Prompt eval: 160.11 t/s (6198 tokens), generation: 23.31 t/s (148 tokens).

Qwen3 27.00B Qwen3.6 27b

AMD Strix Halo 128GB · 115,000 ctx

46.0 t/s gen
quant:
Q8 (gguf)
kv:
Q8
agenticcoding

User is running on Strix Halo (AMD Ryzen AI Max). They mention Q8 of Qwen3.6 27b with Q8 KV cache and 115k context, getting 46 tok/sec. They find this slow for agentic coding tasks. The model name "Qwen3.6" is unusual — likely a variant of Qwen3 or a typo, but kept as-is per instructions.

Qwen3 27.00B Qwen 3.6 27B

RTX 5090 · llama.cpp · 200,000 ctx

quant:
IQ4_XS (gguf)
kv:
Q8
rating:
5/5
codingtool-use

User reports Qwen 3.6 27B is excellent for pyspark/python and data transformation debugging. Running on ASUS ROG Strix SCAR 18 with RTX 5090 laptop (24GB VRAM) and 64GB DDR5 RAM. Using llama.cpp with IQ4_XS quant at 200k context with Q8_0 KV cache. Initially tried q4_k_m at q4_0. Cancelling cloud subscriptions due to local performance. No tokens/sec reported.

Qwen3 27.00B Qwen 3.6 27B

RTX 3090 · llama.cpp · 125,000 ctx

85.0 t/s gen
quant:
turbo quant
codingagentictool-use

User reports Qwen 3.6 27B working in Claude Code (CC) with 125k context and turbo quant on a single RTX 3090. Speculative decoding produced garbage output. The user fixed repetition issues by launching Claude with --model claude-3-haiku-20240307 instead of claude-4-haiku (disabling extended thinking). Generation speed of 85 t/s mentioned in the linked Medium article which the user implemented. The user also tried AWQ/GGUF formats with llamacpp, vllm, and LM Studio before settling on this config.