llamaperf

Qwen3.6

Alibaba · 50 reports

Qwen3.6 27B

RTX 5060 Ti 16GB · llama.cpp · 131,072 ctx

Tone: positive
throughput:
19.0 t/s gen
quant:
IQ4_XS (gguf)
kv:
F16

User reports that offloading KV cache to RAM (with -nkvo) allows fitting the whole model on GPU with f16 KV cache, achieving 19 tps peak and 14 tps during long generation at 65k context. With 128k context and 63 layers on GPU, speed remained similar. KV cache quant to RAM didn't improve performance.

Qwen3.6 27B

RTX 3090 · Ollama · 32,000 ctx

Tone: mixed
quant:
Q6_K (gguf)
codingagentic

User replaced Claude with Qwen3.6-27B in multi-agent orchestrator for 2 weeks. Plan generation good, tool-call reliability poor (12% format error), long-context drift past ~14k tokens, cascade-failure handling weak. Viable as reasoning layer but not execution layer.

quant:
4bit
agenticcoding

User currently runs Qwen3.6-35B-A3B-4bit on M3 Max 128GB for production sub-agent delegations. Also mentions GLM-5.1 for orchestration. Considering building a 5090 rig.

Qwen3.6 27B

RTX 3090 · 128,000 ctx

throughput:
104.0 t/s gen · 1399.0 t/s pp
quant:
Q8
kv:
F16

User recommends getting enough GPUs to avoid VRAM hacks. Uses 2x RTX 3090s.

Tone: positive
throughput:
70.0 t/s gen
creative-writing

User mentions using Qwen3.6-27B on dual RTX 3090s for generating interactive HTML content inline with chat. Reports ~70 t/s.

Qwen3.6 27B

RTX 3060 12GB · llama.cpp · 64,000 ctx

Tone: positive
throughput:
43.3 t/s gen · 456.1 t/s pp
quant:
Q4_K_S (gguf)

Dual RTX 3060 setup with tensor parallel. MTP enabled. Context 64k. Prefill 456 t/s, generation 43.26 t/s at 12k context. Without MTP, context 96k, generation 31 t/s. User praises value and stability of CUDA.

throughput:
3500.0 t/s gen · 30000.0 t/s pp

Two benchmarks: Qwen3.6 27B BF16 and Qwen3.6 35B BF16. For 35B, best gen tps 3500 at 128 concurrency with MTP off, prompt tps 30000. Also tested 27B with MTP on/off.

Qwen3.6 27B

RX 9070 · llama.cpp · 131,072 ctx

Tone: positive
throughput:
46.9 t/s gen · 398.4 t/s pp
quant:
UD-Q5_K_XL (gguf)
flash attention:
on
mtp (multi-token prediction):
on
codingagentic

User runs two RX 9070 XTs with ROCm, uses MTP (spec-type = draft-mtp, spec-draft-n-max = 2). Prompt t/s varies; generation t/s around 45-52. Draft acceptance rate ~0.8-0.99. User praises speed, smarts, steerability for agentic coding tasks. Quant is UD-Q5_K_XL (unsloth GGUF).

Qwen3.6 35B (3B active)

RTX 4070 Ti Super · ik_llama.cpp · 131,072 ctx

Tone: positive
throughput:
110.2 t/s gen
quant:
IQ4_XS-4.19bpw (gguf)
kv:
Q8
mtp (multi-token prediction):
on
codingsummarizationmath

Benchmark comparing llama.cpp (89.76 t/s) vs ik_llama.cpp (110.24 t/s) with MTP on Qwen3.6-35B-A3B IQ4_XS quant. 23% speed increase. CPU: Ryzen 7 9700X, OS: CachyOS. GPU used as secondary with iGPU for display.

Qwen3.6 35B (3B active)

RTX 5080 · llama.cpp · 131,072 ctx

throughput:
56.0 t/s gen · 1584.0 t/s pp
quant:
Q4_K_XL (gguf)
kv:
Q8
flash attention:
on
mtp (multi-token prediction):
off
codingagentic

Best config for 35B Q4_K_XL at 128k context: no MTP, --fit-target 1536. MTP doesn't help at 128k. 27B IQ3 fits fully on GPU and benefits from MTP (73 tok/s).

Qwen3.6 27B

M2 Max 96GB · llama.cpp · 256,000 ctx

Tone: positive
throughput:
8.0 t/s gen
quant:
F16 (gguf)
mtp (multi-token prediction):
on
codingagentic

User benchmarked Qwen 3.6 27b F16 on M2 Max 96GB using llama.cpp with MTP speculative decoding. Generation speed varied 8-18 tok/s depending on task; without MTP got 6.6 tok/s. Used for agentic coding to create a Pacman game. Also tested Q8 quant but results were worse. Context up to 150k+ tokens usable. Chat template fixes were critical.

Qwen3.6 27B

RTX 3090 · BeeLlama · 128,000 ctx

quant:
Q5_K_S (gguf)
kv:
Q8
long-context

Benchmark of KV cache quantization methods using Qwen3.6 27B at 64k and 128k context. Also tested IQ4_XS quant. Article at https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

Qwen3.6 27B

RTX 3090 · ik_llama.cpp · 156,000 ctx

Tone: positive
throughput:
72.9 t/s gen · 1261.0 t/s pp
quant:
IQ4_KS (gguf)
kv:
Q8
flash attention:
on
mtp (multi-token prediction):
on
coding

Best setup on RTX 3090 24GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf, 156k context, q8_0/q8_0 KV, MTP, vision on CPU. Prefill 1261 tok/s, decode 72.9 tok/s. Also tested llama.cpp and BeeLlama.

Qwen3.6 27B

AMD MI50 32GB · llama.cpp

quant:
Q8_0 (gguf)
kv:
Q8
mtp (multi-token prediction):
on

Benchmark comparing MTP KV cache quantization (Q8_0) vs no quantization on Qwen3.6-27B-Q8_0 with llama.cpp. Results show negligible wall time difference (~0.14s) with quantized draft KV cache. Also tested with tensor parallelism on 2xMI50 32GB. Aggregate accept rate ~0.735-0.741.

throughput:
21.2 t/s gen
quant:
Q4_K_M (gguf)
mtp (multi-token prediction):
on

MTP enabled with --spec-type draft-mtp --spec-draft-n-max 3. Baseline without MTP: 11.7 tok/s. Also tested Q8_0: 7.4 → 18.1 tok/s (2.44×).

Qwen3.6 27B

AMD Strix Halo 128GB · llama.cpp · 128,000 ctx

Tone: mixed
quant:
Q8_0 (gguf)
flash attention:
on
long-context

Benchmark compares MTP vs non-MTP for 27B and 35B-A3B models. 27B-MTP shows significant speedup in generation and overall wall time for long-context chat; 35B-MTP shows mixed results with faster generation but slower end-to-end due to prefill overhead.

throughput:
21.1 t/s gen
quant:
Q4_K_M (gguf)
coding

Strix Halo 128GB, ROCm backend, Q4_K_M quant, chat workload. Also tested RTX 3090 and RTX 5070. Multiple models and quants reported.

Tone: positive
throughput:
29.0 t/s gen · 239.0 t/s pp

Benchmark of 4x RTX 3090 with Qwen3.6-27B FP16 using vLLM TP=4. Power limit sweep from 200W to unrestricted. Peak efficiency at 220W. User expresses satisfaction with Qwen 3.6 27B as daily driver.

throughput:
3238.0 t/s gen

Benchmark of Qwen3.6-35B-A3B with LDLM diffusion model on RTX 5090 32GB. Throughput 3,238 tok/s at 10 diffusion steps, seq len 64, batch size 1. Also reports ~6,500 tok/s for 4 steps (extrapolated). Untrained weights. Also mentions Qwen3.6-27B at 745 tok/s (10 steps) and ~1,500 tok/s (4 steps).

Qwen3.6 27B unsloth

RTX 3060 12GB · llama.cpp · 32,000 ctx

Tone: positive
throughput:
70.0 t/s gen · 780.0 t/s pp
quant:
Q2-XS (gguf)
kv:
Q8
flash attention:
off
rating:
4/5
codingcreative-writingtool-usesummarizationvisionagenticmultilingual

defnitly want to try an higher qwant. I vé took that one beacause gguf sise 11,9 go, ans barely offload this dense modèle to cpu

Tone: mixed
summarizationtool-use

User is a vet building a dictation/SOAP scribe. Reports inconsistent output from local models (Gemma 4, Qwen 3.6 35B A3B) compared to frontier models. System prompt is a 25-30k token markdown file. Hardware: Core Ultra 9, 128GB RAM, RTX 5090, Proxmox, AnythingLLM + Ollama (llama.cpp).

Qwen3.6 27B Heretic-v2

RTX 4090 · llama.cpp · 262,000 ctx

Tone: positive
throughput:
80.0 t/s gen
quant:
Q4_K_M (gguf)
kv:
Q4

MTP draft acceptance ~73%, TBQ4_0 KV cache, MTP draft 3. Fork: https://github.com/Indras-Mirror/llama.cpp-mtp

Qwen3.6 35B (3B active)

RTX 3060 12GB · llama.cpp · 32,768 ctx

Tone: positive
throughput:
46.8 t/s gen · 914.0 t/s pp
quant:
IQ4_XS (gguf)
kv:
Q8
flash attention:
on
coding

Best plain llama-bench: pp512 ~914 t/s, tg128 ~46.8 t/s. Practical coding profile: 32k context, ~43.4 t/s generation. MTP gave ~47.7 t/s (2% improvement).

Qwen3.6 27B

M2 Max 96GB · llama.cpp · 262,144 ctx

Tone: positive
throughput:
28.0 t/s gen
quant:
Q5_K_M (gguf)
kv:
Q4
codingagentic

MTP speculative decoding gives 2.5x speedup. Tested on M2 Max 96GB with Q5_K_M quant and q4_0 KV cache. Also provides hardware recommendations for various Apple Silicon and NVIDIA GPUs.

Qwen3.6 27B

RTX 5090 · vLLM · 200,000 ctx

Tone: positive
throughput:
73.6 t/s gen · 2883.0 t/s pp
quant:
NVFP4 (safetensors)
kv:
Q8
flash attention:
on
coding

MTP enabled with 3 speculative tokens. KV cache fp8_e4m3. Prefix caching tested. Stability pass at 200k: 10/10 runs. Generation speed varies 59-111 tok/s. Mean MTP acceptance length 2.28.

Tone: mixed
throughput:
215.1 t/s gen
quant:
Q4 (gguf)

MTP grafted model; Q4 speed increase only 6% on 5090. Also tested Q8 on 5090+3090: 148.20 t/s without MTP, 152.02 t/s with MTP.

Qwen3.6 27B

RTX 3090 · llama.cpp · 100,000 ctx

Tone: positive
throughput:
50.0 t/s gen
quant:
Q4_K_M (gguf)
kv:
Q4
flash attention:
on
coding

Uses MTP (Multi-Token Prediction) GGUF with am17an commit for faster inference. KV cache quantized to Q4_0. Speculative draft set to 2. At 100k context, VRAM usage ~19GB. Performance degrades above 90k context.

Qwen3.6 27B

V100 32GB · llama.cpp · 200,000 ctx

Tone: positive
throughput:
54.5 t/s gen
kv:
Q8
codingtool-use

MTP branch of llama.cpp by am17an. 29-30 t/s without MTP, 54-55 t/s with MTP at 150W power limit. Falls to 40-45 t/s after 50k tokens. Used as vscode copilot.

Qwen3.6 27B

RTX 5060 Ti 16GB · llama.cpp · 75,000 ctx

Tone: positive
throughput:
22.0 t/s gen · 760.0 t/s pp
quant:
IQ4_XS (gguf)
kv:
Q8
flash attention:
on

User tested Qwen3.6 27B IQ4_XS on RTX 5060 Ti 16GB with llama.cpp (TheTom's TurboQuant fork). Prompt processing 760 t/s, generation 22 t/s. Context window limited to 75k. KV cache quant turbo4/turbo2. Also tested BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS, Q3_K_XL, Q3_K_M, Q2_K_XL on L40S or RTX 5060 Ti. Quality comparison using chess board SVG generation task. Recommends IQ4_XS as minimum.

Tone: positive
throughput:
63.0 t/s gen
quant:
4-bit (mlx)
codingcreative-writing

MTPLX engine achieves 63 tok/s on Qwen3.6-27B 4-bit MLX on M5 Max 64GB, up from 28 tok/s baseline. Uses native MTP heads with temperature 0.6, top_p 0.95, top_k 20. Optimal depth D3. Custom patched MLX fork with Metal kernels.

Qwen3.6 27B

RTX 5000 PRO 48GB · vLLM · 196,608 ctx

Tone: positive
throughput:
80.0 t/s gen
quant:
FP8 (safetensors)
kv:
F16
flash attention:
on
codingagentic

Uses vLLM 0.20.1 with CUDA 12.9. BF16 KV cache. MTP=2 speculative decoding. Performance range 60-90 TPS, reported 80 TPS typical.

Tone: mixed
codingsummarization

Compared Qwen3.6-27B (with and without thinking) against Coder-Next. 27B with thinking disabled was most consistent (95.8% ship rate). 27B and Coder-Next statistically tied overall. Also mentions Qwen3.6-35B-A3B performed poorly and was dropped.

Qwen3.6 35B (3B active)

RTX 2060 Max-Q 6GB · llama.cpp · 65,536 ctx

Tone: positive
throughput:
23.0 t/s gen
kv:
Q8
flash attention:
on
agentic

Running on a 5-year-old laptop with RTX 2060 Max-Q 6GB VRAM and 24GB RAM. Uses llama.cpp with Q8 KV cache, flash attention, and 64k context. Also mentions a long context variant with 128k context using Tom's fork. Model is Qwen3.6-35B-A3B (MoE, 3B active).

Qwen3.6 27B

M5 Max 128GB · MLX · 290,000 ctx

Tone: mixed
throughput:
5.5 t/s gen · 160.0 t/s pp
quant:
Q8 (mlx)
long-context

User reports 160 tok/s prefill, 5-6 tok/s generation (later 4-5 tok/s) on M5 Max 128GB with Qwen 3.6 27B Q8 MLX at 290k context. GPU utilization 36-50%. User feels performance is lower than expected and seeks comparison.

Tone: mixed
throughput:
32.0 t/s pp
quant:
Q6 (gguf)
coding

User is considering adding a second 7900 XTX for 48GB VRAM to run larger models. Currently running Qwen 27B Q6 dense with 32K context at 32 t/s prompt processing. Main use case is coding via opencode.

Qwen3.6 27B

M5 Max 128GB · MLX · 290,000 ctx

Tone: mixed
throughput:
5.5 t/s gen · 160.0 t/s pp
quant:
Q8 (mlx)
long-context

User reports 160 tok/s prefill and 5-6 tok/s generation on M5 Max 128GB with Qwen 3.6 27B Q8 MLX at 290k context. GPU utilization only 36-50%, feels off compared to expected 8-14 tok/s generation. Seeking comparison from others.

Tone: mixed
throughput:
32.0 t/s gen
coding

Qwen 3.6 27B on MacBook Pro M5 Max 64GB: 32 tokens/sec, 18m04s, 33946 tokens. Compared to Gemma 4 31B (27 t/s, 3m51s, 6209 tokens). Qwen showed more creativity but Gemma won for game logic.

Qwen3.6 27B

RTX 3090 · vLLM · 218,000 ctx

Tone: positive
throughput:
66.0 t/s gen
tool-usecoding

~218K context @ ~50/66 TPS (text, narr/code). Tool calls with ~25K-token outputs now complete without OOM after fixing Genesis patch (PN12). Lower TPS than earlier config but higher context + stability.

Qwen3.6 27B

M3 Max 128GB · MLX · 290,000 ctx

Tone: mixed
throughput:
5.5 t/s gen · 160.0 t/s pp
quant:
Q8 (mlx)
long-context

User reports 160 tok/s prefill, 5-6 tok/s generation on M5 Max 128GB with Qwen 3.6 27B Q8 MLX at 290k context. GPU utilization only 36-50%, feels off compared to expected 8-14 tok/s generation. Asks for comparison with other setups.

Qwen3.6 27B

H100 80GB · vLLM · 128,000 ctx

Tone: positive
throughput:
45.0 t/s gen
codingagentic

User rents GPU instance with 2x H100s (160GB VRAM) to run Qwen3.6-27B at 45 t/s. Uses vLLM for inference. Runs multiple agents (Claude Code, QwenCode, social media bots) hitting the API simultaneously. Context length 128K. Cost ~$0.90/hr, spent $120 last month. Model outperformed 120B model in tests.

Qwen3.6 27B

RTX 4090 · LM Studio · 120,000 ctx

Tone: mixed
throughput:
25.0 t/s gen
quant:
Q4_0 (gguf)
kv:
Q4
coding

User reports 3000 tokens in ~2 minutes (25 t/s) with Q4_0 quant, 120k context, both caches quantized to 4_0. Seeking faster performance. Also includes a reply with vLLM benchmark on RTX 3090: 27B INT4 quant, 125K context, TurboQuant 3-bit NC KV cache, MTP speculative decoding, 82 tok/s generation, 0.3-0.6s TTFT.

Qwen3.6 35B (3B active)

16GB VRAM (Vulkan backend, unspecified GPU model) · llama.cpp (llama-server.exe, Vulkan backend) · 80,000 ctx

Tone: positive
throughput:
38.0 t/s gen · 1021.0 t/s pp
quant:
IQ4_XS (gguf)
kv:
Q8
flash attention:
on
codingtool-usesummarizationagentic

Poster used llama-server.exe with Vulkan backend on a 16GB VRAM GPU (model not specified). Model is Qwen3.6-35B-A3B (MoE, 34.66B params, ~4.25 bpw) in IQ4_XS quant from Unsloth. Used --n-gpu-layers 99, --n-cpu-moe 16 (offloading MoE experts to CPU), --threads 14, --batch-size 1024, --ubatch-size 1024, --flash-attn 1, --cache-type-k q8_0, --cache-type-v q8_0, --ctx-size 80000, --cache-ram 2048, --no-mmap. Prompt processing: 1021.05 ± 1.24 t/s (pp80000). Generation: 37.96 ± 0.10 t/s (tg1000). Combined with a pi coding agent for file operations, tool calls, summarizations, and MCP calls. Poster says it's usable for someone with a 16GB GPU.

Qwen3.6 27B

Unknown GPU

Tone: positive
throughput:
27.6 t/s gen
quant:
Q6_K (gguf)
vision

Used open-visual in Open WebUI for image generation. Multiple prompts with generation speeds around 27 t/s.

Qwen3.6 27B

RTX 5090 · vLLM · 262,144 ctx

Tone: positive
throughput:
106.5 t/s gen
quant:
INT4 (safetensors)
kv:
Q8

Qwen3.6-27B-INT4 via vllm 0.19 on 1x RTX 5090. Achieves 105-108 tps generation with 256k context. Uses fp8_e4m3 KV cache, flashinfer attention, MTP speculative decoding (3 tokens). Model from Lorbus quant (AutoRound).

Qwen3.6 35B (3B active)

5070 ti · LM Studio

Tone: positive
throughput:
55.0 t/s gen
quant:
IQ4_XS (gguf)
coding

Switched from Qwen3.6 35b-a3b IQ4_XS to Qwen3.6 27b IQ3_M. 35b-a3b got 50-60 t/s but slow prompt processing; 27b got ~40 t/s but consistent. 27b found a bug the 35b couldn't. Dense model handles compression better than MoE.

Qwen3.6 35B (3B active)

3070 8gb · llama.cpp · 128,000 ctx

Tone: positive
throughput:
30.0 t/s gen
quant:
Q5_K_S (gguf)

User found that larger quants (Q4_K_XL, Q5_K_S) gave better speed than smaller Q4 (IQ4_XS) on MoE model Qwen3.6-35B-A3B. With Q5_K_S, ~30 t/s at 128k context. Also tested Q4_K_XL: 32 t/s. IQ4_XS gave 25-30 t/s with 32k context. System: RTX 3070 8GB + 64GB DDR4.

Qwen3.6 27B

RTX 5090 · vLLM · 218,000 ctx

Tone: positive
throughput:
80.0 t/s gen
quant:
NVFP4 (safetensors)

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19.1rc1. Uses NVFP4 quantization.

Qwen3.6 26.9B

5070Ti 16GB, 2060 6GB · llama-server · 128,000 ctx

Tone: positive
throughput:
19.2 t/s gen · 186.8 t/s pp
quant:
Q4_K_M (gguf)
kv:
Q8

User reports using a 5070Ti 16GB and a 2060 6GB to run Qwen3.6-27B Q4_K_M with llama-server. At 71k actual context, pp=186.76 t/s, tg=19.21 t/s. Also provides llama-bench results with CUDA showing tg speeds around 16-25 t/s depending on configuration.

Qwen3.6 27B

Unknown GPU · llama-cpp-python · 32,768 ctx

Tone: positive
throughput:
22.5 t/s gen
quant:
Q4_K_M (gguf)
codingtool-use

Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks: HumanEval, HellaSwag, BFCL. Q4_K_M best practical variant: 1.45x faster than BF16, 48% less peak RAM, 68.8% smaller model file, nearly identical function calling score.

Qwen3.6 27B

RTX 3090 · Luce DFlash (custom C++/CUDA stack on ggml) · 256,000 ctx

Tone: positive
throughput:
913.0 t/s pp
quant:
Q4_K_M (gguf)
kv:
Q4
flash attention:
on
codingmath

Speculative decoding (DFlash) on single RTX 3090. Target: Qwen3.6-27B Q4_K_M GGUF (~16 GB). Draft: z-lab Qwen3.6-27B-DFlash bf16 (~3.46 GB). DDTree tree-verify, block size 16, budget 22, greedy verify. KV cache compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) with 4096-slot ring buffer enabling 256K context in 24 GB. Sliding-window flash attention (2048-token window) at decode. Prefill ubatch auto-bumps from 16 to 192 for prompts >2048 tokens. OpenAI-compatible HTTP endpoint. CUDA only, no Metal/ROCm/multi-GPU. Bit-identical output to autoregressive in AR mode; draft matches z-lab PyTorch reference at cos sim 0.999812.