RTX 3090

NVIDIA · 24GB · 7 reports

Qwen3 27.00B Qwen3.6-27B

RTX 3090 · 256,000 ctx

913.0 t/s pp
quant:
Q4_K_M (gguf)
kv:
Q4
flash-attn:
on
codingmath

Speculative decoding (DFlash) on single RTX 3090. Target: Qwen3.6-27B Q4_K_M GGUF (~16 GB). Draft: z-lab Qwen3.6-27B-DFlash bf16 (~3.46 GB). DDTree tree-verify, block size 16, budget 22, greedy verify. KV cache compressed to TQ3_0 (3.5 bpv, ~9.7x vs F16) with 4096-slot ring buffer enabling 256K context in 24 GB. Sliding-window flash attention (2048-token window) at decode. Prefill ubatch auto-bumps from 16 to 192 for prompts >2048 tokens. OpenAI-compatible HTTP endpoint. CUDA only, no Metal/ROCm/multi-GPU. Bit-identical output to autoregressive in AR mode; draft matches z-lab PyTorch reference at cos sim 0.999812.

Unknown family

RTX 3090

27.5 t/s gen
quant:
Q6

User reports getting max 26-29 t/s with Q6 quant on a single RTX 3090. Asking what quant the OP is using. The model name appears to be Qwen3.6-27B (likely a variant/fine-tune of Qwen3).

37.0 t/s gen
quant:
IQ4

User reports 37 t/s with llama-server (llama.cpp). Using IQ4 quant because Q4_K_M causes out-of-memory. Model is "Luce DFlash: Qwen3.6-27B" which appears to be a variant of Qwen3 at 27B parameters.

Unknown family

RTX 3090

13.0 t/s gen
quant:
UD-IQ4_XS

User reports 13 t/s with Qwen3.6-27B UD-IQ4_XS on a single RTX 3090 and expresses concern that this seems low compared to the post's claim of up to 2x throughput.

Unknown family

RTX 3090

65.0 t/s gen

Same setup as earlier in thread. Speculative decoding (spec dec) mentioned earlier helped increase t/s to about 65 on average.

Unknown family

RTX 3090

23.0 t/s gen
quant:
Q8_0 (gguf)

Same setup as parent post (Luce DFlash: Qwen3.6-27B). Running Unsloth's Q8_0 with full context at ~23tps with the cards set to 75% power limit.

Qwen3 27.00B Qwen 3.6 27B

RTX 3090 · llama.cpp · 125,000 ctx

85.0 t/s gen
quant:
turbo quant
codingagentictool-use

User reports Qwen 3.6 27B working in Claude Code (CC) with 125k context and turbo quant on a single RTX 3090. Speculative decoding produced garbage output. The user fixed repetition issues by launching Claude with --model claude-3-haiku-20240307 instead of claude-4-haiku (disabling extended thinking). Generation speed of 85 t/s mentioned in the linked Medium article which the user implemented. The user also tried AWQ/GGUF formats with llamacpp, vllm, and LM Studio before settling on this config.