llamaperf

RTX 5090

NVIDIA · 32GB · 3 reports

Qwen3.6

RTX 5090 · vLLM · 262,144 ctx

Tone: positive
throughput:
106.5 t/s gen
quant:
INT4 (safetensors)
kv:
Q8

Qwen3.6-27B-INT4 via vllm 0.19 on 1x RTX 5090. Achieves 105-108 tps generation with 256k context. Uses fp8_e4m3 KV cache, flashinfer attention, MTP speculative decoding (3 tokens). Model from Lorbus quant (AutoRound).

Qwen3.6

RTX 5090 · vLLM · 218,000 ctx

Tone: positive
throughput:
80.0 t/s gen
quant:
NVFP4 (safetensors)

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19.1rc1. Uses NVFP4 quantization.

Qwen3.6 27B

RTX 5090 · llama.cpp · 200,000 ctx

Tone: positive
quant:
IQ4_XS (gguf)
kv:
Q8
rating:
5/5
codingtool-use

User reports Qwen 3.6 27B is excellent for pyspark/python and data transformation debugging. Running on ASUS ROG Strix SCAR 18 with RTX 5090 laptop (24GB VRAM) and 64GB DDR5 RAM. Using llama.cpp with IQ4_XS quant at 200k context with Q8_0 KV cache. Initially tried q4_k_m at q4_0. Cancelling cloud subscriptions due to local performance. No tokens/sec reported.