RTX 4090

NVIDIA · 24GB · 3 reports

Latest Most reported Fastest t/s

Gemma 4 30.7B Instruct

RTX 4090 · Ollama

throughput:: 7.8 t/s gen · 26.3 t/s pp
quant:: Q4_K_M (gguf)

text-generation

23.5GB VRAM maxed. VRAM bottleneck causes slow gen. Source: n1n.ai

Gemma 4 25.2B (3.8B active) 26B Instruct (MoE)

RTX 4090 · Ollama

throughput:: 149.6 t/s gen · 15.6 t/s pp
quant:: Q4_K_M (gguf)

text-generation

~150 tok/s generation. Star performer. Source: n1n.ai

Qwen3.6 27B

RTX 4090 · LM Studio · 120,000 ctx

throughput:: 25.0 t/s gen
quant:: Q4_0 (gguf)
kv:: Q4

coding

User reports 3000 tokens in ~2 minutes (25 t/s) with Q4_0 quant, 120k context, both caches quantized to 4_0. Seeking faster performance. Also includes a reply with vLLM benchmark on RTX 3090: 27B INT4 quant, 125K context, TurboQuant 3-bit NC KV cache, MTP speculative decoding, 82 tok/s generation, 0.3-0.6s TTFT.