- throughput:
- 138.0 t/s gen
Benchmark of Gemma 4 26B-A4B vs 12B on RTX 4090. 26B-A4B used 15GB VRAM, 138 tok/s; 12B used 9GB, 80 tok/s. 26B-A4B won every scene and ran ~1.7x faster. 12B ideal for 16GB laptop.
NVIDIA · 24GB · 4 reports
Benchmark of Gemma 4 26B-A4B vs 12B on RTX 4090. 26B-A4B used 15GB VRAM, 138 tok/s; 12B used 9GB, 80 tok/s. 26B-A4B won every scene and ran ~1.7x faster. 12B ideal for 16GB laptop.
RTX 4090 · llama.cpp · 262,000 ctx
MTP draft acceptance ~73%, TBQ4_0 KV cache, MTP draft 3. Fork: https://github.com/Indras-Mirror/llama.cpp-mtp
~150 tok/s generation. Star performer. Source: n1n.ai
RTX 4090 · LM Studio · 120,000 ctx
User reports 3000 tokens in ~2 minutes (25 t/s) with Q4_0 quant, 120k context, both caches quantized to 4_0. Seeking faster performance. Also includes a reply with vLLM benchmark on RTX 3090: 27B INT4 quant, 125K context, TurboQuant 3-bit NC KV cache, MTP speculative decoding, 82 tok/s generation, 0.3-0.6s TTFT.