- throughput:
- 7.5 t/s gen
- flash-attn:
- on
User reports poor performance with Gemma4-31B (7.5 tok/s) and Qwen3.6-27B (locking up) on M5 Max 128GB, while Qwen3.6-35B-A3 is fast. Mentions using DFLASH.
Google DeepMind · 44 reports
| Engine | Avg t/s | Range | N |
|---|---|---|---|
| llama.cpp | 7.7 | 1–17 | 8 |
| Ollama | 26.7 | 8–150 | 33 |
User reports poor performance with Gemma4-31B (7.5 tok/s) and Qwen3.6-27B (locking up) on M5 Max 128GB, while Qwen3.6-35B-A3 is fast. Mentions using DFLASH.
~0.5-2 tok/s on CPU. 26B MoE painfully slow on CPU. Source: gemma4-ai.com hardware guide
AMD Threadripper 256GB · llama.cpp
~2-5 tok/s on CPU. E4B usable but slow. Source: gemma4-ai.com hardware guide
AMD Threadripper 256GB · llama.cpp
~5-10 tok/s on CPU. E2B is usable CPU-only. Source: gemma4-ai.com hardware guide
~18 tok/s on Intel Arc B580 12GB. 26B MoE fits tight — short context only. Source: compute-market.com
~30 tok/s on Intel Arc B580 12GB. Handles E4B comfortably. Source: compute-market.com
~40-50 tok/s on RTX 5060 Ti 16GB. Blackwell FP4 native support. Top pick for 26B MoE. Source: compute-market.com
~30 tok/s on RTX 4070 12GB. 26B MoE fits at Q4 with short context. Source: compute-market.com
RTX 4070 · Ollama
~55 tok/s on RTX 4070 12GB. Ada Lovelace efficiency. Source: estimated from compute-market tiers
~8 tok/s on RTX 4060 Ti 16GB. 31B at Q4 barely fits — very limited context. Source: compute-market.com
~15 tok/s. 26B MoE at Q8 on 16GB — tight but runs. Source: compute-market.com
~25 tok/s on RTX 4060 Ti 16GB. 26B MoE Q4 fits with 8K context — the sweet spot. Source: compute-market.com
~45 tok/s on RTX 4060 Ti 16GB. E4B at Q4. Source: compute-market.com
~60 tok/s on RTX 3060 12GB. E2B runs effortlessly. Source: estimated from compute-market tiers
~25 tok/s on RTX 3060 12GB. 26B MoE Q4 fits with ~8K context. Great value option. Source: compute-market.com
RTX 3060 12GB · Ollama
~45 tok/s on RTX 3060 12GB. E4B fits easily. Source: compute-market.com
RTX 5090 · 50,000 ctx
nvidia/Gemma-4-26B-A4B-NVFP4 works on 5090 with 80% allocation (of 32GB) got around 50k context. Model size 18.8GB. Benchmarks provided: GPQA Diamond 80.30% (baseline) vs 79.90% (NVFP4), AIME 2025 88.95% vs 90.00%, MMLU Pro 85.00% vs 84.80%, LiveCodeBench (pass@1) 80.50% vs 79.80%, IFBench 77.77% vs 78.1%, IFEval 96.60% vs 96.40%.
AMD Threadripper 256GB · llama.cpp
CPU-only, no GPU. Outperformed 4090 for 31B gen speed (8.8 vs 7.8 t/s) due to no VRAM bottleneck. Source: n1n.ai
RTX 4090 · Ollama
23.5GB VRAM maxed. VRAM bottleneck causes slow gen. Source: n1n.ai
~150 tok/s generation. Star performer. Source: n1n.ai
RTX A6000 48GB · llama.cpp
bf16 no quantization. 43.82GB VRAM. Very slow — 1 token every 2s. Source: dev.to Gaurav Vij
bf16 no quantization. 42.30GB VRAM. 18x faster than dense 31B. Source: dev.to Gaurav Vij
RTX A6000 48GB · llama.cpp
bf16 no quantization. ~16GB VRAM. Source: dev.to Gaurav Vij
RTX A6000 48GB · llama.cpp
bf16 no quantization. 10.25GB VRAM, 61ms TTFT. Source: dev.to Gaurav Vij
M4 Max 64GB · Ollama
20-26 tok/s range. Source: gemma4-ai.com
M4 Max 36GB · Ollama
18-24 tok/s range. Source: gemma4-ai.com
16-22 tok/s range. Source: gemma4-ai.com
M4 16GB · Ollama
20-26 tok/s range. Source: gemma4-ai.com
M3 Max 48GB · Ollama
16-20 tok/s range. Source: gemma4-ai.com
M3 Max 36GB · Ollama
14-18 tok/s range. Source: gemma4-ai.com
12-16 tok/s range. Source: gemma4-ai.com
M3 16GB · Ollama
18-24 tok/s range. Source: gemma4-ai.com
M3 8GB · Ollama
16-20 tok/s range. Source: gemma4-ai.com
M2 Ultra 64GB · Ollama
12-16 tok/s range. Source: gemma4-ai.com
14-18 tok/s range. Source: gemma4-ai.com
10-14 tok/s range. Source: gemma4-ai.com
M2 16GB · Ollama
16-20 tok/s range. Source: gemma4-ai.com
M2 8GB · Ollama
14-18 tok/s range. Source: gemma4-ai.com
M1 Ultra 64GB · Ollama
10-14 tok/s range. Source: gemma4-ai.com
8-12 tok/s range. Source: gemma4-ai.com
M1 Pro 16GB · Ollama
18-22 tok/s range. Source: gemma4-ai.com
M1 16GB · Ollama
12-16 tok/s range. Source: gemma4-ai.com
M1 8GB · Ollama
15-20 tok/s range, usable for simple tasks. Source: gemma4-ai.com
m5 pro (18 core cpu, 20 core gpu) · Ollama
User reports very fast performance on M5 Pro with 48GB RAM. Prompt eval rate: 204.07 t/s, generation rate: 68.76 t/s.