Gemma 4

Google DeepMind · 44 reports

By engine

Engine	Avg t/s	Range	N
llama.cpp	7.7	1–17	8
Ollama	26.7	8–150	33

Gemma 4 31B

M5 Max 128GB

throughput:: 7.5 t/s gen
flash-attn:: on

User reports poor performance with Gemma4-31B (7.5 tok/s) and Qwen3.6-27B (locking up) on M5 Max 128GB, while Qwen3.6-35B-A3 is fast. Mentions using DFLASH.

Gemma 4 25.2B (3.8B active) 26B Instruct (MoE)

AMD Threadripper 256GB · llama.cpp

throughput:: 1.3 t/s gen
quant:: Q4 (gguf)

text-generation

~0.5-2 tok/s on CPU. 26B MoE painfully slow on CPU. Source: gemma4-ai.com hardware guide

Gemma 4 8B E4B Instruct

AMD Threadripper 256GB · llama.cpp

throughput:: 3.5 t/s gen
quant:: Q4 (gguf)

text-generation

~2-5 tok/s on CPU. E4B usable but slow. Source: gemma4-ai.com hardware guide

Gemma 4 5.1B E2B Instruct

AMD Threadripper 256GB · llama.cpp

throughput:: 7.5 t/s gen
quant:: Q4 (gguf)

text-generation

~5-10 tok/s on CPU. E2B is usable CPU-only. Source: gemma4-ai.com hardware guide

Gemma 4 25.2B (3.8B active) 26B Instruct (MoE)

Intel Arc B580 12GB · Ollama

throughput:: 18.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~18 tok/s on Intel Arc B580 12GB. 26B MoE fits tight — short context only. Source: compute-market.com

Gemma 4 8B E4B Instruct

Intel Arc B580 12GB · Ollama

throughput:: 30.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~30 tok/s on Intel Arc B580 12GB. Handles E4B comfortably. Source: compute-market.com

Gemma 4 25.2B (3.8B active) 26B Instruct (MoE)

RTX 5060 Ti 16GB · Ollama

throughput:: 45.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~40-50 tok/s on RTX 5060 Ti 16GB. Blackwell FP4 native support. Top pick for 26B MoE. Source: compute-market.com

Gemma 4 25.2B (3.8B active) 26B Instruct (MoE)

RTX 4070 · Ollama

throughput:: 30.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~30 tok/s on RTX 4070 12GB. 26B MoE fits at Q4 with short context. Source: compute-market.com

Gemma 4 8B E4B Instruct

RTX 4070 · Ollama

throughput:: 55.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~55 tok/s on RTX 4070 12GB. Ada Lovelace efficiency. Source: estimated from compute-market tiers

Gemma 4 30.7B Instruct

RTX 4060 Ti 16GB · Ollama

throughput:: 8.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~8 tok/s on RTX 4060 Ti 16GB. 31B at Q4 barely fits — very limited context. Source: compute-market.com

Gemma 4 25.2B (3.8B active) 26B Instruct (MoE)

RTX 4060 Ti 16GB · Ollama

throughput:: 15.0 t/s gen
quant:: Q8 (gguf)

text-generation

~15 tok/s. 26B MoE at Q8 on 16GB — tight but runs. Source: compute-market.com

Gemma 4 25.2B (3.8B active) 26B Instruct (MoE)

RTX 4060 Ti 16GB · Ollama

throughput:: 25.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~25 tok/s on RTX 4060 Ti 16GB. 26B MoE Q4 fits with 8K context — the sweet spot. Source: compute-market.com

Gemma 4 8B E4B Instruct

RTX 4060 Ti 16GB · Ollama

throughput:: 45.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~45 tok/s on RTX 4060 Ti 16GB. E4B at Q4. Source: compute-market.com

Gemma 4 5.1B E2B Instruct

RTX 3060 12GB · Ollama

throughput:: 60.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~60 tok/s on RTX 3060 12GB. E2B runs effortlessly. Source: estimated from compute-market tiers

Gemma 4 25.2B (3.8B active) 26B Instruct (MoE)

RTX 3060 12GB · Ollama

throughput:: 25.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~25 tok/s on RTX 3060 12GB. 26B MoE Q4 fits with ~8K context. Great value option. Source: compute-market.com

Gemma 4 8B E4B Instruct

RTX 3060 12GB · Ollama

throughput:: 45.0 t/s gen
quant:: Q4_K_M (gguf)

text-generation

~45 tok/s on RTX 3060 12GB. E4B fits easily. Source: compute-market.com

Gemma 4 26B (4B active)

RTX 5090 · 50,000 ctx

quant:: NVFP4 (safetensors)

nvidia/Gemma-4-26B-A4B-NVFP4 works on 5090 with 80% allocation (of 32GB) got around 50k context. Model size 18.8GB. Benchmarks provided: GPQA Diamond 80.30% (baseline) vs 79.90% (NVFP4), AIME 2025 88.95% vs 90.00%, MMLU Pro 85.00% vs 84.80%, LiveCodeBench (pass@1) 80.50% vs 79.80%, IFBench 77.77% vs 78.1%, IFEval 96.60% vs 96.40%.