llamaperf

Gemma 4

Google DeepMind · 44 reports

By engine

EngineAvg t/sRangeN
llama.cpp7.71–178
Ollama26.78–15033
Tone: negative
throughput:
7.5 t/s gen
flash-attn:
on

User reports poor performance with Gemma4-31B (7.5 tok/s) and Qwen3.6-27B (locking up) on M5 Max 128GB, while Qwen3.6-35B-A3 is fast. Mentions using DFLASH.

throughput:
55.0 t/s gen
quant:
Q4_K_M (gguf)
text-generation

~55 tok/s on RTX 4070 12GB. Ada Lovelace efficiency. Source: estimated from compute-market tiers

quant:
NVFP4 (safetensors)

nvidia/Gemma-4-26B-A4B-NVFP4 works on 5090 with 80% allocation (of 32GB) got around 50k context. Model size 18.8GB. Benchmarks provided: GPQA Diamond 80.30% (baseline) vs 79.90% (NVFP4), AIME 2025 88.95% vs 90.00%, MMLU Pro 85.00% vs 84.80%, LiveCodeBench (pass@1) 80.50% vs 79.80%, IFBench 77.77% vs 78.1%, IFEval 96.60% vs 96.40%.

Gemma 4 26B

m5 pro (18 core cpu, 20 core gpu) · Ollama

Tone: positive
throughput:
68.8 t/s gen · 204.1 t/s pp

User reports very fast performance on M5 Pro with 48GB RAM. Prompt eval rate: 204.07 t/s, generation rate: 68.76 t/s.