llamaperf

Ollama

A user-friendly wrapper around llama.cpp with a model registry and one-line install.

19 community reports

Ollama is the easiest entry point into local LLMs. It wraps llama.cpp with a model registry (so 'ollama run llama3' just works), a daemon-style server, and an OpenAI-compatible API. Most casual users run their first local model through Ollama.

Performance characteristics are essentially identical to llama.cpp since Ollama uses it as the underlying runtime. The tradeoff is convenience vs. control: Ollama hides quant selection and engine flags, which is fine for getting started but limits tuning for advanced users.

Reports tagged 'Ollama' on llamaperf use Ollama's defaults unless the submitter notes otherwise. Throughput numbers will closely match llama.cpp reports on the same hardware.

Top GPUs running Ollama

GPUVRAMReportsFastest t/s
RTX 4060 Ti 16GBnvidia16GB445.0
RTX 3060 12GBnvidia12GB360.0
RTX 4090nvidia24GB2149.6
RTX 4070nvidia12GB255.0
RX 7900 XTXamd24GB240.0
RX 7900 XTamd20GB238.0
Intel Arc B580 12GBintel12GB230.0
RX 7800 XT 16GBamd16GB227.0

Top models on Ollama

Frequently asked

Is Ollama the same as llama.cpp?

Ollama uses llama.cpp as its inference backend, so raw throughput is essentially the same. The difference is the user experience: Ollama provides a model registry, daemon, and API; llama.cpp is the underlying engine.

What's the best GPU for Ollama?

Same answer as for llama.cpp: a 24GB NVIDIA card (RTX 3090/4090) for 30B-class models, 48GB+ or two-card setups for 70B, or an Apple Silicon Mac with sufficient unified memory.

Does Ollama support AMD or Apple Silicon GPUs?

Yes — both. Ollama inherits llama.cpp's hardware support. ROCm for AMD on Linux, Metal for Apple Silicon, plus CUDA and CPU fallback.