llamaperf

Best Mac for running local LLMs

Apple Silicon's unified memory is the killer feature for local LLMs: you can fit much larger models than a comparable discrete GPU because the entire system memory is GPU-addressable. The tradeoff is generation speed — bandwidth scales with the chip tier (Pro < Max < Ultra). Ranked from community reports below.

Ranked from 24 community reports on llamaperf.

Ranked by community reports

#GPUVRAMReportsFastest t/s
1M5 Max 128GBapple128GB27.5
2M5 Max 64GBapple64GB132.0
3M4 Max 64GBapple64GB123.0
4M4 16GBapple16GB123.0
5M4 Max 36GBapple36GB121.0
6M3 16GBapple16GB121.0
7M1 Pro 16GBapple16GB120.0
8M4 Pro 24GBapple24GB119.0
9M3 Max 48GBapple48GB118.0
10M2 16GBapple16GB118.0
11M3 8GBapple8GB118.0
12M1 8GBapple8GB117.5
13M3 Max 36GBapple36GB116.0
14M2 Max 32GBapple32GB116.0
15M2 8GBapple8GB116.0
16M2 Ultra 64GBapple64GB114.0
17M3 Pro 18GBapple18GB114.0
18M1 16GBapple16GB114.0
19M1 Ultra 64GBapple64GB112.0
20M2 Pro 16GBapple16GB112.0
21M1 Max 32GBapple32GB110.0
22M3 Max 128GBapple128GB15.5
23M2 Ultra 192GBapple192GB1

No reports yet

These match the profile but nobody has submitted a report yet.

What to look for

Unified memory is the headline feature

Mac Studio Ultras with 192GB+ can run models that would require a multi-GPU server rack on the discrete side. Even a base M-Pro at 32GB will comfortably hold a 30B-class quant. Match memory size to the largest model you want to run plus a few GB of headroom.

Bandwidth scales with chip tier

M-base chips are around 100 GB/s; M-Pro is ~200 GB/s; M-Max is ~400 GB/s; M-Ultra is ~800 GB/s. Tokens-per-second roughly scales with bandwidth on inference workloads, so an Ultra runs 70B noticeably faster than a Max — even when both have enough memory.

MLX vs llama.cpp

MLX (Apple's framework) tends to be slightly faster on Apple Silicon and supports lower-bit quants well. llama.cpp's Metal backend is more mature and supports a wider range of model formats. Most users start with llama.cpp and migrate to MLX for the largest models.

Frequently asked

What is the best Mac for running local LLMs?

For most users, an M-Max or M-Ultra Mac with 64GB+ unified memory is the best balance. M-Ultra Macs with 128GB+ are required to run 70B-class models comfortably, and 192GB+ Ultras can handle the largest open-weight models.

Is an Apple Silicon Mac fast enough for local LLMs?

For interactive use, yes — Macs deliver 5–30 tokens-per-second on common model sizes, which is fast enough for conversational use. They are slower than top-end NVIDIA GPUs at the same quant, but the unified memory advantage often more than makes up for it.

How much RAM do I need on a Mac for local LLMs?

A rough rule: model weights (in GB) ≈ parameters (B) × bytes-per-weight. For Q4 quantization that's ~0.5GB per billion params. So a 70B model needs ~40GB of weights, and you want at least 24GB of headroom for the OS and context. Plan for 64GB+ for serious 70B work.

Should I get an M-Pro or M-Max?

M-Max if you care about throughput. M-Max chips have roughly twice the memory bandwidth of M-Pro chips, which translates almost linearly to tokens-per-second on the same model.

How we rank

Hardware is sorted by the number of community submissions on llamaperf — a proxy for how widely each card is used in practice for local LLM inference. Within that, we surface the fastest tokens-per-second observed on each as a quality signal. Submissions come primarily from r/LocalLLaMA discussions and direct user uploads. Nothing here is sponsored or affiliate-driven.

See also