llamaperf

VRAM requirements for local LLMs

How much VRAM you actually need, by model size and quantization level. Real numbers, no marketing fluff.

The fast answer

As a working approximation, model weights take (parameters in B) × (bits per weight) ÷ 8 gigabytes. So a 70B model in Q4 (4 bits) is 70 × 4 ÷ 8 = 35GB of weights. Add ~4–8GB for context and KV cache and you land at the realistic 40–48GB requirement.

The table below collapses this into the numbers you actually want. Values include modest headroom for context and assume Q4_K_M, the most common quant. See our quantization guide for what the quant codes mean.

VRAM by model size and quant (in GB)

Model sizeQ4_K_MQ5_K_MQ6_KQ8_0FP16
7B567814
13B810121426
30B1822263260
70B40505870140
120B (MoE)607588110240

Numbers include weights plus ~2–4GB of headroom for typical context lengths (4K–8K). Long contexts (32K+) and full-precision KV cache add meaningfully more. MoE models (Mixtral 8×22B, DeepSeek V3) are listed by total parameters, not active.

By VRAM tier — what fits

Looking at it from the other direction: given your hardware, what can you actually run?

VRAMExamplesWhat fits
8GBRTX 3060 8GB, RTX 4060 8GB, M1/M2/M3 8GB Macs7B at any quant, 13B at Q3 with short context
12GBRTX 3060 12GB, RTX 4070, M1/M2/M3 16GB Macs13B at Q4, 30B at Q3 with short context
16GBRTX 4060 Ti 16GB, RTX 4070 Ti Super 16GB, RTX 5060 Ti 16GB13B at Q8, 30B at Q3/Q4
24GBRTX 3090, RTX 4090, RTX 5090 (32GB)30B at Q4–Q6, 70B at Q2 with offloading
32GB+M-Pro/M-Max Macs at 32–64GB, dual 3090/4090 setups30B at Q8, 70B at Q4 (single card 32GB+ or dual 24GB)
48GB+RTX A6000, dual 3090/4090, M-Max 48GB+70B at Q4 with full context, 120B-class MoE at Q3
80GB+A100 80GB, H100 80GB, M-Ultra 96GB+, RTX Pro 6000 96GB70B at Q8, 120B+ at Q4, multi-user serving

The four things that consume VRAM

  1. Model weights. Fixed once you choose a model and quant. The dominant cost.
  2. KV cache. Grows linearly with context length. A 70B model at 4K context uses ~3GB at FP16, or ~750MB with KV-Q8. At 32K context, multiply by 8.
  3. Activations and overhead. Engine-internal buffers. Usually a few hundred MB to ~1GB depending on engine and batch size.
  4. Display / OS overhead. On a card driving a monitor, browser and OS chrome eat 1–2GB before you even start. On a headless server or secondary card this disappears.

Apple Silicon is different

Macs use unified memory: the same RAM is shared between CPU and GPU. macOS lets the GPU address up to 75% of installed RAM by default, configurable up to ~92% via sudo sysctl iogpu.wired_limit_mb=....

Practical rule of thumb: a 64GB Mac can host ~48GB of model. A 96GB Mac handles ~72GB. A 192GB Mac Studio Ultra can run essentially any open-weight model that ships, including 70B at Q8 or 120B-class MoE at Q4.

See our best Mac for local LLMs ranking for community-reported throughput by chip.

Frequently asked

How much VRAM do I need for a 7B local LLM?

About 5–6GB at Q4 quantization, or 8GB at Q8. An 8GB GPU comfortably runs 7B models with room for context. Examples: RTX 3060 8GB, RTX 4060 8GB, M1/M2/M3 8GB Macs.

How much VRAM do I need for a 13B local LLM?

About 8GB at Q4, 11GB at Q8. A 12GB GPU is the practical floor (RTX 3060 12GB, RTX 4070 12GB). 16GB cards add comfortable headroom for 8K+ contexts.

How much VRAM do I need for a 30B local LLM?

About 18GB at Q4_K_M with a useful context. A 24GB card (RTX 3090, RTX 4090) is the sweet spot. 16GB cards work with tighter quants but feel cramped.

How much VRAM do I need for a 70B local LLM?

About 40GB at Q4_K_M for weights alone, plus 4–8GB for context. The realistic floor is 48GB of VRAM (RTX A6000, dual 3090s/4090s, or an Apple Silicon Mac with 64GB+ unified memory).

How is KV cache calculated?

KV cache size scales with context length, layer count, and head dimension. As a rough estimate: a 70B model at 4K context uses ~3GB of KV cache at full precision; at 32K context, ~24GB. Quantizing the KV cache (KV-Q8 or KV-Q4) cuts this by 2–4×.

Can I run a model that doesn't fit in my VRAM?

Yes, with caveats. llama.cpp can offload some layers to system RAM and run them on CPU — but throughput drops sharply (often 5–10× slower) the moment you spill to CPU. For interactive use, fitting fully in VRAM is strongly preferred.

See also