RTX 5060 Ti 16GB

NVIDIA · 16GB · 4 reports

See what fits on this GPU →

Latest Most reported Fastest t/s

Qwen3.6 27B

4× RTX 5060 Ti 16GB · 256,000 ctx

throughput:: 52.2 t/s gen · 608.0 t/s pp
quant:: Q8
kv:: F16
mtp (multi-token prediction):: on

coding

Benchmark on Vast AI instance with 4x RTX 5060 Ti 16GB. Q8 quant, FP16 KV cache, MTP enabled. 256K context. Cold prefill 608 t/s, decode 52.2 t/s. User considers this excellent for $2K hardware.

Qwen3.6 30B (3B active)

RTX 5060 Ti 16GB

throughput:: 52.0 t/s gen
quant:: float8

Custom CUDA/C++ engine, 50-54 tok/s, 50% improvement over llama.cpp (33-34 tok/s).

Qwen3.6 27B

RTX 5060 Ti 16GB · llama.cpp · 131,072 ctx

throughput:: 19.0 t/s gen
quant:: IQ4_XS (gguf)
kv:: F16

User reports that offloading KV cache to RAM (with -nkvo) allows fitting the whole model on GPU with f16 KV cache, achieving 19 tps peak and 14 tps during long generation at 65k context. With 128k context and 63 layers on GPU, speed remained similar. KV cache quant to RAM didn't improve performance.

Qwen3.6 27B

RTX 5060 Ti 16GB · llama.cpp · 75,000 ctx

throughput:: 22.0 t/s gen · 760.0 t/s pp
quant:: IQ4_XS (gguf)
kv:: Q8
flash attention:: on

User tested Qwen3.6 27B IQ4_XS on RTX 5060 Ti 16GB with llama.cpp (TheTom's TurboQuant fork). Prompt processing 760 t/s, generation 22 t/s. Context window limited to 75k. KV cache quant turbo4/turbo2. Also tested BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS, Q3_K_XL, Q3_K_M, Q2_K_XL on L40S or RTX 5060 Ti. Quality comparison using chess board SVG generation task. Recommends IQ4_XS as minimum.