llamaperf

RTX 5060 Ti 16GB

NVIDIA · 16GB · 2 reports

See what fits on this GPU →
This page is thin (2 of 3 reports needed for indexing). Help fill it in.

Qwen3.6 27B

RTX 5060 Ti 16GB · llama.cpp · 131,072 ctx

Tone: positive
throughput:
19.0 t/s gen
quant:
IQ4_XS (gguf)
kv:
F16

User reports that offloading KV cache to RAM (with -nkvo) allows fitting the whole model on GPU with f16 KV cache, achieving 19 tps peak and 14 tps during long generation at 65k context. With 128k context and 63 layers on GPU, speed remained similar. KV cache quant to RAM didn't improve performance.

Qwen3.6 27B

RTX 5060 Ti 16GB · llama.cpp · 75,000 ctx

Tone: positive
throughput:
22.0 t/s gen · 760.0 t/s pp
quant:
IQ4_XS (gguf)
kv:
Q8
flash attention:
on

User tested Qwen3.6 27B IQ4_XS on RTX 5060 Ti 16GB with llama.cpp (TheTom's TurboQuant fork). Prompt processing 760 t/s, generation 22 t/s. Context window limited to 75k. KV cache quant turbo4/turbo2. Also tested BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS, Q3_K_XL, Q3_K_M, Q2_K_XL on L40S or RTX 5060 Ti. Quality comparison using chess board SVG generation task. Recommends IQ4_XS as minimum.