Qwen2.5

Alibaba · 3 reports

Qwen2.5 27B

2× RTX 3090 · llama.cpp

throughput:: 70.0 t/s gen · 1850.0 t/s pp
quant:: Q6_K_XL (gguf)

coding

Multi-token prediction enabled. 96GB total VRAM (24+24? but user says 96GB system). Reliable for code generation and codebase ingestion.

Benchmark of abliteration tools (Apostate, Huihui, Heretic) on Qwen 2.5 7B. Evaluated with lm-evaluation-harness via vLLM 0.19.0, bf16 on RTX 5090 32GB. Reports MMLU, GSM8K, HellaSwag, ARC Challenge, WinoGrande, TruthfulQA MC2, PiQA, LAMBADA ppl, HarmBench ASR, KL divergence. No tokens/sec reported.

Qwen2.5 32B Coder

RTX 3090 · llama.cpp · 32,768 ctx

throughput:: 28.0 t/s gen · 450.0 t/s pp
quant:: Q4_K_M (gguf)
kv:: Q8

coding

Solid for autocomplete, occasionally hallucinates imports in multi-file refactors. Build b4400.