← blog

January 22, 2026

Choosing a quantized model

On Hugging Face, a model page is a wall of filenames. Q4_K_M. Q5_K_S. Q8_0. The letters are a ladder, not a lottery.

The rule: fit the weights in VRAM first, then climb toward quality until speed or memory says stop.

The memory ladder

Think of quantization like resolution:

LevelWhen to use
Q8_0Near-lossless. Use when VRAM is generous.
Q6_KLong context, JSON, reasoning you cannot afford to slip.
Q5_K_MStrong default when Q4 shows cracks.
Q4_K_MSweet spot for most machines. Start here.
Q3_K_MOnly when you must run a larger model on tight VRAM. Test before you trust.

Prefer _K_M variants over plain Q4_0. Mixed per-layer precision keeps more quality at similar size.

Do not go below Q4 for work you care about. Below that, logic drifts, JSON wanders, code gets off-by-one in quiet ways.

VRAM is not just the file size

The GGUF on disk is not the whole story. You also pay for the KV cache—memory that grows with context length. A model that loads fine at 4K context may choke at 32K.

Before you download, estimate:

  • Model weights (quant file size ≈ VRAM for weights)
  • Plus context budget (long chats and RAG eat this fast)
  • Plus a gigabyte or two for framework overhead

If the ladder fails at Q4_K_M, do not reach for Q2 first. Consider a smaller parameter count. A sharp 8B beats a drowning 70B.

How to decide in practice

  1. Pick your task: chat, code, RAG, structured JSON.
  2. Note your VRAM (dedicated GPU memory is the hard ceiling).
  3. Download Q4_K_M and Q5_K_M of the same model.
  4. Run the same prompts. If you cannot tell them apart, keep Q4.

For strict tool calls and long documents, bias toward Q5_K_M or Q6_K. For casual notes on an 8GB card, Q4_K_M is enough.

One sentence

Start at Q4_K_M; step up only when quality complains, step down only when the machine refuses to load.