← blog

March 11, 2026

Local inference — tips

Local inference rewards patience more than hardware bravado. The model is only half the system; how you load and serve it is the other half.

Fit beats size

A model fully on GPU often runs an order of magnitude faster than one spilling to system RAM. Partial offload is not a free lunch—it is a trade you should make consciously.

If tokens crawl, check offload before you blame the model.

Context is a bill

Every extra thousand tokens of context reserves KV cache memory. Ollama in particular may default to a modest context window; raising num_ctx without checking VRAM invites silent truncation or OOM.

Raise context only when the task needs it. Start at 4096 or 8192. Grow after you measure.

Quantization before heroics

Before buying RAM, try Q4_K_M instead of Q8. Before Q3, try a smaller model. The goal is steady tokens per second, not the largest badge on Hugging Face.

Flash Attention and batch

On supported stacks, Flash Attention reduces memory pressure and can improve throughput. Batch size (num_batch in llama.cpp/Ollama tuning) trades latency for throughput—raise it for batch jobs, lower it for interactive chat.

API compatibility

Both major local stacks expose OpenAI-compatible HTTP APIs. Point your app at http://localhost:11434/v1 (Ollama) or http://localhost:1234/v1 (LM Studio). Swap the base URL; drop the cloud API key. Same client code, different ground.

Verify the server before you debug the app:

curl -s http://localhost:11434/v1/models
curl -s http://localhost:1234/v1/models

Modelfiles and presets

Repeatable setups beat one-off flags. Ollama Modelfiles pin system prompt, temperature, and context. LM Studio saves presets per model. Version them like config—not folklore in a terminal history.

Measure, then optimize

Benchmark tokens/sec and answer quality on your prompts. Cloud baselines mislead. Local wins on privacy and cost; it must still win on usefulness for the task at hand.

Closing

The best local stack is the one that loads cleanly, stays in VRAM, and answers without drama. Tuning is subtractive: remove context, quant, and model size until the machine breathes.