Using Ollama — thoughts

Ollama is the shortest path from zero to local inference. It wraps llama.cpp (and other backends), handles downloads, and serves an OpenAI-compatible API at http://localhost:11434.

Install and run

macOS and Linux:

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

Windows: use the installer from ollama.com. The daemon starts the API automatically.

Pull a model

ollama pull qwen2.5:7b
ollama run qwen2.5:7b

Tags often imply size and quant. For tight VRAM, favor smaller instruct models (7B–8B class) before chasing 32B+.

List what you have:

ollama list
ollama show qwen2.5:7b

ollama show reveals context limits and template—do not set num_ctx above what the model actually supports.

Chat from the CLI

ollama run opens an interactive session. /bye exits. Useful for a quick sanity check before wiring an IDE.

HTTP API

Native chat endpoint:

curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5:7b",
  "messages": [{"role": "user", "content": "Explain KV cache in one paragraph."}],
  "stream": false
}'

OpenAI-compatible (drop-in for many SDKs):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by SDK; ignored locally
)

response = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Hello."}],
)
print(response.choices[0].message.content)

Modelfiles — config as code

Create Modelfile:

FROM qwen2.5:7b
PARAMETER num_ctx 8192
PARAMETER temperature 0.3
SYSTEM "You are a concise technical assistant."

Build and run:

ollama create my-assistant -f Modelfile
ollama run my-assistant

Check Modelfiles into git. Teammates run ollama create and get the same behavior—not a screenshot of your settings.

Common fixes

Symptom	Try
Slow generation	Smaller quant or model; ensure GPU layers are used
OOM on load	`ollama pull` a smaller tag; lower `num_ctx`
Truncated long docs	Raise `num_ctx` in Modelfile if VRAM allows
IDE cannot connect	Confirm `ollama serve` and `curl localhost:11434/v1/models`

When Ollama fits

Scripts, CI-like workflows, servers, and developers who want one command and a stable local API. Pair with Q4_K_M models sized to your VRAM.

Less ceremony. More runs.