Ollama is the shortest path from zero to local inference. It wraps llama.cpp (and other backends), handles downloads, and serves an OpenAI-compatible API at http://localhost:11434.
Install and run
macOS and Linux:
curl -fsSL https://ollama.com/install.sh | sh
ollama serve
Windows: use the installer from ollama.com. The daemon starts the API automatically.
Pull a model
ollama pull qwen2.5:7b
ollama run qwen2.5:7b
Tags often imply size and quant. For tight VRAM, favor smaller instruct models (7B–8B class) before chasing 32B+.
List what you have:
ollama list
ollama show qwen2.5:7b
ollama show reveals context limits and template—do not set num_ctx above what the model actually supports.
Chat from the CLI
ollama run opens an interactive session. /bye exits. Useful for a quick sanity check before wiring an IDE.
HTTP API
Native chat endpoint:
curl http://localhost:11434/api/chat -d '{
"model": "qwen2.5:7b",
"messages": [{"role": "user", "content": "Explain KV cache in one paragraph."}],
"stream": false
}'
OpenAI-compatible (drop-in for many SDKs):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by SDK; ignored locally
)
response = client.chat.completions.create(
model="qwen2.5:7b",
messages=[{"role": "user", "content": "Hello."}],
)
print(response.choices[0].message.content)
Modelfiles — config as code
Create Modelfile:
FROM qwen2.5:7b
PARAMETER num_ctx 8192
PARAMETER temperature 0.3
SYSTEM "You are a concise technical assistant."
Build and run:
ollama create my-assistant -f Modelfile
ollama run my-assistant
Check Modelfiles into git. Teammates run ollama create and get the same behavior—not a screenshot of your settings.
Common fixes
| Symptom | Try |
|---|---|
| Slow generation | Smaller quant or model; ensure GPU layers are used |
| OOM on load | ollama pull a smaller tag; lower num_ctx |
| Truncated long docs | Raise num_ctx in Modelfile if VRAM allows |
| IDE cannot connect | Confirm ollama serve and curl localhost:11434/v1/models |
When Ollama fits
Scripts, CI-like workflows, servers, and developers who want one command and a stable local API. Pair with Q4_K_M models sized to your VRAM.
Less ceremony. More runs.