Using LM Studio — thoughts

LM Studio is the GUI path to local LLMs. Discover models from Hugging Face, download GGUF quants, chat immediately, and optionally run a local OpenAI-compatible server—default port 1234.

Install

Download from lmstudio.ai. No account required for basic use. Apple Silicon users benefit from the MLX engine on many models—often faster than llama.cpp alone on the same chip.

First model

Open Discover (or Search).
Find an instruct model—e.g. Llama 3.1 8B Instruct, Qwen 2.5 7B Instruct, Mistral 7B Instruct.
Choose Q4_K_M unless VRAM is tight or you need Q5/Q6 for harder reasoning.
Download. Wait—the files are gigabytes.
Open Chat, select the model, send a test prompt.

Green compatibility indicators mean comfortable VRAM headroom; red means CPU offload and slower inference.

Load settings that matter

Open the model gear / load panel:

GPU offload: slide to 100% if the model fits. Anything less spills layers to CPU and can cost an order of magnitude in speed.
Context length: start 4096–8192. Increase only when needed; KV cache grows with context.
Flash Attention: enable when available—saves memory, helps speed.
Offload KV cache to GPU: on if you have VRAM headroom; off if you need longer context on a tight card.

If load fails with OOM: smaller quant (Q4_K_M → Q3_K_M), shorter context, or smaller model—not heroic context on a 70B.

Server mode

Go to Developer / Local Server.
Toggle the server on.
Default: http://localhost:1234/v1 (OpenAI-compatible).

Point Cursor, Continue, or your own app at that base URL. Same pattern as Ollama—different port.

Check models:

curl -s http://localhost:1234/v1/models

To reach the server from another machine on your LAN, bind to 0.0.0.0 only if you understand the exposure—local inference is private until you open the door.

Hidden usefulness

--estimate-only: see whether a GGUF fits before a long download.
Presets: save temperature, system prompt, and load config per workflow.
Structured output: JSON mode for tool pipelines.
Speculative decoding: pair a small draft model with a large one for faster generation when supported.

LM Studio vs Ollama

	LM Studio	Ollama
Interface	GUI-first	CLI-first
Model pick	Visual browse	`ollama pull`
API port	1234	11434
Best for	Exploration, tuning sliders	Scripts, servers, quick dev

Many people use both: LM Studio to find the right quant; Ollama to serve the winner in production-like loops.

Practical default

14B Q4_K_M on 12GB VRAM with 8K context. 8B Q4_K_M on 8GB. Slide GPU to max. Measure tok/s. Then adjust one knob at a time.

The garden grows by attention, not by the largest stone you can carry.