Running Claude Code Offline with a Local LLM
Claude Code is built around Anthropic’s API, but the CLI is just a harness — you can point it at any OpenAI-compatible endpoint. That includes Ollama running locally, which means Claude Code can work fully offline.
This is a practical guide on how to set it up, which models to use, and what to expect.
How It Works
Claude Code talks to ANTHROPIC_BASE_URL. Override that variable to point at a local Ollama instance and it will use your local model instead of the cloud.
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434
claudeThat’s the core of it. The rest is model selection and managing expectations.
Runtime: oMLX vs Ollama (Apple Silicon)
If you’re on a Mac, use oMLX instead of Ollama.
Ollama works, but on Apple Silicon it uses roughly a third of the hardware’s memory bandwidth. oMLX is built directly on Apple’s MLX framework and saturates the unified memory bandwidth on the GPU side. For a 26–35B model, the tool-call latency difference is immediately obvious once you’re chaining multiple calls in a loop.
The practical effect: on the same M2 hardware, Qwen3 runs noticeably faster through oMLX than through a comparably-quantized Ollama build.
On Linux, Ollama is your primary option and works fine.
Model Selection
Not all models are equal for this use case. Claude Code’s workflow is built around tool chaining — the model has to reliably call tools, parse results, and chain several calls per task. A model that’s good at code completion is not necessarily good at this.
| Model | Size | Verdict |
|---|---|---|
| Qwen3.6-27b / Qwen3 variants | ~17 GB | Best community pick for Claude Code tool use |
| qwen3-coder | varies | Built for coding agents; solid choice |
| gemma4:26b | ~17 GB | RL-trained for tool use; ~70% of normal workflow |
| qwen2.5-coder:14b | ~9 GB | Avoid — 25–52s per tool call, unusable |
The key criterion: pick models that are RL-trained for tool use, not just next-token code completion. The tool loop is what separates usable from unusable here.
qwen2.5-coder:14b is the model everyone recommends in local-LLM threads, but for Claude Code specifically it falls apart — 25–52 seconds per tool call, with five or six calls per task, makes it completely unusable in practice.
Setup
1. Pull the model (do this at home, not on the go)
# ~17 GB download
ollama pull qwen3-coder2. Configure Claude Code
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434Or as shell aliases for easy switching:
alias claude-local='ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 claude'
alias claude-cloud='claude'3. Verify offline before you need it
Turn off wifi and test locally before relying on this remotely. If it works in airplane mode at home, it works anywhere.
Context Settings (Ollama, 32 GB RAM)
For gemma4:26b or a similarly-sized model on a 32 GB machine:
num_ctx: 32768The architecture supports up to 256k context, but at roughly 1 GB per 8k context tokens, 32k leaves enough headroom for Claude Code, the OS, and a browser open at the same time. 64k is viable if you close everything else.
Temperature defaults (temperature: 1, top_k: 64, top_p: 0.95) are fine for coding tasks.
What Works Offline (~70%)
- Targeted single-file edits
- Refactoring within a known scope
- Drafting and iterating before spending cloud tokens
- Privacy-sensitive or NDA work where cloud is off the table
What Breaks (~30%)
- Heavy whole-repository reasoning
- Multi-tool agentic workflows with subagents
- MCP server integrations
- Anything going to production — use cloud Claude for that
Caveats
Endpoint compatibility — Ollama’s API shape is not 100% identical to Anthropic’s. Unexpected parsing errors mid-stream usually come from this, not from the model. Switch to claude-cloud and continue.
Battery — Running a 26B model with an editor and browser open drains battery noticeably faster than normal.
Memory headroom — Keep your context length within what your RAM can comfortably handle. Swapping kills latency.
When to Use Which
| Scenario | Use |
|---|---|
| No network / traveling | claude-local |
| NDA / client code, privacy-sensitive | claude-local |
| Drafting prompts or exploring before committing | claude-local |
| Multi-agent workflows, MCP servers | claude-cloud |
| Whole-repo refactors | claude-cloud |
| Shipping to production | claude-cloud |
Summary
Claude Code offline is genuinely usable at roughly 70% of the normal experience — enough for focused feature work, edits, and drafting. The critical variables are the runtime (oMLX over Ollama on Apple Silicon) and the model (Qwen3 variants, not code-completion models).
The setup is three environment variables. The preparation is pulling the model the night before.