Running Claude Code Offline with a Local LLM

Running Claude Code Offline with a Local LLM

Claude Code is built around Anthropic’s API, but the CLI is just a harness — you can point it at any OpenAI-compatible endpoint. That includes Ollama running locally, which means Claude Code can work fully offline.

This is a practical guide on how to set it up, which models to use, and what to expect.


How It Works

Claude Code talks to ANTHROPIC_BASE_URL. Override that variable to point at a local Ollama instance and it will use your local model instead of the cloud.

export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434

claude

That’s the core of it. The rest is model selection and managing expectations.


Runtime: oMLX vs Ollama (Apple Silicon)

If you’re on a Mac, use oMLX instead of Ollama.

Ollama works, but on Apple Silicon it uses roughly a third of the hardware’s memory bandwidth. oMLX is built directly on Apple’s MLX framework and saturates the unified memory bandwidth on the GPU side. For a 26–35B model, the tool-call latency difference is immediately obvious once you’re chaining multiple calls in a loop.

The practical effect: on the same M2 hardware, Qwen3 runs noticeably faster through oMLX than through a comparably-quantized Ollama build.

On Linux, Ollama is your primary option and works fine.


Model Selection

Not all models are equal for this use case. Claude Code’s workflow is built around tool chaining — the model has to reliably call tools, parse results, and chain several calls per task. A model that’s good at code completion is not necessarily good at this.

Model Size Verdict
Qwen3.6-27b / Qwen3 variants ~17 GB Best community pick for Claude Code tool use
qwen3-coder varies Built for coding agents; solid choice
gemma4:26b ~17 GB RL-trained for tool use; ~70% of normal workflow
qwen2.5-coder:14b ~9 GB Avoid — 25–52s per tool call, unusable

The key criterion: pick models that are RL-trained for tool use, not just next-token code completion. The tool loop is what separates usable from unusable here.

qwen2.5-coder:14b is the model everyone recommends in local-LLM threads, but for Claude Code specifically it falls apart — 25–52 seconds per tool call, with five or six calls per task, makes it completely unusable in practice.


Setup

1. Pull the model (do this at home, not on the go)

# ~17 GB download
ollama pull qwen3-coder

2. Configure Claude Code

export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434

Or as shell aliases for easy switching:

alias claude-local='ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 claude'
alias claude-cloud='claude'

3. Verify offline before you need it

Turn off wifi and test locally before relying on this remotely. If it works in airplane mode at home, it works anywhere.


Context Settings (Ollama, 32 GB RAM)

For gemma4:26b or a similarly-sized model on a 32 GB machine:

num_ctx: 32768

The architecture supports up to 256k context, but at roughly 1 GB per 8k context tokens, 32k leaves enough headroom for Claude Code, the OS, and a browser open at the same time. 64k is viable if you close everything else.

Temperature defaults (temperature: 1, top_k: 64, top_p: 0.95) are fine for coding tasks.


What Works Offline (~70%)

  • Targeted single-file edits
  • Refactoring within a known scope
  • Drafting and iterating before spending cloud tokens
  • Privacy-sensitive or NDA work where cloud is off the table

What Breaks (~30%)

  • Heavy whole-repository reasoning
  • Multi-tool agentic workflows with subagents
  • MCP server integrations
  • Anything going to production — use cloud Claude for that

Caveats

Endpoint compatibility — Ollama’s API shape is not 100% identical to Anthropic’s. Unexpected parsing errors mid-stream usually come from this, not from the model. Switch to claude-cloud and continue.

Battery — Running a 26B model with an editor and browser open drains battery noticeably faster than normal.

Memory headroom — Keep your context length within what your RAM can comfortably handle. Swapping kills latency.


When to Use Which

Scenario Use
No network / traveling claude-local
NDA / client code, privacy-sensitive claude-local
Drafting prompts or exploring before committing claude-local
Multi-agent workflows, MCP servers claude-cloud
Whole-repo refactors claude-cloud
Shipping to production claude-cloud

Summary

Claude Code offline is genuinely usable at roughly 70% of the normal experience — enough for focused feature work, edits, and drafting. The critical variables are the runtime (oMLX over Ollama on Apple Silicon) and the model (Qwen3 variants, not code-completion models).

The setup is three environment variables. The preparation is pulling the model the night before.