How we cut our coding agent's bill by 60% with model routing

The first Anthropic invoice that surprised us was $312 for what felt like a normal Saturday. One person, one machine, one coding agent doing a refactor across maybe forty files. We were not running anything fancy. The agent was just on Sonnet for every turn, the way most setups start.

That bill is what made us look at where the tokens were actually going. Two weeks of telemetry later, we had a different agent. Same UX, same model quality on the things that mattered, 60% cheaper on average. The trick is dull, almost insultingly so. We just stopped sending Sonnet to do Haiku's job.

This post is the breakdown: what an average session looks like, why a single-model setup is wasteful, the four rules we landed on, and the things we tried that didn't help.

What an average session actually looks like

Before the optimisation, we ran two weeks of digitorn-code (our open-source coding agent built like the Claude Code clone we wrote up earlier) and tagged every turn by what kind of work it was doing. The split was more lopsided than expected.

Exploration

grep, glob, read

52%

~1.8K avg

Editing

write, edit

22%

~3.4K avg

Testing

bash, pytest

14%

~1.2K avg

Planning

set_goal, plan

8%

~2.1K avg

Other

compaction, recovery

4%

~0.6K avg

Half of every session is exploration. That half is the cheapest to offload to a smaller model, and it's where the savings live.

Half of every session is the agent looking around, figuring out where things live. Grep, Glob, Read truncated to a few hundred lines, then a short summary back to the coordinator. These turns are short on input, short on output, mechanical, and almost always answerable by a small model. They were costing us full Sonnet rates anyway.

Editing is the next biggest slice. Editing turns do warrant the bigger model because they're where the actual code lives. An off-by-one in a Sonnet edit costs you ten retries in Haiku.

Testing is mostly Bash invocation and parsing the output. Cheap to run, doesn't need much intelligence.

Planning is the one that surprised us. Only 8% of turns by count, but they're disproportionately important because every other turn references the plan. We tried moving these to Haiku and it broke. The plans got vague, the agent lost the thread, and we ended up running more turns to compensate. Net loss.

Compaction and recovery are the long tail. They happen, they cost something, you can't really optimise them.

The naive single-model setup is an accident, not a design

Most agent codebases start the same way. You pick a good model (Sonnet, GPT-4o, whatever), you make every turn use it, and you ship. You don't make this choice deliberately. It's the path of least resistance: one provider key in the config, one model name, one billing line. Routing seems like premature optimisation.

It's not. Once you measure, the math becomes hard to argue with. Sonnet output on Anthropic's pricing is roughly fifteen times more expensive per token than Haiku. If half your turns are exploration and you're paying full Sonnet for them, half your bill is literally throwing money at a problem the cheaper model would solve identically.

Saying it that way makes it sound obvious. It wasn't obvious to us until we looked at the bill.

The rules we landed on

We tried a few different splits before settling on something simple. Every time we got clever (per-tool routing, dynamic switching mid-turn, content-based heuristics) we hit a corner case that broke the agent's coherence. The version that ended up sticking is six rules across three agent types, and they fit on one page.

Coordinator

writes code, owns the plan

Sonnet

premium

Explorer / search

grep, glob, summarise

Haiku

fast

Fact checker

verifies claims, no prose

Haiku

fast

Writer

produces the final output

Sonnet

premium

Editor

polish, terseness, voice

Sonnet

premium

Reviewer / triage

scoring, classification

Haiku

fast

The cheap rule of thumb: anything that produces prose for a human gets the premium model. Anything that filters or classifies gets the fast one.

The pattern under all of these is one cheap heuristic: does this turn produce something a human will read? If yes, premium model. If no, fast model.

A coordinator turn produces code that ends up in a file. Premium. A writer turn produces prose for the user. Premium. An editor turn shapes a final answer. Premium. An explorer turn produces a list of file:line strings the coordinator alone will see. Fast. A fact checker emits a yes/no/citation tuple. Fast. A reviewer scores something on a rubric. Fast.

There's a sub-rule too, which we missed for the first month: the planner is premium, even though nobody reads the plan directly. The plan steers every other turn. A bad plan multiplies cost across the whole session. We learned this the hard way.

What it costs in YAML

Once you've decided which agent gets which model, expressing it on Digitorn is one block per agent. Here's the shape:

YAML

1agents:2  - id: coordinator3    role: coordinator4    brain:5      provider: anthropic6      model: claude-sonnet-4-6     # premium: writes the code7      max_tokens: 81928      temperature: 0.2910  - id: explorer11    role: specialist12    specialty: "Find files, grep symbols, sample contents"13    modules:14      - {filesystem: [read, grep, glob]}15      - {shell: [bash]}16    brain:17      provider: anthropic18      model: claude-haiku-4-5      # fast: triage, no prose19      temperature: 0.0

That's the whole change. No new framework. No special routing layer. Just two model: strings that point at different things. The coordinator hands work off via Agent(specialist="explorer"), the runtime spawns the explorer with its own brain config, the result comes back, and the coordinator decides what to do next on Sonnet. Each agent uses its own model on its own tokens.

Tip

If you're in a Python framework today, the same shape works there too. The win is in the routing decision, not the runtime. We just think YAML makes it harder to forget the rule when you're tired at 11pm.

What didn't work

A few optimisations look obvious on paper and turn out to be net-negative or net-zero in practice.

Prompt caching alone. Anthropic's prompt cache is great, and we use it. But the cache hit rate on a long-running coding session is bounded by how often the same context shows up across turns, which on a real session with active edits and tool results is low. Caching cut our bill by maybe 12%. Routing cut it by 60%. Stack them, but don't expect cache to do the work.

Switching to a smaller model for the whole agent. We tried running everything on Haiku to see how bad it would be. It was worse than expected. The coordinator started losing the thread on multi-step tasks and re-asking for things it had already seen. Net token consumption went up, not down. The savings on per-token rate were eaten by the higher turn count. Quality also dropped to "annoying to use", which we couldn't ship.

Truncating system prompts aggressively. The system prompt is included in every Sonnet call. Trimming it sounds like it would save tokens. It does, by a tiny amount, and it consistently degraded behaviour. Keep the system prompt focused but don't turn this into a sport.

Switching providers per turn. We tried mixing Anthropic for Sonnet and DeepSeek V3 for explorer turns. Worked technically. The DeepSeek-emitted tool calls had subtle format differences that the coordinator (Sonnet, on a different family) sometimes mis-parsed when it received them as worker results. We lost more time debugging the seam than we saved on the bill. Not saying don't do it, saying check the integration carefully.

What changed downstream

The 60% number is the headline. The thing that actually mattered more, week to week, is that we stopped flinching at the bill. Before routing, every long debugging session came with a low-grade anxiety about cost. After routing, sessions feel free in the way that local tools feel free, and we use the agent more often as a result.

That second-order effect is the underrated one. The cost of an agent isn't just dollars-per-call. It's how often you reach for it. If you're rationing yourself because every call feels like a small purchase, you're getting less value out of the tool than you should. Routing the cheap turns to a cheap model is what made our team actually use the agent the way we'd hoped.

How to try this on your own setup

If you have an agent on a single model right now, the cheapest experiment to run is: instrument turns by type for one week. Tag exploration vs editing vs the rest. Look at the split. If exploration is north of 30%, you have routing money on the table.

The simplest first move is to add one specialist agent for exploration, give it the cheap model, and route grep/glob/read calls through it. That alone usually cuts a bill by a third. The other rules are refinements you find by measuring once you've made the first cut.

If you want a head start, the digitorn-code builtin already ships with this routing wired up, and the YAML is readable in one screen. Install it, run a session against your codebase, and check the daemon's session log under ~/.digitorn/logs/ to confirm which model serviced which turn.

Bash

1curl -sSL https://digitorn.ai/install | sh2digitorn install hub://digitorn/digitorn-code3digitorn dev chat digitorn-code

Keep reading

credentials

Try it now

Ship your first AI agent in 5 minutes.

Open-source. Self-hosted. YAML-first. Bring your own LLM keys, agents run on your machine.

Install Digitorn Browse the Hub

This post is the breakdown: what an average session looks like, why a single-model setup is wasteful, the four rules we landed on, and the things we tried that didn't help.

What an average session actually looks like

Exploration

grep, glob, read

52%

~1.8K avg

Editing

write, edit

22%

~3.4K avg

Testing

bash, pytest

14%

~1.2K avg

Planning

set_goal, plan

8%

~2.1K avg

Other

compaction, recovery

4%

~0.6K avg

Half of every session is exploration. That half is the cheapest to offload to a smaller model, and it's where the savings live.

Editing is the next biggest slice. Editing turns do warrant the bigger model because they're where the actual code lives. An off-by-one in a Sonnet edit costs you ten retries in Haiku.

Testing is mostly Bash invocation and parsing the output. Cheap to run, doesn't need much intelligence.

Compaction and recovery are the long tail. They happen, they cost something, you can't really optimise them.

The naive single-model setup is an accident, not a design

Saying it that way makes it sound obvious. It wasn't obvious to us until we looked at the bill.

The rules we landed on

Coordinator

writes code, owns the plan

Sonnet

premium

Explorer / search

grep, glob, summarise

Haiku

fast

Fact checker

verifies claims, no prose

Haiku

fast

Writer

produces the final output

Sonnet

premium

Editor

polish, terseness, voice

Sonnet

premium

Reviewer / triage

scoring, classification

Haiku

fast

The cheap rule of thumb: anything that produces prose for a human gets the premium model. Anything that filters or classifies gets the fast one.

The pattern under all of these is one cheap heuristic: does this turn produce something a human will read? If yes, premium model. If no, fast model.

What it costs in YAML

Once you've decided which agent gets which model, expressing it on Digitorn is one block per agent. Here's the shape:

YAML

1agents:2  - id: coordinator3    role: coordinator4    brain:5      provider: anthropic6      model: claude-sonnet-4-6     # premium: writes the code7      max_tokens: 81928      temperature: 0.2910  - id: explorer11    role: specialist12    specialty: "Find files, grep symbols, sample contents"13    modules:14      - {filesystem: [read, grep, glob]}15      - {shell: [bash]}16    brain:17      provider: anthropic18      model: claude-haiku-4-5      # fast: triage, no prose19      temperature: 0.0

Tip

What didn't work

A few optimisations look obvious on paper and turn out to be net-negative or net-zero in practice.

What changed downstream

How to try this on your own setup

Bash

1curl -sSL https://digitorn.ai/install | sh2digitorn install hub://digitorn/digitorn-code3digitorn dev chat digitorn-code

Keep reading

credentials

Ship your first AI agent in 5 minutes.

Open-source. Self-hosted. YAML-first. Bring your own LLM keys, agents run on your machine.

Install Digitorn Browse the Hub

How we cut our coding agent's bill by 60% with model routing

What an average session actually looks like

The naive single-model setup is an accident, not a design

The rules we landed on

What it costs in YAML

What didn't work

What changed downstream

How to try this on your own setup

Further reading

One post a fortnight, in your inbox.

Keep reading

How credentials work on Digitorn: an encrypted vault driven from YAML

Hooks: 4 production patterns we ship today on Digitorn

Digitorn vs LangChain: an honest comparison

Ship your first AI agent in 5 minutes.

How we cut our coding agent's bill by 60% with model routing

What an average session actually looks like

The naive single-model setup is an accident, not a design

The rules we landed on

What it costs in YAML

What didn't work

What changed downstream

How to try this on your own setup

Further reading

One post a fortnight, in your inbox.

Keep reading

How credentials work on Digitorn: an encrypted vault driven from YAML

Hooks: 4 production patterns we ship today on Digitorn

Digitorn vs LangChain: an honest comparison

Ship your first AI agent in 5 minutes.

How we cut our coding agent's bill by 60% with model routing

#What an average session actually looks like

#The naive single-model setup is an accident, not a design

#The rules we landed on

#What it costs in YAML

#What didn't work

#What changed downstream

#How to try this on your own setup

#Further reading

One post a fortnight, in your inbox.

Keep reading

How credentials work on Digitorn: an encrypted vault driven from YAML

Hooks: 4 production patterns we ship today on Digitorn

Digitorn vs LangChain: an honest comparison

Ship your first AI agent in 5 minutes.

How we cut our coding agent's bill by 60% with model routing

#What an average session actually looks like

#The naive single-model setup is an accident, not a design

#The rules we landed on

#What it costs in YAML

#What didn't work

#What changed downstream

#How to try this on your own setup

#Further reading

One post a fortnight, in your inbox.

Keep reading

How credentials work on Digitorn: an encrypted vault driven from YAML

Hooks: 4 production patterns we ship today on Digitorn

Digitorn vs LangChain: an honest comparison

Ship your first AI agent in 5 minutes.

What an average session actually looks like

The naive single-model setup is an accident, not a design

The rules we landed on

What it costs in YAML

What didn't work

What changed downstream

How to try this on your own setup

Further reading

What an average session actually looks like

The naive single-model setup is an accident, not a design

The rules we landed on

What it costs in YAML

What didn't work

What changed downstream

How to try this on your own setup

Further reading