A single Sonnet-class agent answers everything. Half the queries are 'what is X' lookups that a Haiku-class model handles for a tenth of the cost. The expensive model spends most of its budget on questions it never needed to see.
- Cost per session is uniformly high regardless of query complexity
- Average response is short but the model is configured large
- Users say simple questions feel slow
Apps with a wide query distribution where many requests are simple and a few are complex. Customer support, knowledge base assistants, internal helpdesks.
Apps where every query touches the same tool surface and the cost of misrouting (a small model failing) is high.
The YAML
Drop this into an app.yaml. Adjust the credential refs and module names to fit your existing setup.
1modules:2 web: {}3 rag: { config: { backend: { type: qdrant, path: ./kb } } }45agents:6 - id: router7 modules: [{agent_spawn: [Agent]}]8 brain:9 provider: anthropic10 model: claude-haiku-4-511 credential: anthropic_main12 system_prompt: |13 You route. Classify the user message in one word:14 SIMPLE -> single fact or definition, dispatch fast_helper15 RESEARCH -> needs the web, dispatch researcher16 EXPERT -> multi-step or ambiguous, dispatch expert17 Then call Agent(specialist=<picked>, prompt=<original>, wait=true)18 and return the result verbatim.1920 - id: fast_helper21 role: specialist22 modules: [{rag: [search]}]23 brain: { model: claude-haiku-4-5, credential: anthropic_main }24 system_prompt: "Answer in one paragraph using rag.search results."2526 - id: researcher27 role: specialist28 modules: [{web: [search, fetch]}, {rag: [search]}]29 brain: { model: claude-sonnet-4-6, credential: anthropic_main }30 system_prompt: "Research, cite sources, write a concise answer."3132 - id: expert33 role: specialist34 modules: [{web: [search, fetch]}, {rag: [search]}]35 brain: { model: claude-opus-4-7, credential: anthropic_main }36 system_prompt: "Reason carefully, ask clarifying questions if needed."How it works
Walking through the YAML one block at a time so the design is clear, not memorised.
A cheap classifier sits at the front
The router runs Haiku-class. Its only job is one-word classification of incoming messages, almost free per call.
Specialists are sized to their workload
Three specialists: Haiku for trivial lookups, Sonnet for research, Opus for hard reasoning. Each only sees the queries that need it.
The router forwards verbatim
Returning the specialist's output as-is keeps the response identical to what a single big model would have produced. No double summarisation.
Cost falls because the distribution is skewed
Most workloads are 70% simple, 25% research, 5% expert. With a 1:10:100 cost ratio between Haiku and Opus, the average query cost drops 5-15x.
Other ways to solve it
The pattern above is not the only answer. Here is when something else is the right call.
Single expensive model
Simpler config, predictable quality. You pay the expert price on every query.
Embedding-based router
Replace the LLM router with a local embedding classifier trained on past queries. Near-zero cost, lower flexibility.
Get the next post in your inbox.
Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.