Digitorn
Digitorn
← All patterns
performance pattern

Semantic router

Pick the cheapest specialist that can answer, instead of one big model.

The problem

A single Sonnet-class agent answers everything. Half the queries are 'what is X' lookups that a Haiku-class model handles for a tenth of the cost. The expensive model spends most of its budget on questions it never needed to see.

Symptoms
  • Cost per session is uniformly high regardless of query complexity
  • Average response is short but the model is configured large
  • Users say simple questions feel slow
Use when

Apps with a wide query distribution where many requests are simple and a few are complex. Customer support, knowledge base assistants, internal helpdesks.

Skip when

Apps where every query touches the same tool surface and the cost of misrouting (a small model failing) is high.

The YAML

Drop this into an app.yaml. Adjust the credential refs and module names to fit your existing setup.

app.yaml
1modules:2  web: {}3  rag: { config: { backend: { type: qdrant, path: ./kb } } }45agents:6  - id: router7    modules: [{agent_spawn: [Agent]}]8    brain:9      provider: anthropic10      model: claude-haiku-4-511      credential: anthropic_main12    system_prompt: |13      You route. Classify the user message in one word:14        SIMPLE   -> single fact or definition, dispatch fast_helper15        RESEARCH -> needs the web, dispatch researcher16        EXPERT   -> multi-step or ambiguous, dispatch expert17      Then call Agent(specialist=<picked>, prompt=<original>, wait=true)18      and return the result verbatim.1920  - id: fast_helper21    role: specialist22    modules: [{rag: [search]}]23    brain: { model: claude-haiku-4-5, credential: anthropic_main }24    system_prompt: "Answer in one paragraph using rag.search results."2526  - id: researcher27    role: specialist28    modules: [{web: [search, fetch]}, {rag: [search]}]29    brain: { model: claude-sonnet-4-6, credential: anthropic_main }30    system_prompt: "Research, cite sources, write a concise answer."3132  - id: expert33    role: specialist34    modules: [{web: [search, fetch]}, {rag: [search]}]35    brain: { model: claude-opus-4-7, credential: anthropic_main }36    system_prompt: "Reason carefully, ask clarifying questions if needed."

How it works

Walking through the YAML one block at a time so the design is clear, not memorised.

01

A cheap classifier sits at the front

The router runs Haiku-class. Its only job is one-word classification of incoming messages, almost free per call.

02

Specialists are sized to their workload

Three specialists: Haiku for trivial lookups, Sonnet for research, Opus for hard reasoning. Each only sees the queries that need it.

03

The router forwards verbatim

Returning the specialist's output as-is keeps the response identical to what a single big model would have produced. No double summarisation.

04

Cost falls because the distribution is skewed

Most workloads are 70% simple, 25% research, 5% expert. With a 1:10:100 cost ratio between Haiku and Opus, the average query cost drops 5-15x.

Other ways to solve it

The pattern above is not the only answer. Here is when something else is the right call.

Alternative

Single expensive model

Simpler config, predictable quality. You pay the expert price on every query.

Prefer when: When you cannot afford a misroute and the workload is uniformly hard.
Alternative

Embedding-based router

Replace the LLM router with a local embedding classifier trained on past queries. Near-zero cost, lower flexibility.

Prefer when: High-volume apps where even a Haiku call per query is too expensive and the routing rules are stable.
Newsletter

Get the next post in your inbox.

Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.

One-click unsubscribe. We never share your address. Powered by our own infrastructure, not a tracker.

Related patterns

performanceFan out, joinSpawn N specialists in parallel, wait for all, fold the results.performanceSummarise and feed backCompress past turns when the context window starts to bite.