cost pattern

Rate limit with fallback

Cap the agent's external calls and degrade gracefully when the cap fires.

The problem

An agent in a loop can hammer an expensive API and spike your bill. You want a hard ceiling on calls per session or per minute, not a soft promise.

Symptoms

Single sessions costing 10x the average
An external API quota gets exhausted by one runaway agent
Cost reports show heavy concentration in a small fraction of sessions

Use when

Any tool with per-call cost or upstream rate limits. LLM providers, paid search APIs, vector store queries against managed services.

Skip when

Free internal services where rate limiting just adds latency without preventing real harm.

The YAML

Drop this into an app.yaml. Adjust the credential refs and module names to fit your existing setup.

app.yaml

1modules:2  web: {}3  cache: {}45execution:6  mode: conversation7  hooks:8    - id: cap_searches_per_session9      "on": tool_start10      condition:11        type: all_of12        conditions:13          - { type: tool_name, match: "web.search" }14          - { type: expression, expr: "session.tool_calls.web_search.count >= 10" }15      action:16        type: gate17        result:18          status: "rate_limited"19          message: "session cap reached, use cached results"20          fallback_tool: "cache.get"21          fallback_params: { key: "last_search:{{tool.params.query}}" }2223    - id: warn_at_524      "on": tool_end25      condition:26        type: all_of27        conditions:28          - { type: tool_name, match: "web.search" }29          - { type: expression, expr: "session.tool_calls.web_search.count == 5" }30      action:31        type: inject_message32        role: system33        content: "You have used 5 of 10 search calls this session. Consolidate."3435agents:36  - id: helper37    modules: [{web: [search, fetch]}, {cache: [get, set]}]38    brain: { model: claude-haiku-4-5, credential: anthropic_main }

How it works

Walking through the YAML one block at a time so the design is clear, not memorised.

01

Cap is enforced at the runtime, not the prompt

Telling the model 'don't make too many calls' fails about a third of the time. The runtime hook is deterministic: at 10 calls the next one is gated.

02

Soft warning at the halfway point

An inject_message hook tells the agent it has used half its budget. Models react to this signal and consolidate their calls.

03

Graceful fallback to cache

When the cap fires, the gate's fallback_tool returns a cached result instead of an error. The agent does not crash, it gets degraded data.

04

Bills become predictable

Worst-case cost per session = 10 calls. You can compute upper bounds for monthly spend without staring at the dashboard.

Other ways to solve it

The pattern above is not the only answer. Here is when something else is the right call.

Alternative

Cost ceiling on the runtime

max_tokens_per_run kills the whole turn when total token cost exceeds a threshold. Coarser, simpler, no per-tool granularity.

Prefer when: When you don't care which tool blew the budget, only that the budget is respected.

Alternative

Rate limit at the gateway

Put a proxy in front of the external API that enforces RPM globally. Works for many agents at once, doesn't help with single-session cost.

Prefer when: Multi-tenant deployments where the upstream's quota is shared and per-session caps aren't enough.

Newsletter

Get the next post in your inbox.

Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.

Related patterns

reliabilityCircuit breakerStop hammering a failing service and route to a graceful fallback.dataAudit everythingEvery tool call, every credential read, hash-chained for replay.

The YAML

Drop this into an app.yaml. Adjust the credential refs and module names to fit your existing setup.

app.yaml

1modules:2  web: {}3  cache: {}45execution:6  mode: conversation7  hooks:8    - id: cap_searches_per_session9      "on": tool_start10      condition:11        type: all_of12        conditions:13          - { type: tool_name, match: "web.search" }14          - { type: expression, expr: "session.tool_calls.web_search.count >= 10" }15      action:16        type: gate17        result:18          status: "rate_limited"19          message: "session cap reached, use cached results"20          fallback_tool: "cache.get"21          fallback_params: { key: "last_search:{{tool.params.query}}" }2223    - id: warn_at_524      "on": tool_end25      condition:26        type: all_of27        conditions:28          - { type: tool_name, match: "web.search" }29          - { type: expression, expr: "session.tool_calls.web_search.count == 5" }30      action:31        type: inject_message32        role: system33        content: "You have used 5 of 10 search calls this session. Consolidate."3435agents:36  - id: helper37    modules: [{web: [search, fetch]}, {cache: [get, set]}]38    brain: { model: claude-haiku-4-5, credential: anthropic_main }

How it works

Walking through the YAML one block at a time so the design is clear, not memorised.

01

Cap is enforced at the runtime, not the prompt

Telling the model 'don't make too many calls' fails about a third of the time. The runtime hook is deterministic: at 10 calls the next one is gated.

02

Soft warning at the halfway point

An inject_message hook tells the agent it has used half its budget. Models react to this signal and consolidate their calls.

03

Graceful fallback to cache

When the cap fires, the gate's fallback_tool returns a cached result instead of an error. The agent does not crash, it gets degraded data.

04

Bills become predictable

Worst-case cost per session = 10 calls. You can compute upper bounds for monthly spend without staring at the dashboard.

Other ways to solve it

The pattern above is not the only answer. Here is when something else is the right call.

Alternative

Cost ceiling on the runtime

max_tokens_per_run kills the whole turn when total token cost exceeds a threshold. Coarser, simpler, no per-tool granularity.

Prefer when: When you don't care which tool blew the budget, only that the budget is respected.

Alternative

Rate limit at the gateway

Put a proxy in front of the external API that enforces RPM globally. Works for many agents at once, doesn't help with single-session cost.

Prefer when: Multi-tenant deployments where the upstream's quota is shared and per-session caps aren't enough.

Newsletter

Get the next post in your inbox.

Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.