An agent in a loop can hammer an expensive API and spike your bill. You want a hard ceiling on calls per session or per minute, not a soft promise.
- Single sessions costing 10x the average
- An external API quota gets exhausted by one runaway agent
- Cost reports show heavy concentration in a small fraction of sessions
Any tool with per-call cost or upstream rate limits. LLM providers, paid search APIs, vector store queries against managed services.
Free internal services where rate limiting just adds latency without preventing real harm.
The YAML
Drop this into an app.yaml. Adjust the credential refs and module names to fit your existing setup.
1modules:2 web: {}3 cache: {}45execution:6 mode: conversation7 hooks:8 - id: cap_searches_per_session9 "on": tool_start10 condition:11 type: all_of12 conditions:13 - { type: tool_name, match: "web.search" }14 - { type: expression, expr: "session.tool_calls.web_search.count >= 10" }15 action:16 type: gate17 result:18 status: "rate_limited"19 message: "session cap reached, use cached results"20 fallback_tool: "cache.get"21 fallback_params: { key: "last_search:{{tool.params.query}}" }2223 - id: warn_at_524 "on": tool_end25 condition:26 type: all_of27 conditions:28 - { type: tool_name, match: "web.search" }29 - { type: expression, expr: "session.tool_calls.web_search.count == 5" }30 action:31 type: inject_message32 role: system33 content: "You have used 5 of 10 search calls this session. Consolidate."3435agents:36 - id: helper37 modules: [{web: [search, fetch]}, {cache: [get, set]}]38 brain: { model: claude-haiku-4-5, credential: anthropic_main }How it works
Walking through the YAML one block at a time so the design is clear, not memorised.
Cap is enforced at the runtime, not the prompt
Telling the model 'don't make too many calls' fails about a third of the time. The runtime hook is deterministic: at 10 calls the next one is gated.
Soft warning at the halfway point
An inject_message hook tells the agent it has used half its budget. Models react to this signal and consolidate their calls.
Graceful fallback to cache
When the cap fires, the gate's fallback_tool returns a cached result instead of an error. The agent does not crash, it gets degraded data.
Bills become predictable
Worst-case cost per session = 10 calls. You can compute upper bounds for monthly spend without staring at the dashboard.
Other ways to solve it
The pattern above is not the only answer. Here is when something else is the right call.
Cost ceiling on the runtime
max_tokens_per_run kills the whole turn when total token cost exceeds a threshold. Coarser, simpler, no per-tool granularity.
Rate limit at the gateway
Put a proxy in front of the external API that enforces RPM globally. Works for many agents at once, doesn't help with single-session cost.
Get the next post in your inbox.
Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.