An agent in a loop can hammer an expensive API and spike your bill. You want a hard ceiling on calls per session or per minute, not a soft promise.
Any tool with per-call cost or upstream rate limits. LLM providers, paid search APIs, vector store queries against managed services.
Free internal services where rate limiting just adds latency without preventing real harm.
Drop this into an app.yaml. Adjust the credential refs and module names to fit your existing setup.
1modules:2 web: {}3 cache: {}45execution:6 mode: conversation7 hooks:8 - id: cap_searches_per_session9 "on": tool_start10 condition:11 type: all_of12 conditions:13 - { type: tool_name, match: "web.search" }14 - { type: expression, expr: "session.tool_calls.web_search.count >= 10" }15 action:16 type: gate17 result:18 status: "rate_limited"19 message: "session cap reached, use cached results"20 fallback_tool: "cache.get"21 fallback_params: { key: "last_search:{{tool.params.query}}" }2223 - id: warn_at_524 "on": tool_end25 condition:26 type: all_of27 conditions:28 - { type: tool_name, match: "web.search" }29 - { type: expression, expr: "session.tool_calls.web_search.count == 5" }30 action:31 type: inject_message32 role: system33 content: "You have used 5 of 10 search calls this session. Consolidate."3435agents:36 - id: helper37 modules: [{web: [search, fetch]}, {cache: [get, set]}]38 brain: { model: claude-haiku-4-5, credential: anthropic_main }Walking through the YAML one block at a time so the design is clear, not memorised.
Telling the model 'don't make too many calls' fails about a third of the time. The runtime hook is deterministic: at 10 calls the next one is gated.
An inject_message hook tells the agent it has used half its budget. Models react to this signal and consolidate their calls.
When the cap fires, the gate's fallback_tool returns a cached result instead of an error. The agent does not crash, it gets degraded data.
Worst-case cost per session = 10 calls. You can compute upper bounds for monthly spend without staring at the dashboard.
The pattern above is not the only answer. Here is when something else is the right call.
max_tokens_per_run kills the whole turn when total token cost exceeds a threshold. Coarser, simpler, no per-tool granularity.
Put a proxy in front of the external API that enforces RPM globally. Works for many agents at once, doesn't help with single-session cost.
Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.