A downstream service is degraded. Retrying every call keeps the upstream pinned and prevents recovery. Worse, the agent burns tokens on retries that would have failed.
- A spike in errors from one specific tool
- Agent latency climbs while the offending service is degraded
- Cost per session jumps because every turn waits for retries
When a tool depends on a single external service that has well-understood failure modes and a viable fallback (cache, secondary provider, graceful 'service unavailable' message).
When there is no fallback and the service is critical to the agent's output. Better to surface the failure than to lie with cached data.
The YAML
Drop this into an app.yaml. Adjust the credential refs and module names to fit your existing setup.
1# Trip the circuit after 3 consecutive failures, route to fallback2modules:3 web: {}4 cache: {}56execution:7 mode: conversation8 hooks:9 - id: trip_circuit10 "on": tool_end11 condition:12 type: all_of13 conditions:14 - { type: tool_name, match: "web.fetch" }15 - { type: tool_failed }16 - { type: expression, expr: "session.consecutive_failures.web_fetch >= 3" }17 action:18 type: chain19 actions:20 - { type: module_action, module: cache, action: set, params: { key: "circuit:web_fetch", value: "open", ttl: 60 } }21 - { type: log, level: warn, message: "web.fetch circuit opened" }2223 - id: serve_from_cache_when_open24 "on": tool_start25 condition:26 type: all_of27 conditions:28 - { type: tool_name, match: "web.fetch" }29 - { type: expression, expr: "cache.get('circuit:web_fetch') == 'open'" }30 action:31 type: gate32 result:33 status: "service_unavailable"34 fallback: "cache"35 retry_after: 603637agents:38 - id: helper39 modules: [{web: [fetch]}, {cache: [get, set]}]40 brain: { model: claude-haiku-4-5, credential: anthropic_main }41 system_prompt: |42 If a tool returns status: service_unavailable, do not retry, tell43 the user the live data is unavailable and offer the cached version.How it works
Walking through the YAML one block at a time so the design is clear, not memorised.
Track consecutive failures per tool
The runtime exposes session.consecutive_failures.{tool_name}. The hook fires when the count hits three.
Open the circuit by storing a flag
A cache entry with a 60s TTL is the open-circuit signal. Long enough to give the upstream room to recover, short enough that the next call after the cooldown probes the service again.
Block the tool while the circuit is open
The gate action inside a tool_start hook intercepts the call before it reaches the network. The agent sees a structured result, not an error.
Self-healing on the next probe
After 60s the circuit cache entry expires. The next call goes through. If it fails three times again, the circuit reopens with the same logic.
Other ways to solve it
The pattern above is not the only answer. Here is when something else is the right call.
Half-open state with single-probe
Classic CB pattern: after the cooldown, allow one request through. If it succeeds, close the circuit; if it fails, reopen for another cycle. Slightly more code, much smoother under intermittent failures.
Plain retry with backoff
Simpler. Costs more in degraded scenarios because every call still tries.
Get the next post in your inbox.
Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.