reliability pattern

Circuit breaker

Stop hammering a failing service and route to a graceful fallback.

The problem

A downstream service is degraded. Retrying every call keeps the upstream pinned and prevents recovery. Worse, the agent burns tokens on retries that would have failed.

Symptoms

A spike in errors from one specific tool
Agent latency climbs while the offending service is degraded
Cost per session jumps because every turn waits for retries

Use when

When a tool depends on a single external service that has well-understood failure modes and a viable fallback (cache, secondary provider, graceful 'service unavailable' message).

Skip when

When there is no fallback and the service is critical to the agent's output. Better to surface the failure than to lie with cached data.

The YAML

Drop this into an app.yaml. Adjust the credential refs and module names to fit your existing setup.

app.yaml

1# Trip the circuit after 3 consecutive failures, route to fallback2modules:3  web: {}4  cache: {}56execution:7  mode: conversation8  hooks:9    - id: trip_circuit10      "on": tool_end11      condition:12        type: all_of13        conditions:14          - { type: tool_name, match: "web.fetch" }15          - { type: tool_failed }16          - { type: expression, expr: "session.consecutive_failures.web_fetch >= 3" }17      action:18        type: chain19        actions:20          - { type: module_action, module: cache, action: set, params: { key: "circuit:web_fetch", value: "open", ttl: 60 } }21          - { type: log, level: warn, message: "web.fetch circuit opened" }2223    - id: serve_from_cache_when_open24      "on": tool_start25      condition:26        type: all_of27        conditions:28          - { type: tool_name, match: "web.fetch" }29          - { type: expression, expr: "cache.get('circuit:web_fetch') == 'open'" }30      action:31        type: gate32        result:33          status: "service_unavailable"34          fallback: "cache"35          retry_after: 603637agents:38  - id: helper39    modules: [{web: [fetch]}, {cache: [get, set]}]40    brain: { model: claude-haiku-4-5, credential: anthropic_main }41    system_prompt: |42      If a tool returns status: service_unavailable, do not retry, tell43      the user the live data is unavailable and offer the cached version.

How it works

Walking through the YAML one block at a time so the design is clear, not memorised.

01

Track consecutive failures per tool

The runtime exposes session.consecutive_failures.{tool_name}. The hook fires when the count hits three.

02

Open the circuit by storing a flag

A cache entry with a 60s TTL is the open-circuit signal. Long enough to give the upstream room to recover, short enough that the next call after the cooldown probes the service again.

03

Block the tool while the circuit is open

The gate action inside a tool_start hook intercepts the call before it reaches the network. The agent sees a structured result, not an error.

04

Self-healing on the next probe

After 60s the circuit cache entry expires. The next call goes through. If it fails three times again, the circuit reopens with the same logic.

Other ways to solve it

The pattern above is not the only answer. Here is when something else is the right call.

Alternative

Half-open state with single-probe

Classic CB pattern: after the cooldown, allow one request through. If it succeeds, close the circuit; if it fails, reopen for another cycle. Slightly more code, much smoother under intermittent failures.

Prefer when: When the downstream service flaps for short windows and you want to avoid thundering-herd behavior at every cooldown.

Alternative

Plain retry with backoff

Simpler. Costs more in degraded scenarios because every call still tries.

Prefer when: Tools where failures are isolated, the upstream does not benefit from being left alone, and the cost of failed retries is small.

Newsletter

Get the next post in your inbox.

Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.

Related patterns

reliabilityRetry with backoffSoak transient failures with exponential backoff before they reach the user.costRate limit with fallbackCap the agent's external calls and degrade gracefully when the cap fires.

The YAML

Drop this into an app.yaml. Adjust the credential refs and module names to fit your existing setup.

app.yaml

1# Trip the circuit after 3 consecutive failures, route to fallback2modules:3  web: {}4  cache: {}56execution:7  mode: conversation8  hooks:9    - id: trip_circuit10      "on": tool_end11      condition:12        type: all_of13        conditions:14          - { type: tool_name, match: "web.fetch" }15          - { type: tool_failed }16          - { type: expression, expr: "session.consecutive_failures.web_fetch >= 3" }17      action:18        type: chain19        actions:20          - { type: module_action, module: cache, action: set, params: { key: "circuit:web_fetch", value: "open", ttl: 60 } }21          - { type: log, level: warn, message: "web.fetch circuit opened" }2223    - id: serve_from_cache_when_open24      "on": tool_start25      condition:26        type: all_of27        conditions:28          - { type: tool_name, match: "web.fetch" }29          - { type: expression, expr: "cache.get('circuit:web_fetch') == 'open'" }30      action:31        type: gate32        result:33          status: "service_unavailable"34          fallback: "cache"35          retry_after: 603637agents:38  - id: helper39    modules: [{web: [fetch]}, {cache: [get, set]}]40    brain: { model: claude-haiku-4-5, credential: anthropic_main }41    system_prompt: |42      If a tool returns status: service_unavailable, do not retry, tell43      the user the live data is unavailable and offer the cached version.

How it works

Walking through the YAML one block at a time so the design is clear, not memorised.

01

Track consecutive failures per tool

The runtime exposes session.consecutive_failures.{tool_name}. The hook fires when the count hits three.

02

Open the circuit by storing a flag

A cache entry with a 60s TTL is the open-circuit signal. Long enough to give the upstream room to recover, short enough that the next call after the cooldown probes the service again.

03

Block the tool while the circuit is open

The gate action inside a tool_start hook intercepts the call before it reaches the network. The agent sees a structured result, not an error.

04

Self-healing on the next probe

After 60s the circuit cache entry expires. The next call goes through. If it fails three times again, the circuit reopens with the same logic.

Other ways to solve it

The pattern above is not the only answer. Here is when something else is the right call.

Alternative

Half-open state with single-probe

Prefer when: When the downstream service flaps for short windows and you want to avoid thundering-herd behavior at every cooldown.

Alternative

Plain retry with backoff

Simpler. Costs more in degraded scenarios because every call still tries.

Prefer when: Tools where failures are isolated, the upstream does not benefit from being left alone, and the cost of failed retries is small.

Newsletter

Get the next post in your inbox.

Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.