External services fail. Rate limits, network blips, 5xx from a flaky upstream. A naive agent surfaces every glitch as an error and the user sees a tool failure that would have worked on retry.
- Tool calls fail intermittently with 429, 502, 503, or socket errors
- Users see error messages for transient problems
- Manual reruns of the same prompt usually succeed
Any tool call that hits a network or external service with a known retry-after semantic. Search APIs, LLM providers, webhook deliveries, vector stores backed by a remote cluster.
Side-effectful operations without idempotency keys. Retrying a non-idempotent POST can double-charge a user or send the same email twice.
The YAML
Drop this into an app.yaml. Adjust the credential refs and module names to fit your existing setup.
1# retry-with-backoff hook attached to flaky tool calls2modules:3 web: {}4 channels:5 config:6 slack: { bot_token: { credential: slack_bot } }78execution:9 mode: conversation10 hooks:11 - id: retry_web_search12 "on": tool_end13 condition:14 type: all_of15 conditions:16 - { type: tool_name, match: "web.search" }17 - { type: tool_failed }18 - { type: error_type, match: "RateLimit|ServiceUnavailable|Timeout" }19 action:20 type: chain21 actions:22 - { type: log, message: "retrying web.search after {{tool.error}}" }23 - { type: shell, command: "sleep $(( (2 ** {{tool.attempt}}) ))" }24 - { type: module_action, module: web, action: search, params: "{{tool.params}}" }25 max_fires: 52627agents:28 - id: helper29 modules: [{web: [search, fetch]}]30 brain: { model: claude-haiku-4-5, credential: anthropic_main }How it works
Walking through the YAML one block at a time so the design is clear, not memorised.
Hook on tool_end with a typed error filter
The hook fires only when web.search ends with a recognised transient error. Permanent errors (auth, validation) skip the retry entirely so the user sees them.
Exponential delay with shell sleep
Backoff doubles per attempt: 2s, 4s, 8s, 16s, 32s. The shell action templates {{tool.attempt}} from the hook context. Total ceiling around a minute.
Re-invoke via module_action
The hook calls the same tool with the original params. The runtime swaps the failed tool result with the retry result, so the model sees only the final outcome, never the noise.
Bounded by max_fires
Five tries maximum. After that the original failure surfaces and the agent decides what to do with it.
Other ways to solve it
The pattern above is not the only answer. Here is when something else is the right call.
Inline retry inside the tool
Bake retry logic into the module action itself. Simpler call site, harder to reason about because the retry is invisible to the agent loop.
Circuit breaker without retry
Skip the retry, open a circuit, return a cached result or a graceful fallback. Better for high-traffic services where retries amplify the load.
10 apps you can ship in 50 lines of YAML
Get the next post in your inbox.
Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.