External services fail. Rate limits, network blips, 5xx from a flaky upstream. A naive agent surfaces every glitch as an error and the user sees a tool failure that would have worked on retry.
Any tool call that hits a network or external service with a known retry-after semantic. Search APIs, LLM providers, webhook deliveries, vector stores backed by a remote cluster.
Side-effectful operations without idempotency keys. Retrying a non-idempotent POST can double-charge a user or send the same email twice.
Drop this into an app.yaml. Adjust the credential refs and module names to fit your existing setup.
1# retry-with-backoff hook attached to flaky tool calls2modules:3 web: {}4 channels:5 config:6 slack: { bot_token: { credential: slack_bot } }78execution:9 mode: conversation10 hooks:11 - id: retry_web_search12 "on": tool_end13 condition:14 type: all_of15 conditions:16 - { type: tool_name, match: "web.search" }17 - { type: tool_failed }18 - { type: error_type, match: "RateLimit|ServiceUnavailable|Timeout" }19 action:20 type: chain21 actions:22 - { type: log, message: "retrying web.search after {{tool.error}}" }23 - { type: shell, command: "sleep $(( (2 ** {{tool.attempt}}) ))" }24 - { type: module_action, module: web, action: search, params: "{{tool.params}}" }25 max_fires: 52627agents:28 - id: helper29 modules: [{web: [search, fetch]}]30 brain: { model: claude-haiku-4-5, credential: anthropic_main }Walking through the YAML one block at a time so the design is clear, not memorised.
The hook fires only when web.search ends with a recognised transient error. Permanent errors (auth, validation) skip the retry entirely so the user sees them.
Backoff doubles per attempt: 2s, 4s, 8s, 16s, 32s. The shell action templates {{tool.attempt}} from the hook context. Total ceiling around a minute.
The hook calls the same tool with the original params. The runtime swaps the failed tool result with the retry result, so the model sees only the final outcome, never the noise.
Five tries maximum. After that the original failure surfaces and the agent decides what to do with it.
The pattern above is not the only answer. Here is when something else is the right call.
Bake retry logic into the module action itself. Simpler call site, harder to reason about because the retry is invisible to the agent loop.
Skip the retry, open a circuit, return a cached result or a graceful fallback. Better for high-traffic services where retries amplify the load.
Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.