Digitorn
Digitorn
← All patterns
reliability pattern

Retry with backoff

Soak transient failures with exponential backoff before they reach the user.

The problem

External services fail. Rate limits, network blips, 5xx from a flaky upstream. A naive agent surfaces every glitch as an error and the user sees a tool failure that would have worked on retry.

Symptoms
  • Tool calls fail intermittently with 429, 502, 503, or socket errors
  • Users see error messages for transient problems
  • Manual reruns of the same prompt usually succeed
Use when

Any tool call that hits a network or external service with a known retry-after semantic. Search APIs, LLM providers, webhook deliveries, vector stores backed by a remote cluster.

Skip when

Side-effectful operations without idempotency keys. Retrying a non-idempotent POST can double-charge a user or send the same email twice.

The YAML

Drop this into an app.yaml. Adjust the credential refs and module names to fit your existing setup.

app.yaml
1# retry-with-backoff hook attached to flaky tool calls2modules:3  web: {}4  channels:5    config:6      slack: { bot_token: { credential: slack_bot } }78execution:9  mode: conversation10  hooks:11    - id: retry_web_search12      "on": tool_end13      condition:14        type: all_of15        conditions:16          - { type: tool_name, match: "web.search" }17          - { type: tool_failed }18          - { type: error_type, match: "RateLimit|ServiceUnavailable|Timeout" }19      action:20        type: chain21        actions:22          - { type: log, message: "retrying web.search after {{tool.error}}" }23          - { type: shell, command: "sleep $(( (2 ** {{tool.attempt}}) ))" }24          - { type: module_action, module: web, action: search, params: "{{tool.params}}" }25      max_fires: 52627agents:28  - id: helper29    modules: [{web: [search, fetch]}]30    brain: { model: claude-haiku-4-5, credential: anthropic_main }

How it works

Walking through the YAML one block at a time so the design is clear, not memorised.

01

Hook on tool_end with a typed error filter

The hook fires only when web.search ends with a recognised transient error. Permanent errors (auth, validation) skip the retry entirely so the user sees them.

02

Exponential delay with shell sleep

Backoff doubles per attempt: 2s, 4s, 8s, 16s, 32s. The shell action templates {{tool.attempt}} from the hook context. Total ceiling around a minute.

03

Re-invoke via module_action

The hook calls the same tool with the original params. The runtime swaps the failed tool result with the retry result, so the model sees only the final outcome, never the noise.

04

Bounded by max_fires

Five tries maximum. After that the original failure surfaces and the agent decides what to do with it.

Other ways to solve it

The pattern above is not the only answer. Here is when something else is the right call.

Alternative

Inline retry inside the tool

Bake retry logic into the module action itself. Simpler call site, harder to reason about because the retry is invisible to the agent loop.

Prefer when: When the same retry policy applies to every consumer of the tool and there's nothing the agent could decide differently.
Alternative

Circuit breaker without retry

Skip the retry, open a circuit, return a cached result or a graceful fallback. Better for high-traffic services where retries amplify the load.

Prefer when: Public-facing apps under load, where a 429 from a downstream means it's already overwhelmed.
Read the deep dive

10 apps you can ship in 50 lines of YAML

Read article
Newsletter

Get the next post in your inbox.

Engineering notes from the Digitorn team. No marketing, no launch announcements, no "10 prompts that will change your life". Just the things we write that we'd want to read.

One-click unsubscribe. We never share your address. Powered by our own infrastructure, not a tracker.

Related patterns

reliabilityCircuit breakerStop hammering a failing service and route to a graceful fallback.costRate limit with fallbackCap the agent's external calls and degrade gracefully when the cap fires.