Retry & Circuit Breaker
When a delivery fails, it is automatically retried with exponential backoff. If an endpoint fails repeatedly, the circuit breaker disables it to prevent wasting resources.
Retry Flow
Delivery attempt fails (non-2xx or network error)
│
├─ attempts++
│
├─ attempts < maxAttempts?
│ ├─ yes → status=PENDING, next_attempt_at = now + backoff delay
│ └─ no → status=FAILED, store last_error, set completed_at
│
└─ Circuit breaker: incrementFailures(endpointId)
├─ consecutiveFailures < threshold → continue
└─ consecutiveFailures >= threshold → disable endpointBackoff Schedule
The default exponential backoff schedule:
| Attempt | Delay | Cumulative |
|---|---|---|
| 1 | 30 seconds | 30s |
| 2 | 5 minutes | ~5.5m |
| 3 | 30 minutes | ~35.5m |
| 4 | 2 hours | ~2.6h |
| 5 | 24 hours | ~26.6h |
| 6 | FAILED | — |
With jitter: true (default), each delay is randomized by ±10% to prevent thundering herd when many deliveries retry simultaneously.
Configuration
WebhookModule.forRoot({
prisma: prismaService,
delivery: {
maxRetries: 5, // default: 5
backoff: 'exponential', // fixed to 'exponential'
jitter: true, // default: true (±10%)
},
circuitBreaker: {
failureThreshold: 5, // default: 5 consecutive failures
cooldownMinutes: 60, // default: 60 minutes
},
})| Option | Type | Default | Description |
|---|---|---|---|
delivery.maxRetries | number | 5 | Max delivery attempts before FAILED |
delivery.jitter | boolean | true | Add ±10% random jitter to delays |
circuitBreaker.failureThreshold | number | 5 | Consecutive failures before disabling endpoint |
circuitBreaker.cooldownMinutes | number | 60 | Minutes before attempting recovery |
Circuit Breaker
The circuit breaker tracks consecutive failures per endpoint. It operates in three states:
Healthy (closed)
- Endpoint is active, deliveries are processed normally
- Each successful delivery resets
consecutiveFailuresto 0 - Each failure increments
consecutiveFailures
Disabled (open)
When consecutiveFailures >= failureThreshold:
- Endpoint is marked
active = false disabled_atis set to the current timestampdisabled_reasonis set to'consecutive_failures_exceeded'- No new deliveries are created for this endpoint
Recovery (half-open)
After cooldownMinutes have passed since disabled_at:
- The delivery worker calls
recoverEligibleEndpoints()every poll cycle - Eligible endpoints are re-enabled:
active = true,disabled_at = null,disabled_reason = null consecutiveFailuresremains unchanged — one more failure will re-disable immediately- A successful delivery resets the counter to 0, fully restoring the endpoint
Healthy ──[threshold reached]──> Disabled
^ │
│ [cooldown elapsed]
│ │
└──[success]── Recovery <────────┘
│
└──[failure]──> Disabled (immediately)WARNING
During recovery, the endpoint has one chance. A single failure immediately re-disables it. This prevents flapping between healthy and disabled states.
Stale Delivery Recovery
If a worker crashes while a delivery is in SENDING status, the delivery may be stuck indefinitely. The worker automatically recovers stale deliveries:
- Every poll cycle, deliveries in
SENDINGwithclaimed_atolder thanstaleSendingMinutes(default: 5 minutes) are reset toPENDING - The
claimed_atcolumn acts as a lease — exceeding the threshold implies the worker is dead
WebhookModule.forRoot({
prisma: prismaService,
polling: {
staleSendingMinutes: 5, // default: 5
},
})TIP
Setting staleSendingMinutes too low may cause deliveries to be re-attempted while the original request is still in flight. Keep it well above your delivery.timeout value.
Successful Delivery Reset
On a successful delivery (2xx response):
- Delivery is marked
SENTwithcompletedAtand response details consecutiveFailuresfor the endpoint is reset to 0- If the endpoint was in recovery, it is now fully healthy