Skip to content
← Chaos & Resilience · intermediate · 12 min · 02 / 06

Resilience Patterns

Timeouts, retries, circuit breakers, bulkheads, and graceful degradation — the building blocks that keep failures contained.

circuit breakerbulkheadtimeoutretrygraceful degradation

Real-World Analogy

A ship’s watertight compartments: if one section floods, the bulkheads contain the water to that compartment. The ship keeps sailing. Without compartments, one breach sinks everything. Resilience patterns are your system’s watertight compartments.

Timeouts: The Foundation

Every outbound call needs a timeout. Without one, a slow dependency holds connections open indefinitely — eventually exhausting your connection pool and taking down your entire service.

// WRONG — hangs forever if service is slow
const response = await fetch('https://payment-service/charge', {
  method: 'POST',
  body: JSON.stringify(payload),
});

// RIGHT — fail fast, release the connection
const response = await fetch('https://payment-service/charge', {
  method: 'POST',
  body: JSON.stringify(payload),
  signal: AbortSignal.timeout(3000), // 3 second timeout
});

Timeout hierarchy: Set timeouts at every layer:

const TIMEOUTS = {
  connect: 1000,    // time to establish TCP connection
  request: 3000,    // time to send request + receive first byte
  response: 10000,  // total time for full response body
};

// Axios example with all three
const client = axios.create({
  timeout: TIMEOUTS.response,
  httpAgent: new http.Agent({ keepAlive: true }),
});

Timeout budget: When service A calls B which calls C, total latency is A + B + C. Set timeouts so inner calls leave budget for outer calls:

Client timeout: 10s
Service A timeout on B: 5s
Service B timeout on C: 2s

Each call has room to fail and retry without blowing the client's timeout.

Retries

Transient failures (network blip, brief overload) often resolve on retry. But retry naively and you amplify load on an already-struggling dependency.

async function withRetry<T>(
  fn: () => Promise<T>,
  opts: { maxAttempts: number; baseDelayMs: number },
): Promise<T> {
  let lastError: Error;

  for (let attempt = 1; attempt <= opts.maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err as Error;

      if (attempt === opts.maxAttempts) break;

      // Exponential backoff with jitter
      const delay = opts.baseDelayMs * Math.pow(2, attempt - 1);
      const jitter = Math.random() * delay * 0.2; // ±20%
      await sleep(delay + jitter);
    }
  }

  throw lastError!;
}

// Usage
const result = await withRetry(
  () => fetch('https://api.service/data'),
  { maxAttempts: 3, baseDelayMs: 500 },
);

Only retry idempotent operations. A POST that creates a record will create duplicates on retry unless it’s idempotent. Use idempotency keys:

// Safe to retry with same key
const response = await fetch('https://payments/charge', {
  method: 'POST',
  headers: { 'Idempotency-Key': `charge-${orderId}` },
  body: JSON.stringify({ amount, customerId }),
});

Retry on the right errors:

function isRetryable(err: unknown): boolean {
  if (err instanceof TypeError) return false; // network error — retryable
  if (!(err instanceof Response)) return true;

  // 429: rate limited — retry with backoff
  if (err.status === 429) return true;

  // 502, 503, 504: upstream issues — retryable
  if (err.status >= 502 && err.status <= 504) return true;

  // 400, 401, 403, 404: client errors — not retryable
  if (err.status < 500) return false;

  return false; // default: don't retry 500 (might be our fault)
}

Circuit Breaker

When a dependency is failing consistently, retrying just makes it worse. A circuit breaker tracks failure rate and “opens” — stopping all calls — to give the dependency time to recover.

type CircuitState = 'closed' | 'open' | 'half-open';

class CircuitBreaker {
  private state: CircuitState = 'closed';
  private failures = 0;
  private successes = 0;
  private lastFailureTime = 0;

  constructor(
    private readonly threshold: number = 5,  // failures to open
    private readonly resetTimeMs: number = 30_000, // open duration
    private readonly halfOpenRequests: number = 3, // test requests
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailureTime > this.resetTimeMs) {
        this.state = 'half-open';
        this.successes = 0;
      } else {
        throw new Error('Circuit open — dependency unavailable');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  private onSuccess(): void {
    this.failures = 0;
    if (this.state === 'half-open') {
      this.successes++;
      if (this.successes >= this.halfOpenRequests) {
        this.state = 'closed'; // recovered
      }
    }
  }

  private onFailure(): void {
    this.failures++;
    this.lastFailureTime = Date.now();
    if (this.failures >= this.threshold || this.state === 'half-open') {
      this.state = 'open';
    }
  }

  get isOpen(): boolean { return this.state === 'open'; }
}

Use opossum in production rather than rolling your own:

import CircuitBreaker from 'opossum';

const breaker = new CircuitBreaker(callPaymentService, {
  timeout: 3000,
  errorThresholdPercentage: 50, // open if 50% of requests fail
  resetTimeout: 30_000,         // try again after 30s
  volumeThreshold: 10,          // need at least 10 calls to calculate error rate
});

breaker.on('open', () => logger.warn('Payment circuit OPEN'));
breaker.on('halfOpen', () => logger.info('Payment circuit HALF-OPEN'));
breaker.on('close', () => logger.info('Payment circuit CLOSED'));

// Provide a fallback when circuit is open
breaker.fallback(() => ({ status: 'pending', message: 'Payment queued for processing' }));

Bulkheads

Isolate resources so that one slow consumer can’t exhaust resources for all consumers. Named after ship compartments.

Thread/connection pool isolation:

// Without bulkheads: one pool for all downstream services
const sharedPool = new ConnectionPool({ max: 20 });

// With bulkheads: separate pools, failure in one doesn't affect others
const paymentPool = new ConnectionPool({ max: 5 });  // max 5 connections to payments
const inventoryPool = new ConnectionPool({ max: 5 }); // max 5 to inventory
const emailPool = new ConnectionPool({ max: 2 });     // non-critical — fewer resources

If payment service becomes slow and saturates paymentPool, inventory and email calls are unaffected. Without bulkheads, payment latency would exhaust the shared pool and cascade to all services.

Queue-based bulkheads:

// Separate queues per job type — high priority jobs don't wait behind low priority
const criticalQueue = new Queue('critical-ops', { concurrency: 20 });
const bulkQueue     = new Queue('bulk-exports',  { concurrency: 2 });

// Bulk export job surge doesn't block critical operations

Graceful Degradation

When a non-critical dependency fails, serve a degraded but functional response rather than a hard error.

async function getProductPage(productId: string): Promise<ProductPage> {
  // Critical: product data — required
  const product = await productService.get(productId);

  // Non-critical: recommendations — degrade gracefully
  const recommendations = await recommendationService.get(productId)
    .catch((err) => {
      logger.warn({ productId, err: err.message }, 'Recommendations unavailable');
      return []; // empty, not an error
    });

  // Non-critical: reviews — show cached or empty
  const reviews = await reviewService.get(productId)
    .catch(async () => {
      return cache.get(`reviews:${productId}`) ?? { items: [], total: 0 };
    });

  return { product, recommendations, reviews };
}

Design your UI to handle empty states for non-critical sections. A product page without recommendations is fine. A product page that errors because recommendations timed out is not.

Feature flags for graceful degradation:

async function checkout(cart: Cart): Promise<CheckoutResult> {
  const result = await processPayment(cart);

  // Fraud check: non-blocking — fail open if service is down
  if (await featureFlags.isEnabled('fraud-check')) {
    const fraudSignal = await fraudService.check(cart, result)
      .catch(() => ({ score: 0, block: false })); // fail open

    if (fraudSignal.block) {
      await refundPayment(result.chargeId);
      throw new Error('Order flagged for review');
    }
  }

  return result;
}

Hedged Requests

Send the same request to two backends simultaneously, use whichever responds first. Reduces tail latency at the cost of double load.

async function hedgedRequest<T>(
  primary: () => Promise<T>,
  secondary: () => Promise<T>,
  hedgeAfterMs: number, // delay before sending second request
): Promise<T> {
  return new Promise((resolve, reject) => {
    let settled = false;

    const settle = (result: T | Error) => {
      if (settled) return;
      settled = true;
      if (result instanceof Error) reject(result);
      else resolve(result);
    };

    // Start primary immediately
    primary().then(settle).catch((err) => {
      if (!settled) settle(err);
    });

    // Start secondary after hedge delay
    setTimeout(() => {
      if (!settled) {
        secondary().then(settle).catch((err) => {
          if (!settled) settle(err);
        });
      }
    }, hedgeAfterMs);
  });
}

// Usage: hedge after 100ms — if primary doesn't respond in 100ms,
// fire secondary request. Use whichever answers first.
const data = await hedgedRequest(
  () => fetchFromPrimary(id),
  () => fetchFromReplica(id),
  100,
);

Hedged requests effectively reduce p99 latency toward p50 latency at the cost of ~2x request volume to the slower percentile requests.

Combining Patterns

In production, use these together:

const paymentBreaker = new CircuitBreaker(callPaymentService, {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30_000,
});

async function chargeCustomer(order: Order): Promise<PaymentResult> {
  try {
    // Circuit breaker wraps retry-with-timeout
    return await paymentBreaker.fire(async () => {
      return await withRetry(
        () => paymentPool.call(() => // bulkhead
          fetchWithTimeout('https://payments/charge', order, 2500)
        ),
        { maxAttempts: 2, baseDelayMs: 500 },
      );
    });
  } catch (err) {
    if (paymentBreaker.opened) {
      // Circuit open — queue for async retry
      await paymentQueue.add('retry-payment', { orderId: order.id });
      return { status: 'queued', message: 'Payment processing — you will be notified' };
    }
    throw err;
  }
}

Timeout → Retry → Circuit Breaker → Bulkhead → Graceful Degradation: each layer handles a different failure mode. Together they make the system fail gracefully instead of catastrophically.