Skip to content
← API Gateway · advanced · 13 min · 06 / 07

Observability & Production Gateway

Access logs, distributed tracing, circuit breakers, and the operational checklist before putting a gateway in front of real traffic.

observabilitytracingcircuit breakerKongproduction

Real-World Analogy

Air traffic control — not just directing planes (routing), but maintaining a real-time picture of every flight, detecting problems early, and having clear procedures when something goes wrong. The gateway in production is your ATC for API traffic.

Access Logging

Every request through the gateway should be logged with enough context to reconstruct what happened:

log_format gateway escape=json
  '{'
    '"time":"$time_iso8601",'
    '"method":"$request_method",'
    '"path":"$request_uri",'
    '"status":$status,'
    '"upstream":"$upstream_addr",'
    '"request_time":$request_time,'
    '"upstream_time":"$upstream_response_time",'
    '"request_id":"$request_id",'
    '"user_id":"$http_x_user_id",'
    '"bytes_sent":$bytes_sent'
  '}';

access_log /var/log/nginx/gateway.log gateway;

Structured logs in Node.js gateway:

import pino from 'pino';

const logger = pino({ level: 'info' });

function loggingMiddleware(req: Request, res: Response, next: NextFunction): void {
  const start = Date.now();
  const requestId = req.headers['x-request-id'] as string;

  res.on('finish', () => {
    logger.info({
      requestId,
      method:        req.method,
      path:          req.path,
      status:        res.statusCode,
      userId:        req.headers['x-user-id'],
      durationMs:    Date.now() - start,
      upstream:      req.headers['x-upstream-service'],
      contentLength: res.get('content-length'),
    });
  });

  next();
}

Distributed Tracing

Inject trace context so spans from the gateway and all downstream services appear in one trace:

import { trace, context, propagation } from '@opentelemetry/api';

function tracingMiddleware(req: Request, res: Response, next: NextFunction): void {
  // Extract trace context from incoming request (if any)
  const parentContext = propagation.extract(context.active(), req.headers);

  const tracer = trace.getTracer('api-gateway');
  const span = tracer.startSpan(
    `${req.method} ${req.path}`,
    { kind: SpanKind.SERVER },
    parentContext,
  );

  span.setAttributes({
    'http.method':   req.method,
    'http.url':      req.originalUrl,
    'http.route':    req.route?.path,
    'user.id':       req.headers['x-user-id'] as string,
  });

  // Inject trace context into upstream request
  propagation.inject(trace.setSpan(context.active(), span), req.headers);

  res.on('finish', () => {
    span.setAttributes({ 'http.status_code': res.statusCode });
    span.end();
  });

  next();
}

With this, your Jaeger or Tempo dashboard shows the full request path: gateway → service A → database, with latency at each hop.

Circuit Breaker

Prevent a slow/failing backend from cascading to gateway exhaustion:

import CircuitBreaker from 'opossum';

const options = {
  timeout: 3000,           // request > 3s = failure
  errorThresholdPercentage: 50,  // open circuit if 50% fail
  resetTimeout: 30000,     // try again after 30s
};

const breaker = new CircuitBreaker(callBackend, options);

breaker.on('open',     () => logger.warn('Circuit breaker OPEN'));
breaker.on('halfOpen', () => logger.info('Circuit breaker HALF-OPEN'));
breaker.on('close',    () => logger.info('Circuit breaker CLOSED'));

async function proxyRequest(req: Request, res: Response): Promise<void> {
  try {
    const response = await breaker.fire(req);
    res.status(response.status).json(response.data);
  } catch (err) {
    if (breaker.opened) {
      // Return cached or degraded response
      res.status(503).json({
        error: 'Service temporarily unavailable',
        cached: await getCachedResponse(req.path),
      });
    } else {
      res.status(502).json({ error: 'Bad gateway' });
    }
  }
}

Gateway Metrics

Key metrics to expose and alert on:

import { Counter, Histogram, Registry } from 'prom-client';

const registry = new Registry();

const requestCounter = new Counter({
  name: 'gateway_requests_total',
  help: 'Total requests through gateway',
  labelNames: ['method', 'route', 'status', 'upstream'],
  registers: [registry],
});

const latencyHistogram = new Histogram({
  name: 'gateway_request_duration_seconds',
  help: 'Request latency',
  labelNames: ['method', 'route', 'upstream'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5],
  registers: [registry],
});

// Metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', registry.contentType);
  res.end(await registry.metrics());
});

Alert thresholds:

  • Gateway p99 latency > 500ms: investigate upstream
  • Error rate (4xx + 5xx) > 5%: page on-call
  • Circuit breaker open: immediate page
  • Rate limit rejections spike: possible abuse or misconfiguration

Production Checklist

□ TLS termination configured with modern cipher suites (TLS 1.2+)
□ HTTP/2 enabled for client connections
□ Timeouts set on all routes (connect, send, read)
□ Health check endpoint for the gateway itself
□ Rate limiting enabled on all public routes
□ Request ID injected on all requests
□ Structured access logs shipping to log aggregator
□ Distributed tracing context propagated
□ Circuit breakers on backends with known instability
□ Graceful shutdown: drain connections before process exit
□ Horizontal scaling tested: multiple gateway instances behind a load balancer
□ Config changes tested in staging before production

Choosing a Gateway

nginxTraefikKongAWS API Gateway
ConfigStatic filesDynamic (Docker labels, K8s)Admin API + DBConsole/Terraform
AuthPluginPluginBuilt-inBuilt-in
Rate limitingPaid (nginx Plus)Built-inBuilt-inBuilt-in
Best forHigh-perf proxyDocker/K8s nativeFeature-rich self-hostedAWS-native serverless
Ops burdenLowLowMediumNone

Start with nginx or Traefik. Graduate to Kong when you need the plugin ecosystem. Use managed (AWS/Cloudflare) when ops burden matters more than per-request cost.