Health Checks
Active vs passive detection, failure thresholds, graceful draining — keeping dead backends out of rotation.
Real-World Analogy
A hospital triage system: a nurse periodically checks each room to see if the patient is stable (active checks). But the nurse also notices immediately if a patient codes while being treated (passive detection). You need both — scheduled checks catch slow degradation, real-time observation catches sudden failures.
Passive Health Checks
Passive checks detect failure by watching live request outcomes. If a backend returns errors or times out, the LB marks it down.
nginx:
upstream backend {
server 10.0.0.10:3000 max_fails=3 fail_timeout=30s;
server 10.0.0.11:3000 max_fails=3 fail_timeout=30s;
} max_fails=3— mark server down after 3 failures withinfail_timeoutfail_timeout=30s— window for counting failures AND how long the server stays down before retry
The limitation: passive checks require real traffic to detect failure. A server that goes down between requests isn’t detected until a user hits it and gets an error. The user’s request fails.
Active Health Checks
The LB sends synthetic requests to each backend on a schedule, independent of real traffic. Failed backends are removed before real requests reach them.
nginx Plus (commercial) active health check:
upstream backend {
zone backend 64k;
server 10.0.0.10:3000;
server 10.0.0.11:3000;
}
server {
location / {
proxy_pass http://notes;
health_check interval=10s fails=3 passes=2 uri=/health;
}
} HAProxy (active checks built into open source):
backend api_servers
option httpchk GET /health HTTP/1.1\r\nHost:\ api.internal
server s1 10.0.0.10:3000 check inter 10s fall 3 rise 2
server s2 10.0.0.11:3000 check inter 10s fall 3 rise 2 Parameters:
inter 10s— check every 10 secondsfall 3— mark down after 3 consecutive failuresrise 2— mark up after 2 consecutive successes (prevents flapping)
The Health Endpoint
The backend must expose a /health endpoint the LB can call:
// Express
app.get('/health', (req, res) => {
res.status(200).json({ status: 'ok' });
}); A basic /health that always returns 200 only catches process crashes. A useful health check verifies the dependencies the server needs:
app.get('/health', async (req, res) => {
try {
await db.query('SELECT 1');
await redis.ping();
res.status(200).json({ status: 'ok', db: 'ok', cache: 'ok' });
} catch (err) {
res.status(503).json({ status: 'error', error: err.message });
}
}); 503 signals the LB to remove this server from rotation. The LB doesn’t parse the JSON — it just looks at the HTTP status code.
Be careful: if the database goes down and all servers return 503, the LB removes all backends. That’s usually correct (the app is broken) but plan for it — some teams split health into liveness (is the process alive?) and readiness (can it serve traffic?) and configure the LB to use readiness.
// /health/live — always 200 while process is up
app.get('/health/live', (req, res) => res.sendStatus(200));
// /health/ready — checks dependencies
app.get('/health/ready', async (req, res) => {
try {
await db.query('SELECT 1');
res.sendStatus(200);
} catch {
res.sendStatus(503);
}
}); Configure LB to use /health/ready.
TCP Health Checks
For non-HTTP backends (databases, Redis, custom TCP):
HAProxy TCP check:
backend postgres
mode tcp
option tcp-check
server db1 10.0.0.10:5432 check
server db2 10.0.0.11:5432 check HAProxy opens a TCP connection, checks it succeeds, and closes it. Doesn’t send any data — just verifies the port is open.
For Redis, use a more specific check:
backend redis
option tcp-check
tcp-check connect
tcp-check send PING\r\n
tcp-check expect string +PONG
server redis1 10.0.0.10:6379 check Connection Draining
When a backend needs to come down (deploy, scale-in), don’t kill connections instantly. Drain them:
- Mark the server as “draining” — stop routing new requests to it
- Let existing connections finish
- After a timeout, shut down
HAProxy runtime API:
# Mark server for drain (no new connections, finish existing)
echo "set server api_servers/s1 state drain" | \
socat stdio /var/run/haproxy/admin.sock
# Wait for connections to finish (watch until 0)
watch -n1 "echo 'show servers conn api_servers' | socat stdio /var/run/haproxy/admin.sock"
# When 0: take fully down
echo "set server api_servers/s1 state maint" | \
socat stdio /var/run/haproxy/admin.sock nginx graceful shutdown:
# Reload nginx config (zero-downtime, existing connections finish)
nginx -s reload
# Or full graceful quit (waits for connections)
nginx -s quit Application-side draining with SIGTERM:
process.on('SIGTERM', async () => {
server.close(async () => { // stop accepting new connections
await db.end(); // close DB pool after in-flight requests complete
process.exit(0);
});
// Force exit after 30s if connections don't drain
setTimeout(() => process.exit(1), 30_000);
}); Pair this with Kubernetes terminationGracePeriodSeconds: 30 and a preStop sleep to give the LB time to stop routing before SIGTERM arrives:
lifecycle:
preStop:
exec:
command: ["sleep", "5"] # give LB time to deregister
terminationGracePeriodSeconds: 35 Flapping Prevention
A server that oscillates between up and down (network hiccup, intermittent error) causes the LB to constantly change routing. The rise parameter prevents this:
server s1 10.0.0.10:3000 check fall 3 rise 2 - Down after 3 consecutive failures
- Back up only after 2 consecutive successes
This means a single successful check after failure won’t immediately restore the server — it needs to prove stability.
Health Check Overhead
Each active health check is a real HTTP request. With 10 backends, every-5s checks, and 3 LB instances: 10 × 12/min × 3 = 360 requests/min to /health. Usually negligible, but protect the endpoint from heavy checks:
app.get('/health', async (req, res) => {
// Don't run expensive checks on every health probe
// Cache the result for a few seconds
const cached = healthCache.get('status');
if (cached) return res.status(cached.code).json(cached.body);
// ... actual checks ...
}); Or use HAProxy’s fastinter for the first failure and normal inter otherwise:
server s1 10.0.0.10:3000 check inter 30s fastinter 5s downinter 10s inter 30s— healthy: check every 30sfastinter 5s— recovery: check every 5s once server comes back up (confirm stability quickly)downinter 10s— down: check every 10s (detect when it recovers)