Structured Logging
JSON logs, log levels, correlation IDs, log aggregation with Loki — building logs you can actually search in production.
Real-World Analogy
A well-organized filing cabinet vs a pile of papers: unstructured logs are a pile — you know the information is in there somewhere, but finding it requires reading every page. Structured logs are the filing cabinet: every document has labeled fields, filed in a consistent location, retrievable in seconds. Same information; radically different searchability.
Why Structured Logging
Unstructured log:
[2024-01-15 10:23:41] ERROR: Payment failed for order ord-123 (user usr-456): card declined To extract order_id from this, you write regex. Multiply by 10k log lines per second. Now count how many different developers wrote how many different formats.
Structured log (JSON):
{"level":"error","time":"2024-01-15T10:23:41Z","service":"order-service","order_id":"ord-123","user_id":"usr-456","event":"payment_failed","reason":"card_declined","duration_ms":234} Every field is a key-value pair. Query: {service="order-service"} | json | reason="card_declined" | order_id != "" — instant, no regex.
Pino (Node.js)
Pino is the fastest Node.js logger — synchronous JSON output with minimal allocation:
import pino from 'pino';
const log = pino({
level: process.env.LOG_LEVEL ?? 'info',
base: {
service: 'order-service',
version: process.env.GIT_SHA ?? 'dev',
env: process.env.NODE_ENV,
},
// In development: pretty-print. In production: raw JSON.
transport: process.env.NODE_ENV === 'development'
? { target: 'pino-pretty', options: { colorize: true } }
: undefined,
});
export { log }; // Usage
log.info({ orderId, customerId }, 'Order created');
log.error({ orderId, err: err.message, stack: err.stack }, 'Order creation failed');
log.warn({ queueDepth: 500 }, 'Queue depth high');
// Child logger — inherits context
const reqLog = log.child({ requestId, userId });
reqLog.info({ orderId }, 'Processing order');
// Output: {"requestId":"...","userId":"...","orderId":"...","msg":"Processing order"} Correlation IDs
A request passes through multiple services. To follow it across all logs, generate one ID at ingress and propagate it everywhere.
// Express middleware — generate or propagate correlation ID
import { randomUUID } from 'crypto';
app.use((req, res, next) => {
const correlationId = req.headers['x-correlation-id'] as string ?? randomUUID();
req.correlationId = correlationId;
res.setHeader('x-correlation-id', correlationId);
// Attach to request logger
req.log = log.child({ correlationId, method: req.method, path: req.path });
next();
});
// Request handler
app.post('/orders', async (req, res) => {
req.log.info('Creating order');
try {
const order = await createOrder(req.body, req.log);
req.log.info({ orderId: order.id }, 'Order created');
res.json(order);
} catch (err) {
req.log.error({ err: err.message }, 'Order creation failed');
res.status(500).json({ error: 'Order creation failed' });
}
}); Pass the correlation ID to downstream services:
async function callPaymentService(order: Order, log: Logger) {
const response = await fetch('http://payment-service/charge', {
method: 'POST',
headers: {
'x-correlation-id': log.bindings().correlationId,
'content-type': 'application/json',
},
body: JSON.stringify(order),
});
} Now a single query {correlationId="abc-123"} across all services shows the complete request journey.
Log Levels
Use levels consistently — they determine what gets stored and what triggers alerts:
ERROR — unexpected failure requiring investigation; fires an alert
WARN — degraded state, expected to recover; may fire a low-priority alert
INFO — significant business events (order created, user registered)
DEBUG — diagnostic detail; disabled in production, enabled per-request when debugging
TRACE — everything (query parameters, raw HTTP bodies); never in production // Good level usage
log.error({ err, orderId }, 'Payment service unreachable'); // alert
log.warn({ queueDepth, threshold }, 'Queue depth approaching limit'); // investigate soon
log.info({ orderId, total }, 'Order confirmed'); // business event
log.debug({ sql, params }, 'Executing query'); // dev only
// Common mistake: ERROR for expected failures
log.error('Order not found'); // NOT_FOUND is normal — use warn or info
log.info({ orderId }, 'Order not found, returning 404'); // correct Dynamic log levels in production:
// Change level at runtime without restart
process.on('SIGUSR1', () => {
if (log.level === 'info') {
log.level = 'debug';
log.info('Debug logging enabled');
} else {
log.level = 'info';
log.info('Debug logging disabled');
}
}); Log Aggregation with Loki
Loki stores logs indexed by labels (like Prometheus, but for logs). Promtail ships logs from files or journald to Loki.
# docker-compose.yml
services:
loki:
image: grafana/loki:latest
ports: ["3100:3100"]
command: -config.file=/etc/loki/loki.yml
volumes:
- ./loki.yml:/etc/loki/loki.yml
- loki-data:/loki
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log # host logs
- /var/run/docker.sock:/var/run/docker.sock
- ./promtail.yml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml # promtail.yml — ship Docker container logs
server:
http_listen_port: 9080
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- source_labels: ['__meta_docker_container_name']
target_label: container
- source_labels: ['__meta_docker_container_label_service']
target_label: service
pipeline_stages:
- json:
expressions:
level: level
correlation_id: correlationId
- labels:
level:
correlation_id: journald (systemd services)
For bare-metal or VM deployments (not Docker), logs go to journald:
# All logs from a service
journalctl -u order-service -f
# Logs since yesterday
journalctl -u order-service --since yesterday
# JSON output (for parsing)
journalctl -u order-service -o json | jq '.MESSAGE | fromjson | select(.level == "error")'
# Filter by time range
journalctl -u order-service --since "2024-01-15 10:00:00" --until "2024-01-15 11:00:00" Forward journald to Loki:
# promtail.yml — journald source
scrape_configs:
- job_name: journal
journal:
max_age: 12h
labels:
job: systemd-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: unit Loki Queries (LogQL)
# All errors from order-service
{service="order-service"} |= "error"
# Parse JSON and filter
{service="order-service"} | json | level="error"
# Filter by specific field
{service="order-service"} | json | order_id="ord-123"
# Count errors per minute
count_over_time({service="order-service"} | json | level="error" [1m])
# Rate of errors
rate({service="order-service"} | json | level="error" [5m])
# Top error reasons
{service="order-service"} | json | level="error"
| line_format "{{.reason}}"
| topk(10, count_over_time[1h]) What to Log
Log these:
- Business events (order created, payment charged, user registered)
- All errors with full context (user, resource ID, error code, message)
- Slow operations (requests > 1s, queries > 100ms)
- Security events (failed auth, permission denied, unusual access patterns)
- Service startup and shutdown
Don’t log these:
- Passwords, tokens, card numbers (PCI), personal data (GDPR)
- Successful health checks (100% noise)
- Debug-level SQL in production (volume)
- Stacktraces on expected errors (404, 401)
// Sanitize sensitive data before logging
function sanitizeOrder(order: Order) {
return {
...order,
paymentMethod: { last4: order.paymentMethod.cardNumber.slice(-4) },
// never log full card number
};
}
log.info({ order: sanitizeOrder(order) }, 'Order created');