Service Discovery
How services find each other — DNS-based discovery, Consul, client-side vs server-side load balancing, and health-integrated routing.
Real-World Analogy
A company directory vs a receptionist: the directory lists everyone’s extension — you look it up and call directly (client-side discovery). The receptionist knows who’s in today, routes your call to someone available, and handles transfers when someone’s out (server-side discovery). The receptionist adds a step but shields you from needing to know who’s at their desk.
The Problem
In a monolith, calling a function is just a pointer dereference. In microservices, calling a service requires:
- Knowing its current IP and port
- Knowing which instances are healthy
- Deciding which instance to call (load balancing)
These can’t be hardcoded — containers restart with new IPs, instances scale in and out, deployments replace instances.
DNS-Based Discovery
The simplest approach: each service has a stable DNS name that resolves to one or more IPs.
In Kubernetes: every Service gets a stable DNS name automatically.
apiVersion: v1
kind: Service
metadata:
name: order-service
namespace: production
spec:
selector:
app: order-service
ports:
- port: 50051
targetPort: 50051 Within the cluster:
order-service.production.svc.cluster.local:50051
# Or just:
order-service:50051 (within same namespace) Kubernetes DNS resolves this to the ClusterIP, which routes to any healthy pod. No service registry needed — Kubernetes is the registry.
Outside Kubernetes: use Route 53 or any DNS server with health checks.
# Route 53 with health check
aws route53 create-health-check \
--caller-reference $(date +%s) \
--health-check-config '{
"IPAddress": "10.0.0.10",
"Port": 50051,
"Type": "TCP",
"RequestInterval": 10,
"FailureThreshold": 3
}'
# A record with health check — Route 53 removes failing instances
aws route53 change-resource-record-sets \
--hosted-zone-id ZXXX \
--change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "order-service.internal",
"Type": "A",
"TTL": 30,
"HealthCheckId": "abc-123",
"ResourceRecords": [{"Value": "10.0.0.10"}]
}
}]
}' TTL matters: short TTL (30s) means clients discover failures quickly. Long TTL (5min) means stale DNS after a deploy. Keep internal DNS TTL at 10-30s.
Consul
Consul is a purpose-built service registry with health checks, KV store, and service mesh capabilities.
# Start Consul agent (dev mode)
consul agent -dev
# Production: 3-node cluster
consul agent \
-server \
-bootstrap-expect=3 \
-datacenter=us-east-1 \
-data-dir=/var/lib/consul \
-bind=10.0.0.10 \
-retry-join=10.0.0.11 \
-retry-join=10.0.0.12 Service registration:
// /etc/consul.d/order-service.json
{
"service": {
"name": "order-service",
"id": "order-service-1",
"port": 50051,
"tags": ["grpc", "v1"],
"check": {
"grpc": "localhost:50051",
"interval": "10s",
"deregister_critical_service_after": "1m"
}
}
} consul reload
# Service is now registered and health-checked Querying Consul:
# DNS interface (built-in)
dig @127.0.0.1 -p 8600 order-service.service.consul SRV
# Returns: IP + port of all healthy instances
# HTTP API
curl http://localhost:8500/v1/health/service/order-service?passing=true Application integration:
import Consul from 'consul';
const consul = new Consul();
async function discoverService(name: string): Promise<string> {
const services = await consul.health.service({
service: name,
passing: true, // only healthy instances
});
if (services.length === 0) throw new Error(`No healthy instances of ${name}`);
// Simple round-robin
const instance = services[Math.floor(Math.random() * services.length)];
const { Address, Port } = instance.Service;
return `${Address}:${Port}`;
}
const orderServiceAddr = await discoverService('order-service');
const client = createClient(OrderService, createGrpcTransport({
baseUrl: `https://${orderServiceAddr}`,
})); Client-Side vs Server-Side Load Balancing
Server-side (traditional):
Client → Load Balancer → [picks instance] → Service instance The LB has all the knowledge. Clients just call a single stable address.
Client-side:
Client → Consul (get all instances) → Client picks one → Service instance The client does its own load balancing. More complex, but no LB bottleneck, and smarter routing (client can retry on a different instance automatically).
gRPC with a service registry naturally uses client-side load balancing — the gRPC runtime resolves the name to multiple addresses and balances across them:
// gRPC client-side LB with multiple addresses
const transport = createGrpcTransport({
baseUrl: 'https://order-service:50051',
// The resolver queries Consul and returns all instance addresses
// gRPC runtime round-robins across them
}); Service Mesh with Envoy/Istio
A service mesh moves all service discovery, load balancing, retries, circuit breaking, and mTLS into a sidecar proxy (Envoy). Application code just calls localhost:50051 — the sidecar intercepts and handles everything.
# Kubernetes: Istio injects Envoy automatically
apiVersion: v1
kind: Pod
metadata:
name: order-service
annotations:
sidecar.istio.io/inject: "true"
spec:
containers:
- name: order-service
image: myorg/order-service:1.2.0
ports:
- containerPort: 50051
# Istio injects envoy sidecar here automatically Traffic policy with Istio:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: order-service
spec:
hosts:
- order-service
http:
- route:
- destination:
host: order-service
subset: v1
weight: 90
- destination:
host: order-service
subset: v2
weight: 10 # canary: 10% to v2 apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: order-service
spec:
host: order-service
trafficPolicy:
connectionPool:
http:
h2UpgradePolicy: UPGRADE # HTTP/2 for gRPC
outlierDetection:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s # circuit breaker: eject failing instances
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2 The application knows nothing about canary routing or circuit breaking — Envoy handles it.
Health Check Conventions
Services must expose health checks that discovery systems can query:
gRPC Health Check Protocol:
import { HealthImplementation } from 'grpc-health-check';
const healthImpl = new HealthImplementation({
'': ServingStatus.SERVING,
'order.v1.OrderService': ServingStatus.SERVING,
});
// Update when service degrades
async function checkDatabaseHealth() {
try {
await db.query('SELECT 1');
healthImpl.setStatus('order.v1.OrderService', ServingStatus.SERVING);
} catch {
healthImpl.setStatus('order.v1.OrderService', ServingStatus.NOT_SERVING);
}
}
setInterval(checkDatabaseHealth, 10_000); HTTP health check (for non-gRPC services):
app.get('/health/ready', async (req, res) => {
try {
await Promise.all([db.query('SELECT 1'), redis.ping()]);
res.json({ status: 'ready' });
} catch (err) {
res.status(503).json({ status: 'not ready', error: err.message });
}
});
app.get('/health/live', (req, res) => {
res.json({ status: 'alive' });
}); /health/live — is the process running? Used by Kubernetes to restart crashed pods. /health/ready — can the service take traffic? Used by service discovery to route requests.
Zero-Downtime Deploys
The moment between “old instance stops” and “new instance is ready” is when discovery goes wrong.
# Kubernetes deployment with readiness gate
spec:
strategy:
rollingUpdate:
maxSurge: 1 # spin up 1 new pod before killing old
maxUnavailable: 0 # never kill before replacement is ready
template:
spec:
containers:
- readinessProbe:
grpc:
port: 50051
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
lifecycle:
preStop:
exec:
command: ["sleep", "5"] # wait for LB to deregister before SIGTERM The preStop sleep ensures Kubernetes has time to remove the pod from service endpoints before the process receives SIGTERM. Without it: a brief window where the LB still routes to a pod that’s stopping.