← SRE · mastery · 28 min · 17 / 21 বাংলা

Service Mesh Internals

Envoy, Istio, Linkerd, sidecar vs ambient, mTLS, xDS, retries, circuit breakers, traffic shifting. What a mesh actually does and when it earns its complexity.

service meshEnvoyIstioLinkerdambientmTLSxDSCilium

Real-World Analogy

A service mesh is like the electrical wiring in a building — every room gets power, circuit breakers, and grounding without each room wiring itself. You pay for the infrastructure once; every tenant benefits automatically.

What a service mesh is, in one paragraph

A service mesh moves cross-cutting networking concerns — mTLS, retries, timeouts, traffic shifting, telemetry, circuit breakers, load balancing — out of every service and into a dedicated proxy that sits next to (sidecar) or under (ambient/eBPF) every service. The proxies are the data plane. A separate control plane configures them and ships them updated routing and policy.

You buy: consistent zero-trust networking, language-agnostic resilience, deep telemetry, traffic-shifted deploys.

You pay: an extra hop per call, a new control plane to operate, a learning curve, and a new failure mode (the mesh itself).

When a mesh actually earns its keep

Situation	Mesh worth it?
1 monolith + 3 services, single language	No. Use HTTP keepalive + a library.
50 services, 5 languages, mTLS required	Yes. The library cost dominates.
Zero-trust mandate, policy-as-code	Yes. Mesh is the natural enforcement point.
Pure event-driven (Kafka/SQS) services	No. The mesh sits on RPC paths, not queues.
Heavy egress to third-party SaaS	Partial. Egress gateway is useful; full mesh isn’t.

The honest test: list the cross-cutting concerns you’d otherwise build into N libraries. If the list is short, skip the mesh.

Envoy — the data plane the industry standardized on

Istio uses Envoy. Linkerd has its own (Rust-based linkerd2-proxy). Cilium has its own (eBPF + Envoy for L7). Most “service mesh” articles are really Envoy articles.

Envoy is a high-performance L4/L7 proxy with a few defining ideas:

- Configuration is dynamic. xDS APIs (LDS/RDS/CDS/EDS) push updates
  without restart.
- Filter chains. Each connection runs through a stack of filters
  (TLS termination → HTTP parsing → routing → load balancing →
   rate limit → upstream connection pool).
- Listeners (downstream) and Clusters (upstream). Listeners terminate
  client connections; Clusters represent groups of upstream endpoints.
- First-class observability. Stats and access logs are part of the data
  model, not bolted on.

xDS — the Envoy config protocol

LDS — Listener Discovery Service.    Where Envoy listens.
RDS — Route Discovery Service.       HTTP routing rules.
CDS — Cluster Discovery Service.     Upstream service definitions.
EDS — Endpoint Discovery Service.    Endpoints inside each cluster.
SDS — Secret Discovery Service.      Certificates for mTLS.

Istiod, Cilium’s mesh agent, Consul Connect — all speak xDS to Envoy. If you understand the xDS taxonomy, you can debug any Envoy-based mesh.

# Dump live Envoy config from an Istio sidecar
istioctl proxy-config listener payments-api-7c5d8 -n team-payments
istioctl proxy-config cluster  payments-api-7c5d8 -n team-payments -o json
istioctl proxy-config route    payments-api-7c5d8 -n team-payments
istioctl proxy-config endpoint payments-api-7c5d8 -n team-payments

The first time a route doesn’t work, dump RDS. The first time mTLS fails, dump SDS. Treat the proxy as inspectable, not magical.

Sidecar vs ambient — the architectural fight

For a decade, “service mesh” meant “sidecar mesh”: every pod gets an Envoy container running next to it, and iptables rules redirect pod traffic through that Envoy.

Sidecar pros:
  - Per-pod isolation (proxy share fate with app).
  - Mature, well-understood model.
Sidecar cons:
  - +50–200 MiB RAM per pod and CPU per request.
  - 2 extra hops on every call (in + out).
  - Lifecycle pain (pod must wait for sidecar before serving;
    sidecar must drain before pod terminates).

Ambient mesh (Istio’s newer mode) and Cilium service mesh take a different shape:

Ambient / eBPF mesh:
  - L4 layer ("ztunnel") runs once per node. Handles mTLS for every pod.
  - L7 layer (Envoy in a "waypoint" deployment) only for namespaces that need it.
  - Pods get mTLS without any sidecar.
  - Big drop in resource overhead at scale.

Tradeoffs:
  - Less mature (Istio ambient went GA in 2024).
  - L7 features still need a proxy somewhere — just deployed differently.
  - Per-tenant blast radius slightly larger (shared ztunnel per node).

If you’re starting fresh in 2026, evaluate ambient/eBPF before sidecar. The resource math at 10,000 pods is dramatic.

mTLS — the feature that pays for the mesh

Mutual TLS: every connection inside the cluster is TLS-encrypted, and both ends present certificates the other validates. The mesh:

Provisions per-workload certs automatically (SPIFFE/SPIRE identities or Istio’s CA).
Rotates them regularly (default 24 h in Istio).
Enforces mTLS via policy (PeerAuthentication: STRICT).

# Istio: require mTLS in production namespaces
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: team-payments
spec:
  mtls:
    mode: STRICT

The two operational gotchas:

PERMISSIVE → STRICT migration must be staged. Start PERMISSIVE (accept both plain + mTLS), confirm all clients have sidecars, then switch to STRICT.
Cert rotation depends on the control plane. Istiod outage during a long rotation cycle can leave pods with expired certs. Monitor cert age.

Authorization — workload-level RBAC over the network

Once you have identity (mTLS gives every workload a verifiable name), you can write network policies that look like API authorization:

apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: payments-api-allow
  namespace: team-payments
spec:
  selector:
    matchLabels: { app: payments-api }
  action: ALLOW
  rules:
    - from:
        - source:
            principals:
              - cluster.local/ns/team-checkout/sa/checkout-api
              - cluster.local/ns/team-admin/sa/admin-tools
      to:
        - operation:
            methods: [POST, PUT]
            paths: [/v1/charges/*]

Compare to NetworkPolicy: NP says “this pod label can talk to this pod label.” AuthorizationPolicy says “this identity can perform this RPC.” That’s the leap from network ACL to service-level RBAC.

Traffic management — the second feature you’ll actually use

Canary deploys, blue-green, mirror traffic for testing, fault injection — all become declarative.

# Send 5% to v2, 95% to v1
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: payments-api
spec:
  hosts: [payments-api]
  http:
    - route:
        - destination: { host: payments-api, subset: v1 }
          weight: 95
        - destination: { host: payments-api, subset: v2 }
          weight: 5

# Inject 100ms delay for 0.1% of requests, to test client timeouts
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: payments-api-fault
spec:
  hosts: [payments-api]
  http:
    - fault:
        delay:
          percentage: { value: 0.1 }
          fixedDelay: 100ms
      route:
        - destination: { host: payments-api }

Argo Rollouts + a mesh make progressive delivery (auto-canary, auto-rollback on SLO breach) a 50-line config.

Resilience features — and the trap of double retries

Mesh-level retries, timeouts, and circuit breakers are powerful and dangerous.

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: payments-api
spec:
  hosts: [payments-api]
  http:
    - route: [{ destination: { host: payments-api } }]
      timeout: 2s
      retries:
        attempts: 3
        perTryTimeout: 500ms
        retryOn: 5xx,connect-failure,reset

The trap: if both the application and the mesh retry, you get retry amplification (3 client retries × 3 mesh retries × 3 backend retries = 27 calls per failed request). At scale this turns a tiny upstream blip into a stampede that takes the upstream out.

The senior-team rule: decide where retries live, and disable them everywhere else. Mesh-level retries are the right answer most of the time because they share a budget across services. Application-level retries should be the exception — and labeled “no further retry” so the mesh doesn’t retry on top.

Circuit breakers — Envoy’s outlier detection

apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: payments-api
spec:
  host: payments-api
  trafficPolicy:
    connectionPool:
      tcp: { maxConnections: 100 }
      http: { http2MaxRequests: 1000, maxRequestsPerConnection: 10 }
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

This says: if a backend pod returns 5 consecutive 5xx, eject it for 30 s. Up to 50% of pods can be ejected at once. The maxEjectionPercent cap is critical — without it, a bad deploy can eject every replica and leave you with zero capacity.

Observability that comes for free

Every mesh ships:

Per-RPC metrics (request rate, error rate, duration percentiles) — RED automatically.
Distributed traces with mesh-injected B3/W3C trace headers (you still need a tracer for app spans, but the network spans show up free).
Access logs for every request, in JSON or any format you like.

This alone is the value pitch for many teams: even before you use traffic shifting or mTLS, the mesh gives you uniform telemetry across every service in every language.

Failure modes the mesh adds

Every layer is a layer that can fail. The mesh-specific failure patterns:

1. Sidecar crash loops cause pod failure even when the app is fine.
   → Set sidecar resource requests; monitor sidecar restart count.

2. Control plane outage stalls config rollouts.
   → Mesh data plane keeps working with last known config. Long outages
     stall cert rotation though — alert on cert TTL.

3. Misconfigured VirtualService routes 100% of traffic to a non-existent subset.
   → Always canary route changes through staging first.

4. Egress gateway dies → all external traffic dies.
   → Egress gateway is a SPOF unless replicated. Multi-AZ + PDB.

5. mTLS cert expiry across the cluster (control plane was down for 36h).
   → Alert on cert age, not just cluster status.

Each of these has caused a Sev-1 in some team somewhere. The mesh gives capability; it adds another component you must operate.

Mesh comparison — what to pick in 2026

	Istio (sidecar)	Istio Ambient	Linkerd	Cilium Service Mesh
Data plane	Envoy sidecar	Ztunnel + Envoy waypoint	linkerd2-proxy (Rust)	Envoy + eBPF
Resource overhead	High	Low	Lowest of sidecar meshes	Lowest at scale
Maturity	Very mature	GA 2024	Mature	Mature (CNCF graduated)
L7 features	Full Envoy	Via waypoint	Less than Envoy	Full Envoy (when needed)
Best for	Existing Istio shops	Greenfield K8s	Simplicity-focused	Already Cilium for CNI
mTLS	Yes (Istio CA)	Yes	Yes	Yes (SPIFFE)
Multi-cluster	Yes (complex)	Yes	Yes	Yes

The real-world picks:

Cilium for CNI already? Cilium service mesh is the lowest-friction path.
Greenfield K8s, want simplest mesh? Linkerd. Boring is good.
Already on Istio? Migrate to ambient when stable for your use case.
Heavy on Envoy already (e.g. front proxy fleet)? Istio gives consistency.

A real mesh debugging walkthrough

Symptom: a downstream service intermittently sees 503s with upstream connect error or disconnect/reset before headers. Five-minute spike, then quiet, then back. No app logs.

Step 1: Check the mesh access log on the SOURCE side.
   istioctl pc log payments-api-xxx --level access:debug
   Look for response_flags:
     UH = no healthy upstream → endpoint discovery problem
     UF = upstream connection failure → backend died
     UO = overflow (circuit breaker tripped) → check outlierDetection
     URX = upstream max retries reached
     LR = local reset (sidecar killed connection itself)

Step 2: If UH, check EDS — does the source see the destination's endpoints?
   istioctl pc endpoint payments-api-xxx --cluster '*payments-backend*'

Step 3: If endpoints look right, check the destination side.
   istioctl pc log payments-backend-xxx
   Look for sidecar startup time, mTLS cert mismatches.

Step 4: Common root causes for this exact pattern:
   - Backend pods restarting faster than EDS propagates → set
     terminationGracePeriodSeconds on pods, drain via preStop hook.
   - Outlier detection ejecting healthy pods → tune consecutiveErrors
     thresholds.
   - Sidecar OOMKilled because backend bursts traffic to it →
     bump sidecar memory request.

This kind of triage is impossible without knowing the mesh’s data model. Which is why the chapter exists.

Common mistakes

Adopting a mesh because it’s trendy. It’s a tax. Make sure you collect.
Sidecar without resource requests. Sidecar OOMs cause weird app errors.
Both app and mesh retry. Pick one layer; disable on the other.
PERMISSIVE forever. STRICT mTLS is the goal; staying PERMISSIVE means you don’t know what’s encrypted.
Ignoring control-plane HA. Istiod is a SPOF for cert rotation. Run multiple replicas across zones.
Treating VirtualService as immutable. Route changes need canary + monitoring like any deploy.

Tools tier list

Tier S (cold)
  istioctl / linkerd CLI / cilium CLI, kubectl + access to sidecar logs

Tier A (worth a week)
  Argo Rollouts + mesh integration (auto-canary)
  Kiali (Istio service graph), Linkerd dashboard
  GoldPinger (mesh-level synthetic checks)

Tier B
  Custom EnvoyFilter (you're going off-paved-road)
  WASM filters in Envoy (powerful, niche)
  SPIRE / SPIFFE for cross-cluster identity

Tier F
  Mesh in front of Kafka/queues. Wrong layer.
  iptables hand-tuned to "fix" sidecar redirection. You'll regret it.

Stay current

Istio docs — sidecar + ambient mode reference
Linkerd docs — Rust-based, simpler alternative
Cilium service mesh — eBPF-native mesh
Envoy docs — the data plane underneath most meshes

Key Takeaways

A mesh is a tax with a payback — adopt it when the cross-cutting library cost dominates.
Envoy + xDS is the underlying data model of Istio, Cilium, Consul Connect. Learn it once.
Ambient / eBPF meshes are the future at scale — sidecar overhead becomes brutal past 10k pods.
mTLS + workload identity-based authz is what makes a mesh strategically valuable.
Retry budgets, not multiplicative retries — pick one layer to retry from.
The mesh adds failure modes — control plane HA, cert TTLs, sidecar OOMs are now your problem.