← SRE · beginner · 25 min · 11 / 21 বাংলা

Scaling & Distributed Systems — 8-Week Companion Roadmap

Zero to designing, building, and operating scalable systems. 1–2 hrs/day, 8 weeks, 8 real projects, 5 case studies, mini-YouTube capstone.

scalingdistributed systemsroadmapcapstonesystem design

Real-World Analogy

A construction schedule — each week’s work depends on the last, you can’t pour the roof before the foundation cures, and the capstone at the end only makes sense because everything beneath it is solid.

What this roadmap is

A second 8-week plan, complementary to the fullstack-to-SRE roadmap. Where the SRE roadmap focuses on operating production (SLOs, on-call, postmortems), this one focuses on building the scalable systems you will operate: networking primitives, load balancing, replication, caching, queues, distributed-systems theory, containers, observability, and a real capstone.

Pick whichever roadmap maps to your gap. Many learners run them sequentially — scaling first to build the system, SRE second to operate it.

8 weeks   ·   8 real projects   ·   5+ case studies   ·   1–2 h/day

Pacing

This roadmap is paced at 1–2 hrs/day instead of 15–18 hrs/week. It is gentler than the SRE roadmap and built around evening study after a full-time job. Expect to extend a week if a project stretches; that is fine — depth beats schedule adherence.

Phase 1 — Foundation (Weeks 1–2)
  W01  How the internet & servers actually work
  W02  Load balancing, reverse proxies & failover

Phase 2 — Core Scaling Concepts (Weeks 3–5)
  W03  Databases, replication & sharding
  W04  Caching strategies & message queues
  W05  Distributed systems theory

Phase 3 — Real-World Scaling (Weeks 6–8)
  W06  Containers, horizontal scaling & service discovery
  W07  Observability, resilience & failure patterns
  W08  Capstone — design & build a real scalable system

Phase 1 — Foundation (Weeks 1–2)

Week 01 — How the internet & servers actually work

Topics

DNS resolution — how a hostname becomes an IP address
TCP/IP handshake — how two machines establish a connection
HTTP/1.1 vs HTTP/2 vs HTTP/3 — what changed and why
What a server process actually is — sockets, file descriptors, accept loops
Linux fundamentals — processes, threads, file I/O, networking tools (netstat, ss, curl, dig)
Blocking vs non-blocking I/O — why Node.js handles 10k connections and Apache chokes

Key concepts

TCP 3-way handshake · DNS TTL · OSI model · file descriptors · epoll / kqueue · keep-alive · TLS termination

Resources

Hussein Nasser — Fundamentals of Networking (YouTube)
roadmap.sh/devops
Linux Journey — linuxjourney.com (free)
High Performance Browser Networking — Ilya Grigorik (free online)

Case study — How Cloudflare handles 100M+ requests/sec

They run a single event loop per CPU core using epoll, avoid thread context switching entirely, and terminate TLS at the edge. The same principles apply when you run Nginx — it uses the same non-blocking architecture.

Weekly project — Mini HTTP server from scratch

Build a basic HTTP server in any language (Node.js, Python, Go) that handles concurrent connections without blocking. Forces you to understand what Nginx actually does under the hood.

Open a TCP socket, bind to port 8080, listen for connections
Parse a raw HTTP GET request (method, path, headers)
Respond with a valid HTTP/1.1 response (status line + headers + body)
Handle 100 concurrent connections — measure where it breaks
Add keep-alive support — observe connection reuse in tcpdump
Benchmark your server vs Nginx using wrk — record req/sec difference

Deliverable: Working HTTP server + benchmark report (your server vs Nginx req/sec).

Lab (Thu–Fri)

# Provision 2 Hetzner CX11 VMs (~€4/mo each). SSH in.
ss -tlnp                           # observe what's listening on every port
dig google.com                     # trace the full DNS resolution chain
curl -v https://example.com        # read every line of the TLS handshake
# Deploy your project to VM1, hit it from VM2

Week 02 — Load balancing, reverse proxies & failover

Topics

What a reverse proxy is — TLS offloading, compression, routing
Load balancing algorithms — round robin, least connections, IP hash, weighted
Active vs passive health checks — how HAProxy detects a dead backend
Session persistence (sticky sessions) — when and why to avoid it
keepalived + VRRP — Virtual IP failover at the network layer
Connection draining — removing a node without dropping live requests

Key concepts

VIP / virtual IP · VRRP · upstream pool · active-passive HA · connection draining · layer 4 vs layer 7 LB · PROXY protocol

Resources

Hussein Nasser — HAProxy series (YouTube)
HAProxy official docs
Nginx upstream module docs
keepalived.org — user guide

Case study — GitHub’s 24-hour outage (2018)

A network partition caused primary and replica databases to diverge. HAProxy health checks failed too slowly, routing traffic to degraded nodes for minutes before detection. The fix: health check interval reduced to 500ms + circuit breakers added. Exactly the pattern you’re building this week.

Weekly project — Production-grade streaming node failover

Rebuild a real failover setup. Measure everything and target zero dropped requests during switchover.

Install HAProxy on a 3rd VM. Configure Node 01 as primary, Node 02 as backup
Write a custom health check script that verifies the stream process (not just TCP port)
Set health check interval 500ms, rise=2, fall=3
Install keepalived on Node 01 and 02. Assign shared VIP. Point HAProxy to the VIP.
Run wrk load test (-t4 -c100 -d60s). Kill Node 01 mid-test. Record failed requests.
Tune until zero requests fail during failover. Document your final failover time in ms.

Deliverable: Zero-downtime failover proof via wrk benchmark. Failover time documented in milliseconds.

Chaos test (Weekend)

# Break it deliberately
pkill node                                            # kill Node 01 process — does HAProxy detect within 1s?
poweroff                                              # kill entire Node 01 VM — does VIP move to Node 02?
tc qdisc add dev eth0 root netem delay 200ms          # add latency
dd if=/dev/zero of=/tmp/fill bs=1M                    # fill Node 01 disk to 100% — does your health check catch non-TCP failures?

Phase 2 — Core Scaling Concepts (Weeks 3–5)

Week 03 — Databases, replication & sharding

Topics

How Postgres stores data — heap files, pages, B-tree indexes
WAL (Write-Ahead Log) — how replication works at the byte level
Read replicas — synchronous vs asynchronous replication tradeoffs
Connection pooling — why 10,000 Postgres connections kills performance; how PgBouncer fixes it
Sharding strategies — range, hash, directory-based
When NOT to shard — most apps never need it
EXPLAIN ANALYZE — reading a query plan, spotting seq scans

Key concepts

WAL · replication lag · MVCC · connection pooling · index scan vs seq scan · shard key · hot standby · VACUUM

Resources

DDIA — Ch. 3, 5 (storage engines, replication)
Arpit Bhayani — How Instagram sharded their DB (YouTube)
Postgres docs — WAL and streaming replication
Use The Index, Luke — use-the-index-luke.com (free)

Case study — How Instagram scaled to 1 billion users with PostgreSQL

They ran a single Postgres primary for years by using read replicas aggressively and PgBouncer for connection pooling. They only sharded when a single machine’s RAM could no longer hold the working set. Key lesson: add replicas and fix slow queries before you consider sharding.

Weekly project — Postgres primary + replica with automatic failover

Set up real database replication and observe what happens during failover. This is what every production web app runs.

Install Postgres on 2 VMs. Configure VM1 as primary (pg_hba.conf, postgresql.conf)
Stream WAL to VM2 using pg_basebackup + streaming replication config
Route writes to primary, reads to replica via app-level connection string env var
Install PgBouncer on both. Set pool_mode=transaction. Observe connection count drop.
Simulate primary failure — promote replica manually using pg_ctl promote
Install Patroni to automate promotion without any manual steps

Deliverable: Auto-failover Postgres cluster. Measure replication lag under 1,000 writes/sec.

Lab

-- EXPLAIN ANALYZE a slow query. Add an index. Run again. Record time difference.
EXPLAIN ANALYZE SELECT * FROM events WHERE user_id = 42;

-- INSERT 1M rows. Compare seq scan vs index scan timing.
INSERT INTO events (user_id, payload) SELECT (random()*100000)::int, '...' FROM generate_series(1,1000000);

-- Check lag from the replica
SELECT now() - pg_last_xact_replay_timestamp();

Week 04 — Caching strategies & message queues

Topics

Why caching exists — latency numbers every engineer should know (RAM vs disk vs network)
Cache-aside, read-through, write-through, write-behind patterns
Cache invalidation — TTL vs event-driven expiry, the thundering herd problem
Redis data structures — strings, hashes, sorted sets, pub/sub, streams
Message queues — decoupling producers from consumers, backpressure
Kafka fundamentals — topics, partitions, consumer groups, offsets
At-least-once vs exactly-once delivery — when each matters

Key concepts

cache-aside · TTL · thundering herd · cache stampede · pub/sub · Kafka partition · consumer group · backpressure · dead letter queue

Resources

DDIA — Ch. 11 (stream processing)
ByteByteGo — caching strategies (YouTube)
Redis University — RU101 (free)
Kafka quickstart — kafka.apache.org

Case study — How Twitter’s timeline works

When a celebrity with 10M followers tweets, writing to all 10M timelines synchronously would take minutes. Twitter uses fan-out-on-read with Redis sorted sets — timelines are materialised in Redis, not recomputed per request. Pure caching solving a pure scaling problem.

Weekly project — URL shortener with Redis cache + async analytics queue

Classic real-world project. High read:write ratio (perfect for caching). Background analytics processing (perfect for queues). Used by bit.ly and TinyURL at scale.

Build URL shortener API: POST /shorten returns short code, GET /:code redirects
Store mappings in Postgres. Add Redis cache-aside layer on reads.
Benchmark: measure p99 latency with cache cold vs warm (expect 10–50x improvement)
On every redirect, publish a “click” event to a Redis pub/sub channel or Kafka topic
Write a consumer that reads click events and writes analytics to a separate DB table
Implement a mutex lock to prevent thundering herd on cache miss

Deliverable: URL shortener + benchmark (cache hit vs miss latency) + click analytics flowing through queue.

Lab

# Watch every Redis command in real time while your app runs
redis-cli monitor

# Manually expire a key — observe app behaviour on next request
redis-cli expire mykey 1

# Thundering herd test: restart Redis, hit /api 100 times concurrently — count DB queries
sudo systemctl restart redis && wrk -t8 -c100 -d10s http://localhost/api

Week 05 — Distributed systems theory

Topics

Why distributed systems are hard — partial failures, no global clock, message delays
CAP theorem in practice — what “partition tolerance” actually means day-to-day
Consistency models — strong, eventual, causal, read-your-own-writes
Consensus algorithms — why Raft exists and what problem it solves
Distributed transactions — 2-phase commit and why it’s often avoided
Idempotency — designing operations that are safe to retry
Vector clocks — why timestamps alone don’t establish event ordering

Key concepts

CAP theorem · eventual consistency · Raft consensus · 2PC · idempotency key · vector clock · split brain · quorum

Resources

DDIA — Ch. 8, 9 (read slowly — the hardest chapters)
Martin Kleppmann — Cambridge lecture series (YouTube, free)
MIT 6.824 — Lectures 1–4 (YouTube, free)
raft.github.io — interactive Raft visualisation

Case study — Amazon DynamoDB paper (2007)

Amazon chose eventual consistency deliberately. For a shopping cart, showing a slightly stale cart is better than showing an error. This became the blueprint for every eventually-consistent database. CAP isn’t a flaw — it’s a deliberate design choice with tradeoffs.

Weekly project — Distributed counter with conflict resolution

Build a counter that multiple nodes can increment simultaneously. Solve the split-brain problem. This is the core problem inside every distributed database.

Run 3 instances of a counter service on different ports
Increment from all 3 simultaneously — observe and document the inconsistency
Implement last-write-wins using timestamps — observe why clock skew breaks it
Implement a CRDT G-Counter — each node tracks its own count, merge on read
Implement leader election using Redis SETNX as a distributed lock
Simulate network partition using iptables — block traffic between node 1 and 2, observe behaviour

Deliverable: Counter that stays consistent under concurrent writes and survives a simulated network partition.

Conceptual exercise — theory-heavy week, also do these

Play raft.github.io — elect a leader, kill it, watch automatic re-election
Draw your streaming platform’s CAP tradeoffs on paper — where did you choose consistency? Where availability?
Write a 1-page “consistency contract” — what data in your system can go stale? What must never?

Phase 3 — Real-World Scaling (Weeks 6–8)

Week 06 — Containers, horizontal scaling & service discovery

Topics

Why containers exist — environment consistency, isolation, fast startup
Docker internals — Linux namespaces, cgroups, union filesystems (overlayfs)
Stateless vs stateful services — why stateless scales horizontally and stateful doesn’t
Docker Compose for multi-service local development
Service discovery — how services find each other when IPs are ephemeral
Kubernetes core concepts — pods, deployments, services, ingress
Horizontal Pod Autoscaler — scaling based on CPU or custom metrics

Key concepts

Linux namespaces · cgroups · stateless service · service discovery · sidecar pattern · HPA · rolling deploy · readiness probe

Resources

TechWorld with Nana — Docker full course (YouTube)
labs.play-with-docker.com (free browser playground)
kubernetes.io — interactive tutorials (free)
Ivan Velichko — Container networking from scratch (blog)

Case study — How Spotify moved to Kubernetes (2018)

300+ engineering teams deploying independently without stepping on each other. Kubernetes gave them isolated namespaces per team and rolling deployments with zero downtime. The hard part wasn’t Kubernetes itself — it was making their services stateless first. That’s always the prerequisite.

Weekly project — Containerised multi-node app with auto-scaling on k3s

Take the Week 4 URL shortener and make it run as multiple containers behind a load balancer, scaling automatically under load.

Write a Dockerfile for your URL shortener. Build and run locally.
Write docker-compose.yml: 3 app replicas + Nginx LB + Redis + Postgres
Confirm Nginx distributes requests across replicas (add hostname to API response)
Make the app truly stateless — move any local session/state to Redis
Deploy to k3s cluster (3 VMs). Write Deployment + Service YAML files.
Install metrics-server. Configure HPA: scale 1–5 pods above 50% CPU. Run load test, watch it scale.

Deliverable: App on k3s. HPA scaling proof: screenshot of kubectl get pods growing during load test.

Lab

docker stats                                        # watch CPU/memory per container during load test
docker exec -it container bash                      # explore the container filesystem
kubectl rollout restart deployment/app              # observe zero-downtime rolling restart

Week 07 — Observability, resilience & failure patterns

Topics

The 3 pillars of observability — metrics, logs, distributed traces
RED method — Rate, Errors, Duration (the right metrics to track for any service)
Prometheus data model — time series, labels, PromQL queries
Grafana dashboards — visualising system health in real time
Circuit breaker pattern — fail fast to prevent cascading failure
Retry with exponential backoff and jitter — why plain retry makes things worse
Bulkhead pattern — isolating failures so one bad service can’t kill everything
SLOs, SLAs, error budgets — how Google measures reliability

Key concepts

RED method · SLO / SLA / SLI · p99 latency · circuit breaker · exponential backoff · jitter · bulkhead · error budget · chaos engineering

Resources

Grafana Labs — free tutorials (grafana.com/tutorials)
Prometheus docs — prometheus.io
Site Reliability Engineering — Google (free online)
Netflix Tech Blog — chaos engineering articles

Case study — How Netflix invented Chaos Engineering

In 2010 Netflix built “Chaos Monkey” — a tool that randomly kills production servers. The idea: if your system survives random production failures, you’ll never have a surprise outage. They now run “Chaos Kong” which kills entire AWS availability zones. Insight: you only know your system is resilient after it survives real failure.

Weekly project — Full observability stack + structured chaos test

Instrument your URL shortener fully, build a Grafana dashboard, set up alerts, then deliberately break the system and observe everything in real time.

Add Prometheus client to your app. Expose /metrics: request count, latency histogram, active connections
Deploy Prometheus. Configure scrape every 15s from your app.
Deploy Grafana. Build dashboard: req/sec, p50/p95/p99 latency, error rate, DB pool usage
Write alert rule: fire if error_rate > 1% for 2 minutes. Route to Slack webhook.
Chaos 1: kill Redis container mid-load-test. Does circuit breaker open? Does alert fire?
Chaos 2: add 500ms latency to Postgres (tc netem). Does p99 spike? Write a post-mortem doc.

Deliverable: Grafana dashboard screenshot during chaos. Alert firing proof. Post-mortem doc (what failed, why, what you fixed).

Chaos commands reference — inject failure with these

tc qdisc add dev eth0 root netem delay 200ms        # add network latency
tc qdisc add dev eth0 root netem loss 10%           # drop 10% of packets
dd if=/dev/zero of=/tmp/fill bs=1M                  # fill disk until full
stress --cpu 4 --timeout 60                         # spike CPU, watch HPA scale
iptables -A INPUT -p tcp --dport 5432 -j DROP       # simulate DB unreachable

Week 08 — Capstone: design and build a real scalable system

Topics

Back-of-envelope estimation — calculating storage, bandwidth, QPS before designing
API gateway pattern — rate limiting, auth, routing at the edge
Read-heavy vs write-heavy systems — different architectures for different problems
Database selection — when to use SQL, NoSQL, time-series, graph
CQRS pattern — separate read and write models for high-scale reads
Cost optimisation — the real constraint that shapes real architecture decisions

Key concepts

back-of-envelope · QPS estimation · API gateway · rate limiting · fan-out · CQRS · event sourcing

Resources

System Design Interview Vol.1 — Alex Xu (book)
system-design-primer — github.com/donnemartin (free)
High Scalability — highscalability.com (real case studies)
DDIA — Ch. 1–2 (re-read now — it all clicks)

Case study — How Twitch handles 8 million concurrent viewers

Separate ingest (RTMP) from delivery (HLS). Transcode in parallel across worker pools. Global CDN means viewers pull from edge, not origin. Chat is rate-limited independently from video. Each component scales independently — this is your domain, and it’s exactly the architecture you’re building in the capstone below.

Capstone project — build a mini YouTube

Combines every concept from weeks 1–7. A real video upload + streaming platform. Not a toy — design it to handle actual traffic.

Component 1 — Upload service

Accept video via multipart upload
Store raw file to MinIO (S3-compatible)
Publish video.uploaded event to Kafka
Return job ID immediately (async response)

Component 2 — Transcoding worker

Consume video.uploaded from Kafka
Transcode to 360p / 720p using FFmpeg
Upload segments back to MinIO
Update job status in Postgres

Component 3 — API + serving layer

GET /video/:id — serve HLS playlist
Cache playlist + metadata in Redis
Postgres read replica for metadata queries
Nginx serves .ts segment files directly

Component 4 — Infrastructure

All services in Docker Compose
HAProxy in front of API replicas
Prometheus + Grafana monitoring
Chaos: kill transcoding worker mid-job

Deliverables — what to build and document

Working demo: upload a video, wait for transcode, play it back in a browser
Architecture diagram showing all components, data flows, and failure modes
Load test: how many concurrent viewers before quality degrades?
Failure test: transcoding worker dies mid-job — does it retry from Kafka offset?
Scale plan: back-of-envelope showing how to reach 10,000 concurrent viewers

Books — when to read them

When	Book	How to read
Week 3–5	Designing Data-Intensive Applications — Martin Kleppmann	The best book on distributed systems. Read Ch.3 (storage engines) alongside Week 3, Ch.8–9 (consensus, distributed systems) alongside Week 5. Don’t read cover-to-cover cold — use it alongside the labs and it will make complete sense.
Week 8	System Design Interview Vol.1 — Alex Xu	Practical walkthroughs of designing URL shorteners, Twitter feeds, YouTube, WhatsApp at scale. Read after you’ve built similar things yourself — it’s far more useful when you have context.
After	Site Reliability Engineering — Google (free online)	How Google operates systems at planet scale. SLOs, error budgets, on-call, incident management. Best read once you have something running in production.
Reference	The Linux Command Line — William Shotts (free online)	Essential for all the lab work. Don’t read cover-to-cover — use it as a reference when you’re stuck on a shell command.

Daily rhythm — 1–2 hrs/day

Days	Mode	What to do
Mon – Wed	Watch + Read	1 topic/day. Take notes. Draw diagrams by hand.
Thu – Fri	Build the project	Hands-on only. No videos. Just terminal and editor.
Weekend	Break things	Chaos tests. Write a post-mortem. Review the week.

Both roadmaps end with a working artifact. The SRE roadmap (chapter 0) ends with a fully observable Go service on Kubernetes with SLOs, chaos, and DR. This roadmap ends with a mini-YouTube that survives chaos and scales horizontally. Either is a portfolio piece worth interviewing on. Doing both back-to-back is roughly four months of real, documented systems work.

Stay current

Kubernetes tutorials — version-current hands-on
AWS Builders’ Library — scaling case studies
High Scalability blog — architecture writeups across companies
InfoQ architecture track — talks from real scale operators

Key Takeaways

Eight weeks, 1–2 hrs/day — gentler pace than the SRE roadmap; runs alongside a job
Phase 1 builds the primitives (HTTP, LB, failover) that everything else assumes
Phase 2 covers the core scaling levers — replication, caching, queues, distributed-systems theory
Phase 3 puts it on Kubernetes, observes it, breaks it, then ships the capstone
Mini-YouTube capstone combines every prior week — it is the portfolio artifact
Read DDIA alongside the labs, not cover-to-cover cold — context unlocks the book

What this roadmap is

Contents

Phase 1 — Foundation (Weeks 1–2)

Week 01 — How the internet & servers actually work

Week 02 — Load balancing, reverse proxies & failover

Phase 2 — Core Scaling Concepts (Weeks 3–5)

Week 03 — Databases, replication & sharding

Week 04 — Caching strategies & message queues

Week 05 — Distributed systems theory

Phase 3 — Real-World Scaling (Weeks 6–8)

Week 06 — Containers, horizontal scaling & service discovery

Week 07 — Observability, resilience & failure patterns

Week 08 — Capstone: design and build a real scalable system

Capstone project — build a mini YouTube

Component 1 — Upload service

Component 2 — Transcoding worker

Component 3 — API + serving layer

Component 4 — Infrastructure

Deliverables — what to build and document

Books — when to read them

Daily rhythm — 1–2 hrs/day

Stay current

Key Takeaways