← Disaster Recovery · beginner · 7 min · 01 / 06 বাংলা

RTO, RPO, and What They Actually Mean

Two numbers that define your recovery requirements — and why getting them wrong makes your DR plan useless.

RTORPOdisaster recoverySLAbusiness continuity

Real-World Analogy

Two questions after a house fire: “How long until we’re back in a home?” (RTO — Recovery Time Objective) and “How much stuff did we lose?” (RPO — Recovery Point Objective). A family that backs up photos to the cloud daily has a 24-hour RPO for photos. A family with a hotel booked in 2 hours has a 2-hour RTO. Disaster recovery planning is answering both questions before the fire.

The Two Numbers

RTO (Recovery Time Objective): How long can your system be down before the business suffers unacceptable harm? The maximum allowable downtime from incident to recovery.

RPO (Recovery Point Objective): How much data can you lose? The maximum acceptable data loss measured in time — if your RPO is 1 hour, you can afford to lose at most 1 hour of transactions.

Timeline of a disaster:

12:00  →  Normal operation
12:30  →  Disaster strikes (database corrupted)
         ↑
         RPO boundary: how far back can we restore?
         If backups run at midnight: RPO = 12.5 hours of lost data

12:30  →  Incident detected, recovery begins
13:30  →  System restored and accepting traffic
         ←——————————————→
         RTO: 1 hour of downtime

These are objectives — targets you design your system to meet. They’re not automatic guarantees.

Deriving RTO and RPO from Business Requirements

Don’t pick numbers arbitrarily. Work backwards from business impact:

RTO calculation:

What is the hourly cost of downtime?
  Lost revenue:          $5,000/hour
  Staff idle time:       $2,000/hour
  Customer support load: $500/hour
  Reputation damage:     hard to quantify but real

At what point does the cumulative loss justify the cost of faster recovery?
  4 hours = $30,000 in losses
  Cost to achieve 4-hour RTO: $2,000/month in standby infrastructure
  → 4-hour RTO is economically justified

  1 hour = $7,500 in losses
  Cost to achieve 1-hour RTO: $15,000/month in hot standby + ops
  → 1-hour RTO is probably not justified unless contractually required

RPO calculation:

What is the cost of losing N hours of data?
  Losing 1 hour of orders: ~500 orders × $80 avg = $40,000 unrecoverable
  Losing 5 minutes of orders: ~40 orders = $3,200

  Cost to achieve 5-minute RPO (continuous WAL archival): $200/month
  → 5-minute RPO clearly justified; 1-hour RPO is unacceptable for orders

Different parts of your system have different RTO/RPO requirements:

System	RTO	RPO	Reason
Order database	1 hour	5 minutes	Revenue impact
User accounts	4 hours	1 hour	Login disruption
Analytics DB	24 hours	24 hours	Non-operational
Email logs	72 hours	24 hours	Compliance, not ops
CDN assets	Minutes (CDN failover)	N/A (no writes)	—

Design and budget per system. Don’t apply the tightest requirement uniformly.

Recovery Tiers

RTO/RPO targets map to infrastructure tiers with different costs:

Tier 1: Cold Standby (RTO: hours–days, RPO: hours)

Backups stored in S3/object storage
No hot infrastructure waiting
Recovery: provision new server, restore from backup, catch up
Cost: storage only (~$20/month for 100GB of daily backups)

Tier 2: Warm Standby (RTO: 15 min–1 hour, RPO: minutes)

Backup infrastructure running at reduced scale
Replication keeping it near-current
Recovery: scale up + promote replica + redirect traffic
Cost: 30-50% of full production cost

Tier 3: Hot Standby (RTO: seconds–minutes, RPO: seconds)

Full duplicate production environment
Synchronous replication
Recovery: DNS failover or load balancer redirect
Cost: ~100% additional (2x total infrastructure cost)

Tier 4: Active-Active (RTO: ~0, RPO: ~0)

Traffic distributed across multiple sites simultaneously
Automatic failover with no human intervention
Cost: 2x+ infrastructure + significant engineering complexity

Most applications live at Tier 1–2. Only systems where any downtime is catastrophic (financial trading, healthcare systems, payment processing) justify Tier 3–4.

The Plan Is Worthless Without Testing

RTO is a commitment, not a hope. The only way to know if you can actually recover in 1 hour is to practice recovering in 1 hour — regularly, under realistic conditions.

Types of recovery tests:

Tabletop exercise:
  Walk through the runbook in a meeting room
  Identify gaps in documentation and ownership
  Time: 2 hours, no infrastructure required
  Frequency: quarterly

Backup restore test:
  Restore last night's backup to a test environment
  Verify data integrity and application health
  Measure actual restore time
  Time: 2-4 hours
  Frequency: monthly

Full DR drill:
  Simulate actual disaster (production DB unavailable)
  Follow runbook under time pressure
  Measure actual RTO achievement
  Time: half day
  Frequency: twice yearly

If you’ve never actually restored from backup, your RPO is theoretical. If you’ve never timed a full recovery, your RTO is a guess.

Common Failure Modes in DR Plans

Backup exists, restore never tested: Backups are corrupt, incomplete, or require software that’s no longer installed. Discovered during actual disaster.

RTO set by wishful thinking: “We can restore in 1 hour” because that sounds good, not because anyone has measured it. Actual restore time: 6 hours.

RPO mismatch with backup schedule: Claiming 4-hour RPO with daily backups. If disaster strikes at 11pm, you’ve lost 23 hours of data.

Single region, single AZ backups: Backups stored in the same location as the primary. A region failure destroys both.

No runbook, knowledge in one person’s head: The person who knows the restore procedure is on vacation. Or left the company.

Document the actual measured RTO from your last drill. If it was 4 hours and your SLA says 2 hours, you have a gap to close — not a plan to point to.