Inside a Real Disaster Recovery Day: What Actually Happens When Everything Breaks

by Ian · 8th February 2026

Inside a Real Disaster Recovery Day: What Actually Happens When Everything Breaks

We’ve all seen the glossy DR diagrams.

Colour‑coded tiers, neat arrows flowing between regions, “RTO 15 mins” stuck in bold above some perfectly aligned icons. It all looks calm and controlled – until the day something actually breaks.

This post is about that day.

Not a lab test, not a vendor demo, but a composite of real incidents from the world of Data Recovery and Disaster Recovery, stitched together and suitably anonymised. The goal isn’t to scare you; it’s to show you what really happens when the lights go out – and what you can do now so future‑you doesn’t hate present‑you.

02:17 – The Alert

It always starts with a noise.

Phone buzzes, smartwatch vibrates, laptop lights up on the nightstand. You know, instantly, whether it’s a “have a look in the morning” type of alert or something nastier.

This one’s the latter.

Multiple backup jobs failing in parallel
Unusual encryption‑like behaviour on primary workloads
Users are suddenly unable to access a key line‑of‑business app

You don’t have the full picture yet, but the pattern smells like ransomware or a major storage/platform failure. You grab a drink, open the laptop, and start piecing it together.

02:27 – Runbooks vs Reality

On paper, this is the moment where you flip open the DR runbook, follow the pretty flowchart, and calmly progress through the steps.

In reality, it’s more like:

VPN in, check monitoring
Correlate what the customer is seeing with what your platforms are seeing
Confirm whether the issue is still unfolding or has stabilised
Decide who needs to be woken up right now, and who can sleep a bit longer

The runbook is useful – but as a guide, not a script.

You quickly find out how up‑to‑date it really is:

Does it still reference that database that was retired last year?
Does it assume certain people are on‑call who have since moved roles?
Does it rely on a network path that changed after the last big refresh?

This is where the gap between the diagram and the real world starts to show.

02:45 – Who’s Actually in Charge?

Disaster recovery is as much about people and decision‑making as it is about snapshots and replicas.

On the customer side, you’ll typically have some mix of:

A technical contact who knows the environment inside out
A manager who is responsible for the service but not the nuts and bolts
A senior stakeholder (or three) who are very good at asking, “When will it be back?”

On your side, you’ve got:

DR/backup engineers
Platform or cloud specialists
Possibly vendor support on standby if the underlying platform looks suspect

One of the first real decisions isn’t technical at all:

Who gets to decide what’s recovered first, and what does “recovered” actually mean for this incident?

If that conversation has never happened before, it’s going to happen now – at 2 am, under pressure. This is not ideal.

03:05 – Priorities, RTOs and Reality Checks

Every DR design talks about RTO (Recovery Time Objective) and RPO (Recovery Point Objective). Those numbers look very impressive on slides.

In a real incident, priorities can shift very quickly:

An internal reporting system that was classed as “Tier 3” suddenly becomes critical because the execs want it for situational awareness
A “non‑critical” file server turns out to hold shared templates that many other systems silently depend on
The one tiny VM that runs a licence server stops half the estate from working

You work with the customer to agree on phased priorities:

Phase 1 – Keep the business breathing
- Authentication (AD/Entra ID equivalents)
- Core line‑of‑business apps that directly generate revenue
- Any shared infrastructure pieces that rely on (DNS, DHCP, certificate services, etc.)
Phase 2 – Stabilise operations
- Supporting systems that make the day‑to‑day work feasible
- Portals, ticketing, document storage, and collaboration tools
Phase 3 – Nice to haves / long tail
- Historical reporting, archive workloads, dev/test environments

And then you have the awkward conversation:

“The RTOs you have on paper assume a clean, controlled DR test.
Tonight, we have live ransomware/failed hardware/ongoing issues in play. We’ll aim for those numbers, but we need to expect some deviation.”

The expectations reset here are critical. Over‑promise at this point, and you’ll pay for it all the way through the incident.

03:45 – Immutable Backups, Air‑Gaps and “Oh Thank Goodness”

If this is a ransomware event, the next question is brutal:

Are the backups safe?

This is where design decisions either save the day or make it much, much worse.

Good signs:

Backups or copies stored on immutable object storage, with retention that outlives the attack window
No direct, standing access from compromised accounts to the backup repositories
Separate credentials and hardened repositories for the backup infrastructure

Red flags:

Backup servers on the same domain, same privileges, same blast radius as everything else
No immutable storage, or immutability windows that are too short
Backup infrastructure itself is behaving oddly (encryption, suspicious changes, or simply gone)

If the immutable backups are intact, you feel that brief, guilty surge of relief:

“Okay. This is going to be painful, but we can get them back.”

That relieved feeling is a direct result of the often‑thankless work of designing, selling, and implementing proper backup and DR architectures – the sort of work that never gets fanfare in BAU, but is absolutely everything on days like this.

04:30 – The Messy Bits You Never See in Diagrams

This is the part we don’t usually talk about in polished case studies.

1. The “mystery” dependencies

You bring up a recovered app, and it looks fine… until users try to log in and hit a brick wall.

Some ancient SSO plugin you’ve never heard of
A hard‑coded IP pointing at a legacy firewall
A DNS record that was manually tweaked on someone’s laptop years ago and never documented

You end up doing forensic archaeology on a live system while the clock ticks.

2. Performance and capacity surprises

On paper, the DR design says:

“We can run this entire estate in Site B / the cloud region for 7 days.”

In practice:

Some workloads have grown significantly since the last full DR test
Core storage is shared with other workloads that weren’t accounted for in the original modelling
Network bandwidth and latency from user locations to the DR site aren’t what anyone expected

You find yourself tuning:

Which VMs to power on first
Which can run in reduced capacity mode (fewer app servers, tighter maintenance windows)
Where you can trade performance for availability just to get people back to work

3. Human factors

People are tired. People are stressed.

Someone forgets to tick a crucial box in a restore wizard
A firewall rule is applied to the wrong group
A miscommunication means two teams both “own” the same change and it doesn’t get done

The best DR setups account for this with checklists and peer reviews, even at silly o’clock in the morning.

06:15 – Turning the Corner

At some point, there’s a subtle shift.

You move from “everything is on fire” to “this is horrible, but under control.”

Signs you’re turning that corner:

Priority 1 apps are back up and being used (even if sub‑optimally)
You have a stable plan for the next phases with realistic timings
Comms have gone from frantic to structured: regular updates, clear actions, defined owners

This is where experience really shows. Having seen a few of these, you start to get better at:

Not chasing every small issue at the expense of the bigger picture
Communicating clearly without either sugar‑coating or doom‑mongering
Making sensible trade‑offs: “Yes, we could fix that now, but it would risk destabilising this. Let’s schedule it for later.”

10:00 – The Post‑Incident Hangover

By late morning, people are back in systems, tickets are flying around, and the business is trying to pretend this was all just a short “blip”.

For IT and DR teams, this is where the second half of the work begins.

1. Post‑mortem (done properly)

A good post‑incident review isn’t about blame; it’s about finding the weak points:

Where did detection fail or arrive too late?
Which runbooks were inaccurate or missing steps?
Which dependencies surprised everyone?
Where did tooling get in the way rather than help?

And crucially:

What can we change now, before the next incident, that actually moves the needle?

2. Architecture and process changes

This is often where you finally get approval for the things you’ve been asking for:

Immutable backups where they didn’t previously exist
Separation of duties and tighter security for backup infrastructure
More realistic DR tiers, instead of “everything is Tier 1 because we said so”
Regular, full‑fat DR tests that go beyond “can we restore a VM?” into “can people actually work?”

If you’re in a service provider/vendor role, this is also when you lean on community experience – others in similar roles have likely seen the same patterns and got battle‑tested fixes and runbook improvements ready to share.

What This Has Taught Me About “Good” DR

After enough of these days, a few themes keep coming back.

1. Fancy tech is worthless without boring basics

Cool new features are great – and things like immutability, object storage, and instant recovery absolutely matter – but they’re never a substitute for:

Up‑to‑date documentation
Clear ownership
Regular testing
Sensible, implemented security around the backup and DR stack

2. People and process beat products

When everything goes sideways, it’s not the prettiest GUI that saves you; it’s:

The person who knows where the skeletons are hidden
The team that has rehearsed this enough times that the 2 am version is “just another day”
The organisation that treats DR as a business capability, not an afterthought

3. RTOs and RPOs are earned, not declared

Anyone can write “RTO 15 minutes, RPO 0” on a slide.

You earn those numbers when:

You’ve tested them regularly
You’ve proven you can hit them in conditions that look like production
You’ve included real‑world messiness in your scenarios – partial failures, human error, network weirdness, third‑party dependencies

A Practical DR Sanity Check You Can Do This Month

If you’ve made it this far, here are a few concrete checks you can run against your own environment, without waiting for a real disaster to do it for you.

Pick one critical application and walk through a full recovery scenario.
- Where would you restore it to?
- What accounts/credentials would you need?
- How would users reach it?
- Who would decide it’s “good enough” to go live?
Verify your backups are actually usable, not just “green ticks” in a console.
- Do you have recent, tested restores for your most important services?
- Have you tested restoring to an alternate location or platform, not just back into the comfy, known one?
Check that your backup and DR infrastructure aren’t in the same blast radius.
- Are the backup servers/domain/accounts hardened and separate?
- Are you using immutable storage or equivalent safeguards where it really matters?
- Do you have a copy of the key configuration somewhere that doesn’t disappear if the primary site does?
Make sure decision‑makers know their role before the incident.
- Who declares a disaster?
- Who signs off on failing over to a DR site?
- Who gets informed vs. who gets to decide?
Schedule one grim, realistic DR test.
- Not the “happy path” where everything behaves.
- Include something awkward: a missing dependency, degraded bandwidth, a key person “unavailable”.
- See how your team and your design respond.

Why I Still Love Doing This

For all the 2 am calls, the stress, and the messy realities, there’s something deeply satisfying about helping a business through its worst day.

When things go dark, and you can say, “It’s bad, but here’s the plan, and we’re going to get you back,” all those hours spent thinking about backup jobs, object storage, immutability, and DR runbooks suddenly feel very worthwhile.

Disaster recovery done right isn’t about perfection. It’s about being ready enough, often enough, that when everything breaks, you can still tell a story with a reasonably happy ending.

And if this post nudges you to poke at your own DR plans and ask a few uncomfortable questions, then it’s already been a good DR day.

Disclaimer: Although this post draws upon my own experiences, it was enhanced with AI to bring further realism to the recovery flow.

Inside a Real Disaster Recovery Day: What Actually Happens When Everything Breaks

Inside a Real Disaster Recovery Day: What Actually Happens When Everything Breaks

02:17 – The Alert

02:27 – Runbooks vs Reality

02:45 – Who’s Actually in Charge?

03:05 – Priorities, RTOs and Reality Checks

03:45 – Immutable Backups, Air‑Gaps and “Oh Thank Goodness”