Disaster Recovery Is Not a Diagram: Planning, People and Buy-In
We’ve all seen the shiny disaster recovery and business continuity decks.
Neat tiers. Arrows between regions. RPO/RTO numbers in big bold fonts. Maybe a sprinkle of buzzwords around immutable backups, zero trust, or AI-driven response for good measure.
And then a real incident happens.
Suddenly, it’s not about the pretty diagram anymore – it’s about the people, the process, and whether everyone actually agreed what “good” looked like before things broke.
This post is about that gap: why planning and stakeholder buy-in matter more than any single technology choice when you’re defining what your DR/BC solution should be.
Start with one uncomfortable question
Before you pick a product, region, or replication method, ask:
“When something really bad happens, what does ‘acceptable’ look like for this business?”
Not “what can this appliance do” or “what’s the vendor’s reference architecture.”
Instead:
- How long can key services be down before there’s real pain?
- How much data can we afford to lose and still function?
- Who decides what’s acceptable – and have they actually decided?
If you can’t point to a shared, written understanding of those answers – not just in IT, but with business stakeholders – you don’t have a DR/BC strategy. You have some technology and a set of assumptions.
Why planning matters more than product
Good planning turns “we bought some kit” into “we can recover on purpose”.
A few reasons planning is non‑negotiable:
1. Technology is configurable chaos without context
The same tool can be configured to:
- Recover a critical app in minutes, or
- Sit there happily taking copies that are useless in a real incident.
The difference is the plan:
- What’s in scope and what’s explicitly out
- Which workloads are tier 0 vs tier 3
- What the recovery sequence looks like
- Who has the authority to make trade‑offs under pressure
Without that, you’re relying on “whatever the engineer thought was sensible at the time.”
2. Planning forces prioritisation
When you run an actual planning workshop and say:
“You can’t have everything at RTO 0 and RPO 0. Pick.”
You surface all the hidden assumptions:
- Finance want near‑zero data loss.
- Operations care most about warehouse systems.
- Sales just want the CRM up first so they don’t lose deals.
- IT is thinking in terms of infrastructure layers, not business services.
That conversation only happens if you intentionally plan it. If you skip it, you end up with:
- Over‑protected low‑value systems
- Under‑protected high‑value ones
- And nobody quite sure why
3. Planning is where people learn how to behave in a crisis
The plan isn’t just a document; it’s a rehearsal in written form.
A good planning process answers:
- Who declares a disaster, and based on what?
- Who speaks to customers, regulators, and partners?
- Who is allowed to shut things down or fail over?
- Where do we communicate if email and chat are dead?
If you only discover these answers during the incident, you’re improvising – and improvisation plus stress rarely ends well.
Stakeholder buy‑in: the difference between “IT’s problem” and “our strategy”
You can have the best DR design in the world, but if it lives purely in IT, it will fail where it matters most: execution and funding.
Who actually needs to buy in?
Think of three broad groups:
- Business owners
- Heads of departments, product owners, process owners.
- They own the impact when things go down.
- They must sign off on RPO/RTO, priorities, and acceptable risk.
- Executive leadership
- They decide budgets and risk appetite.
- They need clear options: “Here’s what you get for this level of spend, and here’s what you don’t.”
- Operational teams
- IT, service desk, security, facilities, vendors.
- They live with the day‑to‑day processes and on‑call realities.
- Without their input, your runbooks will look lovely and never match reality.
What buy‑in actually looks like (and what it doesn’t)
Real buy‑in:
- They’ve been involved in workshops, not just sent a PDF.
- They can explain – in their own words – what the plan is for their area.
- They’ve signed off on recovery objectives and priorities.
- They show up for tests and treat them as business events, not “IT fire drills.”
Fake buy‑in:
- A PowerPoint was emailed around with “Any comments?” at the end.
- Silence was taken as agreement.
- RPO/RTO numbers were copied from a template, not debated.
- The first time they see the runbook is the day of the incident.
It’s never just technology: people and process decide the outcome
You absolutely need the right tech: backups that work, replication that’s tested, infrastructure that doesn’t crumble under failover.
But when you look at real‑world failures, the root causes are often:
- Nobody knew who was in charge once systems started failing.
- Conflicting decisions from different leaders under pressure.
- Runbooks are out of date with the actual architecture.
- Vendors and partners are not included in the plan.
- Comms chaos – customers getting mixed messages, or none at all.
In other words: people and process, not the product.
People
The human side of DR/BC includes:
- Clear roles: incident manager, technical lead, comms lead, business lead.
- Training: not just how to click “failover”, but how to run a war room.
- Psychology under stress: simple checklists, clear language, no blame.
- Availability: Who is actually reachable at 3 am on a bank holiday?
If your plan assumes “someone will know what to do” – they won’t.
Process
Solid processes turn your design into something usable:
- Change control: DR/BC impact is considered with every major change.
- Documentation lifecycle: updating runbooks when architecture shifts.
- Regular tests: not just technical restores, but full scenario exercises.
- Post‑incident reviews: feeding lessons back into the plan.
Good processes make your technology predictable. Bad processes make it surprising, and surprises during DR are rarely pleasant.
Defining what your DR/BC solution should look like
So how do you actually go from “we need DR” to a solution that people understand and support?
Here’s a pragmatic flow.
1. Map business services, not just servers
Don’t start with VMs, clusters, or storage tiers. Start with:
- Named business services – e.g. “Online ordering”, “Payroll”, “Manufacturing control”.
- For each service, identify:
- Supporting apps
- Data stores
- Upstream/downstream dependencies
- External partners (SaaS, payment providers, logistics, etc.)
This is what you’ll be asked about in a crisis: “When will ordering be back?” – not “Is the hypervisor up?”
2. Run a proper RPO/RTO and impact workshop
Get the right people in a (virtual) room:
- Service owners
- IT operations
- Security/risk
- Someone who can say “no” to unrealistic expectations
For each service:
- RTO: “If this is down for X hours/days, what happens?”
- RPO: “If we lose Y hours of data, what happens?”
- Impact: Revenue, reputation, compliance, safety.
Force trade‑offs:
- “If everything is ‘critical’, nothing is.”
- “If you want 15‑minute RPO, that has a cost – is it worth it for this service?”
Capture decisions in plain language, not just numbers.
3. Design tiers and patterns from those decisions
Now bring in the technology.
- Group services by similar RPO/RTO and criticality.
- Define a small set of standard patterns, for example:
- Tier 0: Active/active or hot standby, RTO < 1 hour, RPO < 15 mins
- Tier 1: Warm standby, RTO < 4 hours, RPO < 1 hour
- Tier 2: Backup restore only, RTO < 24 hours, RPO < 24 hours
Pick technologies and configurations that implement those patterns consistently, rather than bespoke snowflakes for every system.
4. Build runbooks that humans can actually follow
For each pattern and key service, document:
- Trigger: When do we invoke this, and who decides?
- Initial checklist: First 5–10 steps, in order.
- Technical steps: Clear, tested, with screenshots or commands.
- Decision points: If X fails, who can authorise plan B?
Runbooks should be written so that a tired but competent engineer can follow them at 3am without guessing.
5. Test like you mean it (and invite the business)
Schedule regular exercises:
- Tabletop: Walk through a scenario with key stakeholders.
- Technical: Prove you can restore and fail over within targets.
- End‑to‑end: Can users actually do their jobs in the recovered environment?
Most importantly, make business stakeholders attend at least some of these.
Nothing drives home the value (or gaps) of your DR/BC plan like watching a simulated outage play out in real time.
Bringing it all together
If you take nothing else away from this post, let it be this:
A disaster recovery and business continuity solution is a social contract, not a set of features.
It’s an agreement between:
- The business, about what’s acceptable to lose
- IT and vendors, about what’s realistically achievable
- The entire organisation, about how everyone will behave on a bad day
Technology absolutely matters. You need solid backups, resilient infrastructure, and tested failover paths.
But without planning and stakeholder buy‑in, you don’t have DR/BC – you just have infrastructure with a hopeful name.
If you’re not there yet, don’t start with another product evaluation.
Start with a whiteboard, the right people in the room, and the question:
“When the lights go out, what do we really expect to happen – and who’s willing to sign their name next to that?”
That’s where real disaster recovery and business continuity begin.

