Remote Support Start download

Disaster Recovery Plan for SMBs: What to Do on Total Server Failure

Disaster RecoveryBackupSecurity

A server is down. Not a single service, not one VM — the hypervisor itself stops responding. Employees are calling, phone lines are busy, someone has already alerted the IT hotline, and the managing director is asking, “when will it be back?”. This is the moment that decides whether a company has an emergency plan — or whether it improvises.

This article is not a backup strategy and not a disaster recovery concept in the broader sense (we have separate articles for that). It answers a narrower question: What do you do in the first 60 minutes and the hours that follow? With concrete checklists, roles, and a restore order that has proven itself in practice.

Prerequisites: What Must Exist Before the Incident?

An emergency plan is worthless if it is taken out of the cabinet for the first time during the first real outage. These items must exist beforehand — and be verified every two to three years:

  • Current backup, tested (not just “runs green daily”)
  • RTO and RPO defined per business process, in writing
  • Escalation list with contact info available outside business hours
  • Hardware inventory with serial numbers, service tags, warranty dates
  • License key directory in a place that is secure but reachable in an emergency
  • Emergency communication channel independent of the company mail server (Signal group, external mail, phone list)

If even one of these points is missing, the emergency plan is not yet ready for an emergency. The following phases assume this foundation is in place.

Clarify RTO and RPO — Before Anything Happens

Before we get to the 4-phase plan, every company needs to know the two key figures that drive every decision during an incident:

MetricMeaningSMB example
RTO (Recovery Time Objective)Maximum acceptable outage duration4 hours for ERP, 24 hours for file server
RPO (Recovery Point Objective)Maximum acceptable data loss1 hour for databases, 24 hours for documents

Without these numbers, the emergency plan becomes a “we work as fast as possible” action — and at the end, management and IT argue about whether the result was acceptable. Defining the numbers in advance gives you a measuring stick.

Phase 1: Detect (Minute 0 to 15)

The outage does not begin when IT notices it — it begins when an employee notices something is not working. Keep that gap small.

What must happen in phase 1:

  1. Confirm the outage via two independent paths: monitoring alert AND manual ping / web UI test
  2. Scope clarification: which systems exactly? Hypervisor, single VM, storage, network? Three pings to three different IPs is usually enough
  3. First entry in the incident log (time, symptom, first observation) — on paper or in a system outside the affected infrastructure
  4. First notification to the IT lead or on-call

Tools that help: A standalone monitoring stack (Zabbix, Checkmk, Uptime Kuma) running outside the main infrastructure. If your monitoring runs on the same hypervisor that fails, you know about the outage — not at all.

What must NOT happen in phase 1: No repair attempts. Nobody should reboot storage because “it helped before”. Diagnose first, act second.

Phase 2: Contain (Minute 15 to 60)

A total server failure has two possible cause categories: hardware/software defect or security incident (ransomware, compromised admin accounts). Containment looks different in each case — and not knowing which one applies is the most common reason responders get it wrong.

Suspected security incident: isolate immediately

  • Physically or VLAN-isolate affected systems — disable switch port or pull cable
  • Disable WLAN for the affected site
  • Deliberately do NOT spin up backup systems immediately — they could pull compromised restores
  • Consider forensic imaging before starting repairs (law enforcement, cyber insurance)

Suspected hardware/software defect: preserve data integrity

  • Do not hard power off if avoidable — running caches could be lost
  • Collect storage logs (smartctl, IPMI, SEL log)
  • For RAID sets: read first, then decide — no disk swap without clarified order
  • Escalate hardware support (Dell ProSupport, HPE, Wortmann TERRA Service) with service tag

Escalation matrix — who calls whom?

TierRoleExample
1IT lead (internal)Admin, IT manager
2IT service providerDATAZONE or in-house IT
3Hardware vendor supportDell, HPE, Wortmann TERRA
4Managementwhen RTO breach is imminent
5Cyber insurer / policeon security incident
6Customers, suppliersif external communication is affected

This list must exist with phone numbers and availability windows before the incident. Nobody searches LinkedIn for the provider’s mobile number at 2 AM.

Phase 3: Restore (Hour 1 to RTO)

This is the technical part. Most emergency plans fail not on the “whether” but on the “in what order”. A wrong restore sequence easily doubles the time of the actual recovery.

Recommended restore order:

  1. Infrastructure services first: DNS, DHCP, NTP — nothing else works cleanly without these
  2. Active Directory / domain controllers: logins, Kerberos tickets, group policies. With multiple DCs, the FSMO holder first
  3. Storage / file server: SMB shares, home directories — needed before ERP because many applications have paths on shares
  4. Mail server / mail routing: so external communication works again
  5. ERP / line-of-business software: only once all dependencies above are running
  6. Secondary services: print, telephony (if VoIP), SharePoint, internal web services
  7. Workstations: only when the backbone stands — otherwise everyone runs into login errors

This order applies to most mid-market setups. In detail it can shift — e.g. when ERP has its own database VM that must come up before the ERP server.

Choose the restore method:

MethodWhen it makes senseTypical RTO
Bare-metal restoreHardware available, full image exists4-8 hours
VM restore from Proxmox Backup ServerHypervisor running, individual VMs broken30 min - 2 hrs per VM
Replica failoverReplication to second site exists15-60 min
Cloud failoverCloud DR site set up1-4 hours
Rebuild + data restoreWorst case, everything from scratchDays

A working Proxmox Backup Server and TrueNAS replication typically put you in the middle row — well below most SMB RTOs.

During the restore:

  • Brief status update every 30-60 minutes in the defined communication channel
  • Write restore logs — what was restored from where and when
  • Before production release, at least one smoke test per system: login, a few typical actions
  • Only then bring employees back to the system

Phase 4: Learn (24 to 72 Hours After Restore)

This phase is the one SMBs skip most often — and it is the most important. Without a post-mortem, the same mistake happens again.

Post-mortem meeting with all involved parties:

  • IT, management, affected business departments, external provider if applicable
  • Timebox: 1-2 hours, no longer
  • No place for blame — focus is process improvement, not punishment
  • Outputs: written action items with owners and deadlines

Structured walkthrough — what happened when?

TimeWhat happened?What could have gone better?
T+0 (outage)first detection via monitoringCould an employee have noticed first — was monitoring late?
T+15minescalation to on-callDid on-call respond quickly?
T+45mindiagnosis completeWas diagnosis tooling available?
T+2hfirst restore beginsWas the restore path clear?
T+RTOlast system backRTO met? If not: why not?

Typical findings from real post-mortems:

  • Monitoring detected the outage too late (e.g. heartbeat was not an end-to-end test)
  • Restore order was not documented — order was decided in the moment
  • Nobody knew where the backup encryption passphrase was stored
  • Employees had no information on how to communicate during the outage (mail was down too)
  • Spare hardware was not in stock — lead time extended RTO by days

Each finding turns into a concrete action: monitoring extension, runbook update, emergency mail group, spare hardware on shelf.

Sample Checklist: First 60 Minutes of Total Server Failure

This checklist belongs in every server cabinet — laminated, with current numbers:

[ ] T+0:    Symptom recorded (time, what does not work)
[ ] T+5:    Monitoring check, manual ping check
[ ] T+10:   Scope clarified: hypervisor / storage / network / single VM
[ ] T+15:   IT on-call informed
[ ] T+15:   Security or hardware incident? Decision made
[ ] T+20:   Security: network isolation; hardware: collect logs
[ ] T+30:   Management informed, RTO status discussed
[ ] T+45:   Provider / vendor support contacted
[ ] T+60:   Recovery plan in place, first steps begin

What DATAZONE Maintains for Customers

As part of our DATAZONE Control managed services, we maintain for customers:

  • Current emergency plans as living documents
  • Escalation lists with verified availability
  • Restore tests at least annually on a test VM
  • Backup validation, not just job status
  • 24/7 on-call with defined response time

In practice we regularly see emergency plans that are formally well documented but fail in the real event — because a phone number is outdated, because the restore was never tested under time pressure, or because nobody has the system order in their head. The annual exercise costs a few hours — and shortens real-world RTO by factors.

Conclusion

A total server failure is not the end of a company — if an emergency plan exists, has been rehearsed, and comes out of the cabinet during the real incident. The four phases detect, contain, restore, learn are not a theoretical structure but an order that works under stress. The most important investment is not the next backup tool, but the annual drill with a real restore under time pressure.

If you want to test how robust your own emergency plan is, three questions get you far:

  1. Where is the current escalation list, and when were the numbers last verified?
  2. In which order are systems restored — and who decides?
  3. When was the last real restore tested on separate hardware?

If any of these questions produces an “I don’t know”, there is homework to do. We are happy to help — before the real incident demands the homework.

More on these topics:

Need IT consulting?

Contact us for a no-obligation consultation on Proxmox, OPNsense, TrueNAS and more.

Get in touch