A server is down. Not a single service, not one VM — the hypervisor itself stops responding. Employees are calling, phone lines are busy, someone has already alerted the IT hotline, and the managing director is asking, “when will it be back?”. This is the moment that decides whether a company has an emergency plan — or whether it improvises.

This article is not a backup strategy and not a disaster recovery concept in the broader sense (we have separate articles for that). It answers a narrower question: What do you do in the first 60 minutes and the hours that follow? With concrete checklists, roles, and a restore order that has proven itself in practice.

Prerequisites: What Must Exist Before the Incident?

An emergency plan is worthless if it is taken out of the cabinet for the first time during the first real outage. These items must exist beforehand — and be verified every two to three years:

Current backup, tested (not just “runs green daily”)
RTO and RPO defined per business process, in writing
Escalation list with contact info available outside business hours
Hardware inventory with serial numbers, service tags, warranty dates
License key directory in a place that is secure but reachable in an emergency
Emergency communication channel independent of the company mail server (Signal group, external mail, phone list)

If even one of these points is missing, the emergency plan is not yet ready for an emergency. The following phases assume this foundation is in place.

Clarify RTO and RPO — Before Anything Happens

Before we get to the 4-phase plan, every company needs to know the two key figures that drive every decision during an incident:

Metric	Meaning	SMB example
RTO (Recovery Time Objective)	Maximum acceptable outage duration	4 hours for ERP, 24 hours for file server
RPO (Recovery Point Objective)	Maximum acceptable data loss	1 hour for databases, 24 hours for documents

Without these numbers, the emergency plan becomes a “we work as fast as possible” action — and at the end, management and IT argue about whether the result was acceptable. Defining the numbers in advance gives you a measuring stick.

Phase 1: Detect (Minute 0 to 15)

The outage does not begin when IT notices it — it begins when an employee notices something is not working. Keep that gap small.

What must happen in phase 1:

Confirm the outage via two independent paths: monitoring alert AND manual ping / web UI test
Scope clarification: which systems exactly? Hypervisor, single VM, storage, network? Three pings to three different IPs is usually enough
First entry in the incident log (time, symptom, first observation) — on paper or in a system outside the affected infrastructure
First notification to the IT lead or on-call

Tools that help: A standalone monitoring stack (Zabbix, Checkmk, Uptime Kuma) running outside the main infrastructure. If your monitoring runs on the same hypervisor that fails, you know about the outage — not at all.

What must NOT happen in phase 1: No repair attempts. Nobody should reboot storage because “it helped before”. Diagnose first, act second.

Phase 2: Contain (Minute 15 to 60)

A total server failure has two possible cause categories: hardware/software defect or security incident (ransomware, compromised admin accounts). Containment looks different in each case — and not knowing which one applies is the most common reason responders get it wrong.

Suspected security incident: isolate immediately

Physically or VLAN-isolate affected systems — disable switch port or pull cable
Disable WLAN for the affected site
Deliberately do NOT spin up backup systems immediately — they could pull compromised restores
Consider forensic imaging before starting repairs (law enforcement, cyber insurance)

Suspected hardware/software defect: preserve data integrity

Do not hard power off if avoidable — running caches could be lost
Collect storage logs (smartctl, IPMI, SEL log)
For RAID sets: read first, then decide — no disk swap without clarified order
Escalate hardware support (Dell ProSupport, HPE, Wortmann TERRA Service) with service tag

Escalation matrix — who calls whom?

Tier	Role	Example
1	IT lead (internal)	Admin, IT manager
2	IT service provider	DATAZONE or in-house IT
3	Hardware vendor support	Dell, HPE, Wortmann TERRA
4	Management	when RTO breach is imminent
5	Cyber insurer / police	on security incident
6	Customers, suppliers	if external communication is affected

This list must exist with phone numbers and availability windows before the incident. Nobody searches LinkedIn for the provider’s mobile number at 2 AM.

Phase 3: Restore (Hour 1 to RTO)

This is the technical part. Most emergency plans fail not on the “whether” but on the “in what order”. A wrong restore sequence easily doubles the time of the actual recovery.

Recommended restore order:

Infrastructure services first: DNS, DHCP, NTP — nothing else works cleanly without these
Active Directory / domain controllers: logins, Kerberos tickets, group policies. With multiple DCs, the FSMO holder first
Storage / file server: SMB shares, home directories — needed before ERP because many applications have paths on shares
Mail server / mail routing: so external communication works again
ERP / line-of-business software: only once all dependencies above are running
Secondary services: print, telephony (if VoIP), SharePoint, internal web services
Workstations: only when the backbone stands — otherwise everyone runs into login errors

This order applies to most mid-market setups. In detail it can shift — e.g. when ERP has its own database VM that must come up before the ERP server.

Choose the restore method:

Method	When it makes sense	Typical RTO
Bare-metal restore	Hardware available, full image exists	4-8 hours
VM restore from Proxmox Backup Server	Hypervisor running, individual VMs broken	30 min - 2 hrs per VM
Replica failover	Replication to second site exists	15-60 min
Cloud failover	Cloud DR site set up	1-4 hours
Rebuild + data restore	Worst case, everything from scratch	Days

A working Proxmox Backup Server and TrueNAS replication typically put you in the middle row — well below most SMB RTOs.

During the restore:

Brief status update every 30-60 minutes in the defined communication channel
Write restore logs — what was restored from where and when
Before production release, at least one smoke test per system: login, a few typical actions
Only then bring employees back to the system

Phase 4: Learn (24 to 72 Hours After Restore)

This phase is the one SMBs skip most often — and it is the most important. Without a post-mortem, the same mistake happens again.

Post-mortem meeting with all involved parties:

IT, management, affected business departments, external provider if applicable
Timebox: 1-2 hours, no longer
No place for blame — focus is process improvement, not punishment
Outputs: written action items with owners and deadlines

Structured walkthrough — what happened when?

Time	What happened?	What could have gone better?
T+0 (outage)	first detection via monitoring	Could an employee have noticed first — was monitoring late?
T+15min	escalation to on-call	Did on-call respond quickly?
T+45min	diagnosis complete	Was diagnosis tooling available?
T+2h	first restore begins	Was the restore path clear?
T+RTO	last system back	RTO met? If not: why not?

Typical findings from real post-mortems:

Monitoring detected the outage too late (e.g. heartbeat was not an end-to-end test)
Restore order was not documented — order was decided in the moment
Nobody knew where the backup encryption passphrase was stored
Employees had no information on how to communicate during the outage (mail was down too)
Spare hardware was not in stock — lead time extended RTO by days

Each finding turns into a concrete action: monitoring extension, runbook update, emergency mail group, spare hardware on shelf.

Sample Checklist: First 60 Minutes of Total Server Failure

This checklist belongs in every server cabinet — laminated, with current numbers:

[ ] T+0:    Symptom recorded (time, what does not work)
[ ] T+5:    Monitoring check, manual ping check
[ ] T+10:   Scope clarified: hypervisor / storage / network / single VM
[ ] T+15:   IT on-call informed
[ ] T+15:   Security or hardware incident? Decision made
[ ] T+20:   Security: network isolation; hardware: collect logs
[ ] T+30:   Management informed, RTO status discussed
[ ] T+45:   Provider / vendor support contacted
[ ] T+60:   Recovery plan in place, first steps begin

What DATAZONE Maintains for Customers

As part of our DATAZONE Control managed services, we maintain for customers:

Current emergency plans as living documents
Escalation lists with verified availability
Restore tests at least annually on a test VM
Backup validation, not just job status
24/7 on-call with defined response time

In practice we regularly see emergency plans that are formally well documented but fail in the real event — because a phone number is outdated, because the restore was never tested under time pressure, or because nobody has the system order in their head. The annual exercise costs a few hours — and shortens real-world RTO by factors.

Conclusion

A total server failure is not the end of a company — if an emergency plan exists, has been rehearsed, and comes out of the cabinet during the real incident. The four phases detect, contain, restore, learn are not a theoretical structure but an order that works under stress. The most important investment is not the next backup tool, but the annual drill with a real restore under time pressure.

If you want to test how robust your own emergency plan is, three questions get you far:

Where is the current escalation list, and when were the numbers last verified?
In which order are systems restored — and who decides?
When was the last real restore tested on separate hardware?

If any of these questions produces an “I don’t know”, there is homework to do. We are happy to help — before the real incident demands the homework.

Disaster Recovery Plan for SMBs: What to Do on Total Server Failure

Prerequisites: What Must Exist Before the Incident?

Clarify RTO and RPO — Before Anything Happens

Phase 1: Detect (Minute 0 to 15)

Phase 2: Contain (Minute 15 to 60)

Phase 3: Restore (Hour 1 to RTO)

Phase 4: Learn (24 to 72 Hours After Restore)

Sample Checklist: First 60 Minutes of Total Server Failure

What DATAZONE Maintains for Customers

Conclusion

More articles

MFA Methods: TOTP vs. FIDO2 vs. Push — Which for SMBs?

Proxmox Backup Server vs. Veeam Community: Which When?

3-2-1-1-0: The Extended Backup Formula for 2026

Need IT consulting?

Disaster Recovery Plan for SMBs: What to Do on Total Server Failure

Prerequisites: What Must Exist Before the Incident?

Clarify RTO and RPO — Before Anything Happens

Phase 1: Detect (Minute 0 to 15)

Phase 2: Contain (Minute 15 to 60)

Phase 3: Restore (Hour 1 to RTO)

Phase 4: Learn (24 to 72 Hours After Restore)

Sample Checklist: First 60 Minutes of Total Server Failure

What DATAZONE Maintains for Customers

Conclusion

Related Articles

More articles

MFA Methods: TOTP vs. FIDO2 vs. Push — Which for SMBs?

Proxmox Backup Server vs. Veeam Community: Which When?

3-2-1-1-0: The Extended Backup Formula for 2026

Need IT consulting?