A server is down. Not a single service, not one VM — the hypervisor itself stops responding. Employees are calling, phone lines are busy, someone has already alerted the IT hotline, and the managing director is asking, “when will it be back?”. This is the moment that decides whether a company has an emergency plan — or whether it improvises.
This article is not a backup strategy and not a disaster recovery concept in the broader sense (we have separate articles for that). It answers a narrower question: What do you do in the first 60 minutes and the hours that follow? With concrete checklists, roles, and a restore order that has proven itself in practice.
Prerequisites: What Must Exist Before the Incident?
An emergency plan is worthless if it is taken out of the cabinet for the first time during the first real outage. These items must exist beforehand — and be verified every two to three years:
- Current backup, tested (not just “runs green daily”)
- RTO and RPO defined per business process, in writing
- Escalation list with contact info available outside business hours
- Hardware inventory with serial numbers, service tags, warranty dates
- License key directory in a place that is secure but reachable in an emergency
- Emergency communication channel independent of the company mail server (Signal group, external mail, phone list)
If even one of these points is missing, the emergency plan is not yet ready for an emergency. The following phases assume this foundation is in place.
Clarify RTO and RPO — Before Anything Happens
Before we get to the 4-phase plan, every company needs to know the two key figures that drive every decision during an incident:
| Metric | Meaning | SMB example |
|---|---|---|
| RTO (Recovery Time Objective) | Maximum acceptable outage duration | 4 hours for ERP, 24 hours for file server |
| RPO (Recovery Point Objective) | Maximum acceptable data loss | 1 hour for databases, 24 hours for documents |
Without these numbers, the emergency plan becomes a “we work as fast as possible” action — and at the end, management and IT argue about whether the result was acceptable. Defining the numbers in advance gives you a measuring stick.
Phase 1: Detect (Minute 0 to 15)
The outage does not begin when IT notices it — it begins when an employee notices something is not working. Keep that gap small.
What must happen in phase 1:
- Confirm the outage via two independent paths: monitoring alert AND manual ping / web UI test
- Scope clarification: which systems exactly? Hypervisor, single VM, storage, network? Three pings to three different IPs is usually enough
- First entry in the incident log (time, symptom, first observation) — on paper or in a system outside the affected infrastructure
- First notification to the IT lead or on-call
Tools that help: A standalone monitoring stack (Zabbix, Checkmk, Uptime Kuma) running outside the main infrastructure. If your monitoring runs on the same hypervisor that fails, you know about the outage — not at all.
What must NOT happen in phase 1: No repair attempts. Nobody should reboot storage because “it helped before”. Diagnose first, act second.
Phase 2: Contain (Minute 15 to 60)
A total server failure has two possible cause categories: hardware/software defect or security incident (ransomware, compromised admin accounts). Containment looks different in each case — and not knowing which one applies is the most common reason responders get it wrong.
Suspected security incident: isolate immediately
- Physically or VLAN-isolate affected systems — disable switch port or pull cable
- Disable WLAN for the affected site
- Deliberately do NOT spin up backup systems immediately — they could pull compromised restores
- Consider forensic imaging before starting repairs (law enforcement, cyber insurance)
Suspected hardware/software defect: preserve data integrity
- Do not hard power off if avoidable — running caches could be lost
- Collect storage logs (smartctl, IPMI, SEL log)
- For RAID sets: read first, then decide — no disk swap without clarified order
- Escalate hardware support (Dell ProSupport, HPE, Wortmann TERRA Service) with service tag
Escalation matrix — who calls whom?
| Tier | Role | Example |
|---|---|---|
| 1 | IT lead (internal) | Admin, IT manager |
| 2 | IT service provider | DATAZONE or in-house IT |
| 3 | Hardware vendor support | Dell, HPE, Wortmann TERRA |
| 4 | Management | when RTO breach is imminent |
| 5 | Cyber insurer / police | on security incident |
| 6 | Customers, suppliers | if external communication is affected |
This list must exist with phone numbers and availability windows before the incident. Nobody searches LinkedIn for the provider’s mobile number at 2 AM.
Phase 3: Restore (Hour 1 to RTO)
This is the technical part. Most emergency plans fail not on the “whether” but on the “in what order”. A wrong restore sequence easily doubles the time of the actual recovery.
Recommended restore order:
- Infrastructure services first: DNS, DHCP, NTP — nothing else works cleanly without these
- Active Directory / domain controllers: logins, Kerberos tickets, group policies. With multiple DCs, the FSMO holder first
- Storage / file server: SMB shares, home directories — needed before ERP because many applications have paths on shares
- Mail server / mail routing: so external communication works again
- ERP / line-of-business software: only once all dependencies above are running
- Secondary services: print, telephony (if VoIP), SharePoint, internal web services
- Workstations: only when the backbone stands — otherwise everyone runs into login errors
This order applies to most mid-market setups. In detail it can shift — e.g. when ERP has its own database VM that must come up before the ERP server.
Choose the restore method:
| Method | When it makes sense | Typical RTO |
|---|---|---|
| Bare-metal restore | Hardware available, full image exists | 4-8 hours |
| VM restore from Proxmox Backup Server | Hypervisor running, individual VMs broken | 30 min - 2 hrs per VM |
| Replica failover | Replication to second site exists | 15-60 min |
| Cloud failover | Cloud DR site set up | 1-4 hours |
| Rebuild + data restore | Worst case, everything from scratch | Days |
A working Proxmox Backup Server and TrueNAS replication typically put you in the middle row — well below most SMB RTOs.
During the restore:
- Brief status update every 30-60 minutes in the defined communication channel
- Write restore logs — what was restored from where and when
- Before production release, at least one smoke test per system: login, a few typical actions
- Only then bring employees back to the system
Phase 4: Learn (24 to 72 Hours After Restore)
This phase is the one SMBs skip most often — and it is the most important. Without a post-mortem, the same mistake happens again.
Post-mortem meeting with all involved parties:
- IT, management, affected business departments, external provider if applicable
- Timebox: 1-2 hours, no longer
- No place for blame — focus is process improvement, not punishment
- Outputs: written action items with owners and deadlines
Structured walkthrough — what happened when?
| Time | What happened? | What could have gone better? |
|---|---|---|
| T+0 (outage) | first detection via monitoring | Could an employee have noticed first — was monitoring late? |
| T+15min | escalation to on-call | Did on-call respond quickly? |
| T+45min | diagnosis complete | Was diagnosis tooling available? |
| T+2h | first restore begins | Was the restore path clear? |
| T+RTO | last system back | RTO met? If not: why not? |
Typical findings from real post-mortems:
- Monitoring detected the outage too late (e.g. heartbeat was not an end-to-end test)
- Restore order was not documented — order was decided in the moment
- Nobody knew where the backup encryption passphrase was stored
- Employees had no information on how to communicate during the outage (mail was down too)
- Spare hardware was not in stock — lead time extended RTO by days
Each finding turns into a concrete action: monitoring extension, runbook update, emergency mail group, spare hardware on shelf.
Sample Checklist: First 60 Minutes of Total Server Failure
This checklist belongs in every server cabinet — laminated, with current numbers:
[ ] T+0: Symptom recorded (time, what does not work)
[ ] T+5: Monitoring check, manual ping check
[ ] T+10: Scope clarified: hypervisor / storage / network / single VM
[ ] T+15: IT on-call informed
[ ] T+15: Security or hardware incident? Decision made
[ ] T+20: Security: network isolation; hardware: collect logs
[ ] T+30: Management informed, RTO status discussed
[ ] T+45: Provider / vendor support contacted
[ ] T+60: Recovery plan in place, first steps begin
What DATAZONE Maintains for Customers
As part of our DATAZONE Control managed services, we maintain for customers:
- Current emergency plans as living documents
- Escalation lists with verified availability
- Restore tests at least annually on a test VM
- Backup validation, not just job status
- 24/7 on-call with defined response time
In practice we regularly see emergency plans that are formally well documented but fail in the real event — because a phone number is outdated, because the restore was never tested under time pressure, or because nobody has the system order in their head. The annual exercise costs a few hours — and shortens real-world RTO by factors.
Conclusion
A total server failure is not the end of a company — if an emergency plan exists, has been rehearsed, and comes out of the cabinet during the real incident. The four phases detect, contain, restore, learn are not a theoretical structure but an order that works under stress. The most important investment is not the next backup tool, but the annual drill with a real restore under time pressure.
If you want to test how robust your own emergency plan is, three questions get you far:
- Where is the current escalation list, and when were the numbers last verified?
- In which order are systems restored — and who decides?
- When was the last real restore tested on separate hardware?
If any of these questions produces an “I don’t know”, there is homework to do. We are happy to help — before the real incident demands the homework.
Related Articles
More on these topics:
More articles
Home Office IT: Securely Connecting Remote Employees
Secure home office for SMBs: VPN with OPNsense, MDM, RDP gateway, Vaultwarden, MFA with Yubikey. Configuration blueprint from laptop via VPN to terminal session.
TrueNAS Cloud Sync to Backblaze B2: Affordable Offsite Backup
TrueNAS Cloud Sync to Backblaze B2 as an offsite backup target: B2 application key, bucket setup, push mode, encryption and bandwidth management. With best practices for SMBs.
Authentik: Single Sign-On for Self-Hosted Services
Authentik as self-hosted SSO and identity provider: OIDC, SAML2, LDAP, MFA. Example setup with Nextcloud, GitLab and Vaultwarden — plus comparison with Authelia.