Remote Support Start download

OPNsense Multi-WAN: Failover Done Right for SMB

OPNsenseNetworkingWANHigh Availability
OPNsense Multi-WAN: Failover Done Right for SMB

Multi-WAN setups are among the most misunderstood OPNsense topics. On the data sheet it sounds like “double bandwidth, double availability”; in practice this is often a simplification that can be expensive. This article explains how to cleanly configure real failover between two WAN connections — typically fibre plus LTE/5G backup — on OPNsense, and why load balancing is not the right choice for most SMBs.

The scenario

Typical SMB setup we configure at DATAZONE:

  • WAN1: fibre, symmetric 500/500 Mbit/s, fixed ISP contract, static IP
  • WAN2: LTE or 5G backup, asymmetric, monthly data quota (e.g. 100 GB), changing IP
  • Requirement: on WAN1 failure switch automatically to WAN2, do not disrupt VoIP telephony, keep critical services reachable (mail, ERP, VPN dial-in)

Goal: failover-only, no load balancing. Justification below.

Configure gateways

In OPNsense multi-WAN lives at the gateway level. A gateway is defined per WAN interface (System → Gateways → Single). Important fields:

  • Monitor IP: do not use the ISP standard gateway IP, use a real public IP of the ISP internet — e.g. 1.1.1.1 (Cloudflare) or 9.9.9.9 (Quad9). Why: the ISP router often still responds to pings when the connection is “down” (reachable technically but no internet). A real public IP is a more reliable health check.
  • Latency threshold: default values (500 ms warning / 1000 ms alarm) are too generous for fibre. We typically set 200 ms / 500 ms.
  • Packet loss threshold: 10% warning / 20% alarm — at smaller thresholds failover triggers on normal fluctuations.
  • Probe interval: 1 second
  • Time period: 60 seconds — gateway is considered “down” only after 60 seconds over threshold

These settings are a compromise between fault tolerance (no flapping) and reaction time (failover within ~1 minute). Anyone needing faster failover can go to 30 seconds time period — at the risk of false positives on short ISP hiccups.

Create gateway group

The gateway group is the central configuration for the failover logic (System → Gateways → Group). Example configuration:

Name: WAN_FAILOVER
Gateway priority:
  - WAN1_GW: Tier 1
  - WAN2_GW: Tier 2
Trigger level: Packet Loss or High Latency

Tier 1 is used as long as it is “up”. On failure OPNsense switches to Tier 2. Several gateways on the same tier would activate load balancing — which we deliberately avoid here.

Switch firewall rules to the gateway group

The real lever: firewall rules that allow outbound traffic must get the gateway group as gateway selection — not the individual WAN gateway.

In Firewall → Rules → LAN for the default outbound rule (“from LAN net to any”):

  • Advanced features → Gateway: WAN_FAILOVER instead of default

Anyone who forgets this has working gateway health checks but the traffic still takes the system default route — which cannot fail, because OPNsense entered it statically. Classic configuration mistake.

Outbound NAT per WAN

OPNsense default NAT does source NAT to the respective WAN IP. With multi-WAN this must work cleanly per WAN — otherwise packets go out via WAN2 but with WAN1 source IP, and the ISP LTE gateway drops the traffic.

In Firewall → NAT → Outbound:

  • Switch mode to Hybrid Outbound NAT (default is Automatic)
  • Manual rules per WAN: “Source LAN net → Translation interface address WAN1” and “Source LAN net → Translation interface address WAN2”
  • Order is not decisive, because OPNsense applies the NAT rule matching the chosen outgoing interface

Health check with monitor IP — the most important part

We briefly mentioned this above, here in more detail. The monitor IP decides whether failover works. Common mistakes:

  • ISP gateway IP as monitor: ISP routers keep responding to pings of themselves even when their internet uplink is down. Failover does not trigger.
  • Own public IP as monitor: makes no sense — goes over the same interface we want to check, plus possible asymmetry problems.
  • Unreachable IP: same mistake in the other direction — gateway is permanently detected as down.

Well proven:

  • WAN1 monitor IP: 1.1.1.1 (Cloudflare)
  • WAN2 monitor IP: 9.9.9.9 (Quad9) — or specifically a different public IP from WAN1, so both monitor targets cannot fail at the same time (e.g. during a large DDoS on Cloudflare)

Sticky connections for VoIP

This is the point where most multi-WAN setups fail in practice. VoIP connections (SIP, RTP) do not tolerate mid-call failover. If the gateway switches mid-call, the source IP for RTP changes, the SIP provider drops the packets, and the call drops.

Solution in OPNsense:

  • Reply-To mechanism: in firewall rules (Advanced → Reply-to) ensure that established connections keep going over the original gateway, even if the default route has switched.
  • Sticky connections (Firewall → Settings → Advanced → Sticky connections active): forces an existing connection to stay on the same gateway.

With sticky active only new connections switch to WAN2. Active calls stay on WAN1 — and drop if WAN1 is really down. That is acceptable: a dropped call is better than losing every second’s packets in the void.

Load balancing — why we rarely recommend it

Multi-tier gateway groups with two Tier 1 gateways activate load balancing. Sounds tempting (“double bandwidth!”) but has hard practice problems:

  • Sessions break: TCP sessions started on WAN1 once must stay there — sticky is mandatory, otherwise HTTPS collapses regularly.
  • Asymmetric routing: some web applications react sensitively to changing source IPs (cookies, session tokens, captcha).
  • Cloud apps with geo-IP: Microsoft 365 or Google Workspace notice when the source IP jumps between ISPs — account security alarms follow.
  • VPN performance: WireGuard and IPsec drop their tunnels on source-IP change.
  • Asymmetric WAN bandwidth (e.g. 500 Mbit/s fibre plus 50 Mbit/s LTE) profits little from load balancing — the fast line waits for the slow one.

For most SMBs failover-only is the right choice. Load balancing makes sense for high-load setups with two equivalent symmetric lines and workloads that do not need sticky (e.g. backup replication, bulk downloads).

What happens at failover — expectation management

Even a cleanly configured failover is not an uninterrupted connection. What actually happens:

  • Active TCP sessions break: HTTPS connections, RDP, SSH — all existing sessions terminate. Browsers reload after that, RDP clients reconnect — mostly within 5–15 seconds.
  • VPN tunnels must rebuild: WireGuard is faster than IPsec (typically under 5 seconds), but there is an interruption.
  • DNS caches contain old public IPs: outbound connections can choose wrong routes in the first seconds — dynamic DNS for own services can mitigate this.
  • VoIP: active calls drop (see sticky), new calls go over WAN2.

A good failover plan communicates this internally: “on WAN outage we switch automatically to LTE. There is a 15–30 second interruption. Calls may drop — dial again. ERP web client reloads.”

Monitoring and alerting

After configuration it must be ensured that a failover is noticed. OPNsense can:

  • Email alert on gateway switch (System → Settings → Notifications)
  • Webhook alert for integration into Slack, Mattermost or MS Teams
  • Zabbix/Prometheus polling via the OPNsense plugin/API

Important: after recovery of WAN1, OPNsense switches back automatically. This switch is also an alarm-worthy event.

Testing — before the real case

Failover without testing is hope-based. What we always do in DATAZONE setups:

  1. Controlled WAN1 shutdown: pull WAN1 cable (or disable ISP modem), start stopwatch. When does OPNsense detect the failure, when does the first traffic route over WAN2?
  2. VoIP test during failover: hold an active call during the test. Expected: call drops. New call over WAN2 works.
  3. Test VPN reconnect: check home-office VPN during failover — does the tunnel rebuild on the new WAN IP?
  4. Failback test: re-enable WAN1, check whether OPNsense switches back automatically.

Document the result. When the setup is tested next (in 12 months at the maintenance appointment), you have a baseline.

Realistic recommendation for SMB

For the typical mid-market customer under our consulting:

  • One fast, stable line (fibre, possibly SDSL as bundle) as WAN1
  • LTE/5G backup as WAN2 — with a contract with sufficient data quota (on failover-heavy days the backup can consume several hundred GB)
  • Failover-only setup, no load balancing
  • Sticky connections on, health checks on real public IPs
  • Outbound NAT per WAN cleanly configured
  • Alerting to the IT distribution list

This is a setup that helps in emergencies without creating problems in everyday operation. Anyone needing a more complex setup with load balancing should justify this with workload analysis — not because “multi-WAN” is on the data sheet.

DATAZONE recommendation

OPNsense multi-WAN with failover-only is standard repertoire in our firewall setups. We typically configure it with two hours of preparation and a 30-minute test — the result lasts years, as long as ISP contracts and hardware do not change.

Anyone migrating from pfSense or from an old Sophos/Fortinet solution finds the multi-WAN configuration in OPNsense well structured — the UI is clear, the logic comprehensible.

Sources and further reading

Anyone who wants their multi-WAN setup configured by an OPNsense expert: please book a meeting — we set this up remotely too.

More on these topics:

Need IT consulting?

Contact us for a no-obligation consultation on Proxmox, OPNsense, TrueNAS and more.

Get in touch