Sophos UTM WAN Link Failover Fix



Quick Overview

Problem

Why A Backup Wan Connection Is Always
Needed

Prerequisites

Current Configuration

To Confirm The Root Cause Of The
Problem

How To Fix The WAN Failover Issue

The Bash Script In Action

The Final Result

Problem

If the primary ADSL connection fails, the failover succeeds to the standby interface with no
issues, what doesn’t work as intended is falling back to the primary connection when it is back up…

It can be argued that the monitoring for the primary connection is not setup as intended which is not the
case… I actually configured the UTM to monitor the default gateway for the WAN connection (the first hop in the traceroute
of the primary ADSL connection ISP) because if i monitor the router itself or any public IP it’s going to be up anyways
regardless of the WAN connection being used.

Why A Backup Wan Connection Is Always Needed

The main issue does not lie in losing internet connection, there are other things to consider like being able to connect remotely to a machine in the network when the primary connection goes down to see what’s wrong and to be able to connect to wireless security cameras…

Prerequisites

-A router flashed with open source firmware like DD-WRT or
any linux device in the network with bash shell.

-A 4G modem with an ethernet interface or any backup WAN connection preferably of course, from a
different ISP.

-Of course, a Sophos UTM with Uplink Balancing & Uplink Monitoring configured, either running
on a VM or on a dedicated device.

Current Configuration


The VM has three virtual interfaces configured in bridged mode, this way they are seen by
other devices in the network as normal physical interfaces

I have a Sophos UTM running on an Oracle VirtualBox virtual machine on an Intel NUC, the NUC has
only one ethernet interface and a wireless interface but i am only using the ethernet interface and in the VM configuration, i
have created three virtual interfaces in bridged mode:

-LAN interface.

-Primary WAN interface set to use the main router as default gateway.

-Backup WAN interface set to use the 4G modem as default gateway, in my case i am using the

TP-Link MR-3020
 as default gateway.

-In case you will be using the TP-LINK MR-3020 or a similar device then you will also need a 4G
USB modem plugged in to the USB interface of the MR-3020.

Creating the interfaces in bridged mode allows them to appear to other devices in the network as
physical interfaces, no device in the network can tell the difference.


Problematic setup – Uplink Balancing is set to
use the default gateway of the ISP as monitoring host

The issue arises when the UTM tries to ping the aforementioned first hop via the
backup wan interface which will never work because they are totally different ISP’s, simply.

To Confirm The Root Cause Of The Problem

To confirm the root cause of the problem, we will initiate a failover by disabling the primary WAN
connection on the router and then monitoring the monitoring traffic 🙂


Disconnecting the WAN from the DD-WRT router’s interface and simulating a
failover

But first, we have to setup our tcpdump on the UTM and note the behaviour of the normal monitoring
traffic, we will use tcpdump flag -e to show the MAC addresses of the interfaces:

elutm:/root # tcpdump -nei eth1 host 10.45.3.134

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes

08:22:14.847351 08:00:27:a1:c4:f6 > e0:3f:49:9c:5a:78, ethertype IPv4 (0x0800),
length 98: 192.168.1.156 > 10.45.3.134: ICMP echo request, id 30819, seq 294, length 64

08:22:14.869246 e0:3f:49:9c:5a:78 > 08:00:27:a1:c4:f6, ethertype IPv4 (0x0800),
length 98: 10.45.3.134 > 192.168.1.156: ICMP echo reply, id 30819, seq 294, length 64

08:22:29.849211 08:00:27:a1:c4:f6 > e0:3f:49:9c:5a:78, ethertype IPv4 (0x0800),
length 98: 192.168.1.156 > 10.45.3.134: ICMP echo request, id 30819, seq 295, length 64

08:22:29.870805 e0:3f:49:9c:5a:78 > 08:00:27:a1:c4:f6, ethertype IPv4 (0x0800),
length 98: 10.45.3.134 > 192.168.1.156: ICMP echo reply, id 30819, seq 295, length 64

08:22:44.852065 08:00:27:a1:c4:f6 > e0:3f:49:9c:5a:78, ethertype IPv4 (0x0800),
length 98: 192.168.1.156 > 10.45.3.134: ICMP echo request, id 30819, seq 296, length 64

08:22:44.873276 e0:3f:49:9c:5a:78 > 08:00:27:a1:c4:f6, ethertype IPv4 (0x0800),
length 98: 10.45.3.134 > 192.168.1.156: ICMP echo reply, id 30819, seq 296, length 64

The ICMP’s are working as expected, to the correct IP and to the router’s MAC
address

Now let’s disable and re-enable the WAN connection and capture the monitoring traffic on
both WAN interfaces eth1 and eth2

elutm:/root # tcpdump -nei eth1 host 10.45.3.134

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes

08:39:00.768403 08:00:27:a1:c4:f6 > e0:3f:49:9c:5a:78, ethertype IPv4 (0x0800),
length 98: 192.168.1.156 > 10.45.3.134: ICMP echo request, id 3148, seq 9, length 64

08:39:01.021330 08:00:27:a1:c4:f6 > e0:3f:49:9c:5a:78, ethertype IPv4 (0x0800),
length 98: 192.168.1.156 > 10.45.3.134: ICMP echo request, id 3148, seq 9, length 64

08:39:01.272891 08:00:27:a1:c4:f6 > e0:3f:49:9c:5a:78, ethertype IPv4 (0x0800),
length 98: 192.168.1.156 > 10.45.3.134: ICMP echo request, id 3148, seq 9, length 64

08:39:01.524324 08:00:27:a1:c4:f6 > e0:3f:49:9c:5a:78, ethertype IPv4 (0x0800),
length 98: 192.168.1.156 > 10.45.3.134: ICMP echo request, id 3148, seq 9, length 64

elutm:/root # tcpdump -nei eth1 host 10.45.3.134

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes

08:46:00.924674 08:00:27:a1:c4:f6 > e0:3f:49:9c:5a:78, ethertype IPv4 (0x0800),
length 98: 192.168.1.156 > 10.45.3.134: ICMP echo request, id 3148, seq 37, length 64

08:46:01.175700 08:00:27:a1:c4:f6 > e0:3f:49:9c:5a:78, ethertype IPv4 (0x0800),
length 98: 192.168.1.156 > 10.45.3.134: ICMP echo request, id 3148, seq 37, length 64

08:46:01.427553 08:00:27:a1:c4:f6 > e0:3f:49:9c:5a:78, ethertype IPv4 (0x0800),
length 98: 192.168.1.156 > 10.45.3.134: ICMP echo request, id 3148, seq 37, length 64

08:46:01.678668 08:00:27:a1:c4:f6 > e0:3f:49:9c:5a:78, ethertype IPv4 (0x0800),
length 98: 192.168.1.156 > 10.45.3.134: ICMP 08:46:04.442774 08:00:27:a1:c4:f6 > e0:3f:49:9c:5a:78, ethertype IPv4
(0x0800), length 98: 192.168.1.156 > 10.45.3.134: ICMP

As per the captures above, although the primary WAN connection is back up, for some reason the UTM
cannot reach it on both interfaces and therefore no fallback to the primary WAN interface…

How To Fix The WAN Failover Issue

To fix this, i came up with the idea to have the UTM monitor an IP inside the LAN instead of
monitoring the default gateway of the ISP, but we also need the reachability of that IP to be dependent on the reachability of
the default gateway of the ISP…

The solution is very simple… A small bash script on the router (or any linux based OS with
CLI) will ping the default gateway of the primary ADSL ISP every 30 seconds, if it succeeds it will add an additional IP on the
router’s LAN interface, this IP address will be used for the Sophos’ UTM monitoring instead of the gateway of the
ISP, so here’s how it will work.

-On router boot, assume WAN connection is up and add an additional IP address to the
router’s LAN interface.

-The Sophos UTM will test the reachability of the internal IP 192.168.1.250.

-Every 30 seconds, the router will ping the default gateway of the ISP.

-If the default gateway of the ISP is not reachable, change the additional IP address on the
router’s LAN interface to any other IP address.

-If at anytime 192.168.1.250 is unreachable, UTM will failover to the backup WAN
connection.

-A while TRUE loop continues to ping the default gateway of the ISP.

-If we are on the backup WAN connection, since the IP being tested for is an internal IP address
so the UTM can reach it regardless of which WAN connection is active.

-If the router can ping the default gateway of the ISP, it will re-add the previously deleted
additional IP on the router’s LAN interface 192.168.1.250.

-The UTM will detect that the monitored host is up and will fall back to the primary WAN
connection :).


Under “Interfaces > Uplink Balancing”, this is how it looks like with the
local IP address used for monitoring

The Bash Script In Action

One of the most powerful features of DD-WRT is the ability to add custom scripts that are executed at startup
that can do virtually anything, i have added this script to the “Custom scripts” section under
Administration > Commands and called it via the “Startup script”
section.

#!/bin/sh
ifconfig br0:2 192.168.1.250 #Additional monitored IP address on LAN interface


while true
do
ping -c 3 -W 5 10.45.3.134                #The
default gateway of the ISP will be pinged three times with a 5 second wait interval


if [[ $? -ne 0 ]]; then                #Check the
exit status of the previous wait command, if no response received from previous pings then a value other than 0 is returned and
the if condition is met


ifconfig br0:2
192.168.1.251                #Change the IP to
any IP other than the one used by the UTM’s monitored host        
else
ifconfig br0:2 192.168.1.250                #In
case the default gateway is reachable keep or change the additional IP back to the one used by monitoring.
fi

The Final Result

Now the UTM sees the primary interface as up and the interface connected to the 4G modem as
standby, which is exactly the intended behaviour.


The ADSL connection is up and the 4G connection is standby in case of any issues with
WAN

mo

Information Security Engineer with vast experience in a large array of devices and technologies.