Multihoming bare-metal infrastructures - part 2

Previously on Multihoming...

As you know, we designed a multihoming point of peering to secure our nines and ensure a reliable service for our customers.

We previously saw that it's possible to do some ECMP Load sharing on BGP. There was several caveats to this design choice though.

  1. If you're looking into this, you probably are looking for a load-balancing solution.
  2. If you check who gets what portion of production traffic, you'll soon see that load-sharing is not load balancing.
  3. You will probably see some network congestion on the most used links.

The following is a run-down on the modifications we made to ensure the widest bandwidth on a reliable design.

What were we missing?

Load-balancing!

The basic principle of the previously described architecture was to abstract transportation between two hosting providers' networks. We relied on BGP to route packets on this peering link and it seemed to be a good idea, but BGP isn't aware at all of the other routers load and doesn't implement any load balancing logic.
Quite often our traffic repartition was looking like this:
INNKHIrIEBOBKzHIXWhzAc9SL

As you can see, load-sharing is not load-balancing.

BGP choose a channel to send its packets, and stuck to it forever. It can handle a network failure on both sides, but when it comes to network performance, you are going to feel left behind.

How to implement network load-balancing on a multi-hosted environment?

Let's talk about solutions.

Since we already had a 4-HOP interconnection, it didn't hurt to shake its components a bit. We first chose pfsense for its simplicity, but mostly for the incredible performances of packet-filter.

Pfsense's downside

Despite Pfsense's ability to be highly available and highly reliable, it lacks a key feature for the design we had in mind. Indeed, it's able to handle load-balancing on an ingress link, but egress is another topic. As a matter of fact, it appeared impossible to do roundrobin on an uplink with Pfsense.

Packetfilter is good though

Since we were already using a bsd-based distro, why not check FreeBSD? It also has CARP redundancy, arp proxying and packet-filter. The lack of UI offers the possibility to use all the non-essential features that PFSense left aside.

TL;DR, packet-filter's configuration

To those who already know what feature I'm talking about, here is the configuration that we used:

set limit { states 1601000, frags 20000, src-nodes 1601000, table-entries 400000}

lan_net = "private_range_from_our_network/8"
interlan_net = "foo.bar.baz.0/24"
remote_net = "sub_private_range_from_our_network/24"

batch = "our.private.public.address/32"

int_if  = "igb1"
ext_if = "igb0"

vpn_1 = "foo.bar.baz.2"
vpn_2 = "foo.bar.baz.3"
vpn_3 = "foo.bar.baz.4"
vpn_4 = "foo.bar.baz.5"

gateways = "$vpn_1, $vpn_2, $vpn_3, $vpn_4"

set block-policy drop
set loginterface egress
set skip on lo0

pass in on $int_if from any to any
pass out on $int_if from any to any
pass out on $int_if from $lan_net
pass out on $int_if from $interlan_net
pass in on $int_if from $remote_net

block in on $ext_if from { any , ! $batch }

pass  in quick on $int_if route-to \
       { ($int_if $vpn_1), ($int_if $vpn_2), ($int_if $vpn_3), ($int_if $vpn_4) } \
       round-robin from $lan_net to $remote_net keep state

pass inet proto icmp icmp-type echoreq
pass out on $ext_if proto { tcp, udp, icmp } all

As you might guess, the most important line here is:

pass  in quick on $int_if route-to \
       { ($int_if $vpn_1), ($int_if $vpn_2), ($int_if $vpn_3), ($int_if $vpn_4) } \
       round-robin from $lan_net to $remote_net keep state

Since some BSD distros have their own pf port, it's important to keep in mind that this is the FreeBSD syntax, but the feature also exists in OpenBSD AFAIK.

This way, packetfilter does load balancing by TCP session. So, when your connection is established, you will keep the same transport route until its end.

ARP

ARP proxying is quite simple, you just have to follow FreeBSD's guidelines to add your custom ARP entries to the table. We will see a bit later how to update the ARP table when a router becomes unavailable.

High Availability

This is where it gets interesting. Now that we have a router with load-balancing capabilities, we don't want to see it fail. Using CARP, it's relatively straightforward:

/etc/rc.conf:

ifconfig_igb1="inet private_network netmask your_mask"
ifconfig_igb1_alias1="inet interlan_network_primary_address/netmask"
ifconfig_igb1_alias2="inet vhid an_id_to_apply pass mysupersecretpassword alias interlan_network_primary_address/netmask"

You can check FreeBSD's documentation if this feature is still blurry for you. With this configuration, routers will share a virtual IP address. If a router is seen as down, it will be discarded from the cluster and its peer will handle the rest.

The same config being applied on the backup router, you may want to check the advskew option to ensure that there is some consistency on your failover.

ARP Failover

There is now a routing failover mechanism, but we also need to load/unload our ARP entries from the table to enable/disable proxying.

FreeBSD has a device state change daemon called devd. This daemon enables you to run some commands on device state changes. It's configured via a simple config file in /etc/devd/ifconfig.conf:

notify 10 {
        match "system"	   "CARP";
        match "subsystem"	   "your_vhid_id@igb1";
        match "type"		   "BACKUP";
        action "/usr/local/scripts/unload.sh	$subsystem $type";
};
notify 10 {
        match "system"	   "CARP";
        match "subsystem"	   "your_vhid_id@igb1";
        match "type"		   "MASTER";
        action "/usr/local/scripts/load.sh	$subsystem $type";
};

Those action scripts are called on BACKUP and MASTER type events, there are a few others that are well documented.

TCP States

Since we are able to "hot swap" our routers, we also need to synchronize states. We previously saw that keep state was used in our round-robin pf rule. Those states are kept locally, but we can synchronize them on the backup router via pfsync which is note worthy of a whole section.

What does it look like now?

Well, it looks like this:
multihoming_architecture
We kept using BGP, as previously described, between our VPN servers, to ensure failover on a VPN Link. Now, our routers are available of their uplink availability, our loadbalancers are able to spread packets around; TCP is able to handle sequences re-ordering in sessions.

Is it better now?

Yes, it is.

So far, we've peaked for several hours @800Mbps+ without breaking a sweat, latency is not impacted by the addition of a hop since the RTT mostly depends on the distance between our two DCs. Our uplinks are properly used, there is no more ARP flapping between hosts and we have a rather stable latency between our datacenters.

Show Comments