Where did it came from?
Since we decided to use Quagga and BGP to deploy our overlay network infrastructure internally; extending this logic to a multihoming point of peering wasn't much of a stretch.
Why do we need multihoming?
Relying on a single hosting provider is quite problematic. We use OVH to provide most of our servers, therefore we have inherited all of their difficulties. OVH's CEO Octave has a some hashtags to prove it. Those issues were an early matter of concern on our side, so we did a little digging in that direction.
From there our options were limited, either we chose to take the blue pill and change our hosting strategy to be more 'cloud reliant', or we chose to toughen up our private networking stack to enlarge our network on a new and shiny bare-metal infrastructure, via another hosting company. Quite obviously (hence the article), we chose the latter. The core concept, like with overlay networking, will heavily rely on BGP.
At first, we only had a couple load-balancers and a few servers to fit our needs.
This could be presented like this:
After we felt the need to multi-home our services, the topology looked like this:
Our services, isolated at first, are now deployed on two different datacenters with private, isolated networks. Those networks are interconnected via a router stack, sharing its routes and topology via BGP. That being said we also chose to apply some weights to our two DC in order to receive traffic proportional to our computing power on each one. Fortunately, the Linux Kernel provides every tool that we need.
rp_filter and proxy_arp
To ensure some sort of network isolation, we chose to extend only some limited
/24 ranges between DCs. You may have noticed that we used 3 routers on each side of the PoP and because of that we had to loosen some default
TCP behaviour on our private networks and create some abstractions.
Since we do not filter anything anymore on the
MAC side, we also need to make some abstractions to avoid propagating
MAC addresses where they shouldn't be propagated. We all know the classical
who has 02:42:32:10:bc:7d. In this use case, we want our routers to reply "It's a me" to ensure packet transportation to the right peer.
Our 3 routers have an operational
eth1, which spoofs what happens on the other side. Disabling that proxy and removing arp cache from the
monitoring host triggers a reshaping of the path to follow to reach the host called
rmt. When you disable them all : no path will be found.
So, at first our router stack looks like this:
Each router has its own arp table and proxy_arp, so hosts are just picking the first responder when they want to talk to the other side of the network.
Things get complicated when you remove the proxy to trigger this reshaping:
On this step, I run:
sysctl -w net.ipv4.conf.eth1.proxy_arp=0
on the first router and then I remove the arp cache from the monit host:
arp -d remote.ip.address.of.the.host
Because you either have to delete the arp entry from the arp table or wait for the arp expiration to occur (60s by default on most linux distribution). I run the previous commands 2 more times and I get this:
A perfect netsplit.
proxy_arp and you get your network.
sysctl -w net.ipv4.conf.eth1.proxy_arp=1
Reverse path filtering, described in this old (but gold) article and in the famous TLDP (you can check over here and here for some explanations in french) is a burden in this case. Indeed, we need to act like a
L3 switch. We disabled
rp_filter on our private interfaces, enabling dynamic packet transfer between routers and servers. A router will be able to ensure a packet transfer even if the first trip wasn't made via its pipes. With that filter on, it would be impossible for two host to communicate other than by the same router pair. With the filter off, hosts can communicate regardless of the chosen path for each trip. This way, a round trip between two host could be handled by either 2 routers or 4.
We can see in the schema below a "normal" request, using 4 nodes to roundtrip:
If we cut the link between
RT5, BGP will reshape the route based on the weight of each router:
What we've done is basically a run of
iperf with the flag
-r, asking for a two way bandwidth benchmark. We run a
iftop -i tap0 -f "$myfilter" to only watch the trafic between the two observed hosts.
Equal cost multipath
We talked about what happens on the 2 first layers of the OSI model, but what happens on the third layer?
BGP has a feature that serves our purpose in this architecture: ECMP. In a few config lines, you allow your daemon to make routing decisions in order to share the network load between peers:
bgp bestpath as-path ignore bgp bestpath as-path multipath-relax maximum-paths ibgp 5 maximum-paths 5
Each router having the same weight and same network capability, they are sharing their load with eachother.
As a matter of fact, it's a requirement for this routing architecture to work as expected, otherwise,
rp_filter will not be used since BGP will tend to reuse the same routing path every time.
Many:many > 1:1
A side effect of this setup is that we can scale up our multihoming POP bandwidth "horizontally". Each router pair has its own bandwidth limitation, but since peering is not done by only one router pair, it is possible to have a total bandwidth volume roughly equivalent to
(n_hosts/2)*max_bandwidth_per_host. We've seen this 6 instances routing infrastructure peaking at 450Mbps (both upload and download) for a few hours hours without breaking a sweat. Latency is around 5ms RTT.
Towards abstracted transport
Since we have 3 routers on each end on this setup, we have 3 potential candidates for an
arp response. This seems correct in a normal situation, but if you ever loose a router on either side, you will face some issues. For instance, Debian keeps
arp table entries for a long time, even when the peer is not reachable anymore. This makes sense, most of the time. But in this case, it becomes a problem. The more you add on your transport capability, the more you will be subject to it.
Pfsense has a nice feature allowing to enable
arp proxying for host addresses or a whole network. It also comes with FRR, which is a fork of Quagga. Also, FreeBSD's kernel has a nice switch to it:
net.link.ether.inet.max_age. This will define the maximum age of an ARP table entry, therefore a quicker arp expiration will be allowed.
So, we added 2 peering points and moved the router stack previously described in a separate network, creating a CIDR separation between the router stack and the rest of the network. Both new router are becoming route server and arp proxy, serving peers. The 6 router stack is transformed as
route-reflector-client, enabling both point of peering to advertise their network, without interfering in the process.
The architecture now looks like this:
Pfsense also comes with CARP redundancy, which allows us to make the 2 new peering points highly available.
This part may seem overkill, but it really speeds up network convergence in case of a downtime from any router. Before this, in some case of a stale
arp table entry, we sometimes saw some hosts taking over an hour to get a new valid
arp entry without a human/automate intervention.
So, we went from this:
It's important also to keep in mind that peering health between 2 hosts also comes from network congestion management. Without the addition of the 2 peering points, when a link went to being congestionned, it often resulted in servers not being able to communicate with eachother.
Another way: VXLAN
Solutions like Calico and others are relying on a BGP/VXLAN couple. You can see on the Internet that smart people like Vincent Bernat are publishing articles about this too. By the way, we warmly recommend this one and that one (they are also available in french). While the end is quite the same, the mean is way different. We chose the previously presented solution for a simple reason, extension is simple, and we don't have to deal with linux bridges. It's only relying on plain old networking protocols.
This architecture might offer redundancy and some load sharing, but this is still quite subpar regarding what you might want to run in production. In a following article, we will discuss how load-sharing could be turned into load balancing, with a rather similar architecture.