What we needed
Most of our infrastructure runs on Docker, on bare metal clusters. At some point we needed to give the ability to our apps to talk to each other using overlay networking instead of the usual Docker port binding. We could've used service discovery and port binding, but it would've been more complicated than we needed: we would've had to bind several ports for a single app on a single host if we wanted to have several replica per host and change our load balancing configuration each time to match the topology change. Plus, the setup of an overlay network is a nice preamble to a migration on a Kubernetes cluster.
Why did we chose BGP?
Our software choice comes from a simple observation:
Internet relies on a commonly unknown protocol, a.k.a. BGP. The very principle of this protocol is to provide dynamic routing based on some basic principles. A lot of network defining softwares are relying on either OSPF or BGP. We chose BGP for its simplicity and proven robustness. A lot of open-source initiatives software made similar design choices.
Why did we chose Quagga?
Quagga is quite reliable and often seen in networking stacks with proprietary hardware such as Cisco or Juniper. It is still in active development and has had a lot of updates over the time even though Zebra (the routing program behind Quagga) seems to be a bit outdated from an outside perspective.
A bit of tech
Since Kubernetes requires overlay networking, the documentation suggest to use a various set of tools to achieve this. A lot of those tools are either blackboxes or unpractical in our use case. We thought that using this network abstraction would also be nice for our "not yet kubernetes compliant" applications, so we did a bit of testing and got a satisfying result.
A cluster, a lot of networks.
The very principle of the overlay network is to provide a network for each node inside a cluster accessible by every other nodes. Our stack looked like this from a logical point of view :
As you can see, each docker container needs a individual port binding to work properly. Those ports then have to be mapped in our load-balancers to receive some trafic.
Then with the overlay network we created a basic abstraction of this bare-metal entanglement :
Now each container has its own routable IP address and can use the same port even on the same host.
This topology change is given by BGPd on Quagga, a sample configuration snippet could look like this:
! Ansible managed log file /var/log/quagga/bgpd.log !debug bgp events !debug bgp filters !debug bgp fsm !debug bgp keepalives !debug bgp updates router bgp 65500 bgp router-id 172.16.200.1 ! # This is the ipaddress of the observed host timers bgp 30 90 redistribute static ! # we want to send away our static routes network 10.0.0.0 mask 255.255.255.0 ! # This is the docker0 network, so we need to append ! # the "bip": "10.0.0.1/24" flag to docker's daemon.json ! ! # Following : a description of our neighbors in the same AS neighbor 172.16.200.10 remote-as 65500 neighbor 172.16.200.10 route-map foo out neighbor 172.16.200.10 route-map foo in neighbor 172.16.200.10 activate neighbor 172.16.200.200 remote-as 65500 neighbor 172.16.200.200 route-map bar out neighbor 172.16.200.200 route-map bar in neighbor 172.16.200.200 activate ! ! ! ! # We set the same preference to each router route-map foo permit 10 set local-preference 222 ! route-map bar permit 10 set local-preference 222
The resulting routing table is quite straightforward :
$ ip r|grep -i zebra 10.0.2.0/24 via 172.16.200.10 dev eth1 proto zebra 10.0.3.0/24 via 172.16.200.200 dev eth1 proto zebra
We now have a fully operational overlay network. At this point, you may think that a classic SDN tool like calico would be easier to manage. Upon a certain scale it could be true, but we also need to take account of the main constraint in our environment : we are not on a public cloud. Therefore, we need to manually manage some stuff. Fortunately for us, a long time ago, ansible appeared in our world to make our life easier. At this moment, rolling a topology change in our overlay stack is a matter of seconds, with no service interruption whatsoever.
Quagga being not self sufficient, we added a home-brewed service discovery software to ensure that all of our live apps' capability to communicate with eachother and receive trafic from our load-balancers. We subsequently have been able with this networking feature to enable automated gossiping between our apps and do a lot of other fun stuff.
For Kubernetes : Load balancers, ingresses and stuff
Since there is a lot to read on the Internet about those ones, I think it's better to point out the "good ones", rather than poorly paraphrasing :
- Julia Evans' article about networking in kubernetes
- Kubernetes networking documentation
- Undertanding kubernetes networking pods
- The amazing talk "life of a packet" (slides are here)
- Last but not least Haproxy's article
Our load-balancers are aware of every app on every container, they are able to send packets to each application based on our ACLs. Ingresses will be coming in an other article to be written.
Paving the way to multihomed infrastructure
Since we have a reproductible network model, why not apply it to a L2/L3 interconnection? You can check why and how here.
Even though the setup is quite simple while it is running, starting it from scratch was quite a ride. We have to acknowledge here the help provided by Paul Jakma on some steps of debugging our BGP setup.