Michael Was Here

My home network is complicated. A good friend recently described the network as having the complexity of a small to medium enterprise. I’m inclined to agree, since my network participates in multiple internal and external Border Gateway Protocol (BGP) arrangements, has close to a dozen internal security zones each with unique set of firewall rules and access control rules. The network includes split horizon DNS, both dynamic and static DHCP configurations, and a number of servers and distributed switches. All of this complexity has been managed via several ways over the years, each with its own strengths and weaknesses. Recently though, I’ve moved the entire network to be managed via Hashicorp Terraform, and the experience has been pretty good.

But how did I get here? My network didn’t just spring into existence like this, so lets look briefly at the path that got here. First a disclaimer that I despise the term “homelab”. I do not have a lab as my home network. Labs are networks where you can break things and it doesn’t matter, and that’s not true about the core network powering all the lights, HVAC, and creature comforts of my home as well as my ability to telework. My home network is very much a production zone with the appropriate ACLs, maintenance windows, and backups that a production environment demands. I have a lab environment as well, but its homed off a different router specifically so that it can be islanded when problems occur.

About 15 years ago I learned about dd-wrt as an alternate firmware for cheap routers. Living relatively near a Goodwill store at the time this led to an unhealthy number of cheap router/access point combo devices making their way to my office. This worked out okay for a couple of years, with physically distinct networks cascading off a single root switch. Things worked okay, but weren’t that easy to manage and I was effectively using a bunch of independent subnets glued together via what, as far as dd-wrt was concerned, were WAN addresses. This doesn’t scale and eventually led me to trying out Smoothwall. I don’t remember much about their solution other than at the time it being really hard to get various things to work correctly when putting a NAT between two private networks, and that the forum was not a welcome place. I’m sure things have improved since, but it was enough to put me off of the project.

My next routing device ran pfSense, and while I would not recommend it at all anymore (use OPNsense), it worked well enough at the time. I was still using a dd-wrt device around this time for my Wifi, and it played really well with the pfSense router. What I concluded though was that the separation of the wireless layer made a lot of things much easier to operate.

After working for an organization that introduced me to really heavy Ansible usage, as well as a local OpenBSD evangelist, I migrated my network to OpenBSD. This was great since it meant that I could manage my network with Ansible and configure my servers and my routing infrastructure using the same system. It also let me get rid of my jump boxes by using OpenBSD’s security features to my advantage. The packet filter syntax of pf was also a joy to work with and I’m still annoyed by how clunky iptables and nftables are in comparison. If you want to see what a network like that looks like, I codified most of the configuration I had into the SimpleGateway project, which is not at all simple. This collection of Ansible roles abuse a number of “features” in Ansible to do things that are trivial now, but were almost impossible at the time. I link the project here mostly as a historical curiosity, but now you should really use collections and leverage heavy use of custom filters to have better control over your data shapes rather than abusing the python environment inside of a conditional evaluator node on the Jinja2 parse tree. OpenBSD worked well and provided a robust gateway solution, but the fact that OpenBSD still uses a very fragile filesystem always annoyed me, as well as some limitations I ran into as a residential network customer. A lot of the value of OpenBSD was hitless upgrades via CARP, but you can’t easily do that with a residential ISP because getting multiple addresses is not an easy thing to do, and most CPE balks at having a switch plugged in between it and the device holding the IPs.

This led me down the route of “what about appliances?”. You see, I’d bought a Ubiquiti access point when I started the OpenBSD adventure, and it served me well through the 4 years that I was running OpenBSD. I was working for a company that had their entire network built out on Ubiquiti products, and we had massively expanded our footprint in a new building with an all Ubiquiti core. This was around the time of the Ubiquiti Dream Machine Professional, and I bought in as a launch customer. It was an okay system that worked pretty well, but was only good so long as I stayed within the lines set out by Ubiquiti. I found their SDN solution to be extremely clunky, with the lack of a zone based firewall being a deal breaker. Add in that the router that was supposedly a professional product couldn’t do BGP and it was time again to move on.

On the recommendation of a friend I tried out Mikrotik products. I was shocked at the low price, and not unsurprisingly you get what you pay for. The RB4011 I was using ran insanely hot, did not mount securely in the rack, and was obnoxious to configure due to the many moving parts I had to contend with in order to use CAPsMAN access points. Once I had it working the whole thing did manage to function incredibly fast and worked reliably. I left the Mikrotik ecosystem when I tried to figure out how to setup native VLANs and neither the documentation, nor the incredibly frustrating to deal with IRC channel were of any use.

This led me to consider going full on enterprise, which I did with a Juniper SRX300 router. Even before the HPE acquisition, this was the product that basically assured I will not buy Juniper hardware in the future. The company makes it extremely difficult to do business, and I never did find a workable solution to get a support contract for the device to get updates for it. This, coupled with how slow changes were to apply on the SRX300 have really pushed me away from Juniper networks, which is a shame because I really did like the build quality of the device as well as the passively cooled silence it brought to my network. I migrated my WiFi to first be based on OpenWISP and later Aruba Instant. The OpenWISP WiFi worked well and was extremely stable, but I don’t run Ubuntu at home and so I was maintaining a very custom docker container to run the OpenWISP controller. A container that itself launched the entire runit service supervisor internally to keep all the various services for OpenWISP self contained. I think it was a workable solution, but it required too much upkeep for me to be happy with it long term, which led to the replacement with the Aruba access points. They use a controller which is hosted on the access point itself rather than on a server somewhere else. This also means that the access points are once again self contained devices on my network without tight requirements against the routing and switching infrastructure.

To manage the SRX300 I used the excellent terraform provider which uses JUNOS Netconf to apply changes. This is an API based way to apply changes equivalent to what the CLI can do, and it is great as an example of network device feature parity in a way that doesn’t involve just driving a CLI and scraping a terminal. It was much cleaner to configure than my previous Ansible attempts, and it produced a nice diff of what config changes I was in the middle of applying. The ability to use HCL2 dynamic blocks as well made it much much easier to construct my complicated firewall mappings, finally allowing for the entire network to be broken apart into unique subnets with subnet level routes and ACLS. The problem with this setup was that it was slow, painfully slow. A configuration commit could take upwards of 12 minutes to apply, and this was using terraform or not. When I need to make a change, usually its in response to a device that needs a route, an ACL issue that I need to resolve, or a change request coming in from one of the networks I peer with. I have neither the time nor the patience to wait 12 minutes every time I want to change something. Add to this that I wasn’t using the DHCP implementation on JUNOS due to some serious stability problems with reserved leases, as well as a need to integrate external split horizon DNS and I was in a position where I managed most of the network with Terraform, then a handful of structural components with Ansible.

The straw that broke the camel’s back was using Wireguard. JUNOS does not have native Wireguard support, and so this meant that I needed the core router to peer with a peering router for places where I wanted to do road warrior setups or have a remote access device phone home to my network. IPsec is a great VPN technology, but its a technology that requires more patience than I have and is annoying to debug when the tunnels don’t come up or maintain a stable connection. Because I was now maintaining some peers on a different router which was managed with ansible and installing cheater routes as well as increasingly complicated NAT rules onto the SRX, my network had lost almost all of the elegance of a fully configuration managed environment and had instead picked up only the clunkiness of a poorly scaled solution.

I set out trying to get down to just one tool, so I looked for a device that I could manage with Terraform and that could do both BGP and Wireguard. It turns out that Mikrotik devices fit this bill well, as there is an extremely high quality 3rd party provider with good coverage of the Mikrotik feature set. Even better, the applies were lightning fast. This allowed me to port my entire peering presence to an extremely cheap hEX Lite device as a test. This device is only capable of 100mbit/s throughput, but for a $40 device delivered the next morning from Amazon, I have no complaints. The overall workflow was very straightforward, and even better terraform applies were lightning fast. Sometimes a little too fast and I quickly learned to have a terminal up connected to the router using RouterOS’s “safe mode” feature to back out the last transaction if I lost connection to it. My core routers all have serial console ports, but the hEX Lite does not.

After a couple of days of futzing around with things, I got the entire system to work cleanly where I could add and remove peers and change settings very quickly on the Mikrotik side, but the Juniper side was still painfully slow. Having not been bitten by the complexity of RouterOS this time round due to having Terraform manage all the complexity, I decided to get a larger device and try porting my entire home network over to it. I selected a RouterBoard 3011 initially which comes in a rack mount case and occupies an impossibly shallow space. I really like this device both in terms of size and build quality, though I wish it had an internal power supply. Fortunately the case was machined with a knockout for the appropriate connector, and there’s plenty of room inside to fit a Meanwell IRM-60-24ST. If you choose to do this kind of modification, make sure you’re comfortable working with mains voltage and have your work checked by someone just to be sure.

It took me about a week to port the entire config over from the existing terraform into a new workspace for the new routerboard. It was extremely convenient to re-use my configuration I’d already worked out for BGP since I’d modularized that code and ingested all the critical settings including the security sensitive values from an external tfvars file. If you use this pattern a lot its handy to use an .auto.tfvars file so that your secure values stay out of git and are easily loaded on each plan/apply. One of the major advantages of using Terraform for this was that I could avoid having the same networks appear on both routers by migrating blocks of network definitions from one tfvars file to another. This made it super clean to configure the new routerboard from my desktop without loosing access to the still running production network.

Using Terraform for this also made it really clean to test the configuration to make sure I’d written it in a way that could be applied from nothing by reinstalling RouterOS with a custom startup script that got the chassis just far enough that terraform could talk to it, then applying my configuration. In my final configuration there’s a few bootstrap concerns that I don’t have a good way to deal with, mostly places where I need to create certain resources and then reference them for firewall rules. I’m pretty sure I could solve this problem by having a global in_bootstrap flag that would disable certain firewall rules and suppress certain security features, but that’s an exercise for another time. I installed the configured rb3011 and verified that the entire network came up, only to discover a problem!

With my configuration loaded, I was only able to manage about 400mbit/s throughput on the rb3011. This seems low and I’m pretty sure there’s stuff that I could optimize better, but as I’d sneakily planned to replace a vyos router with an rb3011 if I liked it, I instead chose to solve the performance problem by throwing more hardware at it. I upgraded the router for the home network to a Mikrotik RB1100AHx4 which is an overkill dual PSU model with powerfail bypass and a number of other unique features. Since my configuration was in Terraform I was able to just make a backup of my old tfstate file on the off chance I needed to roll back to the rb3011 and then perform an apply. From cold, my entire network was reloaded into the chassis in about 37 seconds. Not bad at all in my opinion, even if I don’t intend to do a reinstall from cold very often, if indeed ever again.

With the configuration loaded a complete apply cycle with no changes takes 3.6 seconds to pass over the 254 configured resources. This provides a great way to validate the config, as well as an extremely fast way to configure new changes in the system.

Whether or not this network is actually software defined though is a matter of debate. The end result is a network which I have defined using configuration software and checked into source control, but whether or not this is truly an SDN in the same way as something like Cilium is a matter of debate.

If you manage networks for a living, I’d really recommend you take a look at Terraform and see what the level of support is for your hardware. Juniper equipment is well supported, and both Palo Alto and Arista Networks have 1st party support for extremely high coverage Terraform providers. At some point I’ll do a followup to this with my findings of using Terraform to manage the routers I use to support robotics competitions and computer shows, as that is an interesting look into doing high speed turnaround on equipment in a way that only git branches can enable. For now though, I hope I’ve provided you some food for thought in how you manage your own networks, be them at home or work.

Software 'Defined' Networking with Terraform

February 12, 2024