In the last post, I talked about how I setup my Nomad cluster running on bare metal using Void Linux and Ansible. I talked a lot about the basic design, but not how I actually executed on that design. That’s what we’ll get into in this post.

There are 3 main elements to configure in the cluster. First, the network, routing, and data-plane needs to be configured as a foundation that can be used to PXE boot the rest of the fleet, and then managed on an ongoing basis. Second, the Operating System (OS) on the servers needs to be configured to run the services that will make this selection of machines into a Nomad cluster. Finally, Nomad itself needs to be configured with security policies and other runtime tweaks to suit the applications that I want to install into the cluster.

Building the Network

Lets start with the network, since that’s the building block that the entire rest of the cluster stands upon.

From the last post, you know that I am using a Microtik RB3011UAS+RM which provides me with a very capable Ethernet router. I manage this device using Terraform and this provider. This allows me to get a preview of any change that I am considering making to the network, and if necessary roll back changes with confidence that I have captured everything I changed.

Terraforming physical hardware is different from the cloud, as you usually need some kind of bootstrap config keyed in by hand to setup just enough of a connection that Terraform can then access the hardware and continue to provision it. There are many ways to solve this problem, but for RouterOS, I choose to reinstall the entire device OS and use a “user script” during install to configure the absolute bare-minimum required for Terraform to talk to the device. In my environment this looks something like this:

/interface/vlan/add comment="Bootstrap Interface" interface=ether1 name=bootstrap0 vlan-id=2
/ip/address/add address=100.64.1.1/24 interface=bootstrap0
/user/set admin password=password
/ip service
set www disabled=no

These 5 lines of configuration data are enough to install the OS, configure a bootstrap interface on VLAN 2, and create the default admin user with a password. The last 2 lines ensure that the www service is not disabled, which makes the API accessible.

In a production environment, you would generally also want to setup TLS here, but when I can see both ends of the cable that I’m using to bootstrap the system and one end is in my laptop which is necessarily trusted for this process, I can be reasonably sure nobody is eavesdropping on me.

Once I can connect, its a simple matter of having Terraform dump in the rest of the configuration data to the network. There’s something about this that is extremely satisfying, because you can actually have all the cables hooked up and then just push in a complete system configuration. Watching BGP come up mid way through the terraform apply cycle is a very satisfying result to the work it takes to achieve this level of configuration capture.

The last step of my network bootstrap is to remove resources that terraform doesn’t know about, namely that bootstrap0 interface and its associated IP. Leaving them doesn’t necessarily hurt anything, but can cause strange behaviors when the underlying physical port is managed by Terraform. A quick command to remove the VLAN is the last time I touch the device by hand, as from this point forward, I can use automation to make any further changes I need in the network.

Configuring Void Linux

Void Linux is a conventional multi-user general purpose Linux distribution. As such, it works well with tools like Chef, Ansible, and Salt for configuration since it adopts many of the same models that those tools are good at configuring. I use Ansible most frequently, so this was a natural choice for configuring my Nomad cluster.

I layer my Ansible playbooks, so I have 3 that every machine needs to run through:

  • bootstrap.yml: This playbook consists of a single raw command to connect and install Python. It is not generally necessary anymore because I install Python as one of the very last steps during machine install, but I keep the playbook around in case I need to bootstrap a machine that didn’t go through the automated install.

  • base.yml: This playbook configures everything that’s common across all machine types. This includes SSH, repository and mirror settings, making sure NTP is synchronized, and similar baseboard configuration tasks.

  • nomad.yml: As the name implies, this playbook actually installs Nomad and joins the worker nodes to the larger cluster. The file actually contains two different plays back to back which target Nomad servers and clients independently, but having one playbook has improved ergonomics to actually use day to day.

Once I run the three playbooks against a host once, Nomad is up and running, and I can use more sophisticated automation to manage the system long term, which we’ll look at at the end of this post.

Configuring Nomad

Nomad has relatively little that needs to be reconfigured, but does have a handful of tunable settings mostly related to elevated permissions assigned to a handful of jobs, and configuring the anonymous policy to allow snooping on the cluster without logging in. Since my network perimeter is well contained, I am willing to make the trade-off for convenience to be able to look at the contents of the cluster without authenticating.

Nomad is configured via API once the server is up and running. You can technically use the UI for this, but that mostly just presents a series of boxes for you to paste in policy language fragments, and I would prefer to maintain my entire cluster using IaC techniques, so its time for more Terraform. The policies being loaded are not complex, the most complicated one in the system to date is a permissions over-grant to Traefik:

resource "nomad_acl_policy" "proxy" {
  name = "proxy-read"
  job_acl {
    namespace = "default"
    job_id    = "proxy"
  }

  rules_hcl = <<EOT
namespace "*" {
  policy = "read"
}
EOT
}

This is an over-grant because Traefik does not need, and therefore should not have, full read in all namespaces. All it actually needs is the ability to list and read services in all namespaces, but like all production systems, this temporary fix has lived far beyond when it should have been revisited.

Ongoing Upkeep

I’ve alluded several times now to solving some of the fundamental problems that Ansible has when scaling out, so its time to stop teasing and look at how this is accomplished. In a nutshell, this magic is accomplished by using a Nomad system batch job that schedules onto every machine in the fleet and runs Ansible in a sort-of pull mode. I say a sort of pull mode because this isn’t done using Ansible’s built-in pull mnemonic, this is done using Ansible running inside a container.

I run Ansible from a container because this allows me to make the configuration management system not only independent of Git by baking in the specific files from any given Git revision to the image, but also baking in Ansible itself to the image so that its entire Python runtime comes along for the ride. Since I need to make use of some custom collections, this works really well.

Lets look at the Dockerfile that I use to build this container:

# syntax=docker/dockerfile:1-labs
FROM alpine:latest AS ansible
WORKDIR /ansible
RUN --mount=type=bind,source=requirements.txt,target=requirements.txt \
    apk update && apk add python3 bash tini && \
    python3 -m venv venv && \
    ./venv/bin/pip install -r requirements.txt && \
    ./venv/bin/ansible-galaxy collection install void.network

FROM ansible
ENV ARA_API_CLIENT=http \
    ARA_API_SERVER=http://ara.matrix.michaelwashere.net \
    ANSIBLE_CALLBACK_PLUGINS=/ansible/venv/lib/python3.12/site-packages/ara/plugins/callback \
    ANSIBLE_ACTION_PLUGINS=/ansible/venv/lib/python3.12/site-packages/ara/plugins/action \
    ANSIBLE_LOOKUP_PLUGINS=/ansible/venv/lib/python3.12/site-packages/ara/plugins/lookup
COPY --exclude=*.txt . .

LABEL org.opencontainers.image.title="Ansible Configuration" \
    org.opencontainers.image.description="Complete Ansible configuration for the Matrix" \
    org.opencontainers.image.licenses=Proprietary \
    org.opencontainers.image.vendor="M. Aldridge" \
    org.opencontainers.image.authors="M. Aldridge"

This container does a few important things. First it uses the 1-labs version of the Dockerfile syntax, which is necessary to gain access to --exclude on the COPY instruction. I can’t add the requirements.txt file to the .dockerignore because then it wouldn’t be included in the context used to bind mount it into container during Python module install. Second, it makes use of --mount=type=bind to bring in the requirements file, which I don’t actually want to end up in the container, but I need temporarily to make the virtualenv. This is a neat trick if you didn’t already know it for bringing in non-secret information to a build. Finally, the build is structured in such a way that there are specific layers that change frequently, and they are the ones that are on the very very top of the stack containing the actual Ansible configuration files. Structuring the build in this way improves caching dramatically.

This results in the container having Ansible installed in a virtualenv, and then having this file structure adjacent to it:

.
├── ansible.cfg
├── base.yml
├── bootstrap.yml
├── group_vars
│   ├── all.yml
│   └── nomad_client.yml
├── host_vars
│   └── matrix-core.yml
├── inventory
├── nomad.yml
└── roles
    ├── chrony
    │   └── tasks
    │       └── main.yml
    ├── docker
    │   ├── handlers
    │   │   └── main.yml
    │   └── tasks
    │       └── main.yml
    ├── nomad
    │   ├── handlers
    │   │   └── main.yml
    │   ├── tasks
    │   │   └── main.yml
    │   └── templates
    │       ├── 10-base.hcl.j2
    │       └── conf.j2
    ├── nomad-client
    │   ├── defaults
    │   │   └── main.yml
    │   ├── meta
    │   │   └── main.yml
    │   ├── tasks
    │   │   └── main.yml
    │   └── templates
    │       └── 20-client.hcl.j2
    ├── nomad-server
    │   ├── meta
    │   │   └── main.yml
    │   ├── tasks
    │   │   └── main.yml
    │   └── templates
    │       └── 20-server.hcl.j2
    ├── ssh_keys
    │   └── tasks
    │       └── main.yml
    └── xbps-repoconf

25 directories, 23 files

Those 23 files really are all you need to build a Nomad cluster on Void Linux. It may seem slightly odd then that this container is based on Alpine and not Void, especially when Void provides comparatively sized container base images. This boils down to copy-pasta, where I’m re-using some other development work on putting Ansible into a container.

This gets us pretty far along, but now what we’ve got is Ansible running in a container on every host in the fleet. How do we get Ansible to configure the host itself, especially if we don’t want to give Ansible an SSH principal to be able to connect from the container to the machine that way. There are lots of neat ways you can do this, but the approach I’ve settled on for now is using a special connection manager and giving the Ansible container some elevated security permissions.

Let me introduce you to community.general.chroot. This connection plugin allows you to use Ansible to manage chroots. This is already interesting in and of itself, but how does this help us use Nomad to run Ansible on the actual physical host? There’s nothing technically stopping us from claiming that / is a chroot and going there, and then applying Ansible to our “chrooted” environment. Well, almost nothing. There are a few things you have to change to make this work.

By default, Nomad won’t give you access to arbitrary paths on the host filesystem. This is a good thing, and it helps keep your machines secure. You can expose host paths via a number of ways, but in my case, I took the lease secure of them that offered me the most flexibility while doing development work. I enabled Docker Volumes support which enables me to mount arbitrary host paths to a container. The better way to do this would be to create a Nomad Host Volume, but those exist in every namespace concurrently. If there was a way to make Host Volumes, which provide a named handle to a specific base directory, exist in only a single namespace then this would be easy. You could create the volume in the namespace, restrict access to that namespace, and call it a day. This may be possible to do using a combination of wildcard deny rules and explicit allow rules, but this is unclear from Nomad’s documentation and is something I want to play with further.

Once we have the ability to mount the host filesystem, this is pretty straightforward. All we need to do is mount the host filesystem, use the chroot driver to access it, and set ansible_host to wherever we mounted the host filesystem. This makes the entire system reasonably transparent. To make the system even more transparent, we run the docker container in privileged mode, which puts it into the host namespaces for a number of different datatypes. Finally, we run the container in the host’s network namespace so that it sees what the host sees. This has the net effect of providing a portable runtime to Ansible that then has full access to the host to actually configure it.

As near as I am able to tell, this is side-effect free with one notable exception. The ansible_virtualization fact is always set to docker. I’m fairly confident that this is a fingerprinted fact coming from some environment variable, but because I don’t use this variable anywhere, I haven’t been bothered enough to fix the incorrect detection yet. This is almost certainly some weird interaction between the environment the container spawns with and the chroot driver, but since it hasn’t broken anything it hasn’t moved up the priority list to be fixed yet.

Putting all of this together, we can arrive at the following Nomad jobspec, which makes Ansible just another task that Nomad is aware of:

job "ansible" {
  name = "ansible"
  type = "sysbatch"
  datacenters = ["MATRIX-CONTROL", "MATRIX"]

  parameterized {
    payload = "forbidden"
    meta_required = ["COMMIT", "ANSIBLE_PLAYBOOK"]
    meta_optional = []
  }

  group "ansible" {
    count = 1

    network { mode = "host" }

    task "ansible" {
      driver = "docker"
      config {
        image = "registry.matrix.michaelwashere.net:5000/ansible/ansible:${NOMAD_META_COMMIT}"
        network_mode = "host"
        privileged = true
        command = "/ansible/venv/bin/ansible-playbook"
        args = [
          "-D", "${NOMAD_META_ANSIBLE_PLAYBOOK}",
          "-c", "community.general.chroot",
          "-e", "ansible_host=/host",
          "--limit", "${node.unique.name}",
        ]
        volumes = ["/:/host"]
      }
    }
  }
}

This short (35 lines!) file allows us to deploy Ansible across any machine that Nomad is aware of with the full scheduling speed that Nomad has. If you aren’t aware, Nomad is Very Fast at scale. One of my biggest complaints with Ansible is that its slow. This slowness is bad for iteration, and if you can’t iterate quickly or re-apply your configuration data at speed, you’re likely to enter into a world of slowly drifting configs. Since my runtime for any Ansible playbook is now effectively however long the slowest host is, I can run my playbooks on a timer to ensure no configuration has drifted. This doesn’t hold a candle to immutable infrastructure systems like The ResinStack but its still a much better situation than most production systems I’ve seen in my career, and does so without a clunky external scheduler beyond Nomad.

Sure, you could achieve this kind of workflow with Puppet or Chef, but you’d have to install additional daemons, additional scheduling components, and additional runtime components to make it happen. By using Nomad as the execution scheduler for Ansible, I’m able to have supervised, daemon-based Ansible runs at no cost, while leveraging all the logging, monitoring, and access control systems I already put in place for my Nomad cluster that is going to run other applications that I actually care about. This last part is extremely important, because I often see infrastructure engineering teams getting lost in technology for technology’s sake, but somewhere there is a hopefully revenue generating application that all of your infrastructure runs, and anything that isn’t necessary to run that service is overhead - including your configuration management.

Best of all, the user experience here is extremely simple. I have a git hook that builds the container, pushes it to an internal registry, and then requests Nomad to apply it. If I want to re-apply changes that I’ve already run, I can either use the CLI or Nomad’s web interface to accomplish this:

Nomad Dispatch Interface showing Ansible

All that’s needed to dispatch the job is to know the commit and playbook I want to run. I intentionally did not make these default to anything, because I don’t want to get into the habit of using this interface. Manually dispatching Ansible should be to resolve a problem, not as a standard workflow.

This brings us to the final challenge with using Ansible in this Nomad supervised workflow: observing it. With Nomad no longer attached to the terminal on a machine I’m sitting in front of, how can I get the logs, the status of the plays, and information about problems that may have happened during the run? There are lots of solutions to this problem, and the logs are in fact already captured by the syslog which is streamed to my cluster-wide logging system. For Ansible, however, I want something more like the status readout that happens at the end of a run in the terminal which provides quick at-a-glance type feedback.

Eagle eyed readers will already know my answer to this, because the configuration for it is in the Docker file above. I selected to use Ara which works as a callback logging plugin for Ansible and provides me not only a way to observe any changes made by the automated system, but also is configured to keep track of playbooks I run by hand.

We’ve covered a lot of ground in this post, and hopefully you’ve learned something new about either a way you can run your Ansible playbooks, that Nomad isn’t that complicated, or that Ara provides a great single-pane of glass for when configuration was last applied. In the next post I’ll look at the core services that are part of my cluster to provide HTTP routing, logging, and container management.