Programmatic Process Groups with runit

Fast integration tests are the best. Being in the flow with unit tests is good but I love having that same rapid iteration cycle with integration tests. And I think it's better to exercise as much real code as possible in them as possible. Why rely on mocking, especially for your own services, if you can run the real thing quickly? So I built a mechanism for launching groups of arbitrary binaries and Docker containers quickly and simultaneously.

I mentioned this in the previous post Serverless Rust Testing. I got some pointed questions about the runit mention in particular so I'll focus on that here.

Suppose you want integration tests that need multiple processes. For example, you want to run real service endpoints, real supporting processes like Redis, and/or Docker containers in your tests as a logical group. You want to launch as many groups as needed, all simultaneously and very quickly. Instead of over-relying on mocks, you can exercise the real code without a significant time penalty in a lot of cases.

What I built is a bit like Docker Compose except:

It reserves local ports on the fly so that nothing conflicts with each other.
It doesn't require Docker. This is better for fast iteration especially on OSX because we can use the local, incrementally-compiled binaries that will be tested.
It dynamically generates configuration files to feed into whatever it's launching.
It is quite fast.

I wanted hundreds of these groups to be launched at once, one group per test, without any conflicting files or ports. I needed it to be portable to at least OSX and a number of Linux variants. And painless to install. It should also never leak processes (and if it ever did, things should be trivial to clean up after testing, even programmatically).

Instead of writing all the process management myself (a bigger PITA than you might think), I experimented with process watchdogs like runit, daemontools, systemd, Supervisor, immortal, etc.

I ruled out some options because I wanted something that was just a binary (or a simple collection of them like with runit) and didn't involve any complicated installation steps or dependencies. That way, it could run on any laptop or CI node more easily. (It turns out I will use this group mechanism to kick off processes in production in some places, but that is a story for another time. 😏)

Some of them introduce a fixed waiting period before checking on the child processes and declaring them up and running. In Supervisor's case, it would've meant every test introduced at least one second artifically and usually more.

Others were ruled out because of pains running without root and/or expectations of files in /etc/.

The two candidates I tested more extensively were runit and immortal. I think either could have worked out but runit is more time-tested and has a better story with config files and running many, many groups of them that are launched, queried, and stopped independently and simultaneously.

I'll talk about this in terms of group create, start, stop, and terminate. You can do whatever you need to do, obviously. This functionality is accessible from binaries (for debugging and to launch groups in production) and from direct library calls (in my case called from each integration test via Rust macro that generates boilerplate wrapping test logic in the setup/teardown code).

The group create functionality takes a directory that has a group.toml file and any other configuration file templates that will be needed by any of the processes. It returns a unique ID (let's call it $group_id) and generates a unique ID for each process in the group. It's important the group ID is unique to the node because of the way the start and stop functionality will work.

The template files are copied into a unique directory per group (let's call it $group_dir which in my implementation is exactly equal to the $group_id) and ports are reserved for any process indicating one is needed (via SQLite database).

There are a few other dynamic substitutions based on the configs in group.toml. I'll spare you the details and syntax of all that; the main point is that many different tests can re-use the same group config template and be run simultaneously (because of dynamic port reservation in particular). Each of those tests might want the same process setup but each is testing a different area of the logic or whatever so you want different tests and want each of them to start with a clean slate. Not one big monster.

During create, we write out two runit-specific files for each process:

$group_dir/$process_dir/run
$group_dir/$process_dir/log/run.

In the examples here, let's say the group id is pgrp-123 and the process id is proc-456. (In the real implementation, I used nice, readable IDs with a UUID component to ensure uniqueness.)

The pgrp-123/proc-456/run file:

#!/bin/bash

exec 2>&1
exec /path/to/redis --port 13342

The pgrp-123/proc-456/log/run file:

#!/bin/bash

exec svlogd -tt /path/to/logs/pgrp-123/proc-456

You can add whatever arguments the process needs in the main run file. The log/run file should be exactly what is there, the only variable being what directory you want the logs dumped in.

So, each process in the group gets one subdirectory and its own customized run and log/run files. After those are all written out to the filesystem, the group is good to go.

To start the group:

runsvdir /path/to/pgrp-123

The runsvdir program "starts and monitors a collection of runsv(8) processes".

It starts a runsv process for every subdirectory found in the given directory. This is why we've laid out the $group_dir to have one subdirectory per process, each with its own run file. runsvdir will scan through each subdirectory and kick off runsv per process.

Note that this will create a process on your system with runsvdir /path/to/$unique_group_id, a fact we will use to stop it.

To stop the group, find the runsvdir process with the unique group ID in the running processes list. Get its PID and run this to stop the group:

kill -1 $PID

That's a one (as in SIGHUP) not an L. That brings the whole group down gracefully.

Let's say your test fails. Your code can stop the group, choose to skip file cleanup, and have a human come debug things at their leisure. At that point, the person can start it back up with group start if the logs weren't sufficient for debugging.

The terminate functionality for me is to make sure the group has been stopped, release the network port 'reservation', and clean up all the files.

That's it. You should be able to create a system like this now. Let me know if something is unclear or doesn't work out.

One more thing to make it better:

I ended up adding one more thing to the group start functionality to give me an optional way to make sure every process was running first before the start call returns. This isn't truly necessary given there's boilerplate around calling the services under test that will wait until they're responding (or timeout). But it's cleaner to sanity check first and potentially of interest to you.

And there's the reverse idea to enhance group stop: waiting around to make sure the signal sent to runsvdir has resulted in all of your group processes stopping before exiting.

These are both possible because the status for each process can be queried given the path to the unique $group_dir/$process_dir.

I represent the result of each query as: Down, Fail, HasNeverRun, Run, Unknown.

First check for the existence of a $group_dir/$process_dir/supervise subdirectory as a shortcut. If it is not present, that is HasNeverRun. It means runit has never seen the process before.

Remember that we started the whole group with runsvdir which launched an runsv for each of the processes in the group. Now we will use a third binary from runit: sv. The sv program lets you query or manage one runsv at a time. i.e., we can check up on each process.

sv status /path/to/pgrp-123/proc-456

This will print a line beginning with down, run, or fail which I map accordingly. The Unknown mentioned before is for completeness in case something unexpected happens, etc.

For group start, the nice extra functionality is to wait for each process to get to the Run state. For stop, you can use the opposite or you can alternatively make sure your unique process PID is not present in the OS process list.

I found that runit would freak out about the files going away if I didn't do the graceful stop and then terminate the group which removes all the files and directories. And it should freak out, that is fine. So, group stop is always called with that wait baked in.

Again, let me know if something is unclear or doesn't work out.