Self-hosting our CI, three months later
A couple of folks have written in over the past month asking how my experiments in self-hosting GitHub Actions runners have held up. And three months seems like a reasonable amount of time to post a follow-up.
I'll also treat this as a bit of a first draft for a longer engineering post on the Buttondown blog itself — so please write in if there's something you wish I'd covered here.
The short version: the experiment graduated into the plan. That March post ended on a half-joking aside — "maybe we're just going to self-host the entire CI suite" — and, reader, we did.
When I wrote that post back in March, I called out two things that were true at the time:
- We were continuing to use Blacksmith for our heaviest suite of jobs — the backend test suite.
- We had touched essentially nothing since setting it up; it just ran.
Neither is quite true anymore, and — as you might expect — the way each stopped being true turns out to be somewhat commensal.
Around a month or so ago, on a lark and the back of a Blacksmith incident, I decided to try running the entirety of CI on Pythia just to see what would happen. As is often the case, experiments are really just preludes to decisions; what seemed fairly promising — a 30% slowdown in performance on a quote-unquote hot server, but saving a hundred or so bucks a month in exchange — felt good enough to commit to.
To put some numbers on it: that January Blacksmith bill was $300, and by the time I got around to this it had crept up toward $400 as the team merged more. The backend suite was the single fattest line item — around $100/month on its own. Moving it over was the last domino; our Blacksmith bill now rounds to zero.
The 30% is less scary than it sounds in the body of an actual workday. The backend suite that used to go green in around six minutes on Blacksmith's 8-CPU runner now lands closer to eight. You notice it; you do not particularly care.
The flip side is that it's very easy to unintentionally starve yourself.
Just like in the initial post, we have five GitHub Actions runners running in parallel on a single box. This works well for our smaller jobs, which are IO-bound and don't really use multi-threading in any real capacity. But we do still have very big jobs, and poor Pythia suffers the consequences when we try to run three or so backend test jobs simultaneously.
GitHub Actions helpfully provides a concurrency parameter, but it's a bit finicky to get working in exactly the right way, because it applies at an overall workflow level and not a job level. To enforce the concurrency I actually want, I'd have to isolate the backend tests into their own workflow.
All of this is a very in-the-weeds way of saying: migrating the whole gestalt of our CI onto a dinky little box in my office has resulted in tiny amounts of pain that could probably go away if I spent an afternoon thinking really hard about the problem.
And then there's the objection everyone reaches for first: you've put all of your CI behind a single point of failure, and that single point of failure is a box in your office. Which is true! But that single point of failure resides in my office, which — notably — is not the same thing as residing in my head. The total number of times I have had to bike over to the office and restart the server over the past 90 days has been exactly one.[^bike]
[^bike]: It would be disingenuous not to also cop to the other time this happened — except that one doesn't count, because I was already at the office.
But the fact that I've been able to save the team and myself almost $400 a month — with very, very little work, and no marginal cost — has been a bit of a revelation. I highly encourage you to spend an afternoon giving it a shot yourself.