Starchart 0.7

Today we're shipping Starchart v0.7.  This felt like a big one, so I wanted to discuss what we did. I'd say the theme of this release is "2 steps forward, 1 step backward."  I'll discuss some of the highlights.

CI/CD and Staging

First, we now have a working CI/CD pipeline to staging!  This has taken much longer than I anticipated, and required all sorts of people to lend a hand (thank you to many members of the team who helped me, to Mehrdad in ITS, Chris Tyler, and even ChatGPT!).  Our GitHub Actions Workflow runs all of our tests and static analysis tools, then builds and pushes a Docker image to the GitHub Container Registry.  We then use a webhook from GitHub Actions to staging to trigger Docker to update the running Docker swarm to the new image tag.

ITS asked us to see if we could avoid sending secrets over the network, which we're doing via an HMAC SHA-256 digest on the webhook body.  We have the same encrypted secret on GitHub Actions as well as the staging server, and use it to calculate our digest, which has to match for the webhook to get processed.  It works really smoothly.

A major benefit of having a live CI/CD pipeline like this is that you can more easily test features as they land.  But testing leads to finding bugs, and this week we found lots of bugs!  For example, we learned that Let's Encrypt won't support _s in domain names (our AWS Route53 Hosted Zone at _stage_ in it).  We also learned that our SMTP setup for email notifications, which works great in development, isn't working against Office365.

Reconciling our Tech Debt

Another thing that testing does is to reveal inadequate architectural choices.  Imagine building a bridge and working from both ends simultaneously toward the middle.  Eventually, a small error on one side is going to reveal itself in the middle when things don't line up the way they should.

We experienced this when we landed our Let's Encrypt certificate order code this week and tried to use it for the first time on staging.  Denes and I were excited to witness this small victory live, but the server logs told us a different story: our DNS data model and internal API won't work the way we wanted.  This realization has led us to move to a new approach, which Denes spent the weekend hacking together.

Our DNS works across two systems: 1) we store DNs record data in MySQL; 2) we deploy those records to AWS Route53.  Previously, we did atomic DNS changes.  If someone wanted to add a DNS record, we issued a request, waited for Route53 to sync the change, then updated our MySQL database.  However, we're going to move to a kind of "Virtual DOM" approach.  In much the same way that React deals with state and applies changes to a virtual DOM before diff'ing the real DOM, we're going to maintain our DNS "state" in MySQL and reconcile that with Route53.  We take the state of the database and our hosted zone's records and do a comparison: if a record exists in Route53 that's not in MySQL, it needs to get deleted; similarly, if a record exists in MySQL but not in Route53, it needs to get created or updated to match.

Using this new DNS "reconciler," we'll continually sync the state of our database with Route53 in the background, allowing us to simplify the app in the foreground, where we can work exclusively with the database as our "source of truth" and ignore Route53 completely--eventually Route53 will match our database.  It will also allow us to quickly reflect the database into Route53 when we make a move (e.g., like when we moved to a new hosted zone this week to get rid of the _stage_ name).

Making this overhaul at such a late stage of development is a concern.  Do we have enough time to properly shift all of our code over to the new architecture, fix the bugs it causes, and also complete the other existing work we have?  It's a great question!  I'm fairly confident that we can, but it's going to require a bunch of co-ordination across the team to sort out all the various systems that need updating. Changes of this scale aren't without risk and cost.

Another side-effect of doing this change is that it's going to trigger us to delete a huge amount of code and tests that Won and others worked hard to write over the past 2 months.  I know from experience that it can feel pretty demoralizing to watch your hard work get flushed down the drain.  "What was the point of all that?" And yet, without that amazing code having been written in the first place, there's no way that we could have come to understand what your design should actually look like.  Brook's says it better than I can:

“Where a new system concept or new technology is used, one has to build a system to throw away, for even the best planning is not so omniscient as to get it right the first time. Hence plan to throw one away; you will, anyhow.”  --Fred Brooks, "The Mythical Man-Month"

Like all project leaders, I wish I'd figured this out sooner. What if I'd insisted on more integration testing sooner in the process?  Would we be at this stage now, or would it have dragged the inevitable out longer?  The truth is, this always happens. Real software projects are full of this stuff and above all, I'm trying to give the students an authentic experience of real-world, open source software development.  I'd say this is "mission accomplished!"

Other Notable Changes

I took a stroll through the changesets from 0.6.0 til today, and it's packed with amazing work by the team.  Here's some of what I saw:

  • Won's added code to limit the number of records a user can create, and fixed a bug where we counted the Let's Encrypt TXT records in that value by mistake.
  • Tymur fixed the UI buttons for the DNS Records to work properly when they shouldn't be usable and added validation to our create/edit form, which is amazing.
  • Eakam chipped away at many of our front-end "papercuts:" fixing the styling on the Renew button in the DNS Records table, improved the front-page's accessibility, creating a first-run experience when users create their first DNS Record.  We also got new icons and a favicon added (goodbye Remix logo!) and fixed some HTML issues that Lighthouse exposed.
  • Eakam and Chen-Yuan worked a bunch on our tests and test infrastructure, and we added support for Playwright's new and very cool --ui flag.
  • TD wrote an epic PR to rename DNS Domain to DNS Record across the entire tree, all APIs, data model, etc to standardize our naming.  Won did something similar with the terms Subdomain and FQDN (fully-qualified domain name). We've needed both of these for a while.  He also got our notifications wired up to the DNS worker, which I discussed above.
  • Stefan got our SLO flow finalized and fixed a bad bug we had with overly strict cookie settings.  The SAML stuff on staging has been working amazingly well!
  • I rewrote our DNS worker flow so it properly dealt with error cleanup, and we learned a lot about BullMQ's flows and error patterns this week.
  • Denes finished the Let's Encrypt certificate order flow and landed the first version of our DNS reconciler code, which I discussed above.
  • Mario's been working to get the front-end pieces of the certificate order logic landed, which he demoed on GitHub this week. It will land in 0.8 I think.
  • Dependabot updated a bunch of our deps.  Good bot!

Toward 0.8

This week we're going to focus on paying down the tech debt I outlined above, and also try to get the last of our main features in place. Stefan, Tymur and others are working on adding an Admin UI and ability for admin users to assume the role of a regular user.  Mario, Eakam and others are finalizing our UI to add missing copy, include instruction pages for users, etc.  We also need to add the logic for disabling a user, finis implementing all the expiration and notification logic, add error UI, and dozens of other small things.

There's lots to do, but also lots that's already done.  We're making good progress.