How I debugged Chaos in our CI pipeline

The story of how one seemingly innocent PR, caused all hell to break loose in our CI/CD flows.

Just about a week ago, a colleague of mine had to pull me in to have a look at some weird issues he was seeing.
Our CI pipeline, specifically a Github action, that spins up a Docker container and verified the project can be build, was failing.

He knew I had spent some time uncovering deep Typescript issues before and thought I might be able to help.
Sure, I’m always happy to help a colleague out! And I get to exercise the debugging muscles again.

In short, his PR introduced a new take at some of our infrastructure, a working prototype of getting trigger.dev up and running. Very cool stuff.

He was puzzled to say the least.

The build was working locally, but as soon as the Github action ran to build it, it failed with errors neither of us had ever seen before.

Worse yet, the errors seemed to change on every build! This caused a tremendous amount of confusion.

The Iceberg of symptoms

Debugging muscles immediately kicked into overdrive and I – almost automatically – started thinking about the problem:

“If a build is working in one place, but not in another, that seems like the environments must be different”.

My colleague and I were working on this in tangent (no he didnt just hand it over to me), so we could cover more ground and get out of the office sooner than later.

He was taking a thorough look at whether the code changes could have introduced some unexpected side effects, type issues for example.

While he focused on that, I hypothesized that somehow, the Github action must be producing a different environment than our local one, simply because it was producing different results.
To me, it seemed like the only logical explanation.

So that raises some questions:

  • How are the environments different, and:
  • Why are we producing different environments?

And yet to be answered, if this is an environment issue, why are we seeing different errors on every build?

To make matters worse, why would it be a different build, if we are building with the best practice --frozen-lockfile argument?

Great questions.

In essence, this was one of these very frustrating issues, that often seen hopeless to go through.
But hey it was Friday, I had my own tasks finished early and was in a good mood, so I decided to give it a crack.

For lack of a better term – at least that I know of – I’ve called this the iceberg of symptoms. Very much like an iceberg that’s light on the top and hiding most of its mass under the surface, I hypothesized that the errors we were seeing, were symptoms of something else being wrong.
Some order, more major issue, hiding under the surface.

When dealing with debugging processes like this, it’s extremely important to not jump to any conclusions, and instead consider all possible solutions as a hypothesis, to keep the right mental model of things. As any hypothesis, it needs to be verified. That is roughly done by:

  • Eliminate variables around the issue
  • Consistently produce an expected outcome, from the same input

Given that we saw errors related to, among other things, Typescript and other dev tools, I decided to verify my hypothesis by a few simple steps:

  1. Pin the version(s) of any library producing an error, to a previously known working version
  2. Try to typecheck and build the project at least twice
  3. Avoid making any other changes to the codebase.

The important bit here, is that we pinned it to a previous version, not a new one. This is done in order to isolate variables in the debugging process.
Since we know that previous versions have worked, we should expect them to work again. If not, the hypothesis is not verified.

This is an important step, because if we can’t verify the issue, we are simply treating symptoms – the top of the iceberg.

Why are we looking at packages instead of code?!

So how on earth did I even land on this being the issue?

Without having verified this yet, i was working with a confused colleague. But as with any healthy workplace, he trusted me to handle my own attempts well, same as I trust him.

Well first, discrepancies between the CI build and the local build tipped me off. As is common in Javascript/Typescript land we determine versions something like: "typescript": "^5.8.2".
The ^ symbol is important, because it has a specific behaviour. It allows packages to upgrade patch and minor versions (according to semver).

The second telltale sign, was the fact that while reviewing his PR (which was actually his initial ask before dragging me in to help solve this), I noticed that the file pnpm-lock.yaml had a lot of changes.

Normally this file shouldnt have too many changes, especially if only a single new library was added.

Because it had many changes, this could mean that it was removed, only to run pnpm install and then push it with the branch.

Those two signs together, produces a unique situation as a result:

If the lock file was deleted and re-installed, Javascript is free to update all packages to the newest patch and minor versions available, when creating the new lock file.

This is important, because unless we replicate the exact steps locally, the CI will start picking up on new versions when it runs pnpm install --frozen-lockfile, whereas we will never notice these errors locally.

Thus my hypothesis being exactly this: the pnpm lock file was removed, pnpm install was ran and the new lock file committed.

To verify that, we would need to be able produce it. Ideally this should happen in a local development environment.
After all, if the root issue is packages and their versions being accidentally updated, then we should be seeing the same issues in locally, if we follow the same steps.

This is why having a hypothesis, and being clear on the mental model of it being a hypothesis, is so important. It helps keep in mind that we need to be able to reproduce the issue.

Now, this replication is somewhat easy, as the exact stage things ended in, is part of the hypothesis. Newer versions of packages.

All I had to do, was simply run rm pnpm-lock.yaml && pnpm install.

As it turned out, that did in fact cause errors, the exact same ones we were seeing in CI.

Problem verified 😎

Which versions and packages to pin?

Great, so now that we have the problem verified, and we know that it has to do with some package, how do we figure out which one, and which packages to pin?

Well, as unfortunate as it is, the most straight forward solution, is simply to check the diff on the pnpm-lock.yaml file, and try to see which versions were bumped, with the new lock file.

As it turned out, the list was rather long (this is non-exhaustive):

typescript
langchain
langfuse
zod
ioredis
ai
resend
tsx
esbuild
@types/node

All packages are major and could have significant impact on both typecheck and build steps.

See, the problem is that not everyone follows semver dogmatically (nor do I do so strictly myself), which means that even if – at least in theory – it shouldn’t break anything to use the caret (^) symbol on packages, it often does.

In our case, even just a few types being changed in a package, could make it extremely difficult to trace down the error across a large monorepo.

As such, the stability in environment is more important to us, than getting the newest version of some dependent package.
Updating packages can be helped along with manual process, or tooling like dependabot.

I would even go as far as argue that you should not keep the caret indicator on your package versions!
Because at the end of the day, predictability and stability are often much more valuable, than having a minor or patch version accidentally updated.

In any case, in order to finally fix the issue, there was no way around doing it the hard way:
I looked thoroughly at the diff for the pnpm-lock.yaml file, to see what the previous version was, for each package.

From there, I went across our monorepos many package.json files and specified versions, now without the ^ symbol, so that we can ensure we don’t have an accident like this again in the future.

Yes, it is possible to solve it in other ways, for example checking out the lock file to a known state, then only adding the new package and re uploading the file to the branch.

But as with any problem, this one helped surface a problem. Semantic versioning in JavaScript, is somewhat finicky. At the core of it, while it sounds nice to have minor and patch versions automatically updated, it is not as valuable as having a stable development environment.

So, we used the opportunity to pin all the package versions to known working states, and we can now go over updating them one at a time when it fits us.

And in case you were wondering, yes! Pinning the versions absolutely did solve the issue and I believe it to be a better approach than trying to fix any issues that appeared 😎