How I Almost Took Down Production Due to a Near-Empty Config File
Some web frameworks will only load a specific environment based config file in that env, which could lead to production errors.
Prefer video? Here’s a recorded version of this post on YouTube. The video also demos some code not in the post near the end.
This is going to be a post mortem of how I almost took down production and is a case for performing sanity checks and manual tests for production config files in development.
I had this happen to me recently for client work where I was working with PHP CodeIgniter 3. It’s a perfect example of knowing just enough to be dangerous because while I have worked with PHP for years in the early 2000s, I wouldn’t say I’m a PHP developer today.
At this place I play the role of a solo SRE (Site Reliability Engineer) / platform / ops / whatever and I sometimes find myself making app changes for infrastructure or hosting related things like creating a health check controller, converting config items to environment variables, etc..
It works out nicely because the dev team can focus on delivering business value and we all coordinate together to fill in knowledge gaps across the board.
The work I was doing would be code reviewed and tested by someone on the dev team. That’s nice because it means a proper PHP developer can offer suggestions and improvements which makes this process OK in my book.
# The Story behind This Post
CodeIgniter 3 allows you to define multiple config files for different
environments and it only loads a specific config file when you run your app in
let’s say production
or staging
mode. Quite a few web frameworks support
this pattern to let you configure your app per environment. This is pretty
standard so far.
In this case we had a few config items duplicated across multiple environment config files and I thought using a single environment variable that would be read within our default config felt like a better fit here so I started doing the work.
I don’t know CodeIgniter a ton but in all of our config files we reference a
variable called $config
and set values on it like $config["hello"] = "Hello world!"
.
We had a specific config/production/api.php
file where I removed a custom
config item we had and moved it to the default config/api.php
file which
would then read the value from an environment variable. This default file
applies to all environments unless you override it.
This feels like an innocent refactor right? Remove the config item from a couple of config files across multiple environments and replace it with a single entry in the default config file which reads from an environment variable. Then we can set that env var value as needed in all of our environments.
That’s exactly what I did and that config/production/api.php
file looked like
this in the end:
<?php
# [Redacted, but we had a single if condition here.]
# [Redacted, we had a bunch of comments I didn't want to delete "just in case".
Since the file wasn’t empty of code (it had an if condition) and we had a bunch of comments I made a judgment call to keep the file around. Normally I’m a fan of doing something and asking for forgiveness later which would have meant deleting the file but since I’m not super familiar with PHP I went with what I thought was a more safe approach by leaving it.
So I made my changes and created the pull request. In the testing instructions I mentioned this is modifying config files for a few environments and just to be safe maybe the tester can manually print out the config item to make sure the value is being populated from the env var correctly.
We don’t have unit tests at this level where every single individual config item is tested and this manual test felt like a decent sanity check for the PR even though our other tests would have picked this up by testing the controller action the config option is referenced in.
Quick Aside
We mostly trust our tests but the context here is out of the norm. I’m not a grizzled CodeIgniter veteran and I’m committing code to a large app for a large company.
The company is private so I’m not going to disclose any details beyond that. The reason I was even confident enough in making this patch is because I have absolute trust in the dev team and our release process. I’ve also made a decent amount of changes to the code base over the last year without any issues.
Back to the Story
In any case, everything passed because the manual test was done in development mode which loaded the development config and the automated tests also passed because it loaded the test config.
It got the seal of approval and was pushed to production. Technically we don’t have any pre-prod environments yet but to be fair this hasn’t been something that routinely kills us, I also haven’t been here long enough to address this.
I’ve been focusing on moving their infrastructure into Terraform and also getting all of their services in Kubernetes (which happened recently). I’m not making any excuses but did want to shed a tiny bit of light on why things are the way they are right now.
Fortunately Kubernetes prevented this version from going live because the app
failed Kubernetes’ health check due to a CodeIgniter error that said it failed
to merge in $config
from that config/production/api.php
file.
Nothing out of the ordinary happened in development or testing because those
config files had other $config
references.
While the PHP file itself wasn’t empty of code, the production config didn’t
have a single $config
variable and under the hood CodeIgniter 3 will expect
that if a config file exists it will have at least 1 $config
to merge. Ours
didn’t so it crashed and the app didn’t start.
Since I’m not a hardcore CodeIgniter user I didn’t know that and the person who reviewed the PR also didn’t catch it. I’m not blaming them because this is a pretty obscure thing and it might have been the first time this came up in the life time of the app.
What’s interesting here it’s not guaranteed that if we had a staging or pre-prod environment it would have been caught. As long as our staging and production configs had the same config options defined it would but there could be use cases where this isn’t the case. Perhaps a default config is used for every environment except for production.
# Takeaways and Lessons Learned
If your web framework supports configs for multiple environments then treat changes to these files with a lot of care and respect.
Consider introducing a policy and checklist around changes to these files which
ensures that your app gets run in production mode or whatever environment has
changes before your PR gets merged. Perhaps this could be a manual test by the
developer doing the PR and the dedicated reviewer of the PR. I also think
there’s room here to maybe automate starting the app up in production mode
(with development .env
values) within CI to ensure it at least starts
successfully.
Having a health check that must pass before your app gets put into a load balancer to serve traffic is a really good thing to have. You don’t need Kubernetes to have this ability but Kubernetes does make this pretty easy to do with its probes.
Another takeaway is that having a staging or pre-prod environment shouldn’t be blindly trusted to catch everything. In practice it should catch a lot but you should still put in the due diligence to check as much as you can within reason before you ship a release.
Also, even if you have a lot of checks and balances in your release process with a strong technical team things are going to slip through the cracks. Embrace these as positive events.
When this happened I didn’t get upset or feel embarrassed. It made me happy to discover a weakness in both my own personal ability and our release process. This something I now have an awareness of and it can follow me around no matter which tech stack I’m using.
# Side Topic about Rails
Not all multi-environment config file frameworks are affected by this. For
example Rails supports this pattern too but you typically build your production
assets with RAILS_ENV=production rails assets:precompile
. This command would
happen during your build process and it would fail to run if your
environments/production.rb
file had issues.
But with that said, I think all of the above takeaways are worth applying to Rails or any other tech stack you might use. Ideally you want to catch these things as early as possible and have multiple levels of defense.
The video below covers everything we went over in this post and the 13:21 point in the video picks up from this point and goes over a Rails example app to demonstrate the issue.
# Video Walkthrough
Timestamps
- 0:10 – Loading in environment specific configs
- 0:42 – How I got myself into this situation
- 2:10 – Making an innocent refactor and playing it safe
- 4:33 – Coding the change and opening a PR to be reviewed
- 5:16 – A quick aside on the context of this change
- 6:02 – Getting the release approved and shipping it to production
- 7:10 – Kubernetes helped prevent this error from taking down prod
- 9:36 – Takeaways and lessons learned
- 12:10 – Looking at some Rails code to demonstrate what happened
Reference Links
Has something like this ever happened to you? Let me know below!