Creating Checklists for High Stakes Changes
The use case we'll go over is performing a major database upgrade for a large application that's running in production.
Prefer video? This video covers what’s in this post with a little bit more detail.
Checklists can be used for more than just database upgrades of course but I’m a big fan of using practical examples to demonstrate something and I recently performed a checklist driven database upgrade.
There’s a reason airline pilots use checklists even if they have 40+ years of experience and have taken off and landed thousands of times. Having the workflow in front of your face lets you focus on executing the steps instead of trying to think about the workflow while executing the steps at the same time.
For one of my clients we were preparing a major database version upgrade (MySQL 5.7 to 8.0) and we wanted to minimize downtime and what could go wrong.
Needless to say a checklist was invaluable.
It let us do all of the hard work of defining the workflow while there’s no pressure well before the time when the upgrade was happening. We also got a chance to test the workflow in a test environment before applying it to the live database.
Database Upgrade Workflow
It was on AWS using RDS which is AWS’ managed MySQL / Postgres service but similar steps could be used for other database upgrades too. After all of the steps are listed out we’ll go over the details and my thought process of each one in the next section.
None of the links below go anywhere. I added them here because in reality these led to the exact resource being acted on in a step. This helps avoid human error and needing to click around and find things on the fly. It’s a minor but useful detail in my opinion.
- Ensure you can connect to the live database before doing anything
- Confirm the SQL user has all necessary permissions to run any queries afterwards
- Create maintenance page message for example.com
- Enable maintenance page at the firewall level
- Verify maintenance page is being served for example.com
- Create manual database snapshot as a backup:
- Name it
- Name it
- Modify the live
- Set the DB engine to MySQL
- Set the parameter group to
- Set the DB engine to MySQL
- Wait until RDS is updated while looking at the status
- It may take 10-20 minutes
- Connect to the live database and run SQL queries to verify it works:
SELECT ...and expect
SELECT ...and expect
SELECT ...and expect
- Disable maintenance page
- Verify example.com is loading successfully
- Create a calendar reminder to delete the manual snapshot in 30 days
- Notify the business that the site is back up and operational
Here’s my thought process for defining the checklist the way it was:
Steps 1 and 2 are critically important because you could get to step 9 and then try to run a query only to find out that the DB user you’re connecting with doesn’t have permissions.
For example, our “app” user can’t create or drop tables but our “migrate” user can.
This happened in our test environment and it caused a blocker to obtain the correct permissions because the person performing the upgrade (which was me in this case) didn’t have access to modify SQL user permissions.
Step 1 is also nice because it’s a good idea to ensure your system is in a working state before you modify it.
Steps 3, 4 and 5 are a handy way to toggle maintenance mode without needing to perform a code change. It’s a firewall rule we can turn on and off on demand for 1 or more domains.
Step 6 is an insurance policy in case something goes wrong we have the latest version of our DB to ensure no data loss. That’s why we had to turn maintenance mode on first to ensure no new data is being written to the DB.
Technically this isn’t flawless because in between steps 6 and 7 a background job could modify the state of our database even with no new traffic being sent to the site. For example any scheduled work that may happen at X time in the future.
We have pretty robust logging and observability in place to know when this happens along with a means to reconcile everything as needed if something went wrong and we needed to use the snapshot.
Again, this only becomes an issue if we need to use the snapshot. If a background job runs while the database is down during the upgrade itself that’s ok, the job will fail and will be auto-retried. Eventually the DB will come back up and it’ll work.
Step 7 is mainly clicking buttons in a web UI. That custom parameter group was created ahead of time in preparation for the upgrade.
Ideally all of these “click buttons in web UI” steps could be put into code using Terraform so they all become independent pull requests that can be reviewed but this client has a 10+ year old app and not all of their infrastructure has been ported into Terraform yet.
Step 8 is a waiting game. The AWS config tab for your RDS instance will show the status, such as if it’s applying the new parameter group or if it’s in the process of rebooting. The time range is pretty sporadic, I noticed upwards of a 2x difference in total time based on nothing we could control.
Step 9 are a few sanity check queries to ensure things updated to their correct version and certain tables are cleared based on what the dev team suggested.
Steps 10 and 11 allow the site to be live again within a few seconds of enabling it. Automated smoke tests would be a way to automate step 11 a bit but for now a human went to a couple of pages to make sure they worked.
Step 12 is important because a snapshot costs money to store. We wanted to keep it as a snapshot to quickly restore it as needed. We could have done a SQL dump and stored it on S3 but realistically spending a couple of bucks to keep the snapshot around is worth it.
If something goes wrong it lets the developers quickly spin up a test DB based off the snapshot to help figure out what’s happening and address the situation.
Step 13 is always nice to let stakeholders know how things went. In our case everything went flawlessly and we had around 15 minutes of expected downtime which was done during a time where our traffic usually dips. The preparation work paid off.
You’ll notice there’s no steps or workflow for if something goes wrong. That’s because I both wrote and executed the steps. At this point I was ok with letting my overall general experience guide troubleshooting anything that might go wrong.
If I were writing these steps for someone else, that would be different. I’d add more details.
I also executed them in a test environment multiple times to work out the kinks. By the time it was executed in production I was feeling really good about it.
We did have a disaster recovery plan though. It wasn’t a formal document but we thought about what we could do if certain things went wrong. Realistically steps 7 and 8 are the riskiest ones and it’s technically out of our control. Either RDS will successfully do it or not.
While it’s Happening
While executing the steps I created screenshots and quickly jotted down how long each step took. I really like this idea because once the whole process is finished you can go back and make a little post with a timeline that mentions anything interesting that may have happened along the way.
This really helps your future self or someone else understand the process in case they need to do a similar thing in the future. When performing a scary upgrade, having the extra context and details is well worth the 15 minutes writing up a summary when it happened.
Video Version of This Post
- 0:44 – Checklists are valuable
- 2:09 – Going over the database upgrade checklist
- 3:26 – Ensure you can connect to your DB
- 4:27 – Turning on maintenance mode for the site
- 5:55 – Create a manual snapshot as a backup and perform the upgrade
- 10:49 – Confirm it works with a few SQL queries
- 11:30 – Turning off maintenance mode for the site
- 12:10 – Create a calendar reminder to delete the snapshot backup
- 13:53 – Notifying the business that everything went well
- 14:24 – What if something went wrong?
- 16:30 – Making a timeline of interesting details afterwards
- 17:39 – What was the last thing you made a checklist for?
What was the last thing you made a checklist for? Let me know below.