Updated on February 14, 2023 in #deployment

Lessons Learned from Re-rolling 60+ Production Credentials

lessons-learned-from-rerolling-60-production-credentials.jpg

It's worth documenting and running through the process before disaster strikes while the stakes are low.

Quick Jump:

Prefer video? There’s a version of this post on YouTube that covers the same story and takeaways with a little bit more flavor.

I was involved with a company who was using LastPass for 5+ years. In December 2022, LastPass had a really bad situation where encrypted vaults were compromised. That means an attacker can try to crack your master password offline and if they did, they would have access to all of your credentials. 2FA on your LastPass account won’t save you here.

I think this company was very fortunate because LastPass has 25+ million customers and a bad actor would need to crack a master password from an employee who had a lot of production credentials in there to do serious damage.

One of the folks with production credentials had a very old LastPass account with 5,000 iterations and a decent but not great password in terms of password entropy. Basically this means the only reasonable option here to is to re-roll everything.

It also means the company has a very real security threat in front of them but they’re not having to securely re-roll everything under pressure while a breach is happening in real-time. In my opinion that’s a very favorable outcome.

What was stored in their LastPass you might ask? Good question!

One feature of LastPass is you can have shared folders which allows teams to share credentials. This could be handy if you have a shared set of keys that everyone uses for specific things such as pre-prod tokens and credentials. More on this later!

There was all sorts of stuff stored, such as:

Root cloud provider key pairs (not always a literal “root” key, but full admin access)
Dozens of logins to various sites (Bitbucket, GitLab, Datadog, etc.)
Dozens of API keys (payment gateways, document signing services, etc.)
Encryption keys related to decrypting very sensitive PII (Personally Identifiable Information)
All sorts of credentials to pre-prod / prod databases and services

Not all of these were shared with all developers by the way.

Even though everything was re-rolled I’m trying to be a little vague on purpose, but you get the idea. It was the literal keys to the kingdom, especially for the folks with production keys and access.

# How Do You Even Begin to Re-roll Everything?

If you have a long running organization with a decent amount of developers and services then credentials accumulate over time. Maybe the person who signed up for an external service in 1 area of a large code base doesn’t even work there anymore.

The first thing was identifying groups of credentials such as:

Individual LastPass items that each developer has in their own local LastPass
Shared LastPass items that all developers have access to
Shared LastPass items that only a few devs have access to (production, etc.)
Environment variables for each service and tool

The next step was to create a spreadsheet and document to go along with it.

The goal is to assess everything and also have a spot to track of re-roll progress and jot down the workflow for re-rolling all of these items.

I started with the spreadsheet and made a couple of sheets. Basically 1 for each bullet point above, the only exception was the first bullet point. We asked devs to re-roll all of their individual site passwords. This can be confirmed because LastPass shows the password history, it was also not too bad. The devs cranked this out in like 15-20 minutes.

For the environment variables I made a sheet and dropped in all of the services there. There were a lot of environment variables here.

Tracking Environment Variables in a Spreadsheet

You can imagine having a shared LastPass with an .env file for 1 specific service where you have 40 passwords, tokens and other credentials.

I added these columns to the sheet:

Secret
- This would be the environment variable’s name (obviously not the value!)
Purpose
- What does it do?
Service
- Which service is it for?
Status
- When was it re-rolled, such as a YYYY-MM-DD date?
- Any blockers on being able to re-roll it?
- Is there a ticket assigned to this one?
Risk
- What would be compromised if someone got access to this?

Initially I had a “Risk Score” column that was scored on a 1-5 where 5 is catastrophic but in the end this turned out to be busy work and I quickly axed that idea. Since we were re-rolling everything the score didn’t matter. The risk description was more than enough.

Customer impact was another criteria that was important but I ended up color coding rows for this instead of adding a column. For example, if a key were to be updated, would it mean customers need to reset their password because the key was used to encrypt and sign cookies? In the end I highlighted those rows and left the rest as their default color.

This really helped a lot.

Initially I started with comically bad descriptions like:

Secret         Purpose            STATUS         RISK
ACME_API_KEY   API key for ACME   Ask the devs   Access to ACME's data

Fortunately for about half of them I did know what they did and what the risk was, but as I gained more information I added more accurate descriptions and risk assessments.

My process was to fill out as much as I could and also grep the code bases for references. I identified a number of variables that were no longer being used which was a nice side benefit of assessing everything. We got rid of years worth of cruft.

For everything I didn’t know I reached out to the dev team and asked questions asynchronously. It worked really well and wasn’t super disruptive to the team as far as I know. I tried to be as thorough as I could before reaching out to them.

Now that the sheet was filled out, it was time to start re-rolling keys.

# Re-rolling Everything in Order

To do this properly requires figuring out which keys to re-roll first.

For example if you have cloud provider credentials with root access then re-rolling a Stripe API key doesn’t do anything for you because an attacker can use the root cloud credentials to look at anything such as the output of env in a running container within a Kubernetes cluster since they could just kubectl exec into one of your app’s pods.

At some point your app needs to decrypt your secrets. Whether it’s through environment variables or encrypted secrets in a secret manager it doesn’t matter. Once someone has access to your app’s run-time environment that’s it.

That wouldn’t require SSH access either. It’s all done over cloud provider command line tools where access is handled through access keys and user permissions.

With that in mind, I started re-rolling all of the cloud provider keys and web console logins. Fortunately this was easy because they allowed you to create new keys and deactivate the old keys so it could be done without service disruption.

I ended up performing this workflow:

Create new key
Update apps to use new key (SDK access, etc.)
Deactivate leaked key
Create another “new new” key
Update apps to use the “new new” key
Deactivate “old new” key

This ensures a potential attacker had no access to the account in between steps 1-2 where they might have gotten to see the new key name in step 1. Although technically the secret component of the key wouldn’t be known, I took the safer approach because it helps me sleep better at night.

As a side topic, this is also a good opportunity to ensure these new keys only have minimal access for the features your app needs. For example if your app only needs to read and write to an S3 bucket, then create permissions just for that. Don’t give the key full admin access so it can delete a Kubernetes cluster!

Re-rolling the Rest

Now that we’ve got a controlled environment it was time to re-roll everything else. I made judgment calls on this based on what I thought was most important and also factored in how time consuming it was to re-roll or if I were blocked or not.

For example, re-rolling keys to obtain access to pull private repos is important and relatively easy so I did that first. Our DB credentials are also important of course, but access is only obtainable within a private network while connected to a VPN. Of course re-rolling them is important but I aimed for handling external non-IP address white listed credentials first.

Keeping Progress

As I re-rolled everything I modified the spreadsheet to include the date when they were re-rolled in the “Status” column. I also highlighted that cell to be light green so it was easy to see which items were done.

Likewise, as I performed the process I wrote down the steps in a document for each environment variable that got re-rolled with concise bullet points and added commands or screenshots as needed.

Some of them had really important steps like if a token changes in service A then service B also needs the same token updated at the same time.

A couple of them also uncovered just how much of a single source of truth I was for the process on re-rolling something. For example re-rolling an Argo CD password for a read-only or admin user was something I’ve always done but it was a couple of manual steps.

This event prompted me to automated that with a script where you pass in the current and new password as an environment variable name and it does the rest.

Datadog’s API token for Kubernetes with apiKeyExistingSecret was another interesting one. If you do an in-place secret update, you need to manually restart both the agent’s Deployment and the DaemonSet. This wasn’t documented but I discovered it by combing through their Helm chart’s source code looking for dependencies on the secret.

Another interesting one was database credentials.

The approach we took was to create a new DB user / pass combo to ensure we could re-roll everything without downtime or disruption. These DB users are limited to only being able to do what the app needs so they can’t manage users themselves. Once the new user was being used everywhere we dropped the old user.

As a quick aside, locking down your DB user’s permissions like that is very much worth it. The peace of mind knowing a user can only perform a specific list of actions is great. This is especially nice for giving read-only access to a specific environment.

How Did It Go?

It was mostly smooth sailing. Everything got re-rolled with minimal to no service disruption.

The site itself never went down but there was 1 case where I changed a token but the UI on Bitbucket made me think I applied something when it didn’t get applied. This caused a minor ~2 min hiccup where CI broke due to permission issues that was quickly fixed.

Another fun mini-disruption that had no real impact was due to how crazy Slack handles re-rolling “Bot user” tokens for Slack apps (they are the ones that start with xoxb-). You have to revoke all tokens for the app, then after you re-install the app into your workspace you get a new token.

The above leaves a time gap where you have no valid token until you update / deploy all of your apps to use the new token.

That only affected a few dev focused tools that post Slack updates for things that get reported elsewhere. Slack’s webhook URLs can be re-rolled with zero downtime because it lets you make a new one before deleting the old one.

All in all, this whole process to re-roll everything was done casually over a few weeks (a few hours a day) where a huge amount of that time was researching and understanding what most of these secrets do.

The next time around will be much faster now that there’s a workflow defined. The next step would be re-rolling things “for fun” to really automate and work out the kinks so that if disaster strikes for real and your back is against the wall, you’ll be prepared to quickly and calmly handle it.

Even if that doesn’t happen in the near term, they’re in a much better spot now than before.

# Lessons Learned

Here’s a couple of things after having gone through a low stress mass re-rolling.

It’s Worth Doing This Now

I’ve always taken precautions with passwords and use offline password managers like pass but I never went the extra mile to define a workflow for my own stuff.

After experiencing and seeing all of this unfold I’m going to start defining workflows for my own personal sites and projects. It’s not really a lot of extra time to do this.

I write tests and I also perform backups along with exercising how to restore from them. Handling re-rolling keys feels like it’s in the same category as that, especially if your site is offering a paid service for others.

Don’t wait until something bad happens and you’re forced to do this in real-time while your site is actively being breached.

Are You an API Provider?

If you have a service with any form of API or programmatic access please offer a way to add, deactivate and remove keys so folks can perform safely monitored zero disruption re-rolls.

Revoking or deactivating is an important distinction between deleted. Having the key remain visible in the UI but not allowed to be used is great. It lets you reactivate it in case something really unexpected happens.

Being able to create multiple keys and handle decommissioning them on your own terms is important for giving you peace of mind and providing less service disruption.

Also, showing the last accessed time when a key is used is very helpful to figure out if you’re still using a key in some unknown spot. For example if it says “last used 4 minutes ago” you know you might have missing updating something but if you let it ride over the weekend and come back to “last used 2 days ago”, you’re likely safe. Please include this, it helps a lot!

Take Your Time

Part of the luxury of doing this while not under pressure is you’re not rushed into making hasty decisions.

Even if you are under pressure it’s not worth rushing through things. If you make a hasty decision like deleting the wrong key you might end up locking yourself out of something for an undetermined amount of time.

Or maybe you think something is good so you let your guard down and step away for a while only to discover 8 hours later one area of your site isn’t working and you happened to not have monitoring for that 1 thing.

For example, if a key provider’s UI is bad and doesn’t show you stats like when a key last used and you decide to label the new key the same name as the old one, you may end up in a spot where you accidentally delete the wrong key because you didn’t triple check the creation time of the key.

Take nothing for granted when it comes to the UI of tools you haven’t created yourself.

Minimize Where You Have Secrets Stored and Referenced

It’ll be a lot easier to re-roll and track things if you don’t need to update it in 10 different spots. It’s also less error prone and more secure.

One easy win is to leverage workspace variables if you’re using continuous integration (CI) that apply to an entire organization or subset of projects. For example if CI sends a Slack notification out from a custom script, don’t add your Slack webhook URL to the 8 repos that use it. Add it in a single spot where all 8 repos in that project or sub-org can reference it.

Instead of copying an entire .env file to a password manager for backups, maybe store it as a secret file in your cloud provider’s secret store. If it’s sitting there as a backup that’s not being read from all the time it’s really cheap. You can also save them in an encrypted S3 bucket with extremely limited access control, that comes with versioning too!

Create Unique Developer Keys Instead of Sharing Keys

For example, if in development your app needs AWS SDK access and it’s expected you pass in a key pair then don’t use a key pair that’s shared for all devs.

Let devs create and use their own key pair. This has a number of advantages:

It never lands in a shared password manager
If a dev leaves you can remove their user and don’t need to re-roll a shared key
You can get an audit log of each dev’s activity

This applies to database access too, such as a sanitized staging database.

You can apply this to other spots beyond the above use cases but it doesn’t always apply. Often times you end up with a business or enterprise license for a service and only have 1 API token that you can use. In that case, hopefully you have an easy way to re-roll that key on demand.

That’s about it. I’m thankful this happened in a low stress scenario.

# Video Version of this Post

Timestamps

0:29 – Yes, it’s worth creating a re-roll process now
0:51 – LastPass had a breach in December 2022
4:31 – How do you even begin to re-roll everything?
6:59 – Creating a spreadsheet to track what each env variable is for
11:33 – How do you figure out which keys to re-roll first?
15:08 – Re-rolling the rest
16:34 – Keeping progress in the spreadsheet and document
21:17 – How did it go? It was mostly smooth sailing
24:49 – It’s worth doing this now
26:22 – Are you an API service provider?
28:14 – Take your time
29:48 – Minimize where you have secrets stored and referenced
32:03 – Create unique dev keys instead of sharing keys

What’s your best tips for re-rolling credentials? Let us know below!

Like you, I'm super protective of my inbox, so don't worry about getting spammed. You can expect a few emails per year (at most), and you can 1-click unsubscribe at any time. See what else you'll get too.

Learn Docker With My Newest Course