It's OK to Sleep
It's really basic but sleeping before running a command or tool could be a quick way to solve a specific problem.
Prefer video? Here it is on YouTube.
Oftentimes using sleep
within a script or program gets a bad rap and there’s
all sorts of memes like “Hey I improved the performance of our site by 1000%”
and the punchline is showing a git diff where they removed sleep 60
which
they forgot about.
But sometimes it’s a reasonable choice to solve a specific problem.
For example, adding a delay before running a script such as sleep 300 && ./some-tool
or whatever the syntax would be with your programming language of
choice.
# Use Case
In my case, there’s this manual workflow that I’m responsible for where a developer requests a new development DB. This dev DB is shared with a group of developers and when a new one gets created, the old one is first deleted so there’s about 15 minutes of downtime for the process to complete.
That amount of time is purely waiting for AWS RDS to delete and create an instance.
We’ve come up with a workflow to wait 15 minutes as a buffer time before making the new DB after a request has been made. This allows folks who are using it to finish up anything they might be doing or reply saying they want to request a delay.
Technically everyone has their own local isolated DB that’s seeded off this dev DB so it’s almost never an issue for anyone. There’s been literally 0 requests for a delay beyond the 15 minutes in over 2 years back when we started this process.
If it’s an urgent request, such as needing new data from production to fix an issue then they mention it’s urgent in the request and we don’t wait 15 minutes.
It’s also worth pointing out these dev DB requests usually happen once or maybe twice a week so the above isn’t too painful. There’s quite a few ways to improve this process such as making it self-serve. Right now it’s isolated to me running it due to requiring pretty elevated AWS permissions since it deals with production snapshots and creating RDS instances.
You’re Doing It Wrong
I know, you might be thinking “create a new DB before deleting the old one to eliminate the downtime”. Yes, but it’s not that simple.
We want to keep the same RDS DNS name so no one has to reconfigure their SQL clients with a new hostname. Even if we created a distinct name for each one and then used a CNAME to map it to a static hostname the underlying DB will change which could affect someone actively doing something.
This isn’t the focus of this post but given this site is focused on software development I have a hunch about 94.82% of you probably read the use case and immediately went into problem solving mode (I know I would have!).
The Manual Workflow Looks like This
- A developer requests a new DB in our chat channel
- If it’s urgent (10% of the time)
- I run the command to make it right now
- If it’s not urgent (90% of the time)
- I wait 15 minutes between the time of their request and making it
- For example if they request it at 10:00am and I see their message at 10:05am, I’ll make it at 10:15am
- I wait 15 minutes between the time of their request and making it
- I reply and acknowledge the request, I look at their time and mention when I’ll start it
- I run the script to create the new set of dev DBs at the calculated time
If you think about this from my perspective, there’s kind of a lot of ways for things to go wrong and it involves multiple interrupts on my end:
- Interrupt 1: Seeing the chat message and needing to react
- Interrupt 2: Remembering to run the command
There used to be a third interrupt which was posting a message saying it was complete and then pinning it but that was added to the script which makes the dev DBs so it’s automated. Since the whole process takes 15 minutes and there’s that 15 minute buffer it used to mean being around for ~30 minutes – eww.
But, if you factor in just the 2nd interrupt alone…
I need to remember to switch to my tmux session to run the command at about that time but what if I have a live demo meeting that starts in a few minutes? What if I start to get into the zone on something else and completely forget? What if I happen to want to go-to lunch?
I’ll admit things rarely fall in between the cracks for me but I’m not perfect. I have forgotten to run the command for an hour once because I got in the zone on something else. To combat that I actually resorted to a system involving physical paper, I placed a little sticky note under my monitor when I have a pending request.
It Was So Obvious
While interviewing candidates for a DevOps engineer / platform / SRE role I was thinking about questions to ask and used my previous post of 120+ tech skills I use to help think about what I wanted to ask.
As soon as I read “Writing scripts to solve ADHOC business tasks”, I instantly
thought to myself why don’t you just do sleep 900 && ./dev-db
and in this
case I can adjust the sleep time to be whatever it needs to be (or none for
urgent requests).
Now I can run that command at the time of interrupt 1 and interrupt 2 never happens. It’s been transformed into a set-it-and-forget-it task. The script itself is super robust and hasn’t failed in longer than I can remember and if this ever occurs I can always enhance the script to notify me if it fails in the future.
Also, since the script runs on my work machine and not within a CI pipeline or serverless function the added idle time of the script waiting doesn’t matter.
It was so basic and so easy to implement. I can even bake the sleep into the script and have it default to 900 seconds (15 minutes) but that didn’t feel worth it, especially since this whole process and workflow will get revamped entirely at some point.
This is a nice example of the Unix philosophy where we use small focused tools to solve specific problems. Sleep can be used on its own, the dev DB script can be used on its own but when we combine them then we solve this specific problem.
The video below goes over this post.
# Demo Video
Timestamps
- 0:25 – Using sleep is not just for memes
- 0:57 – A real world use case
- 3:57 – Going over the manual workflow
- 5:52 – Reducing human error
- 7:01 – Sleep for the win
When was the last time you used sleep to solve a problem? Let me know below.