Debugging a Container Error When It Produces No Log Output
It can get even more fun if it only happens in production, here's a scenario where something like this happened and how I fixed it.
Prefer video? Here’s a recorded version of this post on YouTube.
This is a fun one, imagine this:
- Your web app and worker are happily running in production and everything works
- You deploy a new version of your code
- Your web container doesn’t start up successfully
- Your web container errors out with a generic exit code such as 255
- Your stopped web container that crashed has no log output, not even 1 line
- If you’re using Kubernetes, the events only say CrashLoopBackOff with no details
- Your background worker using the same Docker image does start up successfully
But:
- The new version fully runs fine on your local dev box
- The new version builds and fully runs successfully in CI
- You pull down the image from your registry that CI built and run it locally successfully
- You even remove all volumes to better replicate production, nope same thing
What would you do in this spot to start debugging the problem?
This recently happened for client work and figuring out what happened was an adventure.
# What Happened to Your Container’s Logs?
I always liked that one scene from The Matrix which is “What good is a phone call if you’re unable… to… speak?” and then Neo’s lips fuse together. It’s so unsettling.
What if your process cannot speak because something prevented it from running?
That something could be your ENTRYPOINT
script. A pretty basic one is:
#!/usr/bin/env bash
set -o errexit
echo "TODO: Your custom entrypoint script logic goes here"
exec "${@}"
The last line passes control back to your process and runs it. Everything above that line will run before your process starts.
Since we’re using errexit
that will halt the script on any error. If you have
a command run in the TODO
section and it happens to exit with a non-0 exit
code without producing any log output then your container will stop with that
exit code and produce no log output. Your process never had a chance to run.
You can test this by running the command false
which exits 1 and produces no
output.
You can confirm that by commenting out the errexit
line and now if an error
happens the script will still run. At the very least your process should start
or get called and error out with more specific details.
A takeaway is to produce some type of helpful log message if your custom command or logic fails because that will give you direction.
Also having a health check on your container is quite useful here because if you don’t see your health check’s output (such as an HTTP request) you can at least draw a conclusion that your web app’s process didn’t get a chance to run.
# Conditional Logic Depending on Your App’s Env
Just to expand on the above entrypoint, what if you have something like this:
#!/usr/bin/env bash
set -o errexit
if [ "${APP_ENV}" != "development" ]; then
echo "TODO: Run some command here for non-development environments"
fi
exec "${@}"
That’s not too crazy. When your container starts, maybe you run a quick cache related command to optimize things in production and other non-development envs.
This explains why something might work in development and CI but fails in
production. In development and CI maybe you’re running your app with
APP_ENV="development"
so that condition to run some command never fires to
catch the failure early.
That’s also an example of how dangerous it can be to have conditional logic like this in your entrypoint. I mean, it’s not doing anything wrong but having an awareness that not all code paths could be tested with your test suite is important to know.
That’s what happened in this case, the command being run was something a developer set up which produced no log output and it was only a problem in production. This adventure could have been a 5 minute non-adventure with log details.
# (Worker == Working && Web == Failing) == WTF
Right, so maybe you’re using Docker Compose in development but Kubernetes in production. That’s not too unreasonable.
In your Docker Compose set up, your web and worker services both run the
entrypoint because you didn’t set entrypoint: []
in your worker to cancel it
out.
But in Kubernetes your web deployment’s pod runs your app’s web server where as
your worker deployment’s pod runs a different command. The web container uses
args
to set the command but the worker container uses command
.
There is a huge difference between command
and args
:
- If you supply
command
but notargs
only the command runs, your entrypoint is ignored - If you supply
args
but notcommand
your entrypoint will run with the args you supply
Yep, that explains why the worker came up successfully where as the web container did not. You may have did that as an optimization because technically you don’t need your entrypoint to run in the background worker.
The takeaway here is to make your Docker Compose set up more aligned with what you’re doing in Kubernetes. If you’re not running an entrypoint for something in production you should also not run it in development.
The video below goes over some of these scenarios in an example project.
# Demo Video
Timestamps
- 0:29 – Explaining the scenario
- 2:16 – Check your entrypoint script
- 6:20 – Environment conditionals in your entrypoint
- 7:44 – How it worked for the worker but failed for web
Have you encountered this before? Let us know below.