Updated on September 17, 2024 in #docker

Debugging a Container Error When It Produces No Log Output

debugging-a-container-error-when-it-produces-no-log-output.jpg

It can get even more fun if it only happens in production, here's a scenario where something like this happened and how I fixed it.

Quick Jump:

Prefer video? Here’s a recorded version of this post on YouTube.

This is a fun one, imagine this:

Your web app and worker are happily running in production and everything works
You deploy a new version of your code
Your web container doesn’t start up successfully
Your web container errors out with a generic exit code such as 255
Your stopped web container that crashed has no log output, not even 1 line
If you’re using Kubernetes, the events only say CrashLoopBackOff with no details
Your background worker using the same Docker image does start up successfully

But:

The new version fully runs fine on your local dev box
The new version builds and fully runs successfully in CI
You pull down the image from your registry that CI built and run it locally successfully
You even remove all volumes to better replicate production, nope same thing

What would you do in this spot to start debugging the problem?

This recently happened for client work and figuring out what happened was an adventure.

# What Happened to Your Container’s Logs?

I always liked that one scene from The Matrix which is “What good is a phone call if you’re unable… to… speak?” and then Neo’s lips fuse together. It’s so unsettling.

What if your process cannot speak because something prevented it from running?

That something could be your ENTRYPOINT script. A pretty basic one is:

#!/usr/bin/env bash

set -o errexit

echo "TODO: Your custom entrypoint script logic goes here"

exec "${@}"

The last line passes control back to your process and runs it. Everything above that line will run before your process starts.

Since we’re using errexit that will halt the script on any error. If you have a command run in the TODO section and it happens to exit with a non-0 exit code without producing any log output then your container will stop with that exit code and produce no log output. Your process never had a chance to run.

You can test this by running the command false which exits 1 and produces no output.

You can confirm that by commenting out the errexit line and now if an error happens the script will still run. At the very least your process should start or get called and error out with more specific details.

A takeaway is to produce some type of helpful log message if your custom command or logic fails because that will give you direction.

Also having a health check on your container is quite useful here because if you don’t see your health check’s output (such as an HTTP request) you can at least draw a conclusion that your web app’s process didn’t get a chance to run.

# Conditional Logic Depending on Your App’s Env

Just to expand on the above entrypoint, what if you have something like this:

#!/usr/bin/env bash

set -o errexit

if [ "${APP_ENV}" != "development" ]; then
  echo "TODO: Run some command here for non-development environments"
fi

exec "${@}"

That’s not too crazy. When your container starts, maybe you run a quick cache related command to optimize things in production and other non-development envs.

This explains why something might work in development and CI but fails in production. In development and CI maybe you’re running your app with APP_ENV="development" so that condition to run some command never fires to catch the failure early.

That’s also an example of how dangerous it can be to have conditional logic like this in your entrypoint. I mean, it’s not doing anything wrong but having an awareness that not all code paths could be tested with your test suite is important to know.

That’s what happened in this case, the command being run was something a developer set up which produced no log output and it was only a problem in production. This adventure could have been a 5 minute non-adventure with log details.

# (Worker == Working && Web == Failing) == WTF

Right, so maybe you’re using Docker Compose in development but Kubernetes in production. That’s not too unreasonable.

In your Docker Compose set up, your web and worker services both run the entrypoint because you didn’t set entrypoint: [] in your worker to cancel it out.

But in Kubernetes your web deployment’s pod runs your app’s web server where as your worker deployment’s pod runs a different command. The web container uses args to set the command but the worker container uses command.

There is a huge difference between command and args:

If you supply command but not args only the command runs, your entrypoint is ignored
If you supply args but not command your entrypoint will run with the args you supply

Yep, that explains why the worker came up successfully where as the web container did not. You may have did that as an optimization because technically you don’t need your entrypoint to run in the background worker.

The takeaway here is to make your Docker Compose set up more aligned with what you’re doing in Kubernetes. If you’re not running an entrypoint for something in production you should also not run it in development.

The video below goes over some of these scenarios in an example project.

# Demo Video

Timestamps

0:29 – Explaining the scenario
2:16 – Check your entrypoint script
6:20 – Environment conditionals in your entrypoint
7:44 – How it worked for the worker but failed for web

Have you encountered this before? Let us know below.

Like you, I'm super protective of my inbox, so don't worry about getting spammed. You can expect a few emails per year (at most), and you can 1-click unsubscribe at any time. See what else you'll get too.

Learn Docker With My Newest Course