Debugging an Idempotency Related User Group Issue with Ansible
Today I had a 1 hour adventure debugging a very subtle issue with Ansible. Here's what went wrong, how to fix it and how I approached it.
One really cool thing about Ansible is once you have things set up, you kind of sit back and use the building blocks you’ve created until you come across limitations or issues where you feel compelled to update something.
That’s pretty much what happened here.
I’ve been working with Ansible for about half a decade. During that time I’ve written dozens upon dozens of roles, tons of playbooks and even some third party tools to help make working with large Ansible projects easier.
Some roles are more updated than others, but as I’m preparing my next course which will be on the topic of using Ansible to accomplish deploying real world web apps, I wanted to polish a bunch of my roles and add a few new ones.
My current user
role is designed to create 1 admin user, but I am currently
doing some client work where they requested an ability to add multiple users
instead of just 1. That seems like a reasonable request so I wanted to roll
that into my user
role.
# It All Started with Updating My User Role
The task that creates a user in my old user role was this:
- name: "Create user"
user:
name: "{{ user_name }}"
groups: "{{ (user_groups | join(',')) }}"
generate_ssh_key: "{{ user_generate_ssh_key }}"
shell: "{{ user_shell }}"
As you can see, there’s no loop. It just sets up a single user. It’s also worth
pointing out that by default user_groups
is an empty list. In addition to
working with multiple roles, I wanted the updated role to put these users into an
admins
and sshusers
group by default.
So, I went to my whiteboard and wrote out an objectives list for the updated role:
- Should be able to add a user to N amount of groups
- Should default to putting users into
admins
andsshusers
groups - Should be able to add 1 or more users, each with their own SSH keys and groups
- Should optionally enable passwordless sudo for a specific group
- Should be able to use with zero configuration but can override everything
After I had my objectives out of the way, I went to work and came up with a solution that hits every point in the above list.
The New Role’s Create User Task
Here’s what it looked like initially:
- name: Create user(s)
user:
name: "{{ item.name | default(user_default_name) }}"
groups: "{{ item.groups | default(user_default_belongs_to_groups) }}"
shell: "/bin/bash"
loop: "{{ user_default_accounts + user_accounts }}"
And here’s what the default variables looked like for context:
user_create_groups: ["admins", "sshusers"]
user_default_name: "{{ lookup('env', 'USER') | default('admin') }}"
user_default_belongs_to_groups: "{{ user_create_groups }}"
user_default_local_ssh_public_key: "~/.ssh/id_rsa.pub"
user_default_accounts:
- name: "{{ user_default_name }}"
groups: "{{ user_default_belongs_to_groups }}"
local_ssh_public_key: "{{ user_default_local_ssh_public_key }}"
user_accounts: []
user_passwordless_sudo_group_name: "admins"
So far so good. The idea here is a default user will be created, and you can
define your own user_accounts
without having to define a new default user. This
makes the role a bit nicer to use for adding additional users.
This pattern of using a defaults_
list, a regular list and then combining
them together is something I first saw in the
DebOps project years ago.
It’s pretty sweet!
Anyways, things look pretty similar to the original role for the “Create user” task. The only real difference is it’s looping over a list of users instead of just 1 user.
# Testing the New Role in Isolation
Now that the role was written it’s time to test it. I’m a big fan of running Ansible tests inside of a Docker container. It’s super fast to iterate on a role and get feedback. I’ll talk more about that another time.
But the takeaway here is I was running the new user role by itself against a fresh Ubuntu 18.04 base Docker image and it passed a bunch of rigorous tests I had written.
It was also idempotent, meaning if you ran the role a 2nd time and your variables didn’t change then it would report back 0 changes. This is good, and you should strive for this for all of your roles.
# Doing a Bit of Client Work and Testing a Real System
Writing and testing roles locally is cool, but the real test for Ansible is running all of the roles you plan to use against a real system to make sure things work.
This client I’m doing some work for is deploying a Dockerized app and about a week ago I released a pretty big update to my Ansible Docker role.
I’ve been using it for my own projects and some client work and it’s been working out well. Naturally I wanted to use it for this current client too.
However, the new user role is mostly untested in production. I had just put it together for a separate client, but that’s how the real world is. You do the work, write some tests and cross your fingers that it all works.
There’s no way in a million years I would ever run untested Ansible roles against a production machine – especially not for a client.
So I spun up my own DigitalOcean server and did a full run. Both the new user role and Docker role functioned in the sense that it handled all of the objectives the roles set out to do, but it wasn’t idempotent.
Every time I ran the playbook, it would report changes in the “Create user” task, along with another task in the Docker role.
This task in the Docker role was always changing:
- name: Add user(s) to "docker" group
user:
name: "{{ item }}"
groups: "docker"
append: True
loop: "{{ docker_users }}"
when: not docker_remove_package and docker_users
In the past with the old user role, both of these tasks were idempotent and the new versions of both of them are idempotent when ran individually inside of a Docker container.
# The Debugging Process
I’ve talked about debugging a lot in the past, such as how using print is handy and rubber duck debugging. I used both tactics here, especially print.
The first thing I did was open up a notepad document and start writing down things I knew about the problem. That looked something like this:
- The old user role was idempotent when ran by itself
- The Docker role was idempotent when ran by itself
- The new user role was idempotent when ran by itself
- The new user role and Docker role were NOT idempotent when ran together
My first line of attack was to compare Ansible versions:
The Docker based test environment for individual role tests was running Ansible 2.5.1 and I was using Ansible 2.6.4 when testing everything together on DigitalOcean.
The first thing I did was drop back to using Ansible 2.5.9 on my system, thinking no backwards incompatible changes would be present in 2.5.9 vs 2.5.1, but the problem persisted with 2.5.9. Then I dropped down to 2.5.1 just to be sure, and yep, the same idempotency issues happened.
So that rules out Ansible breaking between versions.
Next up, I dropped in some debug outputs:
I attached a debug output to each task and it looked like this:
- name: Create user(s)
user:
name: "{{ item.name | default(user_default_name) }}"
groups: "{{ item.groups | default(user_default_belongs_to_groups) }}"
shell: "/bin/bash"
loop: "{{ user_default_accounts + user_accounts }}"
register: create_users
- debug:
msg: "{{ create_users }}"
And this produced a ton of output which I should have read more carefully but after looking over it twice I didn’t see anything out of the norm so I gave up on this approach.
Then I learned about the --diff
flag which will output any changes that were
made in a git style diff inline with the task. That’s perfect, except in this
case the diff produced nothing. I then learned it only works for some modules,
doh!
After that, I looked at the system’s state:
When in doubt, looking at what Ansible did on the system can’t ever be a bad thing, so I started to dig around and see what was up with the user.
I ran id
on the user that was created, and it produced:uid=1000(nick) gid=1002(nick) groups=1002(nick),999(docker),1000(admins),1001(sshusers)
That looks normal. I ran the playbook again which reported a change but when I
ran id
again, it produced identical results to the above. Weird!
So then I ran getent group
and it reported back with:
admins:x:1000:nick
sshusers:x:1001:nick
nick:x:1002:
docker:x:999:nick
That looks good too. Everything lines up with the output of id
and nothing looks
malformed. I ran the playbook once again, and of course the output was the same,
yet it still reported a change. That’s when I knew it was going to be one of
those days. :D
Starting to get desparate, I ran 1 role at a time:
My fact list said running the roles independently works, so I tried running them in isolation against the DigitalOcean host just to make sure the Docker test set up wasn’t funky.
So I tagged the run to only execute the user
role by itself, and running it once
still reported a change on the same droplet, but then I ran it again, and WHOA it
didn’t change.
Then I did the same test with the Docker role by itself and yep, it was the same deal. I had to run it at least 3 times before it was idempotent.
Naturally I began thinking “there be dragons here, running the role 3 times on DigitalOcean is ok, but running it twice is not ok. Docker is broken, DigitalOcean’s server is weird, fuck this!”.
Eliminating code, round #1:
After that, I started to eliminate potential things that could go wrong, so I removed all of the loops and variables. I peeled things back until I was just working with a single hard coded user, and the same problem happened.
Eliminating code, round #2:
Then I commented out the groups
property and suddenly everything started to
work as expected. Then I added back the loops and variables and it still worked.
Yay, finally some progress. It’s been determined that something with the groups is broken. Now that we know the problem, fixing it isn’t too bad.
Finally, progress with the groups property:
So then I ran each role indepenently and looked at the state of the system in between each run and then I saw the problem:
- Ran the user role & my user belonged to the
admins
andsshusers
groups - Ran the Docker role & my user belonged to the
admins
,sshusers
anddocker
groups - Ran the user role again & my user belonged to the
admins
andsshusers
groups
And there’s the problem. When I ran everything together, the Docker role was
appending the docker
group to my user, but the user role wasn’t appending the
admins
and sshusers
roles – instead it’s setting them to be that exactly
and it stripped away the docker
role.
# The Fix and Explaining What Went Wrong
The fix was simple. Here’s the broken task:
- name: Create user(s)
user:
name: "{{ item.name | default(user_default_name) }}"
groups: "{{ item.groups | default(user_default_belongs_to_groups) }}"
shell: "/bin/bash"
loop: "{{ user_default_accounts + user_accounts }}"
And here’s the fixed task:
- name: Create user(s)
user:
name: "{{ item.name | default(user_default_name) }}"
groups: "{{ item.groups | default(user_default_belongs_to_groups) }}"
append: True
shell: "/bin/bash"
loop: "{{ user_default_accounts + user_accounts }}"
I just had to tell the user role to append the groups. Mystery solved!
Lesson Learned
One takeaway for me here is you can test the heck out of a role in isolation but never forget to run all of your associated roles together on a real system to make sure they play together nicely.
I still think initially developing a role in isolation will make you more productive because inside of a Docker container you can spin up, run a role, run your tests and destroy everything in seconds where as if you do a full multi-role run on some type of VM or remote server, it might take minutes.
Although I’m thankful DigitalOcean’s rebuild feature is so fast. It lets you rebuild a droplet back to its default image in about 10 seconds and you keep your old IP address.
Also, I’m thankful I have programming buddies I can talk to. For example, one of my Ansible mentors (@drybjed) let me use him as a rubber duck. I just spammed him with about 100 lines of text on IRC (to his 3-4 lines), and eventually I figured it out.
What type of fun Ansible issues have you had to debug? Let me know below!