Updated on April 16, 2024 in #dev-mindset

120+ Skills I Use in an SRE / Platform / DevOps Developer Position

120-skills-i-use-in-an-sre-platform-devops-developer-position.jpg

These are technical topics I've actively used or thought about for the last 2 years while working in 1 role.

Quick Jump:

Prefer video? Here is it on YouTube. It includes more detail about each skill.

I mentioned 2 years but that doesn’t mean it’s expected to know about everything that’s listed with 2 years of experience. I’ve been building and deploying web apps for over 20+ years. These are some skills I exercised in 1 role for ~2 years.

I rarely think about skills like this but thought it could be interesting to write out because:

If you want to get into this field, these are a few topics to explore
If you’re a company hiring someone, you can use this to help write a job description

In this post, we’ll be focusing mostly on technical skills and not soft skills. Soft skills can be covered in a future post because there’s a lot to think about there too. Examples of soft skills are managing up, big picture thinking, emotional intelligence and more.

# The ‘DevOps’ Elephant in the Room

Yes, I also don’t like the term “DevOps developer” or “DevOps engineer” as a role type because in my mind DevOps is a business wide philosophy but I do think folks use this term to describe a very real combination of technical skills.

It boils down to someone who likes infrastructure and most things related to an SRE (Site Reliability Engineer) or platform role but also has a lot of experience building web apps.

Topics like web servers, databases, caching, background workers, keeping app package dependencies up to date and other stuff come into play. This unlocks a whole category of ways you can provide business value. Suddenly you know how to scale background workers and have no problem jumping into app code to make changes – even if you don’t fully know the programming language!

Hopefully there’s only minor code changes, like replacing hard coded config values with environment variables, setting up health check end points and things of that nature.

You can talk shop with developers and understand their perspective on things. For bigger code changes such as uploading files to S3 instead of a local file system when moving into Kubernetes, you can write tickets in enough detail to help developers think about edge cases and other concerns.

Also, I’ve found in practice a lot of developers don’t think about or more commonly aren’t interested in what happens to their code after it’s running in production but if you have experience building and deploying web apps you can help level up everyone (yourself included) by sharing knowledge and applying examples directly to your org’s apps.

This can really move the ball forward on building more resilient, stable and scalable applications. I think there’s a lot of value in having this combo of experience.

# How Do You Describe This Role?

Besides the 120+ skills listed below, there’s also things like being an on demand consultant for a dev team to solve general problems, maximizing developer happiness with tools that make things easier, keeping an eye on technical debt while advocating and promoting engineering principles to help guide a team on delivering maintainable code at a sustainable pace – all while attempting to avoid stepping on anyone’s toes.

You may also occasionally dabble in risk level topics like ensuring the company’s SAAS app is compliant with US tax sales liabilities or is responsible for helping a business line move from Slack to Teams because the CTO determined that’s the best move for the business.

You also participate outside of your team to help shape system and technology choices for your whole organization. You can be counted on by the org to be an “action taker” and “problem solver” for almost anything.

You’re comfortable chatting with vendors and providing demos or presentations to anyone in the org, including non-technical topics such as helping non-technical folks from tool A to B.

You’re also trusted to come up with a system to run technical interviews for engineering positions along with participating in those interviews.

The above list can go on and on. That’s a tiny handful of ancillary topics that came up in the last 2 years. I wanted to include them because besides solving pure technical problems like writing a script, there’s things like that which come up in your day to day.

It’s kind of like a jack of all trades tech focused role but instead of being a master of none, you have enough knowledge in enough related things that you can cross a bunch wires internally to provide solutions to a wide range of problems.

In one role, the title was “LORE” which is Lead Operations and Reliability Engineer. It was picked solely due to the acronym sounding cool and it has a fun easter egg related to Star Trek: The Next Generation.

For anyone who knows who Lore is from TNG, no I don’t really think of myself in that way.

I like keeping notes and it provided me personal joy in naming my work notes directory “lore” because the definition of the word is related to storing knowledge about a subject.

Now onto the skills…

# The Skills

I time boxed myself to about an hour to come up with this list (minus the descriptions) so it’s likely I’m missing quite a few things. This should be good enough to begin with and I’ll add to it if I think of something later.

I also tried to keep it high level. For example the first item in the list has about 100 sub-topics based on what you’re doing. I didn’t bother putting “S3” or “object storage” as part of the top level list.

AWS
- IAM, EC2, EKS, ALB, S3, RDS, ElastiCache and other resources to host web apps
AWS SAM
- Used to build and deploy Serverless Lambda functions
Access control
- Who can access what across multiple apps, resources, services, and tools
Ansible
- Configure your servers in an automated way
Apache
- A web server (perhaps the org you’re with is already using it)
Architectural patterns
- How can we develop technical systems that work for our business requirements?
Argo CD
- Helps deploy Kubernetes apps by converging the state of a git repo to your cluster
Automation
- Taking any manual task and automating it with tools, services, etc.
Back-end development
- Creating the server side component of a web app
Background job processing
- Executing tasks outside of the request / response cycle
Bash / Shell scripting
- Writing scripts to solve ADHOC business tasks
Batch processing
- Purposely breaking up tasks in batches of N to prevent performance issues
Benchmarking and profiling
- How does a site perform when under varied load? How can it be measured?
Bitbucket
- A service to host source code using git
Building CLI tools
- Internal tools to help automate workflows or make processes easier
CI pipelines
- Automate things like linting, testing and building images / deployable artifacts
Cache busting
- Digest static assets so web servers or a CDN can use long lived cache headers
Caching
- Prevent expensive operations like DB queries or API calls if nothing has changed
CloudFormation
- An AWS tool to help automate creating and configuring resources
Code formatting
- Using and configuring tools to assist in formatting code (Black in Python, etc.)
Code linting
- Using and configuring tools to find issues in your code (Flake8 in Python, etc.)
Command line
- Using tools like grep, sed, cut and friends to solve ADHOC business problems
Concurrency and parallelism
- Something that can be done to help reduce how long tasks take to complete
Configuration management
- Using tools like Ansible or AWS SAM are an example of this
Confluence
- Store knowledge in a way that’s digestible and searchable by a business
Container runtimes
- Knowing the difference between containerd and things like dockershim (K8s 1.24)
Cost optimization and analysis
- How can we keep an eye on cloud costs and what can we do to reduce costs?
Cron jobs
- Run tasks on a recurring schedule, such as making an API call at 2am every day
DDoS prevention
- What can we do to prevent our site from going down if under a large scale attack?
DKIM
- Typically combined with DMARC and SPF to verify emails are sent by a domain
DMARC
- Typically combined with DKIM and SPF to verify emails are sent by a domain
DNS
- Ultimately map hostnames to an IP address, weighted DNS, etc.
Database backups and restoring
- Backup and restore your data in a resilient way (disaster recovery, etc.)
Databases
- Which DB works well for this use case? What size do we need? Writing queries
Datadog
- A service to help collect, view and alert on logs and metrics
Debugging
- Something isn’t working as expected, how do we find the root cause?
Docker
- Run a process in isolation and gain cross-OS portability
Docker Compose
- Run 1 or more containers together, such as everything related to 1 project
Documentation
- Explaining how to use something and hopefully also cover the “why”
Dotfiles
- Configure various shell related tools to create a personalized experience
Email delivery validation
- Help ensure emails can be trusted and don’t get marked as spam
Encryption
- Protecting user data at rest and in transit
Environment variables
- Tweak settings that change in dev vs prod and help protect secrets
Event driven architectures
- For example, triggering a function after an S3 file is uploaded, etc.
Excel
- Sometimes a spreadsheet is really what you need to solve something quickly
Feature flags
- Enabling or disabling features for a subset of users until it’s fully rolled out
File systems
- At least knowing what’s different between Linux and macOS, performance, etc.
Firewalls
- Monitor and potentially block traffic within a network
Flask
- Python based micro-web framework for building web apps and APIs
Front-end development
- Tools to help manage CSS and JS but you can also think of user experience too
Full text search
- A method to effectively search for strings and phrases within a database
General hardware knowledge
- AMD64 vs ARM64, SSD / IOPS and knowing how to measure / pick hardware
Git
- Make commits, merge branches, resolve conflicts, hooks, create workflows, etc.
GitLab
- A service to host source code using git
HTTP or generally how the web works
- Generally knowing how everything works together, HTTP verbs, etc.
Health checks
- Continuous monitoring to ensure your app is available
Helm
- A template based approach to help manage Kubernetes configs and apps
Horizontal scaling
- Adding more servers or containers to help handle more traffic
Infrastructure as code
- The idea of provisioning resources and config files with code
Jira
- A service to help manage working on tickets
Kubernetes
- A tool to automate deploying and scaling containerized applications
Kustomize
- A patch based approach to help manage Kubernetes configs and apps
Linux
- Being proficient in using a Linux distro of your choice (Debian, Ubuntu, etc.)
Load balancing
- A way to distribute traffic across resources (random, round robin, etc.)
Load testing
- Using tools like wrk to see how a site performs under load
Logging
- What to log? What not to log? How to parse logs and controlling costs
Markdown
- A light weight text format to help write documentation and notes
MFA / 2FA
- An added level of protection around securing access to an account
Memcached
- A tool to help cache information
Microservices
- A strategy to help organize and break up an application (it’s not a silver bullet)
Monitoring and alerting
- If something goes wrong, how can you get notified?
Multiple deployment environments
- Pull request environments, shared staging environment, production, demo, etc.
NGiNX
- Web server, reverse proxy, load balancer, TLS termination, redirects and more
Networking
- General knowledge around protocols (TCP, HTTP) and types (LAN, WAN, etc.)
OpenAPI spec
- A specification for documenting APIs
Operating systems
- Knowing your way around Windows, macOS and Linux
POSIX compliance
- Do your shell scripts need maximum compatibility? How can you achieve this?
Package managers
- Knowing how to use and configure tools like apt, brew, etc.
Patch management
- Keeping your servers and app dependencies up to date
Payment gateways
- Accepting multiple types of payments (Credit card, PayPal, Apple pay, etc.)
Penetration testing
- Evaluating the results of pen tests and knowing which reports apply to your app
Performance monitoring
- Actively understand how your app is performing and how to improve things
Pingdom
- A service to monitor and track uptime as well as notify you of downtime
Port forwarding
- Knowing when you want to make a port available within a network
Privacy
- Protecting the privacy of your customers
Process management
- Understanding systemd and how Docker / Kubernetes keep processes up
Proxies
- Passing traffic through a server, this includes reverse proxies such as NGiNX
Python
- A programming language, it’s oftentimes a nice choice for scripting and web apps
Quality assurance
- Thinking about the happy and unhappy cases, edge cases, etc.
RESTful APIs
- A way to organize APIs to keep them predictable and maintainable
Redis
- A swiss army knife for caching, queue back-end, pub / sub and more
Release management
- Creating tools and workflows for efficient and dependable releases
Reverse engineering
- Figuring out how something works by breaking it down
SMTP
- A protocol for sending email
SOC 2 Type II compliance
- A framework to help ensure service providers process client data securely
SPF
- Typically combined with DKIM and DMARC to verify emails are sent by a domain
SSH / SCP
- Configuring secure access to a server and transferring files
SSO (Single Sign-On)
- This includes setting up social logins but also org wide SSO such as Okta
Sanitizing PII (Personal Identifiable Information)
- Protect customer data by redacting it in non-production environments
Scaling web apps
- Knowing where the bottlenecks are and how to address them
SealedSecrets
- A GitOps friendly way to manage secrets within Kubernetes
Security
- Hashing, encryption, access, social engineering, avoiding malware, etc.
Signal processing
- Graceful shutdowns, SIGKILL vs SIGTERM, etc.
Snyk
- A service to scan your apps and infrastructure for security vulnerabilities
Static code analysis
- Anything you can run against your source code (formatting, linting, vuln. scanning)
Stripe
- A payment gateway to accept credit cards and offer money related services
Swagger
- A way to view API docs that’s compatible with the OpenAPI specification
System design
- Creating a solution after understanding the input and output requirements
TLS / SSL certificates
- Understanding the moving parts around this and what’s secure vs what’s not
Task queues
- A way to run tasks that are executed by a background worker
Terraform
- Configure your infrastructure (servers, DNS, etc.) in an automated way
Testing
- Knowing what and when to test and how to verify those tests
Tool comparisons
- Evaluate different tools and arrive at a good solution
Troubleshooting
- Systematic steps you can take to resolve both known and unknown problems
Vertical scaling
- Adding more compute resources (CPU, memory, etc.) to a single server / container
VPN
- Create a secure connection between networks and provide a static IP address
VPS
- Virtual private servers are isolated environments to run an OS
VSCode
- A popular code editor, I don’t use it but your dev team might so it’s worth knowing
Virtual Machines
- Spin up virtualized operating systems for testing or usage
Vulnerability scanning
- Knowing how to identify, document, assess, triage and / or fix vulnerabilities
Web development
- Building web apps, such as internal tools to help folks get things done
Websockets
- A protocol for handling bi-directional web traffic such as chat, etc.
Webhooks
- Sending a web request to another service after an event triggers
WordPress
- Maybe your company’s blog is run on this, knowing how plugins / updates work
Zero downtime app deployments
- When deploying new versions of your app, its service isn’t disrupted

The video below goes into more detail about each of these skills.

By the way, if anyone is hiring I am available for hire.

# Video Explanation

Timestamps

1:15 – The DevOps elephant in the room
3:25 – How do you describe this role?
7:10 – The skills
8:03 – Skills 1-25
20:47 – Skills 26-50
29:43 – Skills 51-75
39:25 – Skills 76-100
47:57 – Skills 101-124

Which skills do you commonly use in this type of role? Let us know below.

Like you, I'm super protective of my inbox, so don't worry about getting spammed. You can expect a few emails per year (at most), and you can 1-click unsubscribe at any time. See what else you'll get too.

Learn Docker With My Newest Course