120+ Skills I Use in an SRE / Platform / DevOps Developer Position
These are technical topics I've actively used or thought about for the last 2 years while working in 1 role.
Prefer video? Here is it on YouTube. It includes more detail about each skill.
I mentioned 2 years but that doesn’t mean it’s expected to know about everything that’s listed with 2 years of experience. I’ve been building and deploying web apps for over 20+ years. These are some skills I exercised in 1 role for ~2 years.
I rarely think about skills like this but thought it could be interesting to write out because:
- If you want to get into this field, these are a few topics to explore
- If you’re a company hiring someone, you can use this to help write a job description
In this post, we’ll be focusing mostly on technical skills and not soft skills. Soft skills can be covered in a future post because there’s a lot to think about there too. Examples of soft skills are managing up, big picture thinking, emotional intelligence and more.
# The ‘DevOps’ Elephant in the Room
Yes, I also don’t like the term “DevOps developer” or “DevOps engineer” as a role type because in my mind DevOps is a business wide philosophy but I do think folks use this term to describe a very real combination of technical skills.
It boils down to someone who likes infrastructure and most things related to an SRE (Site Reliability Engineer) or platform role but also has a lot of experience building web apps.
Topics like web servers, databases, caching, background workers, keeping app package dependencies up to date and other stuff come into play. This unlocks a whole category of ways you can provide business value. Suddenly you know how to scale background workers and have no problem jumping into app code to make changes – even if you don’t fully know the programming language!
Hopefully there’s only minor code changes, like replacing hard coded config values with environment variables, setting up health check end points and things of that nature.
You can talk shop with developers and understand their perspective on things. For bigger code changes such as uploading files to S3 instead of a local file system when moving into Kubernetes, you can write tickets in enough detail to help developers think about edge cases and other concerns.
Also, I’ve found in practice a lot of developers don’t think about or more commonly aren’t interested in what happens to their code after it’s running in production but if you have experience building and deploying web apps you can help level up everyone (yourself included) by sharing knowledge and applying examples directly to your org’s apps.
This can really move the ball forward on building more resilient, stable and scalable applications. I think there’s a lot of value in having this combo of experience.
# How Do You Describe This Role?
Besides the 120+ skills listed below, there’s also things like being an on demand consultant for a dev team to solve general problems, maximizing developer happiness with tools that make things easier, keeping an eye on technical debt while advocating and promoting engineering principles to help guide a team on delivering maintainable code at a sustainable pace – all while attempting to avoid stepping on anyone’s toes.
You may also occasionally dabble in risk level topics like ensuring the company’s SAAS app is compliant with US tax sales liabilities or is responsible for helping a business line move from Slack to Teams because the CTO determined that’s the best move for the business.
You also participate outside of your team to help shape system and technology choices for your whole organization. You can be counted on by the org to be an “action taker” and “problem solver” for almost anything.
You’re comfortable chatting with vendors and providing demos or presentations to anyone in the org, including non-technical topics such as helping non-technical folks from tool A to B.
You’re also trusted to come up with a system to run technical interviews for engineering positions along with participating in those interviews.
The above list can go on and on. That’s a tiny handful of ancillary topics that came up in the last 2 years. I wanted to include them because besides solving pure technical problems like writing a script, there’s things like that which come up in your day to day.
It’s kind of like a jack of all trades tech focused role but instead of being a master of none, you have enough knowledge in enough related things that you can cross a bunch wires internally to provide solutions to a wide range of problems.
In one role, the title was “LORE” which is Lead Operations and Reliability Engineer. It was picked solely due to the acronym sounding cool and it has a fun easter egg related to Star Trek: The Next Generation.
For anyone who knows who Lore is from TNG, no I don’t really think of myself in that way.
I like keeping notes and it provided me personal joy in naming my work notes directory “lore” because the definition of the word is related to storing knowledge about a subject.
Now onto the skills…
# The Skills
I time boxed myself to about an hour to come up with this list (minus the descriptions) so it’s likely I’m missing quite a few things. This should be good enough to begin with and I’ll add to it if I think of something later.
I also tried to keep it high level. For example the first item in the list has about 100 sub-topics based on what you’re doing. I didn’t bother putting “S3” or “object storage” as part of the top level list.
- AWS
- IAM, EC2, EKS, ALB, S3, RDS, ElastiCache and other resources to host web apps
- AWS SAM
- Used to build and deploy Serverless Lambda functions
- Access control
- Who can access what across multiple apps, resources, services, and tools
- Ansible
- Configure your servers in an automated way
- Apache
- A web server (perhaps the org you’re with is already using it)
- Architectural patterns
- How can we develop technical systems that work for our business requirements?
- Argo CD
- Helps deploy Kubernetes apps by converging the state of a git repo to your cluster
- Automation
- Taking any manual task and automating it with tools, services, etc.
- Back-end development
- Creating the server side component of a web app
- Background job processing
- Executing tasks outside of the request / response cycle
- Bash / Shell scripting
- Writing scripts to solve ADHOC business tasks
- Batch processing
- Purposely breaking up tasks in batches of N to prevent performance issues
- Benchmarking and profiling
- How does a site perform when under varied load? How can it be measured?
- Bitbucket
- A service to host source code using git
- Building CLI tools
- Internal tools to help automate workflows or make processes easier
- CI pipelines
- Automate things like linting, testing and building images / deployable artifacts
- Cache busting
- Digest static assets so web servers or a CDN can use long lived cache headers
- Caching
- Prevent expensive operations like DB queries or API calls if nothing has changed
- CloudFormation
- An AWS tool to help automate creating and configuring resources
- Code formatting
- Using and configuring tools to assist in formatting code (Black in Python, etc.)
- Code linting
- Using and configuring tools to find issues in your code (Flake8 in Python, etc.)
- Command line
- Using tools like grep, sed, cut and friends to solve ADHOC business problems
- Concurrency and parallelism
- Something that can be done to help reduce how long tasks take to complete
- Configuration management
- Using tools like Ansible or AWS SAM are an example of this
- Confluence
- Store knowledge in a way that’s digestible and searchable by a business
- Container runtimes
- Knowing the difference between containerd and things like dockershim (K8s 1.24)
- Cost optimization and analysis
- How can we keep an eye on cloud costs and what can we do to reduce costs?
- Cron jobs
- Run tasks on a recurring schedule, such as making an API call at 2am every day
- DDoS prevention
- What can we do to prevent our site from going down if under a large scale attack?
- DKIM
- Typically combined with DMARC and SPF to verify emails are sent by a domain
- DMARC
- Typically combined with DKIM and SPF to verify emails are sent by a domain
- DNS
- Ultimately map hostnames to an IP address, weighted DNS, etc.
- Database backups and restoring
- Backup and restore your data in a resilient way (disaster recovery, etc.)
- Databases
- Which DB works well for this use case? What size do we need? Writing queries
- Datadog
- A service to help collect, view and alert on logs and metrics
- Debugging
- Something isn’t working as expected, how do we find the root cause?
- Docker
- Run a process in isolation and gain cross-OS portability
- Docker Compose
- Run 1 or more containers together, such as everything related to 1 project
- Documentation
- Explaining how to use something and hopefully also cover the “why”
- Dotfiles
- Configure various shell related tools to create a personalized experience
- Email delivery validation
- Help ensure emails can be trusted and don’t get marked as spam
- Encryption
- Protecting user data at rest and in transit
- Environment variables
- Tweak settings that change in dev vs prod and help protect secrets
- Event driven architectures
- For example, triggering a function after an S3 file is uploaded, etc.
- Excel
- Sometimes a spreadsheet is really what you need to solve something quickly
- Feature flags
- Enabling or disabling features for a subset of users until it’s fully rolled out
- File systems
- At least knowing what’s different between Linux and macOS, performance, etc.
- Firewalls
- Monitor and potentially block traffic within a network
- Flask
- Python based micro-web framework for building web apps and APIs
- Front-end development
- Tools to help manage CSS and JS but you can also think of user experience too
- Full text search
- A method to effectively search for strings and phrases within a database
- General hardware knowledge
- AMD64 vs ARM64, SSD / IOPS and knowing how to measure / pick hardware
- Git
- Make commits, merge branches, resolve conflicts, hooks, create workflows, etc.
- GitLab
- A service to host source code using git
- HTTP or generally how the web works
- Generally knowing how everything works together, HTTP verbs, etc.
- Health checks
- Continuous monitoring to ensure your app is available
- Helm
- A template based approach to help manage Kubernetes configs and apps
- Horizontal scaling
- Adding more servers or containers to help handle more traffic
- Infrastructure as code
- The idea of provisioning resources and config files with code
- Jira
- A service to help manage working on tickets
- Kubernetes
- A tool to automate deploying and scaling containerized applications
- Kustomize
- A patch based approach to help manage Kubernetes configs and apps
- Linux
- Being proficient in using a Linux distro of your choice (Debian, Ubuntu, etc.)
- Load balancing
- A way to distribute traffic across resources (random, round robin, etc.)
- Load testing
- Using tools like wrk to see how a site performs under load
- Logging
- What to log? What not to log? How to parse logs and controlling costs
- Markdown
- A light weight text format to help write documentation and notes
- MFA / 2FA
- An added level of protection around securing access to an account
- Memcached
- A tool to help cache information
- Microservices
- A strategy to help organize and break up an application (it’s not a silver bullet)
- Monitoring and alerting
- If something goes wrong, how can you get notified?
- Multiple deployment environments
- Pull request environments, shared staging environment, production, demo, etc.
- NGiNX
- Web server, reverse proxy, load balancer, TLS termination, redirects and more
- Networking
- General knowledge around protocols (TCP, HTTP) and types (LAN, WAN, etc.)
- OpenAPI spec
- A specification for documenting APIs
- Operating systems
- Knowing your way around Windows, macOS and Linux
- POSIX compliance
- Do your shell scripts need maximum compatibility? How can you achieve this?
- Package managers
- Knowing how to use and configure tools like apt, brew, etc.
- Patch management
- Keeping your servers and app dependencies up to date
- Payment gateways
- Accepting multiple types of payments (Credit card, PayPal, Apple pay, etc.)
- Penetration testing
- Evaluating the results of pen tests and knowing which reports apply to your app
- Performance monitoring
- Actively understand how your app is performing and how to improve things
- Pingdom
- A service to monitor and track uptime as well as notify you of downtime
- Port forwarding
- Knowing when you want to make a port available within a network
- Privacy
- Protecting the privacy of your customers
- Process management
- Understanding systemd and how Docker / Kubernetes keep processes up
- Proxies
- Passing traffic through a server, this includes reverse proxies such as NGiNX
- Python
- A programming language, it’s oftentimes a nice choice for scripting and web apps
- Quality assurance
- Thinking about the happy and unhappy cases, edge cases, etc.
- RESTful APIs
- A way to organize APIs to keep them predictable and maintainable
- Redis
- A swiss army knife for caching, queue back-end, pub / sub and more
- Release management
- Creating tools and workflows for efficient and dependable releases
- Reverse engineering
- Figuring out how something works by breaking it down
- SMTP
- A protocol for sending email
- SOC 2 Type II compliance
- A framework to help ensure service providers process client data securely
- SPF
- Typically combined with DKIM and DMARC to verify emails are sent by a domain
- SSH / SCP
- Configuring secure access to a server and transferring files
- SSO (Single Sign-On)
- This includes setting up social logins but also org wide SSO such as Okta
- Sanitizing PII (Personal Identifiable Information)
- Protect customer data by redacting it in non-production environments
- Scaling web apps
- Knowing where the bottlenecks are and how to address them
- SealedSecrets
- A GitOps friendly way to manage secrets within Kubernetes
- Security
- Hashing, encryption, access, social engineering, avoiding malware, etc.
- Signal processing
- Graceful shutdowns, SIGKILL vs SIGTERM, etc.
- Snyk
- A service to scan your apps and infrastructure for security vulnerabilities
- Static code analysis
- Anything you can run against your source code (formatting, linting, vuln. scanning)
- Stripe
- A payment gateway to accept credit cards and offer money related services
- Swagger
- A way to view API docs that’s compatible with the OpenAPI specification
- System design
- Creating a solution after understanding the input and output requirements
- TLS / SSL certificates
- Understanding the moving parts around this and what’s secure vs what’s not
- Task queues
- A way to run tasks that are executed by a background worker
- Terraform
- Configure your infrastructure (servers, DNS, etc.) in an automated way
- Testing
- Knowing what and when to test and how to verify those tests
- Tool comparisons
- Evaluate different tools and arrive at a good solution
- Troubleshooting
- Systematic steps you can take to resolve both known and unknown problems
- Vertical scaling
- Adding more compute resources (CPU, memory, etc.) to a single server / container
- VPN
- Create a secure connection between networks and provide a static IP address
- VPS
- Virtual private servers are isolated environments to run an OS
- VSCode
- A popular code editor, I don’t use it but your dev team might so it’s worth knowing
- Virtual Machines
- Spin up virtualized operating systems for testing or usage
- Vulnerability scanning
- Knowing how to identify, document, assess, triage and / or fix vulnerabilities
- Web development
- Building web apps, such as internal tools to help folks get things done
- Websockets
- A protocol for handling bi-directional web traffic such as chat, etc.
- Webhooks
- Sending a web request to another service after an event triggers
- WordPress
- Maybe your company’s blog is run on this, knowing how plugins / updates work
- Zero downtime app deployments
- When deploying new versions of your app, its service isn’t disrupted
The video below goes into more detail about each of these skills.
By the way, if anyone is hiring I am available for hire.
# Video Explanation
Timestamps
- 1:15 – The DevOps elephant in the room
- 3:25 – How do you describe this role?
- 7:10 – The skills
- 8:03 – Skills 1-25
- 20:47 – Skills 26-50
- 29:43 – Skills 51-75
- 39:25 – Skills 76-100
- 47:57 – Skills 101-124
Which skills do you commonly use in this type of role? Let us know below.