Learn Docker With My Newest Course

Dive into Docker takes you from "What is Docker?" to confidently applying Docker to your own projects. It's packed with best practices and examples. Start Learning Docker →

120+ Skills I Use in an SRE / Platform / DevOps Developer Position

120-skills-i-use-in-an-sre-platform-devops-developer-position.jpg

These are technical topics I've actively used or thought about for the last 2 years while working in 1 role.

Quick Jump:

Prefer video? Here is it on YouTube. It includes more detail about each skill.

I mentioned 2 years but that doesn’t mean it’s expected to know about everything that’s listed with 2 years of experience. I’ve been building and deploying web apps for over 20+ years. These are some skills I exercised in 1 role for ~2 years.

I rarely think about skills like this but thought it could be interesting to write out because:

  • If you want to get into this field, these are a few topics to explore
  • If you’re a company hiring someone, you can use this to help write a job description

In this post, we’ll be focusing mostly on technical skills and not soft skills. Soft skills can be covered in a future post because there’s a lot to think about there too. Examples of soft skills are managing up, big picture thinking, emotional intelligence and more.

# The ‘DevOps’ Elephant in the Room

Yes, I also don’t like the term “DevOps developer” or “DevOps engineer” as a role type because in my mind DevOps is a business wide philosophy but I do think folks use this term to describe a very real combination of technical skills.

It boils down to someone who likes infrastructure and most things related to an SRE (Site Reliability Engineer) or platform role but also has a lot of experience building web apps.

Topics like web servers, databases, caching, background workers, keeping app package dependencies up to date and other stuff come into play. This unlocks a whole category of ways you can provide business value. Suddenly you know how to scale background workers and have no problem jumping into app code to make changes – even if you don’t fully know the programming language!

Hopefully there’s only minor code changes, like replacing hard coded config values with environment variables, setting up health check end points and things of that nature.

You can talk shop with developers and understand their perspective on things. For bigger code changes such as uploading files to S3 instead of a local file system when moving into Kubernetes, you can write tickets in enough detail to help developers think about edge cases and other concerns.

Also, I’ve found in practice a lot of developers don’t think about or more commonly aren’t interested in what happens to their code after it’s running in production but if you have experience building and deploying web apps you can help level up everyone (yourself included) by sharing knowledge and applying examples directly to your org’s apps.

This can really move the ball forward on building more resilient, stable and scalable applications. I think there’s a lot of value in having this combo of experience.

# How Do You Describe This Role?

Besides the 120+ skills listed below, there’s also things like being an on demand consultant for a dev team to solve general problems, maximizing developer happiness with tools that make things easier, keeping an eye on technical debt while advocating and promoting engineering principles to help guide a team on delivering maintainable code at a sustainable pace – all while attempting to avoid stepping on anyone’s toes.

You may also occasionally dabble in risk level topics like ensuring the company’s SAAS app is compliant with US tax sales liabilities or is responsible for helping a business line move from Slack to Teams because the CTO determined that’s the best move for the business.

You also participate outside of your team to help shape system and technology choices for your whole organization. You can be counted on by the org to be an “action taker” and “problem solver” for almost anything.

You’re comfortable chatting with vendors and providing demos or presentations to anyone in the org, including non-technical topics such as helping non-technical folks from tool A to B.

You’re also trusted to come up with a system to run technical interviews for engineering positions along with participating in those interviews.

The above list can go on and on. That’s a tiny handful of ancillary topics that came up in the last 2 years. I wanted to include them because besides solving pure technical problems like writing a script, there’s things like that which come up in your day to day.

It’s kind of like a jack of all trades tech focused role but instead of being a master of none, you have enough knowledge in enough related things that you can cross a bunch wires internally to provide solutions to a wide range of problems.

In one role, the title was “LORE” which is Lead Operations and Reliability Engineer. It was picked solely due to the acronym sounding cool and it has a fun easter egg related to Star Trek: The Next Generation.

For anyone who knows who Lore is from TNG, no I don’t really think of myself in that way.

I like keeping notes and it provided me personal joy in naming my work notes directory “lore” because the definition of the word is related to storing knowledge about a subject.

Now onto the skills…

# The Skills

I time boxed myself to about an hour to come up with this list (minus the descriptions) so it’s likely I’m missing quite a few things. This should be good enough to begin with and I’ll add to it if I think of something later.

I also tried to keep it high level. For example the first item in the list has about 100 sub-topics based on what you’re doing. I didn’t bother putting “S3” or “object storage” as part of the top level list.

  1. AWS
    • IAM, EC2, EKS, ALB, S3, RDS, ElastiCache and other resources to host web apps
  2. AWS SAM
    • Used to build and deploy Serverless Lambda functions
  3. Access control
    • Who can access what across multiple apps, resources, services, and tools
  4. Ansible
    • Configure your servers in an automated way
  5. Apache
    • A web server (perhaps the org you’re with is already using it)
  6. Architectural patterns
    • How can we develop technical systems that work for our business requirements?
  7. Argo CD
    • Helps deploy Kubernetes apps by converging the state of a git repo to your cluster
  8. Automation
    • Taking any manual task and automating it with tools, services, etc.
  9. Back-end development
    • Creating the server side component of a web app
  10. Background job processing
    • Executing tasks outside of the request / response cycle
  11. Bash / Shell scripting
    • Writing scripts to solve ADHOC business tasks
  12. Batch processing
    • Purposely breaking up tasks in batches of N to prevent performance issues
  13. Benchmarking and profiling
    • How does a site perform when under varied load? How can it be measured?
  14. Bitbucket
    • A service to host source code using git
  15. Building CLI tools
    • Internal tools to help automate workflows or make processes easier
  16. CI pipelines
    • Automate things like linting, testing and building images / deployable artifacts
  17. Cache busting
  18. Caching
    • Prevent expensive operations like DB queries or API calls if nothing has changed
  19. CloudFormation
    • An AWS tool to help automate creating and configuring resources
  20. Code formatting
    • Using and configuring tools to assist in formatting code (Black in Python, etc.)
  21. Code linting
    • Using and configuring tools to find issues in your code (Flake8 in Python, etc.)
  22. Command line
    • Using tools like grep, sed, cut and friends to solve ADHOC business problems
  23. Concurrency and parallelism
    • Something that can be done to help reduce how long tasks take to complete
  24. Configuration management
    • Using tools like Ansible or AWS SAM are an example of this
  25. Confluence
    • Store knowledge in a way that’s digestible and searchable by a business
  26. Container runtimes
    • Knowing the difference between containerd and things like dockershim (K8s 1.24)
  27. Cost optimization and analysis
    • How can we keep an eye on cloud costs and what can we do to reduce costs?
  28. Cron jobs
    • Run tasks on a recurring schedule, such as making an API call at 2am every day
  29. DDoS prevention
    • What can we do to prevent our site from going down if under a large scale attack?
  30. DKIM
    • Typically combined with DMARC and SPF to verify emails are sent by a domain
  31. DMARC
    • Typically combined with DKIM and SPF to verify emails are sent by a domain
  32. DNS
    • Ultimately map hostnames to an IP address, weighted DNS, etc.
  33. Database backups and restoring
    • Backup and restore your data in a resilient way (disaster recovery, etc.)
  34. Databases
    • Which DB works well for this use case? What size do we need? Writing queries
  35. Datadog
    • A service to help collect, view and alert on logs and metrics
  36. Debugging
    • Something isn’t working as expected, how do we find the root cause?
  37. Docker
    • Run a process in isolation and gain cross-OS portability
  38. Docker Compose
    • Run 1 or more containers together, such as everything related to 1 project
  39. Documentation
    • Explaining how to use something and hopefully also cover the “why”
  40. Dotfiles
  41. Email delivery validation
    • Help ensure emails can be trusted and don’t get marked as spam
  42. Encryption
    • Protecting user data at rest and in transit
  43. Environment variables
    • Tweak settings that change in dev vs prod and help protect secrets
  44. Event driven architectures
    • For example, triggering a function after an S3 file is uploaded, etc.
  45. Excel
    • Sometimes a spreadsheet is really what you need to solve something quickly
  46. Feature flags
    • Enabling or disabling features for a subset of users until it’s fully rolled out
  47. File systems
    • At least knowing what’s different between Linux and macOS, performance, etc.
  48. Firewalls
    • Monitor and potentially block traffic within a network
  49. Flask
    • Python based micro-web framework for building web apps and APIs
  50. Front-end development
    • Tools to help manage CSS and JS but you can also think of user experience too
  51. Full text search
    • A method to effectively search for strings and phrases within a database
  52. General hardware knowledge
    • AMD64 vs ARM64, SSD / IOPS and knowing how to measure / pick hardware
  53. Git
    • Make commits, merge branches, resolve conflicts, hooks, create workflows, etc.
  54. GitLab
    • A service to host source code using git
  55. HTTP or generally how the web works
    • Generally knowing how everything works together, HTTP verbs, etc.
  56. Health checks
    • Continuous monitoring to ensure your app is available
  57. Helm
    • A template based approach to help manage Kubernetes configs and apps
  58. Horizontal scaling
    • Adding more servers or containers to help handle more traffic
  59. Infrastructure as code
    • The idea of provisioning resources and config files with code
  60. Jira
    • A service to help manage working on tickets
  61. Kubernetes
    • A tool to automate deploying and scaling containerized applications
  62. Kustomize
    • A patch based approach to help manage Kubernetes configs and apps
  63. Linux
    • Being proficient in using a Linux distro of your choice (Debian, Ubuntu, etc.)
  64. Load balancing
    • A way to distribute traffic across resources (random, round robin, etc.)
  65. Load testing
    • Using tools like wrk to see how a site performs under load
  66. Logging
    • What to log? What not to log? How to parse logs and controlling costs
  67. Markdown
    • A light weight text format to help write documentation and notes
  68. MFA / 2FA
    • An added level of protection around securing access to an account
  69. Memcached
    • A tool to help cache information
  70. Microservices
    • A strategy to help organize and break up an application (it’s not a silver bullet)
  71. Monitoring and alerting
    • If something goes wrong, how can you get notified?
  72. Multiple deployment environments
    • Pull request environments, shared staging environment, production, demo, etc.
  73. NGiNX
    • Web server, reverse proxy, load balancer, TLS termination, redirects and more
  74. Networking
    • General knowledge around protocols (TCP, HTTP) and types (LAN, WAN, etc.)
  75. OpenAPI spec
    • A specification for documenting APIs
  76. Operating systems
    • Knowing your way around Windows, macOS and Linux
  77. POSIX compliance
    • Do your shell scripts need maximum compatibility? How can you achieve this?
  78. Package managers
    • Knowing how to use and configure tools like apt, brew, etc.
  79. Patch management
    • Keeping your servers and app dependencies up to date
  80. Payment gateways
    • Accepting multiple types of payments (Credit card, PayPal, Apple pay, etc.)
  81. Penetration testing
    • Evaluating the results of pen tests and knowing which reports apply to your app
  82. Performance monitoring
    • Actively understand how your app is performing and how to improve things
  83. Pingdom
    • A service to monitor and track uptime as well as notify you of downtime
  84. Port forwarding
    • Knowing when you want to make a port available within a network
  85. Privacy
    • Protecting the privacy of your customers
  86. Process management
    • Understanding systemd and how Docker / Kubernetes keep processes up
  87. Proxies
    • Passing traffic through a server, this includes reverse proxies such as NGiNX
  88. Python
    • A programming language, it’s oftentimes a nice choice for scripting and web apps
  89. Quality assurance
    • Thinking about the happy and unhappy cases, edge cases, etc.
  90. RESTful APIs
    • A way to organize APIs to keep them predictable and maintainable
  91. Redis
    • A swiss army knife for caching, queue back-end, pub / sub and more
  92. Release management
    • Creating tools and workflows for efficient and dependable releases
  93. Reverse engineering
    • Figuring out how something works by breaking it down
  94. SMTP
    • A protocol for sending email
  95. SOC 2 Type II compliance
    • A framework to help ensure service providers process client data securely
  96. SPF
    • Typically combined with DKIM and DMARC to verify emails are sent by a domain
  97. SSH / SCP
    • Configuring secure access to a server and transferring files
  98. SSO (Single Sign-On)
    • This includes setting up social logins but also org wide SSO such as Okta
  99. Sanitizing PII (Personal Identifiable Information)
    • Protect customer data by redacting it in non-production environments
  100. Scaling web apps
    • Knowing where the bottlenecks are and how to address them
  101. SealedSecrets
    • A GitOps friendly way to manage secrets within Kubernetes
  102. Security
    • Hashing, encryption, access, social engineering, avoiding malware, etc.
  103. Signal processing
    • Graceful shutdowns, SIGKILL vs SIGTERM, etc.
  104. Snyk
    • A service to scan your apps and infrastructure for security vulnerabilities
  105. Static code analysis
    • Anything you can run against your source code (formatting, linting, vuln. scanning)
  106. Stripe
    • A payment gateway to accept credit cards and offer money related services
  107. Swagger
    • A way to view API docs that’s compatible with the OpenAPI specification
  108. System design
    • Creating a solution after understanding the input and output requirements
  109. TLS / SSL certificates
    • Understanding the moving parts around this and what’s secure vs what’s not
  110. Task queues
    • A way to run tasks that are executed by a background worker
  111. Terraform
    • Configure your infrastructure (servers, DNS, etc.) in an automated way
  112. Testing
    • Knowing what and when to test and how to verify those tests
  113. Tool comparisons
    • Evaluate different tools and arrive at a good solution
  114. Troubleshooting
    • Systematic steps you can take to resolve both known and unknown problems
  115. Vertical scaling
    • Adding more compute resources (CPU, memory, etc.) to a single server / container
  116. VPN
    • Create a secure connection between networks and provide a static IP address
  117. VPS
    • Virtual private servers are isolated environments to run an OS
  118. VSCode
    • A popular code editor, I don’t use it but your dev team might so it’s worth knowing
  119. Virtual Machines
    • Spin up virtualized operating systems for testing or usage
  120. Vulnerability scanning
    • Knowing how to identify, document, assess, triage and / or fix vulnerabilities
  121. Web development
    • Building web apps, such as internal tools to help folks get things done
  122. Websockets
    • A protocol for handling bi-directional web traffic such as chat, etc.
  123. Webhooks
    • Sending a web request to another service after an event triggers
  124. WordPress
    • Maybe your company’s blog is run on this, knowing how plugins / updates work
  125. Zero downtime app deployments
    • When deploying new versions of your app, its service isn’t disrupted

The video below goes into more detail about each of these skills.

By the way, if anyone is hiring I am available for hire.

# Video Explanation

Timestamps

  • 1:15 – The DevOps elephant in the room
  • 3:25 – How do you describe this role?
  • 7:10 – The skills
  • 8:03 – Skills 1-25
  • 20:47 – Skills 26-50
  • 29:43 – Skills 51-75
  • 39:25 – Skills 76-100
  • 47:57 – Skills 101-124

Which skills do you commonly use in this type of role? Let us know below.

Never Miss a Tip, Trick or Tutorial

Like you, I'm super protective of my inbox, so don't worry about getting spammed. You can expect a few emails per year (at most), and you can 1-click unsubscribe at any time. See what else you'll get too.



Comments