The Many Shapes of Site Reliability
by Rob Cummings
In my role as a Cloud and SRE Practice Lead at Slalom Build, I am fortunate to talk to a wide range of organizations, from smaller mid-market companies all the way to astoundingly large and complex enterprises, all from an equally wide range of industries.
There is no doubt about it, Site Reliability Engineering (SRE) is the latest hot topic. These companies are looking to reduce the impact and risk of failure that can come from moving quickly at scale with increasingly complex systems.
What is SRE? It is a specific implementation of DevOps that applies a software engineering mindset towards solving traditional operations problems, with a focus on creating reliable and scalable technology.
Much like DevOps, it turns out no one is following a common team blueprint when it comes to SRE. However, we are seeing several distinct patterns emerge as organizations adopt SRE.
Please note: This post is not about gatekeeping and declaring there is only one true way to approach SRE. After being through too many “DevOps is not a team or title” debates, I have mellowed out considerably when it comes to these things. It is more important that you go with what works for your context and stay focused on the outcomes.
SRE Implementation Patterns
With that in mind, lets get into the more common team structures emerging in organizations adopting Site Reliability Engineering:
The Google Model
- A dedicated engineering team focused on running and scaling a product or platform.
- Team is made of highly skilled software and systems engineers who are both directly updating the product code base for reliability and building associated tooling to support the product.
- Team is on-call for the product.
- Can hand on-call support back to the feature teams if reliability falls below an agreed upon threshold (Google uses an Error Budget model for driving these conversations).
We Are Now SRE
- Ops, DevOps, and platform teams have rebranded and reimagined themselves as Site Reliability Engineers.
- Goal is to put a stronger engineering emphasis around improving reliability and scalability.
- These teams typically do not make significant changes to product code, but do play a heavy role in the underlying infrastructure, tooling, platforms, or day to day support of the product.
- Team is on-call and typically rely on a development team for escalating application specific issues.
SRE Center of Practice
- A centralized team focused on creating and advocating for reliability tools and processes.
- On-call responsibilities are limited to non-external customer facing tooling.
- Is an internal consulting arm to help with adopting SRE patterns and tooling, but does not have direct product accountability.
- These teams are uniquely positioned to see the forest for the trees while also staying sharp on the latest technologies, trends, and research in the SRE space.
- SRE engineers are embedded into cross-functional teams that own the the end-to-end lifecycle of a product, from build through decommission.
- This can take two shapes. First is a matrix SRE organization where engineers belong to a single capability and are also embedded full time within a product team (this is the Slalom Build SRE model). Alternatively, product teams may hire their own dedicated SRE engineers.
- The SRE engineer role is a hands on reliability/scalability Subject Matter Expert that helps the team adopt the engineering practices and tooling to ensure right-sized scalability and reliability throughout the lifecycle.
- The entire team participates in an on-call rotation.
The Long Tail
- I still get into a significant number of conversations where teams or entire organizations have not yet heard of SRE.
- Although it’s obviously the new hotness, we are still in the early stages of adoption. It is an exciting time for all of us to share what we are learning for the benefit those starting down the path.
If you are embarking on an SRE implementation, how do you ensure you achieve the reliability and scalability benefits you are looking for? A couple things to keep in mind:
Make it Humane
- My absolute favorite aspect of the SRE community today is the dialog around the interaction points between humans and the complex systems they are supporting.
- Make improving the on-call experience and reducing the inherent stress of dealing with large-scale production systems a key tenet of your SRE program.
Customer Outcome Driven
- Maintain a direct line of sight to customer impacts (positive and negative), regardless of whether your SRE engineers are directly embedded in a product team, or building and supporting a platform used by product teams.
- A common red flag is a team so focused on their platform and associated tooling that they can’t quantify or get visibility into the external customer impact they are having. In fact, this was the drive for the creation of Google’s Customer Reliability Engineering team…to establish a bridge between actual customer products and the underlying Google Compute Platform teams.
Focus on Rightsizing
- SRE teams must have real discussions with their internal customers, product teams, and business partners around the cost of downtime and the cost of uptime. Every additional nine has a real cost that needs to be quantified and made visible.
- Make both missing and greatly surpassing Service Level Objectives (SLOs) undesirable. This is a new concept for most traditional production support teams that consider all downtime an unattractive or unacceptable risk. Exceeding your SLO over too many consecutive periods might be an indication that you have over-invested in reliability at the expense of other customer features.
- Transparency through real SLOs, created in partnership with the business, tracked, and regularly evaluated for relevancy are a must have in the SRE world. Too many teams sidestep or only pay lip service to this during their journey to SRE.
Ownership and Accountability
- One of the most painful parts of traditional ops on-call is the complete decoupling of product development and the running of that product. Getting called at 3am is bad enough, not being empowered to fix the underlying causes is much worse.
- Ensure a solid feedback path into the product teams backlog is in place for operational fixes and improvements. Make sure the engineers accountable for production have ownership over reliability and scalability fixes, or at least very strong input into their prioritization.
Include both Software and Systems Engineering Expertise
- Transitioning to SRE is most successful with a mix of Software and System Engineering expertise on the team. This means substantially updated or additional job descriptions are usually needed.
- If you are looking to transition a traditional Ops team to an SRE model, push hard to bring in at least a pair of software engineers. A pair creates enough critical mass to ensure their voices are heard, they can bounce ideas off of each other, and creates capacity for two-way knowledge sharing with the rest of the team. The same works in reverse, bring in a pair of Systems Engineers into a traditional Software Engineering team transitioning to SRE.
- You should experience this as an uncomfortable transition, at first. Otherwise, you aren’t pushing yourselves forward enough to get the benefits you are looking for and instead will only get the “feel good” effects of being associated with SRE.
- Your business partners will be even more uncomfortable. They are taking a leap of faith by fundamentally changing core support processes into a new model they may not yet fully understand. This is ok, but be sure to have empathy and treat them as equal partners in this journey.
How to move to SRE?
- “Start small” advice definitely applies.
- Look for both technology and business teams excited about the opportunity SRE brings and put a strong focus on quick wins.
- Have a bias towards action, measure your progress and impact to inform your next action. Make sure you are adjusting direction as you learn.
- Be transparent about what is working, what isn’t, and where you have learned something unexpected.
Read more about what we build at our Slalom Build Engineering blog
Look for upcoming posts to share more tactical starting points, as well as why Slalom Build chose Embedded SRE as the right model for us.
Special thanks to Arielle Allen, Sascha Bates, Jeremiah Dangler, Joel Forman, Jeff Knecht, Dan Mazur, and Kevin McClelland, and for their help making this post better.