Site Reliability Engineering
“That’s kind of where we are today. We have a lot of operational people working under the title of SRE. They are trying to bring more automation and self-healing to the site engineering side of the world”
– Benjamin Treynor Sloss, Vice President of Engineering at Google.
What is Site Reliability Engineering?
The realm of technology is ever-expanding and continually branches with the rise of innovations. Particularly with the transit into cloud software and microservices, the need to increase the dexterity of services has increased along with a surge in unique threats. With an increase in the complexity of services, there is naturally an increase in unknown factors.
Hence, developers and operations engineers’ everyday obligations have evolved to include looking for new ways of improving stability, reliability, and automation-first practices. Thus, came the rise of the site reliability engineer.
Phrased by Benjamin Treynor Sloss, Site Reliability Engineering has become an integral part of the IT-sphere. It is intended to advance automated solutions for operational aspects such as on-call monitoring, performance, and capacity planning, and disaster response.
SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor. In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
– Benjamin Treynor Sloss, VP of Engineering, Google
Before the implementation of the Site Reliability Engineering notion, systems administrators were appointed to congregate the necessary portions of the applications responsible for cloud services and on-premise environments. While this sort of set-up offered a reactive approach to issues that may arise, SRE was introduced to reinvent this to provide reliable and improved access to these services.
How Did SRE Improve the Role of Systems Administrator?
A general sysadmin, or systems administrator, plays a significant role in assembling existing software elements and marshaling them to work collectively to build a service. They are further responsible for continually monitoring the running service for any issues or updates and making them methodically.
For this section, please add an infographic at the end, showing two Venn diagrams that overlap with the conventional sysadmins and SRE teams with the information mentioned as a reference.
With the growth in the number of complex systems involved in cloud processes and microsystems, the load of traffic increased along with the demand for additional workload for the administration team as well.
Since the sysadmin role demands a notably diverse skill set than that expected of developers, sysadmins are classified into discrete teams:
Conventional operations teams and their complements in product development thus often end up in conflict over the release of software for production. The development teams oversee the launch of new features, whereas the ops teams are responsible for making sure the service does not crack under unique conditions.
Google thus chose to run their systems with a different approach in response to such conflicts. Site Reliability Engineering teams specialize in hiring software engineers to run their products and to make systems to satisfy the work that might contrarily be performed, often manually, by system administrators.
SRE is essentially performing work that has historically been done by an operations team using engineers instead with software expertise. Speculation on the fact that these engineers are naturally both predisposed to and can design and implement automation with software to substitute human labor.
SRE teams must be focused on engineering. Without consistent engineering, operations saddle increases, and the teams require more people to keep step with the workload. Typically, the traditional operations-focused group compares linearly with the service size: if the products backed by the service ensue, the operational capacity will grow with traffic.
Attributes Contributing to SRE
Site Reliability Engineers need a holistic perception of the systems and the associations among those systems. SREs must comprehend the system as a complex body and treat its interconnections with as much consideration as the components themselves.
In extension to a comprehension of systems, site reliability engineers are also responsible for specific tasks and outcomes. Those are sketched in the ensuing the postulates of SRE.
- Operations are a subset of software duties: The primary tenet of SRE is that performing operations well is a software obstacle. SRE should, therefore, use software engineering approaches to jump that hurdle.
- Service Level Objectives (SLOs): Persevering a 100% availability is not the intention of SRE. Instead, the product team and the SRE team elect a relevant level of availability target for the service and its user base, and the service is overseen to that SLO. Ruling on a target level expects active collaboration from the patronage.
- Automation: Automation moves alongside subduing labor by ascertaining what to automate, under what contingencies, and how to automate it.
- Reducing the cost of failure: The longer it takes to detect an existing issue with the systems, the harder they are to overcome. SRE approaches this issue as well. SREs are charged explicitly with advancing undesirably delayed query detection, deferring perks for the company as a whole.
- Working alongside developers: SRE strives to overcome barriers. Ideally, a holistic view of the heap—the front end, backend, libraries, storage, kernels, and physical machine— should be sustained by product development and SRE teams.
Tools: SRE does not condone separate teams using distinct sets of tools to achieve their goal. Ultimately, managing service with a diverse set of tools for SRE teams and product developing team becomes a monumental task. Tools may behave abnormally in different situations, which could prove catastrophic sometimes.
DevOps vs. SRE
Adopting SRE practices does not necessitate the upheaval of previously existing procedures or their immediate replacement. SRE methodologies include supplementing both DevOps and ITIL methodologies. The trick is to assure that despite the organizations’ distinctive operating models or toolchains, there is shared visibility, communication, and collaboration across teams.
The overlaps of SRE methodologies and DevOps methodologies are listed as follows:
- Consistent growth through learning: SRE practices ensure that all conflicts serve as learning possibilities and have periodic postmortems to provide reports. SRE also constitutes guardrails for failure through SLOs and error budgets.
- Progressive change: Instead of a massive update altogether, SRE aims to roll out a narrow subset of customers before granting access to the remaining in regard to new features. Smaller changes are easier and safer to dissect and iterate on.
- Measuring:SRE pointedly focuses on measuring toil and reliability to ensure a continuing satisfaction of both the software teams involved and the end-users with the service provided.
The above-mentioned principles constitute the overlap of SRE and DevOps, with SRE systematizing disciplines that make it more accessible to deliver the commitments of DevOps.
Benefits of Implementing SRE
Candidates chosen for SRE and their technical contributions vary based on how the organization implementing the SRE defines the purpose of their position. While one company might expect the candidate to be more experienced in software engineering and coding, another company might place their expectations on operations or a more QA based skillset. For a balanced service, SRE manages to provide a blend of practices and qualities that complement technical expertise.
Source: Inspiredbytech – Mahesh Patil
Focus on the Broader Horizon
SRE can be utilized for its full potential if the software developers in-charge possess a deep understanding of how their code helps drive the overall business.
A successful Site Reliability Engineer can adeptly recognize and evaluate the circumstances on a higher plane. Though changes can generate additional uncertainties or impacts further, not just in that current note, a good SRE is equipped to perform a thorough analysis before making any changes.
The capacity to acknowledge how the changes made are going to affect the rest of the overall system, team, or the more extensive infrastructure is provided to the SREs efficiently.
Implementing SRE enables the engineers to collectively consider future decisions and how the current changes will affect them for all departments working on them. Those decisions will affect people much further up the stack. Sound decisions facilitate seamless developments.
Consider Automation at Every Turn
SREs favorably increase the reliability of the services they provide without hindering their company’s ability to dispatch software promptly. Automation plays a significant role in their efficiency and renders SREs to be more proactive in their approach. By reducing manual tasks and additional labor, automation enables the SREs to focus on the bigger picture elements.
SREs are primarily on the lookout for time-consuming, repetitive tasks that can be automated to prevent future employees from losing valuable time that they can spend on other vital jobs.
This automation constitutes one of SRE’s principles as previously mentioned and becomes a major asset to implementing SRE in any company. It is a key responsibility that any Site Reliability Engineer is expected to carry out.
Unique Tools and Approaches
Site Reliability Engineering mainly focuses on making operations more effective while reducing the time required to carry out services. Site Reliability Engineers thus hail from diverse settings.
Many engineers who currently possess the title served in other jobs before taking on the role of SRE. This allows hiring managers to collect a plethora of employees of different settings, some from a developer background and some from traditional operations. A traditional QA engineer might have the right makeup for the SRE position, for example.
SREs hold several qualifications and are also expected to acquire knowledge on additional tools and approaches from the rest of the team. An operations practitioner might benefit from learning a programming language. At the same time, someone with a development background should be willing and able to think much more deeply about operational processes and challenges than they did in the past. The best SREs embrace this kind of broad-based learning and skills development.
Tackling Engineering Problems Through SRE
Optimization is pivotal to a successful SRE practice and for proper implementation of DevOps principles. It would be laborious to manage a service properly without understanding the expected outcome of managing that service. SRE’s concept of SLIs (service level indicators), SLOs (service level objectives), and error budgets helps keep the organization aligned on optimizing services to match the customers’ needs better.
Service Level Indicators
An SLI or a Service Level Indicator is a strictly defined quantitative measure of some aspect of the provided level of service. The primary key of SLI is the request latency, which determines how long a request remains until a response is supplied.
The error rate is measured periodically, often represented as a division of all requests obtained. These measurements are aggregated after their collection: i.e., raw data is accumulated over a period and then turned into informative reports including rate, average, or percentile.
Additionally, this opens a path for the SREs to come up with solutions rooted in the backend and define priorities for development.
Service Level Objectives
Service Level Objectives, or SLOs, serve as a fundamental goal for each service to meet to keep up with the metrics, such as uptime or response speed, and are hence corresponded with the customers’ satisfaction.
SLOs enable Site Reliability Engineers to address issues before they plague the users’ experience. Setting expectations for the users concerning the service’s performance constitutes the main portion of choosing and publishing SLOs.
SLOs thus reduce the situations where the users often form their own opinions about desired performance, which may be irrelevant to the beliefs held by the people sketching and operating the service.
This can lead to both over-reliance on the service when users mistakenly conclude that a service will be more accessible than it is and under-reliance when proposed users believe a system is flakier and scarcely reliable than it is.
Service Level Agreements
Service Level Agreements or SLAs are specific or implied contracts that are made with the users. They generally include the details of the SLOs that are collected.
Since SLAs are involved with more business-related product decisions, SREs do not indulge in their construction intensely. Despite this, SRE does focus on the minor detail of missed SLOs and their consequences while constructing the agreement contracts. They can also help to define the SLIs: there obviously must be an objective recognition to measuring the SLOs within the agreement, or disagreements will arise.
Source: New Relic Blog
Service Reliability Hierarchy
Essentially, SREs are responsible for the proper sending out of services, a set of related subsystems that ultimately serve a specific purpose for the end-users, be it intrinsic or extrinsic. They oversee the entire lifetime of the service, and issues are dealt with promptly.
Successfully operating a service entails a wide range of activities: developing monitoring systems, planning capacity, responding to incidents, ensuring the root causes of outages are addressed, and so on.
The lifespan of services is visualized by stacking the necessary requirements for the functioning of a system up to all the higher levels of functioning, including taking active control of the direction to handle issues and self-actualization.
The foundation of the service reliability hierarchy is to monitor the service to see if it is working correctly through a proper channel of monitoring infrastructure. Monitoring essentially means to be notified of the issues and problems that might hinder the working of the service before it occupies the attention of the end-users.
In case of detecting any faults in the working of the service, the SREs are not obligated to go immediately to solve any small issue. While their focus remains on the larger purpose, smaller solutions are temporarily put into place until the other departments can handle the hurdle.
This is done through maybe shutting off specific services until the issue gradually dissipates or redirect the traffic to create an instance to work on. Responding effectively to incidents, however, applies to all teams.
Postmortem and Root-Cause Analysis
One of the key differentiators between the SRE philosophy and some more traditional operations-focused environments is to analyze the issues that occur and prevent such issues from occurring in the future through comprehensive reports.
If left unchecked, conflicts can increase in complexity, conquering systems, and its operators and ultimately impacting the end-users. Therefore, postmortems are an essential tool for SRE.
The postmortem notion is a well-known concept in the technology industry and is widely used. A postmortem is a transcribed account of an incident, its impact, the actions taken to mitigate or settle it, the root cause(s), and the follow-up actions to prevent the incident from occurring in the future.
One of the critical obligations of Site Reliability Engineers is to properly mete the confidence of the systems they strive to maintain. SREs, perform this task by modifying standard software testing techniques to systems at scale.
Confidence in a system is measured through a careful comparison of past and future reliability constants. The former is taken by analyzing data provided by monitoring historic system behavior or through postmortem reports, while the latter is quantified by making predictions from data about past system behavior.
Testing is the mechanism that is used to illustrate specific areas of equivalence when changes are carried out. Each test passes both before and after a change, thus reduces the ambiguity for which the analysis needs to concede. Thorough testing helps predict the future reliability of a given service with enough detail to be practically useful.
Capacity planning is a perpetual cycle: changing assumptions, a slip of deployments, and budget cuts, ending in the improvement upon the improvisation of the Plan. Each revision that is made has subsequent effects that must develop during the plans of all succeeding sections.
Following the necessary measures that are taken, the SREs are responsible for improving the decisions that have already been employed.
Several aspects, including data processing, data integrity, and protection, also go into the developmental perspective of Site Reliability Engineering.
Finally, at the top of the site, the reliability hierarchy lays the launch of the product. Launching a new product or feature is the time of revelation for every company—the point at which months or years of effort are introduced to the world.
Established companies launch new products at a reasonably low rate. Launches and rapid repetitions are far more manageable because new features can be rolled out on the server side, rather than requiring software rollout on individual customer workstations.
Source: Google file on SRE
Once the Site Reliability Engineer’s role is put into place, the implementation must be carried out seamlessly with all the proper incentives in place. SRE is a mountainous sphere with a plethora of skills and traits required to run services and systems effectively.
By embracing SRE’s culture and mindset, new processes with no uniting value at the center help keep the initiative in place. Every company – great or modest, has multiple purposes under development and code deployments (and re-deployments). A lot of such enterprises hold a combination of legacy and modern reinforcements, supported by separate Development and Operations teams.
While DevOps guarantees a smooth, automated approach to such deployments, there arises a need for a more dedicated focus on securing the availability of end-to-end business purposes. This is what SREs bring to the forefront, thus earning a lot of attention in the industry.
The applications are holistically considered from the end-user perspective and gather what is existing and determine the gaps in service reliability. This would give any organization a view of how they are fixed when it comes to reliability and what needs to be approached immediately.