Site Reliability Engineering

Site Reliability Engineering,

it is a practice that tech giants are practicing now a days where operation problems of an organization are treated as software engineering problem, in other way when a developer is assigned to solve operations problem. Basically, SREs are software engineers who build various softwares to make better reliable systems. The question that arises is isn’t that DevOps? or which is better SRE vs

DevOps

History :

This term was first coined by

Ben treynor

, a software engineer at google in 2003, this practice started lot earlier than DevOps movement. Shortly, after implementing SRE at their premises treynor’s team shortly launched SRE ebook to aware the industry about the practice.

Key principles of Site Reliability Engineers:

1. Service Level Objectives (SLOs):

SLOs specify the desired degree of dependability that a service need to accomplish. These are quantifiable, precise targets which help in coordinating technical and business goals. They rank actions to satisfy customer expectations and act as a basis for decision-making.

2. Budgets for errors:

Error budgets, which indicate the maximum number of errors or downtime permitted in a specified period of time, are linked to SLOs. They offer a numerical indicator for the required level of system reliability. Error budgets let SREs decide when to invest in new features and when to concentrate on enhancing reliability.

3. Observation and Warning:

Prompt issue detection and response depend heavily on effective monitoring and alerting. SREs make sure that relevant information is gathered by monitoring systems and that warnings are clear, actionable and free from false positives.

4. Automation:

SRE’s core concept of automation places an emphasis on reducing down on manual labor and boosting operational effectiveness. SREs free up teams to concentrate on more strategic and creative work by automating tedious and prone to error chores. To guarantee a dependable and scalable system, this involves automating monitoring, issue response and deployment procedures.

5. The importance of reliability in culture:

SRE practices must be successful in creating a culture of reliability. This involves developing an attitude that values dependability and willingness to grow from mistakes.

Responsibilities Of Site Reliability Engineers (SREs) :

SREs are accountable and take on-call duties for the systems that are running in production.
SREs are responsible for developing software(s) that improves the reliability of systems.
They are responsible for performing post incident reviews of the systems that fails.

SRE vs DevOps : Which is better?

There’s a great analogy to understand the two terms better. So, here it goes, let’s consider DevOps as an

interface

i.e. similar to abstract class containing methods without definitions, and SRE as a

concrete class

implementing DevOps.

Interface DevOps{
Reduce Organizational silos();
Accepting failures();
Implement gradual changes();
Leverage Automation();
Measure Everything();
}

Now, SRE as a concrete class will implements DevOps, alongwith defining all methods as :

Reducing the organizational silos, by sharing the ownership among software engineers, product team and SREs by using same set of tools.
Accepting Failures, as no system is 100% reliable so faults will be there, SREs do Blameless post-martems of systems and generate metadata for the same.
Implementing small changes, smaller the change is, easier it is to identify the problem or faster it is to fix the change or rollback. Thereby, reducing the cost of failure.
Leveraging Automation, automating manual tasks, wherever possible on the production system such as user creation, installing packages, alerting or logging etc.
Measuring Everything, at the end monitoring the right things that has implemented, as on the end of the day you should have numbers or clear metrics that supports success. So, SRE and DevOps are not competing standards, rather they go hand in hand together. So, it is SRE with DevOps.

Article Tags :

Software Engineering

Software Testing