SLIs, SLOs & error budgets · DevOps

What are SLIs, SLOs, and error budgets?

An SLI is a measured indicator of service health, like the percentage of successful requests. An SLO is the target for that indicator, say 99.9%. The error budget is what is left over — the small amount of failure you are allowed — and it turns reliability from a vibe into a number you can manage.

Why it matters

"Make it reliable" is unmeasurable; an SLO makes it concrete and shared between engineering and the business. The error budget resolves the eternal fight between shipping features and staying stable: if you have budget, ship; if you have burned it, slow down and fix. It is core to modern SRE practice.

What to learn

SLIs: choosing indicators users actually feel
SLOs: setting realistic targets
Error budgets and what they permit
The cost of each extra nine of reliability
Burn-rate alerts
Using the budget to gate releases
Negotiating SLOs with stakeholders

Common pitfall

Chasing 100% reliability. It is impossible, ruinously expensive, and not what users need — they cannot tell 99.9% from 100%. Set a target that matches user expectations and leaves an error budget, because a budget of zero means you can never ship or take any risk at all.

Resources

Primary (free):

Google SRE — Service level objectives · docs
Google SRE — Embracing risk · docs
Google SRE workbook — Implementing SLOs · docs

Practice

For a service, define one SLI (such as request success rate), set an SLO, and compute the resulting monthly error budget in minutes. Then describe what your team would do when the budget is half spent versus fully burned. Done when you can express reliability as a number and a decision rule.

Outcomes

Choose SLIs that reflect what users experience.
Set realistic SLOs and compute the error budget.
Use the budget to balance shipping against stability.
Explain why 100% reliability is the wrong target.