What are SLIs, SLOs, and error budgets?
An SLI is a measured indicator of service health, like the percentage of successful requests. An SLO is the target for that indicator, say 99.9%. The error budget is what is left over — the small amount of failure you are allowed — and it turns reliability from a vibe into a number you can manage.
Why it matters
"Make it reliable" is unmeasurable; an SLO makes it concrete and shared between engineering and the business. The error budget resolves the eternal fight between shipping features and staying stable: if you have budget, ship; if you have burned it, slow down and fix. It is core to modern SRE practice.
What to learn
- SLIs: choosing indicators users actually feel
- SLOs: setting realistic targets
- Error budgets and what they permit
- The cost of each extra nine of reliability
- Burn-rate alerts
- Using the budget to gate releases
- Negotiating SLOs with stakeholders
Common pitfall
Chasing 100% reliability. It is impossible, ruinously expensive, and not what users need — they cannot tell 99.9% from 100%. Set a target that matches user expectations and leaves an error budget, because a budget of zero means you can never ship or take any risk at all.
Resources
Primary (free):
- Google SRE — Service level objectives · docs
- Google SRE — Embracing risk · docs
- Google SRE workbook — Implementing SLOs · docs
Practice
For a service, define one SLI (such as request success rate), set an SLO, and compute the resulting monthly error budget in minutes. Then describe what your team would do when the budget is half spent versus fully burned. Done when you can express reliability as a number and a decision rule.
Outcomes
- Choose SLIs that reflect what users experience.
- Set realistic SLOs and compute the error budget.
- Use the budget to balance shipping against stability.
- Explain why 100% reliability is the wrong target.