ObservabilityIntermediate4h

On-call fundamentals.

Alerts, escalation, runbooks, and sustainable rotations.

What is being on-call?

On-call means being the person responsible for responding when something breaks outside business hours. Done well, it is a fair rotation with actionable alerts and clear runbooks. Done badly, it is a pager that screams all night at things nobody can fix.

Why it matters

On-call is the human side of reliability, and it is where good observability pays off or fails. Engineers who design humane on-call — useful alerts, written runbooks, blameless follow-up — keep teams healthy and systems reliable. Burnout from bad on-call drives people out of the field.

What to learn

  • Actionable alerts versus noise
  • Severity levels and what each demands
  • Escalation paths and rotations
  • Runbooks: what to check and what to do
  • Acknowledging, mitigating, then fixing
  • Alert fatigue and how to fight it
  • Following up so the same page does not recur

Common pitfall

Alerting on causes instead of symptoms, so the pager fires for high CPU that users never notice while a real outage slips through. Alert on user-facing symptoms — errors, latency, unavailability — that genuinely need a human now. Every alert that is not actionable trains people to ignore the pager.

Resources

Primary (free):

Practice

Take an existing alert and judge it: does it fire on a user-facing symptom, is it actionable, and is there a runbook? Rewrite one cause-based alert as a symptom-based one, and write a short runbook for it. Done when the alert would only wake someone for something worth waking them.

Outcomes

  • Distinguish actionable alerts from noise.
  • Alert on user-facing symptoms, not raw causes.
  • Write a runbook that guides a responder.
  • Design a rotation that does not burn people out.
Back to DevOps roadmap