What is observability?
Observability is the ability to understand what your running service is doing from the outside, using the data it emits. It rests on three pillars: logs (what happened), metrics (how much and how fast), and traces (the path of one request across components).
Why it matters
In production you cannot attach a debugger to a live incident affecting users. Observability is what lets you answer "why is it slow?" or "which requests are failing?" in minutes. A service you cannot observe is one you cannot operate with confidence, and operations is half of backend work.
What to learn
- The three pillars: logs, metrics, traces
- Structured logging and what to include
- Key metrics: latency percentiles, error rate, throughput
- Distributed tracing and span context
- Dashboards and alerts that signal real problems
- The RED and USE methods for what to measure
- Avoiding alert fatigue
Common pitfall
Watching averages instead of percentiles. An average latency of 100ms can hide that 1% of users wait 5 seconds. Track p95 and p99, because the tail is where real users suffer and where the bugs that matter hide. Averages lie; percentiles tell the truth.
Resources
Primary (free):
- Google SRE — Monitoring distributed systems · docs
- OpenTelemetry — Documentation · docs
- Grafana — Introduction to observability · docs
Practice
Instrument your API to emit structured logs and a request-duration metric. Expose the metric, then compute the p95 latency under a little load with a benchmarking tool. Add a log line with a request id you can trace end to end. Done when you can report your p95 latency with a number.
Outcomes
- Explain the roles of logs, metrics, and traces.
- Track latency percentiles instead of averages.
- Trace a single request across components with span context.
- Set an alert that fires on a real problem, not noise.