ProductionIntermediate6h

Observability.

Logs, metrics, and traces — knowing what your service does.

What is observability?

Observability is the ability to understand what your running service is doing from the outside, using the data it emits. It rests on three pillars: logs (what happened), metrics (how much and how fast), and traces (the path of one request across components).

Why it matters

In production you cannot attach a debugger to a live incident affecting users. Observability is what lets you answer "why is it slow?" or "which requests are failing?" in minutes. A service you cannot observe is one you cannot operate with confidence, and operations is half of backend work.

What to learn

  • The three pillars: logs, metrics, traces
  • Structured logging and what to include
  • Key metrics: latency percentiles, error rate, throughput
  • Distributed tracing and span context
  • Dashboards and alerts that signal real problems
  • The RED and USE methods for what to measure
  • Avoiding alert fatigue

Common pitfall

Watching averages instead of percentiles. An average latency of 100ms can hide that 1% of users wait 5 seconds. Track p95 and p99, because the tail is where real users suffer and where the bugs that matter hide. Averages lie; percentiles tell the truth.

Resources

Primary (free):

Practice

Instrument your API to emit structured logs and a request-duration metric. Expose the metric, then compute the p95 latency under a little load with a benchmarking tool. Add a log line with a request id you can trace end to end. Done when you can report your p95 latency with a number.

Outcomes

  • Explain the roles of logs, metrics, and traces.
  • Track latency percentiles instead of averages.
  • Trace a single request across components with span context.
  • Set an alert that fires on a real problem, not noise.
Back to Backend roadmap