Security & reliabilityAdvanced4h

Chaos engineering (light).

Injecting failure on purpose to find weakness first.

What is chaos engineering?

Chaos engineering is deliberately injecting failure — killing a pod, adding latency, cutting a dependency — to discover how your system actually behaves under stress, before a real outage finds out for you. The "light" version starts small and controlled, not by breaking production randomly.

Why it matters

Systems fail in ways nobody predicted, and the calm of a working system hides fragile assumptions. Testing failure on purpose surfaces those weaknesses while you are watching and ready, instead of at 3am. It builds genuine confidence that your redundancy and failover actually work.

What to learn

  • The hypothesis-driven experiment model
  • Starting in staging before production
  • Blast radius and limiting it
  • Common experiments: kill instances, add latency, drop dependencies
  • Steady-state metrics to measure against
  • Automated rollback if things go wrong
  • Game days as a team practice

Common pitfall

Running chaos experiments in production with no blast-radius limit and no plan to stop. That is not engineering, it is causing an outage. Start in staging, define the steady state you expect to hold, limit the experiment to a small slice, and have an automatic abort ready before you touch anything live.

Resources

Primary (free):

Practice

In a staging environment, form a hypothesis — "if one replica dies, traffic keeps flowing with no errors" — then kill a pod and watch your steady-state metrics. Note whether the system held and what you would fix. Done when you have tested one failure mode against a clear hypothesis.

Outcomes

  • Frame a chaos experiment as a testable hypothesis.
  • Limit blast radius and start outside production.
  • Measure against a defined steady state.
  • Build confidence that redundancy actually works.
Back to DevOps roadmap