Chaos engineering (light) · DevOps

What is chaos engineering?

Chaos engineering is deliberately injecting failure — killing a pod, adding latency, cutting a dependency — to discover how your system actually behaves under stress, before a real outage finds out for you. The "light" version starts small and controlled, not by breaking production randomly.

Why it matters

Systems fail in ways nobody predicted, and the calm of a working system hides fragile assumptions. Testing failure on purpose surfaces those weaknesses while you are watching and ready, instead of at 3am. It builds genuine confidence that your redundancy and failover actually work.

What to learn

The hypothesis-driven experiment model
Starting in staging before production
Blast radius and limiting it
Common experiments: kill instances, add latency, drop dependencies
Steady-state metrics to measure against
Automated rollback if things go wrong
Game days as a team practice

Common pitfall

Running chaos experiments in production with no blast-radius limit and no plan to stop. That is not engineering, it is causing an outage. Start in staging, define the steady state you expect to hold, limit the experiment to a small slice, and have an automatic abort ready before you touch anything live.

Resources

Primary (free):

Principles of Chaos Engineering · docs
Netflix — Chaos Monkey · tool
Gremlin — Chaos engineering guide · article

Practice

In a staging environment, form a hypothesis — "if one replica dies, traffic keeps flowing with no errors" — then kill a pod and watch your steady-state metrics. Note whether the system held and what you would fix. Done when you have tested one failure mode against a clear hypothesis.

Outcomes

Frame a chaos experiment as a testable hypothesis.
Limit blast radius and start outside production.
Measure against a defined steady state.
Build confidence that redundancy actually works.