Production patternsAdvanced5h

Safety & red-teaming.

Prompt injection, jailbreaks, and guardrails.

What is AI safety and red-teaming?

For applied AI, safety means stopping your system from being manipulated or producing harmful output. Red-teaming is deliberately attacking your own system — prompt injection, jailbreaks, data exfiltration — to find the holes before users or attackers do.

Why it matters

An LLM connected to tools, data, or users is an attack surface. Prompt injection can hijack an agent, leak private data, or make your product say things that damage trust. Building guardrails and testing them is now a core responsibility for anyone shipping AI features, not an afterthought.

What to learn

  • Prompt injection, direct and indirect
  • Jailbreaks and bypassing instructions
  • Data exfiltration through the model
  • Input and output guardrails
  • Least privilege for tools an agent can call
  • Red-teaming your own prompts and pipelines
  • Never trusting model output as safe by default

Common pitfall

Putting untrusted content — a web page, a user document — directly into the prompt and trusting the model to ignore any malicious instructions inside it. That is indirect prompt injection, and models do follow embedded instructions. Treat all external content as untrusted, isolate it, and constrain what the model is allowed to do with it.

Resources

Primary (free):

Practice

Take an LLM feature and red-team it: try a prompt injection, a jailbreak, and embedding a malicious instruction in content the model reads. Note what got through, then add a guardrail — input checks, output filtering, or tighter tool permissions. Done when at least one attack you found is now blocked.

Outcomes

  • Explain prompt injection, jailbreaks, and exfiltration.
  • Treat all external content as untrusted input.
  • Add input and output guardrails.
  • Red-team your own AI system before attackers do.
Back to AI / ML roadmap