Controlled fault injection for LLM agent systems.
Inject realistic API faults at the HTTP transport layer.
Evaluate robustness across 65 fault configs — zero agent code changes.
AgentChaos runs your agent under controlled fault injection — not static analysis, but real execution under chaos. You provide the agent + task, we inject the faults and trace every LLM call.
5 agent systems × 4 backbone LLMs × 7 benchmarks × 65 fault configs. Result: up to 50pp Δpass@1 degradation — architecture determines robustness, not model capability.
| System | Pattern | HumanEval | HumanEval+ | MBPP | MBPP+ | MMLU-Pro | MATH-500 |
|---|---|---|---|---|---|---|---|
| MapCoder | Pipeline | 48.61 | 49.30 | 41.07 | 40.85 | 38.25 | 34.27 |
| MAD | Multi-agent Debate | 24.20 | 24.84 | 24.49 | 15.08 | 20.64 | 20.70 |
| AutoGen | Conversation | 19.44 | 21.13 | 17.31 | 11.61 | 7.05 | 8.38 |
| EvoMAC | Evolutionary | 18.48 | 18.18 | 16.67 | 14.73 | 13.63 | 15.85 |
| Mini-SE | Single + Tools | SWE-bench Pro: Δ0.87pp | |||||
Results shown for Claude-Sonnet-4.5 backbone. Consistent patterns observed across GPT-5.2, DeepSeek-V3.2, and Seed-1.8.
| System | Claude-Sonnet-4.5 | GPT-5.2 | DeepSeek-V3.2 | Seed-1.8 |
|---|---|---|---|---|
| MapCoder | 48.61 | 49.32 | 46.48 | 46.76 |
| MAD | 24.20 | 18.26 | 17.73 | 19.11 |
| AutoGen | 19.44 | 21.42 | 15.38 | 20.00 |
| EvoMAC | 18.48 | 23.53 | 19.87 | 20.55 |
Robustness ranking is identical across all 4 backbone LLMs — agent architecture determines robustness, not model capability.
All LLM agent systems access models through the same HTTP interface. AgentChaos intercepts at the httpx transport layer to inject faults and record traces — completely transparent to your agent code.
inject() / disable() around your existing code| Static Scanners (e.g. agent-audit) | AgentChaos | |
|---|---|---|
| Approach | Scan source code for patterns | Inject faults into agent at runtime |
| Requires | Source files (.py, .yaml) | A runnable agent + task |
| Finds | Code vulnerabilities (XSS, injection) | Robustness gaps (crash, silent fault propagation) |
| Like | ESLint / Ruff / Bandit | Chaos Monkey / Jepsen |
| Agent runs? | No — just reads files | Yes — real execution + fault injection |
| Output | Vulnerability list + CWE IDs | Trace + diagnosis + robustness evaluation |
| Function | Description |
|---|---|
| agentchaos.inject(fault) | Start fault injection + trace (None = trace only) |
| agentchaos.disable() | Stop fault injection and trace |
| agentchaos.save_trace(path) | Save execution trace to JSON |
| agentchaos.eval(agent_fn, query, faults) | Batch robustness evaluation across fault configs |
| agentchaos.diagnose(text) | Diagnose fault type from agent output |
| agentchaos.list_faults() | List all 65 fault configs |
6 fault types × 2 targets × 4 injection strategies + 8 compound + 9 positional. Each fault maps to a real-world LLM API failure mode.
Six findings from evaluating agent robustness under controlled chaos.
Every agent system tested shows significant performance drops. No system is immune — even the most robust architectures lose accuracy.
Crash faults (HTTP 500, timeouts) are detected and retried. Omission faults (truncation, empty) look like valid output and propagate silently through the agent pipeline.
Developers misattribute truncation faults to model capability. Rule-based diagnose achieves ~52% accuracy; LLM-based caps at ~47%.
Agent robustness ranking is identical across all 4 backbone LLMs. Pipeline agents are most vulnerable; iterative/evolutionary agents are most robust.
When faults inject on every LLM call, even robust agents collapse. Persistent injection erases the robustness gap between architectures.
Combining multiple fault types (e.g. truncation + wrong finish_reason) creates cascading failures far worse than individual faults.
Inject faults into your LangChain, AutoGen, ADK, or custom agents. Diagnose retry gaps, missing error handlers, and silent fault propagation before users hit them.
Evaluate agent robustness with reproducible fault injection experiments. Generate Δpass@1 degradation curves and fault diagnosis accuracy metrics for publications.
Inject faults into agent pipelines before deployment. Trace execution to diagnose failures, evaluate robustness to identify blind spots.