AgentChaos — Evaluate Agent Robustness via Fault Injection

Quick Start

Inject faults. Trace execution. Evaluate robustness.

AgentChaos runs your agent under controlled fault injection — not static analysis, but real execution under chaos. You provide the agent + task, we inject the faults and trace every LLM call.

~/my-agent-project

# Install

pip install agentchaos-sdk

# Your agent code needs ZERO changes

import agentchaos

agentchaos.inject("llm_error_single")

result = await my_agent(query)

agentchaos.disable()

agentchaos.save_trace("trace.json")

# Examples

git clone https://github.com/floritange/AgentChaos.git

cd AgentChaos && uv sync

uv run python examples/list_faults.py # list all 65 faults

uv run python examples/agent_openai.py # OpenAI agent: normal vs faulted

uv run python examples/agent_langchain.py # LangChain agent

uv run python examples/agent_adk.py # Google ADK agent

uv run python examples/eval_batch.py # batch evaluation

Evaluation Results

Agent robustness evaluation

5 agent systems × 4 backbone LLMs × 7 benchmarks × 65 fault configs. Result: up to 50pp Δpass@1 degradation — architecture determines robustness, not model capability.

Δpass@1 degradation (percentage points, averaged across fault configs)

System	Pattern	HumanEval	HumanEval+	MBPP	MBPP+	MMLU-Pro	MATH-500
MapCoder	Pipeline	48.61	49.30	41.07	40.85	38.25	34.27
MAD	Multi-agent Debate	24.20	24.84	24.49	15.08	20.64	20.70
AutoGen	Conversation	19.44	21.13	17.31	11.61	7.05	8.38
EvoMAC	Evolutionary	18.48	18.18	16.67	14.73	13.63	15.85
Mini-SE	Single + Tools	SWE-bench Pro: Δ0.87pp

Results shown for Claude-Sonnet-4.5 backbone. Consistent patterns observed across GPT-5.2, DeepSeek-V3.2, and Seed-1.8.

Cross-LLM consistency (HumanEval Δpass@1)

System	Claude-Sonnet-4.5	GPT-5.2	DeepSeek-V3.2	Seed-1.8
MapCoder	48.61	49.32	46.48	46.76
MAD	24.20	18.26	17.73	19.11
AutoGen	19.44	21.42	15.38	20.00
EvoMAC	18.48	23.53	19.87	20.55

Robustness ranking is identical across all 4 backbone LLMs — agent architecture determines robustness, not model capability.

How It Works

Non-intrusive HTTP-layer fault injection

All LLM agent systems access models through the same HTTP interface. AgentChaos intercepts at the httpx transport layer to inject faults and record traces — completely transparent to your agent code.

🤖

Your Agent

Any framework

→

📡

httpx.send()

Transport layer

→

🔥
FaultEngine
Intercept & mutate

→

📦

Faulty Response

Looks real to agent

→

🔍

Diagnose

Detect & report

Properties

✓ Works with any framework using OpenAI-compatible APIs (OpenAI, LangChain, ADK, AutoGen, LiteLLM)
✓ Zero code changes — just inject() / disable() around your existing code
✓ Records full execution trace (raw input/output, token usage, timing) for every LLM call
✓ 65 pre-built fault configurations covering all real-world failure modes

Static Analysis vs. Runtime Fault Injection

	Static Scanners (e.g. agent-audit)	AgentChaos
Approach	Scan source code for patterns	Inject faults into agent at runtime
Requires	Source files (.py, .yaml)	A runnable agent + task
Finds	Code vulnerabilities (XSS, injection)	Robustness gaps (crash, silent fault propagation)
Like	ESLint / Ruff / Bandit	Chaos Monkey / Jepsen
Agent runs?	No — just reads files	Yes — real execution + fault injection
Output	Vulnerability list + CWE IDs	Trace + diagnosis + robustness evaluation

API

Function	Description
agentchaos.inject(fault)	Start fault injection + trace (`None` = trace only)
agentchaos.disable()	Stop fault injection and trace
agentchaos.save_trace(path)	Save execution trace to JSON
agentchaos.eval(agent_fn, query, faults)	Batch robustness evaluation across fault configs
agentchaos.diagnose(text)	Diagnose fault type from agent output
agentchaos.list_faults()	List all 65 fault configs

Fault Taxonomy

65 fault configurations

6 fault types × 2 targets × 4 injection strategies + 8 compound + 9 positional. Each fault maps to a real-world LLM API failure mode.

💥 Error Crash

Inject HTTP 500 / server error into LLM API responses. Simulates server overload, rate limiting, and service outages.