v0.1.0 — 65 faults · zero code changes · 4 LLMs tested

Evaluate agent robustness
before deployment

Controlled fault injection for LLM agent systems.
Inject realistic API faults at the HTTP transport layer.
Evaluate robustness across 65 fault configs — zero agent code changes.

65 Fault Configs
~50pp Max Pass@1 Drop
5 Agent Systems
4 Backbone LLMs

Inject faults. Trace execution. Evaluate robustness.

AgentChaos runs your agent under controlled fault injection — not static analysis, but real execution under chaos. You provide the agent + task, we inject the faults and trace every LLM call.

~/my-agent-project
# Install
pip install agentchaos-sdk

# Your agent code needs ZERO changes
import agentchaos
agentchaos.inject("llm_error_single")
result = await my_agent(query)
agentchaos.disable()
agentchaos.save_trace("trace.json")

# Examples
git clone https://github.com/floritange/AgentChaos.git
cd AgentChaos && uv sync
uv run python examples/list_faults.py # list all 65 faults
uv run python examples/agent_openai.py # OpenAI agent: normal vs faulted
uv run python examples/agent_langchain.py # LangChain agent
uv run python examples/agent_adk.py # Google ADK agent
uv run python examples/eval_batch.py # batch evaluation

Agent robustness evaluation

5 agent systems × 4 backbone LLMs × 7 benchmarks × 65 fault configs. Result: up to 50pp Δpass@1 degradation — architecture determines robustness, not model capability.

Δpass@1 degradation (percentage points, averaged across fault configs)

System Pattern HumanEval HumanEval+ MBPP MBPP+ MMLU-Pro MATH-500
MapCoder Pipeline 48.61 49.30 41.07 40.85 38.25 34.27
MAD Multi-agent Debate 24.20 24.84 24.49 15.08 20.64 20.70
AutoGen Conversation 19.44 21.13 17.31 11.61 7.05 8.38
EvoMAC Evolutionary 18.48 18.18 16.67 14.73 13.63 15.85
Mini-SE Single + Tools SWE-bench Pro: Δ0.87pp

Results shown for Claude-Sonnet-4.5 backbone. Consistent patterns observed across GPT-5.2, DeepSeek-V3.2, and Seed-1.8.

Cross-LLM consistency (HumanEval Δpass@1)

System Claude-Sonnet-4.5 GPT-5.2 DeepSeek-V3.2 Seed-1.8
MapCoder 48.61 49.32 46.48 46.76
MAD 24.20 18.26 17.73 19.11
AutoGen 19.44 21.42 15.38 20.00
EvoMAC 18.48 23.53 19.87 20.55

Robustness ranking is identical across all 4 backbone LLMs — agent architecture determines robustness, not model capability.

Non-intrusive HTTP-layer fault injection

All LLM agent systems access models through the same HTTP interface. AgentChaos intercepts at the httpx transport layer to inject faults and record traces — completely transparent to your agent code.

🤖
Your Agent
Any framework
📡
httpx.send()
Transport layer
🔥
FaultEngine
Intercept & mutate
📦
Faulty Response
Looks real to agent
🔍
Diagnose
Detect & report

Properties

  • ✓ Works with any framework using OpenAI-compatible APIs (OpenAI, LangChain, ADK, AutoGen, LiteLLM)
  • Zero code changes — just inject() / disable() around your existing code
  • ✓ Records full execution trace (raw input/output, token usage, timing) for every LLM call
  • 65 pre-built fault configurations covering all real-world failure modes

Static Analysis vs. Runtime Fault Injection

Static Scanners (e.g. agent-audit) AgentChaos
Approach Scan source code for patterns Inject faults into agent at runtime
Requires Source files (.py, .yaml) A runnable agent + task
Finds Code vulnerabilities (XSS, injection) Robustness gaps (crash, silent fault propagation)
Like ESLint / Ruff / Bandit Chaos Monkey / Jepsen
Agent runs? No — just reads files Yes — real execution + fault injection
Output Vulnerability list + CWE IDs Trace + diagnosis + robustness evaluation

API

Function Description
agentchaos.inject(fault) Start fault injection + trace (None = trace only)
agentchaos.disable() Stop fault injection and trace
agentchaos.save_trace(path) Save execution trace to JSON
agentchaos.eval(agent_fn, query, faults) Batch robustness evaluation across fault configs
agentchaos.diagnose(text) Diagnose fault type from agent output
agentchaos.list_faults() List all 65 fault configs

65 fault configurations

6 fault types × 2 targets × 4 injection strategies + 8 compound + 9 positional. Each fault maps to a real-world LLM API failure mode.

💥 Error Crash
Inject HTTP 500 / server error into LLM API responses. Simulates server overload, rate limiting, and service outages.
Content Tool Call 4 strategies
⏱️ Timeout Crash
Inject timeout into LLM API responses. Simulates network congestion, backend delays, and API latency spikes.
Content Tool Call 4 strategies
🕳️ Empty Omission
Inject empty content into LLM responses. Simulates safety filter triggers, content policy blocks, and silent rejections.
Content Tool Call 4 strategies
✂️ Truncate Omission
Inject truncation at 30% length. Simulates token limit hits, TCP disconnects, and incomplete streaming.
Content Tool Call 4 strategies
🔀 Corrupt Value
Inject unicode/mojibake corruption into responses. Simulates encoding errors, proxy charset mismatches, and garbled payloads.
Content Tool Call 4 strategies
📋 Schema Value
Inject JSON/HTML instead of expected format. Simulates parsing errors, schema violations, and structural anomalies.
Content Tool Call 4 strategies

Injection Strategies

Single
Fire once, then stop
Transient network glitch
Persistent
Every call fails
Expired API key
Intermittent
30% probability per call
Flaky connection
Burst
First 3 calls fail, then recover
Rate limit burst

Compound Scenarios

API Degradation
Delay then error response
Cascading backend failure
Content Filter
Strip tool calls + filter message
Safety policy block
Max Tokens
Truncate + finish_reason=length
Context window overflow
Proxy HTML
Replace with HTML error page
CDN / reverse proxy error
Stale Cache
Replay previous response
Caching layer staleness
Wrong Entity
Ambiguous tool arguments
Entity resolution failure
Stale Data
Wrong values in tool args
Outdated function params
Slow Response
Add delay, content unchanged
High latency, no error

What fault injection reveals

Six findings from evaluating agent robustness under controlled chaos.

Finding 01

All systems degrade under fault injection

Every agent system tested shows significant performance drops. No system is immune — even the most robust architectures lose accuracy.

Δpass@1 up to 50 percentage points
Finding 02

Severe faults are NOT the most harmful

Crash faults (HTTP 500, timeouts) are detected and retried. Omission faults (truncation, empty) look like valid output and propagate silently through the agent pipeline.

Truncation causes the highest Δpass@1 but is diagnosed at only 4.3% accuracy
Finding 03

Most harmful faults are hardest to diagnose

Developers misattribute truncation faults to model capability. Rule-based diagnose achieves ~52% accuracy; LLM-based caps at ~47%.

Fault diagnosis accuracy ceiling: <56% for both methods
Finding 04

Architecture determines robustness, not model

Agent robustness ranking is identical across all 4 backbone LLMs. Pipeline agents are most vulnerable; iterative/evolutionary agents are most robust.

Ranking consistent across Claude, GPT, DeepSeek, Seed
Finding 05

Persistent fault injection overrides architecture

When faults inject on every LLM call, even robust agents collapse. Persistent injection erases the robustness gap between architectures.

Up to 62.39% Δpass@1 under persistent fault injection
Finding 06

Compound faults amplify degradation

Combining multiple fault types (e.g. truncation + wrong finish_reason) creates cascading failures far worse than individual faults.

Up to 86.36% Δpass@1 under compound fault injection (MapCoder)

Built for agent builders and researchers

👨‍💻

Agent Developers

Inject faults into your LangChain, AutoGen, ADK, or custom agents. Diagnose retry gaps, missing error handlers, and silent fault propagation before users hit them.

🔬

Researchers

Evaluate agent robustness with reproducible fault injection experiments. Generate Δpass@1 degradation curves and fault diagnosis accuracy metrics for publications.

🏢

Production Teams

Inject faults into agent pipelines before deployment. Trace execution to diagnose failures, evaluate robustness to identify blind spots.