AI / Machine Learning|2026

agentrial

The pytest for AI agents. Run your agent 100 times, get confidence intervals instead of anecdotes. Published open-source framework with Wilson confidence intervals, step-level failure attribution via Fisher exact test, and real cost tracking across 45+ models.

PythonStatisticsLLM APIsFastAPIVS Code ExtensionCI/CDPyPI
agentrial run
$ agentrial run tests/calculator_agent.yml --trials 25
agentrial v0.5.2 — calculator_agent
Model: claude-3.5-haiku · Parallel: 8 · Provider: auto-selected
basic-math$0.080.25s
100%PASS
micro-step-color$0.060.18s
100%PASS
add-subtraction$0.060.26s
100%PASS
multi-digit$0.070.30s
60%WARN
error-handling$0.050.19s
80%PASS
  → Step: output format mismatch  (Fisher p=0.027)
Results
  Suites: 5 pass · 1 warn · 0 fail
  Reliability: 94.0%  [83.5% — 98.7%]  (Wilson CI)
  Score: 87/100
Cost Summary
  Total: $0.31  (25 trials)  ·  Avg: $0.012/trial
  Duration: 8.2s parallel — est. 62s serial
  ✓ Report saved to ./reports/calculator_agent_2026-02-08.html

Motivation

AI agents are unreliable. The same agent, same prompt, same model can pass on Monday and fail on Wednesday. Research shows LLMs exhibit up to 72% variance across runs, even at temperature=0. Yet the standard practice is to run an agent once, see it work, and ship it.

I built agentrial because I was tired of anecdotes. I wanted confidence intervals.

The Core Insight

Agent evaluation is a statistical problem, not a pass/fail one. A single successful run tells you nothing about production reliability. You need multiple trials, proper statistical tests, and a framework that tracks where failures happen, not just that they happen.

agentrial borrows from clinical trial methodology: run your agent NN times, compute Wilson confidence intervals on pass rates, and use Fisher's exact test to pinpoint which step in the pipeline is the weak link. No more guessing.

What It Does

The framework wraps around any agent (LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, or custom) and provides:

  • Multi-trial execution with configurable parallelism
  • Wilson confidence intervals on every metric
  • Step-level failure attribution via Fisher exact test: if step 3 fails 40% of the time, you see it
  • Real cost tracking across 45+ models with per-test breakdowns
  • Drift detection for production: CUSUM, Page-Hinkley, and Kolmogorov-Smirnov detectors catch regressions before users do
  • Agent Reliability Score (ARS): a composite [0,100][0, 100] metric combining success rate, latency, cost, and consistency

Design Decisions

Local-first, no SaaS. agentrial runs entirely on your machine. No telemetry, no cloud dependency. Tests are defined in YAML or Python, results are stored locally, and everything integrates with CI/CD via GitHub Actions.

Framework-agnostic. Adapter-based architecture means adding support for a new framework is ~50 lines of code. The 6 built-in adapters cover the majority of the ecosystem.

MCP Security Scanner. As Model Context Protocol tools proliferate, security becomes critical. The built-in scanner analyzes MCP integrations for common vulnerabilities before they reach production.

Impact

The project is published on PyPI as an open-source package with MIT license, 450 passing tests, and a VS Code extension for inline results. The goal is to make "we ran it once and it worked" an unacceptable standard for agent deployment.