Standards for building agents, better
-
Updated
Feb 22, 2026 - TypeScript
Standards for building agents, better
Agentic testing for agentic codebases
Ship agents you can audit.
The pre-flight check for AI agents
The open-source MultiAgentOps evaluation harness for any industry scenario.
Deterministic runtime for agent evaluation
GitHub template for agent-testable SaaS apps. Next.js 16 + shadcn/ui + Neon Postgres + agent-browser e2e testing via accessibility tree.
Diagnose your AI agents in production. Extract policies from prompts, evaluate traces, generate diagnostic reports.
Qualitative benchmark suite for evaluating AI coding agents and orchestration paradigms on realistic, complex development tasks
Agent testing automation 🤖 by simulating users 👥 and agents 🤝 with judge ⚖️(langwatch-scenario)
Simulation environment for testing and validating autonomous agents
The definitive benchmark for AI agents on OpenClaw. 45 tasks across 4 tiers. Powered by MyClaw.ai
🔬 Playwright for AI Agents — Test, record, and replay agent behaviors
Token-efficient stochastic testing for AI agents. 5-20x cost reduction. 10 framework adapters. Paper: arXiv:2603.02601
PHP testing framework for LLM agents — multi-turn dialogs, cassette replay, tool calling, LLM-as-judge assertions
𝘈 𝘔𝘶𝘭𝘵𝘪-𝘈𝘨𝘦𝘯𝘵 𝘚𝘺𝘴𝘵𝘦𝘮 𝘧𝘰𝘳 𝘊𝘳𝘰𝘴𝘴-𝘊𝘩𝘦𝘤𝘬𝘪𝘯𝘨 𝘗𝘩𝘪𝘴𝘩𝘪𝘯𝘨 𝘜𝘙𝘓𝘴.
AI Agent Evaluation and Monitoring Guide
Behavior test framework for AI agents. Define tests in YAML. Run against transcripts. Get scored reports.
QUEEF — Quality User Experience Enforcement Framework. Test structural UX of agent responses.
Holdout scenario evaluation harness for AI agents. Doer/Judge/Adversary/Observer roles, probabilistic satisfaction scoring, append-only JSONL audit trails with integrity hashes. Created Dec 2025.
Add a description, image, and links to the agent-testing topic page so that developers can more easily learn about it.
To associate your repository with the agent-testing topic, visit your repo's landing page and select "manage topics."