Tools for Agent Development and Testing

Table of contents

Tools for Agent Development and Testing

The previous chapters discussed several agent implementation and testing tools we used, specifically Langchain’s Deep Agents library and Agent Skills.

TBD - evaluation tools…

This chapter lists some other tools for implementation and evaluation that may be of interest. You can skip this chapter if you aren’t interested in exploring additional tools.

Highlights:

What tools have you used? Feedback is welcome on the list hear, especially experiences with any of the tools mentioned. What tools should we add?

We group these tools into development and testing categories, although there is some overlap. Each category is listed in alphabetical order.

Agent Development Tools

There is a rapidly growing list of tools for developing agents. In addition to many of the tools mentioned above, Here, are some additional for consideration, all of which offer integrated support for evaluation, in one form or another.

Agent Development Kit

Google’s Agent Development Kit provides general guidance on building agents. It also has a chapter called Why Evaluate Agents?, which provides tips for writing evaluations specifically tailored for agents.

`AGENTS.md`

Similar in spirit to Skills, OpenAI’s AGENTS.md (GitHub is a simple Markdown format for guiding coding agents. They describe it as a README for agents: a dedicated, predictable place to provide the context and instructions to help AI coding agents work on projects.

Any Agent

Mozilla AI’s any-agent (blog post) abstracts over other agent frameworks, providing common services like observability with the ability to switch out agent frameworks as needed.

CUGA - ConfigUrable Generalist Agent

CUGA (ConfigUrable Generalist Agent) (GitHub, IBM blog post, HuggingFace blog post) is an agent framework from IBM Research that is purpose-built for enterprise automation.

CUGA integrates several popular agentic patterns, such as ReAct, CodeAct (and here), and Planner-Executor.

CUGA provides a modular architecture enabling trustworthy, policy-aware, and composable automation across web interfaces, APIs, and custom enterprise systems. It also takes evaluation seriously, with built-in tools and examples.

A related, more recent project from the same team is Agent Lifecycle Toolkit, which helps agent builders create better performing agents by addressing errors, like failure to follow instructions, struggling to find the right tool to use, violating business rules.

Weave CLI

weave-cli is a tool for working with vector databases and related agents more easily. It has built-in features for running evaluations.

Other References on Agent Development

The following resources offer useful guidance on various aspects of agent development.

Anthropic’s influential post, How we built our multi-agent research system, offers useful tips for building effective agents.
Prioritizing Real-Time Failure Detection in AI Agents from the Partnership on AI offers guidance on accessing the potential impact of various failures and where to prioritize early detection and handling.
CoSAI, the Coalition for Secure AI, has published The Future of Agentic Security: From ChatBots to Intelligent Swarms, a definitive guide of the potential security risks posed by agents and how to mitigate them.

Agent Evaluation and Testing Tools

Some of the tools listed above also support evaluation and testing, e.g., Google’s Agent Development Kit.

Arize Phoenix

(Mentioned in the Anthropic evaluation post)) Arize Phoenix is an open-source platform for LLM tracing, debugging, and offline or online evaluations. AX is their SaaS offering with additional scalability and other capabilities.

AssetOpsBench

AssetOpsBench from IBM is a unified framework for developing, orchestrating, and evaluating domain-specific AI agents in industrial asset operations and maintenance. It is designed for maintenance engineers, reliability specialists, and facility planners, it allows reproducible evaluation of multi-step workflows in simulated industrial environments.

Braintrust

(Mentioned in the Anthropic evaluation post)) Braintrust integrates offline evaluation with production traces, for example allowing interesting traces to be easily converted into evaluations.

Benchmarks, Registries, Competitions, and Leaderboards

Several agent benchmarks, registries, competitions, and leaderboards have emerged that are good resources for finding agents that work well, along with the evaluations used to assess them.

Agent Beats is registry for agents, benchmarks, and a running competition organized by Berkeley RDI and their Agentic AI MOOC.
Humanity’s Last Exam (HLE) is addressing the problem that state-of-the-art LLMs are now achieving over 90% accuracy on the most popular benchmarks, limiting informed measurement of capabilities. HLE is a multi-modal benchmark at the frontier of human knowledge, designed to be the final, closed-ended academic benchmark of its kind with broad subject coverage. The dataset consists of 2,500 challenging questions across over a hundred subjects. HLE is a global collaborative effort, with questions from nearly 1,000 subject expert contributors affiliated with over 500 institutions across 50 countries, comprised mostly of professors, researchers, and graduate degree holders.
Exgentic (ArXiv paper) is an open-source leaderboard for general agents. Based on their observations, they conclude that general-purpose agents often outperform specialized agents and the model choice has a bigger impact than the agent framework or patterns used. Model choice also has the biggest impact on the cost profile.

CUBE

The CUBE (Common Unified Benchmark Environment) projects were discussed in See the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/evaluating-agents/) chapter for more details on these projects, as well as the References. In short, they attempt to standardize techniques for building agent evaluations, along with an effort to build and catalog evaluations built to the standard.

DoomArena

DoomArena is a framework for testing AI Agents against evolving security threats. It offers a modular, configurable, plug-in framework for threat modeling and testing the security of AI agents across multiple attack scenarios. See the References for more details.

Harbor

(Mentioned in the Anthropic evaluation post)) Harbor is a framework for evaluating and optimizing agents and models in container environments. It can run at scale in cloud environments.

LangSmith

(Mentioned in the Anthropic evaluation post)) LangSmith, part of the LangChain ecosystem, integrates offline and online evaluation. Langfuse offers similar capabilities in an open-source package that support on-premise use.

LastMile AI’s MCP Eval

MCP Eval is an evaluation framework for testing Model Context Protocol (MCP) servers and the agents that use them. Unlike traditional testing approaches that mock interactions or test components in isolation. It is built on MCP Agent, their agent framework that emphasizes MCP as the communication protocol.

Simulation Tools

Agents interact with other agents, tools, and systems with often complex behaviors. Evaluation of agents can’t always interact with real systems, so digital twins or simulation of such systems is necessary.

Simulation of environments has been an important part of Reinforcement Learning (RL) for a long time. Gymnasium, the successor to OpenAI’s Gym, is a popular framework, for example.

The requirements for simulation environments have evolved as RL’s use for model Post-Training has evolved.

The PyTorch community recently announced OpenEnv, “an end-to-end framework designed to standardize how agents interact with execution environments during reinforcement learning (RL) training.” (Other links: GitHub, HuggingFace), HuggingFace blog post)

Some of the benefits of OpenEnv compared to other options include better type safety, Docker containers providing both sandbox execution and cluster deployments for scaling, not limited to Python users, and support for sharing environments.

While oriented towards RL, OpenEnv can be used to build environment simulations for use by agent evaluations, especially where a task’s step-by-step state evolution needs to be observed and progress measured.

Similarly, Patronus AI has described a technology they are working on called Generative Simulators (paper - PDF), an outgrowth of their work on various benchmarks, e.g., for the financial sector.

What’s Next?

Review the highlights summarized above, then proceed to the Advanced Techniques section of chapters.