Link Search Menu Expand Document

AI Safety, Governance, and Education

Collaborate on the necessary enablers of successful AI applications.

In order for the objectives of the Open Agent Hub and the Open Data and Model Foundry to be achieved, fundamental requirements must be met for safety, governance, and the expertise required to use AI technologies effectively.

AI Safety encompasses classic cybersecurity, as well as AI-specific concerns, such as suppression of undesirable content and compliance with regulations and social norms. A more general term is trustworthiness, which adds concerns about ensuring accuracy (i.e., minimizing hallucinations) and meeting the specific requirements for application use cases, etc. Enterprises won’t deploy AI applications into production scenarios if they don’t trust them to behave as expected.

Governance is an aspect of trustworthiness, specifically the assurances that all end-to-end processes used to create all AI application components are secure, licensed for use, etc. AI models are created with data; they are mostly data themselves. Hence, models, like data, need to be governed.

Finally, Education addresses the needs where organizations struggle to learn all the things they need to know in order to use AI safely and effectively. Not only has AI introduced new tools and techniques to software application development, it has fundamentally altered some of the ways software works, for example, introducing stochastic behaviors as core aspects of application features, where previously deterministic behaviors were the norm. Most AI Alliance projects have dual missions, not only to innovate and create, but to educate.

The following projects address these concerns.

Links Description
The AI Trust and Safety User Guide AI Alliance icon
An introduction to trust and safety concepts from diverse experts, followed by recommendations for how to meet your application's needs. Start here if you are new to trust and safety, then leverage the projects discussed next to implement what you need.
Testing Generative AI Applications AI Alliance icon
Are you an enterprise developer? How should you test AI applications? You know how to write deterministic tests for your "pre-AI" applications. What should you do when you add generative AI models, which aren't deterministic? This project adapts existing evaluation techniques for the "last mile" of AI evaluation; verifying that an AI application correctly implements its requirements and use cases, going beyond the general concerns of evaluation for safety, security, etc. We are building nontrivial, reusable examples and instructional materials, so you can use these techniques effectively in combination with the traditional tools you already know. This project is part of the Trust and Safety Evaluation Initiative (TSEI). (It was previously called Achieving Confidence in Enterprise AI Applications.)
DoomArena

AI agents are becoming increasingly powerful and ubiquitous. They now interact with users, tools, web pages, and databases—each of which introduces potential attack vectors for malicious actors. As a result, the security of AI agents has become a critical concern. DoomArena provides a modular, configurable framework that enables the simulation of realistic and evolving security threats against AI agents. It helps researchers and developers explore vulnerabilities, test defenses, and improve the security of AI systems. The DoomArena architecture comprises several key components that work together to create a flexible, powerful security testing environment for AI agents:

  • Attack Gateway: Functions as a wrapper around original agentic environments (TauBench, BrowserGym, OSWorld), injecting malicious content into the user-agent-environment loop as the AI agent interacts with it.
  • Threat Model: Defines which components of the agentic framework are attackable and specifies targets for the attacker, enabling fine-grained security testing.
  • Attack Config: Specifies the AttackableComponent, the AttackChoice (drawn from a library of implemented attacks), and the SuccessFilter which evaluates attack success.

DoomArena offers several advanced capabilities that make it a powerful and flexible framework for security testing of AI agents:

  • Plug-in: Plug into to your favorite agentic framework and environments τ-Bench, BrowserGym, OSWorld without requiring any modifications to their code.
  • Customizable threat models: Test agents against various threat models including malicious users and compromised environments.
  • Generic Attacker Agents: Develop and reuse attacker agents across multiple environments.
  • Defense Evaluation: Compare effectiveness of guardrail-based, LLM-powered, and security-by-design defenses.
  • Composable Attacks: Reuse and combine previously published attacks for comprehensive and fine-grained security testing.
  • Trade-off Analysis: Analyze the utility/security trade-off under various threat models.

Evaluation Is for Everyone AI Alliance icon
Evaluation Is for Everyone addresses two problems: 1) many AI application builders don't know what they should do to ensure trust and safety, and 2) it should be as easy as possible to add trust and safety capabilities to AI applications. Many trust and safety evaluation suites are available that can be executed on the Evaluation Reference Stack. We are making it as easy as possible for AI application developers to find and deploy the evaluations they need. See also the companion Testing Generative AI Applications project. This project is part of the Trust and Safety Evaluation Initiative (TSEI).
Evaluation Reference Stack AI Alliance icon
The companion projects Testing Generative AI Applications and Evaluation Is for Everyone require a runtime stack that is flexible and easy to deploy and manage. This project is collating popular tools for writing and running evaluations into easy-to-consume packages. This project is part of the Trust and Safety Evaluation Initiative (TSEI).
unitxt
Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking. (Principal developer: IBM Research)