Evaluation Is for Everyone
Part of the AI Alliance Trust and Safety Evaluation Initiative (TSEI), our goal is to ensure the widespread adoption of AI trust and safety technologies, both educating application developers about these concepts and making it as easy as possible for state-of-the-art tools to be used to support them. Welcome to the Evaluation Is for Everyone project.
Tip: The links for italicized terms go to this glossary.
Unlike traditional software systems that rely on prescribed specifications and mostly-deterministic application code, AI systems based on generative AI models depend on training data to map inputs to probabilistic outputs. A consequence is these systems are inherently non-deterministic and may even generate erroneous or undesirable output. To evaluate such systems, benchmarks are commonly used to measure how models behave in these areas of concern.
It is essential to establish a flexible evaluation framework that supports rapid updates to evaluation criteria and benchmark selection, in part because benchmark data often becomes part of the training data corpus, so models become better at existing benchmarks, whether or not they are actively engineered to do so. Given the critical role of testing and evaluation in deploying AI systems with confidence, there is a pressing need for a consistent methodology and robust tool support for these activities.
Within the AI Alliance’s Trust and Safety work group, the projects under the Trust and Safety Evaluation Initiative umbrella are designed to promote the best-of-breed tools for running evaluations, existing evaluation suites for common purposes, and ensuring these tools can be adopted and adapted efficiently and effectively for evolving uses.
- Evaluation Is for Everyone (this project) has two goals:
- Educate application developers about the importance of building AI trustworthiness and safety into their AI-enabled applications from the beginning, just as we have needed to build cybersecurity into our apps for decades.
- Make it easy to find and adopt the appropriate set of evaluations for particular application requirements.
- Achieving Confidence in Enterprise AI Applications addresses the problem that AI application developers struggle to test that their applications meet the requirements and perform the use cases they were designed for. Enterprise developers are accustomed to writing repeatable tests for software that is (mostly) deterministic, but the inherent probabilistic nature of the underlying generative AI models defeats these techniques. This project is adapting evaluation techniques for these testing purposes and teaching developers how to use them.
- Evaluation Reference Stack is documenting the industry’s best tools for running evaluations and making them easy to adopt and manage.
Related projects include Ranking AI Safety Priorities by Domain and our AI Trust and Safety User Guide.
Activities in This Project
There are several work streams in this project that serve our goals.
Understand and Grow Existing Evaluation Taxonomies
Many organizations have worked on taxonomies or “suites” of evaluations, usually focused on specific areas of interest, such as categories of harmful speech. Other possible areas of interest are under-served, such as common concerns in particular domains, for example evaluating how well legal applications understand established case law and provide responses consistent with it.
Since this project wants to make it easy for developers to adopt evaluation for trust and safety, as well as other uses, we have a long-running work stream to identify existing evaluation suites users might use, and where gaps exist we can help fill.
This work stream will also explore how to make it easy to run a suite of evaluations on the evaluation reference stack.
Current activities: evaluations
and taxonomy
Provide Useful Leaderboards
Leaderboards are a user-friendly tool that help users find suites of useful evaluations and data about how well particular models perform against them.
In addition to the current leaderboards we support, we plan to build out graphically-based tools allowing users to browse and search for evaluations that support their needs, then download them in a packaged for that is easy to deploy in different environments using the reference stack.
Current activities: leaderboards
Educate Developers about Using Evaluations
Besides our AI Trust and Safety User Guide, which provides general guidance, this project will build examples of finding and using appropriate evaluations for particular categories of needs. The companion projects mentioned above will also build examples for their needs, but this project should provide clear adoption guidance for the most important taxonomies of trust and safety.
Current activities: documentation
Overview of This Website
The rest of this website is organized into the following sections:
- Glossary of Terms
- User Personae and Their Needs
- Evaluations and Benchmarks
- Taxonomies of Evaluations
- Leaderboards
Getting Involved
Are you interested in contributing? If so, please see the Contributing page for information on how you can get involved. See the About Us page for more details about this project and the AI Alliance.
Additional Links
- This project’s GitHub Repo
- Companion projects:
- The AI Alliance:
Authors | The AI Alliance Trust and Safety Work Group (see About Us) |
Last Update | V0.5.0, 2025-07-21 |