Join Our Work Group Visit Our GitHub Repo

Leaderboards

This section describes the leaderboards and related tools that are maintained by us or separately by other AI Alliance members.

The leaderboards provide results from running various benchmarks against the most popular models.

Other tools assist software engineers in identifying important risks for their use cases and finding the evaluations and benchmarks that support testing for those risks.

Current Leaderboards

The following pages discuss the leaderboards already available:

Other leaderboards of note from other organizations include the following:

AIR-Bench 2024 from the Stanford University HELM project: A safety benchmark aligned with emerging government regulations and company policies, following the regulation-based safety categories grounded in the HELM AI Risks study, AIR 2024.
Galileo Agent Leaderboard v2 from Galileo AI evaluates the performance of LLM agents across business domains. Their dataset could be useful for custom domain-specific evaluation.

Plans for Leaderboards and Other Tools

Planned leaderboards will include the leading open-source models to serve as evaluation targets and as evaluation judges. Initially, we are focusing on Meta’s Llama family of models and IBM’s Granite family of models, with others to follow.

As we fill in the evaluation taxonomy, we will add corresponding evaluations and benchmarks to the leaderboards, along with search capabilities to find the topics of interest.

Finally, we plan to provide downloadable and deployable configurations of the Evaluation Reference Stack with the selected evaluations for easy and rapid use.

The child pages describe the leaderboards and other tools that are currently available.

Leaderboards

Current Leaderboards

Plans for Leaderboards and Other Tools

Child Pages