Join Our Work Group Visit Our GitHub Repo

SafetyBAT Leaderboard

The Safety BAT Leaderboard on the AI Alliance Hugging Face Community uses BenchBench to rate benchmarks according to their agreement with the defined Aggregate Benchmark, an enhanced representation of many benchmarks that are available.

BenchBench is a useful tool for users with the following needs:

You have a new benchmark and you want to see if it agrees or disagrees with other known benchmarks.
You are looking for a benchmark to run and use to ensure your trust in a system or model you want to use. BenchBench helps you find efficient alternatives that provide acceptable coverage, but may meet other needs, such as the ability to run the benchmark privately or with less overhead.

The leaderboard shows that agreements are best represented with the BenchBench Score, the relative agreement (Z Score) of each benchmark to the Aggregate benchmark.

Read more about BenchBench in the paper Benchmark Agreement Testing Done Right and the BenchBench repo.

Using SafetyBAT for Your Own Evaluations

If you are interested in cloning the source code for your own use or contributing to this leaderboard, see this README.