Evaluators and Benchmarks
This section describes the evaluators that implement the evaluations identified in the taxonomy. Evaluations include some combination of code and data.
Benchmarks that aggregate evaluators for larger goals, e.g., domain-specific scenarios, are also cataloged here.
For now, see the following resources, which overlap with each other.
unitxt
catalog: a set of evaluators implemented usingunitxt
.lm-evaluation-harness
tasks: a set of evaluators implemented directly onlm-evaluation-harness
, including examples that useunitxt
, too.- Llama Guard: Meta’s system for safeguarding human-AI conversations.
- Granite Guardian: IBM’s risk detection models for enterprise use cases.
- MLCommons AILuminate: The MLCommons benchmark that assesses the safety of text-to-text interactions with a general purpose AI chat model in the English language.
- The AI Alliance Open Trusted Data Initiative catalogs open-access datasets, including many used for benchmarks and evaluations, etc.
Evaluators and Benchmarks to Explore
A list of possible candidates to incorporate in our catalog.
More Coming Soon
Help Wanted: Do you have datasets, benchmarks, or other evaluators that you believe should be included? See our Contributing page!
NeurIPS 2024 Datasets Benchmarks
The NeurIPS 2024 Datasets Benchmarks is a list of recently-created datasets of interest for evaluation.
do-not-answer
Developed by the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), do-not-answer is an open-source dataset to evaluate LLMs’ safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.
Human-Centric Face Representations
A collaboration of Sony AI and the University of Tokyo, Human-Centric Face Representations is a collaboration to generate a dataset of 638,180 human judgments on face similarity. Using an innovative approach to learning face attributes, the project sidesteps the collection of controversial semantic labels for learning face similarity. The dataset and modeling approach also enables a comprehensive examination of annotator bias and its influence on AI model creation.
Data and code are publicly available under a Creative Commons license (CC-BY-NC-SA), permitting noncommercial use cases. See the GitHub repo.
Social Stigma Q&A
Social Stigma Q&A is a dataset from IBM Research. From the arXiv paper abract:
Current datasets for unwanted social bias auditing are limited to studying protected demographic features such as race and gender. In this work, we introduce a comprehensive benchmark that is meant to capture the amplification of social bias, via stigmas, in generative language models. Taking inspiration from social science research, we start with a documented list of 93 US-centric stigmas and curate a question-answering (QA) dataset which involves simple social situations. Our benchmark, SocialStigmaQA, contains roughly 10K prompts, with a variety of prompt styles, carefully constructed to systematically test for both social bias and model robustness. We present results for SocialStigmaQA with two open source generative language models and we find that the proportion of socially biased output ranges from 45% to 59% across a variety of decoding strategies and prompting styles. We demonstrate that the deliberate design of the templates in our benchmark (e.g., adding biasing text to the prompt or using different verbs that change the answer that indicates bias) impacts the model tendencies to generate socially biased output. Additionally, through manual evaluation, we discover problematic patterns in the generated chain-of-thought output that range from subtle bias to lack of reasoning.
For more information, see Arxiv:2312.07492.
Kepler
Kepler (paper) measures resource utilization for sustainable computing purposes. From the repo:
Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics.