Link Search Menu Expand Document

Evaluation Platform Reference Stack

This section describes the reference stack that can be used to run the evaluators and benchmarks aggregated from them.

It is important to note the separation between the stack that is agnostic about particular evaluations of interest, and the “plug-in” evaluators themselves. A set of evaluators in a given stack deployment may represent a defined benchmark for particular objectives.

The evaluation platform is under development, based on shared needs of all users. A theme expressed in those needs is the ability to support both running the evaluation platform for public collaborative tasks and leaderboards, as well as support private deployments for evaluating proprietary models and systems. Both offline evaluation, such as for leaderboards and research investigations, and online inference should be able to use the same stack, with appropriate scaling and hardening of the deployments, as required.

Architecture

Schematically, a trust and safety deployment using the reference stack with example evaluators is shown in Figure 1:

Reference Stack schematic diagram

Figure 1: Schematic architecture of a deployment.

Note that some evaluators won’t use unitxt and some of them will not run on lm-evaluation-harness. This is the practical reality of the technology today. However, our hope is that the reference stack will prove so compelling and so productive to use, that it will be widely adopted by teams doing evaluation R&D.

Execution Framework

The execution framework provides mechanisms to run evaluations and benchmarks in a consistent manner, mechanisms to aggregate results, compute metrics, and report results, and logging and error recovery capabilities.

The open-source software (OSS) components in the stack including the following projects, which have emerged as de-facto standard tools for evaluation.

  • EleutherAI’s LM Evaluation Harness (GitHub repo), a widely used, efficient evaluation platform for inference time (i.e., runtime) evaluation and for leaderboards.
  • IBM’s Unitxt library, the framework for individual evaluators, which has an interesting benefit that evaluators can be declaratively defined and executed without the need to execute third-party, untrusted code. This supports several of the user needs involving open collaboration in a pragmatic way, without the need for running third-party evaluation code.
  • Arize Phoenix (GitHub) and similar tools for observability and metrics collection. The reference stack needs to provide the desired information, but it also needs to be agnostic about the specific tools used, as different environments will have different standard tools in place already.

We are working on easy to use examples for all these components, discussed below.

Leaderboard Deployments

While supporting private, on-premise deployments for proprietary evaluation requirements, the stack will be used to implement public leaderboards hosted in the AI Alliance Hugging Face Community.

In progress.

LM Evaluation Harness Installation

To install LM Evaluation Harness, use the following command, from the lm-evaluation-harness repo README:

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

The README has examples and other ways to run evaluations in different deployment scenarios.

Unitxt Examples

Several examples using unitxt are available in the IBM Granite Community, in the Granite “Snack“ Cookbook repo, under the recipes/Evaluation folder. These examples only require running Jupyter locally, because all inference is done remotely by the community’s back-end services:

Using LM Evaluation Harness and Unitxt Together

Start on this Unitxt page. Then look at the unitxt tasks in the lm-evaluation-harness repo.

Easy to use examples are under development for publication here.

Observability with Arize Phoenix

Arize Phoenix (GitHub) is an open-source LLM tracing and evaluation platform designed to provide seamless support for evaluating, experimenting, and optimizing AI applications.

See the home page for details on installing and using Phoenix. We are working on an example deployment that demonstrates the integration with the rest of the reference stack discussed above.


Child Pages