Evaluation Platform Reference Stack
This section describes the reference stack that can be used to run the evaluators and benchmarks aggregated from them.
It is important to note the separation between the stack that is agnostic about particular evaluations of interest, and the “plug-in” evaluators themselves. A set of evaluators in a given stack deployment may represent a defined benchmark for particular objectives.
The evaluation platform is under development, based on shared needs of all users. A theme expressed in those needs is the ability to support both running the evaluation platform for public collaborative tasks and leaderboards, as well as support private deployments for evaluating proprietary models and systems. Both offline evaluation, such as for leaderboards and research investigations, and online inference should be able to use the same stack, with appropriate scaling and hardening of the deployments, as required.
Architecture
Schematically, a trust and safety deployment using the reference stack with example evaluators is shown in Figure 1:
Figure 1: Schematic architecture of a deployment.
Note that some evaluators won’t use unitxt
and some of them will not run on lm-evaluation-harness
. This is the practical reality of the technology today. However, our hope is that the reference stack will prove so compelling and so productive to use, that it will be widely adopted by teams doing evaluation R&D.
Execution Framework
The execution framework provides mechanisms to run evaluations and benchmarks in a consistent manner, mechanisms to aggregate results, compute metrics, and report results, and logging and error recovery capabilities.
The open-source software (OSS) components in the stack including the following projects, which have emerged as de-facto standard tools for evaluation.
- EleutherAI’s LM Evaluation Harness (GitHub repo), a widely used, efficient evaluation platform for inference time (i.e., runtime) evaluation and for leaderboards.
- IBM’s Unitxt library, the framework for individual evaluators, which has an interesting benefit that evaluators can be declaratively defined and executed without the need to execute third-party, untrusted code. This supports several of the user needs involving open collaboration in a pragmatic way, without the need for running third-party evaluation code.
- Arize Phoenix (GitHub) and similar tools for observability and metrics collection. The reference stack needs to provide the desired information, but it also needs to be agnostic about the specific tools used, as different environments will have different standard tools in place already.
We are working on easy to use examples for all these components, discussed below.
Leaderboard Deployments
While supporting private, on-premise deployments for proprietary evaluation requirements, the stack will be used to implement public leaderboards hosted in the AI Alliance Hugging Face Community.
In progress.
LM Evaluation Harness Installation
To install LM Evaluation Harness, use the following command, from the lm-evaluation-harness
repo README:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
The README has examples and other ways to run evaluations in different deployment scenarios.
Unitxt Examples
Several examples using unitxt
are available in the IBM Granite Community, in the Granite “Snack“ Cookbook repo, under the recipes/Evaluation
folder. These examples only require running Jupyter locally, because all inference is done remotely by the community’s back-end services:
Unitxt_Quick_Start.ipynb
- A quick introduction tounitxt
.Unitxt_Demo_Strategies.ipynb
- Various ways to useunitxt
.Unitxt_Granite_as_Judge.ipynb
- Usingunitxt
to drive the LLM as a judge pattern.
Using LM Evaluation Harness and Unitxt Together
Start on this Unitxt page. Then look at the unitxt
tasks in the lm-evaluation-harness
repo.
Easy to use examples are under development for publication here.
Observability with Arize Phoenix
Arize Phoenix (GitHub) is an open-source LLM tracing and evaluation platform designed to provide seamless support for evaluating, experimenting, and optimizing AI applications.
See the home page for details on installing and using Phoenix. We are working on an example deployment that demonstrates the integration with the rest of the reference stack discussed above.