Join Our Work Group Visit Our GitHub Repo

Evaluation Platform Reference Stack

This section describes the reference stack that can be used to run the evaluators and benchmarks aggregated from them.

It is important to note the separation between the stack that is agnostic about particular evaluations of interest, and the “plug-in” evaluators themselves. A set of evaluators in a given stack deployment may represent a defined benchmark for particular objectives.

The evaluation platform is based on shared needs of all users. A common theme is the need to run the evaluation platform for public collaborative tasks and leaderboards, as well private deployments for evaluating proprietary models and systems. Also, both offline evaluation, such as for leaderboards and research investigations, and online inference must be supported by the same stack, with appropriate scaling and hardening of the deployments, as required.

There is no industry-standard evaluation stack, but several tools have achieved wide adoption, such as EleutherAI’s lm-evaluation-harness and IBM’s unitxt. Evaluations can be implemented using the lm-evaluation-harness or unitxt API. Evaluations implemented for unitxt can be executed on top of lm-evaluation-harness or separately by unitxt.

IBM’s EvalAssist is a relatively new tool that makes writing certain kinds of unitxt-based evaluations easier, as discussed below. IBM’s Risk Atlas Nexus and SafetyBAT Leaderboard provide accessible tools for viewing how different models perform against user-specified criteria. They are not discussed further here; see Leaderboards.

Infosys’ Responsible AI Toolkit is a suite of tools for various evaluation purposes.

Many other evaluation suites are written using less well-known or “home-grown” tools. Hence, today the AI engineer may need to support a heterogeneous runtime environment to run all the evaluations required, but hopefully the industry will mature and consolidate on a standard suite of tools soon.

Architecture

Schematically, an evaluation deployment using the reference stack with example evaluators is shown in Figure 1:

Reference Stack schematic diagram

Figure 1: Schematic architecture of a deployment.

Evaluations can be written and deployed using any combination of EvalAssist, unitxt, lm-evaluation-harness, or other tools. Runtime support is provided by one or more of Llama Stack, Infosys Responsible AI Toolkit, lm-evaluation-harness, or other tools.

Not shown are other production support tools like those for observability, security, horizontal scaling, etc. Tools like Arize Phoenix provide AI-centric observability and metrics collection (discussed below). Some deployments will use Kubernetes.

Execution Framework

The execution framework provides mechanisms to run evaluations and benchmarks in a consistent manner, mechanisms to aggregate results, compute metrics, and report results, and logging and error recovery capabilities.

The open-source software (OSS) components in the reference stack include the following projects:

EleutherAI’s LM Evaluation Harness (a.k.a., lm-evaluation-harness. GitHub repo), a widely used, efficient evaluation platform for inference time (i.e., runtime) evaluation and for leaderboards.
IBM’s Unitxt library, a framework for individual evaluators, which has an interesting benefit that evaluators can be declaratively defined and executed without the need to write and execute third-party, possibly-untrusted code. This model supports several of the user needs involving open collaboration in a pragmatic way, without the need for running third-party evaluation code.
IBM’s EvalAssist is a relatively new tool that makes writing unitxt-based evaluations easier. Specifically, EvalAssist is an application that simplifies using LLMs as evaluators (LLM-as-a-Judge) of the output of other LLMs by supporting users in iteratively refining evaluation criteria in a web-based user experience, with other features designed for the incremental process of building evaluations.

The evaluation tools can be run on platforms that provide broader services for AI applications:

Arize’s Phoenix (GitHub) is an AI application-centric tool for observability and metrics collection. Real deployments of the reference stack need to provide these capabilities, but the stack needs to be agnostic about the specific tools used, as different deployments will use different tools.
Infosys’ Responsible AI Toolkit can be used to deploy evaluation tools for various evaluation purposes.
Meta’s Llama Stack is a full-featured stack that provides built-in integrations for evaluation tools, agents, etc.

We are working on easy to use examples for all these components. For now, here is some information you can use now.

Leaderboard Deployments

Components of the stack are used to implement leaderboards described in Leaderboards and hosted in the AI Alliance Hugging Face Community or member communities.

LM Evaluation Harness Installation

To install LM Evaluation Harness, use the following command, from the lm-evaluation-harness repo README:

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

The README has examples and other ways to run evaluations in different deployment scenarios.

Unitxt Examples

Several examples using unitxt are available in the IBM Granite Community, in the Granite “Snack“ Cookbook repo, under the recipes/Evaluation folder. These examples only require running Jupyter locally, because all inference is done remotely by the community’s back-end services:

Unitxt_Quick_Start.ipynb - A quick introduction to unitxt.
Unitxt_Demo_Strategies.ipynb - Various ways to use unitxt.
Unitxt_Granite_as_Judge.ipynb - Using unitxt to drive the LLM as a judge pattern.

Using LM Evaluation Harness and Unitxt Together

Start on this Unitxt page. Then look at the unitxt tasks in the lm-evaluation-harness repo.

Easy to use examples are under development for publication here.

EvalAssist

One popular evaluation technique is LLM as a judge, which uses a smart “teacher model” to evaluate the quality of benchmark datasets and model responses, because having humans do this is expensive and not sufficiently scalable. (This is different from data synthesis with a teacher model.) EvalAssist is designed to make writing these kinds of evaluations easier, including incremental development. It uses unitxt to implement evaluations.

Observability with Arize Phoenix

Arize Phoenix (GitHub) is an open-source LLM tracing and evaluation platform designed to provide seamless support for evaluating, experimenting, and optimizing AI applications.

See the home page for details on installing and using Phoenix. We are working on an example deployment that demonstrates the integration with the rest of the reference stack discussed above.