Join Our Work Group Visit Our GitHub Repo

Deployment of the Reference Stack Components

We are working on easy to use examples for all the components in various deployment configurations. For now, here is some guidance for getting started.

LM Evaluation Harness Installation

To install LM Evaluation Harness, use the following commands, taken from the lm-evaluation-harness repo README:

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

The README has examples and other ways to run the tools and specific evaluations in different deployment scenarios.

Unitxt Examples

Several examples using unitxt are available in the IBM Granite Community, e.g., in the Granite “Snack” Cookbook repo. See the recipes/Evaluation folder. These examples only require running Jupyter locally, because all inference is done remotely by the community’s back-end services. Here are the specific Jupyter notebooks:

Unitxt_Quick_Start.ipynb - A quick introduction to unitxt.
Unitxt_Demo_Strategies.ipynb - Various ways to use unitxt.
Unitxt_Granite_as_Judge.ipynb - Using unitxt to drive the LLM as a judge pattern.

Using LM Evaluation Harness and Unitxt Together

Start on this Unitxt page. Then look at the unitxt tasks described in the lm-evaluation-harness repo.

EvalAssist

One popular evaluation technique is LLM as a judge, which uses a smart “teacher model” to evaluate the quality of benchmark datasets and model responses, because having humans do this is expensive and not sufficiently scalable. (This is different from data synthesis with a teacher model.) EvalAssist is designed to make writing these kinds of evaluations easier, including incremental development. It uses unitxt to implement evaluations.

Observability with Arize Phoenix

Arize Phoenix (GitHub) is an open-source LLM tracing and evaluation platform designed to provide seamless support for evaluating, experimenting, and optimizing AI applications.

See the home page for details on installing and using Phoenix. We are working on an example deployment that demonstrates the integration with the rest of the reference stack discussed above.