Deployment of the Reference Stack Components
We are working on easy to use examples for all the components in various deployment configurations. For now, here is some guidance for getting started.
LM Evaluation Harness Installation
To install LM Evaluation Harness, use the following commands, taken from the lm-evaluation-harness repo README:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
The README has examples and other ways to run the tools and specific evaluations in different deployment scenarios.
Unitxt Examples
Several examples using unitxt are available in the IBM Granite Community, e.g., in the Granite “Snack” Cookbook repo. See the recipes/Evaluation folder. These examples only require running Jupyter locally, because all inference is done remotely by the community’s back-end services. Here are the specific Jupyter notebooks:
Unitxt_Quick_Start.ipynb- A quick introduction tounitxt.Unitxt_Demo_Strategies.ipynb- Various ways to useunitxt.Unitxt_Granite_as_Judge.ipynb- Usingunitxtto drive the LLM as a judge pattern.
Using LM Evaluation Harness and Unitxt Together
Start on this Unitxt page. Then look at the unitxt tasks described in the lm-evaluation-harness repo.
EvalAssist
One popular evaluation technique is LLM as a judge, which uses a smart “teacher model” to evaluate the quality of benchmark datasets and model responses, because having humans do this is expensive and not sufficiently scalable. (This is different from data synthesis with a teacher model.) EvalAssist is designed to make writing these kinds of evaluations easier, including incremental development. It uses unitxt to implement evaluations.
Observability with Arize Phoenix
Arize Phoenix (GitHub) is an open-source LLM tracing and evaluation platform designed to provide seamless support for evaluating, experimenting, and optimizing AI applications.
See the home page for details on installing and using Phoenix. We are working on an example deployment that demonstrates the integration with the rest of the reference stack discussed above.
