Deployment of the Reference Stack Components
We are working on easy to use examples for all the components in various deployment configurations. For now, here is some guidance for getting started.
LM Evaluation Harness Installation
To install LM Evaluation Harness, use the following commands, taken from the lm-evaluation-harness
repo README:
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
The README has examples and other ways to run the tools and specific evaluations in different deployment scenarios.
Unitxt Examples
Several examples using unitxt
are available in the IBM Granite Community, e.g., in the Granite “Snack“ Cookbook repo. See the recipes/Evaluation
folder. These examples only require running Jupyter locally, because all inference is done remotely by the community’s back-end services. Here are the specific Jupyter notebooks:
Unitxt_Quick_Start.ipynb
- A quick introduction tounitxt
.Unitxt_Demo_Strategies.ipynb
- Various ways to useunitxt
.Unitxt_Granite_as_Judge.ipynb
- Usingunitxt
to drive the LLM as a judge pattern.
Using LM Evaluation Harness and Unitxt Together
Start on this Unitxt page. Then look at the unitxt
tasks described in the lm-evaluation-harness
repo.
EvalAssist
One popular evaluation technique is LLM as a judge, which uses a smart “teacher model” to evaluate the quality of benchmark datasets and model responses, because having humans do this is expensive and not sufficiently scalable. (This is different from data synthesis with a teacher model.) EvalAssist
is designed to make writing these kinds of evaluations easier, including incremental development. It uses unitxt
to implement evaluations.
Observability with Arize Phoenix
Arize Phoenix (GitHub) is an open-source LLM tracing and evaluation platform designed to provide seamless support for evaluating, experimenting, and optimizing AI applications.
See the home page for details on installing and using Phoenix. We are working on an example deployment that demonstrates the integration with the rest of the reference stack discussed above.