
Join Our Work Group GitHub Repo
LLM as a Judge
In LLM as a Judge, a separate model, one that is very smart and also usually expensive to use or otherwise not suitable for production use in your application, serves as a judge to generate Q&A pairs for the benchmarks.
The judge model can also be used to decide whether or not the application model’s response to a particular question is sufficiently close to the expected answer.
An Example
Popular frameworks for implementing evaluations include unitxt
and lm-evaluation-harness
.
An example of using an LLM as a judge can be found in the IBM Granite Community, in the Granite “Snack“ Cookbook repo, under the recipes/Evaluation
folder. The recipes in this folder use unitxt
. They only require running Jupyter locally, because all inference is done remotely by the community’s back-end services:
In addition, these notebooks demonstrate other aspects of using unitxt
:
Unitxt_Quick_Start.ipynb
- A quick introduction tounitxt
.Unitxt_Demo_Strategies.ipynb
- Various ways to useunitxt
.
Issues You Have to Consider
- How do you validate that the judge model is producing good Q&A pairs and accurately evaluating the student model’s results? This will require some human inspection of the Q&A pairs and possibly some test results, until some confidence is established. Statistical techniques may also be useful.
- If the judge model is expensive or slow, how do you use it economically?
- …
TODO