Link Search Menu Expand Document

LLM as a Judge

In LLM as a Judge, a separate model, one that is very smart and also usually expensive to use or otherwise not suitable for production use in your application, serves as a judge to generate Q&A pairs for the benchmarks.

The judge model can also be used to decide whether or not the application model’s response to a particular question is sufficiently close to the expected answer.

Tools

Popular frameworks for implementing evaluations include unitxt and lm-evaluation-harness.

IBM Research recently open-sourced EvalAssist, which makes writing unitxt-based evaluations easier. Specifically, EvalAssist is an application that simplifies using writing evaluators using LLM as a Judge. It also helps users refine evaluation criteria iteratively using a web-based user experience.

An Example

An example of using an LLM as a judge can be found in the IBM Granite Community, in the Granite “Snack“ Cookbook repo, under the recipes/Evaluation folder. The recipes in this folder use unitxt. They only require running Jupyter locally, because all inference is done remotely by the community’s back-end services:

In addition, these notebooks demonstrate other aspects of using unitxt:

Issues You Have to Consider

  1. How do you validate that the judge model is producing good Q&A pairs or accurately evaluating the student model’s results, depending on the usage pattern? Most likely, some human inspection of the Q&A pairs and possibly some test results will be necessary, until sufficient confidence is established. Statistical techniques will be useful in establishing confidence.
  2. If the judge model is expensive or slow, how do you use it economically? On the other hand, it won’t be used during normal inference, just for the testing process, so the higher inference costs may not really matter.

TODO