Link Search Menu Expand Document
AI Alliance Banner
Join Our Work Group   GitHub Repo

LLM as a Judge

In LLM as a Judge, a separate model, one that is very smart and also usually expensive to use or otherwise not suitable for production use in your application, serves as a judge to generate Q&A pairs for the benchmarks.

The judge model can also be used to decide whether or not the application model’s response to a particular question is sufficiently close to the expected answer.

Issues you have to manage:

  1. How do you validate that the judge model is producing good Q&A pairs and accurately evaluating the student model’s results? This will require some human inspection of the Q&A pairs and possibly some test results, until some confidence is established. Statistical techniques may also be useful.
  2. If the judge model is expensive or slow, how do you use it economically?

TODO