Join Our Work Group GitHub Repo
LLM as a Judge
In LLM as a Judge, a separate model, one that is very smart and also usually expensive to use or otherwise not suitable for production use in your application, serves as a judge to generate Q&A pairs for the benchmarks.
The judge model can also be used to decide whether or not the application model’s response to a particular question is sufficiently close to the expected answer.
Issues you have to manage:
- How do you validate that the judge model is producing good Q&A pairs and accurately evaluating the student model’s results? This will require some human inspection of the Q&A pairs and possibly some test results, until some confidence is established. Statistical techniques may also be useful.
- If the judge model is expensive or slow, how do you use it economically?
- …
TODO