Join Our Work Group Visit Our GitHub Repo

Statistical Tests

Are there cases where the behavior is nondeterministic, but reasonable bounds can be specified statistically? In other words, if the results fall within some measurable confidence window, they are considered acceptable, i.e., passing.

Use of Statistics at Netflix

Adrian Cockcroft told one of us that Netflix took this approach for their recommendation systems, computing plausibility scores that gave them sufficient confidence in the results.

More recently, a new open-source project called Intellagent from Plurai.ai brings together recent research on automated generation of test data, knowledge graphs based on the constraints and requirements for an application, and automated test generation to verify alignment of the system to the requirements. (We plan to update this site with more information about Intellagent soon.)

TODOs:

Examples, perhaps inspired by classifiers.
Use of standard deviations, …
See Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations a paper by Evan Miller.
Deeper discussion of Intellagent.