Join Our Work Group GitHub Repo
Statistical Tests
Are there cases where the behavior is nondeterministic, but reasonable bounds can be specified statistically? In other words, if the results fall within some measurable confidence window, they are considered acceptable, i.e., passing.
Use of Statistics at Netflix
Adrian Cockcroft told one of us that Netflix took this approach for their recommendation systems, computing plausibility scores that gave them sufficient confidence in the results.
TODOs:
- Examples, perhaps inspired by classifiers.
- Use of standard deviations, …
- See Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations a paper by Evan Miller.