Link Search Menu Expand Document
AI Alliance Banner
Join Our Work Group   GitHub Repo

Statistical Tests

Are there cases where the behavior is nondeterministic, but reasonable bounds can be specified statistically? In other words, if the results fall within some measurable confidence window, they are considered acceptable, i.e., passing.

Use of Statistics at Netflix

Adrian Cockcroft told one of us that Netflix took this approach for their recommendation systems, computing plausibility scores that gave them sufficient confidence in the results.

TODOs:

  1. Examples, perhaps inspired by classifiers.
  2. Use of standard deviations, …
  3. See Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations a paper by Evan Miller.