Link Search Menu Expand Document
AI Alliance Banner
Join Our Work Group   GitHub Repo

Statistical Tests

Are there cases where the behavior is nondeterministic, but reasonable bounds can be specified statistically? In other words, if the results fall within some measurable confidence window, they are considered acceptable, i.e., passing.

TODOs:

  1. Examples, perhaps inspired by classifiers.
  2. Use of standard deviations, …
  3. See Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations.