Join Our Work Group GitHub Repo
Statistical Tests
Are there cases where the behavior is nondeterministic, but reasonable bounds can be specified statistically? In other words, if the results fall within some measurable confidence window, they are considered acceptable, i.e., passing.
TODOs:
- Examples, perhaps inspired by classifiers.
- Use of standard deviations, …
- See Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations.