Link Search Menu Expand Document

Statistical Evaluation

Table of contents
  1. Statistical Evaluation
    1. Statistical Analysis of Data for Stochastic Systems
    2. Other Examples of Using Statistics in AI Testing Situations
      1. The Use of Statistics at Netflix
      2. Plurai’s Intellagent
    3. Evaluating Our Synthetic Data and Healthcare ChatBot Test Results
    4. Experiments to Try
    5. For More Information
    6. What’s Next?

So far, we have explored various techniques for testing the Stochastic behaviors of the application where generative AI is used. We found scenarios where we could enforce mostly-deterministic behavior, such as handling FAQs in our example ChatBot. However, in general, we need the ability to assess the non-deterministic behaviors, such as deciding on appropriate thresholds for when an AI-related test can be considered to pass or fail, and when a synthetic datum is acceptable or not.

TODO:

This chapter needs contributions from experts in statistics. See this issue and Contributing if you would like to help.

In the Unit Benchmark’s Experiments to Try and in various parts of LLM as a Judge chapter, we raised questions to begin thinking about these decisions. Now we will put the concepts on a more formal foundation. Specifically, we will apply Statistical Analysis to test results and use that information to inform our thinking.

Highlights:

  1. Statistical analysis helps us make sense of observed behaviors of stochastic processes, like generated AI responses.
  2. Deciding on acceptable thresholds for pass/fail may be use case dependent. They will require consideration of acceptable tolerance for “suboptimal” responses, the risks at stake for the application, and overall intuition about acceptable use-case performance.

Statistical Analysis of Data for Stochastic Systems

TODO: Expand this section to provide a very concise overview of the basic concepts the reader needs to understand and their uses.

The section, How to think about non-determinism in evaluations for agents, in Anthropic’s post, Demystifying evals for AI agents, discusses approaches to non-determinism in evaluations. Summarizing their discussion of two useful metrics:

pass@k measures the probability that at least one correct solution occurs in k attempts. As k increases, the pass@k score rises, which makes sense because the more attempts you make, the higher the likelihood that at least one of them will succeed. For example, a score of 50% pass@1 means that a test successfully completes on half the tasks on the first try. This metric is most useful for scenarios where multiple attempts are okay as long as one attempt is highly likely to succeed.

pass^k measures the probability that all k trials succeed. As k increases, pass^k falls since demanding consistency across more trials is harder than for fewer trials. So, if there is a 75% per-trial success rate and you run 3 trials, the probability of passing all three is (0.75)³ ≈ 42%. This metric is important for applications where reliable behavior is expected every time, like ChatBots.

More content is TODO…

Other Examples of Using Statistics in AI Testing Situations

The Use of Statistics at Netflix

In Testing Problems, we mentioned that Netflix faced the same testing challenges back in 2008 for their recommendation systems. Part of their solution leveraged statistical analysis. They computed plausibility scores that gave them sufficient confidence in the results.

TODO:

Fill in more details.

Plurai’s Intellagent

More recently, a new open-source project called Intellagent from Plurai.ai brings together recent research on automated generation of test data, knowledge graphs based on the constraints and requirements for an application, and automated test generation to verify alignment of the system to the requirements.

TODO:

Expand the explanation of what Intellagent does and show use of it in our example.

Evaluating Our Synthetic Data and Healthcare ChatBot Test Results

In our healthcare ChatBot, we realized we could design our prompts to detect FAQs, like prescription refill requests, and return a deterministic response. However, we have left open the question of how to handle edge cases, such as messages that are ambiguous and may or may not be actual refill requests. Let’s explore this issue now.

First, can we establish our confidence that we have a real refill request? We raised this question informally in the Experiments to Try in Unit Benchmarks.

For simplicity, we will continue to ignore the possibility of a prompt containing additional content in addition to a refill request.

TODO:

Complete…

Experiments to Try

TODO:

Expand this section once more content is provided above.

For More Information

The paper Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations by Evan Miller discusses the use of error bars, a standard technique in statistical analysis to quantify the uncertainty of a result. For example, in science experiments, no measurement exists with infinite precision and potential false signals (i.e., noise) must be accounted for.

What’s Next?

Review the highlights summarized above, then proceed to the Lessons from Systems Testing.