Join This Project GitHub Repo

Testing Problems Caused by Generative AI Nondeterminism

Let’s first review why Determinism is an important concept in software development, then discuss how the use of Generative AI Models makes this difficult.

Highlights:

Traditional software is mostly Deterministic. This makes it much easier to reason about its behavior and to write repeatable, reliable tests. This is a central, enabling assumption made in software development.

Generative AI Model outputs are Stochastic, governed by a Random Probability Distribution, which means that some values are more likely than others in a given context, but you can’t predict exactly what you will get in any single observation.

Testing AI-enabled applications requires understanding and using the same tools based on statistical techniques that are used to assess model performance, such as Benchmarks.

Many generative models support a temperature setting that lets you reduce the amount of randomness down to “none”, when desired. This feature can be useful in tests, but some randomness is almost always desired in the running production system.

Why Determinism is an Important Tool for Software Development

We have learned from decades of experience that creating and maintaining reliable software requires deterministic Behavior, whenever possible, and principled handling of unavoidable nondeterminism. Simply stated, the more Predictable and Repeatable the behavior, the easier it is to reason about its State and Behavior, including aspects of design, testing, and interactions with other software.

To frame the following discussion, We will use the term Unit for the lowest-granularity encapsulation of some sort of work done by code execution. (This term was popularized by the Test-Driven Development community.) We will use Component for larger-granularity collections of units. Depending on the context, a unit will usually be a Function or a Class. We will normally use component for a whole distributed service we test or use.

Furthermore, suppose a unit in question is Immutable, meaning its State never changes. Also, suppose it never performs Side Effects, a term meaning it doesn’t modify any state outside of itself, like writing to file (which modifies the state of the file), updating a database record (a similar state change), or reading user input (which will be different every time). Such a unit will always be deterministic, which means that if we invoke it with the same input repeatedly, we must always receive the same value back. For example, the Mathematics equation sin(π) == -1 will always be true, and a software implementation of it will always be true, as well (ignoring potential floating point round-off errors…). For such a unit, you can write an automated test that checks this result and it will never, ever fail, unless some new bug, a Regression, causes its behavior to change.

There are necessary exceptions to this deterministic behavior for real-world systems. Some units will have Mutable state, like files, databases, and many in-memory data structures. Finally, any distributed systems, including multi-threaded applications, cannot guarantee how events will be ordered nor which events will occur.

Fortunately, all these more complex behaviors are well understood. For example, the range of possible values and orders of occurrence are usually bounded, allowing both exhaustive tests to be written and allowing effective handling when testing dependencies that use them and when everything runs in production deployments. We have effective techniques for handling these scenarios, some of which we will review in Architecture and Design for Testing.

To summarize, application developers expect the following:

Most software behaves deterministically, with known exceptions, allowing reasoning about behaviors and writing tests to verify expectations.
The ability to use repeatable, automated tests to validate new behaviors work as designed and to ensure that no regressions occur as the application code base evolves.
The ability to work with high productivity, because robust, comprehensive test suites provide confidence in the current safety and reliability of the application, and since they also catch regressions when they occur during the evolution of the software, the developer can work relatively quickly.

How Generative AI Changes This Picture

Generative AI models are inherently Stochastic and hence nondeterministic. As an extreme example, sending the same query to a model, “Write a haiku about the beauty of Spring” or “Create an image of a dog in a space suit walking on Mars”, is expected to return a different result every time. How do you reason about and write reliable tests for such “expected” behavior? Introducing AI-generated content into an application makes it difficult, if not impossible, to write deterministic tests that are repeatable and automatable.

More precisely, the simple view of an LLM is that it generates the next Token, one at a time, based on the tokens it has generated so far, guided by any additional context information that was supplied in the prompt. It picks the next token randomly from all possible tokens based on the probability that each one would be a “suitable” choice to appear next. Those tokens with the highest Probabilities are chosen more often, but occasionally less-probable tokens are chosen. Multimodal Models that generate images, audio, and video work similarly.

Therefore, model outputs are an example of a Stochastic process, where each observation, a Token in this case, can’t be predicted exactly, but if you collect enough observations, the frequencies of the observed values will fit a random probability distribution (see Probability and Statistics) that represents the model’s behavior.

Actually, some models support an adjustable parameter, called the temperature, that controls how much randomness is allowed in token selection. In these models, you can turn this parameter down to zero, which forces the model to always pick the most probable token in every situation. This makes the output effectively deterministic for any given prompt! However, we normally want some variability. Nevertheless, a zero temperature can be useful in some tests.

Temperature is a useful metaphor for randomness; think of how the surface of a pot of water behaves as you heat it up, going from cold and flat (and “predictable”) to hot and very bubbly, where the height at any point on the surface can vary a lot around the average level.

Two other simple random probability distribution examples are useful to consider. Consider the behavior of flipping a fair (unweighted) coin. For each flip, you have no way of knowing whether you will observe a head or a tail, each of which has a 50% probability of occurring. However, if you flip the coin 100 times, you will have observed approximately 50 heads and 50 tails. For 1000 flips, the split is even more likely to be 500 heads and 500 tails. A less simple example distribution is the values observed when rolling two, six-sided dice. Without going into details, it is much more probable to get two values that add up to 5, 6, or 7 on a roll, for example, than to get a total of 2 or 12.

Furthermore, the nondeterminism introduced by generative AI isn’t peripheral to the application logic, like an implementation detail that is independent of the user experience. Rather, the nondeterminism is a core enabler of fundamentally new capabilities that were previously impossible.

So, we can’t avoid this nondeterminism. We have to learn how to write tests that are still repeatable and automatable, that are deterministic where feasible, but otherwise effectively evaluate the stochastic behavior that occurs. These tests are necessary to give us confidence our application works as intended. This is the challenge this guide explores.

How Do AI Experts Test Models?

AI experts also need to understand how well generative models (and the AI Systems that use them) perform against various criteria, like skills in Mathematics, and suppression of hate speech and hallucinations.

Sometimes you can model a stochastic process with Mathematics to understand it, like tossing a coin or a pair of dice. Often, though, the behavior is too complex to model mathematically or the mathematical formulas are unknown. In these cases, you have to collect as many observations as you can, then look at the percentages for all the observed values.

Generative AI falls into the later category and the model is so complex, people don’t normally try to capture it “experimentally”. Instead, they just focus on observed, aggregate behavior, i.e., rather than focus on token-by-token probabilities, focus on whole responses of text, images, etc.

This is what Benchmarks focus on. A common way to implement a benchmark is to curate a set of question and answer (Q&A) pairs (a kind of Labeled Data) that cover as much of the space of possible questions and expected answers as possible in the domain of interest. To use a benchmark, a model is sent each question and the answer is evaluated for correctness. Deciding whether or not an answer is “correct” is another challenge, which we’ll return to in several places in Testing Strategies and Techniques.

It is usually not necessary for you to run these benchmarks yourself. The results of applying the more popular benchmarks against the more popular models are published in leaderboards, allowing you to browse models based on how well they do against the benchmarks you care most about.

So, when you see that a particular model scores 85% on a benchmark, it means the model’s replies to the questions was judged to be correct 85% of the time. Now, is 85% good enough?? The answer depends on the application requirements. You may have to chose the best-performing model for a given benchmark, while considering performance of the model against other benchmarks to be less critical.

This approach to validating behaviors is far different than the unambiguous 100% “pass/fail” answers software developers are accustomed to seeing.

Some models can output a confidence score, expressing how much trust they have in the answer they provided. Is there a correlation between those confidence scores and the answers that were judged good or bad? That information can tell you if the confidence scores are very trustworthy.

More advanced Statistical Analysis techniques can be used to probe the results of a stochastic process more deeply.

You are probably already familiar with statistical concepts like the mean and standard deviation. For example, what is the mean (average) score across all models against a particular benchmark? A low mean gives you a sense that the benchmark is hard for a lot of the models, while a high mean value means that models are now very good in this area, on average.

The standard deviation of model scores tells you how much variability there is across the models. For example, this value is fairly large if newer and larger models tend to significantly outperform older and smaller models. In contrast, if the standard is low, then it means that all models are closer in performance. Combined with a low means, it suggests all models struggle about equally with the benchmark, while a high mean with a low standard deviation suggests that the benchmark is easy for most models to perform well on.

This Is Not a New Problem

Recently, one of us posted a link on Mastodon to the slides for a talk, Generative AI: Should We Say Goodbye to Deterministic Testing?. In a private conversation afterwards, Adrian Cockcroft said that Netflix encountered similar problems around 2008 with their content recommendation systems: “The content inventory (movies or products) changes constantly, and the recommendations are personalized so that everyone sees a different result. We had to build some novel practices and tools for our QA engineers.”

The specific tools and practices he mentioned are discussed in Test Doubles at Netflix and The Use of Statistics at Netflix.

Review the highlights summarized above. Next, before we discuss Testing Strategies, let us first discuss Architecture and Design, informed by our testing concerns.