Join Our Work Group Visit Our GitHub Repo

The Venerable Principles of Coupling and Cohesion

Real applications, AI-enabled or not, combine many subsystems, usually including web pages for the user experience (UX), database and/or streaming systems for data retrieval and management, various libraries and modules, and calls to external services. Each of these Components can be tested in isolation and most are deterministic or can be made to behave in a deterministic way for testing. Good software design is a divide and conquer strategy.

An AI application adds one or more Generative AI Models invoked through libraries, web services, or Agents, increasingly using Model Context Protocol.

Everything that isn’t model output should be made as deterministic as possible and tested using the traditional, deterministic techniques.

Invocations of the model should be hidden behind an API abstraction that can be replaced at test time with a Test Double. Even for some integration and acceptance tests, use a model test double for tests that aren’t exercising the behavior of the model itself.

Possible “Tactics”

Let’s consider ways our encapsulation APIs can be most effective in the context of generative AI.

Test Doubles at Netflix

Adrian Cockcroft told one of us that Netflix wrote model Test Doubles that would “… dynamically create similar input content for tests classified along the axes that mattered for the algorithm.” In other words, while traditional test doubles usually hard-code deterministic outputs for specific inputs, make the test double for a probabilistic model generate nondeterministic outputs that are within the expected bounds of acceptability, so that tests using these test doubles can fully exercise the unit under test with a full range of possible, but acceptable outputs.

However, this also suggests that test doubles are needed that deliberately write “unacceptable” output. These would be used to test component error handling and graceful degradation of components that ingest and process model output.

Netflix also added extra hidden output that showed the workings of the algorithm, i.e., for Explainability, when running a test configuration. Details about model weights, algorithmic details, etc. were encoded as HTML comments, visible if their developers viewed the page source. This information helped them understand why a particular list of movies were chosen, for example, in a test scenario.

The generative AI equivalent of their approach might be to include in the prompt a clause that says something like, “in a separate section explain how you came up with the answer”. The output of that section is then hidden from end users, but recorded for monitoring and debugging purposes by the engineering team.

Designing APIs in AI-based Applications

A hallmark of good software design is clear and unambiguous abstractions with API interfaces between modules that try to eliminate potential misunderstands and guide the user to do the correct things.

Let’s be clear; exchanging free-form text for human-tool or tool-tool interactions is the worst possible interface you can use, from the perspective of good software development practices, because it undermines predictable, testable behaviors. We will get the benefits of generative AI only if we successfully compensate for this serious disadvantage.

Consider tools like pydantic-ai, part of the pydantic tools. It is an example agent frameworks (one of many…) uses type checking of results returned by models and other tool invocations. This introduces an extra level of rigor and validation of the information exchanged.

Projects like OpenDXA with DANA are working to establish better control over model behaviors in part by automatically learning to be more effective.

In general, the API encapsulating model inference should interpret the results and translate them into a more predictable, if not fully deterministic, format, so that components that invoke the API experience behaviors more like expect for traditional components, where we know how to design and test them effectively.

Abstractions Encapsulate Complexities

Michael Feathers gave a talk recently called The Challenge of Understandability at Codecamp Romania, 2024.

Near the end, he discussed how the software industry has a history of introducing new levels of abstractions when complexity becomes a problem. For example, high-level programming languages removed most of the challenges of writing lower-level assembly code.

From this perspective, the nondeterministic nature of generative AI is a significant source of complexity. While generative AI has the potential to provide many benefits (e.g., ease of use for non-technical users, generation of new ideas, productivity acceleration, etc.), it also makes testing and reliability much harder. What kinds of abstractions make sense for AI that would help us manage this new form of complexity?

Is This Enough?

So, we should carefully design our applications to control where non-deterministic AI behaviors occur and keep the rest of the components as deterministic as possible. Those components can be tested in the traditional ways.

We still have the challenge of testing model behaviors themselves, especially for Integration, and Acceptance tests that are intended to exercise whole systems or subsystems, including how parts of the system interact with models, both creating queries and processing results.

The rest of the strategies and techniques explore these concerns, starting with External Tool Verification.