Link Search Menu Expand Document
AI Alliance Banner
Join Our Work Group   GitHub Repo

The Venerable Principles of Coupling and Cohesion

Real applications, AI-enabled or not, combine many subsystems, usually including web pages for the user experience (UX), database and/or streaming systems for data retrieval and management, various libraries and modules, and calls to external services. Each of these Components can be tested in isolation and most are deterministic or can be made to behave in a deterministic way for testing.

An AI application adds one or more Generative AI Models invoked through libraries or services. Everything else should be tested in the traditional, deterministic ways. Invocations of the model should be hidden behind an API abstraction that can be replaced at test time with a Test Double. Even for some integration and acceptance tests, use a model test double for tests that aren’t exercising the behavior of the model itself.

Possible “Tactics”

Let’s consider ways our encapsulation APIs can be most effective.

Test Doubles at Netflix

Adrian Cockcroft told one of us that Netflix wrote model Test Doubles that would “… dynamically create similar input content for tests classified along the axes that mattered for the algorithm.” In other words, while traditional test doubles usually hard-code deterministic outputs for specific inputs, make the test double for a probabilistic model generate nondeterministic outputs that are within the expected bounds of acceptability, so that tests using these test doubles can fully exercise the unit under test with a full range of possible, but acceptable outputs.

However, this also suggests that test doubles are needed that deliberately write “unacceptable” output. These would be used to test component error handling for components that ingest and process model output.

Netflix also added extra hidden output that showed the workings of the algorithm (i.e., for Explainability) when running a test configuration. Details about model weights, algorithmic details, etc. were encoded as HTML comments, visible if you viewed the page source. This information helped them understand why a particular list of movies were chosen, for example, in a test scenario.

The generative AI equivalent of their approach might be to include in the prompt a clause that says something like, “in a separate section explain how you came up with the answer”. The output of that section is then hidden from end users, but visible to engineers through a page comment or logged somewhere.

APIs in AI-based Applications

A hallmark of good software design is clear and unambiguous abstractions with API interfaces between modules that try to eliminate potential misunderstands and guide the user to do the correct things. Exchanging free form text between users and tools is the worst possible interface, from this perspective.

Tools like pydantic-ai, part of the pydantic tools, is an agent framework (one of many…), which is appealing because of its use of type checking for values exchanged between tools, among other benefits.

Abstractions Encapsulate Complexities

Michael Feathers gave a talk recently called The Challenge of Understandability at Codecamp Romania, 2024.

Near the end, he discussed how the software industry has a history of introducing new levels of abstractions when complexity becomes a problem. For example, high-level programming languages removed most of the challenges of writing lower-level assembly code.

From this perspective, the nondeterministic nature of generative AI is a complexity. While it obviously provides benefits (e.g., new ideas, summarization, etc.), it also makes testing harder. What kinds of abstractions make sense for AI that would help us with this form of complexity?

Is This Enough?

We still have the challenge of testing model behaviors themselves, especially for Integration, and Acceptance tests that exercise how other parts of the application interact with models, both creating queries and processing results. The rest of the strategies and techniques explore these concerns, starting with External Verification.