Join Our Work Group GitHub Repo
The Venerable Principles of Coupling and Cohesion
Real applications, AI-enabled or not, combine many subsystems, usually including web pages for the user experience (UX), database and/or streaming systems for data retrieval and management, various libraries and modules, and calls to external services. Each of these Components can be tested in isolation and most are deterministic or can be made to behave in a deterministic way for testing.
An AI application adds one or more Generative AI Models invoked through libraries or services. Everything else should be tested in the traditional, deterministic ways. Invocations of the model should be hidden behind an API abstraction that can be replaced at test time with a Test Double. Even for some integration and acceptance tests, use a model test double for tests that aren’t exercising the behavior of the model itself.
But that still leaves the challenge of testing model behaviors, and for Integration, and Acceptance tests that exercise how other parts of the application interact with models, both creating queries and processing results. The rest of the strategies and techniques explore these concerns.
Possible “Tactics”
APIs in AI-based Applications
A hallmark of good software design is clear and unambiguous abstractions with API interfaces between modules that try to eliminate potential misunderstands and guide the user to do the correct things. Exchanging free form text between users and tools is the worst possible interface, from this perspective.
Tools like pydantic-ai
, part of the pydantic
tools, is an agent framework (one of many…), which is appealing because of its use of type checking for values exchanged between tools, among other benefits.
Abstractions Encapsulate Complexities
Micheal Feathers gave a talk recently called The Challenge of Understandability at Codecamp Romania, 2024.
Near the end, he discussed how the software industry has a history of introducing new levels of abstractions when complexity becomes a problem. For example, high-level programming languages removed most of the challenges of writing lower-level assembly code.
From this perspective, the nondeterministic nature of generative AI is a complexity. While it obviously provides benefits (e.g., new ideas, summarization, etc.), it also makes testing harder. What kinds of abstractions make sense for AI that would help us with this form of complexity?