Component Design
Table of contents
What makes a good Unit? What makes a good Component, consisting of one or more units? How do good abstraction boundaries help us divide and conquer our challenges?
NOTE: Although technically this chapter is about designing Units and Components, we’ll just use component, which is the more conventional term used in the literature about software design principles. We’ll use unit more frequently in the next chapter, on Test-Driven Development (TDD), because unit is more conventional the TDD literature. Sorry for any confusion…
Highlights:
- The classic design principles of Coupling and Cohesion still apply: “Loosely couple” to dependencies and design each Component with one purpose and a clear abstraction.
- Use non-AI components for functionality that is mature and robustly implemented, where performance will be better and more predictable than relying on AI to provide the functionality. Examples include logical reasoning, planning, code verification, etc.
- Encapsulate each generative AI Model and Agent in a component separate from non-AI components. Build and test the non-AI components in “traditional” ways.
- Use an AI component’s interface to constrain allowed inputs and ensure usable outputs for optimal control and testability, while retaining AI’s unique utility.
- Make testing as deterministic as possible by using Test Doubles as a stand-in for an AI component when testing other components that depend on it. If you can’t make it completely deterministic, try to bound the scope of responses for test prompts.
- Leverage other tools, like type checking, to make your interfaces more robust.
The Venerable Principles of Coupling and Cohesion
Real applications, AI-enabled or not, combine many components, such as web pages for the user experience (UX), database and/or streaming systems for data retrieval and management, third-party libraries, and external web services. Each of these components should be testable in isolation, when their dependencies are well-encapsulated and easy to replace with Test Doubles (see also below), and most are deterministic or can be made to behave in a deterministic way for testing. Good software design is a divide and conquer strategy.
A good component should have a clear purpose with understandable State and Behaviors.
Good abstraction boundaries are key. The terms Coupling and Cohesion embody the qualities of good abstractions, as expressed through programming language interfaces or web APIs. A well-designed component interface is loosely coupled to its dependencies. It also has high cohesion, which means it supports a single, logical purpose, with clear behaviors for all its Functions (or other ways of invocation), and state that’s easy to comprehend. If the component state is Mutable), state transitions follow a well-designed State Machine of how transitions can happen.
Abstractions Encapsulate Complexities
Michael Feathers gave a talk recently called The Challenge of Understandability at Codecamp Romania, 2024.
Near the end, he discussed how the software industry has a history of introducing new levels of abstractions when complexity becomes a problem. For example, high-level programming languages removed most of the challenges of writing low-level assembly code.
From this perspective, the nondeterministic nature of generative AI is a significant source of complexity. While generative AI has the potential to provide many benefits (e.g., ease of use for non-technical users, generation of new ideas, productivity acceleration, etc.), it also makes testing and reliability much harder. What kinds of abstractions make sense for AI that would help us manage this new form of complexity?
An AI application is like any other application, except it adds one or more Generative AI Models invoked directly through libraries and web services, or invoked indirectly through Agents and the Model Context Protocol (MCP).
The first lesson we should apply is to clearly encapsulate AI dependencies separately from the rest of the components that behave deterministically. Then we can focus on the nondeterministic behaviors and how to design and test them.
All units and components that don’t encapsulate models or directly handle model responses should be designed and implemented to be as deterministic as possible and tested using the traditional, deterministic testing techniques.
Bring in the Experts (i.e., Other Services)
Given the application’s responsibilities, which ones should be implemented with AI and which ones should not? We know we need to that ensure safe output (e.g., free of hate speech), avoid hallucination, and in general, ensure that generative AI outputs are suitable for the intended purpose Here are some thoughts about how to assign responsibilities to different kinds of components in your applications.
Bias Towards Non-AI Tools
Social media is full of examples of the otherwise-highly capable AI systems failing to get basic factual information correct, like historical events, science, etc.
When possible, rely on more reliable methods to find factual information, like search of reputable information repositories, internal data sources accessed through RAG, etc.
Use non-AI tools to perform logical and mathematical reasoning, to do planning and routing, validate code quality, etc. At the very least, use non-AI tools to validate AI responses in live systems, not just as a testing strategy, which we discuss in more detail in External Tool Verification. Agents and MCP are popular approaches for tool integration.
As much as possible, restrict use of generative AI to tasks for which it is most reliable and useful, like translating between human language and tool invocations (and vice-versa), summarization of information retrieved (without allowed extrapolation or speculation), any task that is well constrained and wouldn’t require a human who possesses deep intuition or expertise in the subject, if a human performed the task instead.
Popular architectural patterns like RAG and agents emerged because generative models by themselves are not sufficient to do “everything”. Bringing together the best tools for each task creates more reliable AI-enabled applications.
We will look at more specific examples of tools in External Tool Verification.
Mitigate Risks with Human in the Loop
Use human in the loop, meaning require human intervention for any decision or to approve any action with significant consequences. Over time, your confidence in the system should grow to allow greater autonomy, but make sure this confidence is earned.
Encapsulate Each Model Behind an API
Let’s discuss abstractions that wrap AI components, especially direct invocations of models. These abstractions provide several benefits.
Manipulate the Prompts and Responses
It is common for any interface to an underlying component to do some transformation and filtering of inputs and outputs, and also to impose restrictions on invocations (see, in particular, Design by Contract). Similarly, it may post-process the results into a more usable form. In production systems, logging and tracing of activity, security enforcement, etc. may occur at these boundaries.
For an AI-enabled unit, allowing open-ended prompts greatly increases the care required to prevent undesirable use and the resulting testing burden. How can the allowed inputs to this unit be constrained, so the AI benefits are still available, but the potential downsides are easier to manage?
From the perspective of good software engineering practices, exchanging free-form text between humans and tools or between tools is the worst possible interface you can use, because it is impossible to reason about the behavior, enforce constraints, predict all possible behaviors, and write repeatable, reliable, and comprehensive tests. This is a general paradox for APIs; the more open-ended the “exchanges” that are allowed, the more progress and utility are constrained. So, we will get the benefits of generative AI only if we successfully manage for this serious disadvantage vs. its potential benefits.
When possible, don’t provide an open-ended chat interface for inputs, but instead constrain inputs to a set of values from which a prompt is generated for the underlying AI model. This approach allows you to retain the control you need, while often providing a better user experience, too.
A familiar analog is the known security vulnerability, SQL Injection, where we should never allow users to specify SQL fragments in plain text that are executed by the system. A malicious user could cause a destructive query to execute. Instead, the user is offered a constrained interface for data and actions that are permitted. The underlying SQL query is generated from that input.
If you do have a chat component, what can you do immediately within the component to transform the user input into a safer, more usable form?
Similarly, avoid returning “raw” AI-generated replies. This creates the same kind of significant burden for handling results, which this time has to be borne by the components that depend on the AI component. For their benefit, can you restrict or translate the response in some way that narrows the “space” of possible results returned to them?
In TDD section, we will explore an example involving frequently-asked questions in a healthcare ChatBot, such as the common request for a prescription renewal. We will see how we can successfully design our prompts so that such questions are mapped to a single, deterministic reply, which is to easy handle downstream, as well as test effectively. Other, more general patient prompts will require different handling.
Hence, we will see that the idea of transforming arbitrary user input into a more a constrained and manageable form, even deterministic outputs, is feasible and reduces our challenges.
Hide Model Details
The encapsulation also minimizes awareness of the underlying generative model to components that depend on the encapsulation. We can substitute updated model versions or wholly different models with no API impact. However, even updating an existing model to a newer version often changes how it responds to the same prompts and it may require rewriting the actual prompt template used. Fortunately, such changes can be kept invisible to the users of the AI component. Also, such changes can be tested thoroughly using the test suite you already have (right? ) for the component.
If there are breaking changes that affect dependents, can you modify how you construct the prompt or process the results to keep the behavior of the abstraction unchanged? If not, and you decide to proceed with the upgrade anyway, dependents will have to be modified accordingly to accommodate the changed behavior.
Design Considerations for Test Doubles
In TDD and Generative AI, we start our exploration of how to create tests for AI-enabled components. Here, we consider the case where we are Unit Testing another, non-AI component that depends on an AI-enabled dependency. Tests, like components, should have a single purpose (cohesion), so all unit tests will not want to handle the Stochastic behavior the AI-enabled dependency normally provides, because the unit tests we are writing now will exercise other aspects of behavior and require deterministic behaviors so these tests are reliable.
We said that all the unit tests for this non-AI component should use a test double, not the real AI dependency. We must write unit tests to exercise how the component responds to any potential responses it might receive from the real AI dependency, but the easiest way to do this is to first understand as best we can the space of all possible behaviors, including error scenarios, and then write tests for them that explore this space exhaustively and ensure the component being tested handles them all correctly. In contrast, it should be the Integration Tests that explore what happens with real interactions.
So, we still need to test the behavior of the component when it interacts with the real AI dependency. This is the role of some of the Integration Tests. We fully expect these tests to occasionally catch query-response interactions that we didn’t anticipate in our space analysis of possibilities, so they aren’t covered by our existing test doubles and unit tests. When this happens, we will need to add or modify our unit tests and test doubles to account for the new behaviors observed.
Also, some integration tests will focus on other, non-stochastic aspects of the integration. Those tests should use test doubles for the AI component, too.
In contrast, Acceptance Tests should never use test doubles, because their sole purpose is final validation that a feature is working as designed, running in the full, real system, including all generative AI and other real world dependencies.
So, Test Doubles take the place of dependencies when needed to ensure predictable behavior, eliminate overhead not needed for the test (like calling a remote service), and to simulate all possible behaviors the real dependency might exhibit, including error scenarios. The simulation role is important to ensure the component being tested is fully capable of handling anything the dependency throws at it. It can also be very difficult to “force” the real dependency to produce some behaviors, like triggering certain error scenarios.
In traditional software, it is somewhat uncommon for a component developer to also deliver test doubles of the component for use in tests for other components that depend on the component. At best, the test suite for the dependency might cover all known behaviors the dependency might exhibit, but it is also essential to test other components that use it to ensure they respond to all these behaviors correctly. Hence, they need a way to trigger all possible behaviors in the dependency. In current practice, it is up to the user of a dependency to understand all the behaviors (which is good to do) and write her own test doubles to simulate all these behaviors (which is a burden).
For a component with non-trivial behaviors, especially complex error scenarios, AI-based or not, consider delivering test doubles of the component along with the real component, where the test doubles simulate every possible component behavior.
Lessons Learned Writing Test Doubles at Netflix
In Testing Problems, we mentioned that Netflix faced similar testing challenges back in 2008 for their recommendation systems. Part of their solution was to write model test doubles that would “… dynamically create similar input content for tests classified along the axes that mattered for the algorithm.”
Netflix also added extra hidden output that showed the workings of the algorithm, i.e., for Explainability, when running a test configuration. Details about model weights, algorithmic details, etc. were encoded as HTML comments, visible if their developers viewed the page source. This information helped them understand why a particular list of movies were chosen, for example, in a test scenario.
In their experience, it was not be feasible for all AI test doubles to return deterministic responses. However, they could constraint the responses to fit into defined “classes” with boundaries of some sort. So, some of our test doubles may sometimes need to use a stochastic model of some kind (generative AI or not) that generates nondeterministic outputs that fit within our identified “classes”. Those generators are used in test doubles so that tests for dependent components can see a full range of possible outputs in this “class”. The tests will have to be designed to handle nondeterministic, but constrained behaviors.
This also suggests that you should have test doubles that deliberately return unacceptable responses, meaning out of acceptable bounds. These test doubles would be used for testing error handling and graceful degradation scenarios. Note that we used the word unacceptable, not unexpected. While it’s not possible to fully anticipate all possible generative model outputs, we have to work extra hard to anticipate all possible outputs, good and bad, and design handling accordingly.
Successful, reliable software systems are designed to expect all possible Scenarios in all Use Cases, including failures of any kind. Encountering an unexpected scenario should be considered a design failure.
More Tools for APIs Design
Finally, for completeness, there are other traditional tools that make designs more robust.
Type checking is a programming language technique to constrain allowed values for variables and Function arguments and return values. Dynamically typed languages like Python, don’t require explicit declarations of types, but many of these languages permit optional type declarations with type checking tools to catch many errors where values with incompatible types are used. This checking eliminates a lot of potential bugs.
In the Python community, pydantic
is one of these type-checking tools. The project has an Agent framework called pydantic-ai
that uses type checking of results returned by models and other tool invocations to make these interactions more robust.
A different approach to achieving greater resiliency is OpenDXA with DANA. Here, they seek to establish better control over model behaviors by automatically learning to be more effective.
What’s Next?
Review the highlights summarized above, then proceed to our discussion of Test-Driven Development.