Testing Strategies and Techniques
After discussing architecture and design considerations, we turn to testing strategies and techniques with the goal of creating reliable, repeatable tests for generative AI applications.
For educational purposes, we demonstrate techniques using relatively simple tools that are easy for you to try. While these tools are designed to be suitable for real project development purposes, we also briefly discuss more sophisticated tools that may be preferred for more “advanced” uses, larger teams, etc. These additional tools are described in sections with titles that begin with Other Tools… near the end of each chapter. Also, note that many startups and consulting organizations now offer proprietary tools and services to aid developer testing, but we won’t cover those options.
We emphasize concepts in these chapters. For the details of applying them to automated test suites for real-world projects, see the companion ChatBot application, specifically the section Automated Testing: Practical Enhancements.
Finally, the end of each chapter has an Experiments to Try section for further exploration.
TIPS:
- These user guide chapters focus on concepts, with the ChatBot application applies them to a demonstration project that uses generative AI.
- Anthropic’s post, Demystifying evals for AI agents, provides a valuable overview of evaluation concepts, with advanced guidance on testing Agents. It is recommended reading. (See also the References).
NOTE:
Using a Generative AI Model can mean it is managed by the application itself, behind library APIs, or it is accessed as a remote service, such as ChatGPT, or through a protocol like MCP. It can be part of more advanced design patterns like Agents and RAG. Furthermore, evaluating just models is not sufficient since these other tools can modify prompts and responses. So, just as classic Unit Tests, Integration Tests, and Acceptance Tests cover individual Units to Components that aggregate them, it is really necessary for our AI tests to cover not just model prompts and responses, but units and components they are part of. Nevertheless, for simplicity, we will often work with models directly.
What’s Next?
Start with Unit Benchmarks. Refer to the Glossary regularly for definitions of terms. See the References for more information.
