References
References for more details on testing, especially in the AI context, and other topics. Note that outside references to particular tools that are mentioned in this web site are not repeated here.
Table of contents
- References    
- Adrian Cockcroft
 - AI for Education
 - Alignment Forum
 - Babeş-Bolyai University
 - CVS Health Data Science Team
 - Dean Wampler
 - EleutherAI
 - Evan Miller
 - Hamel Husain
 - IBM
 - James Thomas
 - Jiayi Yuan, et al.
 - John Snow Labs and Pacific.ai
 - LastMile AI
 - Merriam-Webster Dictionary
 - Meta
 - Michael Feathers
 - MLCommons Glossary
 - Nathan Lambert
 - NIST Risk Management Framework
 - OpenAI
 - Open Data Science
 - Patronus
 - PlurAI
 - RedHat
 - RedMonk
 - ServiceNow
 - Specification-Driven Development
 - University of Tübingen
 - Unsloth
 - Wikipedia
 
 
Adrian Cockcroft
Dean Wampler and Adrian Cockcroft exchanged messages on Mastodon about lessons learned at Netflix, which are quoted in several sections of this website. See also Dean Wampler and the discussion in Testing Problems Caused by Generative AI Nondeterminism.
AI for Education
The AI for Education organization provides lots of useful guidance on how to evaluate AI for different education use cases and select benchmarks for them. See also their Hugging Face page.
Allen Institute of AI
Open Instruct from the Allen Institute of AI tries to meet similar goals as InstructLab. It is discussed by Nathan Lambert (below). See From Testing to Tuning for more details.
Alignment Forum
The Alignment Forum works on many aspects of alignment.
Babeş-Bolyai University
Synthetic Data Generation Using Large Language Models: Advances in Text and Code surveys techniques that use LLMs, like we are explore in the Unit Benchmarks chapter and elsewhere.
CVS Health Data Science Team
CVS, the US-based retail pharmacy and healthcare services company, has a large data science team. They recently open-sourced uqlm, where UQLM stands for Uncertainty Quantification for Language Models. It is a Python package for UQ-based LLM hallucination detection.
Among the useful tools in this repository are:
- A concise summary of best practices and tools for synthetic data generation.
 - Tuning models to improve Chain of Thought reasoning.
 
Dean Wampler
In Generative AI: Should We Say Goodbye to Deterministic Testing? Dean Wampler (one of this project’s contributors) summarizes the work of this project. After posting the link to the slides, Dean and Adrian Cockcroft discussed lessons learned at Netflix, which have informed this project’s content.
Ekimetrics
ClairBot from the Responsible AI Team at Ekimetrics is a research project that compares several model responses for domain-specific questions, where each of the models has been tuned for a particular domain, in this case ad serving, laws and regulations, and social sciencies and ethics. See also the Unit Benchmarks chapter.
EleutherAI
EleutherAI’s definition of Alignment is quoted in the glossary definition for it.
Evan Miller
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations is a research paper arguing that evaluations (see the Trust and Safety Evaluation Initiative for more details) should use proper statistical analysis of their results. It is discussed in Statistical Evaluation.
Google’s Agent Development Kit has a chapter called Why Evaluate Agents?, which provides tips for writing evaluations specifically tailored for agents. See the discussion in the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/) chapter.
Hamel Husain
Your AI Product Needs Evals is a long blog post that discusses testing of AI applications and makes many of the same points this user guide makes.
IBM
RAG
This IBM blog post, What is retrieval-augmented generation? provides a good overview of RAG.
Evaluation and Benchmark Tools
For the following tool, see the LLM as a Judge chapter for more details:
- EvalAssist (paper) is designed to make LLM as a Judge evaluations of data easier for users, including incremental refinement of the evaluation criteria using a web-based user experience. EvalAssist supports direct assessment (scoring) of data individually, which we used in our LLM as a Judge chapter, or pair-wise comparisons, where the best of two answers is chosen.
 
For the following tool, see the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/) chapter for more details:
- AssetOpsBench is a unified framework for developing, orchestrating, and evaluating domain-specific AI agents in industrial asset operations and maintenance. It is designed for maintenance engineers, reliability specialists, and facility planners, it allows reproducible evaluation of multi-step workflows in simulated industrial environments.
 
For the following tools, see the [Unit Benchmarks]((/ai-application-testing/testing-strategies/unit-benchmarks/) chapter for more details:
- FailureSensorIQ is a dataset for multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets.
 - FIBEN Benchmark is a finance dataset benchmark for natural language queries.
 - HELM Enterprise Benchmark is an enterprise benchmark framework for LLM evaluation. It extends HELM, an open-source benchmark framework developed by Stanford CRFM, to enable users evaluate LLMs with domain-specific datasets such as finance, legal, climate, and cybersecurity.
 
James Thomas
James Thomas is a QA engineer who posted a link to a blog post How do I Test AI? that lists criteria to consider when testing AI-enabled systems. While the post doesn’t provide a lot of details behind the list items, the list is excellent for stimulating further investigation.
Jiayi Yuan, et al.
Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning paper examines the influence of floating point precision on the reproducibility of inference results, even when randomness is restricted, such as using a low “temperature”. Of course, the theme of our project is dealing with the inherent randomness of inference, but there are also times when limiting that randomness is important.
John Snow Labs and Pacific.ai
John Snow Labs has created langtest, a test generation and execution framework with “60+ test types for comparing LLM & NLP models on accuracy, bias, fairness, robustness & more.”
The affiliated company Pacific.ai offers a commercial testing system with similar features.
LastMile AI
MCP Eval is an evaluation framework for testing Model Context Protocol (MCP) servers and the agents that use them. Unlike traditional testing approaches that mock interactions or test components in isolation. It is built on MCP Agent, their agent framework that emphasizes MCP as the communication protocol. See the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/) chapter for more details.
Merriam-Webster Dictionary
The Merriam-Webster Dictionary: is quoted in our Glossary for several terms.
Meta
Meta’s synthetic-data-kit provides scalable support for larger-scale data synthesis and processing (such as translating between formats), especially for model Tuning with Llama models. See the Unit Benchmarks chapters for more details.
The Llama Stack project provides a Kubernetes Benchmark suite.
Michael Feathers
Michael Feathers gave a talk recently called The Challenge of Understandability at Codecamp Romania, 2024, which is discussed in Abstractions Encapsulate Complexities.
MLCommons Glossary
The MLCommons AI Safety v0.5 Benchmark Proof of Concept Technical Glossary is used to inform our Glossary.
Nathan Lambert
How to approach post-training for AI applications, a tutorial presented at NeurIPS 2024 by Nathan Lambert. It is discussed in From Testing to Tuning. See also this Interconnects post. See also the From Testing to Tuning) chapter and the Allen Institute of AI entry above.
NIST Risk Management Framework
The U.S. National Institute of Science and Technology’s (NIST) Artificial Intelligence Risk Management Framework (AI RMF 1.0) is used to inform our Glossary.
OpenAI
An OpenAI paper on reinforcement fine tuning is discussed in From Testing to Tuning.
Announcing OpenAI Pioneers Program announced OpenAI Pioneers Program, an effort designed to help application developers optimize model performance in their domains.
Open Data Science
Nine Open-Source Tools to Generate Synthetic Data lists several tools that use different approaches for data generation. See the Unit Benchmarks chapter for more details.
Patronus
The Patronus guide, LLM Testing: The Latest Techniques & Best Practices, discusses the unique testing challenges raised by generative AI and discusses various techniques for testing these systems.
FinanceBench is their benchmark for finance applications. See the Unit Benchmarks chapter for more details.
Evaluating Copyright Violations in LLMs has data and tools for detecting examples of responses that violate one or more copyrights. (This work isn’t discussed elsewhere in this user guide.)
PlurAI
Plurai.ai recently created an open-source project called Intellagent that demonstrates how to exploit some recent research on automated generation of test data, knowledge graphs based on the constraints and requirements for an application, and automated test generation to verify alignment of the system to the requirements. These techniques are designed to provide more exhaustive test coverage of behaviors, including catching corner cases. See the Statistical Evaluation chapter for more details.
RedHat
InstructLab is a project started by IBM Research and developed by RedHat. InstructLab provides conventions for organizing specific, manually-created examples into a domain hierarchy, along with tools to perform model tuning, including synthetic data generation. Hence, InstructLab is an alternative way to generate synthetic data for Unit Benchmarks. See also From Testing to Tuning.
RedMonk
The analyst firm RedMonk posted this interesting piece on Specification-Driven Development.
ServiceNow
DoomArena is a framework for testing AI Agents against evolving security threats. It offers a modular, configurable, plug-in framework for testing the security of AI agents across multiple attack scenarios.
DoomArena enables detailed threat modeling, adaptive testing, and fine-grained security evaluations through real-world case studies, such as τ-Bench and BrowserGym. These case studies showcase how DoomArena evaluates vulnerabilities in AI agents interacting in airline customer service and e-commerce contexts.
Furthermore, DoomArena serves as a laboratory for AI agent security research, revealing fascinating insights about agent vulnerabilities, defense effectiveness, and attack interactions. See the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/) chapter for more details.
Specification-Driven Development
SDD is a more structured approach to prompting LLMs and doing explicit “phases” like planning vs. task execution, so LLMs can do a better job generating production-quality code that meets our requirements. Here we list many references. See the discussion in the Specification-Driven Development chapter, where we explore them.
- How I Apply Spec-Driven AI Coding
 - Spec Kit
 - AWS Kiro, an AI IDE designed to support specification-driven development.
 
University of Tübingen
Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping is a research effort that explores an alternative approach to knowledge representations, like the Q&A pairs we use in this guide for benchmarks, without using LLMs for generating data. See the [Unit Benchmarks]((/ai-application-testing/testing-strategies/unit-benchmarks/) chapter for more details.
Unsloth
Unsloth is an OSS tool suite for model training and tuning. Their documentation includes guides for the following:
Wikipedia
Many Wikipedia articles are used as references in our Glossary and other locations.
- Bertrand Meyer
 - Cyclomatic complexity
 - Design by contract
 - DevOps
 - Eiffel (programming language)
 - Deferent and epicycle
 - Generative adversarial network
 - SQL injection
 - Test-driven development
 - Transmission Control Protocol
 
