Link Search Menu Expand Document

References

References for more details on testing, especially in the AI context, and other topics. Note that outside references to particular tools that are mentioned in this web site are not repeated here.

Table of contents
  1. References
    1. Adrian Cockcroft
    2. AI for Education
      1. Allen Institute of AI
    3. Alignment Forum
    4. Babeş-Bolyai University
    5. CVS Health Data Science Team
    6. Dean Wampler
      1. Ekimetrics
    7. EleutherAI
    8. Evan Miller
    9. Google
    10. Hamel Husain
    11. IBM
      1. RAG
      2. Evaluation and Benchmark Tools
    12. James Thomas
    13. Jiayi Yuan, et al.
    14. John Snow Labs and Pacific.ai
    15. LastMile AI
    16. Merriam-Webster Dictionary
    17. Meta
    18. Michael Feathers
    19. MLCommons Glossary
    20. Nathan Lambert
    21. NIST Risk Management Framework
    22. OpenAI
    23. Open Data Science
    24. Patronus
    25. PlurAI
    26. RedHat
    27. RedMonk
    28. ServiceNow
    29. Specification-Driven Development
    30. University of Tübingen
    31. Unsloth
    32. Wikipedia

Adrian Cockcroft

Dean Wampler and Adrian Cockcroft exchanged messages on Mastodon about lessons learned at Netflix, which are quoted in several sections of this website. See also Dean Wampler and the discussion in Testing Problems Caused by Generative AI Nondeterminism.

AI for Education

The AI for Education organization provides lots of useful guidance on how to evaluate AI for different education use cases and select benchmarks for them. See also their Hugging Face page.

Allen Institute of AI

Open Instruct from the Allen Institute of AI tries to meet similar goals as InstructLab. It is discussed by Nathan Lambert (below). See From Testing to Tuning for more details.

Alignment Forum

The Alignment Forum works on many aspects of alignment.

Babeş-Bolyai University

Synthetic Data Generation Using Large Language Models: Advances in Text and Code surveys techniques that use LLMs, like we are explore in the Unit Benchmarks chapter and elsewhere.

CVS Health Data Science Team

CVS, the US-based retail pharmacy and healthcare services company, has a large data science team. They recently open-sourced uqlm, where UQLM stands for Uncertainty Quantification for Language Models. It is a Python package for UQ-based LLM hallucination detection.

Among the useful tools in this repository are:

Dean Wampler

In Generative AI: Should We Say Goodbye to Deterministic Testing? Dean Wampler (one of this project’s contributors) summarizes the work of this project. After posting the link to the slides, Dean and Adrian Cockcroft discussed lessons learned at Netflix, which have informed this project’s content.

Ekimetrics

ClairBot from the Responsible AI Team at Ekimetrics is a research project that compares several model responses for domain-specific questions, where each of the models has been tuned for a particular domain, in this case ad serving, laws and regulations, and social sciencies and ethics. See also the Unit Benchmarks chapter.

EleutherAI

EleutherAI’s definition of Alignment is quoted in the glossary definition for it.

Evan Miller

Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations is a research paper arguing that evaluations (see the Trust and Safety Evaluation Initiative for more details) should use proper statistical analysis of their results. It is discussed in Statistical Evaluation.

Google

Google’s Agent Development Kit has a chapter called Why Evaluate Agents?, which provides tips for writing evaluations specifically tailored for agents. See the discussion in the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/) chapter.

Hamel Husain

Your AI Product Needs Evals is a long blog post that discusses testing of AI applications and makes many of the same points this user guide makes.

IBM

RAG

This IBM blog post, What is retrieval-augmented generation? provides a good overview of RAG.

Evaluation and Benchmark Tools

For the following tool, see the LLM as a Judge chapter for more details:

  • EvalAssist (paper) is designed to make LLM as a Judge evaluations of data easier for users, including incremental refinement of the evaluation criteria using a web-based user experience. EvalAssist supports direct assessment (scoring) of data individually, which we used in our LLM as a Judge chapter, or pair-wise comparisons, where the best of two answers is chosen.

For the following tool, see the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/) chapter for more details:

  • AssetOpsBench is a unified framework for developing, orchestrating, and evaluating domain-specific AI agents in industrial asset operations and maintenance. It is designed for maintenance engineers, reliability specialists, and facility planners, it allows reproducible evaluation of multi-step workflows in simulated industrial environments.

For the following tools, see the [Unit Benchmarks]((/ai-application-testing/testing-strategies/unit-benchmarks/) chapter for more details:

  • FailureSensorIQ is a dataset for multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets.
  • FIBEN Benchmark is a finance dataset benchmark for natural language queries.
  • HELM Enterprise Benchmark is an enterprise benchmark framework for LLM evaluation. It extends HELM, an open-source benchmark framework developed by Stanford CRFM, to enable users evaluate LLMs with domain-specific datasets such as finance, legal, climate, and cybersecurity.

James Thomas

James Thomas is a QA engineer who posted a link to a blog post How do I Test AI? that lists criteria to consider when testing AI-enabled systems. While the post doesn’t provide a lot of details behind the list items, the list is excellent for stimulating further investigation.

Jiayi Yuan, et al.

Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning paper examines the influence of floating point precision on the reproducibility of inference results, even when randomness is restricted, such as using a low “temperature”. Of course, the theme of our project is dealing with the inherent randomness of inference, but there are also times when limiting that randomness is important.

John Snow Labs and Pacific.ai

John Snow Labs has created langtest, a test generation and execution framework with “60+ test types for comparing LLM & NLP models on accuracy, bias, fairness, robustness & more.”

The affiliated company Pacific.ai offers a commercial testing system with similar features.

LastMile AI

MCP Eval is an evaluation framework for testing Model Context Protocol (MCP) servers and the agents that use them. Unlike traditional testing approaches that mock interactions or test components in isolation. It is built on MCP Agent, their agent framework that emphasizes MCP as the communication protocol. See the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/) chapter for more details.

Merriam-Webster Dictionary

The Merriam-Webster Dictionary: is quoted in our Glossary for several terms.

Meta

Meta’s synthetic-data-kit provides scalable support for larger-scale data synthesis and processing (such as translating between formats), especially for model Tuning with Llama models. See the Unit Benchmarks chapters for more details.

The Llama Stack project provides a Kubernetes Benchmark suite.

Michael Feathers

Michael Feathers gave a talk recently called The Challenge of Understandability at Codecamp Romania, 2024, which is discussed in Abstractions Encapsulate Complexities.

MLCommons Glossary

The MLCommons AI Safety v0.5 Benchmark Proof of Concept Technical Glossary is used to inform our Glossary.

Nathan Lambert

How to approach post-training for AI applications, a tutorial presented at NeurIPS 2024 by Nathan Lambert. It is discussed in From Testing to Tuning. See also this Interconnects post. See also the From Testing to Tuning) chapter and the Allen Institute of AI entry above.

NIST Risk Management Framework

The U.S. National Institute of Science and Technology’s (NIST) Artificial Intelligence Risk Management Framework (AI RMF 1.0) is used to inform our Glossary.

OpenAI

An OpenAI paper on reinforcement fine tuning is discussed in From Testing to Tuning.

Announcing OpenAI Pioneers Program announced OpenAI Pioneers Program, an effort designed to help application developers optimize model performance in their domains.

Open Data Science

Nine Open-Source Tools to Generate Synthetic Data lists several tools that use different approaches for data generation. See the Unit Benchmarks chapter for more details.

Patronus

The Patronus guide, LLM Testing: The Latest Techniques & Best Practices, discusses the unique testing challenges raised by generative AI and discusses various techniques for testing these systems.

FinanceBench is their benchmark for finance applications. See the Unit Benchmarks chapter for more details.

Evaluating Copyright Violations in LLMs has data and tools for detecting examples of responses that violate one or more copyrights. (This work isn’t discussed elsewhere in this user guide.)

PlurAI

Plurai.ai recently created an open-source project called Intellagent that demonstrates how to exploit some recent research on automated generation of test data, knowledge graphs based on the constraints and requirements for an application, and automated test generation to verify alignment of the system to the requirements. These techniques are designed to provide more exhaustive test coverage of behaviors, including catching corner cases. See the Statistical Evaluation chapter for more details.

RedHat

InstructLab is a project started by IBM Research and developed by RedHat. InstructLab provides conventions for organizing specific, manually-created examples into a domain hierarchy, along with tools to perform model tuning, including synthetic data generation. Hence, InstructLab is an alternative way to generate synthetic data for Unit Benchmarks. See also From Testing to Tuning.

RedMonk

The analyst firm RedMonk posted this interesting piece on Specification-Driven Development.

ServiceNow

DoomArena is a framework for testing AI Agents against evolving security threats. It offers a modular, configurable, plug-in framework for testing the security of AI agents across multiple attack scenarios.

DoomArena enables detailed threat modeling, adaptive testing, and fine-grained security evaluations through real-world case studies, such as τ-Bench and BrowserGym. These case studies showcase how DoomArena evaluates vulnerabilities in AI agents interacting in airline customer service and e-commerce contexts.

Furthermore, DoomArena serves as a laboratory for AI agent security research, revealing fascinating insights about agent vulnerabilities, defense effectiveness, and attack interactions. See the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/) chapter for more details.

Specification-Driven Development

SDD is a more structured approach to prompting LLMs and doing explicit “phases” like planning vs. task execution, so LLMs can do a better job generating production-quality code that meets our requirements. Here we list many references. See the discussion in the Specification-Driven Development chapter, where we explore them.

University of Tübingen

Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping is a research effort that explores an alternative approach to knowledge representations, like the Q&A pairs we use in this guide for benchmarks, without using LLMs for generating data. See the [Unit Benchmarks]((/ai-application-testing/testing-strategies/unit-benchmarks/) chapter for more details.

Unsloth

Unsloth is an OSS tool suite for model training and tuning. Their documentation includes guides for the following:

Wikipedia

Many Wikipedia articles are used as references in our Glossary and other locations.