Link Search Menu Expand Document

References

References for more details on testing, especially in the AI context, and other topics. Note that most of the outside references to particular tools that are mentioned elsewhere in this guide site are not repeated here.

Table of contents
  1. References
    1. Adrian Cockcroft
    2. AI for Education
    3. Allen Institute of AI
    4. Alignment Forum
    5. Anthropic
    6. Babeş-Bolyai University
    7. CVS Health Data Science Team
    8. Dean Wampler
      1. Ekimetrics
    9. EleutherAI
    10. Elvis Saravia
    11. EvalEval Coalition
    12. Evan Miller
    13. Google
    14. Hamel Husain
    15. IBM
      1. RAG
      2. Evaluation and Benchmark Tools
      3. Mellea
    16. James Thomas
    17. Jiayi Yuan, et al.
    18. John Snow Labs and Pacific.ai
    19. LastMile AI
    20. Merriam-Webster Dictionary
    21. Meta
    22. Michael Feathers
    23. MLCommons Glossary
    24. Nathan Lambert
    25. NIST Risk Management Framework
    26. OpenAI
    27. Open Data Science
    28. Patronus
    29. PlurAI
    30. RedHat
    31. RedMonk
    32. ServiceNow
    33. Specification-Driven Development
    34. Stanford
    35. University of Tübingen
    36. Unsloth
    37. Wikipedia

Adrian Cockcroft

Dean Wampler and Adrian Cockcroft exchanged messages on Mastodon about lessons learned at Netflix, which are quoted in several sections of this guide. See also Dean Wampler and the discussion in Testing Problems Caused by Generative AI Nondeterminism.

AI for Education

The AI for Education organization provides lots of useful guidance on how to evaluate AI for different education use cases and select benchmarks for them. See also their Hugging Face page.

Allen Institute of AI

Open Instruct from the Allen Institute of AI is discussed by Nathan Lambert below. It tries to meet similar goals as InstructLab. See From Testing to Tuning for more details.

Alignment Forum

The Alignment Forum works on many aspects of alignment.

Anthropic

Anthropic’s post, Demystifying evals for AI agents, provides valuable tips on testing complex agents, but also general guidance on evaluation concepts. Highly recommended.

Babeş-Bolyai University

Synthetic Data Generation Using Large Language Models: Advances in Text and Code surveys techniques that use LLMs, like we are explore in the Unit Benchmarks chapter and elsewhere.

CVS Health Data Science Team

CVS, the US-based retail pharmacy and healthcare services company, has a large data science team. They recently open-sourced uqlm, where UQLM stands for Uncertainty Quantification for Language Models. It is a Python package for UQ-based LLM hallucination detection.

Among the useful tools in this repository are:

Dean Wampler

In Generative AI: Should We Say Goodbye to Deterministic Testing? Dean Wampler (one of this project’s contributors) summarizes the work of this project. After posting the link to the slides, Dean and Adrian Cockcroft discussed lessons learned at Netflix, which have informed this project’s content.

Ekimetrics

ClairBot from the Responsible AI Team at Ekimetrics is a research project that compares several model responses for domain-specific questions, where each of the models has been tuned for a particular domain, in this case ad serving, laws and regulations, and social sciencies and ethics. See also the Unit Benchmarks chapter.

EleutherAI

EleutherAI’s definition of Alignment is quoted in the glossary definition for it.

Elvis Saravia

Elvis Saravia’s Prompt Engineering Guide, part of his DAIR.AI learning academy, provides in depth information on Prompt Engineering.

EvalEval Coalition

EvalEval (Blog post, GitHub organization) is a research coalition on evaluating evaluations, hence the name EvalEval. Its work is hosted by Hugging Face, University of Edinburgh, and EleutherAI.

The main project is Every Eval Ever, “a shared schema and crowdsourced evaluation database. It defines a standardized metadata format for storing AI evaluation results — from leaderboard scrapes and research papers to local evaluation runs — so that results from different frameworks can be compared, reproduced, and reused.”

Evan Miller

Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations is a research paper arguing that evaluations (see the Trust and Safety Evaluation Initiative for more details) should use proper statistical analysis of their results. It is discussed in Statistical Evaluation.

Google

Google’s Agent Development Kit has a chapter called Why Evaluate Agents?, which provides tips for writing evaluations specifically tailored for agents. See the discussion in the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/) chapter.

Hamel Husain

Your AI Product Needs Evals is a long blog post that discusses testing of AI applications and makes many of the same points this user guide makes.

IBM

RAG

This IBM blog post, What is retrieval-augmented generation? provides a good overview of RAG.

Evaluation and Benchmark Tools

For the following tool, see the LLM as a Judge chapter for more details:

  • EvalAssist (paper) is designed to make LLM as a Judge evaluations of data easier for users, including incremental refinement of the evaluation criteria using a web-based user experience. EvalAssist supports direct assessment (scoring) of data individually, which we used in our LLM as a Judge chapter, or pair-wise comparisons, where the best of two answers is chosen.

For the following tool, see the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/) chapter for more details:

  • AssetOpsBench is a unified framework for developing, orchestrating, and evaluating domain-specific AI agents in industrial asset operations and maintenance. It is designed for maintenance engineers, reliability specialists, and facility planners, it allows reproducible evaluation of multi-step workflows in simulated industrial environments.

For the following tools, see the [Unit Benchmarks]((/ai-application-testing/testing-strategies/unit-benchmarks/) chapter for more details:

  • FailureSensorIQ is a data set for multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets.
  • FIBEN Benchmark is a finance data set benchmark for natural language queries.
  • HELM Enterprise Benchmark is an enterprise benchmark framework for LLM evaluation. It extends HELM, an open-source benchmark framework developed by Stanford CRFM, to enable users evaluate LLMs with domain-specific data sets such as finance, legal, climate, and cybersecurity.
  • Stanford ML’s MedAgentBench an virtual environment and benchmark suite for assessing the performance of LLMs in the context of electronic health records (EHR).

Mellea

The Mellea has similar concepts, such as emphasizing specifynig requirements through a Python API rather than arbitrary prompts.

James Thomas

James Thomas is a QA engineer who posted a link to a blog post How do I Test AI? that lists criteria to consider when testing AI-enabled systems. While the post doesn’t provide a lot of details behind the list items, the list is excellent for stimulating further investigation.

Jiayi Yuan, et al.

Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning paper examines the influence of floating point precision on the reproducibility of inference results, even when randomness is restricted, such as using a low “temperature”. Of course, the theme of our project is dealing with the inherent randomness of inference, but there are also times when limiting that randomness is important.

John Snow Labs and Pacific.ai

John Snow Labs has created langtest, a test generation and execution framework with “60+ test types for comparing LLM & NLP models on accuracy, bias, fairness, robustness & more.”

The affiliated company Pacific.ai offers a commercial testing system with similar features.

LastMile AI

MCP Eval is an evaluation framework for testing Model Context Protocol (MCP) servers and the agents that use them. Unlike traditional testing approaches that mock interactions or test components in isolation. It is built on MCP Agent, their agent framework that emphasizes MCP as the communication protocol. See the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/) chapter for more details.

Merriam-Webster Dictionary

The Merriam-Webster Dictionary: is quoted in our Glossary for several terms.

Meta

Meta’s synthetic-data-kit provides scalable support for larger-scale data synthesis and processing (such as translating between formats), especially for model Tuning with Llama models. See the Unit Benchmarks chapters for more details.

The Llama Stack project provides a Kubernetes Benchmark suite.

Michael Feathers

Michael Feathers gave a talk recently called The Challenge of Understandability at Codecamp Romania, 2024, which is discussed in Abstractions Encapsulate Complexities.

MLCommons Glossary

The MLCommons AI Safety v0.5 Benchmark Proof of Concept Technical Glossary is used to inform our Glossary.

Nathan Lambert

How to approach post-training for AI applications, a tutorial presented at NeurIPS 2024 by Nathan Lambert. The same content can be found in this Interconnects blog post. From Testing to Tuning discusses these ideas. See also this Interconnects post. See also the Allen Institute of AI entry above.

NIST Risk Management Framework

The U.S. National Institute of Science and Technology’s (NIST) Artificial Intelligence Risk Management Framework (AI RMF 1.0) is used to inform our Glossary.

OpenAI

An OpenAI paper on reinforcement fine tuning is discussed in From Testing to Tuning.

Announcing OpenAI Pioneers Program announced OpenAI Pioneers Program, an effort designed to help application developers optimize model performance in their domains.

Open Data Science

Nine Open-Source Tools to Generate Synthetic Data lists several tools that use different approaches for data generation. See the Unit Benchmarks chapter for more details.

Patronus

The Patronus guide, LLM Testing: The Latest Techniques & Best Practices, discusses the unique testing challenges raised by generative AI and discusses various techniques for testing these systems.

FinanceBench is their benchmark for finance applications. See the Unit Benchmarks chapter for more details.

Evaluating Copyright Violations in LLMs has data and tools for detecting examples of responses that violate one or more copyrights. (This work isn’t discussed elsewhere in this user guide.)

PlurAI

Plurai.ai recently created an open-source project called Intellagent that demonstrates how to exploit some recent research on automated generation of test data, knowledge graphs based on the constraints and requirements for an application, and automated test generation to verify alignment of the system to the requirements. These techniques are designed to provide more exhaustive test coverage of behaviors, including catching corner cases. See the Statistical Evaluation chapter for more details.

RedHat

InstructLab is a project started by IBM Research and developed by RedHat. InstructLab provides conventions for organizing specific, manually-created examples into a domain hierarchy, along with tools to perform model Tuning, including Synthetic Data Generation. Hence, InstructLab is an alternative way to generate synthetic data for Unit Benchmarks. See also From Testing to Tuning.

RedMonk

The analyst firm RedMonk posted this interesting piece on Specification-Driven Development, which we discuss in this chapter. See also the other references in the Specification-Driven Development section below and the glossary entry.

ServiceNow

DoomArena is a framework for testing AI Agents against evolving security threats. It offers a modular, configurable, plug-in framework for testing the security of AI agents across multiple attack scenarios.

DoomArena enables detailed threat modeling, adaptive testing, and fine-grained security evaluations through real-world case studies, such as τ-Bench and BrowserGym. These case studies showcase how DoomArena evaluates vulnerabilities in AI agents interacting in airline customer service and e-commerce contexts.

Furthermore, DoomArena serves as a laboratory for AI agent security research, revealing fascinating insights about agent vulnerabilities, defense effectiveness, and attack interactions. See the [Testing Agents]((/ai-application-testing/testing-strategies/testing-agents/) chapter for more details.

Specification-Driven Development

SDD is a more structured approach to prompting LLMs and doing explicit “phases” like planning vs. task execution, so LLMs can do a better job generating production-quality code that meets our requirements. Here we list many references. See the discussion in the Specification-Driven Development chapter, where we explore them, and the RedMonk section above.

YouTube videos about SDD:

See also Mellea.

Stanford

What Makes a Good AI Benchmark? from Stanford’s Human-Centered Artificial Intelligence project provides a careful analysis of the qualities of good benchmarks, along with assessments of many well-known, public benchmarks. See also their BetterBench repository of assessments.

Some of the criteria pertain to documentation, ease of adoption, and feedback mechanisms, which may be less important for small-scale and especially private benchmarks, like unit benchmarks discussed in this guide. Other criteria are more applicable, such as clearly defining the goals of the benchmark, how those goals are implemented by the benchmark, how to interpret the results, how involved were domain experts in constructing the benchmark, etc.

University of Tübingen

Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping is a research effort that explores an alternative approach to knowledge representations, like the Q&A pairs we use in this guide for benchmarks, without using LLMs for generating data.

Unsloth

Unsloth is an OSS tool suite for model training and tuning. Their documentation includes guides for the following:

Wikipedia

Many Wikipedia articles are used as references in our Glossary and other locations.