References
References for more details on testing, especially in the AI context, and other topics. Note that outside references to particular tools that are mentioned in this web site are not repeated here.
Table of contents
- References
- Adrian Cockcroft
- AI for Education
- Alignment Forum
- Babeş-Bolyai University
- CVS Health Data Science Team
- Dean Wampler
- EleutherAI
- Evan Miller
- Hamel Husain
- IBM
- James Thomas
- Jiayi Yuan, et al.
- John Snow Labs and Pacific.ai
- Merriam-Webster Dictionary
- Meta
- Michael Feathers
- MLCommons Glossary
- Nathan Lambert
- NIST Risk Management Framework
- OpenAI
- Open Data Science
- Patronus
- PlurAI
- Specification-Driven Development
- University of Tübingen
- Wikipedia
Adrian Cockcroft
Dean Wampler and Adrian Cockcroft exchanged messages on Mastodon about lessons learned at Netflix, which are quoted in several sections of this website. See also Dean Wampler
AI for Education
The AI for Education organization provides lots of useful guidance on how to evaluate AI for different education use cases and select benchmarks for them. See also their Hugging Face page
Alignment Forum
The Alignment Forum works on many aspects of alignment.
Babeş-Bolyai University
Synthetic Data Generation Using Large Language Models: Advances in Text and Code surveys techniques that use LLMs, like we are doing
CVS Health Data Science Team
CVS, the US-based retail pharmacy and healthcare services company, has a large data science team. They recently open-sourced uqlm
, where UQLM stands for Uncertainty Quantification for Language Models. It is a Python package for UQ-based LLM hallucination detection.
Among the useful tools in this repository are:
- A concise summary of best practices and tools for synthetic data generation.
- Tuning models to improve Chain of Thought reasoning.
Dean Wampler
In Generative AI: Should We Say Goodbye to Deterministic Testing? Dean Wampler (one of this project’s contributors) summarizes the work of this project. After posting the link to the slides, Dean and Adrian Cockcroft discussed lessons learned at Netflix, which have informed this project’s content.
Ekimetrics
ClairBot from the Responsible AI Team at Ekimetrics is a research project that compares several model responses for domain-specific questions, where each of the models has been tuned for a particular domain, in this case ad serving, laws and regulations, and social sciencies and ethics.
EleutherAI
EleutherAI’s definition of Alignment is quoted in the glossary definition for it.
Evan Miller
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations is a research paper arguing that evaluations (see the Trust and Safety Evaluation Initiative for more details) should use proper statistical analysis of their results. It is discussed in Statistical Evaluation.
Hamel Husain
Your AI Product Needs Evals is a long blog post that discusses testing of AI applications and makes many of the same points this user guide makes.
IBM
This IBM blog post, What is retrieval-augmented generation? provides a good overview of RAG.
James Thomas
James Thomas is a QA engineer who posted a link to a blog post How do I Test AI? that lists criteria to consider when testing AI-enabled systems. While the post doesn’t provide a lot of details behind the list items, the list is excellent for stimulating further investigation.
Jiayi Yuan, et al.
Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning paper examines the influence of floating point precision on the reproducibility of inference results, even when randomness is restricted, such as using a low “temperature”. Of course, the theme of our project is dealing with the inherent randomness of inference, but there are also times when limiting that randomness is important.
John Snow Labs and Pacific.ai
John Snow Labs has created langtest, a test generation and execution framework with “60+ test types for comparing LLM & NLP models on accuracy, bias, fairness, robustness & more.”
The affiliated company Pacific.ai offers a commercial testing system with similar features.
Merriam-Webster Dictionary
The Merriam-Webster Dictionary: is quoted in our Glossary for several terms.
Meta
Meta’s synthetic-data-kit
(discussed in Unit Benchmarks and From Testing to Tuning) provides scalable support for larger-scale data synthesis and processing (such as translating between formats), especially for model Tuning with Llama models.
Michael Feathers
Michael Feathers gave a talk recently called The Challenge of Understandability at Codecamp Romania, 2024, which is discussed in Abstractions Encapsulate Complexities.
MLCommons Glossary
The MLCommons AI Safety v0.5 Benchmark Proof of Concept Technical Glossary is used to inform our Glossary.
Nathan Lambert
How to approach post-training for AI applications, a tutorial presented at NeurIPS 2024 by Nathan Lambert. It is discussed in From Testing to Tuning. See also this Interconnects post.
NIST Risk Management Framework
The U.S. National Institute of Science and Technology’s (NIST) Artificial Intelligence Risk Management Framework (AI RMF 1.0) is used to inform our Glossary.
OpenAI
An OpenAI paper on reinforcement fine tuning is discussed in From Testing to Tuning.
Announcing OpenAI Pioneers Program announced OpenAI Pioneers Program, an effort designed to help application developers optimize model performance in their domains.
Open Data Science
Nine Open-Source Tools to Generate Synthetic Data lists several tools that use different approaches for data generation.
Patronus
The Patronus guide, LLM Testing: The Latest Techniques & Best Practices, discusses the unique testing challenges raised by generative AI and discusses various techniques for testing these systems.
PlurAI
Plurai.ai recently created an open-source project called Intellagent that demonstrates how to exploit some recent research on automated generation of test data, knowledge graphs based on the constraints and requirements for an application, and automated test generation to verify alignment of the system to the requirements. These techniques are designed to provide more exhaustive test coverage of behaviors, including catching corner cases.
Specification-Driven Development
SDD is a more structured approach to prompting LLMs and doing explicit “phases” like planning vs. task execution, so LLMs can do a better job generating production-quality code that meets our requirements. Here we list many references. See the discussion in the Specification-Driven Development chapter, where we explore them.
- How I Apply Spec-Driven AI Coding
- Spec Kit
- AWS Kiro, an AI IDE designed to support specification-driven development.
University of Tübingen
Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping is a research effort that explores an alternative approach to knowledge representations, like the Q&A pairs we use in this guide for benchmarks, without using LLMs for generating data.
Wikipedia
Many Wikipedia articles are used as references in our Glossary and other locations.
- Bertrand Meyer
- Cyclomatic complexity
- Design by contract
- DevOps
- Eiffel (programming language)
- Deferent and epicycle
- Generative adversarial network
- SQL injection
- Test-driven development
- Transmission Control Protocol