Join Our Work Group GitHub Repo
Glossary of Terms
Some of the terms defined here are industry standards, while others are not standard, but they are useful for our purposes.
Some definitions are adapted from the following sources, which are indicated below using the same numbers, i.e., [1] and [2]:
- MLCommons AI Safety v0.5 Benchmark Proof of Concept Technical Glossary
- NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0)
Table of contents
- Glossary of Terms
- Acceptance Test
- Agent
- AI Actor
- AI System
- Alignment
- Automatable
- Benchmark
- Component
- Dataset
- Determinism
- Explainability
- Evaluation
- Evaluation Framework
- Evaluator
- Fairness
- Feature
- Function
- Functional Programming
- Generative AI Model
- Hallucination
- Inference
- Integration Test
- Large Language Model
- Object-Oriented Programming
- Multimodal Model
- Refactoring
- Regression
- Repeatable
- Robustness
- Side Effect
- Test
- Test Double
- Test-Driven Development
- Token
- Unit Test
Acceptance Test
A test that verifies a user-visible feature works are required, often by driving the user interface or calling the external API. These tests are system-wide and are sometimes executed manually. However, it is desirable to make them automated, in which case all operations with Side Effects need to be replaced with Deterministic Test Doubles. See also Test, Unit Test, and Integration Test.
Agent
An old concept in AI, but seeing a renaissance currently as the most flexible architecture pattern for AI-based applications. Agents are orchestrations of model and external service invocations, e.g., planners, schedulers, reasoning engines, data sources (weather, search, …), etc. In this architecture, the best capabilities of each service and model are leveraged, rather than assuming models can do everything successfully.
AI Actor
[2] An organization or individual building an AI System.
AI System
Umbrella term for an application or system with AI components, including Datasets, Generative AI Models, Evaluation Framework and Evaluators for safety detection and mitigation, etc., plus external services, databases for runtime queries, and other application logic that together provide functionality.
Alignment
A general term for how well an AI System’s outputs (e.g., replies to queries) and behaviors correspond to end-user and service provider objectives, including the quality and utility of results, as well as safety requirements. Quality implies factual correctness and utility implies the results are fit for purpose, e.g., a Q&A system should answer user questions concisely and directly, a Python code-generation system should output valid, bug-free, and secure Python code. EleutherAI defines alignment this way, “Ensuring that an artificial intelligence system behaves in a manner that is consistent with human values and goals.” See also the Alignment Forum.
Automatable
Can an action, like a test, be automated so it can be executed without human intervention?
Benchmark
[1] A methodology or function used for offline Evaluation of a Generative AI Model or AI System for a particular purpose and to interpret the results. It consists of:
- A set of tests with metrics.
- A summarization of the results.
Component
An ill-defined, but often-used term in software. In this case, we use it to generically refer to anything with well-defined boundaries and access APIs: libraries, web services, etc.
Dataset
(See also [1]) A collection of data items used for training, evaluation, etc. Usually, a given dataset has a schema (which may be “this is unstructured text”) and some metadata about provenance, licenses for use, transformations and filters applied, etc.
Determinism
The output of a Function for a given input is always known precisely. This affords writing repeatable, predictable software and automated, reliable tests.
In contrast, nondeterminism means components for which identical inputs yield different results, removing repeatability and complicating predictability, and the ability to write automated, reliable tests.
Explainability
Can humans understand why the system behaves the way that it does in a particular scenario?
Evaluation
The capability of measuring and quantifying how a Generative AI Model or AI System that uses models responds to inputs. Much like other software, models and AI systems need to be trusted and useful to their users. Evaluation aims to provide the evidence needed to gain users’ confidence. See also Evaluation Framework and Evaluator.
Evaluation Framework
An umbrella term for the software tools, runtime services, benchmark systems, etc. used to perform Evaluations by running different Evaluators to measure AI Systems for trust and safety risks and mitigations, and other kinds of measurements.
Evaluator
A classifier Generative AI Model or similar tool, possibly including a Dataset, that can quantify an AI System’s inputs and outputs to detect the presence of risky content, such as hate speech, hallucinations, etc. For our purposes, an evaluator is API compatible for execution within an Evaluation Framework. In general, an evaluator could be targeted towards non-safety needs, such as measuring other aspects of Alignment, Inference model latency and throughput, carbon footprint, etc. Also, a given evaluator could be used at many points in the total AI life cycle, e.g., for a benchmark and an inference-time test.
Fairness
Does the AI System’s behaviors exhibit social biases, preferential treatment, or other forms of non-objectivity?
Feature
For our purposes, a small bit of functionality provided by an application. It is the increment of change in a single cycle of the Test-Driven Development process, which could be enhancing some user-visible functionality or adding new functionality in small increments.
Function
Used here as an umbrella term for any unit of execution, including actual functions, methods, code blocks, etc. Many functions are free of Side Effects, meaning they don’t read or write state external to the function and shared by other functions. These functions are always Deterministic; for a given input(s) they always return the same output.
Other functions that read and possibly write external state or usually Nondeterministic. For example, functions that retrieve data, like a database record, functions to generate UUIDs, functions that call other processes or systems.
Functional Programming
FP is a design methodology that attempts to formalize the properties of components and their properties, inspired by constructs in mathematics. State is maintained in a small set of abstractions, like Maps, Lists, and Sets, with operations that are implemented separately following protocol abstractions exposed by the collections. Like mathematical objects and unlike objects in Object-Oriented Programming, mutation of state is prohibited; any operation, like adding elements to a collection, creates a new copy.
FP became popular when concurrent software became more widespread in the 2000s, because the immutable objects lead to far fewer concurrency bugs.
Contrast with Object-Oriented Programming. Many programming langauges combine elements of FP and OOP.
Generative AI Model
A combination of data and code, usually trained on a Dataset, to support Inference of some kind. See also Large Language Model and Multimodal Model.
For convenience, in the text, we use the term model to refer to the generative AI component that has Nondeterministic behavior, whether it is a model invoked directly through an API in the same application or invoked by calling another service (e.g., ChatGPT). The goal of this project is to better understand how developers can test models.
See also Multimodal Model and LLMs
Hallucination
When a Generative AI Model generates text that seems plausible, but is not factually accurate. Lying is not the right term, because there is no malice intended by the model, which only knows how to generate a sequence of Tokens that are plausible, i.e., probabilistically likely.
Inference
Sending information to a Generative AI Model or AI System to have it return an analysis of some kind, summarization of the input, or newly generated information, such as text. The term query is typically used when working with LLMs. The term inference comes from traditional statistical analysis, including model building, that is used to infer information from data.
Integration Test
A test for several Functions that verifies they interoperate properly. These “functions” could be other, distributed systems, too. When any of the functions being tested have Side Effects, perhaps indirectly through other functions they call, all such side effects must be replaced with Test Doubles to make the test Deterministic. See also Test, Unit Test, and Acceptance Test.
Large Language Model
Abbreviated LLM, a state of the art Generative AI Model, often with billions of parameters, that has the ability to summarize, classify, and even generate text in one or more spoken and programming languages. See also Multimodal Model.
Object-Oriented Programming
OOP (or OOSD - object-oriented software development) is a design methodology that creates software components with boundaries that mimic real-world objects (like Person, Automobile, Shopping Cart, etc.). Each object encapsulates state and behavior behind its abstraction.
Introduced in the Simula langauge in the 1960s, it gained widespread interest in the 1980s with the emergence of graphical user interfaces (GUIs), where objects like Window, Buttons, and Menus were an intuitive way to organize such software.
Contrast with Functional Programming. Many programming langauges combine elements of FP and OOP.
Multimodal Model
Generative AI Models that usually extend the text-based capabilities of LLMs with additional support for other media, such as video, audio, still images, or other kinds of data.
Refactoring
Modifying code to change its structure as required to support a new feature. No behavior changes are introduced, so that the existing automated Tests can verify that no regressions are introduced as the code is modified. This is first step in the Test-Driven Development cycle.
Regression
When an unexpected behavior change is introduced into previously-working Function, because of a change made to the code base, often in other functions for unrelated functionality.
Automated Tests are designed to catch regressions as soon as they occur, making it easier to diagnose the change that caused the regression, as well as detecting the regression in the first place.
Repeatable
If an action, like running a test, is run repeatedly with no code or data changes, does it return the same results every time? By design, Generative AI Models are expected to return different results each time a query is repeated.
Robustness
How well does the AI System continue to perform within acceptable limits or degrade “gracefully” when stressed in some way? For example, how well does a Generative AI Model respond to prompts that deviate from its training data?
Side Effect
Reading and/or writing state shared outside a Function with other functions. See also Determinism.
Test
For our purposes, a Unit, Integration, or Acceptance test.
Test Double
A test-only replacement for a Function with Side Effects, so it returns Deterministic values or behaviors when a dependent function uses it. For example, a function that queries a database can be replaced with a version that always returns a fixed value expected by the test.
See also Test, Unit Test, Integration Test, and Acceptance Test.
Test-Driven Development
When adding a Feature to a code base using TDD, the tests are written before the code is written. A three step “virtuous” cycle is used, where changes are made incrementally and iterative using small steps, one at a time:
- Refactor the code to change its structure as required to support the new feature, using the existing automated Tests to verify that no regressions are introduced. For example, it might be necessary to introduce an abstraction to support two “choices” where previously only one choice existed.
- Write a Test for the new feature. This is primarily a design exercise, because thinking about testing makes you think about usability, behavior, etc., even though you are also creating a reusable test that will become part of the Regression test suite. Note that the test suite will fail to run at the moment, because the code doesn’t yet exist to make it pass!
- Write the new feature to make the new test (as well as all previously written tests) pass.
The Wikipedia article on TDD is a good place to start for more information.
Token
For language Generative AI Models, the training texts and query prompts are split into tokens, usually whole words or fractions according to a vocabulary of tens of thousands of tokens that can include common single characters, several characters, and “control” tokens (like “end of input”). The rule of thumb is a corpus will have roughly 1.5 times the number of tokens as it will have words.
Unit Test
A test for a function that exercises its behavior in isolation from all other functions and state. When the function being tested has Side Effects, perhaps indirectly through other functions it calls, all such side effects must be replaced with Test Doubles to make the test Deterministic. See also Test, Integration Test, and Acceptance Test.