Join Our Work Group Visit Our GitHub Repo

Glossary of Terms

Some of the terms defined here are industry standards, while others are not standard, but they are useful for our purposes.

Some definitions are adapted from the following sources, which are indicated below using the same numbers, i.e., [1] and [2]:

Table of contents

Glossary of Terms

Acceptance Test

A test that verifies a user-visible feature works are required, often by driving the user interface or calling the external API. These tests are system-wide and are sometimes executed manually. However, it is desirable to make them automated, in which case all operations with Side Effects need to be replaced with Deterministic Test Doubles. See also Test, Unit Test, and Integration Test.

Agent

An old concept in AI, but now experiencing a renaissance as the most flexible architecture pattern for AI-based applications. Agents are orchestrations of Generative AI Model and external service invocations, e.g., planners, schedulers, reasoning engines, data sources (weather, search, …), etc. In this architecture, the best capabilities of each service and model are leveraged, rather than assuming that models can do everything successfully themselves. Agent-based applications sometimes use multiple models, one per agent, where each one provides some specific capabilities. For example, one model might be process user prompts into backend API invocations, including to other models, and interpret the results into user-friendly responses.

AI Actor

[2] An organization or individual building an AI System.

AI System

Umbrella term for an application or system with AI components, including Datasets, Generative AI Models, Evaluation Framework and Evaluators for safety detection and mitigation, etc., plus external services, databases for runtime queries, and other application logic that together provide functionality.

Alignment

A general term for how well an AI System’s outputs (e.g., replies to queries) and behaviors correspond to end-user and service provider objectives, including the quality and utility of results, as well as safety requirements. Quality implies factual correctness and utility implies the results are fit for purpose, e.g., a Q&A system should answer user questions concisely and directly, a Python code-generation system should output valid, bug-free, and secure Python code. EleutherAI defines alignment this way, “Ensuring that an artificial intelligence system behaves in a manner that is consistent with human values and goals.” See also the work of the Alignment Forum.

Automatable

Can an action, like a test, be automated so it can be executed without human intervention?

Benchmark

[1] A methodology or function used for offline Evaluation of a Generative AI Model or AI System for a particular purpose and to interpret the results. It consists of:

A set of tests with metrics.
A summarization of the results.

Component

An ill-defined, but often-used term in software. In this case, we use it to generically refer to anything with well-defined boundaries and access APIs: libraries, web services, etc.

Dataset

(See also [1]) A collection of data items used for training, evaluation, etc. Usually, a given dataset has a schema (which may be “this is unstructured text”) and some metadata about provenance, licenses for use, transformations and filters applied, etc.

Determinism

The output of a Function for a given input is always known precisely. This affords writing repeatable, predictable software and automated, reliable tests.

In contrast, nondeterminism means components for which identical inputs yield different results, removing repeatability and complicating predictability, and the ability to write automated, reliable tests.

Explainability

Can humans understand why the system behaves the way that it does in a particular scenario? Can the system provide additional information about about why it produced a particular output?

Evaluation

The capability of measuring and quantifying how a Generative AI Model or AI System that uses models responds to inputs. Much like other software, models and AI systems need to be trusted and useful to their users. Evaluation aims to provide the evidence needed to gain users’ confidence. See also Evaluation Framework and Evaluator.

Evaluation Framework

An umbrella term for the software tools, runtime services, benchmark systems, etc. used to perform Evaluations by running different Evaluators to measure AI Systems for trust and safety risks and mitigations, and other kinds of measurements.

Evaluator

A classifier Generative AI Model or similar tool, possibly including a Dataset, that can quantify an AI System’s inputs and outputs to detect the presence of risky content, such as hate speech, hallucinations, etc. For our purposes, an evaluator is API compatible for execution within an Evaluation Framework. In general, an evaluator could be targeted towards non-safety needs, such as measuring other aspects of Alignment, Inference model latency and throughput, carbon footprint, etc. Also, a given evaluator could be used at many points in the total AI life cycle, e.g., for a benchmark and an inference-time test.

Fairness

Does the AI System’s behaviors exhibit social biases, preferential treatment, or other forms of non-objectivity?

Feature

For our purposes, a small bit of functionality provided by an application. It is the increment of change in a single cycle of the Test-Driven Development process, which could be enhancing some user-visible functionality or adding new functionality in small increments.

Function

Used here as an umbrella term for any unit of execution, including actual functions, methods, code blocks, etc. Many functions are free of Side Effects, meaning they don’t read or write state external to the function and shared by other functions. These functions are always Deterministic; for a given input(s) they always return the same output.

Other functions that read and possibly write external state or usually Nondeterministic. For example, functions that retrieve data, like a database record, functions to generate UUIDs, functions that call other processes or systems.

Functional Programming

FP is a design methodology that attempts to formalize the properties of components and their properties, inspired by constructs in mathematics. State is maintained in a small set of abstractions, like Maps, Lists, and Sets, with operations that are implemented separately following protocol abstractions exposed by the collections. Like mathematical objects and unlike objects in Object-Oriented Programming, mutation of state is prohibited; any operation, like adding elements to a collection, creates a new copy.

FP became popular when concurrent software became more widespread in the 2000s, because the immutable objects lead to far fewer concurrency bugs.

Contrast with Object-Oriented Programming. Many programming langauges combine elements of FP and OOP.

Generative AI Model

A combination of data and code, usually trained on a Dataset, to support Inference of some kind. See also Large Language Model and Multimodal Model.

For convenience, in the text, we use the term model to refer to the generative AI component that has Nondeterministic behavior, whether it is a model invoked directly through an API in the same application or invoked by calling another service (e.g., ChatGPT). The goal of this project is to better understand how developers can test models.

Hallucination

When a Generative AI Model generates text that seems plausible, but is not factually accurate. Lying is not the right term, because there is no malice intended by the model, which only knows how to generate a sequence of Tokens that are plausible, i.e., probabilistically likely.

Inference

Sending information to a Generative AI Model or AI System to have it return an analysis of some kind, summarization of the input, or newly generated information, such as text. The term query is typically used when working with LLMs. The term inference comes from traditional statistical analysis, including model building, that is used to infer information from data.

Integration Test

A test for several Functions that verifies they interoperate properly. These “functions” could be other, distributed systems, too. When any of the functions being tested have Side Effects, perhaps indirectly through other functions they call, all such side effects must be replaced with Test Doubles to make the test Deterministic. See also Test, Unit Test, and Acceptance Test.

Large Language Model

Abbreviated LLM, a state of the art Generative AI Model, often with billions of parameters, that has the ability to summarize, classify, and even generate text in one or more spoken and programming languages. See also Multimodal Model.

Model Context Protocol

A de-facto standard for communications between models, agents, and other tools. See modelcontextprotocol.io for more information.

Object-Oriented Programming

OOP (or OOSD - object-oriented software development) is a design methodology that creates software components with boundaries that mimic real-world objects (like Person, Automobile, Shopping Cart, etc.). Each object encapsulates state and behavior behind its abstraction.

Introduced in the Simula langauge in the 1960s, it gained widespread interest in the 1980s with the emergence of graphical user interfaces (GUIs), where objects like Window, Buttons, and Menus were an intuitive way to organize such software.

Contrast with Functional Programming. Many programming langauges combine elements of FP and OOP.

Multimodal Model

Generative AI Models that usually extend the text-based capabilities of LLMs with additional support for other media, such as video, audio, still images, or other kinds of data.

Paradigm

From the Merriam-Webster Dictionary definition of paradigm: “a philosophical and theoretical framework of a scientific school or discipline within which theories, laws, and generalizations and the experiments performed in support of them are formulated.”

Probability and Statistics

Two interrelated branches of mathematics, where statistics concerns such tasks as collecting, analyzing, and interpreting data, while probability concerns events, in particular the percentage likelihood that certain values will be measured when events occur.

Both disciplines emerged together to solve practical problems in science, industry, sociology, etc. It is common for researchers to build a model of the system being studied, in part to compare actual results with model predictions, confirming or rejecting the underlying theories about the system upon which the model was built. Also, if the model is accurate, it provides predictive capabilities for possible and likely future events.

Refactoring

Modifying code to change its structure as required to support a new feature. No behavior changes are introduced, so that the existing automated Tests can verify that no regressions are introduced as the code is modified. This is first step in the Test-Driven Development cycle.

Regression

When an unexpected behavior change is introduced into previously-working Function, because of a change made to the code base, often in other functions for unrelated functionality.

Automated Tests are designed to catch regressions as soon as they occur, making it easier to diagnose the change that caused the regression, as well as detecting the regression in the first place.

Repeatable

If an action, like running a test, is run repeatedly with no code or data changes, does it return the same results every time? By design, Generative AI Models are expected to return different results each time a query is repeated.

Robustness

How well does the AI System continue to perform within acceptable limits or degrade “gracefully” when stressed in some way? For example, how well does a Generative AI Model respond to prompts that deviate from its training data?

Side Effect

Reading and/or writing state shared outside a Function with other functions. See also Determinism.

Test

For our purposes, a Unit, Integration, or Acceptance test.

Test Double

A test-only replacement for a Function with Side Effects, so it returns Deterministic values or behaviors when a dependent function uses it. For example, a function that queries a database can be replaced with a version that always returns a fixed value expected by the test.

See also Test, Unit Test, Integration Test, and Acceptance Test.

Test-Driven Development

When adding a Feature to a code base using TDD, the tests are written before the code is written. A three step “virtuous” cycle is used, where changes are made incrementally and iterative using small steps, one at a time:

Refactor the code to change its structure as required to support the new feature, using the existing automated Tests to verify that no regressions are introduced. For example, it might be necessary to introduce an abstraction to support two “choices” where previously only one choice existed.
Write a Test for the new feature. This is primarily a design exercise, because thinking about testing makes you think about usability, behavior, etc., even though you are also creating a reusable test that will become part of the Regression test suite. Note that the test suite will fail to run at the moment, because the code doesn’t yet exist to make it pass!
Write the new feature to make the new test (as well as all previously written tests) pass.

The Wikipedia article on TDD is a good place to start for more information.

Token

For language Generative AI Models, the training texts and query prompts are split into tokens, usually whole words or fractions according to a vocabulary of tens of thousands of tokens that can include common single characters, several characters, and “control” tokens (like “end of input”). The rule of thumb is a corpus will have roughly 1.5 times the number of tokens as it will have words.

Training

In our context, training is the processes used to teach a model, such as a Generative AI Models how to do its intended job.

In the generative AI case, we often speak of pretraining, the training process that uses a massive data corpus to teach the model facts about the world, how to speak and understand human language, and do some skills. However, the resulting model often does poorly on specialized tasks and even basic skills like following a user’s instructions, conforming to social norms (e.g., avoiding hate speech), etc.

That’s where a second Tuning phase comes in, a suite of processes used to improve the models performance on many general or specific skills.

Tuning

Tuning refers to one or more processes used to transform a Pretrained model into one that exhibits much better desired behaviors (like instruction following) or specialized domain knowledge.

Unit Test

A test for a function that exercises its behavior in isolation from all other functions and state. When the function being tested has Side Effects, perhaps indirectly through other functions it calls, all such side effects must be replaced with Test Doubles to make the test Deterministic. See also Test, Integration Test, and Acceptance Test.