Link Search Menu Expand Document

Glossary of Terms

Some of the terms defined here are industry standards, while others are not standard, but they are useful for our purposes.

Some definitions are adapted from the following sources, which are indicated below using the same numbers, i.e., [1] and [2]:

  1. MLCommons AI Safety v0.5 Benchmark Proof of Concept Technical Glossary
  2. NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0)

Sometimes we will use a term that could be defined, but we won’t provide a definition for brevity. We show these terms in italics. You can assume the usual, plain-sense meaning for the term, or in some cases it is easy to search for a definition.

Table of contents
  1. Glossary of Terms
    1. Acceptance Benchmark
    2. Acceptance Test
    3. Agent
    4. AI Actor
    5. AI System
    6. Alignment
    7. Automatable
    8. Behavior
    9. Benchmark
    10. Class
    11. Component
    12. Concurrent
    13. Context
    14. Cohesion
    15. Coupling
    16. Design by Contract
    17. Dataset
    18. Determinism
    19. Explainability
    20. Evaluation
    21. Evaluation Framework
    22. Fairness
    23. Feature
    24. Function
    25. Functional Programming
    26. Generative AI Model
    27. Hallucination
    28. Immutable
    29. Inference
    30. Integration Benchmark
    31. Integration Test
    32. Large Language Model
    33. Model Context Protocol
    34. Object-Oriented Programming
    35. Multimodal Model
    36. Mutable
    37. Paradigm
    38. Predictable
    39. Probability and Statistics
    40. Prompt
    41. Prompt Engineering
    42. Refactoring
    43. Regression
    44. Reinforcement Learning
    45. Repeatable
    46. Retrieval-augmented Generation
    47. Response
    48. Robustness
    49. Scenario
    50. Sequential
    51. Side Effect
    52. State
    53. State Machine
    54. Stochastic
    55. System Prompt
    56. Teacher Model
    57. Test
    58. Test Double
    59. Test-Driven Development
    60. Token
    61. Training
    62. Tuning
    63. Unit
    64. Unit Benchmark
    65. Use Case
    66. Unit Test

Acceptance Benchmark

The analog of Acceptance Tests for an AI-enabled system that has Stochastic behaviors. Benchmark technology is adapted for the purpose.

See also Unit Test, Unit Benchmark, Integration Test, Integration Benchmark, and Acceptance Test.

Acceptance Test

A test that verifies a user-visible feature works are required, often by driving the user interface or calling the external API. These tests are system-wide and end-to-end. They are sometimes executed manually, if automation isn’t feasible.

However, it is desirable to make them automated, in which case all operations with Side Effects need to be replaced with Deterministic Test Doubles.

See also Test, Unit Test, Unit Benchmark, Integration Test, Integration Benchmark, Acceptance Test, and Acceptance Benchmark.

Agent

An old concept in AI, but now experiencing a renaissance as the most flexible architecture pattern for AI-based applications. Agents are orchestrations of Generative AI Model and external service invocations, e.g., planners, schedulers, reasoning engines, data sources (weather, search, …), etc. In this architecture, the best capabilities of each service and model are leveraged, rather than assuming that models can do everything successfully themselves. Agent-based applications sometimes use multiple models, one per agent, where each one provides some specific capabilities. For example, one model might be process user prompts into back-end API invocations, including to other models, and interpret the results into user-friendly responses.

AI Actor

[2] An organization or individual building an AI System.

AI System

Umbrella term for an application or system with AI Components, including Datasets, Generative AI Models (e.g., LLMs, Evaluation Frameworks and Evaluations for safety detection and mitigation, etc., plus external services, databases for runtime queries, and other application logic that together provide functionality.

Alignment

A general term for how well an AI System’s outputs (e.g., replies to queries) and Behaviors correspond to end-user and service provider objectives, including the quality and utility of results, as well as safety requirements. Quality implies factual correctness and utility implies the results are fit for purpose, e.g., a Q&A system should answer user questions concisely and directly, a Python code-generation system should output valid, bug-free, and secure Python code. EleutherAI defines alignment this way, “Ensuring that an artificial intelligence system behaves in a manner that is consistent with human values and goals.” See also the work of the Alignment Forum.

Automatable

Can an action, like a test, be automated so it can be executed without human intervention?

Behavior

What does a Component do, either autonomously on its own (e.g., a security monitoring tool that is constantly running) or when invoked by another component through an API or Function call? This is a general-purpose term that could cover a single Feature, a whole Use Case or anything in between.

Benchmark

[1] A methodology or Function used for offline Evaluation of a Generative AI Model or AI System for a particular purpose and to interpret the results. It consists of:

  • A set of tests with metrics.
  • A summarization of the results.

See also Unit Benchmark, Integration Benchmark, and Acceptance Benchmark.

Class

The primary Component abstraction in Object-Oriented Programming, although not necessarily the only one.

Component

An ill-defined, but often-used term in software. In this case, we use it to generically refer to any piece of software with a well-defined purpose, an access API that defines clear boundaries. Depending on the programming language, it may group together Functions, Classes, etc. Particular programming languages and “paradigms” (like OOP and FP) might use terms like packages, modules, subsystems, libraries, and even web services can be considered components.

In principal, a component could contain a single Unit. So, for simplicity in the rest of the text, we will use Component as an umbrella term that could also mean an individual Unit, unless it is important to make finer distinctions.

Concurrent

When work can be partitioned into smaller steps that can be executed in any order and the runtime executes them in a nonpredictable order. If the order is predictable, no matter how it executed, we can say it is effectively Sequential.

Context

Additional information passed to an LLM as part of a user Prompt, which is intended to provide additional, useful context information so that the Response is better than if the user’s prompt was passed to the LLM alone. This additional content may include a System Prompt, relevant documents retrieved using RAG, etc.

Cohesion

Does a Component feel like “one thing” with a single purpose, exhibiting well-defined Behaviors with a coherent State? Or does it feel like a miscellaneous collection of behaviors or state?

Coupling

How closely connected is one Component to others in the system? “Loose” coupling is preferred, because it makes it easier to test components in isolation, substitute replacements when needed, etc. Strongly coupled components often indicate poor abstraction boundaries between them.

Design by Contract

Design by Contract (“DbC”) is an idea developed by Bertran Meyer and incorporated into his Eiffel programming language. In Eiffel all functions can define a contract for allowed inputs, invariants, and guaranteed responses, if the input requirements are met. The runtime system would handle any failures of these contracts. A core principle of DbC use is that contract failures should terminate the application immediately, forcing the developers to fix the issue. Failure to do so becomes an excuse to let bugs accumulate. If this principle was rigorously followed during development, it is often considered acceptable (or at least “expedient”), to log contract failures, but not terminate execution in production runs. DbC can be used in other languages through built-in features (like assertions), libraries, or various runtime features.

DbC provides many of the same design benefits provided by TDD, which emerged later, such as directing attention to more rigorous API design. Because of the additional benefits of TDD, DbC has largely fallen out of practice, but it’s formalism for what constitutes good contracts is still highly valuable and recommended for study.

Dataset

(See also [1]) A collection of data items used for training, evaluation, etc. Usually, a given dataset has a schema (which may be “this is unstructured text”) and some metadata about provenance, licenses for use, transformations and filters applied, etc.

Determinism

The output of a Component for a given input is always known precisely. This affords writing repeatable, predictable software and automated, reliable tests.

In contrast, nondeterminism means identical inputs yield different results, removing Repeatability and complicating Predictability, and the ability to write automated, reliable tests.

Explainability

Can humans understand why the system behaves the way that it does in a particular Use Case? Can the system provide additional information about about why it produced a particular output?

Evaluation

Much like other software, models and AI systems need to be trusted and useful to their users. Evaluation aims to provide the evidence needed to gain users’ confidence for an AI System.

An particular evaluation is the capability of measuring and quantifying how a Generative AI Model, e.g., an LLM, or an AI System as a whole handles Prompts and the kinds of Responses produced. For example, an evaluation might be used to see if hate speech is detected in prompts and responses, if responses contain hallucinations, measure the overhead (time and compute) for processing, and for our purposes, implements a required Use Case, etc.

An evaluation may be implemented in one of several ways. A classifier LLM or another kind of model might be used to score content. A Dataset of examples is commonly used. For our purposes, an implementation is API compatible for execution within an Evaluation Framework.

See also Evaluation Framework.

Evaluation Framework

An umbrella term for the software tools, runtime services, benchmark systems, etc. used to perform Evaluations by running their implementations to measure AI systems for trust and safety risks and mitigations, and other concerns.

Fairness

Does the AI system’s responses exhibit social biases, preferential treatment, or other forms of non-objectivity?

Feature

For our purposes, a small bit of functionality provided by an application. It is the increment of change in a single cycle of the Test-Driven Development process, which could be enhancing some user-visible functionality or adding new functionality in small increments. See also Use Case

Function

In most languages, the most fundamental Unit of abstraction and execution. Depending on the language, the term function or method might be used, where the latter are special functions associated with Classes in OOP languages. Some languages allow code blocks outside of functions, perhaps inside alternative Component boundaries, but this is not important for our purposes.

Many functions are free of Side Effects, meaning they don’t read or write State external to the function and shared by other functions. These functions are always Deterministic; for a given input(s) they always return the same output. This is a very valuable property for design, testing, and reuse.

Other functions that read and possibly write external state are nondeterministic. So are functions that are implemented with Concurrency in a way that the order of results is not deterministic. For example, functions that retrieve data, like a database record, functions to generate UUIDs, functions that call other processes or systems.

Functional Programming

FP is a design methodology that attempts to formalize the properties of Functions and their properties, inspired by the behavior of mathematical functions. State is maintained in a small set of abstractions, like Maps, Lists, and Sets, with operations that are implemented separately following protocol abstractions exposed by the collections. Like mathematical objects and unlike objects in Object-Oriented Programming, mutation of State is prohibited; any operation, like adding elements to a collection, creates a new, Immutable copy.

FP became popular when concurrent software became more widespread in the 2000s, because the immutable objects lead to far fewer concurrency bugs. FP languages may have other Component constructs for grouping of functions, e.g., into libraries.

Contrast with Object-Oriented Programming. Many programming languages combine aspects of FP and OOP.

Generative AI Model

A combination of data and code, usually trained on a Dataset, to support Inference of some kind.

For convenience, in the text, we use the shorthand term model to refer to the generative AI Component that has Nondeterministic Behavior, whether it is a model invoked directly through an API in the same application or invoked by calling another service (e.g., ChatGPT). The goal of this project is to better understand how developers can test models.

See also Large Language Model (LLMs) and Multimodal Model.

Hallucination

When a Generative AI Model generates text that seems plausible, but is not factually accurate. Lying is not the right term, because there is no malice intended by the model, which only knows how to generate a sequence of Tokens that are plausible. Which token is actually returned in a given context is a Stochastic process, i.e., a random process governed by a Probability distributions.

Immutable

A Unit’s or Component’s State cannot be modified, once it has been initialized. If all units in a Component are immutable, then the component itself is considered immutable. Contrast with Mutable. See also State.

Inference

Sending information to a Generative AI Model or AI System to have it return an analysis of some kind, summarization of the input, or newly generated information, such as text. The term query is typically used when working with LLMs. The term inference comes from traditional statistical analysis, including model building, that is used to infer information from data.

Integration Benchmark

The analog of Integration Tests for several Units and Components working together, where some of them are AI-enabled and exhibit Stochastic behaviors. Benchmark technology is adapted for the purpose.

See also Unit Test, Unit Benchmark, Integration Test, Acceptance Test, and Acceptance Benchmark.

Integration Test

A test for several Units and Components working together that verifies they interoperate properly. These components could be distributed systems, too. When any of the units that are part of the test have Side Effects and the purpose of the test is not to explore handling of such side effects, all units with side effects should be replaced with Test Doubles to make the test Deterministic.

See also Test, Unit Test, Unit Benchmark, Integration Benchmark, Acceptance Test, and Acceptance Benchmark..

Large Language Model

Abbreviated LLM, a state of the art Generative AI Model, often with billions of parameters, that has the ability to summarize, classify, and even generate text in one or more spoken and programming languages. See also Multimodal Model.

Model Context Protocol

Abbreviated MCP, a de-facto standard for communications between models, agents, and other tools. See modelcontextprotocol.io for more information.

Object-Oriented Programming

OOP (or OOSD - object-oriented software development) is a design methodology that creates software Components with boundaries that mimic real-world objects (like Person, Automobile, Shopping Cart, etc.). Each object encapsulates State and Behavior behind its abstraction.

Introduced in the Simula language in the 1960s, it gained widespread interest in the 1980s with the emergence of graphical user interfaces (GUIs), where objects like Window, Buttons, and Menus were an intuitive way to organize such software.

Contrast with Functional Programming. Many programming languages combine elements of FP and OOP.

Multimodal Model

Generative AI Models that usually extend the text-based capabilities of LLMs with additional support for other media, such as video, audio, still images, or other kinds of data.

Mutable

A Unit’s State can be modified during execution, either through direct manipulation by another unit or indirectly by invoking the unit (e.g., calling a Function that changes the state. If any one unit in a Component is mutable, then the component itself is considered mutable. Contrast with Immutable. See also State.

Paradigm

From the Merriam-Webster Dictionary definition of paradigm: “a philosophical and theoretical framework of a scientific school or discipline within which theories, laws, and generalizations and the experiments performed in support of them are formulated.”

Predictable

In the context of software, the quality that knowing a Unit’s or Components history of past Behavior and its design, you can predict its future behavior reliably. See also State Machine.

Probability and Statistics

Two interrelated branches of mathematics, where statistics concerns such tasks as collecting, analyzing, and interpreting data, while probability concerns observations, in particular the percentage likelihood that certain values will be measured when observations are made of a random process, or more precisely, a random probability distribution, like heads or tails when flipping a coin. This probability distribution is the simplest possible; there is a 50-50 chance of heads or tails (assuming a fair coin). The probability distribution for rolling a particular sum with a pair of dice is less simple, but straightforward. The probability distribution for the heights of women in the United States is more complicated, where historical data determines the distribution, not a simple formula.

Both disciplines emerged together to solve practical problems in science, industry, sociology, etc. It is common for researchers to build a mathematical model (in the general sense of the word, not just an AI model) of the system being studied, in part to compare actual results with predictions from the model, confirming or rejecting the underlying theories about the system upon which the model was built. Also, if the model is accurate, it provides predictive capabilities for possible and likely future observations.

Contrast with Determinism. See also Stochastic.

Prompt

The query a user (or another system) sends to an LLM. Often, additional Context information is added by an AI System before sending the prompt to the LLM. See also Prompt Engineering.

Prompt Engineering

A term for the careful construction of good Prompts to maximize the quality of Inference responses. It is really considered more art than science or engineering because of the subjective relationship between prompts and responses for Generative AI Models.

Refactoring

Modifying code to change its structure as required to support a new feature. No Behavior changes are introduced, so that the existing automated Tests can verify that no regressions are introduced as the code is modified. This is first step in the Test-Driven Development cycle.

Regression

When an unexpected Behavior change is introduced into a previously-working Unit, because of a change made to the code base, often in other units for unrelated functionality.

Automated Tests are designed to catch regressions as soon as they occur, making it easier to diagnose the change that caused the regression, as well as detecting the regression in the first place.

Reinforcement Learning

Reinforcement learning (RL) is a form of machine learning, often used for optimizing control or similar systems. In RL, an agent performs a loop where it observes the state of the “world” visible to it at the current time, it takes what it thinks is a suitable action for the next step, chosen to maximize a reward signal, often with the goal of maximizing the long-term reward, such as wining a game. The reinforcement aspect is an update at each step that is done to a model of some kind used by the agent to assess which steps produce which rewards, given the current state. However, when choosing the next step, the best choice is not always made. Some degree of randomness is introduced so that the agent explores all possible states and rewards, rather than getting stuck always making choices that are known good, but may be less optimal than unknown choices.

In the generative AI context, RL is a popular tool in the suite of model Tuning processes that are used to improve model performance in various ways.

See also Reinforcement Finetuning in From Testing to Tuning.

Repeatable

If an action, like running a test, is run repeatedly with no code or data changes, does it return the same results every time? By design, Generative AI Models are expected to return different results each time a query is repeated.

Retrieval-augmented Generation

RAG was one of the first AI-specific design patterns for applications. It uses one or more data stores with information relevant to an application’s use cases. For example, a ChatBot for automotive repair technicians would use RAG to retrieve sections from repair manuals and logs from past service jobs, selecting the ones that are most relevant to a particular problem or subsystem the technician is working on. This Context is passed as part of the Prompt to the LLM.

A key design challenge is determining relevancy and structuring the data so that relevant information is usually retrieved. This is typically done by breaking the reference data into “chunks” and encoding each chunk in a vector representation, e.g., a hash, which functions as a similarity metric. During inference, the prompt is passed through the same encoding and the top few nearest neighbors, based on the metric, are returned for the context, thereby attempting to ensure maximum relevancy.

See this IBM blog post for a description of RAG.

Response

The generic term for outputs from a Generative AI Model or AI System. Sometimes results is also used.

Robustness

How well does the AI System continue to perform within acceptable limits or degrade “gracefully” when stressed in some way? For example, how well does a Generative AI Model respond to prompts that deviate from its training data?

Scenario

One path through a use case, such as one “happy path” from beginning to end where a user completes a task or accomplishes a goal. A failure scenario is a path through the use case where the user is unable to succeed, due to system or user errors.

Note: When the text doesn’t link to this definition, it is because the word is being used generically or because the text already linked to this definition. Hopefully the context will be clear.

Sequential

The steps of some work are performed in a predictable, repeatable order. This property is one of the requirements for Deterministic Behavior. Contrast with Concurrent.

Side Effect

Reading and/or writing State shared outside a Unit, i.e., a Function with other functions. See also Determinism. If a Component contains unit that perform side effects, then the component itself is considered to perform side effects.

State

Used in software to refer to a set of values in some context, like a Component. The values determine how the component will behave in subsequent invocations to perform some work. The values can sometimes be read directly by other components. If the component is Mutable, then the state can be changed by other components either directly or through invocations of the component that cause state transitions to occur. (For example, popping the top element of a stack changes the contents of the stack, the number of elements it currently holds, etc.)

Often, these state transitions are modeled with a State Machine, which constrains the allowed transitions.

State Machine

A formal model of how the State of a component can transition from one value (or set of values) to another. As an example, the TCP protocol has a well-defined state machine.

Stochastic

The behavior of a system where observed values are governed by a random probability distribution. For example, when flipping a coin repeatedly, the observed values, heads or tails, are governed by a distribution that predicts 50% of the time heads will be observed and 50% of the time tails will be observed, assuming a fair coin (not weighted on one side or the other). The value you observe for any given flip is random; you can’t predict exactly which possibility will happen, only that there is an equal probability of heads or tails. After performing more and more flips, the total count of heads and tails should be very close to equal. See also Probabilities and Statistics.

System Prompt

A commonly-used, statically-coded part of the Context information added by an AI System the Prompt before sending it to the LLM. System prompts are typically used to provide the model with overall guidance about the application’s purpose and how the LLM should respond. For example, it might include phrases like “You are a helpful software development assistant.”

Teacher Model

A Generative AI Model that can be used as part of a Tuning (“teach”) process for other models, to generate synthetic data, to evaluate the quality of data, etc. These models are usually relatively large, sophisticated, and powerful, so they are very capable for these purposes, but they are often considered too costly to use as an application’s runtime model, where smaller, lower-overhead models are necessary. However, for software development purposes, less frequent use of teacher models is worth the higher cost for the services they provide.

Test

For our purposes, a Unit Test, Integration Test, or Acceptance Test.

Test Double

A test-only replacement for a Unit or a while Component, usually because it has Side Effects and we need the Behavior to be Deterministic for the purposes of testing a dependent unit that uses it. For example, a function that queries a database can be replaced with a version that always returns a fixed value expected by the test. A mock is a popular kind of test double that uses the underlying runtime environment (e.g., the Python interpreter, the Java Virtual Machine - JVM) to intercept invocations of a unit and programmatically behave as desired by the tester.

See also Test, Unit Test, Integration Test, and Acceptance Test.

Test-Driven Development

When adding a Feature to a code base using TDD, the tests are written before the code is written. A three step “virtuous” cycle is used, where changes are made incrementally and iterative using small steps, one at a time:

  1. Refactor the code to change its structure as required to support the new feature, using the existing automated Tests to verify that no regressions are introduced. For example, it might be necessary to introduce an abstraction to support two “choices” where previously only one choice existed.
  2. Write a Test for the new feature. This is primarily a design exercise, because thinking about testing makes you think about usability, Behavior, etc., even though you are also creating a reusable test that will become part of the Regression test suite. Note that the test suite will fail to run at the moment, because the code doesn’t yet exist to make it pass!
  3. Write the new feature to make the new test (as well as all previously written tests) pass.

The Wikipedia TDD article is a good place to start for more information.

See also Design by Contract.

Token

For language Generative AI Models, the training texts and query prompts are split into tokens, usually whole words or fractions according to a vocabulary of tens of thousands of tokens that can include common single characters, several characters, and “control” tokens (like “end of input”). The rule of thumb is a corpus will have roughly 1.5 times the number of tokens as it will have words.

Training

In our context, training is the processes used to teach a model, such as a Generative AI Models how to do its intended job.

In the generative AI case, we often speak of pretraining, the training process that uses a massive data corpus to teach the model facts about the world, how to speak and understand human language, and do some skills. However, the resulting model often does poorly on specialized tasks and even basic skills like following a user’s instructions, conforming to social norms (e.g., avoiding hate speech), etc.

That’s where a second Tuning phase comes in, a suite of processes used to improve the models performance on many general or specific skills.

Tuning

Tuning refers to one or more processes used to transform a Pretrained model into one that exhibits much better desired Behaviors (like instruction following) or specialized domain knowledge.

Unit

For our purposes, the unit in the context of a Unit Test, the smallest granularity of functionality we care about. A unit can be a single Function that is being designed and written, but this may be happening in the larger context of a Component, such as a Class in an Object-Oriented Programming language or some other self-contained.

For simplicity, rather than say “unit and/or component” frequently in the text, we will often use just “component” as an umbrella term that could also mean either or both concepts, unless it is important to make finer distinctions.

Unit Benchmark

An adaption of Benchmark tools and techniques for more fine-grained and targeted testing purposes, such as verifying Features and Use Cases work as designed. See the Unit Benchmarks chapter for details.

The same idea generalizes to the analogs of Integration Tests, namely Integration Benchmarks, and Acceptance Tests, namely Acceptance Benchmarks.

Use Case

A common term for an end-to-end user activity done with a system, often broken down into several Scenarios that describe different “paths” through the use case, including error scenarios, in addition to happy paths. Hence, scenarios would be the next level of granularity. Compare with Features, which would be the capabilities implemented one at a time to support the scenarios that make up a use case.

Unit Test

A test for a Unit that exercises its Behavior in isolation from all other Functions and State. When the unit being tested has Side Effects, because of other units it invokes, all such side effects must be replaced with Test Doubles to make the test Deterministic. Note that writing a unit test as part of Test-Driven Development inevitably begins with a Refactoring step to modify the code, while preserving the current behavior, so that it is better positioned to support implementing the new functionality.

See also Test, Unit Benchmark, Integration Test, Integration Benchmark, Acceptance Test, Acceptance Benchmark.