Link Search Menu Expand Document

The AI Alliance Glossary of Terms  ?

About this Glossary

Welcome to the The AI Alliance Glossary, a resource shared across our websites to provide shared definitions of common terms.

The filter below filters by term names, while the search above searches all the text in this site.

Some of the terms defined here are industry standards, while others are more informal, but still useful for our purposes. Some definitions are adapted from the following sources, which are indicated below using the same numbers, i.e., [1] and [2]:

  1. MLCommons AI Safety v0.5 Benchmark Proof of Concept Technical Glossary (discussed here).
  2. NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0) (discussed here).

Also, a few definitions quote the Merriam-Webster Dictionary, where noted.

Sometimes we will show a term in italics without a definition. This is done for brevity and the usual, plain-sense meaning for the term can be assumed in the context where it appears. For hyphenated terms, we use Foo-Bar-Baz rather than Foo-bar-baz, since the former is a little more common in industry usage.

A

Acceptance Benchmark

The analog of Acceptance Tests for an AI-enabled system that has Stochastic behaviors. Benchmark technology is adapted for the purpose.

See also Unit Test, Unit Benchmark, Integration Test, Integration Benchmark, and Acceptance Test.

Acceptance Test

A test that verifies a user-visible feature works are required, often by driving the user interface or calling the external API. These tests are system-wide and end-to-end. They are sometimes executed manually, if automation isn’t feasible.

However, it is desirable to make them automated, in which case all operations with Side Effects need to be replaced with Deterministic Test Doubles.

See also Test, Unit Test, Unit Benchmark, Integration Test, Integration Benchmark, and Acceptance Benchmark.

Accountability

An aspect of Governance, where we trace behaviors through AI Systems to their causes. Related is the need for organizations to take responsibility for the behaviors of the AI systems they deploy.

Adaptation

A general term used by Nathan Lambert for the addition Tuning performed on a Trained Generative AI Model to improve its Alignment for user goals, like better domain-specific awareness, instruction following, and awareness of social norms, etc.

Agent

An old concept in AI, but now experiencing a renaissance as the most flexible architecture pattern for AI-based applications. Agents are orchestrations of Generative AI Model and external service invocations, e.g., planners, schedulers, reasoning engines, data sources (weather, search, …), etc. In this architecture, the best capabilities of each service and model are leveraged, rather than assuming that models can do everything successfully themselves. Agent-based applications sometimes use multiple models, one per agent, where each one provides some specific capabilities. For example, one model might be process user Prompts into back-end API invocations, including to other models, and interpret the results into user-friendly Responses.

Agents may be designed to perform actions automatically for the user, although the risk associated with this autonomy needs to be carefully designed and tested, depending on the severity of potential unintended consequences. Often, agents are designed to recommend actions the user should take or at least request user confirmation before taking actions.

Agentic Engineering

A term coined by Andrej Karpathy to represent a more careful engineering approach to AI-driven software development than Vibe Coding, the term he coined for more “one-off” use of AI for generation of proofs of concepts, etc., but not suitable for developing applications that need long-term evolution and maintenance.

His reasoning for this choice of words is as follows (quoting from the tweet, with “light” editing):

  1. Agentic because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight.
  2. Engineering to emphasize that there is an art & science and expertise to it. It’s something you can learn and become better at, with its own depth of a different kind.

See also Vibe Engineering.

AI Actor

An organization or individual building an AI System [2].

AI System

Umbrella term for an application or system with AI Components, including Data Sets, Generative AI Models (e.g., LLMs, Evaluation Frameworks and Evaluations for safety detection and mitigation, etc., plus external services, databases for runtime queries, and other application logic that together provide functionality.

Alignment

A general term for how well an AI System’s outputs (e.g., replies to queries) and Behaviors correspond to end-user and service provider objectives, including the quality and utility of results, as well as safety requirements. Quality implies factual correctness and utility implies the results are fit for purpose, e.g., a Q&A system should answer user questions concisely and directly, a Python code-generation system should output valid, bug-free, and secure Python code. EleutherAI defines alignment this way, “Ensuring that an artificial intelligence system behaves in a manner that is consistent with human values and goals.” See also the work of the Alignment Forum.

Annotation

External data that complements a Data Set, such as labels that classify individual items [1].

Automatable

Can an action, like a test, be automated so it can be executed without human intervention?

B

Behavior

What does a Component do, either autonomously on its own (e.g., a security monitoring tool that is constantly running) or when invoked by another component through an API or Function call? This is a general-purpose term that could cover a single Feature, a whole Use Case or anything in between.

Behavior-Driven Development

Behavior-Driven Development (BDD) is an evolution of TDD where the testing APIs more explicitly express the language of specifying behaviors. Hence, writing tests in a BDD style means creating executable specifications.

Popular examples include RSpec for the Ruby language community and several BDD-inspired dialects supported by ScalaTest for Scala. While useful for thinking through requirements, there was a tendency for these APIs to be verbose to use, so practitioners often combined these APIs with more concise testing APIs. See also Test-Driven Development, Specification-Driven Development, Property-Based testing, and Design by Contract.

Benchmark

A methodology or Function used for offline Evaluation of a Generative AI Model or AI System for a particular purpose and to interpret the results [1]. It consists of the following:

  1. A set of tests with metrics.
  2. A summarization of the results.

See also Unit Benchmark, Integration Benchmark, and Acceptance Benchmark.

C

ChatBot

An AI System application for interactive sessions. It provides a interface for accepting user Prompts and showing the replies generated by the system.

Class

The primary Component abstraction in Object-Oriented Programming, although not necessarily the only one.

Classification

Assigning a datum to a category, usually represented by a concise label. The categories may be a pre-defined set or discovered by analyzing the data in some way. Assignments are made with a Classifier.

Classifier

A model or other tool that analyzes data and outputs one or more Classifications or labels about its content. An email SPAM filter is an example, where an email is labeled either SPAM or not SPAM.

Coding Agent

An AI-powered IDE or tool specifically designed for AI-assisted software development. Here is a partial list of coding agents (at the time of this writing):

  1. AWS Kiro (an AI IDE designed to support Specification-Driven Development)
  2. Gemini CLI
  3. Claude Code
  4. Cline
  5. Cursor
  6. GitHub Copilot
  7. Roo Code
  8. Windsurf

Component

An ill-defined, but often-used term in software. In this case, we use it to generically refer to any piece of software with a well-defined purpose, an access API that defines clear boundaries. Depending on the programming language, it may group together Functions, Classes, etc. Particular programming languages and “paradigms” (like OOP and FP) might use terms like packages, modules, subsystems, libraries, and even web services can be considered components.

In principal, a component could contain a single Unit. So, for simplicity in the rest of the text, we will use Component as an umbrella term that could also mean an individual Unit, unless it is important to make finer distinctions.

Concurrent

When work can be partitioned into smaller steps that can be executed in any order and the runtime executes them in a nonpredictable order. If the order is predictable, no matter how it executed, we can say it is effectively Sequential.

Context

Additional information passed to an LLM as part of a user Prompt, which is intended to provide additional, useful context information so that the Response is better than if the user’s prompt was passed to the LLM alone. This additional content may include a System Prompt, relevant documents retrieved using RAG, etc.

Cohesion

Does a Component feel like “one thing” with a single purpose, exhibiting well-defined Behaviors with a coherent State? Or does it feel like a miscellaneous collection of behaviors or state?

Coupling

How closely connected is one Component to others in the system? “Loose” coupling is preferred, because it makes it easier to test components in isolation, substitute replacements when needed, etc. Strongly coupled components often indicate poor abstraction boundaries between them.

Cybersecurity

The Security of software systems, including data protection and allowed use. Prompt Injection is an example of a new class of Risks introduced by Generative AI Models.

D

Data Set

Sometimes written dataset, a collection of data items used for training, evaluation, etc. Usually, a given data set has a schema (which may simply be “unstructured text”) and some metadata that may include information about provenance, license for use (which may specify disallowed uses), target uses, transformations and filters applied, etc. [1].

Design By Contract

The idea of Design By Contract (“DbC”) was developed by Bertrand Meyer and incorporated into his Eiffel programming language. In Eiffel all functions can define a contract for allowed inputs, invariants, and guaranteed responses, if the input requirements are met. The runtime system would handle any failures of these contracts. A core principle of DbC use is that contract failures should terminate the application immediately, forcing the developers to fix the issue. Failure to do so becomes an excuse to let bugs accumulate. If this principle was rigorously followed during development, it is often considered acceptable (or at least “expedient”), to log contract failures, but not terminate execution in production runs. DbC can be used in other languages through built-in features (like assertions), libraries, or various runtime features.

DbC provides many of the same design benefits provided by TDD, which emerged later, such as directing attention to more rigorous API design. Because of the additional benefits of TDD, DbC has largely fallen out of practice, but it’s formalism for what constitutes good contracts is still highly valuable and recommended for study.

Determinism

The output of a Component for a given input is always known precisely. This affords writing repeatable, predictable software and automated, reliable tests.

In contrast, nondeterminism means identical inputs yield different results, removing Repeatability and complicating Predictability, and the ability to write automated, reliable tests.

Direct Preference Optimization

TODO.

See also Reinforcement Learning.

E

Explainability

Can humans understand why the system behaves the way that it does in a particular Use Case? Can the system provide additional information about about why it produced a particular output?

Evaluation

Much like other software, models and AI systems need to be trusted and useful to their users. Evaluation aims to provide the evidence needed to gain confidence for an AI System and its Components.

A particular evaluation is the capability of measuring and quantifying how a Generative AI Model, e.g., an LLM, or an AI System as a whole handles Prompts and the kinds of Responses produced. For example, an evaluation might be used to see if hate speech is detected in prompts and responses, if responses contain hallucinations, measure the overhead (time and compute) for processing, and for our purposes, implements a required Use Case, etc.

So, an evaluation can cover functional and nonfunctional behaviors of models and systems. They may be used throughout the AI application development and deployment lifecycle. Functional evaluation dimensions include alignment to use cases, accuracy in responses, faithfulness to given context, robustness against perturbations and noise, and adherence to safety and social norms. Nonfunctional evaluation dimensions include latency, throughput, compute efficiency, cost to execute, carbon footprint and other sustainability concerns. Evaluations are applied as regression tests while models are trained and fine-tuned, as benchmarks while GenAI-powered applications are designed and developed, and as Guardrails when these applications are deployed in production. They also have a role in compliance, both with specific industry regulations, and with emerging government policies.

An evaluation may be implemented in one of several ways. A Classifier LLM or another kind of model might be used to label content. In general, evaluations often include a Data Set of examples used to Train a model for purposes like classification, or the data set can be used to query a model and score the quality of the responses. For our purposes, an implementation of an evaluation is API compatible for execution within an Evaluation Framework.

See also Evaluation Framework.

Evaluation Framework

An umbrella term for the software tools, runtime services, benchmark systems, etc. used to perform Evaluations by running their implementations to measure AI systems for trust and safety risks and mitigations, and other concerns. See, for example, The AI Alliance Evaluation Reference Stack.

Explainability

Can humans understand why the system behaves the way that it does in a particular situation? Can the system explain its reasoning for arriving at a result?

F

Fairness

Does the AI system’s Responses exhibit social biases, preferential treatment, or other forms of non-objectivity?

Feature

For our purposes, a small bit of functionality provided by a Component and the AI Systems that use it. A feature is the increment of change in a single cycle of the Test-Driven Development process, which could be enhancing some user-visible functionality or adding wholly-new new functionality in small increments. See also Use Case.

Few-Shot Prompt

Sometimes, providing a few examples in a prompt of the desired responses conditions the model to produce better responses. This is the idea with few-shot prompts. For an example, see this discussion in Testing Generative AI Applications. See also Prompt, Zero-Shot Prompt, and Prompt Engineering.

Fine Tuning

A more specific term for Tuning, a part of Post-Training, that emphasizes that after the major learning has happened during Pre-Training, the model behavior is refined and improved with additional training techniques. See also Supervised Fine Tuning.

Function

In most languages, the most fundamental Unit of abstraction and execution. Depending on the language, the term function or method might be used, where the latter term refers to functions associated with Classes in OOP languages. Some languages allow code blocks outside of functions, perhaps inside alternative Component boundaries, but this is not important for our purposes.

Many functions are free of Side Effects, meaning they don’t read or write State external to the function and shared by other functions. These functions are always Deterministic; for a given input(s) they always return the same output. This is a very valuable property for design, testing, and reuse.

Other functions that read and possibly write external state are nondeterministic. So are functions that are implemented with Concurrency in a way that the order of results is not deterministic. For example, functions that retrieve data, like a database record, functions to generate UUIDs, functions that call other processes or systems.

Functional Programming

FP is a design methodology that attempts to formalize the properties of Functions and their properties, inspired by the behavior of mathematical functions. State is maintained in a small set of abstractions, like Maps, Lists, and Sets, with operations that are implemented separately following protocol abstractions exposed by the collections. Like mathematical objects and unlike objects in Object-Oriented Programming, mutation of State is prohibited; any operation, like adding elements to a collection, creates a new, Immutable copy.

FP became popular when concurrent software became more widespread in the 2000s, because the immutable objects lead to far fewer concurrency bugs. FP languages may have other Component constructs for grouping of functions, e.g., into libraries.

Contrast with Object-Oriented Programming. Many programming languages combine aspects of FP and OOP.

G

Guardrails

A frequently-used term for inference-time use of Evaluations to detect and mitigate usage of the AI System that is considered unsafe or otherwise outside the terms of use.

Guardrails often focus on user Prompts and Responses, looking for undesirable content, such as hate speech, misinformation, hallucinations, hacking attempts, etc.

Generative Adversarial Networks

A GAN uses two neural networks that compete with each other in a “zero-sum” game, where one agent’s gain is another agent’s loss.

Quoting from the Wikipedia page on GANs:

Given a training data set, this technique learns to generate new data with the same statistics as the training set. For example, a GAN trained on photographs can generate new photographs that look at least superficially authentic to human observers, having many realistic characteristics...

The core idea of a GAN is based on the "indirect" training through the discriminator, another neural network that can tell how "realistic" the input seems, which itself is also being updated dynamically. This means that the generator is not trained to minimize the distance to a specific image, but rather to fool the discriminator. This enables the model to learn in an unsupervised manner.

The “adversarial” part is how the generator attempts to fool the discriminator, which learns to detect these situations.

Generative AI Model

A combination of data and code, usually trained on a Data Set, to support Inference of some kind.

For convenience, in the text, we use the shorthand term model to refer to the generative AI Component that has Nondeterministic Behavior, whether it is a model invoked directly through an API in the same application or invoked by calling another service (e.g., ChatGPT). The goal of this project is to better understand how developers can test models.

See also Model, Large Language Model (LLMs), and Multimodal Model.

Governance

End-to-end control of assets, especially Data Sets and Models, with lineage traceability and access controls for protecting the security and integrity of assets.

H

Hallucination

When a Generative AI Model generates text that seems plausible, but is not factually accurate. Lying is not the right term, because there is no malice intended by the model, which only knows how to generate a sequence of Tokens that are plausible. Which token is actually returned in a given context is a Stochastic process, i.e., a random process governed by a Probability distributions.

I

In-Context Learning

The idea of embedding in a Prompt additional information to help the LLM produce better results. Examples include Retrieval-Augmented Generation, which is a design pattern where information relevant to a query is retrieved from a data store and passed as part of the Context for the prompt, and Few-Shot Prompting, where a few examples of user prompts and good responses are provided in the prompt.

Immutable

A Unit’s or Component’s State cannot be modified, once it has been initialized. If all units in a Component are immutable, then the component itself is considered immutable. Contrast with Mutable. See also State.

Inference

Sending information to a Generative AI Model or AI System to have it return an analysis of some kind, summarization of the input, or newly generated information, such as text. The term query is typically used when working with LLMs. The term inference comes from traditional statistical analysis, including model building, that is used to infer information from data.

Instruction Fine Tuning

Often abbreviated IFT and sometimes shortened to Instruction Tuning. A form of Supervised Fine Tuning that uses a Labeled Data set of instruction Prompts and Responses. It is designed to improve model performance for specific tasks and for following instructions, in general, such as Question Answering. See also Tuning.

Integration Benchmark

The analog of Integration Tests for several Units and Components working together, where some of them are AI-enabled and exhibit Stochastic behaviors. Benchmark technology is adapted for the purpose.

See also Unit Test, Unit Benchmark, Integration Test, Acceptance Test, and Acceptance Benchmark.

Integration Test

A test for several Units and Components working together that verifies they interoperate properly. These components could be distributed systems, too. When any of the units that are part of the test have Side Effects and the purpose of the test is not to explore handling of such side effects, all units with side effects should be replaced with Test Doubles to make the test Deterministic.

See also Test, Unit Test, Unit Benchmark, Integration Benchmark, Acceptance Test, and Acceptance Benchmark..

J
K
L

Labeled Data

Labeled data contains content used to train a model and corresponding labels of expected outcomes. A classic example is a labeled data set for Training a SPAM filter, where example emails are labeled SPAM or not SPAM. In contrast, Unlabeled Data contains no such labels. Labeled data is used in model Tuning, while sets of unlabeled data are used for training raw Generative AI Models.

In the context of Generative AI Models, there are several popular formats for labeled data:

  1. Question and answer (Q&A) pairs: A set of Prompts, such as questions or instructions to do tasks, accompanied by answers or expected Responses.
  2. Preference data: Similar to Q&A pairs, but in addition to the preferred or chosen answer, a rejected answer is provided, which supports teaching about responses that are good as well as bad.

Large Language Model

Abbreviated LLM, a state of the art Generative AI Model, often with billions of parameters, that has the ability to summarize, classify, and even generate text in one or more spoken and programming languages. See also Model and Multimodal Model.

M

Model

A combination of data and code, usually trained on a Data Set, to support Inference of some kind. See also Generative AI Model, Large Language Model, and Multimodal Model.

Model Context Protocol

Abbreviated MCP, a de-facto standard protocol for communications between models, agents, tools, and services, including auto-discovery. See the AI Alliance’s MCP (and Beyond) in the Enterprise: A User Guide and modelcontextprotocol.io for more information.

Multimodal Model

A model that extends the text-based capabilities of LLMs with additional support for other media, such as video, audio, still images, or other kinds of data. See also Model.

Mutable

A Unit’s State can be modified during execution, either through direct manipulation by another unit or indirectly by invoking the unit (e.g., calling a Function that changes the state. If any one unit in a Component is mutable, then the component itself is considered mutable. Contrast with Immutable. See also State.

N
O

Object-Oriented Programming

OOP (or sometimes object-oriented software development - OOSD - or object-oriented development - OOD) is a design methodology that creates software Components with boundaries that mimic real-world objects (like Person, Automobile, Shopping Cart, etc.). Each object encapsulates State and Behavior behind its abstraction.

Introduced in the Simula language in the 1960s, it gained widespread interest in the 1980s with the emergence of graphical user interfaces (GUIs), where objects like Window, Buttons, and Menus were an intuitive way to organize such software.

Contrast with Functional Programming. Many programming languages combine elements of FP and OOP.

OODA Loop

A method of action, where you constantly perform the loop - Observe, Orient, Decide, Act. Originally developed for combat operations, it has been applied in other areas, such as industrial applications, project assessment, etc.

The Wikipedia page has a history and more details about OODA. It was originally developed by United States Air Force Colonel John Boyd for combat operations, it has been applied in other areas, such as industrial applications, project assessment, etc.

P

Paradigm

From the Merriam-Webster Dictionary) definition of paradigm, “a philosophical and theoretical framework of a scientific school or discipline within which theories, laws, and generalizations and the experiments performed in support of them are formulated.”

Predictable

In the context of software, the quality that knowing a Unit’s or Component’s history of past Behavior and its design, you can predict its future behavior reliably. See also State Machine.

Pre-Training

See Training. A more precise term in the context of Generative AI Model training, where pre-training uses massive datasets to teach models from scratch, followed by a Post-Training (Tuning) process to refine the behaviors as needed.

Privacy

Protection of individuals’ sensitive data and preservation of their rights.

Post-Training

See Tuning. A more precise term in the context of Generative AI Model training, where Pre-Training uses massive datasets to teach models from scratch, followed by a ost training (tuning) process to refine the behaviors as needed.

Probability and Statistics

Two interrelated branches of mathematics, where statistics concerns such tasks as collecting, analyzing, and interpreting data, while probability concerns observations, in particular the percentage likelihood that certain values will be measured when observations are made of a random process, or more precisely, a random probability distribution, like heads or tails when flipping a coin. This probability distribution is the simplest possible; there is a 50-50 chance of heads or tails (assuming a fair coin). The probability distribution for rolling a particular sum with a pair of dice is less simple, but straightforward. The probability distribution for the heights of women in the United States is more complicated, where historical data determines the distribution, not a simple formula.

Both disciplines emerged together to solve practical problems in science, industry, sociology, etc. It is common for researchers to build a mathematical model (in the general sense of the word, not just an AI model) of the system being studied, in part to compare actual results with predictions from the model, confirming or rejecting the underlying theories about the system upon which the model was built. Also, if the model is accurate, it provides predictive capabilities for possible and likely future observations.

Contrast with Determinism. See also Stochastic.

Prompt

The query a user (or another system) sends to an LLM. Often, additional Context information is added by an AI System before sending the prompt to the LLM. See also Prompt Engineering, Prompt Injection, Few-Shot Prompt, and Zero-Shot Prompt.

Prompt Engineering

A term for the careful construction of good Prompts to maximize the quality of Inference Responses. It is really considered more art than science or engineering because of the subjective relationship between prompts and responses for Generative AI Models. See also Prompt Injection.

Prompt Injection

A term for inserting content into Prompts that triggers undesirable behaviors. This is a new Cybersecurity threat introduced by AI Systems, Generative AI Models, in particular.

Property-Based Testing

Property-Based Testing (PBT) is sometimes also called property-based development or property-driven development. This variation of Test-Driven Development emphasizes the mathematical properties of Units being tested. Obvious examples are arithmetic functions on integers, but properties and the “laws” they impose can be much more general. For example, all programming languages support concatenation (e.g., “addition”) of strings, where an empty string is the “zero”. Hence, length("foo") == length("foo" + "") == 3. Sting addition is associative, (a+b)+c == a+(b+c), but not commutative, a+b ≠ b+a.

All libraries that support PBT let you define the properties that must hold and a way of defining allowed values of the “types” in question. At test time, the library generates a large set of representative instances of the types and verifies the properties hold for all instances.

Property-based testing emerged in the Functional Programming community.

See also Design by Contract, Specification-Driven Development, Behavior-Driven Development, and Test-Driven Development.

Q

Question Answering

In many, if not most applications, models and the applications that use them should be good at providing focused, useful answers to user questions, rather than generating text that might be related to the topic, but not useful to the user. Instruction Fine Tuning focuses on improving this capability.

Quantization

In the context of AI, a technique for reducing the size of a model, and hence the resources required to use it, by replacing some or all of the floating point weights (either 16 bit fp16 or sometimes 32 bit fp32) with smaller precision floating point or integer values. Often, the size and resource savings outweigh a relatively small degradation in performance.

R

Refactoring

Modifying code to change its structure as required to support a new feature. No Behavior changes are introduced, so that the existing automated Tests can verify that no regressions are introduced as the code is modified. This is first step in the Test-Driven Development cycle.

Regression

When an unexpected Behavior change is introduced into a previously-working Unit, because of a change made to the code base, often in other units for unrelated functionality.

Automated Tests are designed to catch regressions as soon as they occur, making it easier to diagnose the change that caused the regression, as well as detecting the regression in the first place.

Reinforcement Fine Tuning

Reinforcement Learning

Reinforcement learning (RL) is a form of machine learning, often used for optimizing control or similar systems. In RL, an agent performs a loop where it observes the state of the “world” visible to it at the current time, it takes what it thinks is a suitable action for the next step, chosen to maximize a reward signal, often with the goal of maximizing the long-term reward, such as wining a game. The reinforcement aspect is an update at each step that is done to a policy of some kind that used by the agent to decide which actions in subsequent steps are most likely to produce maximize long-term, cummulative reward, given the current known state. However, when choosing the next step, the best choice is not always made. Some degree of randomness is introduced so that the agent explores all possible states and rewards, rather than getting stuck always choosing the same actions that are known good, but may be less optimal than actions that have not yet been tried.

Variations include having a dedicated reward model that calculates the reward based on the chosen action. When RL is used for a game, for example, it might be obvious what the reward is for any action and state combination, i.e., did you land on square that reveals a “boost” of some kind. In contrast, reward determination like deciding if an LLM output is a good Response to a Prompt, etc. is not so simple.

In the generative AI context, RL is a popular tool in the suite of model Tuning processes that are used to improve model performance in various ways. In particular, Reinforcement Learning with Human Feedback (RLHF) is a popular technique for Adaptation. A new technique called Direct Preference Optimization (DPO) has largely replaced RL in many applications.

See also the discussion of Reinforcement Fine Tuning in From Testing to Tuning, which describes RL in more detail.

Reinforcement Learning with Human Feedback

A Reinforcement Learning introduced by OpenAI that uses human data to train a reward model, which is then used with RL to improve the training of the Generative AI Model. This is an expensive process, because of the expense of acquiring human generated, often expert, data.

Reinforcement Learning with Verifiable Rewards

A Reinforcement Learning approach for LLMs where the Response from a model during an RL step can be verified externally. For example, does the generated code compile and pass existing unit tests? See Awesome RLVR for more details.

Repeatable

If an action, like running a test, is run repeatedly with no code or data changes, does it return the same results every time? By design, Generative AI Models are expected to return different results each time a query is repeated.

Responsible AI

An umbrella term about comprehensive approaches to safety, accountability, and equitability. It covers an organization’s professional responsibility to address concerns. It can encompass tools, models, people, processes, integrated systems, and data [2].

Retrieval Augmented Generation

RAG was one of the first AI-specific design patterns for applications. It uses one or more data stores with information relevant to an application’s use cases. For example, a ChatBot for automotive repair technicians would use RAG to retrieve sections from repair manuals and logs from past service jobs, selecting the ones that are most relevant to a particular problem or subsystem the technician is working on. This Context is passed as part of the Prompt to the LLM.

A key design challenge is determining relevancy and structuring the data so that relevant information is usually retrieved. This is typically done by breaking the reference data into “chunks” and encoding each chunk in a vector representation, e.g., a hash, which functions as a similarity metric. During inference, the prompt is passed through the same encoding and the top few nearest neighbors, based on the metric, are returned for the context, thereby attempting to ensure maximum relevancy.

See this IBM blog post for a description of RAG.

Response

The generic term for outputs from a Generative AI Model or AI System. Sometimes results is also used.

Risk

The composite measure of an event’s probability of occurring and the magnitude or degree of the consequences of the corresponding event. Risk is a function of the negative impact if the event occurs and the likelihood of occurrence [2].

Robustness

How well does the AI System continue to perform within acceptable limits or degrade “gracefully” when stressed in some way? For example, how well does a Generative AI Model Respond to Prompts that deviate from its training data?

S

Scalability

A general concern for large-scale systems; how easily, efficiently, and reliably can you scale up their service capacity in response to load. When the load decreases, can you scale the system back down to conserve resources that aren’t needed?

Scenario

In the context of a Use Case, one path through the use case, such as a “happy path” from beginning to end where a user completes a task or accomplishes a goal successfully. Other scenarios include failures, paths through the use case where the user is unable to succeed, due to system or user errors. Scenario is a generic word, of course, and will often be used more generically.

Security

Preventing, detecting, and mitigating undesirable access and use of physical and software systems, including data. Software and data security is frequently called Cybersecurity, while the term security also encompasses Risks like unauthorized access to or destruction of physical spaces, etc.

New cybersecurity concerns are introduced by AI Systems, such as Prompt Injection. Evaluations can be written for security concerns, in addition to traditional detection and mitigation tools.

Sequential

The steps of some work are performed in a predictable, repeatable order. This property is one of the requirements for Deterministic Behavior. Contrast with Concurrent.

Side Effect

Reading and/or writing State shared outside a Unit, i.e., a Function with other functions. If a Component contains unit that perform side effects, then the component itself is considered to perform side effects. See also Determinism.

Social Responsibility

An organization’s responsibility for the impacts of its decisions and activities on society and the environment through transparent and ethical behavior [2].

Specification-Driven Development

Abbreviated SDD and also known as Spec-Driven Development. In our context, this refers to an idea introduced by GitHub and Microsoft, that we should structure code generation Prompts in a more-precise format to ensure we get the code Responses we need. The argument is that many models are already perfectly capable of generating this code, but they are “literal minded” and need to be told precisely what is needed from them.

We discuss SDD at length in the Specification-Driven Development chapter of Testing Generative AI Applications. SDD is similar in its goals to Test-Driven Development, although arguably closer to the emphasis in Behavior-Driven Development.

State

Used in software to refer to a set of values in some context, like a Component. The values determine how the component will behave in subsequent invocations to perform some work. The values can sometimes be read directly by other components. If the component is Mutable, then the state can be changed by other components either directly or through invocations of the component that cause state transitions to occur. (For example, popping the top element of a stack changes the contents of the stack, the number of elements it currently holds, etc.)

Often, these state transitions are modeled with a State Machine, which constrains the allowed transitions.

State Machine

A formal model of how the State of a component can transition from one value (or set of values) to another. As an example, the TCP protocol has a well-defined state machine.

Stochastic

The behavior of a system where observed values are governed by a random probability distribution. For example, when flipping a coin repeatedly, the observed values, heads or tails, are governed by a distribution that predicts 50% of the time heads will be observed and 50% of the time tails will be observed, assuming a fair coin (not weighted on one side or the other). The value you observe for any given flip is random; you can’t predict exactly which possibility will happen, only that there is an equal probability of heads or tails. After performing more and more flips, the total count of heads and tails should be very close to equal. See also Probabilities and Statistics.

Supervised Fine Tuning

Often abbreviated SFT. A more general term than Instruction Fine Tuning, but often used synonymously. Supervised is an old term in machine learning for any kind of training algorithm that uses Labeled Data, i.e., data that includes the expected answers. See also Tuning.

Sustainability

Taking into account the environmental impact of AI systems, such as carbon footprint and water usage for cooling, both now and for the future [2].

System Prompt

A commonly-used, statically-coded part of the Context information added by an AI System the Prompt before sending it to the LLM. System prompts are typically used to provide the model with overall guidance about the application’s purpose and how the LLM should respond. For example, it might include phrases like “You are a helpful software development assistant.”

T

Taxonomy

Used in the context of Evaluations, taxonomy is used to refer to how categories are defined for known risks, other safety concerns, and other areas where detection or measurement of behaviors is desirable.

Teacher Model

A Generative AI Model that can be used as part of a Tuning (“teach”) process for other models, to generate synthetic data, to evaluate the quality of data, etc. These models are usually relatively large, sophisticated, and powerful, so they are very capable for these purposes, but they are often considered too costly to use as an application’s runtime model, where smaller, lower-overhead models are necessary. However, for software development purposes, less frequent use of teacher models is worth the higher cost for the services they provide.

Test

For our purposes, a Unit Test, Integration Test, or Acceptance Test.

Test Double

A test-only replacement for a Unit or a while Component, usually because it has Side Effects and we need the Behavior to be Deterministic for the purposes of testing a dependent unit that uses it. For example, a function that queries a database can be replaced with a version that always returns a fixed value expected by the test. A mock is a popular kind of test double that uses the underlying runtime environment (e.g., the Python interpreter, the Java Virtual Machine - JVM) to intercept invocations of a unit and programmatically behave as desired by the tester.

See also Test, Unit Test, Integration Test, and Acceptance Test.

Test-Driven Development

When adding a Feature to a code base using TDD, the tests are written before the code is written. A three step “virtuous” cycle is used, where changes are made incrementally and iterative using small steps, one at a time:

  1. Refactor the code to change its structure as required to support the new feature, using the existing automated Tests to verify that no regressions are introduced. For example, it might be necessary to introduce an abstraction to support two “choices” where previously only one choice existed. 2. Write a Test for the new feature. This is primarily a design exercise, because thinking about testing makes you think about usability, Behavior, etc., even though you are also creating a reusable test that will become part of the Regression test suite. Note that the test suite will fail to run at the moment, because the code doesn’t yet exist to make it pass! 3. Write the new feature to make the new test (as well as all previously written tests) pass.

TDD not only promotes iterative and incremental development, with a growing suite of tests resulting from the process, it effectively turns the writing of executable tests into a form of specification of the desired behavior, writing before the code is written to implement the specification. Behavior-Driven Development would take this idea to its logical conclusion, that tests are executable specifications.

The Wikipedia TDD article is a good place to start for more information.

See also Design by Contract, Specification-Driven Development, Behavior-Driven Development, and Property-Based Testing.

Token

For Large Language Models, the training texts and query Prompts are split into tokens, usually whole words or fractions according to a vocabulary of tens of thousands of tokens that can include common single characters, several characters, and “control” tokens (like “end of input”). A rule of thumb is a corpus will parse into roughly 1.5 times the number of tokens as it has words.

Training

In our context, training is the processes used to teach a model, such as a Generative AI Models how to do its intended job. A more precise term used in generative AI model development is Pre-Training, the training process that uses a massive data corpus to teach the model facts about the world, how to speak and understand human language, and do some skills. However, the resulting model uses does poorly on specialized tasks and even basic skills like following a user’s instructions, conforming to social norms (e.g., avoiding hate speech), etc.

That’s where a second Tuning phase comes in, often called Post-Training, which uses a suite of processes to improve the models performance on many general or specific skills.

Trust and Safety

An umbrella term for concerns, processes, and tools to ensure trustworthiness and safety of AI Systems. See the discussion What We Mean by Trust and Safety in The AI Alliance Trust and Safety User Guide.

Tuning

Tuning, or Post-Training, refers to one or more processes used to transform a Pre-Trained model into one that exhibits much better desired Behaviors (like instruction following) or specialized domain knowledge. The term Fine Tuning (sometimes spelled finetuning) is also widely used. These days, Instruction Fine Tuning is a very common form of tuning, which uses Supervised Fine Tuning. Another suite of techniques used is Reinforcement Learning.

U

Unit

For our purposes, a unit refers to the smallest granularity of functionality we care about, e.g., in the context of a Unit Test. A unit can be a single Function that is being designed and written, but this may be happening in the larger context of a Component, such as a Class in an Object-Oriented Programming language or some other self-contained.

For simplicity, rather than say “unit and/or component” frequently in AI Alliance content, we just use component as a generic umbrella term for both concepts, unless it is important to make finer distinctions.

Unit Benchmark

An adaption of Benchmark tools and techniques for more fine-grained and targeted testing purposes, such as verifying Features and Use Cases work as designed. See the Unit Benchmarks chapter for details.

The same idea generalizes to the analogs of Integration Tests, namely Integration Benchmarks, and Acceptance Tests, namely Acceptance Benchmarks.

Use Case

A common term for an end-to-end user activity done with a system, often broken down into several Scenarios that describe different “paths” through the use case, including error scenarios, in addition to happy paths. Hence, scenarios would be the next level of granularity. Compare with Features, which would be the capabilities implemented one at a time to support the scenarios that make up a use case.

Unit Test

A test for a Unit that exercises its Behavior in isolation from all other Functions and State. When the unit being tested has Side Effects, because of other units it invokes, all such side effects must be replaced with Test Doubles to make the test Deterministic. Note that writing a unit test as part of Test-Driven Development inevitably begins with a Refactoring step to modify the code, while preserving the current behavior, so that it is better positioned to support implementing the new functionality.

See also Test, Unit Benchmark, Integration Test, Integration Benchmark, Acceptance Test, Acceptance Benchmark.

Unlabeled Data

Data without labels indicating expected “information” about the data, such as objects in images or themes in text examples. Massive sets of unlabeled data are used for Training raw Generative AI Models, while Labeled Data is typically used for Tuning to improve those models to meet specific requirements.

V

Vibe Coding

A term coined by Andrej Karpathy for just going with the code generated by an LLM, tweaking the Prompt as needed to get the LLM to fix bugs and incorrect behavior. Hence, it’s a completely “non-engineered” approach to coding, which can work well for quick coding needs, especially for non-programmers, but generally is not sufficient for longer-term projects. Hence, the term has a slightly negative connotation for many people, as in “this is not a serious way to write software”. Contrast with Vibe Engineering and Agentic Engineering.

Vibe Engineering

A term coined by Simon Willison, made half in jest, for a more engineering-oriented approach to Vibe Coding, which incorporates various engineering practices to ensure that quality and maintainability requirements can be met, longer term. As such, this blog post is a good counter argument to those who believe that AI coding assistants are now sufficiently reliable and powerful to completely take over from humans.

See also See also Agentic Engineering.

W
X
Y
Z

Zero-Shot Prompt

In a Few-Shot Prompt, a few examples are included in the Prompt of possible user prompts and the desired Responses. This can condition the model to produce better responses. A zero-shot prompt doesn’t include such examples, relying on the rest of the prompt, including any other Context, combined with the model’s inherent abilities to generate acceptable responses. For an example, see this discussion in Testing Generative AI Applications. See also Prompt, Few-Shot Prompt, and Prompt Engineering.