Link Search Menu Expand Document

Glossary of Terms

Let’s define the common terms we use. Some of the terms defined here are industry standards, while others are not standard, but they are useful for our purposes.

Some definitions are adapted from the following sources, which are indicated below using the same numbers, i.e., [1] and [2]:

  1. MLCommons AI Safety v0.5 Benchmark Proof of Concept Technical Glossary
  2. NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0)

Note: For simplicity, we will sometimes refer to sending queries to an AI System when often the query could go directly to a Large Language Model (LLM) or to an AI System that includes one or more LLMs and other “components”. We will just use AI system queries as a shorthand.

Table of contents
  1. Glossary of Terms
    1. Accountability
    2. Agent
    3. AI Actor
    4. AI System
    5. Alignment
    6. Annotation
    7. Benchmark
    8. ChatBot
    9. Classification
    10. Context
    11. Cybersecurity
    12. Dataset
    13. Evaluation
    14. Evaluation Framework
    15. Explainability
    16. Fairness
    17. Guardrails
    18. Governance
    19. Hallucination
    20. Inference
    21. Large Language Model
    22. Model
    23. Multimodal Model
    24. Privacy
    25. Prompt
    26. System Prompt
    27. Privacy
    28. Response
    29. Responsible AI
    30. Retrieval-augmented Generation
    31. Risk
    32. Robustness
    33. Social Responsibility
    34. Sustainability
    35. Token
    36. Trust and Safety

Accountability

An aspect of Governance, where we trace behaviors through AI Systems to their causes. Related is the need for organizations to take responsibility for the behaviors of the AI Systems they deploy.

Agent

The AI equivalent of a component or service, which orchestrates other components and services in an autonomous or semi-autonomous way to help a user perform a task. Hence, an agent might invoke non-AI tools, such as a web search or a weather reporting service, and use an LLM to both construct queries of these services and to interpret their results, to format a Response to the user, etc. Agents may be designed to perform actions automatically for the user, although this “power” needs to be carefully designed and tested, depending on the severity of unintended consequences. Often, agents are designed to recommend actions the user should take or at least request user confirmation before taking actions. Agents are an old concept in AI research with recent, resurgent interest.

AI Actor

[2] An organization or individual building an AI System.

AI System

Umbrella term for an application or system with AI components, including Datasets, Models, Evaluation Framework and Evaluations for safety detection and mitigation, etc., plus external services, databases for runtime queries, and other application logic that together provide functionality.

Alignment

A general term for how well an AI System’s outputs (e.g., replies to queries) and behaviors correspond to end-user and service provider objectives, including the quality and utility of results, as well as safety requirements. Quality implies factual correctness and utility implies the results are fit for purpose, e.g., a Q&A system should answer user questions concisely and directly, a Python code-generation system should output valid, bug-free, and secure Python code. EleutherAI defines alignment this way, “Ensuring that an artificial intelligence system behaves in a manner that is consistent with human values and goals.” See also the Alignment Forum.

Annotation

[1] External data that complements a Dataset, such as labels that classify individual items.

Benchmark

[1] A methodology or function used for offline Evaluation of a Model or AI System for a particular purpose and to interpret the results. It consists of:

  • A set of tests with metrics.
  • A summarization of the results.

ChatBot

An AI System application for interactive sessions. It provides a user Prompt and shows replies generated by the system.

Classification

Assigning a datum to a category, usually represented by a concise label. The categories may be a pre-defined set or discovered by analyzing the data in some way.

Context

Additional information passed to an LLM as part of a user Prompt, which is intended to provide additional, useful context information so that the Response is better than if the user’s prompt was passed to the LLM alone. This additional content may include a System Prompt, relevant documents retrieved using RAG, etc.

Cybersecurity

The catch-all term for “classic” security of systems, predating AI. AI Systems not only need to correctly implement classic cybersecurity techniques, but AI systems introduce new security concerns.

Dataset

(See also [1]) A collection of data items used for training, Evaluation, etc. Usually, a given dataset has a schema (which may be “this is unstructured text”) and some metadata about provenance, licenses for use, transformations and filters applied, etc.

Evaluation

The capability of measuring and quantifying how a Model or AI System that uses models responds to inputs. Much like other software, models and AI systems need to be trusted and useful to their users. Evaluation aims to provide the evidence needed to gain users’ confidence.

Evaluations can cover functional and nonfunctional dimensions of models, and are applicable throughout the model development and deployment lifecycle. Functional evaluation dimensions include alignment to use cases, accuracy in responses, faithfulness to given context, robustness against perturbations and noise, and adherence to safety and social norms. Nonfunctional evaluation dimensions include latency, throughput, compute efficiency, cost to execute, carbon footprint and other sustainability concerns. Evaluations are applied as regression tests while models are trained and fine-tuned, as benchmarks while GenAI-powered applications are designed and developed, and as guardrails when these applications are deployed in production. They also have a role in compliance, both with specific industry regulations, and with emerging government policies.

Evaluations can be implemented in many ways. A Model might be used to judge results or some executable code might be used for simpler cases. Often an evaluation includes a Dataset, such as question-answer pairs that represent the desired behavior. Other techniques include rule-based systems, evaluation with LLMs acting as judges, and human evaluation.

For our purposes, an evaluation must be executable within an Evaluation Framework, such our Evaluation Reference Stack.

Evaluation Framework

An umbrella term for the software tools, runtime services, benchmark systems, etc. used to run Evaluations to measure AI Systems behaviors for trust and safety risks and mitigations, and other kinds of measurements.

Explainability

Can humans understand why the system behaves the way that it does in a particular scenario?

Fairness

Does the AI System’s behaviors exhibit social biases, preferential treatment, or other forms of non-objectivity?

Guardrails

A general term for one or more subsystems in production AI Systems that use various techniques, including specialized Models, to detect and mitigate content in user Prompts and system Responses that is undesirable in some way, such as hate speech, misinformation, hallucinations, etc.

Governance

End-to-end control of assets, especially Datasets and Models, with lineage traceability and access controls for protecting the security and integrity of assets.

Hallucination

When a Model generates text that seems plausible, but is not factually accurate. Lying is not the right term, because there is no malice intended by the model, which only knows how to generate a sequence of Tokens that are plausible, i.e., probabilistically likely.

Inference

Sending a Prompt to an AI System to have it return an analysis of some kind, summarization of the prompt, or newly generated information, such as text or an image. The term inference comes from traditional statistical analysis, including model building, that is used to infer information from data.

Large Language Model

Abbreviated LLM, a state of the art Model, often with billions of parameters, that has the ability to summarize, classify, and even generate text in one or more spoken and programming languages. See also Multimodal Model.

Model

A combination of data and code, usually trained on a Dataset, to support Inference of some kind or other processing like Classification. See also Large Language Model and Multimodal Model.

Multimodal Model

Models that extend the text-based capabilities of LLMs with additional support for other media, such as video, audio, still images, or other kinds of data.

Privacy

Protection of individuals’ sensitive data and preservation of their rights.

Prompt

The query a user (or another system) sends to an LLM. Often, additional Context information is added by an AI System before sending the prompt to the LLM.

System Prompt

A commonly-used, statically-coded part of the Context information added by an AI System the Prompt before sending it to the LLM. System prompts are typically used to provide the model with overall guidance about the application’s purpose and how the LLM should respond. For example, it might include phrases like “You are a helpful software development assistant.”

Privacy

Protection of individuals’ sensitive data and preservation of their rights.

Response

The generic term for outputs from a Model or AI System. Sometimes results is also used.

Responsible AI

(See also [2]) An umbrella term about comprehensive approaches to safety, accountability, and equitability. It covers an organization’s professional responsibility to address concerns. It can encompass tools, models, people, processes, integrated systems, and data.

Retrieval-augmented Generation

RAG was one of the first AI-specific design patterns for applications. It uses one or more data stores with information relevant to an application’s use cases. For example, a ChatBot for automotive repair technicians would use RAG to retrieve sections from repair manuals and logs from past service jobs, selecting the ones that are most relevant to a particular problem or subsystem the technician is working on. This Context is passed as part of the Prompt to the LLM. A key design challenge is determining relevancy and structuring the data so that relevant information is usually retrieved. See this IBM blog post for a description of RAG.

Risk

[2] The composite measure of an event’s probability of occurring and the magnitude or degree of the consequences of the corresponding event. Risk is a function of the negative impact if the event occurs and the likelihood of occurrence.

Robustness

How well does the AI System continue to perform within acceptable limits or degrade “gracefully” when stressed in some way? For example, what is the performance when a Prompt covers an out of band topic, meaning a topic that wasn’t covered in the training data?

Social Responsibility

[2] An organization’s responsibility for the impacts of its decisions and activities on society and the environment through transparent and ethical behavior.

Sustainability

(See also [2]) Taking into account the environmental impact of AI Systems, such as carbon footprint and water usage for cooling, both now and for the future.

Token

For language Models, the training texts and Prompts are split into tokens, usually whole words or fractions according to a vocabulary of tens of thousands of tokens that can include common single characters, several characters, and “control” tokens (like “end of input”). The rule of thumb is a corpus will have roughly 1.5 times the number of tokens as it will have words.

Trust and Safety

See our definition in What We Mean by Trust and Safety.


Next, we explore trust and safety concepts as expressed by various expert organizations.