Join Our Initiative GitHub Repo
Glossary of Terms
Let’s define the common terms we use. Some of the terms defined here are industry standards, while others are not standard, but they are useful for our purposes.
Some definitions are adapted from the following sources, which are indicated below using the same numbers, i.e., [1] and [2]:
- MLCommons AI Safety v0.5 Benchmark Proof of Concept Technical Glossary
- NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0)
Table of contents
- Glossary of Terms
- Accountability
- AI Actor
- AI System
- Alignment
- Annotation
- Benchmark
- Dataset
- Explainability
- Evaluation
- Evaluation Framework
- Evaluator
- Fairness
- Governance
- Hallucination
- Inference
- Large Language Model
- Model
- Multimodal Model
- Privacy
- Responsible AI
- Risk
- Robustness
- Social Responsibility
- Sustainability
- Taxonomy
- Token
Accountability
An aspect of Governance, where we trace behaviors through AI Systems to their causes. Related is the need for organizations to take responsibility for the behaviors of the AI Systems they deploy.
AI Actor
[2] An organization or individual building an AI System.
AI System
Umbrella term for an application or system with AI components, including Datasets, Models, Evaluation Framework and Evaluators for safety detection and mitigation, etc., plus external services, databases for runtime queries, and other application logic that together provide functionality.
Alignment
A general term for how well an AI System’s outputs (e.g., replies to queries) and behaviors correspond to end-user and service provider objectives, including the quality and utility of results, as well as safety requirements. Quality implies factual correctness and utility implies the results are fit for purpose, e.g., a Q&A system should answer user questions concisely and directly, a Python code-generation system should output valid, bug-free, and secure Python code. EleutherAI defines alignment this way, “Ensuring that an artificial intelligence system behaves in a manner that is consistent with human values and goals.” See also the Alignment Forum.
Annotation
[1] External data that complements a Dataset, such as labels that classify individual items.
Benchmark
[1] A methodology or function used for offline Evaluation of a Model or AI System for a particular purpose and to interpret the results. Typically, a benchmark consists of:
- A set of Evaluations with metrics.
- A summarization of the results.
Dataset
(See also [1]) A collection of data items used for training, evaluation, etc. Usually, a given dataset has a schema (which may be “this is unstructured text”) and some metadata about provenance, licenses for use, transformations and filters applied, etc.
Explainability
Can humans understand why the system behaves the way that it does in a particular scenario?
Evaluation
The capability of measuring and quantifying how a Model or AI System that uses models responds to inputs. Much like other software, models and AI systems need to be trusted and useful to their users. Evaluation aims to provide the evidence needed to gain users’ confidence.
Evaluations can cover functional and nonfunctional dimensions of models, and are applicable throughout the model development and deployment lifecycle. Functional evaluation dimensions include alignment to use cases, accuracy in responses, faithfulness to given context, robustness against perturbations and noise, and adherence to safety and social norms. Nonfunctional evaluation dimensions include latency, throughput, compute efficiency, cost to execute, carbon footprint and other sustainability concerns. Evaluations are applied as regression tests while models are trained and fine-tuned, as benchmarks while GenAI-powered applications are designed and developed, and as guardrails when these applications are deployed in production. They also have a role in compliance, both with specific industry regulations, and with emerging government policies. Lastly, there are numerous techniques used in implementing evaluations. Common techniques are rule-based automatic evaluation, evaluation with LLMs acting as judges, and human evaluation.
See also Evaluation Framework and Evaluator.
Evaluation Framework
An umbrella term for the software tools, runtime services, benchmark systems, etc. used to perform Evaluations by running different Evaluators to measure AI Systems for trust and safety risks and mitigations, and other kinds of measurements.
Evaluator
A classifier Model or similar tool, possibly including a Dataset, that can quantify an AI System’s inputs and outputs to detect the presence of risky content, such as hate speech, hallucinations, etc. For our purposes, an evaluator is API compatible for execution within an Evaluation Framework. In general, an evaluator could be targeted towards non-safety needs, such as measuring other aspects of Alignment, Inference model latency and throughput, carbon footprint, etc. Also, a given evaluator could be used at many points in the total AI life cycle, e.g., for a benchmark and an inference-time test.
Fairness
Does the AI System’s behaviors exhibit social biases, preferential treatment, or other forms of non-objectivity?
Governance
End-to-end control of assets, especially Datasets and Models, with lineage traceability and access controls for protecting the security and integrity of assets.
Hallucination
When a Model generates text that seems plausible, but is not factually accurate. Lying is not the right term, because there is no malice intended by the model, which only knows how to generate a sequence of Tokens that are plausible, i.e., probabilistically likely.
Inference
Sending information to a Model or AI System to have it return an analysis of some kind, summarization of the input, or newly generated information, such as text. The term query is typically used when working with LLMs. The term inference comes from traditional statistical analysis, including model building, that is used to infer information from data.
Large Language Model
Abbreviated LLM, a state of the art Model, often with billions of parameters, that has the ability to summarize, classify, and even generate text in one or more spoken and programming languages. See also Multimodal Model.
Model
A combination of data and code, usually trained on a Dataset, to support Inference of some kind. See also Large Language Model and Multimodal Model.
Multimodal Model
Models that extend the text-based capabilities of LLMs with additional support for other media, such as video, audio, still images, or other kinds of data.
Privacy
Protection of individuals’ sensitive data and preservation of their rights.
Responsible AI
(See also [2]) An umbrella term about comprehensive approaches to safety, accountability, and equitability. It covers an organization’s professional responsibility to address concerns. It can encompass tools, models, people, processes, integrated systems, and data.
Risk
[2] The composite measure of an event’s probability of occurring and the magnitude or degree of the consequences of the corresponding event. Risk is a function of the negative impact if the event occurs and the likelihood of occurrence.
Robustness
How well does the AI System continue to perform within acceptable limits or degrade “gracefully” when stressed in some way? For example, how well does a Model respond to prompts that deviate from its training data?
Social Responsibility
[2] An organization’s responsibility for the impacts of its decisions and activities on society and the environment through transparent and ethical behavior.
Sustainability
(See also [2]) Taking into account the environmental impact of AI Systems, such as carbon footprint and water usage for cooling, both now and for the future.
Taxonomy
In this context, taxonomy is used to refer to how categories are defined for known risks, other safety concerns, and other areas where detection or measurement is desirable.
Token
For language Models, the training texts and query prompts are split into tokens, usually whole words or fractions according to a vocabulary of tens of thousands of tokens that can include common single characters, several characters, and “control” tokens (like “end of input”). The rule of thumb is a corpus will have roughly 1.5 times the number of tokens as it will have words.
Next, we explore trust and safety concepts as expressed by various expert organizations.