Link Search Menu Expand Document

Glossary of Terms

Some of the terms defined here are industry standards, while others are not standard, but they are useful for our purposes.

Some definitions are adapted from the following sources, which are indicated below using the same numbers, i.e., [1] and [2]:

  1. MLCommons AI Safety v0.5 Benchmark Proof of Concept Technical Glossary
  2. NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0)

Sometimes we will use a term that could be defined, but we won’t provide a definition for brevity. We show these terms in italics. You can assume the usual, plain-sense meaning for the term, or in some cases it is easy to search for a definition.

Table of contents
  1. Glossary of Terms
    1. Acceptance Benchmark
    2. Acceptance Test
    3. Agent
    4. AI Actor
    5. AI System
    6. Alignment
    7. Automatable
    8. Behavior
    9. Behavior-Driven Development
    10. Benchmark
    11. Class
    12. Coding Agent
    13. Component
    14. Concurrent
    15. Context
    16. Cohesion
    17. Coupling
    18. Design by Contract
    19. Dataset
    20. Determinism
    21. Explainability
    22. Evaluation
    23. Evaluation Framework
    24. Fairness
    25. Feature
    26. Function
    27. Functional Programming
    28. Generative Adversarial Networks
    29. Generative AI Model
    30. Hallucination
    31. Immutable
    32. Inference
    33. Integration Benchmark
    34. Integration Test
    35. Large Language Model
    36. Model Context Protocol
    37. Object-Oriented Programming
    38. Multimodal Model
    39. Mutable
    40. Paradigm
    41. Predictable
    42. Probability and Statistics
    43. Prompt
    44. Prompt Engineering
    45. Property-Based Testing
    46. Refactoring
    47. Regression
    48. Reinforcement Learning
    49. Repeatable
    50. Retrieval-augmented Generation
    51. Response
    52. Robustness
    53. Scenario
    54. Sequential
    55. Side Effect
    56. Specification-Driven Development
    57. State
    58. State Machine
    59. Stochastic
    60. System Prompt
    61. Teacher Model
    62. Test
    63. Test Double
    64. Test-Driven Development
    65. Token
    66. Training
    67. Tuning
    68. Unit
    69. Unit Benchmark
    70. Use Case
    71. Unit Test
    72. Vibe Coding
    73. Vibe Engineering

Acceptance Benchmark

The analog of Acceptance Tests for an AI-enabled system that has Stochastic behaviors. Benchmark technology is adapted for the purpose.

See also Unit Test, Unit Benchmark, Integration Test, Integration Benchmark, and Acceptance Test.

Acceptance Test

A test that verifies a user-visible feature works are required, often by driving the user interface or calling the external API. These tests are system-wide and end-to-end. They are sometimes executed manually, if automation isn’t feasible.

However, it is desirable to make them automated, in which case all operations with Side Effects need to be replaced with Deterministic Test Doubles.

See also Test, Unit Test, Unit Benchmark, Integration Test, Integration Benchmark, Acceptance Test, and Acceptance Benchmark.

Agent

An old concept in AI, but now experiencing a renaissance as the most flexible architecture pattern for AI-based applications. Agents are orchestrations of Generative AI Model and external service invocations, e.g., planners, schedulers, reasoning engines, data sources (weather, search, …), etc. In this architecture, the best capabilities of each service and model are leveraged, rather than assuming that models can do everything successfully themselves. Agent-based applications sometimes use multiple models, one per agent, where each one provides some specific capabilities. For example, one model might be process user prompts into back-end API invocations, including to other models, and interpret the results into user-friendly responses.

AI Actor

[2] An organization or individual building an AI System.

AI System

Umbrella term for an application or system with AI Components, including Datasets, Generative AI Models (e.g., LLMs, Evaluation Frameworks and Evaluations for safety detection and mitigation, etc., plus external services, databases for runtime queries, and other application logic that together provide functionality.

Alignment

A general term for how well an AI System’s outputs (e.g., replies to queries) and Behaviors correspond to end-user and service provider objectives, including the quality and utility of results, as well as safety requirements. Quality implies factual correctness and utility implies the results are fit for purpose, e.g., a Q&A system should answer user questions concisely and directly, a Python code-generation system should output valid, bug-free, and secure Python code. EleutherAI defines alignment this way, “Ensuring that an artificial intelligence system behaves in a manner that is consistent with human values and goals.” See also the work of the Alignment Forum.

Automatable

Can an action, like a test, be automated so it can be executed without human intervention?

Behavior

What does a Component do, either autonomously on its own (e.g., a security monitoring tool that is constantly running) or when invoked by another component through an API or Function call? This is a general-purpose term that could cover a single Feature, a whole Use Case or anything in between.

Behavior-Driven Development

Behavior-Driven Development (BDD) is an evolution of TDD where the testing APIs more explicitly express the language of specifying behaviors. Hence, writing tests in a BDD style means creating executable specifications.

Popular examples include RSpec for the Ruby language community and several BDD-inspired dialects supported by ScalaTest for Scala. While useful for thinking through requirements, there was a tendency for these APIs to be verbose to use, so practitioners often combined these APIs with more concise testing APIs. See also Test-Driven Development, Specification-Driven Development, Property-Based testing, and Design by Contract.

Benchmark

[1] A methodology or Function used for offline Evaluation of a Generative AI Model or AI System for a particular purpose and to interpret the results. It consists of:

  • A set of tests with metrics.
  • A summarization of the results.

See also Unit Benchmark, Integration Benchmark, and Acceptance Benchmark.

Class

The primary Component abstraction in Object-Oriented Programming, although not necessarily the only one.

Coding Agent

An AI-powered IDE or tool specifically designed for AI-assisted software development. Here is a partial list of coding agents:

Component

An ill-defined, but often-used term in software. In this case, we use it to generically refer to any piece of software with a well-defined purpose, an access API that defines clear boundaries. Depending on the programming language, it may group together Functions, Classes, etc. Particular programming languages and “paradigms” (like OOP and FP) might use terms like packages, modules, subsystems, libraries, and even web services can be considered components.

In principal, a component could contain a single Unit. So, for simplicity in the rest of the text, we will use Component as an umbrella term that could also mean an individual Unit, unless it is important to make finer distinctions.

Concurrent

When work can be partitioned into smaller steps that can be executed in any order and the runtime executes them in a nonpredictable order. If the order is predictable, no matter how it executed, we can say it is effectively Sequential.

Context

Additional information passed to an LLM as part of a user Prompt, which is intended to provide additional, useful context information so that the Response is better than if the user’s prompt was passed to the LLM alone. This additional content may include a System Prompt, relevant documents retrieved using RAG, etc.

Cohesion

Does a Component feel like “one thing” with a single purpose, exhibiting well-defined Behaviors with a coherent State? Or does it feel like a miscellaneous collection of behaviors or state?

Coupling

How closely connected is one Component to others in the system? “Loose” coupling is preferred, because it makes it easier to test components in isolation, substitute replacements when needed, etc. Strongly coupled components often indicate poor abstraction boundaries between them.

Design by Contract

Design by Contract (“DbC”) is an idea developed by Bertrand Meyer and incorporated into his Eiffel programming language. In Eiffel all functions can define a contract for allowed inputs, invariants, and guaranteed responses, if the input requirements are met. The runtime system would handle any failures of these contracts. A core principle of DbC use is that contract failures should terminate the application immediately, forcing the developers to fix the issue. Failure to do so becomes an excuse to let bugs accumulate. If this principle was rigorously followed during development, it is often considered acceptable (or at least “expedient”), to log contract failures, but not terminate execution in production runs. DbC can be used in other languages through built-in features (like assertions), libraries, or various runtime features.

DbC provides many of the same design benefits provided by TDD, which emerged later, such as directing attention to more rigorous API design. Because of the additional benefits of TDD, DbC has largely fallen out of practice, but it’s formalism for what constitutes good contracts is still highly valuable and recommended for study.

Dataset

(See also [1]) A collection of data items used for training, evaluation, etc. Usually, a given dataset has a schema (which may be “this is unstructured text”) and some metadata about provenance, licenses for use, transformations and filters applied, etc.

Determinism

The output of a Component for a given input is always known precisely. This affords writing repeatable, predictable software and automated, reliable tests.

In contrast, nondeterminism means identical inputs yield different results, removing Repeatability and complicating Predictability, and the ability to write automated, reliable tests.

Explainability

Can humans understand why the system behaves the way that it does in a particular Use Case? Can the system provide additional information about about why it produced a particular output?

Evaluation

Much like other software, models and AI systems need to be trusted and useful to their users. Evaluation aims to provide the evidence needed to gain users’ confidence for an AI System.

An particular evaluation is the capability of measuring and quantifying how a Generative AI Model, e.g., an LLM, or an AI System as a whole handles Prompts and the kinds of Responses produced. For example, an evaluation might be used to see if hate speech is detected in prompts and responses, if responses contain hallucinations, measure the overhead (time and compute) for processing, and for our purposes, implements a required Use Case, etc.

An evaluation may be implemented in one of several ways. A classifier LLM or another kind of model might be used to score content. A Dataset of examples is commonly used. For our purposes, an implementation is API compatible for execution within an Evaluation Framework.

See also Evaluation Framework.

Evaluation Framework

An umbrella term for the software tools, runtime services, benchmark systems, etc. used to perform Evaluations by running their implementations to measure AI systems for trust and safety risks and mitigations, and other concerns.

Fairness

Does the AI system’s responses exhibit social biases, preferential treatment, or other forms of non-objectivity?

Feature

For our purposes, a small bit of functionality provided by an application. It is the increment of change in a single cycle of the Test-Driven Development process, which could be enhancing some user-visible functionality or adding new functionality in small increments. See also Use Case

Function

In most languages, the most fundamental Unit of abstraction and execution. Depending on the language, the term function or method might be used, where the latter are special functions associated with Classes in OOP languages. Some languages allow code blocks outside of functions, perhaps inside alternative Component boundaries, but this is not important for our purposes.

Many functions are free of Side Effects, meaning they don’t read or write State external to the function and shared by other functions. These functions are always Deterministic; for a given input(s) they always return the same output. This is a very valuable property for design, testing, and reuse.

Other functions that read and possibly write external state are nondeterministic. So are functions that are implemented with Concurrency in a way that the order of results is not deterministic. For example, functions that retrieve data, like a database record, functions to generate UUIDs, functions that call other processes or systems.

Functional Programming

FP is a design methodology that attempts to formalize the properties of Functions and their properties, inspired by the behavior of mathematical functions. State is maintained in a small set of abstractions, like Maps, Lists, and Sets, with operations that are implemented separately following protocol abstractions exposed by the collections. Like mathematical objects and unlike objects in Object-Oriented Programming, mutation of State is prohibited; any operation, like adding elements to a collection, creates a new, Immutable copy.

FP became popular when concurrent software became more widespread in the 2000s, because the immutable objects lead to far fewer concurrency bugs. FP languages may have other Component constructs for grouping of functions, e.g., into libraries.

Contrast with Object-Oriented Programming. Many programming languages combine aspects of FP and OOP.

Generative Adversarial Networks

A GAN uses two neural networks that compete with each other in a “zero-sum” game, where one agent’s gain is another agent’s loss.

Quoting from the Wikipedia page on GANs:

Given a training set, this technique learns to generate new data with the same statistics as the training set. For example, a GAN trained on photographs can generate new photographs that look at least superficially authentic to human observers, having many realistic characteristics…

The core idea of a GAN is based on the “indirect” training through the discriminator, another neural network that can tell how “realistic” the input seems, which itself is also being updated dynamically. This means that the generator is not trained to minimize the distance to a specific image, but rather to fool the discriminator. This enables the model to learn in an unsupervised manner.

The “adversarial” part is how the generator attempts to fool the discriminator, which learns to detect these situations.

Generative AI Model

A combination of data and code, usually trained on a Dataset, to support Inference of some kind.

For convenience, in the text, we use the shorthand term model to refer to the generative AI Component that has Nondeterministic Behavior, whether it is a model invoked directly through an API in the same application or invoked by calling another service (e.g., ChatGPT). The goal of this project is to better understand how developers can test models.

See also Large Language Model (LLMs) and Multimodal Model.

Hallucination

When a Generative AI Model generates text that seems plausible, but is not factually accurate. Lying is not the right term, because there is no malice intended by the model, which only knows how to generate a sequence of Tokens that are plausible. Which token is actually returned in a given context is a Stochastic process, i.e., a random process governed by a Probability distributions.

Immutable

A Unit’s or Component’s State cannot be modified, once it has been initialized. If all units in a Component are immutable, then the component itself is considered immutable. Contrast with Mutable. See also State.

Inference

Sending information to a Generative AI Model or AI System to have it return an analysis of some kind, summarization of the input, or newly generated information, such as text. The term query is typically used when working with LLMs. The term inference comes from traditional statistical analysis, including model building, that is used to infer information from data.

Integration Benchmark

The analog of Integration Tests for several Units and Components working together, where some of them are AI-enabled and exhibit Stochastic behaviors. Benchmark technology is adapted for the purpose.

See also Unit Test, Unit Benchmark, Integration Test, Acceptance Test, and Acceptance Benchmark.

Integration Test

A test for several Units and Components working together that verifies they interoperate properly. These components could be distributed systems, too. When any of the units that are part of the test have Side Effects and the purpose of the test is not to explore handling of such side effects, all units with side effects should be replaced with Test Doubles to make the test Deterministic.

See also Test, Unit Test, Unit Benchmark, Integration Benchmark, Acceptance Test, and Acceptance Benchmark..

Large Language Model

Abbreviated LLM, a state of the art Generative AI Model, often with billions of parameters, that has the ability to summarize, classify, and even generate text in one or more spoken and programming languages. See also Multimodal Model.

Model Context Protocol

Abbreviated MCP, a de-facto standard for communications between models, agents, and other tools. See modelcontextprotocol.io for more information.

Object-Oriented Programming

OOP (or OOSD - object-oriented software development) is a design methodology that creates software Components with boundaries that mimic real-world objects (like Person, Automobile, Shopping Cart, etc.). Each object encapsulates State and Behavior behind its abstraction.

Introduced in the Simula language in the 1960s, it gained widespread interest in the 1980s with the emergence of graphical user interfaces (GUIs), where objects like Window, Buttons, and Menus were an intuitive way to organize such software.

Contrast with Functional Programming. Many programming languages combine elements of FP and OOP.

Multimodal Model

Generative AI Models that usually extend the text-based capabilities of LLMs with additional support for other media, such as video, audio, still images, or other kinds of data.

Mutable

A Unit’s State can be modified during execution, either through direct manipulation by another unit or indirectly by invoking the unit (e.g., calling a Function that changes the state. If any one unit in a Component is mutable, then the component itself is considered mutable. Contrast with Immutable. See also State.

Paradigm

From the Merriam-Webster Dictionary definition of paradigm: “a philosophical and theoretical framework of a scientific school or discipline within which theories, laws, and generalizations and the experiments performed in support of them are formulated.”

Predictable

In the context of software, the quality that knowing a Unit’s or Components history of past Behavior and its design, you can predict its future behavior reliably. See also State Machine.

Probability and Statistics

Two interrelated branches of mathematics, where statistics concerns such tasks as collecting, analyzing, and interpreting data, while probability concerns observations, in particular the percentage likelihood that certain values will be measured when observations are made of a random process, or more precisely, a random probability distribution, like heads or tails when flipping a coin. This probability distribution is the simplest possible; there is a 50-50 chance of heads or tails (assuming a fair coin). The probability distribution for rolling a particular sum with a pair of dice is less simple, but straightforward. The probability distribution for the heights of women in the United States is more complicated, where historical data determines the distribution, not a simple formula.

Both disciplines emerged together to solve practical problems in science, industry, sociology, etc. It is common for researchers to build a mathematical model (in the general sense of the word, not just an AI model) of the system being studied, in part to compare actual results with predictions from the model, confirming or rejecting the underlying theories about the system upon which the model was built. Also, if the model is accurate, it provides predictive capabilities for possible and likely future observations.

Contrast with Determinism. See also Stochastic.

Prompt

The query a user (or another system) sends to an LLM. Often, additional Context information is added by an AI System before sending the prompt to the LLM. See also Prompt Engineering.

Prompt Engineering

A term for the careful construction of good Prompts to maximize the quality of Inference responses. It is really considered more art than science or engineering because of the subjective relationship between prompts and responses for Generative AI Models.

Property-Based Testing

Property-Based Testing (PBT) is sometimes also called property-based development or property-driven development. This variation of Test-Driven Development emphasizes the mathematical properties of Units being tested. Obvious examples are arithmetic functions on integers, but properties and the “laws” they impose can be much more general. For example, all programming languages support concatenation (e.g., “addition”) of strings, where an empty string is the “zero”. Hence, length("foo") == length("foo" + "") == 3. Sting addition is associative, (a+b)+c == a+(b+c), but not commutative, a+b ≠ b+a.

All libraries that support PBT let you define the properties that must hold and a way of defining allowed values of the “types” in question. At test time, the library generates a large set of representative instances of the types and verifies the properties hold for all instances.

Property-based testing emerged in the Functional Programming community.

See also Design by Contract, Specification-Driven Development, Behavior-Driven Development, and Test-Driven Development.

Refactoring

Modifying code to change its structure as required to support a new feature. No Behavior changes are introduced, so that the existing automated Tests can verify that no regressions are introduced as the code is modified. This is first step in the Test-Driven Development cycle.

Regression

When an unexpected Behavior change is introduced into a previously-working Unit, because of a change made to the code base, often in other units for unrelated functionality.

Automated Tests are designed to catch regressions as soon as they occur, making it easier to diagnose the change that caused the regression, as well as detecting the regression in the first place.

Reinforcement Learning

Reinforcement learning (RL) is a form of machine learning, often used for optimizing control or similar systems. In RL, an agent performs a loop where it observes the state of the “world” visible to it at the current time, it takes what it thinks is a suitable action for the next step, chosen to maximize a reward signal, often with the goal of maximizing the long-term reward, such as wining a game. The reinforcement aspect is an update at each step that is done to a model of some kind used by the agent to assess which steps produce which rewards, given the current state. However, when choosing the next step, the best choice is not always made. Some degree of randomness is introduced so that the agent explores all possible states and rewards, rather than getting stuck always making choices that are known good, but may be less optimal than unknown choices.

In the generative AI context, RL is a popular tool in the suite of model Tuning processes that are used to improve model performance in various ways.

See also Reinforcement Finetuning in From Testing to Tuning.

Repeatable

If an action, like running a test, is run repeatedly with no code or data changes, does it return the same results every time? By design, Generative AI Models are expected to return different results each time a query is repeated.

Retrieval-augmented Generation

RAG was one of the first AI-specific design patterns for applications. It uses one or more data stores with information relevant to an application’s use cases. For example, a ChatBot for automotive repair technicians would use RAG to retrieve sections from repair manuals and logs from past service jobs, selecting the ones that are most relevant to a particular problem or subsystem the technician is working on. This Context is passed as part of the Prompt to the LLM.

A key design challenge is determining relevancy and structuring the data so that relevant information is usually retrieved. This is typically done by breaking the reference data into “chunks” and encoding each chunk in a vector representation, e.g., a hash, which functions as a similarity metric. During inference, the prompt is passed through the same encoding and the top few nearest neighbors, based on the metric, are returned for the context, thereby attempting to ensure maximum relevancy.

See this IBM blog post for a description of RAG.

Response

The generic term for outputs from a Generative AI Model or AI System. Sometimes results is also used.

Robustness

How well does the AI System continue to perform within acceptable limits or degrade “gracefully” when stressed in some way? For example, how well does a Generative AI Model respond to prompts that deviate from its training data?

Scenario

One path through a use case, such as one “happy path” from beginning to end where a user completes a task or accomplishes a goal. A failure scenario is a path through the use case where the user is unable to succeed, due to system or user errors.

Note: When the text doesn’t link to this definition, it is because the word is being used generically or because the text already linked to this definition. Hopefully the context will be clear.

Sequential

The steps of some work are performed in a predictable, repeatable order. This property is one of the requirements for Deterministic Behavior. Contrast with Concurrent.

Side Effect

Reading and/or writing State shared outside a Unit, i.e., a Function with other functions. See also Determinism. If a Component contains unit that perform side effects, then the component itself is considered to perform side effects.

Specification-Driven Development

Abbreviated SDD and also known as Spec-Driven Development. In our context, this refers to an idea introduced by GitHub and Microsoft, that we should structure code generation prompts in a more-precise format to ensure we get the code we need. The argument is that many models are already perfectly capable of generating this code, but they are “literal minded” and need to be told precisely what is needed from them.

We discuss SDD at length in the Specification-Driven Development chapter. SDD is similar in its goals to Test-Driven Development, although arguably closer to the emphasis in Behavior-Driven Development.

State

Used in software to refer to a set of values in some context, like a Component. The values determine how the component will behave in subsequent invocations to perform some work. The values can sometimes be read directly by other components. If the component is Mutable, then the state can be changed by other components either directly or through invocations of the component that cause state transitions to occur. (For example, popping the top element of a stack changes the contents of the stack, the number of elements it currently holds, etc.)

Often, these state transitions are modeled with a State Machine, which constrains the allowed transitions.

State Machine

A formal model of how the State of a component can transition from one value (or set of values) to another. As an example, the TCP protocol has a well-defined state machine.

Stochastic

The behavior of a system where observed values are governed by a random probability distribution. For example, when flipping a coin repeatedly, the observed values, heads or tails, are governed by a distribution that predicts 50% of the time heads will be observed and 50% of the time tails will be observed, assuming a fair coin (not weighted on one side or the other). The value you observe for any given flip is random; you can’t predict exactly which possibility will happen, only that there is an equal probability of heads or tails. After performing more and more flips, the total count of heads and tails should be very close to equal. See also Probabilities and Statistics.

System Prompt

A commonly-used, statically-coded part of the Context information added by an AI System the Prompt before sending it to the LLM. System prompts are typically used to provide the model with overall guidance about the application’s purpose and how the LLM should respond. For example, it might include phrases like “You are a helpful software development assistant.”

Teacher Model

A Generative AI Model that can be used as part of a Tuning (“teach”) process for other models, to generate synthetic data, to evaluate the quality of data, etc. These models are usually relatively large, sophisticated, and powerful, so they are very capable for these purposes, but they are often considered too costly to use as an application’s runtime model, where smaller, lower-overhead models are necessary. However, for software development purposes, less frequent use of teacher models is worth the higher cost for the services they provide.

Test

For our purposes, a Unit Test, Integration Test, or Acceptance Test.

Test Double

A test-only replacement for a Unit or a while Component, usually because it has Side Effects and we need the Behavior to be Deterministic for the purposes of testing a dependent unit that uses it. For example, a function that queries a database can be replaced with a version that always returns a fixed value expected by the test. A mock is a popular kind of test double that uses the underlying runtime environment (e.g., the Python interpreter, the Java Virtual Machine - JVM) to intercept invocations of a unit and programmatically behave as desired by the tester.

See also Test, Unit Test, Integration Test, and Acceptance Test.

Test-Driven Development

When adding a Feature to a code base using TDD, the tests are written before the code is written. A three step “virtuous” cycle is used, where changes are made incrementally and iterative using small steps, one at a time:

  1. Refactor the code to change its structure as required to support the new feature, using the existing automated Tests to verify that no regressions are introduced. For example, it might be necessary to introduce an abstraction to support two “choices” where previously only one choice existed.
  2. Write a Test for the new feature. This is primarily a design exercise, because thinking about testing makes you think about usability, Behavior, etc., even though you are also creating a reusable test that will become part of the Regression test suite. Note that the test suite will fail to run at the moment, because the code doesn’t yet exist to make it pass!
  3. Write the new feature to make the new test (as well as all previously written tests) pass.

TDD not only promotes iterative and incremental development, with a growing suite of tests resulting from the process, it effectively turns the writing of executable tests into a form of specification of the desired behavior, writing before the code is written to implement the specification. Behavior-Driven Development would take this idea to its logical conclusion, that tests are executable specifications.

The Wikipedia TDD article is a good place to start for more information.

See also Design by Contract, Specification-Driven Development, Behavior-Driven Development, and Property-Based Testing.

Token

For language Generative AI Models, the training texts and query prompts are split into tokens, usually whole words or fractions according to a vocabulary of tens of thousands of tokens that can include common single characters, several characters, and “control” tokens (like “end of input”). The rule of thumb is a corpus will have roughly 1.5 times the number of tokens as it will have words.

Training

In our context, training is the processes used to teach a model, such as a Generative AI Models how to do its intended job.

In the generative AI case, we often speak of pretraining, the training process that uses a massive data corpus to teach the model facts about the world, how to speak and understand human language, and do some skills. However, the resulting model often does poorly on specialized tasks and even basic skills like following a user’s instructions, conforming to social norms (e.g., avoiding hate speech), etc.

That’s where a second Tuning phase comes in, a suite of processes used to improve the models performance on many general or specific skills.

Tuning

Tuning refers to one or more processes used to transform a Pretrained model into one that exhibits much better desired Behaviors (like instruction following) or specialized domain knowledge.

Unit

For our purposes, the unit in the context of a Unit Test, the smallest granularity of functionality we care about. A unit can be a single Function that is being designed and written, but this may be happening in the larger context of a Component, such as a Class in an Object-Oriented Programming language or some other self-contained.

For simplicity, rather than say “unit and/or component” frequently in the text, we will often use just “component” as an umbrella term that could also mean either or both concepts, unless it is important to make finer distinctions.

Unit Benchmark

An adaption of Benchmark tools and techniques for more fine-grained and targeted testing purposes, such as verifying Features and Use Cases work as designed. See the Unit Benchmarks chapter for details.

The same idea generalizes to the analogs of Integration Tests, namely Integration Benchmarks, and Acceptance Tests, namely Acceptance Benchmarks.

Use Case

A common term for an end-to-end user activity done with a system, often broken down into several Scenarios that describe different “paths” through the use case, including error scenarios, in addition to happy paths. Hence, scenarios would be the next level of granularity. Compare with Features, which would be the capabilities implemented one at a time to support the scenarios that make up a use case.

Unit Test

A test for a Unit that exercises its Behavior in isolation from all other Functions and State. When the unit being tested has Side Effects, because of other units it invokes, all such side effects must be replaced with Test Doubles to make the test Deterministic. Note that writing a unit test as part of Test-Driven Development inevitably begins with a Refactoring step to modify the code, while preserving the current behavior, so that it is better positioned to support implementing the new functionality.

See also Test, Unit Benchmark, Integration Test, Integration Benchmark, Acceptance Test, Acceptance Benchmark.

Vibe Coding

A term coined by Andrej Karpathy for just going with the code generated by an LLM, tweaking the prompt as needed to get the LLM to fix bugs and incorrect behavior. Hence, it’s a completely “non-engineered” approach to coding, which can work well for quick coding needs, especially for non-programmers, but generally is not sufficient for longer-term projects. Hence, the term has a slightly negative connotation for many people, as in “this is not a serious way to write software”. Contrast with Vibe Engineering.

Vibe Engineering

Simon Willison’s term, made half in jest, for a more engineering-oriented approach to Vibe Coding, which incorporates various engineering practices to ensure that quality and maintainability requirements can be met, longer term. As such, this blog post is a good counter argument to those who believe that AI coding assistants are now sufficiently reliable and powerful to completely take over from humans.