Dataset Specification
Note: The specification documented here is the “V0.1.7” version of the criteria we believe are required for datasets cataloged by OTDI. We need and welcome your feedback! Either contact us or consider using pull requests with your suggestions. See the AI Alliance community page on contributing for more details.
Also contact us if you are interested in contributing a dataset, but you have any questions or concerns about meeting the following specification.
Table of contents
About This Specification
The specification attempts to be minimally sufficient, to impose just enough constraints to meet our goals for cataloged datasets.
Sources and Inspirations
The details of the specification and how we are implementing it build on the prior and parallel work of several organizations:
- The metadata fields and concepts defined for Hugging Face Dataset Cards, with a few extensions and clarifications for our provenance and governance purposes.
- MLCommons Croissant for the metadata storage format. Croissant is an emerging de facto standard for metadata. It is used by Hugging Face and other dataset repositories for cataloging metadata and providing search capabilities.
- Some defined metadata fields are inspired by the Data Provenance Standard from the Data and Trust Alliance.
- The Stack dataset for the BigCode model project. See the dataset card.
- Common Crawl Foundation’s current work on provenance tracking, multilingual data, etc.
- Coalition for Secure AI has a work group on software supply chain security concerns.
The metadata are captured in the dataset card that every version of a dataset carries, including after various stages of processing.
Let’s begin.
Core Requirements
Ownership
First, to promote fully-traceable provenance and governance, for all data within the dataset, the owner must affirm that they are either (a) the owner of the dataset or (b) you have rights from the owner of the data that enables the dataset to be provided to anyone under the CDLA Permissive 2.0 license. For example, this dataset owner has been granted permission by the source data owners to act on their behalf with respect to enabling others to use it without restriction.
This provision is necessary because many datasets contain data that was obtained by crawling the web, which frequently has mixed provenance and licenses for use.
NOTE: One of the data processing pipelines we are building will carefully filter datasets for such crawled data to ensure our requirements are met for ownership, provenance, license for use, and quality. Until these tools are ready, we are limiting acceptance of crawled datasets.
Dataset Hosting
Almost all datasets we catalog will remain hosted by the owners, but the AI Alliance can host it for you, when desired.
A Dataset Card
All useful datasets include metadata about their provenance, license(s), target uses, known limitations and risks, etc. To provide a uniform, standardized way of expressing this metadata, we require every dataset to have a dataset card (or data card) that follows the Hugging Face Dataset Card format, where the README.md file functions as the dataset card, with our refinements discussed below. This choice reflects the fact that most AI-centric datasets are already likely to be available on the Hugging Face Hub.
TIP: For a general introduction to Hugging Face datasets, see here.
Quick Steps to Create a Dataset Card
If you need to create a dataset card:
- Download our version of the Hugging Face dataset card template,
datasetcard_otdi_template.md. (If you already have a card in Hugging Face, i.e., theREADME.md, compare our template to your card and add the new fields.)- Edit the Markdown in the template file to provide the details, as described below.
- Create the card in the Hugging Face UI (or edit your existing card.)
- Fill in the metadata fields shown in their editor UI. (See Table 1 below.)
- Paste the rest of your prepared Markdown into the file, after the YAML block delimited by
---.- Commit your changes.
Required Metadata Fields
Refer to the datasetcard.md for details about the metadata fields Hugging Face recommends for inclusion in a YAML block at the top of the README.md. We comment on these fields below, in Table 1.
The templates/README_guide.md provides additional information about the template fields in their Markdown template file, datasetcard_template.md in the huggingface-hub GitHub repo. However, we recommend that you use our extended version: datasetcard_otdi_template.md.
YAML Metadata Block
TIP: The following tables are long, but starting with the
datasetcard_template.mdand the dataset card process will handle most of the details. Then you can add the additional fields requested in Table 2.
Table 1 lists all the fields in the dataset card YAML block. The Required or Disallowed? column uses ✔ to indicate the field is required by us, ❌ for fields that we don’t allow (because they are incompatible with this project), and a blank entry indicates a field is optional.
Click a row to see the full description. Use the line below the table to resize it.
Table 1: Hugging Face datacard YAML metadata block.
NOTES: Some additional points about several of the fields:
1.For the
licenseand related fields, see the list of permissive licenses we accept in the Catalog index page. Use the names shown there, or for consistency with Hugging Face dataset card conventions, consider using their names for these licenses, when different.
- Our recommended licenses:
- For datasets: Community Data License Agreement – Permissive, Version 2.0
- For source code: Apache 2.0
- For documentation: The Creative Commons License, Version 4.0.
- See the Alliance
community/CONTRIBUTINGpage for more details about licenses.- For
tagsvalues, there is no industry-standard list of values, but the Data Classification section in the References lists some well-known taxonomies to consider using.- For
task_categoriesvalues, see the Appendix below.
The Markdown Content in the Dataset Card
Table 2 lists content that we require or recommend in the Markdown body of the dataset card, below the YAML header block. The Source column in the table contains the following:
- “HF” for fields in the Hugging Face
datasetcard_template.md. See theREADME_guide.mdfor descriptions of many of these fields. - “OTDI” for additional fields we believe are necessary.
Click a row to see the full description. Use the line below the table to resize it.
Table 2: Additional markdown metadata content in the dataset card (README.md).
For the personal_and_sensitive_information field, we recommend using one or more of the following values:
Personal Information (PI)/DemographicPayment Card Industry (PCI)Personal Financial Information (PFI)Personally Identifiable Information (PII)Personal Health Information (PHI)Sensitive Personal Information (SPI)Other (please specify)None
Other Considerations for the Data Itself
The dataset card template has sections for all the required and optional metadata. This section discusses the data in the dataset.
Formats
We endeavor to be flexible on dataset file formats and how they are organized. For text, we recommend formats like CSV, JSON, Parquet, ORC, AVRO. Supporting PDFs, where extraction will be necessary, can be difficult, but not impossible.
NOTE: Using Parquet has the benefit that MLCommons Croissant can be used to automatically extract some metadata. See this Hugging Face page and the
mlcroissantlibrary, which supports loading a dataset using the Croissant metadata.
Diverse Datasets
Diverse datasets are desired for creating a variety of AI models and applications with special capabilities.
We are particularly interested in new datasets that can be used to train and tune models to excel in particular domains, or support them through design patterns like RAG and Agents. See What Kinds of Datasets Do We Want? for more information.
Use the tags metadata field discussed above to indicate this information, when applicable.
Derived or Synthetic Dataset Specification
Every dataset that is derived or synthesized via a processing pipeline from one or more other datasets or models requires its own dataset card, which must reference all upstream datasets and models that feed into it (and by extension, their dataset and model cards of metadata).
For example, when a derived dataset is the filtered output of one or more raw (defined below) datasets, where duplication and offensive content removal was performed, the new dataset may now support different recommended uses (i.e., it is now more suitable for model training or more useful for a specific domain), have different bias_risks_limitations, and it will need to identify the upstream (ancestor) source_datasets.
Suppose a new version of an existing dataset is created, where additional or removed data is involved, but no other changes occur. It also needs a new dataset card, even while most of the metadata will be unchanged.
Finally, what if several datasets are used to derive a new dataset and these upstream data sources have different licenses? What if synthetic data is generated using a model? The “most restrictive” upstream license must be used or a suitable alternative. For example, if one upstream source is not permissively licensed, the data from it in the derived dataset can’t be “made” permissive by using a more permissive license. The whole derived dataset must use the most restrictive license attached to the upstream datasets. Similarly, a synthetic dataset generated from a model has to be licensed in accordance with the terms of use for the model. Some commercial models don’t allow generated content to be used in permissively-licensed datasets, for example.
NOTE: The derived dataset license must match the “most restrictive” upstream license or a similarly-restrictive alternative must be used. For synthetic data generated by a model, the terms of service for the model must be supported by the new dataset’s license.
Table 3 lists the minimum set of metadata fields that must change in a derived dataset:
Click a row to see the full description. Use the line below the table to resize it.
Table 3: Minimum required dataset card changes for a derived dataset.
Categories of Dataset Transformations
At this time, we use the following concepts for original and derived datasets, concerning levels of quality and cleanliness. This list corresponds to stages in our ingestion process and subsequent possible derivations of datasets. This list is subject to change.
- Raw: A dataset as it is discovered, validated, and cataloged. For all datasets, our most important concern is unambiguous provenance and clear openness. Raw datasets may go through filtering and analysis to remove potential objectionable content.
- Filtered: A raw dataset that has gone through a processing pipeline to make it more suitable for specific purposes. This might include removal of duplicate records, filtering for unacceptable content (e.g., hate speech, PII), or filtered for domain-specific content, etc. Since the presence of some content in the raw data could have legal implications for OTDI, such as the presence of some forms of PII and confidential information, we may reject cataloging an otherwise “good” raw dataset and only catalog a suitable filtered dataset.
- Structured: A filtered dataset that has also been reformatted to be most suitable for some AI purpose, such as model training, RAG, etc. For example, PDFs are more convenient to use when converted to JSON or YAML.
- Derived: Any dataset created from one or more other datasets. Filtered and structured datasets are derived datasets.
See How We Process Datasets for more details on these levels and how we process datasets.
After you have prepared or updated the dataset card as required, we will automatically pick up the changes from Hugging Face. If you are not hosting your dataset there, then contribute your dataset.
Appendix: Task Categories
The task_categories field in Table 1 above recommends using the “types” in this list in Hugging Face source code. For convenience, here is the same list, as of November 2025:
Here we group the task types by modality (e.g., nlp). Some tasks have defined subtask types, which are listed with them. If no subtasks are shown, none are defined for the task type.
Natural Language Processing - nlp
Table 4 lists tasks and subtasks related to natural language processing (nlp).
| Task Type | Subtask Type | Name |
|---|---|---|
text-classification |
Text Classification | |
acceptability-classification |
Acceptability Classification | |
entity-linking-classification |
Entity Linking Classification | |
fact-checking |
Fact Checking | |
intent-classification |
Intent Classification | |
language-identification |
Language Identification | |
multi-class-classification |
Multi Class Classification | |
multi-label-classification |
Multi Label Classification | |
multi-input-text-classification |
Multi-input Text Classification | |
natural-language-inference |
Natural Language Inference | |
semantic-similarity-classification |
Semantic Similarity Classification | |
sentiment-classification |
Sentiment Classification | |
topic-classification |
Topic Classification | |
semantic-similarity-scoring |
Semantic Similarity Scoring | |
sentiment-scoring |
Sentiment Scoring | |
sentiment-analysis |
Sentiment Analysis | |
hate-speech-detection |
Hate Speech Detection | |
text-scoring |
Text Scoring | |
token-classification |
Token Classification | |
named-entity-recognition |
Named Entity Recognition | |
part-of-speech |
Part of Speech | |
parsing |
Parsing | |
lemmatization |
Lemmatization | |
word-sense-disambiguation |
Word Sense Disambiguation | |
coreference-resolution |
Coreference-resolution | |
table-question-answering |
Table Question Answering | |
question-answering |
Question Answering | |
extractive-qa |
Extractive QA | |
open-domain-qa |
Open Domain QA | |
closed-domain-qa |
Closed Domain QA | |
zero-shot-classification |
Zero-Shot Classification | |
translation |
Translation | |
summarization |
Summarization | |
news-articles-summarization |
News Articles Summarization | |
news-articles-headline-generation |
News Articles Headline Generation | |
feature-extraction |
Feature Extraction | |
text-generation |
Text Generation | |
dialogue-modeling |
Dialogue Modeling | |
dialogue-generation |
Dialogue Generation | |
conversational |
Conversational | |
language-modeling |
Language Modeling | |
text-simplification |
Text Simplification | |
explanation-generation |
Explanation Generation | |
abstractive-qa |
Abstractive QA | |
open-domain-abstractive-qa |
Open Domain Abstractive QA | |
closed-domain-qa |
Closed Domain QA | |
open-book-qa |
Open Book QA | |
closed-book-qa |
Closed Book QA | |
text2text-generation |
Text2Text Generation | |
fill-mask |
Fill Mask | |
slot-filling |
Slot Filling | |
masked-language-modeling |
Masked Language Modeling | |
table-to-text |
Table to Text | |
multiple-choice |
Multiple Choice | |
multiple-choice-qa |
Multiple Choice QA | |
multiple-choice-coreference-resolution |
Multiple Choice Coreference Resolution | |
text-ranking |
Text Ranking | |
text-retrieval |
Text Retrieval | |
document-retrieval |
Document Retrieval | |
utterance-retrieval |
Utterance Retrieval | |
entity-linking-retrieval |
Entity Linking Retrieval | |
fact-checking-retrieval |
Fact Checking Retrieval |
Table 4: Tasks and subtasks related to natural language processing (nlp).
Audio - audio
Table 5 lists tasks and subtasks related to audio processing (audio).
| Task Type | Subtask Type | Name |
|---|---|---|
sentence-similarity |
Sentence Similarity | |
text-to-speech |
Text-to-Speech | |
text-to-audio |
Text-to-Audio | |
automatic-speech-recognition |
Automatic Speech Recognition | |
audio-to-audio |
Audio-to-Audio | |
audio-classification |
Audio Classification | |
keyword-spotting |
Keyword Spotting | |
speaker-identification |
Speaker Identification | |
audio-intent-classification |
Audio Intent Classification | |
audio-emotion-recognition |
Audio Emotion Recognition | |
audio-language-identification |
Audio Language Identification | |
voice-activity-detection |
Voice Activity Detection |
Table 5: Tasks and subtasks related to audio processing (audio).
Multimodal - multimodal
For visual-question-answering and document-question-answering, the Hugging Face source file lists each as its own subtask, which looks like a data error, but we show it for consistency.
Table 6 lists tasks and subtasks related to multimodal processing:
| Task Type | Subtask Type | Name |
|---|---|---|
audio-text-to-text |
Audio-Text-to-Text | |
image-text-to-text |
Image-Text-to-Text | |
visual-question-answering |
Visual Question Answering | |
visual-question-answering |
Visual Question Answering | |
document-question-answering |
Document Question Answering | |
document-question-answering |
Document Question Answering | |
video-text-to-text |
Video-Text-to-Text | |
visual-document-retrieval |
Visual Document Retrieval | |
any-to-any |
Any-to-Any |
Table 6: Tasks and subtasks relaed to multimodal processing.
Computer Vision - cv
Table 7 lists tasks and subtasks related to computer vision (cv):
| Task Type | Subtask Type | Name |
|---|---|---|
depth-estimation |
Depth Estimation | |
image-classification |
Image Classification | |
multi-label-image-classification |
Multi Label Image Classification | |
multi-class-image-classification |
Multi Class Image Classification | |
object-detection |
Object Detection | |
face-detection |
Face Detection | |
vehicle-detection |
Vehicle Detection | |
image-segmentation |
Image Segmentation | |
instance-segmentation |
Instance Segmentation | |
semantic-segmentation |
Semantic Segmentation | |
panoptic-segmentation |
Panoptic Segmentation | |
text-to-image |
Text-to-Image | |
image-to-text |
Image-to-Text | |
image-captioning |
Image Captioning | |
image-to-image |
Image-to-Image | |
image-inpainting |
Image Inpainting | |
image-colorization |
Image Colorization | |
super-resolution |
Super Resolution | |
image-to-video |
Image-to-Video | |
unconditional-image-generation |
Unconditional Image Generation | |
video-classification |
Video Classification | |
text-to-video |
Text-to-Video | |
zero-shot-image-classification |
Zero-Shot Image Classification | |
mask-generation |
Mask Generation | |
zero-shot-object-detection |
Zero-Shot Object Detection | |
text-to-3d |
Text-to-3D | |
image-to-3d |
Image-to-3D | |
image-feature-extraction |
Image Feature Extraction | |
keypoint-detection |
Keypoint Detection | |
pose-estimation |
Pose Estimation | |
video-to-video |
Video-to-Video |
Table 7: Tasks and subtasks related to computer vision (cv).
Reinforcement Learning - rl
Table 8 lists tasks and subtasks related to reinforcement learning (rl):
| Task Type | Subtask Type | Name |
|---|---|---|
reinforcement-learning |
Reinforcement Learning | |
robotics |
Robotics | |
grasping |
Grasping | |
task-planning |
Task Planning |
Table 8: Tasks and subtasks related to reinforcement learning (rl).
Tabular - tabular
Table 9 lists tasks and subtasks related to tabular data processing:
| Task Type | Subtask Type | Name |
|---|---|---|
tabular-classification |
Tabular Classification | |
tabular-multi-class-classification |
Tabular Multi Class Classification | |
tabular-multi-label-classification |
Tabular Multi Label Classification | |
tabular-regression |
Tabular Regression | |
tabular-single-column-regression |
Tabular Single Column Regression | |
tabular-to-text |
Tabular to Text | |
rdf-to-text |
RDF to text | |
time-series-forecasting |
Time Series Forecasting | |
univariate-time-series-forecasting |
Univariate Time Series Forecasting | |
multivariate-time-series-forecasting |
Multivariate Time Series Forecasting |
Table 9: Tasks and subtasks related to tabular data processing.
Other - other
Table 10 lists other special-case tasks and subtasks that don’t fit in the other modality categories.
| Task Type | Subtask Type | Name |
|---|---|---|
graph-ml |
Graph Machine Learning | |
other |
Other |
Table 10: Other special-case tasks and subtasks that don't fit in the other modality categories
