
Join Our Initiative Browse the Datasets Contribute a New Dataset
Dataset Specification
Note: The specification documented here is the “V0.1.5” version of the criteria we believe are required for datasets cataloged by OTDI. We need and welcome your feedback! Either contact us or consider using pull requests with your suggestions. See the AI Alliance community page on contributing for more details.
Also contact us if you are interested in contributing a dataset, but you have any questions or concerns about meeting the following specification.
Table of contents
About This Specification
The specification attempts to be minimally sufficient, to impose just enough constraints to meet our goals for cataloged datasets.
Sources and Inspirations
The details of the specification and how we are implementing it build on the prior and parallel work of several organizations:
- The metadata fields and concepts defined for Hugging Face Dataset Cards, with a few extensions and clarifications for our provenance and governance purposes.
- MLCommons Croissant for the metadata storage format. Croissant is an emerging de facto standard for metadata. It is used by Hugging Face and other dataset repositories for cataloging metadata and providing search capabilities.
- Some defined metadata fields are inspired by the Data Provenance Standard from the Data and Trust Alliance.
- The Stack dataset for the BigCode model project. See the dataset card.
- Common Crawl Foundation’s current work on provenance tracking, multilingual data, etc.
- Coalition for Secure AI has a work group on software supply chain security concerns.
The metadata are captured in the dataset card that every version of a dataset carries, including after various stages of processing.
Let’s begin.
Core Requirements
Ownership
First, to promote fully-traceable provenance and governance, for all data within the dataset, the owner must affirm that they are either (a) the owner of the dataset or (b) you have rights from the owner of the data that enables the dataset to be provided to anyone under the CDLA Permissive 2.0 license. For example, this dataset owner has been granted permission by the source data owners to act on their behalf with respect to enabling others to use it without restriction.
This provision is necessary because many datasets contain data that was obtained by crawling the web, which frequently has mixed provenance and licenses for use.
NOTE: One of the data processing pipelines we are building will carefully filter datasets for such crawled data to ensure our requirements are met for ownership, provenance, license for use, and quality. Until these tools are ready, we are limiting acceptance of crawled datasets.
Dataset Hosting
Almost all datasets we catalog will remain hosted by the owners, but the AI Alliance can host it for you, when desired.
A Dataset Card
All useful datasets include metadata about their provenance, license(s), target uses, known limitations and risks, etc. To provide a uniform, standardized way of expressing this metadata, we require every dataset to have a dataset card (or data card) that follows the Hugging Face Dataset Card format, where the README.md
file functions as the dataset card, with our refinements discussed below. This choice reflects the fact that most AI-centric datasets are already likely to be available on the Hugging Face Hub.
TIP: For a general introduction to Hugging Face datasets, see here.
Quick Steps to Create a Dataset Card
If you need to create a dataset card:
- Download our version of the Hugging Face dataset card template,
datasetcard_otdi_template.md
. (If you already have a card in Hugging Face, i.e., theREADME.md
, compare our template to your card and add the new fields.)- Edit the Markdown in the template file to provide the details, as described below.
- Create the card in the Hugging Face UI (or edit your existing card.)
- Fill in the metadata fields shown in their editor UI. (See Table 1 below.)
- Paste the rest of your prepared Markdown into the file, after the YAML block delimited by
---
.- Commit your changes.
Required Metadata Fields
Refer to the datasetcard.md
for details about the metadata fields Hugging Face recommends for inclusion in a YAML block at the top of the README.md
. We comment on these fields below, in Table 1.
The templates/README_guide.md
provides additional information about the template fields in their Markdown template file, datasetcard_template.md
in the huggingface-hub
GitHub repo. However, we recommend that you use our extended version: datasetcard_otdi_template.md
.
YAML Metadata Block
TIP: The following tables are long, but starting with the
datasetcard_template.md
and the dataset card process will handle most of the details. Then you can add the additional fields requested in Table 2, those marked with “OTDI”.
Table 1 lists all the fields in the dataset card YAML block. The Required? column uses ☑ to indicate the field is required by us, ☒ for fields that we don’t allow, because they are incompatible with this project, and a blank entry indicates a field is optional.
Field Name | Description | Required? |
---|---|---|
license |
We strongly recommend cdla-permissive-2.0 for the Community Data License Agreement – Permissive, Version 2.0 and may require it in the future 1. Use these names for licenses. Also covers the OTDI License to use . |
☑ |
license_name |
e.g, Community Data License Agreement – Permissive, Version 2.0. | ☑ |
license_link |
e.g, LICENSE or LICENSE.md in the same repo or a URL to another location. |
☑ |
license_details |
Not needed if you use a standard license. | |
tags |
Useful for indicating target areas for searches, like chemistry , synthetic , etc. See also task_categories . Where applicable, we recommend that you use the categories described below in Diverse Datasets…. |
|
annotations_creators |
If appropriate. Examples: crowdsourced , found , expert-generated , machine-generated (e.g., using LLMs as judges). |
|
language_creators |
If appropriate. Examples: crowdsourced , found , expert-generated , machine-generated (i.e., synthetic data). |
|
language_details |
One or more of, for example, en-US , fr-FR , etc. |
☑ |
pretty_name |
E.g., Common Chemistry . This is equivalent to the Dataset title/name field in the Data Provenance Standard (OTDI). |
☑ |
size_categories |
E.g., n<1K , 100K<n<1M . |
|
source_datasets |
A YAML list; zero or more. Recall our emphasis on provenance. This list is very important, as each source must meet our provenance standards. See also the discussions below. | ☑ |
task_categories |
A YAML list; one or more from the list in this code. | ☑ |
task_ids |
A YAML list; “a unique identifier in the format lbpp/{idx} , consistent with HumanEval and MBPP” from here. See also examples here. |
|
paperswithcode_id |
Dataset id on PapersWithCode (from the URL). | |
configs |
Can be used to pass additional parameters to the dataset loader, such as data_files , data_dir , and any builder-specific parameters. |
|
config_name |
One or more dataset subsets, if applicable. See the example in datasetcard.md and the discussions here and here. |
|
dataset_info |
Can be used to store the feature types and sizes of the dataset to be used in Python. See the discussion in datasetcard.md . Also covers the OTDI Data format field. |
|
extra_gated_fields |
Used for protected datasets and hence incompatible with the goals of OTDI. | ☒ |
train-eval-index |
Add this if you want to encode a train and evaluation info in a structured way for AutoTrain or Evaluation on the Hub. See the discussion in datasetcard.md . |
Table 1: Hugging Face Datacard Metadata
The Markdown Content in the Dataset Card
Our second table lists content that we require or recommend in the Markdown body of the dataset card, below the YAML header block. The Source column in the table contains the following:
- “HF” for fields in the Hugging Face
datasetcard_template.md
. See theREADME_guide.md
for descriptions of many of these fields. - “OTDI” for additional fields we believe are necessary.
Field Name | Description | Required? | Source |
---|---|---|---|
standards_version_used |
A schema version. Standard schemas are not currently specified and TBD. | OTDI | |
unique_metadata_identifier |
A UUID that is globally unique. Derived datasets must have their own UUIDs. The UUID is very useful for unambiguous lineage tracking, which is why we require it. | ☑ | OTDI |
dataset_summary |
A concise summary of the dataset and its purpose. | ☑ | HF |
dataset_description |
Describe the contents, scope, and purpose of the dataset, which helps users understand what the data represents, how it was collected, and any limitations or recommended uses. However, this field should not include redundant information covered elsewhere. | ☑ | HF |
curated_by |
One or more legal entities responsible for creating the dataset, providing accountability and a point of contact for inquiries. See also dataset_card_authors below. |
☑ | HF |
signed_by |
A legal review process has determined the dataset is free of any license or governance concerns, and is therefore potentially more trustworthy. The entities that performed the review are listed. (This is not yet required, but is under consideration.) | OTDI | |
dataset_sources |
HF template section (from datasetcard_template.md ). Complements the information provided above for source_datasets . The Repository URL “subfield” is required for each source dataset, unless it was provided by source_datasets in Table 1. The Paper and Demo subfields are optional. See also source_data and source_metadata_for_dataset next. |
☑ | HF |
source_data |
HF template section. Use the subsections, described next, data_collection_and_processing_section and source_data_producers_section to describe important provenance information. Is the data synthetic or not? Note our specification above that you can only submit datasets where you have the necessary rights (see also consent_documentation_location below). |
☑ | HF |
data_collection_and_processing_section |
HF template section. Describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. | ☑ | HF |
source_data_producers_section |
HF template section. This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. | ☑ | HF |
source_metadata_for_dataset |
Additional content for source_data ; if the corresponding metadata for any dataset is not part of that dataset, then it must be explicitly linked here. This information is necessary for lineage tracking, part of our provenance objectives. Marked required, but if all metadata is part of all datasets (e.g., in README.md dataset cards), then this field can be omitted. |
☑ | OTDI |
consent_documentation_location |
“Specifies where consent documentation or agreements related to the data can be found, which help enable legal compliance and regulatory use.” Required for third-party datasets you are contributing. | ☑ | OTDI |
data_origin_geography |
“The geographical location where the data was originally collected, which can be important for compliance with regional laws and understanding the data’s context.” Required if restrictions apply. | OTDI | |
data_processing_geography_inclusion_exclusion |
“Defines the geographical boundaries within which the data can or cannot be processed, often for legal or regulatory reasons.” Required if restrictions apply. | OTDI | |
data_storage_geography_inclusion_exclusion |
“Specifies where the data is stored and any geographical restrictions on storage locations, crucial for compliance with data sovereignty laws.” Required if restrictions apply. | OTDI | |
uses |
See the HF template section. Optional, but useful for describing Direct Use (field name: direct_use ) and Out-of-Scope Use (field name: out_of_scope_use ) for the dataset. Consider structuring the Direct Use as described in the Supported Tasks and Leaderboards section in the templates/README_guide.md . |
HF | |
annotations |
HF template section. Add any additional information for the annotations_creators above, if any. Subsections are annotation_process_section and who_are_annotators_section . |
HF | |
annotation_process_section |
HF template section. Describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, inter-annotator statistics, annotation validation, etc. | HF | |
who_are_annotators_section |
HF template section. Describes the people or systems who created the annotations. | HF | |
bias_risks_limitations |
HF template section. While provenance and governance are the top priorities for OTDI, we also want to communicate to potential users what risks they need to understand about our cataloged datasets. Therefore, we require any information you can provide in this section, along with the Recommendations subsection for mitigations, if known. |
☑ | HF |
personal_and_sensitive_information |
State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). Consider using one or more of the values listed below, after this table. If efforts were made to anonymize the data, describe the anonymization process and also fill in use_of_privacy_enhancing_technologies_pets . |
☑ | HF, OTDI |
use_of_privacy_enhancing_technologies_pets |
“Indicates whether techniques were used to protect personally identifiable information (PII) or sensitive personal information (SPI), highlighting the dataset’s privacy considerations.” | ☑ | OTDI |
citation |
HF template section. A place to add BibTeX (field name: citation_bibtex ) and APA (field name: citation_apa ) citations. |
HF | |
glossary |
HF template section. Define useful terms. | HF | |
dataset_card_authors |
HF template section. We need to know the authors. | ☑ | HF |
dataset_card_contact |
HF template section. We need to know whom to contact when needed. Okay to leave blank if the authors’ contact information is provided. | HF | |
dataset_issue_date |
When the dataset was compiled or created. (New versions require new dataset cards.) Recommended format: YYYY-mm-dd:THH:MM:SS |
☑ | OTDI |
date_previously_issued_version_dataset |
Timestamp for previous releases, if applicable. Redundant with other traceability tools, so could be omitted. | OTDI | |
range_dates_data_generation |
The span of time during which the data within the dataset was collected or generated, offering insight into the dataset’s timeliness and relevance. | ☑ | OTDI |
Table 2: Additional Content for the Dataset Card (`README.md`)
For the personal_and_sensitive_information
field, we recommend using one or more of the following values:
Personal Information (PI)/Demographic
Payment Card Industry (PCI)
Personal Financial Information (PFI)
Personally Identifiable Information (PII)
Personal Health Information (PHI)
Sensitive Personal Information (SPI)
Other (please specify)
None
Other Considerations for the Data Itself
The dataset card template has sections for all the required and optional metadata. This section discusses the data in the dataset.
Formats
We endeavor to be flexible on dataset file formats and how they are organized. For text, we recommend formats like CSV, JSON, Parquet, ORC, AVRO. Supporting PDFs, where extraction will be necessary, can be difficult, but not impossible.
NOTE: Using Parquet has the benefit that MLCommons Croissant can be used to automatically extract some metadata. See this Hugging Face page and the
mlcroissant
library, which supports loading a dataset using the Croissant metadata.
Diverse Datasets
Diverse datasets are desired for creating a variety of AI models and applications with special capabilities.
We are particularly interested in new datasets that can be used to train and tune models to excel in particular domains, or support them through design patterns like RAG and Agents. See What Kinds of Datasets Do We Want? for more information.
Use the tags
metadata field discussed above to indicate this information, when applicable.
Derived Dataset Specification
Every dataset that is derived via a processing pipeline from one or more other datasets requires its own dataset card, which must reference all upstream datasets that feed into it (and by extension, their dataset cards of metadata).
For example, when a derived dataset is the filtered output of one or more raw datasets (defined below), where duplication and offensive content removal was performed, the new dataset may now support different recommended uses
(i.e., it is now more suitable for model training or more useful for a specific domain), have different bias_risks_limitations
, and it will need to identify the upstream (ancestor) source_datasets
.
Suppose a new version of an existing dataset is created, where additional or removed data is involved, but no other changes occur. It also needs a new dataset card, even while most of the metadata will be unchanged.
Table 3 lists the minimum set of metadata fields that must change in a derived dataset:
Field Name | Possible Updates | Required? |
---|---|---|
pretty_name |
A modified name is strongly recommended to avoid potential confusion. It might just embed a version string. | |
unique_metadata_identifer |
Must be new! | ☑ |
dataset_issue_date |
The date for this new card. | ☑ |
Categories of Dataset Transformations
At this time, we use the following concepts for original and derived datasets, concerning levels of quality and cleanliness. This list corresponds to stages in our ingestion process and subsequent possible derivations of datasets. This list is subject to change.
- Raw: A dataset as it is discovered, validated, and cataloged. For all datasets, our most important concern is unambiguous provenance and clear openness. Raw datasets may go through filtering and analysis to remove potential objectionable content.
- Filtered: A raw dataset that has gone through a processing pipeline to make it more suitable for specific purposes. This might include removal of duplicate records, filtering for unacceptable content (e.g., hate speech, PII), or filtered for domain-specific content, etc. Since the presence of some content in the raw data could have legal implications for OTDI, such as the presence of some forms of PII and confidential information, we may reject cataloging an otherwise “good” raw dataset and only catalog a suitable filtered dataset.
- Structured: A filtered dataset that has also been reformatted to be most suitable for some AI purpose, such as model training, RAG, etc. For example, PDFs are more convenient to use when converted to JSON or YAML.
- Derived: Any dataset created from one or more other datasets. Filtered and structured datasets are derived datasets.
See How We Process Datasets for more details on these levels and how we process datasets.
After you have prepared or updated the dataset card as required, it’s time to contribute your dataset!
-
For source code, e.g., the code used for the data processing pipelines, the AI Alliance standard code license is Apache 2.0. For documentation, it is The Creative Commons License, Version 4.0, CC BY 4.0. See the Alliance
community/CONTRIBUTING
page for more details about licenses. ↩