Browse the Datasets Contribute a new Dataset!
Dataset Requirements
Note: The requirements documented here are a “draft V0.1” version of what we think will be required. We need and welcome your feedback! Either contact us or consider using pull requests with your suggestions. See the AI Alliance community page on contributing for more details.
Also contact us if you are interested in contributing a dataset, but you have any questions or concerns about meeting the following requirements.
Table of contents
About These Requirements
The requirements attempt to be minimally sufficient, to impose just enough constraints to meet our goals.
The requirements are adapted from the following sources:
- Hugging Face Dataset Card
- The Data Provenance Standard from the Data and Trust Alliance.
- Unique requirements for this project.
Possible additional contributions are TBD from the following examples:
- BigCode’s dataset card for The Stack.
- …
Most of the details are captured in requirements for the dataset card that every version of a dataset carries (e.g., after various stages of processing). Other requirement described here cover data governance for those processing stages.
Let’s begin.
The Data Is Yours to Contribute
To ensure fully-traceable provenance and governance, you must affirm that you are the owner of the dataset or you received the dataset from a source that offers the data for use without restriction, for example, that you have been granted permission by the owner to act on their behalf with respect to the dataset.
WARNING: Do not contribute any data that was obtained by crawling or scraping public data from the Internet. At this time, we can not accept such datasets because of concerns about verifying the provenance of such data.
Dataset Hosting
You can either retain your current hosting location or you can have the AI Alliance host it for you.
Dataset Card
All useful datasets include metadata about their provenance, license(s), target uses, known limitations and risks, etc. To provide a uniform, standardized way of expressing this metadata, we ask you to provide a dataset card (or data card) when you contribute the dataset.
Since your dataset is already likely to be available on the Hugging Face Hub, we ask you create a “complete” Hugging Face Dataset Card with the metadata fields they support. The project README.md
file functions as the card. We provide a list below of the fields we consider necessary to be sufficient for our purposes, including a few additional items we need to know that you should add to the README.md
you create.
TIP: For a general introduction to Hugging Face datasets, see here.
Quick Steps
Here are the steps to create your dataset card, summarized. Read the rest of this page for details:
- Download our version of the Hugging Face dataset card template,
datasetcard_otdi_template.md
. (If you already have a card in Hugging Face, i.e., theREADME.md
, compare our template to your card and add the new fields.)- Edit the Markdown in the template file to provide the details, as described below.
- Create the card in the Hugging Face UI (or edit your existing card.)
- Fill in the metadata fields shown in their editor UI. (See Table 1 below.)
- Paste the rest of your prepared Markdown into the file, after the YAML block delimited by
---
.- Commit your changes!
Details
Refer to the datasetcard.md
for details about the metadata fields Hugging Face (and we!) recommend for inclusion in a YAML block at the top of the README.md
. We comment on these fields below, see Table 1.
The templates/README_guide.md
provides additional information about the template fields in their Markdown template file, datasetcard_template.md
in the huggingface-hub
GitHub repo. However, we recommend that you use our extended version: datasetcard_otdi_template.md
. (You might need to right click on the link…)
If you want to contribute a dataset that isn’t currently hosted in Hugging Face, use the template above to create a dataset card yourself. Manually add the YAML header block, too. Finally, follow the Hugging Face convention of using the README.md
in the top-level directory as the dataset card.
Required Metadata
This section describes the minimum set of metadata we require, combining elements of the Hugging Face dataset card, concepts from the Data Provenance Standard (DPS) and additional OTDI project requirements.
All of the fields apply to synthesized data as well as real data, but of course details will be different.
NOTE: In the tables that follow, many of the fields appear in both the Hugging Face dataset card template and the Data Provenance Standard, but use different names. We ask you to use the Hugging Face names for consistency and convenience. When unique DPS fields are specified below, we convert their names to lowercase and use underscores as separators, for consistency.
YAML Metadata Block
TIP: The following tables are long, but starting with the
datasetcard_template.md
and the dataset card process will handle most of the details. Then you can add the additional fields requested in Table 2, those marked with “DPS”.
Our first table describes the metadata encoded in the YAML header block at the beginning of the Hugging Face README format. See datasetcard.md
for details.
For completeness, the optional fields in that block are also shown. The Required? column uses ☑ to indicate the field is required, empty for optional fields (but often recommended), and ☒ for fields that we don’t allow, because they are incompatible with this project.
Field Name | Description | Required? |
---|---|---|
license |
We strongly recommend cdla-permissive-2.0 for the Community Data License Agreement – Permissive, Version 2.0 and may require it in the future1. Use these names for licenses. Also covers the DPS License to use . |
☑ |
license_name |
e.g, Community Data License Agreement – Permissive, Version 2.0. | ☑ |
license_link |
e.g, LICENSE or LICENSE.md in the same repo or a URL to another location. |
☑ |
license_details |
Not needed if you use a standard license. | |
tags |
Useful for indicating target areas for searches, like chemistry , synthetic , etc. See also task_categories . Where applicable, we recommend that you use the categories described below in Diverse Datasets…. |
|
annotations_creators |
If appropriate. Examples: crowdsourced , found , expert-generated , machine-generated (e.g., using LLMs as judges). |
|
language_creators |
If appropriate. Examples: crowdsourced , found , expert-generated , machine-generated (i.e., synthetic data). |
|
language_details |
One or more of, for example, en-US , fr-FR , etc. |
☑ |
pretty_name |
E.g., Common Chemistry . This is equivalent to the Dataset title/name field in the Data Provenance Standard (DPS). |
☑ |
size_categories |
E.g., n<1K , 100K<n<1M . |
|
source_datasets |
A YAML list; zero or more, e.g., wikipedia , common-crawl . Recall our emphasis on provenance. This list is very important and each source must meet our provenance standards. See also the discussions below. |
☑ |
task_categories |
A YAML list; one or more from the list in this code. | ☑ |
task_ids |
A YAML list; “a unique identifier in the format lbpp/{idx} , consistent with HumanEval and MBPP” from here. See also examples here. |
|
paperswithcode_id |
Dataset id on PapersWithCode (from the URL). | |
configs |
Can be used to pass additional parameters to the dataset loader, such as data_files , data_dir , and any builder-specific parameters. |
|
config_name |
One or more dataset subsets, if applicable. See the example in datasetcard.md and the discussions here and here. |
|
dataset_info |
Can be used to store the feature types and sizes of the dataset to be used in Python. See the discussion in datasetcard.md . Also covers the DPS Data format field. |
|
extra_gated_fields |
Used for protected datasets and hence incompatible with the goals of OTDI. | ☒ |
train-eval-index |
Add this if you want to encode a train and evaluation info in a structured way for AutoTrain or Evaluation on the Hub. See the discussion in datasetcard.md . |
1 For source code, e.g., the code used for the data processing pipelines, the AI Alliance standard code license is Apache 2.0. For documentation, it is The Creative Commons License, Version 4.0 (CC BY 4.0). See the Alliance community/CONTRIBUTING
page for more details about licenses.
The Markdown Content in the Dataset Card
Our second table lists content that we require or recommend in the Markdown body of the dataset card, below the YAML header block. The Source column in the table contains the following:
- “HF” for fields in the Hugging Face
datasetcard_template.md
. See theREADME_guide.md
for descriptions of many of these fields. - “DPS” for additional fields derived from the Data Provenance Standard (DPS). Where we require DPS fields, add them to the
README.md
they seem to fit best. - “OTDI” for project-specific requirements.
As noted in the following table, many of the fields appear in both the Hugging Face dataset card template and the Data Provenance Standard, but use different names. We ask you to use the Hugging Face names for consistency and convenience. When unique DPS fields are used, we convert their field names to lowercase and use underscores as separators, for consistency.
Field Name | Description | Required? | Source |
---|---|---|---|
standards_version_used |
(DPS name: Standards version used ) The DPS schema version. Since our dataset card requirements are not strictly conformant to any DPS schema, this is optional. |
DPS | |
unique_metadata_identifier |
(DPS name: Unique Metadata Identifier ) A UUID (DPS allows other choices) that is globally unique. Derived datasets must have their own UUIDs. The UUID is very useful for unambiguous lineage tracking, which is why we require it. |
☑ | DPS, OTDI |
metadata_location |
(DPS name: Metadata Location ) Where the metadata is located, but by definition, we require the README.md file, so omit this field. |
☒ | DPS |
dataset_summary |
A concise summary of the dataset and its purpose. | ☑ | HF |
dataset_description |
(DPS name: Description of the Dataset ) Describe the contents, scope, and purpose of the dataset, which helps users understand what the data represents, how it was collected, and any limitations or recommended uses. However, this field should not include redundant information covered elsewhere. |
☑ | HF, DPS |
curated_by |
One or more legal entities responsible for creating the dataset, providing accountability and a point of contact for inquiries. Called Dataset issuer in DPS. See also dataset_card_authors below. |
☑ | HF, DPS |
dataset_sources |
HF template section (from datasetcard_template.md ). Complements the information provided above for source_datasets . The Repository URL “subfield” is required for each source dataset, unless it was provided by source_datasets in Table 1. The Paper and Demo subfields are optional. See also source_data and source_metadata_for_dataset next. |
☑ | HF |
source_data |
HF template section. Use the subsections, described next, data_collection_and_processing_section and source_data_producers_section to describe important provenance information. Is the data synthetic or not? This section also covers the DPS Method and Source (if different from issuer) fields. The latter is more explicit about when data comes from third-party sources. Note our requirement above that you can only submit datasets where you have the necessary rights (see also consent_documentation_location below). |
☑ | HF, DPS |
data_collection_and_processing_section |
HF template section. Describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. | ☑ | HF |
source_data_producers_section |
HF template section. This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. | ☑ | HF |
source_metadata_for_dataset |
(DPS name: Source metadata for dataset ) Additional content for source_data ; if the corresponding metadata for any dataset is not part of that dataset, then it must be explicitly linked here. This information is necessary for lineage tracking, part of our provenance objectives. Marked required, but if all metadata is part of all datasets (e.g., in README.md dataset cards), then this field can be omitted. |
☑ | DPS |
consent_documentation_location |
(DPS name: Consent documentation location ) “Specifies where consent documentation or agreements related to the data can be found, ensuring legal compliance and regulatory use.” Required for third-party datasets you are contributing. |
☑ | DPS |
data_origin_geography |
(DPS name: Data origin geography ) “The geographical location where the data was originally collected, which can be important for compliance with regional laws and understanding the data’s context.” Required if restrictions apply. |
DPS | |
data_processing_geography_inclusion_exclusion |
(DPS name: Data Processing Geography Inclusion/Exclusion ) “Defines the geographical boundaries within which the data can or cannot be processed, often for legal or regulatory reasons.” Required if restrictions apply. |
DPS | |
data_storage_geography_inclusion_exclusion |
(DPS name: Data Storage Geography Inclusion/Exclusion ) “Specifies where the data is stored and any geographical restrictions on storage locations, crucial for compliance with data sovereignty laws.” Required if restrictions apply. |
DPS | |
uses |
HF template section. Optional, but useful for describing Direct Use (field name: direct_use ) and Out-of-Scope Use (field name: out_of_scope_use ) for the dataset. Consider structuring the Direct Use as described in the Supported Tasks and Leaderboards section in the templates/README_guide.md . |
HF | |
annotations |
HF template section. Add any additional information for the annotations_creators above, if any. Subsections are annotation_process_section and who_are_annotators_section . |
HF | |
annotation_process_section |
HF template section. Describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. | HF | |
who_are_annotators_section |
HF template section. Describes the people or systems who created the annotations. | HF | |
bias_risks_limitations |
HF template section. While provenance and governance are the top priorities for OTDI, we also want to communicate to potential users what risks they need to understand about our cataloged datasets. Therefore, we require any information you can provide in this section, along with the Recommendations subsection for mitigations, if known. |
☑ | HF |
personal_and_sensitive_information |
(DPS name: Confidentiality classification ), State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). Consider using one or more of the values listed below, after this table. If efforts were made to anonymize the data, describe the anonymization process and also fill in use_of_privacy_enhancing_technologies_pets . |
☑ | HF, DPS |
use_of_privacy_enhancing_technologies_pets |
(DPS name: Use of Privacy Enhancing Technologies (PETs)... ), “Indicates whether techniques were used to protect personally identifiable information (PII) or sensitive personal information (SPI), highlighting the dataset’s privacy considerations.” |
☑ | DPS |
citation |
HF template section. A place to add BibTeX (field name: citation_bibtex ) and APA (field name: citation_apa ) citations. |
HF | |
glossary |
HF template section. Define useful terms. | HF | |
dataset_card_authors |
HF template section. We need to know the authors. | ☑ | HF |
dataset_card_contact |
HF template section. We need to know whom to contact when needed. Okay to leave blank if the authors’ contact information is provided. | HF | |
dataset_issue_date |
(DPS name: Dataset Issue Date ) When the dataset was compiled or created. (New versions require new dataset cards.) Recommended format: YYYY-mm-dd:THH:MM:SS |
☑ | DPS |
date_previously_issued_version_dataset |
(DPS name: Date of Previously Issued Version of the Dataset ) Timestamp for previous releases, if applicable. Redundant with other traceability tools, so not recommended. |
DPS | |
range_dates_data_generation |
(DPS name: Range of dates for data generation ) The span of time during which the data within the dataset was collected or generated, offering insight into the dataset’s timeliness and relevance. |
☑ | DPS |
intended_data_use |
(DPS name: Intended Data Use ) Covered by other fields, so ommit. |
☒ | DPS |
proprietary_data_presence |
(DPS name: Proprietary Data Presence ) Incompatable with OTDI goals, so either omit or always use no . |
☒ | DPS |
The Source Metadata for Dataset
field provides lineage from a dataset to its ancestors. It is not necessary to list the entire lineage, just the immediate “parents”, because the full lineage can be reconstructed from this information.
For the personal_and_sensitive_information
(DPS’ Confidentiality Classification
) field, consider using one or more of the following values defined by DPS:
Personal Information (PI)/Demographic
Payment Card Industry (PCI)
Personal Financial Information (PFI)
Personally Identifiable Information (PII)
Personal Health Information (PHI)
Sensitive Personal Information (SPI)
Other (please specify)
None
Some Requirements for the Data Itself
The dataset card template has sections for all the required and optional information. Here we discuss a few points.
Formats
We endeavor to be flexible on dataset file formats and how they are organized. For text, we recommend formats like CSV, JSON, Parquet, ORC, AVRO. Supporting PDFs, where extraction will be necessary, can be difficult, but not impossible.
NOTE: Using Parquet has the benefit that Hugging Face Croissant can be used to automatically extract some metadata. The
mlcroissant
library supports loading a dataset using the Croissant metadata.
Diverse Datasets Desired for Diverse AI Models and Applications
We are particularly interested in new datasets that can be used to train and tune models to excel in various domains, although general-purpose datasets are also welcome, of course.
These are the current domains of particular interest. Use the tags
metadata field discussed above to indicate domains, when applicable.
Science and Industrial
Climate
: Supporting research in climate change, modeling vegetation and water cover, studying agriculture, etc.Marine
: Supporting research on and applications targeted towards marine environments.Materials
: Known chemical and mechanical properties of chemicals useful for research into potential new and existing materials.Semiconductors
: Specific area of materials research focused on improving the state of the art for semiconductor performance and manufacturing.
Other science and industrial domains are welcome, too.
Other Domains
Finance
: Historical market activity and behaviors. Connections to influences like climate, weather events, political events, etc.Healthcare
: Everything from synthetic patient data for modeling outcomes, to public literature on known diseases and conditions, to diagnostics results and their analysis.Legal
: Jurisdiction-specific data about case law, etc. specific applications.Social Sciences
: Social dynamics, political activity and sentiments, etc.Timeseries
: Data for training, tuning, and testing time series models, including specific applications.
Modalities
In addition, we welcome datasets with different modalities. Hugging Face attempts to determine the modalities of datasets, but you can also use the tags
to indicate modalities, such as the following:
Text
:Image
: i.e., still imagesAudio
:Video
: including optional audio
Derived Dataset Requirements
Every dataset that is derived via a processing pipeline from another dataset, or is a new version of an existing dataset (e.g., because of additional data), requires its own dataset card, which must reference all upstream datasets that fed into it.
A derived dataset could include a dataset that has been processed to remove duplication, hate speech, etc., or transformed to different formats. Much of the dataset card content will be unchanged, but some fields may or will require updating in the new card.
First, consider a new version of an otherwise-identical dataset. You should examine all fields, but the following are most likely to need changing.
Field Name | Possible Updates | Required? |
---|---|---|
tags |
For example, if the expanded dataset covers new domains than its “parent”. | |
language_details |
If new languages were added or existing languages removed. | |
pretty_name |
Consider if a more descriptive name or an added version string is desirable. | |
size_categories |
If the size changed significantly. | |
source_datasets |
If the source datasets changed. | |
task_categories |
If used and the tasks changed, usually meaning that more are supported. | |
task_ids |
If used and the task ids changed. | |
config_name |
If used and the dataset subsets have changed. | |
dataset_info |
If used and the feature types, etc. for Python usage have changed. | |
unique_metadata_identifer |
Must be new! | ☑ |
curated_by |
If different. | |
description_of_the_dataset |
If changed. | |
dataset_sources |
Locations of new sources will be different. | ☑ |
source_data |
What information has changed about the sources? | ☑ |
uses |
Have the uses changed? | |
bias_risks_limitations |
Have these concerns changed? | |
dataset_card_authors |
The authors of this new card. | ☑ |
dataset_card_contact |
Required, if changed. | |
dataset_issue_date |
The date for this new card. | ☑ |
range_dates_data_generation |
The date range for this new dataset. | ☑ |
If other changes were made besides just the addition of new data, consider if the dataset should be considered a different, derived dataset.
Second, consider a new dataset derived from another dataset through a processing pipeline. You should examine all fields, but the following are most likely to need changing.
Field Name | Possible Updates | Required? |
---|---|---|
tags |
For example, if the derived dataset is more focused on a domain than its “parent” or it adds new domains. | |
pretty_name |
A new name is essential to avoid confusion. | ☑ |
size_categories |
If the size changed significantly. | |
source_datasets |
Effective governance requires careful lineage tracking, so all parent datasets must be identified. | ☑ |
task_categories |
If used and the tasks changed. | |
task_ids |
If used and the task ids changed. | |
configs |
If used and the parameters to pass to the dataset loaders have changed. | |
config_name |
If used and the dataset subsets have changed. | |
dataset_info |
If used and the feature types, etc. for Python usage have changed. | |
unique_metadata_identifer |
Must be new! | ☑ |
curated_by |
Most likely different! | ☑ |
description_of_the_dataset |
Change to reflect the transformations, but it’s not recommended to provide information also provided elsewhere. | |
dataset_sources |
New dataset, so new source (location) information. | ☑ |
source_data |
New dataset, so new information about the sources. | ☑ |
uses |
New dataset, so the uses are likely to be different. | ☑ |
bias_risks_limitations |
How have these concerns changed with the processing done to create this dataset? | ☑ |
dataset_card_authors |
The authors of this new card. | ☑ |
dataset_card_contact |
Required, if changed. | |
dataset_issue_date |
The date for this new card. | ☑ |
range_dates_data_generation |
The date range for this new dataset. | ☑ |
Catagories of Dataset Transformations
At this time, we have the following concepts for original and derived datasets, concerning levels of quality and cleanliness. This list corresponds to stages in our ingestion process and subsequent possible derivations of datasets. This list is subject to change.
- Raw: The dataset as submitted, which could already be in “good shape”. Our most important concern at this stage is unambigious provenance. Raw datasets may go through filtering and analysis to remove potential objectionable content. However, the presence of some content in the raw data could have legal implications, such as some forms of PII and company confidential information, which may force us to reject the contribution. (Should this happen, we will discuss mitigation options with you.)
- Filtered: A raw dataset that has gone through a processing pipeline to remove duplicates, filter for objectional content, etc.
- Structured: A filtered dataset that has been reformated to be most suitable for model training (LLMs, time series, etc.), RAG patterns, and similar purposes. For example, PDFs converted to JSON.
See How We Process Datasets for more details on these levels and how we process datasets.
After you have prepared or updated the dataset card as required, it’s time to contribute your dataset!