Link Search Menu Expand Document
AI Alliance Banner
Join Our Initiative   Browse the Datasets   Contribute a New Dataset

Dataset Specification

Note: The specification documented here is the “V0.1” version of what we think will be required for cataloged datasets. We need and welcome your feedback! Either contact us or consider using pull requests with your suggestions. See the AI Alliance community page on contributing for more details.

Also contact us if you are interested in contributing a dataset, but you have any questions or concerns about meeting the following specification.

Table of contents
  1. Dataset Specification
    1. About This Specification
    2. The Data Must Be Yours to Contribute
    3. Dataset Hosting
    4. Dataset Card
    5. Quick Steps
      1. Details
    6. Required Metadata
      1. YAML Metadata Block
    7. The Markdown Content in the Dataset Card
    8. Other Considerations for the Data Itself
      1. Formats
      2. Diverse Datasets
        1. Science and Industrial
        2. Other Domains
        3. Modalities
    9. Derived Dataset Specification
      1. Categories of Dataset Transformations

About This Specification

The specification attempts to be minimally sufficient, to impose just enough constraints to meet our goals for cataloged datasets.

The specification is adapted from the Hugging Face Dataset Card, with a few extensions for clearer provenance and governance.

In addition, we are exploring incorporation of the following sources:

Most of the details are captured in the dataset card that every version of a dataset carries (e.g., after various stages of processing).

Let’s begin.

The Data Must Be Yours to Contribute

To promote fully-traceable provenance and governance, for all data within the dataset, you must affirm that you are either (a) the owner of the dataset or (b) you have rights from the owner of the data that enables you to provide it to anyone under the CDLA Permissive 2.0 license; for example, you have been granted permission by the owner to act on their behalf with respect to the data and enable others to use it without restriction.

WARNING: Do not contribute any data that was obtained by crawling or scraping public data from the Internet or other public places. At this time, we are not accepting such data because we are seeking to build datasets with a heightened level of clarity around ownership, provenance, and quality.

Dataset Hosting

You can either retain your current hosting location or you can have the AI Alliance host it for you.

Dataset Card

All useful datasets include metadata about their provenance, license(s), target uses, known limitations and risks, etc. To provide a uniform, standardized way of expressing this metadata, we ask you to provide a dataset card (or data card) when you contribute the dataset.

Since your dataset is already likely to be available on the Hugging Face Hub, we ask you create a “complete” Hugging Face Dataset Card with the metadata fields they support. The project README.md file functions as the card. We provide a list below of the fields we consider necessary to be sufficient for our purposes, including a few additional items we need to know that you should add to the README.md you create.

TIP: For a general introduction to Hugging Face datasets, see here.

Quick Steps

Here are the steps to create your dataset card, summarized. Read the rest of this page for details:

  1. Download our version of the Hugging Face dataset card template, datasetcard_otdi_template.md. (If you already have a card in Hugging Face, i.e., the README.md, compare our template to your card and add the new fields.)
  2. Edit the Markdown in the template file to provide the details, as described below.
  3. Create the card in the Hugging Face UI (or edit your existing card.)
  4. Fill in the metadata fields shown in their editor UI. (See Table 1 below.)
  5. Paste the rest of your prepared Markdown into the file, after the YAML block delimited by ---.
  6. Commit your changes!

Details

Refer to the datasetcard.md for details about the metadata fields Hugging Face (and we!) recommend for inclusion in a YAML block at the top of the README.md. We comment on these fields below, see Table 1.

The templates/README_guide.md provides additional information about the template fields in their Markdown template file, datasetcard_template.md in the huggingface-hub GitHub repo. However, we recommend that you use our extended version: datasetcard_otdi_template.md. (You might need to right click on the link…)

If you want to contribute a dataset that isn’t currently hosted in Hugging Face, use the template above to create a dataset card yourself. Manually add the YAML header block, too. Finally, follow the Hugging Face convention of using the README.md in the top-level directory as the dataset card.

Required Metadata

This section describes the minimum set of metadata we expect, including some optional elements of the Hugging Face dataset card that believe are essential.

All of the fields apply to synthesized data as well as real data, but of course details will be different.

YAML Metadata Block

TIP: The following tables are long, but starting with the datasetcard_template.md and the dataset card process will handle most of the details. Then you can add the additional fields requested in Table 2, those marked with “OTDI”.

Our first table describes the metadata encoded in the YAML header block at the beginning of the Hugging Face README format. See datasetcard.md for details.

For completeness, the optional fields in that block are also shown. The Required? column uses ☑ to indicate the field is required, empty for optional fields (but often recommended), and ☒ for fields that we don’t allow, because they are incompatible with this project.

Table 1: Hugging Face Datacard Metadata

The Markdown Content in the Dataset Card

Our second table lists content that we require or recommend in the Markdown body of the dataset card, below the YAML header block. The Source column in the table contains the following:

  • “HF” for fields in the Hugging Face datasetcard_template.md. See the README_guide.md for descriptions of many of these fields.
  • “OTDI” for additional fields we believe are necessary.

Table 2: Additional Content for the Dataset Card (`README.md`)

For the personal_and_sensitive_information field, consider using one or more of the following values:

  • Personal Information (PI)/Demographic
  • Payment Card Industry (PCI)
  • Personal Financial Information (PFI)
  • Personally Identifiable Information (PII)
  • Personal Health Information (PHI)
  • Sensitive Personal Information (SPI)
  • Other (please specify)
  • None

Other Considerations for the Data Itself

The dataset card template has sections for all the required and optional information. Here we discuss a few points.

Formats

We endeavor to be flexible on dataset file formats and how they are organized. For text, we recommend formats like CSV, JSON, Parquet, ORC, AVRO. Supporting PDFs, where extraction will be necessary, can be difficult, but not impossible.

NOTE: Using Parquet has the benefit that MLCommons Croissant can be used to automatically extract some metadata. See this Hugging Face page and the mlcroissant library, which supports loading a dataset using the Croissant metadata.

Diverse Datasets

Diverse datasets are desired for creating a variety of AI models and applications with special capabilities.

We are particularly interested in new datasets that can be used to train and tune models to excel in particular domains, although general-purpose datasets are also welcome, of course.

These are the current domains of particular interest. Use the tags metadata field discussed above to indicate domains, when applicable.

Science and Industrial

  • Climate: Supporting research in climate change, modeling vegetation and water cover, studying agriculture, etc.
  • Marine: Supporting research on and applications targeted towards marine environments.
  • Materials: Known chemical and mechanical properties of chemicals useful for research into potential new and existing materials.
  • Semiconductors: Specific area of materials research focused on improving the state of the art for semiconductor performance and manufacturing.

Other science and industrial domains are welcome, too.

Other Domains

  • Finance: Historical market activity and behaviors. Connections to influences like climate, weather events, political events, etc.
  • Healthcare: Everything from synthetic patient data for modeling outcomes, to public literature on known diseases and conditions, to diagnostics results and their analysis.
  • Legal: Jurisdiction-specific data about case law, etc. specific applications.
  • Social Sciences: Social dynamics, political activity and sentiments, etc.
  • Timeseries: Data for training, tuning, and testing time series models, including specific applications.

Modalities

In addition, we welcome datasets with different modalities. Hugging Face attempts to determine the modalities of datasets, but you can also use the tags to indicate modalities, such as the following:

  • Text:
  • Image: i.e., still images
  • Audio:
  • Video: including optional audio

Derived Dataset Specification

Every dataset that is derived via a processing pipeline from one or more other datasets requires its own dataset card, which must reference all upstream datasets that feed into it (and by extension, their dataset cards of metadata). Similarly, each new version of an existing dataset, whre only additional (or removed) data is involved, also needs an updated card, but more of the metadata will be unchanged.

Note: We are considering a way to allow a derived dataset card to just specify what’s new or changed and inherit unchanged metadata from its ancestors. Also, when automated pipelines are used to create derived datasets, our processing pipelines will automatically generate some updated metadata, such as timestamps, processing tools and steps, etc.

When the derived dataset is the filtered output of one or more raw datasets (defined below), where duplication and offensive content removal was performed, the new dataset may support different uses, have different bias_risks_limitations, and it will need to identify the upstream (ancestor) source_datasets, for example.

Table 3 lists the fields that must change (with some exceptions), to avoid ambiguities:

Field Name Possible Updates Required?
pretty_name A modified name is strongly recommended to avoid potential confusion. It might just embed a version string.  
unique_metadata_identifer Must be new!
dataset_issue_date The date for this new card.

Categories of Dataset Transformations

At this time, we have the following concepts for original and derived datasets, concerning levels of quality and cleanliness. This list corresponds to stages in our ingestion process and subsequent possible derivations of datasets. This list is subject to change.

  • Raw: The dataset as submitted, which could already be in “good shape”. Our most important concern at this stage is unambiguous provenance. Raw datasets may go through filtering and analysis to remove potential objectionable content. However, the presence of some content in the raw data could have legal implications, such as some forms of PII and company confidential information, which may force us to reject the contribution. (Should this happen, we will discuss mitigation options with you.)
  • Filtered: A raw dataset that has gone through a processing pipeline to remove duplicates, filter for objectional content, etc.
  • Structured: A filtered dataset that has been reformatted to be most suitable for model training (LLMs, time series, etc.), RAG patterns, and similar purposes. For example, PDFs converted to JSON.

See How We Process Datasets for more details on these levels and how we process datasets.

After you have prepared or updated the dataset card as required, it’s time to contribute your dataset!

  1. For source code, e.g., the code used for the data processing pipelines, the AI Alliance standard code license is Apache 2.0. For documentation, it is The Creative Commons License, Version 4.0, CC BY 4.0. See the Alliance community/CONTRIBUTING page for more details about licenses. 


Child Pages