Browse the Datasets Contribute a new Dataset!

Use the following template to create your dataset card. Replace all the content marked with {...} with appropriate values and add additional text as you see fit. Note the suggestions in italics, which you should remove. Also, pay attention to sections marked required. Keep in mind our goals for OTDI and how this metadata supports those goals.

If you are uncertain about what a particular section requires, add questions in that section! When you submit this card with your dataset, we will provide answers, as well as other feedback.

You might have nothing to enter for some sections that are options. If so, just use “N/A”.

For more information on dataset card metadata, see the Hugging Face guide and their card specification, from which this card template is adapted.

NOTE: We intend to turn this template into a form for easier preparation. Apologies in the meantime…

WARNING: At this time, we can only accept text files with one of the following extensions: *.txt, *.md, or *.markdown.

Here is the template. Click here to download a copy. (You might need to right click on the link…)

Dataset Card for { dataset_name }

A descriptive and unique name is best!

Short Description (Required)

A quick summary of the dataset and its purpose.

{ dataset_summary }

Dataset Details

Dataset Description (Required)

Longer details about this dataset, it’s purpose, goals, etc.

{ dataset_description }

Curated by (required): { curators_list }
Funded by (optional): { funded_by }
Shared by (optional): { shared_by, e.g., your name and email } The submission form will also have this.
Language(s) (NLP): { language_list } include the primary languages you know of

Some of these bullet list items can be expanded upon in sections below, so use the bullet points when a single, concise entry is known, or use the longer sections below.

Dataset Card Authors (Required)

Names and email addresses for the authors.

{ dataset_card_authors }

Dataset Card Contacts {Required}

Names and email addresses for the primary contact people.

{ dataset_card_contacts }

Dataset Sources

The link where the dataset lives today. (Preferably one link, but add more if necessary.) While this information will also be in the submission form, we want to have it in the data card, as well.

Repository (required): { repo_URL } e.g., https://huggingface.co/datasets/…
Paper (optional): { paper_URL } e.g., arxiv.org link
GitHub (optional): { GitHub_URL } e.g., for supporting code and documentation
Other Demo or Documentation Links (optional): { URL_list } e.g., a Just the Docs link.

Notes on How to Use the Dataset

Address questions about how the dataset is intended to be used. For example, is it only suitable for use with/for certain models, modalities, tools? Are there significant limitations to be aware of?

{ how_to_use_the_dataset }

Target Use Cases

Describe the best use cases for the dataset.

{ target_use_cases }

Out-of-Scope Use Cases

Note use cases for which the dataset is ill-suited. This could include scenarios for misuse and malicious activity.

{ out_of_scope_use_cases }

Dataset Structure

Provide a description of the dataset format, directory structure, schema (if structured), and additional useful information, such as criteria that were used to create splits, known relationships between data points, etc.

{ dataset_structure }

Dataset Creation

Curation Rationale

Motivation for the creation of this dataset.

{ curation_rationale_section }

Source Data (Required)

Describe the source data (e.g. news text and headlines, social media posts, translated sentences, …) used to create the dataset. Because of our emphasis on provenance, you must provide explicit details about any sources you used to derive this dataset, including information about provenance, license to use, etc.

{ source_data }

Data Collection and Processing (Required)

Describe the data collection and processing tools and techniques, such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. While we understand you may not want to reveal any proprietary methods used, please provide enough information to satisfy our provenance concerns.

{ data_collection_and_processing }

Source Data Producers (Required)

Describe the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. Provide enough information to satisfy our provenance concerns.

{ source_data_producers }

Annotations (Optional)

If the dataset contains annotations which are not an “inherent” part of the initial data collection, use this section to describe them.

{ annotations }

Annotation Process

Describe the annotation process used, such as particular, the amount of data annotated, annotation guidelines provided to the annotators, inter-annotator statistics, annotation validation, etc.

{ annotation_process }

Who Are the Annotators?

Describe the people or systems who created the annotations. For example, Amazon Mechanical Turk.

{ who_are_annotators }

Personal and Sensitive Information (Required)

State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize or filter the data, describe this process.

{ personal_and_sensitive_information }

Bias, Risks, and Limitations (Required)

Describe any other known technical and social limitations of the dataset.

{ bias_risks_limitations }

Recommendations

Describe any particular recommendations for handling known bias, risk, and technical limitations when using this dataset. Note that at as a matter of common practice, we will always warn users “to be aware of potential risks, biases, and limitations of this dataset, which may not be known.”

{ bias_recommendations }

Licensing Information

The dataset is released under the Community Data License Agreement – Permissive, Version 2.0 license.

Future Work

Describe planned work, if any.

{ future_work }

Citation (Optional)

In additional to research papers mentioned above, add APA and Bibtex citations here, if any.

BibTeX

{ citation_bibtex }

APA

{ citation_apa }

Glossary (Optional)

If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. For example, are there domain-specific terms used that might be unclear to a reader outside the domain?

{ glossary }

More Information (Optional)

Anything else you want to add?

{ more_information }