Join Our Initiative Browse the Datasets Contribute a New Dataset

Contribute Your Dataset!

NOTE: Be sure to read the Dataset Specification details before proceeding. If you have questions or concerns about the specification, please contact us.

Contribution means adding your dataset to our catalog. You continue to own and host the dataset where you see fit.

What Kinds of Datasets Do We Seek?

Broad, effective use of AI requires datasets covering the breadth of human languages, domains, modalities, and target applications. See the current list of keywords we have cataloged.

We have particular interests in these areas:

Science and Industry

Climate: Supporting research in climate change, modeling vegetation and water cover, studying agriculture, etc.
Marine: Supporting research on and applications targeted towards marine environments.
Materials: Known chemical and mechanical properties of chemicals useful for research into potential new and improved materials.
Drug Discovery: Known chemical and medicinal properties of chemicals useful for research into potential new and improved pharmaceuticals.
Semiconductors: Specific area of materials research focused on improving the state of the art for semiconductor performance and manufacturing.
Physics: Data for physical systems.
Software: Software code bases and supporting datasets, e.g., vulnerability datasets, analyses of software-related failures, etc.

Other science and industrial domains are welcome, too.

Vertical Domains

Finance: Historical market activity and behaviors. Connections to influences like climate, weather events, political events, etc.
Healthcare: Everything from synthetic patient data for modeling outcomes, to public literature on known diseases and conditions, to diagnostics results and their analysis.
Legal: Jurisdiction-specific data about case law, etc. specific applications.
Social Sciences: Social dynamics, political activity and sentiments, etc.

Across industries, there are general concerns required for success:

Security: Security vulnerabilities, incidents, etc. for software and other systems, including datasets for red teaming, penetration testing, and other security practices.
Safety: AI safety in all its forms, including suppression of hate speech, assistance with harmful activities, and hallucinations.

Modalities

In addition, we welcome datasets with different modalities. Hugging Face attempts to determine the modalities of datasets, but you can also use the tags to indicate modalities, such as the following:

Text: especially for under-served language.
Image: i.e., still images
Audio:
Video: optional including audio
Time series: Data for training, tuning, and testing time series models, both general-purpose and for domain-specific applications.

In addition, some industry specific datasets have their own custom formats.

Synthetic Datasets

For all of the above categories, synthetic data is important for filling gaps, especially in domains where open datasets are hard to find, such as patient data in healthcare.

The Contribution Process

The process follows these steps:

Prepare your contribution: Make sure you meet the Dataset Specification and prepare the dataset card.
Tell us about your dataset: Follow the instructions in Contribute Your Dataset below to submit your dataset for consideration.
Receive feedback from us: After we evaluate the submission, we will provide feedback and request clarifications, where needed.
Be added to our dataset catalog: Once your contribution is accepted, your dataset will be added to our catalog.
Review your details: After publication in our catalog, verify that the information about your dataset is correct.

License

The Open Trusted Data Initiative is focused on obtaining datasets from submitters who either own them or have a unrestricted, free-to-use license from all owners of data included in the dataset. By contributing a dataset to the catalog, you affirm that with respect to the dataset and all of its data, you are either (1) the owner or (2) you have been granted a license by all owner(s) of the data enabling you to license it to others under an acceptable open license, which gives anyone the right to use, modify, copy, and create derivative works of the data and dataset, among other things. Do not contribute any data that was obtained merely by collecting publicly-visible data from the Internet or from other sources that you do not own or to which you do not have a suitable license.

We prefer the Community Data License Agreement - Permissive, Version 2.0 although The Creative Commons License, Version 4.0 - CC BY 4.0 is also sometimes used.

By contributing the dataset to the Initiative, you grant anyone a license to the dataset and its data under the Developer Certificate of Origin, Version 1.1 (see also our community contributors page). This does not affect your ownership, copyrights and other interests, and rights to and title to the dataset and its data.

Contribute Your Dataset

Use this form to tell us about your dataset. It will open your email client with the data added and formatted. After we receive your email, we will follow up with next steps.

For questions, send us email at data@thealliance.ai.

Dataset name:
Dataset location:
Dataset card:	Leave blank if the location README is the dataset card.
Email:
	I agree to the terms for contribution.