Link Search Menu Expand Document

How to Contribute

There are many ways you can contribute to the Open Trusted Data Initiative.

Report Errors in Our Catalog

See a mistake in our catalog? Send us email, post an issue, or start a discussion.

Help Us Implement Our Data Processing Pipelines

We are working on data processing pipelines, e.g., for evaluating how well datasets match their metadata, claims about licenses, etc., which we discuss on the How We Process Datasets page.

Want to learn more? Send us email, check out our planned work, or start a discussion.

Let Us Know of a Dataset We Should Catalog

Do you know of someone else’s open dataset we should catalog, Send us email, post an issue, or start a discussion.

Do you have an open dataset of your own that we should catalog, let’s discuss!

NOTE: Be sure to read the Dataset Specification details before proceeding. If you have questions or concerns about the specification, please contact us.

Contribution means adding your dataset to our catalog. You continue to own and host the dataset where you see fit.

What Kinds of Datasets Do We Seek?

Broad, effective use of AI requires datasets covering the breadth of human languages, domains, modalities, and target applications. See the current list of keywords we have cataloged.

We have particular interests in these areas:

Science and Industry

  • Climate: Supporting research in climate change, modeling vegetation and water cover, studying agriculture, etc.
  • Marine: Supporting research on and applications targeted towards marine environments.
  • Materials: Known chemical and mechanical properties of chemicals useful for research into potential new and improved materials.
  • Drug Discovery: Known chemical and medicinal properties of chemicals useful for research into potential new and improved pharmaceuticals.
  • Semiconductors: Specific area of materials research focused on improving the state of the art for semiconductor performance and manufacturing.
  • Physics: Data for physical systems.
  • Software: Software code bases and supporting datasets, e.g., vulnerability datasets, analyses of software-related failures, etc.

Other science and industrial domains are welcome, too.

Vertical Domains

  • Finance: Historical market activity and behaviors. Connections to influences like climate, weather events, political events, etc.
  • Healthcare: Everything from synthetic patient data for modeling outcomes, to public literature on known diseases and conditions, to diagnostics results and their analysis.
  • Legal: Jurisdiction-specific data about case law, etc. specific applications.
  • Social Sciences: Social dynamics, political activity and sentiments, etc.

Across industries, there are general concerns required for success:

  • Security: Security vulnerabilities, incidents, etc. for software and other systems, including datasets for red teaming, penetration testing, and other security practices.
  • Safety: AI safety in all its forms, including suppression of hate speech, assistance with harmful activities, and hallucinations.

Modalities

In addition, we welcome datasets with different modalities. Hugging Face attempts to determine the modalities of datasets, but you can also use the tags to indicate modalities, such as the following:

  • Text: especially for under-served language.
  • Image: i.e., still images
  • Audio:
  • Video: optional including audio
  • Time series: Data for training, tuning, and testing time series models, both general-purpose and for domain-specific applications.

In addition, some industry specific datasets have their own custom formats.

Synthetic Datasets

For all of the above categories, synthetic data is important for filling gaps, especially in domains where open datasets are hard to find, such as patient data in healthcare.

The Contribution Process

The process follows these steps:

  1. Prepare your contribution: Make sure you meet the Dataset Specification and prepare the dataset card.
  2. Tell us about your dataset: Follow the instructions in Contribute Your Dataset below to submit your dataset for consideration.
  3. Receive feedback from us: After we evaluate the submission, we will provide feedback and request clarifications, where needed.
  4. Be added to our dataset catalog: Once your contribution is accepted, your dataset will be added to our catalog.
  5. Review your details: After publication in our catalog, verify that the information about your dataset is correct.

License

The Open Trusted Data Initiative is focused on obtaining datasets from submitters who either own them or have a unrestricted, free-to-use license from all owners of data included in the dataset. By contributing a dataset to the catalog, you affirm that with respect to the dataset and all of its data, you are either (1) the owner or (2) you have been granted a license by all owner(s) of the data enabling you to license it to others under an acceptable open license, which gives anyone the right to use, modify, copy, and create derivative works of the data and dataset, among other things. Do not contribute any data that was obtained merely by collecting publicly-visible data from the Internet or from other sources that you do not own or to which you do not have a suitable license.

We prefer the Community Data License Agreement - Permissive, Version 2.0 although The Creative Commons License, Version 4.0 - CC BY 4.0 is also sometimes used.

By contributing the dataset to the Initiative, you grant anyone a license to the dataset and its data under the Developer Certificate of Origin, Version 1.1 (see also our community repo’s contributing page). This does not affect your ownership, copyrights and other interests, and rights to and title to the dataset and its data.

Contribute Your Dataset

Note: If your dataset is hosted by Hugging Face and you meet our requirements above, we will pick it up automatically for the catalog. You can skip the following form. However, if you host your dataset elsewhere, you will need to tell us about it.

Use this form to tell us about your dataset and where it is hosted. It will open your email client with the data added and formatted. After we receive your email, we will follow up with next steps.

For questions, send us email at data@thealliance.ai.
Leave blank if the location README is the dataset card.
  I agree to the terms for contribution.
 

Yet More Ways to Contribute…

Join the Initiative

See also Join the Open Trusted Data Initiative! on the About Us page.

Contribute to This Website

We welcome your contributions to this website itself. The sources are in the docs directory of this GitHub repo. Please post issues or contribute changes as pull requests. Also, notice that every page has Edit this page on GitHub links, making it easy to go straight to the source of a page to make edits and submit a PR! This is the best way to help us fix typos and make single-page edits.

The repo’s GITHUB_PAGES file explains more details for testing the documentation website locally and for creating more extensive changes as PRs.