How to Contribute to OTDI
There are many ways to contribute. In particular, tell us about other datasets we should catalog!
Tell Us About Other Datasets
Tell us about other datasets using the form below. If the datasets are already hosted at Hugging Face, we have already scanned the metadata for them. However, they won’t appear in the OTDI catalog unless they meet some minimum requirements. For example, they must have a permissive license.
NOTE: Be sure to read the Dataset Specification details before proceeding. If you have questions or concerns about the specification, please contact us. See also the Catalog page, where we discuss commonly-found problems with the metadata, which you should avoid.
If you prefer, you can also send us email, post an issue, or start a discussion.
What Kinds of Datasets Do We Seek?
Broad, effective use of AI requires datasets covering the breadth of human languages, domains, modalities, and target applications. See the current list of keywords we have cataloged.
We have particular interests in these areas:
Science and Industry
| Topic | Description |
|---|---|
Climate |
Supporting research in climate change, modeling vegetation and water cover, studying agriculture, etc. |
Marine |
Supporting research on and applications targeted towards marine environments. |
Materials |
Known chemical and mechanical properties of chemicals useful for research into potential new and improved materials. |
Drug Discovery |
Known chemical and medicinal properties of chemicals useful for research into potential new and improved pharmaceuticals. |
Semiconductors |
Specific area of materials research focused on improving the state of the art for semiconductor performance and manufacturing. |
Physics |
Data for physical systems. |
Software |
Software code bases and supporting datasets, e.g., vulnerability datasets, analyses of software-related failures, etc. |
Other science and industrial domains are welcome, too.
Vertical Domains
| Topic | Description |
|---|---|
Finance |
Historical market activity and behaviors. Connections to influences like climate, weather events, political events, etc. |
Healthcare |
Everything from synthetic patient data for modeling outcomes, to public literature on known diseases and conditions, to diagnostics results and their analysis. |
Legal |
Jurisdiction-specific data about case law, etc. specific applications. |
Social Sciences |
Social dynamics, political activity and sentiments, etc. |
Across industries, there are general concerns required for success:
| Topic | Description |
|---|---|
Security |
Security vulnerabilities, incidents, etc. for software and other systems, including datasets for red teaming, penetration testing, and other security practices. |
Safety |
AI safety in all its forms, including suppression of hate speech, assistance with harmful activities, and hallucinations. |
Modalities
In addition, we welcome datasets with different modalities. Hugging Face attempts to determine the modalities of datasets, but you can also use the tags to indicate modalities, such as the following:
| Topic | Description |
|---|---|
Text |
Especially for under-served language. |
Image |
I.e., still images |
Audio |
|
Video |
Optionally including audio |
Time Series |
Data for training, tuning, and testing time series models, both general-purpose and for domain-specific applications. |
In addition, some industry specific datasets have their own custom formats.
Synthetic Datasets
For all of the above categories, synthetic data is important for filling gaps, especially in domains where open datasets are hard to find, such as patient data in healthcare.
Are Your Datasets Truly Open?
Think about these aspects of your datasets:
- Permissively licensed? We can’t catalog datasets without a license and those which don’t specify one of the permissive licenses listed here
- Other requirements are met? See the Dataset Specification and prepare the dataset card accordingly. Note that we don’t yet enforce the requirements shown, except for the license, but we plan to enforce the whole specification, meaning we will filter out datasets that don’t meet its requirements.
Let Us Know About Your Dataset
Note: If your dataset is hosted by Hugging Face and you meet our requirements discussed above, we will pick it up automatically for the catalog. You can skip the following form. However, we would love to hear from you anyway and if you host your dataset elsewhere, you will need to tell us about it here.
Use this form to tell us about your dataset and where it is hosted. It will open your email client with the data added and formatted. After we receive your email, we will follow up with next steps.
Other Ways to Contribute to OTDI
There are many ways you can contribute to the Open Trusted Data Initiative.
Report Errors in Our Catalog
See a mistake in our catalog? Send us email, post an issue, or start a discussion.
Help Us Implement Our Data Processing Pipelines
We are working on data processing pipelines, e.g., for evaluating how well datasets match their metadata, claims about licenses, etc., which we discuss on the How We Process Datasets page.
Want to learn more? Send us email, check out our planned work, or start a discussion.
Contribute to This Website
We welcome your contributions to this website itself. The sources are in the docs directory of this GitHub repo. Please post issues or contribute changes as pull requests. Also, notice that every page has Edit this page on GitHub links, making it easy to go straight to the source of a page to make edits and submit a PR! This is the best way to help us fix typos and make single-page edits.
The repo’s GITHUB_PAGES file explains more details for testing the documentation website locally and for creating more extensive changes as PRs.
Join the Initiative Work Group
See also Join the Open Trusted Data Initiative Work Group! on the About Us page.
