Join Our Initiative Browse the Datasets Contribute a New Dataset
Building the Future of Open, Trusted Data for AI
Join The AI Alliance, Open Trusted Data Initiative (OTDI), where our mission is to create a comprehensive, widely-sourced catalog of datasets with clear licenses for use, explicit provenance guarantees, and governed transformations, intended for AI model training, tuning, and application patterns like RAG (retrieval augmented generation) and agents.
In our context trusted data means the provenance and governance of the dataset is clear and unambiguous. The metadata about the dataset provides clarity about its intended purposes, safety, and other considerations, along with any filtering and other processing steps that were done on the dataset.
News:
- December 11, 2024: Added ServiceNow datasets.
- November 20, 2024: BrightQuery joins the AI Alliance and the Open Trusted Data Initiative: LinkedIn announcement.
- November 4, 2024: pleias joins the AI Alliance and the Open Trusted Data Initiative: LinkedIn announcement.
- October 15, 2024: Common Crawl Foundation joins the AI Alliance and the Open Trusted Data Initiative.
Tip: Use the search box at the top of this page to find specific content.
Authors | The AI Alliance Open Trusted Data Work Group |
Last Update | V0.2.4, 2025-01-21 |
Why Is Trusted Data Important?
A significant challenge today for users of datasets is the desire for clear licenses to use the data, assurances that the data was sourced appropriately (the provenance), and trust that the data has been securely and traceably managed (governance).
OTDI aims to address these concerns with an industry wide effort to specify governance requirements and to catalog and process datasets fully in the open, allowing model developers and users to have full confidence in the provenance and governance of the data they use.
Delivering Trust
What does delivering trust mean? We wish to enable the following:
- Data Exploration: Finding datasets that meet our governance specification and fully support your needs.
- Data Cleaning: Datasets processed for specific objectives (e.g., deduplication, hate speech removal, etc.) with open-source data pipelines.
- Data Auditing: End-to-end governance, ie., traceability, of all activity involving the dataset.
- Data Documentation: Metadata that covers all important aspects of a dataset.
Our deliverables to the industry will include the following:
- Baseline Knowledge Datasets: Openly accessible, permissively licensed language, code, image, audio, and video data that embodies a diverse range of global knowledge.
- Domain Knowledge Datasets: A rich, comprehensive set of datasets pertinent to tuning foundation models to a set of application domains: legal, finance, chemistry, manufacturing, etc.
- Tooling and Platform Engineering: Hosted pipelines, platform services, and compute capacity for synthetic dataset generation and data preparation at the scale needed to achieve the vision. Fully open-source, so you can use these tools as you see fit.
The Value of Governance
Governance of datasets delivers these benefits:
- Strengthens Trust: Demonstrates a commitment to safeguarding data and enhancing its reputation.
- Boosts Operational Efficiency: Reduces redundancies and inefficiencies by ensuring consistent data management and quality practices.
- Supports Innovation: Having reliable, well-managed data can fuel analytics, AI, and other technological innovations.
- Regulatory Compliance: Helps organizations meet legal and industry-specific requirements (e.g., GDPR, HIPAA) by ensuring data is properly managed.
- Facilitates Accountability: Clarifies stewardship of data, ensuring responsibility for its integrity and usage.
- Enhances Decision-Making: Provides access to trusted, high-quality data, enabling better consumption and outcomes.
Contributing Datasets
If you just want to browse the current catalog:
click here.
So, why should you get involved?
- Collaborate on AI Innovation: Your data can help build more accurate, fair, versatile, and open AI models. You can also connect with like-minded data scientists, AI researchers, and industry leaders in the AI Alliance.
- Transparency & Trust: Every contribution is transparent, with robust data provenance, governance, and trust mechanisms. We welcome your expertise to help us improve all aspects of these processes.
- Tailored Contributions: The world needs domain-specific datasets to enable model tuning to create open foundation models relevant to domains such as time series, and branches of science and industrial engineering. The world needs more multilingual, including underserved languages, and multimodel datasets. In many areas, the available real-world data is insufficient for the needs to innovate in those areas. Therefore, synthetic datasets are also needed.
Next Steps
Interested in contributing a dataset to our catalog? Follow these steps:
- Review our Dataset Specification, including creation of a Hugging Face Dataset Card.
- See How We Process Datasets, i.e., the filtering and analysis steps we perform.
- Finally, visit Contribute Your Dataset and let us know about your dataset.
More Information
- References: More details and other viewpoints on open, trusted data.
- About Us: More about the AI Alliance and this project.
Version History
Version | Date |
---|---|
V0.2.4 | 2025-01-21 |
V0.2.3 | 2025-01-08 |
V0.2.2 | 2024-12-11 |
V0.2.1 | 2024-12-05 |
V0.2.0 | 2024-12-04 |
V0.1.0 | 2024-11-13 |
V0.0.4 | 2024-11-04 |
V0.0.3 | 2024-09-06 |
V0.0.2 | 2024-09-06 |
V0.0.1 | 2024-09-01 |