Link Search Menu Expand Document
AI Alliance Banner
Join Our Initiative   Browse the Datasets   Contribute a New Dataset

Why Is Trusted Data Important?

A significant challenge today for users of datasets is the desire for clear licenses to use the data, assurances that the data was sourced appropriately (the provenance), and trust that the data has been securely and traceably managed (governance).

The Value of Governance

Governance of datasets delivers these benefits:

  • Strengthens Trust: Demonstrates a commitment to safeguarding data and enhancing its reputation.
  • Boosts Operational Efficiency: Reduces redundancies and inefficiencies by ensuring consistent data management and quality practices.
  • Supports Innovation: Having reliable, well-managed data can fuel analytics, AI, and other technological innovations.
  • Regulatory Compliance: Helps organizations meet legal and industry-specific requirements (e.g., GDPR, HIPAA) by ensuring data is properly managed.
  • Facilitates Accountability: Clarifies stewardship of data, ensuring responsibility for its integrity and usage.
  • Enhances Decision-Making: Trusted, high-quality data, enables easier consumption and more effective applications.

Delivering Trust

OTDI addresses these concerns with an industry-wide effort to specify trustworthiness criteria and to catalog compliant datasets, allowing model builders and other users to have full confidence in the openness, provenance, and governance of the data they use.

Specifically, we are implementing the following:

  • Data Discovery: We are finding datasets that meet our governance criteria.
  • Data Exploration: We are making the catalog of trustworthy datasets easy to browse and search, so you can find the datasets that support your needs. This means tracking important metadata for each dataset.
  • Data Auditing: For every dataset, we explore its provenance and how it was governed.
  • Data Cleaning: We are building derived datasets processed for specific objectives, such minimizing duplication, removing hate speech, etc., using open-source data processing pipelines with full governance.

Our deliverables to the industry include the following:

  • Baseline Knowledge Datasets: Openly-accessible and permissively-licensed text, code, image, audio, and video data that embodies a diverse range of global knowledge.
  • Domain-specific Datasets: Comprehensive collections of datasets for tuning foundation models for target domains and applications.
  • Tooling and Platform Engineering: Hosted pipelines, platform services, and compute capacity for synthetic dataset generation and data preparation at the scale needed to achieve the vision. Fully open-source, so you can use these tools as you see fit.

Add Your Dataset to the Catalog

Interested in adding your dataset to our catalog? Follow these steps:

  1. Review our Dataset Criteria.
  2. See How We Process Datasets.
  3. Visit Contribute Your Dataset.