Link Search Menu Expand Document
AI Alliance Banner
Join Our Initiative   Browse the Datasets   Contribute a New Dataset

Open Trusted Data Initiative (OTDI)

We are building the world’s largest, most diverse collection of open and transparently sourced datasets for AI. Join us!

Datasets for Languages

Datasets with different human languages.

Subcategories

African Languages Languages in the Americas Asian Languages European Languages Languages in the Middle East Languages of the Pacific Islands and Nations

Datasets for Domains

Domains like chemistry, healthcare, etc.

Keywords

Advertising Agriculture Art Astronomy Automation Banking Biology Chemistry Climate Code Economics Education Environment Fashion Finance Food Game Geospatial Government History Insurance Legal Logic Mathematics Medical Music Philosophy Physics Politics Psychology Robotics Science Sports Time Series Web

Datasets for Modalities

Modalities include text, video, different widely-applicable concepts, like data formats, how the data was collected or transformed from other data (e.g., see text-to-...), etc., and general usage guidance like data intended for pretraining, reinforcement-learning, chain of thought, etc.

Keywords

3D Agents Alignment Arrow Arxiv Audio Benchmark Classification Chain Of Thought Chat Crowd Sourced CSV Embeddings Evaluation Fine Tuning Generated Data Feature Extraction Graph Handwritten Image Instruction Following LLM JSON Monolingual Multi Lingual Multimodal Multiple Choice Named Entity Recognition News NLP Planning Pretraining Problem Solving Prompt Question Answering RAG Reasoning Regression Reinforcement Learning Safety Search Security Sentence Similarity Sentence Transformers Sentiment Analysis Speech Summarization Tabular Retrieval Text To … To Text Translation Tutorial Unlearning Video Vision Wikipedia

Help Us Build the Future of Trustworthy Data for AI

The mission of Open Trusted Data Initiative (OTDI) is to create a comprehensive, widely-sourced catalog of datasets with clear licenses for use, explicit provenance guarantees, and governed lifecycles. These datasets are suitable for AI model training, tuning, and application patterns like RAG (retrieval augmented generation) and agents.

What Does Trusted Data Mean?

Is the provenance and governance of a dataset clear and unambiguous? Does the metadata about the dataset provide clarity about its intended purposes, safety, and other considerations? What sources and processing were used to create the dataset?

Creating a catalog of trusted data involves several projects. We welcome your contributions:

Define the Criteria for Open and Trustworthy Data

Our definition of these criteria is evolving. Help us refine them.

Find and Catalog Datasets for Diverse Topics

AI models and applications need datasets covering a broad range of topics including:

  • Text: Especially for under-served language
  • Multimedia: Images, video, audio
  • Time series: General purpose and domain-specific
  • Science and Technology: Materials, drug discovery, geospatial, physics, etc.
  • Specific domains and use cases: Healthcare, legal, financial, education, chat bots, etc.
  • Synthetic datasets: For all of the above categories, synthetic datasets are needed, too.

Add your datasets to our catalog.

Build Data Processing Pipelines

Data Pipelines are used to validate datasets proposed for inclusion in our catalog and to derive new datasets specialized for particular purposes. Are you a data processing expert? We need your help.

Build a Searchable Dataset Catalog

Currently, the Dataset Catalog is a static resource. Help us make it browsable and searchable.

See this short presentation (PDF) for more information about the Open Trusted Data Initiative.

More Information

  • What trustworthiness means to us.
  • About Us: More about the AI Alliance, this initiative, how to get involved, and how to contact us.
  • References: Other viewpoints on open, trusted data.