Join Our Initiative Browse the Datasets Contribute a New Dataset

Open Trusted Data Initiative (OTDI)

We are building the world’s largest, most diverse catalog of open and transparently sourced datasets for AI. Join us!

Datasets for Languages

Datasets with different human languages.

Subcategories

African Languages Languages in the Americas Asian Languages European Languages Languages in the Middle East Languages of the Pacific Islands and Nations

Datasets for Domains

Domains like chemistry, healthcare, etc.

Keywords

Advertising Agriculture Art Astronomy Automation Banking Biology Chemistry Climate Code Economics Education Environment Fashion Finance Food Game Geospatial Government History Insurance Legal Logic Mathematics Medical Music Philosophy Physics Politics Psychology Robotics Science Sports Time Series Web

Modalities include text, video, different widely-applicable concepts, like data formats, how the data was collected or transformed from other data (e.g., see text-to-...), etc., and general usage guidance like data intended for pretraining, reinforcement-learning, chain of thought, etc.

Keywords

3D Agents Alignment Arrow Arxiv Audio Benchmark Classification Chain Of Thought Chat Crowd Sourced CSV Embeddings Evaluation Fine Tuning Generated Data Feature Extraction Graph Handwritten Image Instruction Following LLM JSON Monolingual Multi Lingual Multimodal Multiple Choice Named Entity Recognition News NLP Planning Pretraining Problem Solving Prompt Question Answering RAG Reasoning Regression Reinforcement Learning Safety Search Security Sentence Similarity Sentence Transformers Sentiment Analysis Speech Summarization Tabular Retrieval Text To … To Text Translation Tutorial Unlearning Video Vision Wikipedia

Help Us Build the Future of Trustworthy Data for AI

The mission of Open Trusted Data Initiative (OTDI) is to create a comprehensive, widely-sourced catalog of datasets with these qualities:

Clear licenses for use
Explicit provenance guarantees
Governed life cycles

These datasets are needed for AI model training and tuning, as well as domain-specific applications using agents, RAG (retrieval augmented generation) and other “patterns”.

What Does Trusted Data Mean?

Is the provenance and governance of a dataset clear and unambiguous? Does the metadata about the dataset provide clarity about its intended purposes, safety, and other considerations? What sources and processing were used to create the dataset?

Creating a catalog of trusted data involves several projects. We welcome your contributions:

Define the Criteria for Open and Trustworthy Data

Our definition of these criteria is evolving. Help us refine them.

Find and Catalog Datasets for Diverse Topics

AI models and applications need datasets covering a broad range of topics including:

Text: Especially for under-served language
Multimedia: Images, video, audio
Time series: General purpose and domain-specific
Science and Technology: Materials, drug discovery, geospatial, physics, etc.
Specific domains and use cases: Healthcare, legal, financial, education, chat bots, etc.
Synthetic datasets: For all of the above categories, synthetic datasets are needed, too.

Add your datasets to our catalog.

Build Data Processing Pipelines

Data Pipelines are used to validate datasets proposed for inclusion in our catalog and to derive new datasets specialized for particular purposes. Are you a data processing expert? We need your help.

Build a Searchable Dataset Catalog

Currently, the Dataset Catalog is a static resource. Help us make it browsable and searchable.

For More Information

See this short presentation (PDF) for more information about the Open Trusted Data Initiative.

What trustworthiness means to us.
Our current catalog.
About Us: More about the AI Alliance, this initiative, how to get involved, and how to contact us.
References: Other viewpoints on open, trusted data.

Open Trusted Data Initiative (OTDI)

Datasets for Languages

Subcategories

Datasets for Domains

Keywords

Datasets for Modalities

Keywords

Help Us Build the Future of Trustworthy Data for AI

What Does Trusted Data Mean?

Define the Criteria for Open and Trustworthy Data

Find and Catalog Datasets for Diverse Topics

Build Data Processing Pipelines

Build a Searchable Dataset Catalog

For More Information