
Join Our Initiative Browse the Datasets Contribute a New Dataset
Open Trusted Data Initiative (OTDI)
We are building the world’s largest, most diverse collection of open and transparently sourced datasets for AI. Join us!
Datasets for Languages
Subcategories
African Languages Languages in the Americas Asian Languages European Languages Languages in the Middle East Languages of the Pacific Islands and Nations
Datasets for Domains
Keywords
Advertising Agriculture Art Astronomy Automation Banking Biology Chemistry Climate Code Economics Education Environment Fashion Finance Food Game Geospatial Government History Insurance Legal Logic Mathematics Medical Music Philosophy Physics Politics Psychology Robotics Science Sports Time Series Web
Datasets for Modalities
text
, video
, different widely-applicable concepts, like data formats, how the data was collected or transformed from other data (e.g., see text-to-...
), etc., and general usage guidance like data intended for pretraining
, reinforcement-learning
, chain of thought
, etc.
Keywords
3D Agents Alignment Arrow Arxiv Audio Benchmark Classification Chain Of Thought Chat Crowd Sourced CSV Embeddings Evaluation Fine Tuning Generated Data Feature Extraction Graph Handwritten Image Instruction Following LLM JSON Monolingual Multi Lingual Multimodal Multiple Choice Named Entity Recognition News NLP Planning Pretraining Problem Solving Prompt Question Answering RAG Reasoning Regression Reinforcement Learning Safety Search Security Sentence Similarity Sentence Transformers Sentiment Analysis Speech Summarization Tabular Retrieval Text To … To Text Translation Tutorial Unlearning Video Vision Wikipedia
Help Us Build the Future of Trustworthy Data for AI
The mission of Open Trusted Data Initiative (OTDI) is to create a comprehensive, widely-sourced catalog of datasets with clear licenses for use, explicit provenance guarantees, and governed lifecycles. These datasets are suitable for AI model training, tuning, and application patterns like RAG (retrieval augmented generation) and agents.
What Does Trusted Data Mean?
Is the provenance and governance of a dataset clear and unambiguous? Does the metadata about the dataset provide clarity about its intended purposes, safety, and other considerations? What sources and processing were used to create the dataset?
Creating a catalog of trusted data involves several projects. We welcome your contributions:
Define the Criteria for Open and Trustworthy Data
Our definition of these criteria is evolving. Help us refine them.
Find and Catalog Datasets for Diverse Topics
AI models and applications need datasets covering a broad range of topics including:
- Text: Especially for under-served language
- Multimedia: Images, video, audio
- Time series: General purpose and domain-specific
- Science and Technology: Materials, drug discovery, geospatial, physics, etc.
- Specific domains and use cases: Healthcare, legal, financial, education, chat bots, etc.
- Synthetic datasets: For all of the above categories, synthetic datasets are needed, too.
Add your datasets to our catalog.
Build Data Processing Pipelines
Data Pipelines are used to validate datasets proposed for inclusion in our catalog and to derive new datasets specialized for particular purposes. Are you a data processing expert? We need your help.
Build a Searchable Dataset Catalog
Currently, the Dataset Catalog is a static resource. Help us make it browsable and searchable.
See this short presentation (PDF) for more information about the Open Trusted Data Initiative.
More Information
- What trustworthiness means to us.
- About Us: More about the AI Alliance, this initiative, how to get involved, and how to contact us.
- References: Other viewpoints on open, trusted data.