Open Data and Model Foundry Projects
Collaborate, experiment, and build data sets and models essential for building agent-based AI applications.
The Open Data and Model Foundry addresses key needs for customized, domain-specific data sets and models.
Projects for Open Trusted Data and Tooling
Good datasets are essential for building good models and applications. The AI Alliance is cataloging datasets, and in some cases building them, that have clear licenses for open use, backed by unambiguous provenance and governance constraints.
| Links | Description |
|---|---|
|
The Open, Trusted Data Initiative |
|
Open data has clear license for use, across a wide range of topic areas, with clear provenance and governance. OTDI seeks to clarify the criteria for openness and catalog the world’s datasets that meet the criteria. Our projects:
|
|
|
SYNTH Initiative |
|
| The SYNTH Initiative aims to address the critical gap in open-source AI development by creating a cutting-edge, open-source data corpus for training sovereign AI models and advanced AI agents. This involves curating permissively licensed, high-quality multimodal and multilingual datasets, with a focus on underrepresented languages, and generating synthetic data specifically designed to enhance frontier-level reasoning capabilities in these languages. The ultimate mission is to enable global access to advanced AI reasoning by fostering an inclusive data ecosystem that supports the full training pipeline of sophisticated models and agents. | |
| Docling | |
| Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem. Docling is a key tool for the project Parsing PDFs to Build AI Datasets for Science, discussed above. (Principal developer: IBM Research) | |
Open Models and Tooling for New Domains and Modalities
The AI Alliance is building new models for many domains and modalities at the intersection of research and engineering. Our projects include models for industrial AI, molecular discovery, geospatial, and time series applications.
| Links | Description |
|---|---|
| Open Models | |
Several AI Alliance work groups are collaborating on the development of domain-specific models:
|
|
| TerraTorch | |
| TerraTorch is a library based on PyTorch Lightning and the TorchGeo domain library for geospatial data. (Principal developer: IBM Research) | |
| GEO-bench | |
| GEO-Bench is a General Earth Observation benchmark for evaluating the performance of large pre-trained models on geospatial data. (Principal developer: ServiceNow) | |
