Link Search Menu Expand Document

Open Data and Model Foundry Projects

Collaborate, experiment, and build data sets and models essential for building agent-based AI applications.

The Open Data and Model Foundry addresses key needs for customized, domain-specific data sets and models.

Projects for Open Trusted Data and Tooling

Good datasets are essential for building good models and applications. The AI Alliance is cataloging datasets, and in some cases building them, that have clear licenses for open use, backed by unambiguous provenance and governance constraints.

Links Description
The Open, Trusted Data Initiative AI Alliance icon
Open data has clear license for use, across a wide range of topic areas, with clear provenance and governance. OTDI seeks to clarify the criteria for openness and catalog the world’s datasets that meet the criteria. Our projects: See also the SYNTH Initiative in the next row!
SYNTH Initiative AI Alliance icon
The SYNTH Initiative aims to address the critical gap in open-source AI development by creating a cutting-edge, open-source data corpus for training sovereign AI models and advanced AI agents. This involves curating permissively licensed, high-quality multimodal and multilingual datasets, with a focus on underrepresented languages, and generating synthetic data specifically designed to enhance frontier-level reasoning capabilities in these languages. The ultimate mission is to enable global access to advanced AI reasoning by fostering an inclusive data ecosystem that supports the full training pipeline of sophisticated models and agents.
Docling
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem. Docling is a key tool for the project Parsing PDFs to Build AI Datasets for Science, discussed above. (Principal developer: IBM Research)

Open Models and Tooling for New Domains and Modalities

The AI Alliance is building new models for many domains and modalities at the intersection of research and engineering. Our projects include models for industrial AI, molecular discovery, geospatial, and time series applications.

Links Description
Open Models
Several AI Alliance work groups are collaborating on the development of domain-specific models:
  • Semikong - The world's first open LLM tuned specifically for the semiconductor industry. (Principal developers: Aitomatic, Tokyo Electron Ltd., FPT Software, and The AI Alliance)
  • Llamarine - An LLM tuned specifically for the needs of the maritime shipping industry.
  • Materials and Chemistry work group (Several developers, including IBM Research):
    • smi-ted - SMILES-based Transformer Encoder-Decoder (SMILES-TED) is an encoder-decoder model pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens. SMI-TED supports various complex tasks, including quantum property prediction, with two main variants (289M and 8×289M).
    • selfies-ted - SMI-SSED (SMILES-SSED) is a Mamba-based encoder-decoder model pre-trained on a curated dataset of 91 million SMILES samples, encompassing 4 billion molecular tokens sourced from PubChem. The model is tailored for complex tasks such as quantum property prediction and offers efficient, high-speed inference capabilities.
    • mhg-ged - SELFIES-based Transformer Encoder-Decoder (SELFIES-TED) is an encoder-decoder model based on BART that not only learns molecular representations but also auto-regressively generates molecules. Pre-trained on a dataset of ~1B molecules from PubChem and Zinc-22.
    • smi-ssed - Molecular Hypergraph Grammar with Graph-based Encoder Decoder (MHG-GED) is an autoencoder that combines a GNN-based encoder with a sequential MHG-based decoder. The GNN encodes molecular input to achieve strong predictive performance on molecular graphs, while the MHG decodes structurally valid molecules. Pre-trained on a dataset of ~1.34M molecules curated from PubChem.
  • More to be announced soon.
TerraTorch
TerraTorch is a library based on PyTorch Lightning and the TorchGeo domain library for geospatial data. (Principal developer: IBM Research)
GEO-bench
GEO-Bench is a General Earth Observation benchmark for evaluating the performance of large pre-trained models on geospatial data. (Principal developer: ServiceNow)