Link Search Menu Expand Document

Open Models and Data Projects

The Open Models and Data Projects address key needs for customized, domain-specific models and data sets, while addressing concerns for sovereignty and governance.

Project Tapestry

Project Tapestry is a global initiative to build and tune foundation with full and flexible support for sovereignty concerns.

Links Description
Project Tapestry AI Alliance icon1
The AI Alliance launched Project Tapestry to build a collaborative foundation for open and sovereign AI. Project Tapestry will be an open-source platform designed to enable globally federated development of frontier open models while preserving sovereignty, local control, and long-term independence.

Projects for Open Trusted Data and Tooling

Good datasets are essential for building good models and applications. The AI Alliance is cataloging datasets, and in some cases building them, that have clear licenses for open use, backed by unambiguous provenance and governance constraints.

Links Description
The Open, Trusted Data Initiative AI Alliance icon1
Open data has clear license for use, across a wide range of topic areas, with clear provenance and governance. OTDI seeks to clarify the criteria for openness and catalog the world’s datasets that meet the criteria. Our projects: See also the SYNTH Initiative in the next row!
SYNTH Initiative AI Alliance icon
The SYNTH Initiative aims to address the critical gap in open-source AI development by creating a cutting-edge, open-source data corpus for training sovereign AI models and advanced AI agents. This involves curating permissively licensed, high-quality multimodal and multilingual datasets, with a focus on underrepresented languages, and generating synthetic data specifically designed to enhance frontier-level reasoning capabilities in these languages. The ultimate mission is to enable global access to advanced AI reasoning by fostering an inclusive data ecosystem that supports the full training pipeline of sophisticated models and agents.
Docling
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem. Docling is a key tool for the project Parsing PDFs to Build AI Datasets for Science, discussed above. (Principal developer: IBM Research)

1 The icon indicates an Alliance core project.

Open Models and Tooling for New Domains and Modalities

The AI Alliance is building new models for many domains and modalities at the intersection of research and engineering. Our projects include models for industrial AI, molecular discovery, geospatial, and time series applications.

Links Description
Open Models
Several AI Alliance work groups are collaborating on the development of domain-specific models:
  • Semikong - The world's first open LLM tuned specifically for the semiconductor industry. (Principal developers: Aitomatic, Tokyo Electron Ltd., FPT Software, and The AI Alliance)
  • Llamarine - An LLM tuned specifically for the needs of the maritime shipping industry.
  • Materials and Chemistry work group (Several developers, including IBM Research):
    • smi-ted - SMILES-based Transformer Encoder-Decoder (SMILES-TED) is an encoder-decoder model pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens. SMI-TED supports various complex tasks, including quantum property prediction, with two main variants (289M and 8×289M).
    • selfies-ted - SMI-SSED (SMILES-SSED) is a Mamba-based encoder-decoder model pre-trained on a curated dataset of 91 million SMILES samples, encompassing 4 billion molecular tokens sourced from PubChem. The model is tailored for complex tasks such as quantum property prediction and offers efficient, high-speed inference capabilities.
    • mhg-ged - SELFIES-based Transformer Encoder-Decoder (SELFIES-TED) is an encoder-decoder model based on BART that not only learns molecular representations but also auto-regressively generates molecules. Pre-trained on a dataset of ~1B molecules from PubChem and Zinc-22.
    • smi-ssed - Molecular Hypergraph Grammar with Graph-based Encoder Decoder (MHG-GED) is an autoencoder that combines a GNN-based encoder with a sequential MHG-based decoder. The GNN encodes molecular input to achieve strong predictive performance on molecular graphs, while the MHG decodes structurally valid molecules. Pre-trained on a dataset of ~1.34M molecules curated from PubChem.
  • More to be announced soon.
TerraTorch
TerraTorch is a library based on PyTorch Lightning and the TorchGeo domain library for geospatial data. (Principal developer: IBM Research)
GEO-bench
GEO-Bench is a General Earth Observation benchmark for evaluating the performance of large pre-trained models on geospatial data. (Principal developer: ServiceNow)