AI Alliance GitHub Organization Repos AI Alliance Events

Open Models and Data Projects

The Open Models and Data Projects address key needs for customized, domain-specific models and data sets, while addressing concerns for sovereignty and governance.

Project Tapestry

Project Tapestry is a global initiative to build and tune foundation models with full and flexible support for sovereignty concerns.

Links	Description
Project Tapestry: AI Alliance Page, Technical Website, GitHub Repo ¹
repo dashboard issues discussions	The AI Alliance launched Project Tapestry to build a collaborative foundation for open and sovereign AI. Project Tapestry is an open-source platform designed to enable globally federated development of frontier open models while preserving sovereignty, local control, long-term independence, and the ability to tune models for local cultural and industrial requirements.

Projects for Open Trusted Data and Tooling

Good datasets are essential for building good models and applications. The AI Alliance is cataloging datasets, and in some cases building them, that have clear licenses for open use, backed by unambiguous provenance and governance constraints.

Links	Description
The Open, Trusted Data Initiative
repo dashboard issues discussions	Open data has clear license for use, across a wide range of topic areas, with clear provenance and governance. OTDI seeks to clarify the criteria for openness and catalog the world’s datasets that meet the criteria. Our projects: Open Dataset Catalog: details, current work Define Openness Criteria: details, current work Find Diverse Datasets: details, current work Data Pipelines to Validate Datasets: details, current work See also the SYNTH Initiative in the next row!
SYNTH Initiative
repo dashboard issues discussions	The SYNTH Initiative aims to address the critical gap in open-source AI development by creating a cutting-edge, open-source data corpus for training sovereign AI models and advanced AI agents. This involves curating permissively licensed, high-quality multimodal and multilingual datasets, with a focus on underrepresented languages, and generating synthetic data specifically designed to enhance frontier-level reasoning capabilities in these languages. The ultimate mission is to enable global access to advanced AI reasoning by fostering an inclusive data ecosystem that supports the full training pipeline of sophisticated models and agents.
Docling
repo issues discussions	Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem. Docling is a key tool for the project Parsing PDFs to Build AI Datasets for Science, discussed above. (Principal developer: IBM Research)

¹ The icon indicates an Alliance core project.

Open Models and Tooling for New Domains and Modalities

The AI Alliance is building new models for many domains and modalities at the intersection of research and engineering. Our projects include models for industrial AI, molecular discovery, geospatial, and time series applications.

Links	Description
Open Models
	Several AI Alliance work groups are collaborating on the development of domain-specific models: Semikong - The world's first open LLM tuned specifically for the semiconductor industry. (Principal developers: Aitomatic, Tokyo Electron Ltd., FPT Software, and The AI Alliance) Llamarine - An LLM tuned specifically for the needs of the maritime shipping industry. Materials and Chemistry work group (Several developers, including IBM Research): smi-ted - SMILES-based Transformer Encoder-Decoder (SMILES-TED) is an encoder-decoder model pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens. SMI-TED supports various complex tasks, including quantum property prediction, with two main variants (289M and 8×289M). selfies-ted - SMI-SSED (SMILES-SSED) is a Mamba-based encoder-decoder model pre-trained on a curated dataset of 91 million SMILES samples, encompassing 4 billion molecular tokens sourced from PubChem. The model is tailored for complex tasks such as quantum property prediction and offers efficient, high-speed inference capabilities. mhg-ged - SELFIES-based Transformer Encoder-Decoder (SELFIES-TED) is an encoder-decoder model based on BART that not only learns molecular representations but also auto-regressively generates molecules. Pre-trained on a dataset of ~1B molecules from PubChem and Zinc-22. smi-ssed - Molecular Hypergraph Grammar with Graph-based Encoder Decoder (MHG-GED) is an autoencoder that combines a GNN-based encoder with a sequential MHG-based decoder. The GNN encodes molecular input to achieve strong predictive performance on molecular graphs, while the MHG decodes structurally valid molecules. Pre-trained on a dataset of ~1.34M molecules curated from PubChem. More to be announced soon.
TerraTorch
repo dashboard issues discussions	TerraTorch is a library based on PyTorch Lightning and the TorchGeo domain library for geospatial data. (Principal developer: IBM Research)
GEO-bench
repo issues	GEO-Bench is a General Earth Observation benchmark for evaluating the performance of large pre-trained models on geospatial data. (Principal developer: ServiceNow)