Parsing PDFs to Build AI Datasets for Science

Tip: Use the search box at the top of this page to find specific content.

Welcome to the The AI Alliance project: Parsing PDFs to Build AI Datasets for Science.

Effective applications of AI to specialized domains, such as scientific research, require customized datasets that can be used to train and tune models, build applications with agents and the RAG (retrieval-augmented generation) pattern, etc.

However, much of the world’s scientific knowledge is contained in PDFs, which can be challenging to parse for automation purposes. Text extraction is not too difficult, but extracting useful information from diagrams and tables is harder.

Recent advancements in tools such as Docling have enabled automated information extraction from PDF and other document types possible as never before. Our project is building the tools to do this extraction at scale to create new domain-specific datasets, starting with datasets for scientific disciplines. The project will begin by parsing the Math-PDF published by PleIAs

These datasets will be hosted in the AI Alliance Hugging Face organization and they will be part of the Open Trusted Data Initiative catalog of datasets.

Please join us! See our contributing page for details.

TODO: This is a work-in-progress website. More information coming soon.

Authors	Foundation Models and Datasets (See the Contributors)
Last Update	V0.1.1, 2025-06-03