Other Datasets
Many open datasets are not hosted at Hugging Face, so they are not yet part of our catalog. Other datasets that are hosted there aren’t picked up by our catalog building process for various reasons, some of which are discussed in About This Catalog. For example, Croissant metadata might not be available, licenses may be incorrectly defined or missing, or it may be required to manually request access to a dataset, even before you can see its Croissant metadata!
For now, here is a list of notable datasets that don’t appear in the catalog pages, grouped into general topic areas. See also the Contributors page.
Table of contents
Benchmark and Other Evaluation Datasets
NeurIPS 2024 Datasets Benchmarks
The NeurIPS 2024 Datasets Benchmarks is a list of recently-created datasets of interest for evaluation.
Chemistry
Many datasets for chemistry are open for use.
CartBlanche
CartBlanche is an interface to ZINC-22, a free database of commercially-available compounds for virtual screening. From the website:
ZINC-22 focuses on make-on-demand (“tangible”) compounds from a small number of large catalogs: Enamine, WuXi and Mcule. Our sister database, ZINC20 focuses on smaller catalogs. ZINC-22 currently has about 54.9 billion molecules in 2D and 5.9 billion in 3D.
PubChem
PubChem is a free-to-use chemistry database. From the website:
PubChem is a free to use database with most of the data readily available for download. Exceptions may exist in cases where licensing agreements prevent our data contributors from allowing bulk downloads of some data sets.
Please consult the NCBI Policies and Disclaimers webpage and the NLM Web Policies webpage.
The data in PubChem comes from hundreds of data contributors. A data source may provide explicit data license information. One should check with the PubChem data source for the most current data licensing information.
PubChem strives to make clear the data provenance of all content. Within a given data table row or beneath provided content, the data provenance is provided. For example, this data shows Medical Subject Headings (MeSH) as the data source for the assertion of a chemical being a “Fibrinolytic Agent”:
Text
Common Pile
Another large open dataset, Common Pile (HF announcement, HF location, HF blog, Paper, Code), was published in June 2025 by a consortium of researchers from University of Toronto, Vector Institute, Hugging Face, EleutherAI, The Allen Institute for Artificial Intelligence, Teraflop AI, Cornell University, University of Maryland College Park, MIT, CMU, Lila Sciences, Lawrence Livermore National Laboratory, etc. See also the PleIAs’ Common Corpus dataset.
The Common Pile collaborators used 1 trillion and 2 trillion token subsets of Common Pile as training datasets for two models, Comma-v0.1-1t and Comma-v0.1-2t, respectively. Both are 7B parameter models.
NOTE: Because this dataset is published in Hugging Face, it will appear in our catalog soon.
Institutional Data Initiative
The [Institutional Data Initiative] at the Harvard Law School Library has published The Institutional Books Corpus. This dataset is available on Hugging Face, but it is not in our catalog, because currently access to it, even its Croissant metadata, requires prior approval.
Other Datasets?
If you know of other open datasets that we should include in our catalog, let us know.