Browse the Datasets Tell Us About Other Datasets GitHub Repo

Other Datasets and Data Initiatives

Many open datasets are not hosted at Hugging Face, so they are not yet part of our catalog. Other datasets that are hosted there aren’t picked up by our catalog building process for various reasons, some of which are discussed in About This Catalog. For example, Croissant metadata might not be available, licenses may be incorrectly defined or missing, or it may be required to manually request access to a dataset, even before you can see its Croissant metadata!

In addition, other data initiatives are fostering the creation, maintenance, and cataloging of datasets for specific purposes, such as under-represented language families, domains and use cases, and areas of science.

Here is a list of notable datasets and initiatives that don’t appear in the catalog pages, grouped into general topic areas. See also the Contributors page.

Table of contents

Other Datasets and Data Initiatives

Avoiding “AI Slop”

The blog Low-background Steel (Pre AI) catalogs datasets known to predate the announcement of ChatGPT, after which AI-generated content became more and more prevalent in datasets. This site wants to ensure that pure, human-generated datasets exist for research and development. From the site:

Sources of data that haven’t been contaminated by AI-created content. Low-background Steel (and lead) is a type of metal uncontaminated by radioactive isotopes from nuclear testing. That steel and lead is usually recovered from ships that sunk before the Trinity Test in 1945. This blog is about uncontaminated content that I’m terming “Low-background Steel”. The idea is to point to sources of text, images and video that were created prior to the explosion of AI-generated content that occurred in 2022.

Benchmark and Other Evaluation Datasets

NeurIPS 2024 Datasets Benchmarks

The NeurIPS 2024 Datasets Benchmarks is a list of recently-created datasets of interest for evaluation.

Chemistry

Many datasets for chemistry are open for use.

CartBlanche

CartBlanche is an interface to ZINC-22, a free database of commercially-available compounds for virtual screening. From the website:

ZINC-22 focuses on make-on-demand (“tangible”) compounds from a small number of large catalogs: Enamine, WuXi and Mcule. Our sister database, ZINC20 focuses on smaller catalogs. ZINC-22 currently has about 54.9 billion molecules in 2D and 5.9 billion in 3D.

PubChem

PubChem is a free-to-use chemistry database. From the website:

PubChem is a free to use database with most of the data readily available for download. Exceptions may exist in cases where licensing agreements prevent our data contributors from allowing bulk downloads of some datasets.

Please consult the NCBI Policies and Disclaimers webpage and the NLM Web Policies webpage.

The data in PubChem comes from hundreds of data contributors. A data source may provide explicit data license information. One should check with the PubChem data source for the most current data licensing information.

PubChem strives to make clear the data provenance of all content. Within a given data table row or beneath provided content, the data provenance is provided. For example, this data shows Medical Subject Headings (MeSH) as the data source for the assertion of a chemical being a “Fibrinolytic Agent”:

Language

Aquarium

Aquarium (blog post) is “An Open Data Platform for Southeast Asian Languages.”. A joint collaboration of AI Singapore and Google, Aquarium is a platform to promote gathering and sharing data sets for the hundreds of languages and dialects spoken by over 650 million people in Southeast Asia. Most of these languages and dialects are under represented in current training datasets used for AI.

Common Pile

Another large open dataset, Common Pile (HF announcement, HF location, HF blog, Paper, Code), was published in June 2025 by a consortium of researchers from University of Toronto, Vector Institute, Hugging Face, EleutherAI, The Allen Institute for Artificial Intelligence, Teraflop AI, Cornell University, University of Maryland College Park, MIT, CMU, Lila Sciences, Lawrence Livermore National Laboratory, etc. See also the PleIAs’ Common Corpus dataset.

The Common Pile collaborators used 1 trillion and 2 trillion token subsets of Common Pile as training datasets for two models, Comma-v0.1-1t and Comma-v0.1-2t, respectively. Both are 7B parameter models.

NOTE: Because this dataset is published in Hugging Face, it will appear in our catalog soon.

WAXAL: A Large-Scale Multilingual African Language Speech Corpus

WAXAL (paper) is a large-scale multilingual African language speech corpus. Quoting from the abstract:

The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts…

Domain-specific Datasets

Finance

SEC Filings

Institutional Data Initiative

The [Institutional Data Initiative] at the Harvard Law School Library has published The Institutional Books Corpus. This dataset is available on Hugging Face, but it is not in our catalog, because currently access to it, even its Croissant metadata, requires prior approval. (See our discussion of this issue here.)

Legal

Medical

PubMed Central

Source Code

BigCode datasets:

Time Series

New York TLC Trip Record

Other General-purpose Training Datasets

arXiv
Common Crawl (See also Common Crawl Foundation)
FineWeb
Github Clean
Hacker News
OpenWeb Math
OpenWeb Text
The Pile
Project Gutenberg
RefinedWeb
StackExchange Datadump
Wikipedia/Wikimedia (See also Wikimedia Enterprise)

What Other Important Datasets Should We Add?

If you know of other open datasets that we should include in our catalog, let us know.