Link Search Menu Expand Document

Other Datasets and Data Initiatives

Many open datasets are not hosted at Hugging Face, so they are not yet part of our catalog. Other datasets that are hosted there aren’t picked up by our catalog building process for various reasons, some of which are discussed in About This Catalog. For example, Croissant metadata might not be available, licenses may be incorrectly defined or missing, or it may be required to manually request access to a dataset, even before you can see its Croissant metadata!

In addition, other data initiatives are fostering the creation, maintenance, and cataloging of datasets for specific purposes, such as under-represented language families, domains and use cases, and areas of science.

Here is a list of notable datasets and initiatives that don’t appear in the catalog pages, grouped into general topic areas. See also the Contributors page.

Table of contents
  1. Other Datasets and Data Initiatives
    1. Other General-Purpose Data Initiatives
      1. Mozilla Data Collective
      2. Avoiding “AI Slop”
    2. Domain-Specific Datasets
      1. Chemistry
        1. CartBlanche
        2. PubChem
      2. Finance
      3. General and Regional Languages
        1. Aquarium
        2. Common Pile
        3. WAXAL: A Large-Scale Multilingual African Language Speech Corpus
    3. Domain-specific Datasets
      1. Finance
      2. Legal
      3. Medical
      4. Source Code
      5. Time Series
    4. Datasets for Training, Benchmarks, and Other Evaluation Purposes
    5. Other Important Datasets We Should Catalog?

Other General-Purpose Data Initiatives

Mozilla Data Collective

The Mozilla Data Collective has curated over 470 high-quality datasets, sourced globally, which have been built in a transparent and ethical way.

Avoiding “AI Slop”

The blog Low-background Steel (Pre AI) catalogs datasets known to predate the announcement of ChatGPT, after which AI-generated content became more and more prevalent in datasets. This site wants to ensure that pure, human-generated datasets exist for research and development. From the site:

Sources of data that haven’t been contaminated by AI-created content. Low-background Steel (and lead) is a type of metal uncontaminated by radioactive isotopes from nuclear testing. That steel and lead is usually recovered from ships that sunk before the Trinity Test in 1945. This blog is about uncontaminated content that I’m terming “Low-background Steel”. The idea is to point to sources of text, images and video that were created prior to the explosion of AI-generated content that occurred in 2022.

Domain-Specific Datasets

Chemistry

Many datasets for chemistry are open for use.

CartBlanche

CartBlanche is an interface to ZINC-22, a free database of commercially-available compounds for virtual screening. From the website:

ZINC-22 focuses on make-on-demand (“tangible”) compounds from a small number of large catalogs: Enamine, WuXi and Mcule. Our sister database, ZINC20 focuses on smaller catalogs. ZINC-22 currently has about 54.9 billion molecules in 2D and 5.9 billion in 3D.

PubChem

PubChem is a free-to-use chemistry database. From the website:

PubChem is a free to use database with most of the data readily available for download. Exceptions may exist in cases where licensing agreements prevent our data contributors from allowing bulk downloads of some datasets.

Please consult the NCBI Policies and Disclaimers webpage and the NLM Web Policies webpage.

The data in PubChem comes from hundreds of data contributors. A data source may provide explicit data license information. One should check with the PubChem data source for the most current data licensing information.

PubChem strives to make clear the data provenance of all content. Within a given data table row or beneath provided content, the data provenance is provided. For example, this data shows Medical Subject Headings (MeSH) as the data source for the assertion of a chemical being a “Fibrinolytic Agent”:

Finance

General and Regional Languages

Aquarium

Aquarium (blog post) is “An Open Data Platform for Southeast Asian Languages.”. A joint collaboration of AI Singapore and Google, Aquarium is a platform to promote gathering and sharing data sets for the hundreds of languages and dialects spoken by over 650 million people in Southeast Asia. Most of these languages and dialects are under represented in current training datasets used for AI.

Common Pile

Another large open dataset, Common Pile (HF announcement, HF location, HF blog, Paper, Code), was published in June 2025 by a consortium of researchers from University of Toronto, Vector Institute, Hugging Face, EleutherAI, The Allen Institute for Artificial Intelligence, Teraflop AI, Cornell University, University of Maryland College Park, MIT, CMU, Lila Sciences, Lawrence Livermore National Laboratory, etc. See also the PleIAs’ Common Corpus dataset.

The Common Pile collaborators used 1 trillion and 2 trillion token subsets of Common Pile as training datasets for two models, Comma-v0.1-1t and Comma-v0.1-2t, respectively. Both are 7B parameter models.

NOTE: Because this dataset is published in Hugging Face, it will appear in our catalog soon.

WAXAL: A Large-Scale Multilingual African Language Speech Corpus

WAXAL (paper) is a large-scale multilingual African language speech corpus. Quoting from the abstract:

The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts…

Domain-specific Datasets

Finance

Medical

Source Code

BigCode datasets:

See also Common Pile).

Time Series

Datasets for Training, Benchmarks, and Other Evaluation Purposes

Other Important Datasets We Should Catalog?

If you know of other open datasets that we should include in our catalog, let us know.