Link Search Menu Expand Document
AI Alliance Banner
Join Our Initiative   Browse the Datasets   Contribute a New Dataset

The Dataset Catalog

Table of contents
  1. The Dataset Catalog
    1. PleIAs
    2. ServiceNow
    3. SemiKong
    4. Coming Soon

See also the AI Alliance’s Hugging Face organization and the Open Trusted Data Initiative catalog there that includes the datasets listed here.

TODO: We plan to provide an integrated search and browsing feature, to make it easier to select the datasets for your particular needs.

Here is the current list of datasets, organized by owner.

BETA: This is a provisional list of datasets. We are not yet validating datasets against our draft requirements.

PleIAs

Domain-specific, clean datasets.

Name Description URL Date Added
Common Corpus Largest multilingual pretraining data Hugging Face 2024-11-04
Toxic Commons Tools for de-toxifying public domain data, especially multilingual and historical text data and data with OCR errors Hugging Face 2024-11-04
Finance Commons A large collection of multimodal financial documents in open data Hugging Face 2024-11-04
Bad Data Toolbox PleIAs collection of models for the data processing of challenging document and data sources Hugging Face 2024-11-04
Open Culture A multilingual dataset of public domain books and newspapers Hugging Face 2024-11-04

ServiceNow

Multimodal, code, and other datasets.

Name Description URL Date Added
BigDocs-Bench A dataset for a comprehensive benchmark suite designed to evaluate downstream tasks that transform visual inputs into structured outputs, such as GUI2UserIntent (fine-grained reasoning) and Image2Flow (structured output). We are actively working on releasing additional components of BigDocs-Bench and will update this repository as they become available. Hugging Face 2024-12-11
RepLiCA RepLiQA is an evaluation dataset that contains Context-Question-Answer triplets, where contexts are non-factual but natural-looking documents about made up entities such as people or places that do not exist in reality… Hugging Face 2024-12-11
The Stack Exact deduplicated version of The Stack dataset used for the BigCode project. Hugging Face 2024-12-11
The Stack Dedup Near deduplicated version of The Stack (recommended for training). Hugging Face 2024-12-11
StarCoder Data Pretraining dataset of StarCoder. Hugging Face 2024-12-11

SemiKong

The training dataset for the SemiKong collaboration that trained an open model for the semiconductor industry.

Name Description URL Date Added
SemiKong An open model training dataset for semiconductor technology Hugging Face 2024-09-01

Coming Soon

In addition to the above organizations, the following are collaborating with us on additional datasets to be published soon.

Organization Kind
BrightQuery BrightQuery (“BQ”) provides proprietary financial, legal, and employment information on private and public companies derived from regulatory filings and disclosures. BQ proprietary data is used in capital markets for investment decisions, banking and insurance for KYC & credit checks, and enterprises for master data management, sales, and marketing purposes. In addition, BQ provides public information consisting of clean and standardized statistical data from all the major government agencies and NGOs around the world, and is doing so in partnership with the source agencies. BQ public datasets will be published in OTDI spanning all topics: economics, demographics, healthcare, crime, climate, education, sustainability, etc. The data will in general be tabular time series. (TBD)
Common Crawl Foundation Tagged and filtered crawl subsets for English and other languages

To expand this catalog, we welcome contributions.