Browse the Datasets Tell Us About Other Datasets GitHub Repo

Our Key Contributors and Their Datasets

The following AI Alliance member or affiliate organizations, shown in alphabetical order, maintain open data sets that are becoming part of our catalog. See also the Other Datasets page.

Table of contents

Our Key Contributors and Their Datasets

NOTE: See also the AI Alliance’s Hugging Face organization and the dataset collection there, which list some datasets discussed below, as well as others that were donated or created by Alliance members.

BrightQuery

BrightQuery (“BQ”) provides proprietary financial, legal, and employment information on private and public companies derived from regulatory filings and disclosures. BQ proprietary data is used in capital markets for investment decisions, banking and insurance for KYC & credit checks, and enterprises for master data management, sales, and marketing purposes.

In addition, BQ provides public information consisting of clean and standardized statistical data from all the major government agencies and NGOs around the world, and is doing so in partnership with the source agencies. BQ public datasets will be published at opendata.org/ and cataloged in OTDI spanning all topics: economics, demographics, healthcare, crime, climate, education, sustainability, etc. See also their documentation about the datasets they are building. Much of the data will be tabular (i.e., structured) time series data, as well as unstructured text.

More specific information is coming soon.

Common Crawl Foundation

Common Crawl Foundation is working on tagged and filtered crawl subsets for English and other languages.

More specific information is coming soon.

EPFL

The EPFL LLM team has curated a dataset to train their Meditron models. An open-access subset of the medical guidelines data is published on Hugging Face

See the Meditron GitHub repo README for more details about the whole dataset used to train Meditron.

IBM Research

Social Stigma Q&A is a dataset from IBM Research. From the arXiv paper abract:

Current datasets for unwanted social bias auditing are limited to studying protected demographic features such as race and gender. In this work, we introduce a comprehensive benchmark that is meant to capture the amplification of social bias, via stigmas, in generative language models. Taking inspiration from social science research, we start with a documented list of 93 US-centric stigmas and curate a question-answering (QA) dataset which involves simple social situations. Our benchmark, SocialStigmaQA, contains roughly 10K prompts, with a variety of prompt styles, carefully constructed to systematically test for both social bias and model robustness. We present results for SocialStigmaQA with two open source generative language models and we find that the proportion of socially biased output ranges from 45% to 59% across a variety of decoding strategies and prompting styles. We demonstrate that the deliberate design of the templates in our benchmark (e.g., adding biasing text to the prompt or using different verbs that change the answer that indicates bias) impacts the model tendencies to generate socially biased output. Additionally, through manual evaluation, we discover problematic patterns in the generated chain-of-thought output that range from subtle bias to lack of reasoning.

For more information, see Arxiv:2312.07492.

Kepler

Kepler (paper) measures resource utilization for sustainable computing purposes. From the repo:

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics.

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)

Developed by the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), do-not-answer is an open-source dataset to evaluate LLMs’ safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.

PleIAs

Domain-specific, clean datasets.

Name	Description	URL	Date Added
Common Corpus	Largest multilingual pretraining data	Hugging Face paper	2024-11-04
Toxic Commons	Tools for de-toxifying public domain data, especially multilingual and historical text data and data with OCR errors	Hugging Face	2024-11-04
Finance Commons	A large collection of multimodal financial documents in open data	Hugging Face	2024-11-04
Bad Data Toolbox	PleIAs collection of models for the data processing of challenging document and data sources	Hugging Face	2024-11-04
Open Culture	A multilingual dataset of public domain books and newspapers	Hugging Face	2024-11-04
Math PDF	A collection of open source PDFs on Mathematics	Hugging Face	2025-03-19

ServiceNow

Multimodal, code, and other datasets.

Name	Description	URL	Date Added
BigDocs-Bench	A dataset for a comprehensive benchmark suite designed to evaluate downstream tasks that transform visual inputs into structured outputs, such as GUI2UserIntent (fine-grained reasoning) and Image2Flow (structured output). We are actively working on releasing additional components of BigDocs-Bench and will update this repository as they become available.	Hugging Face	2024-12-11
RepLiCA	RepLiQA is an evaluation dataset that contains Context-Question-Answer triplets, where contexts are non-factual but natural-looking documents about made up entities such as people or places that do not exist in reality…	Hugging Face	2024-12-11
The Stack	Exact deduplicated version of The Stack dataset used for the BigCode project.	Hugging Face	2024-12-11
The Stack Dedup	Near deduplicated version of The Stack (recommended for training).	Hugging Face	2024-12-11
StarCoder Data	Pretraining dataset of StarCoder.	Hugging Face	2024-12-11

SemiKong

The training dataset for the SemiKong collaboration that trained an open model for the semiconductor industry.

Name	Description	URL	Date Added
SemiKong	An open model training dataset for semiconductor technology	Hugging Face	2024-09-01

`do-not-answer`

Sony AI and the University of Tokyo

A collaboration of Sony AI and the University of Tokyo created the Human-Centric Face Representations, a collaboration to generate a dataset of 638,180 human judgments on face similarity. Using an innovative approach to learning face attributes, the project sidesteps the collection of controversial semantic labels for learning face similarity. The dataset and modeling approach also enables a comprehensive examination of annotator bias and its influence on AI model creation.

Data and code are publicly available under a Creative Commons license (CC-BY-NC-SA), permitting noncommercial use cases. See the GitHub repo.

Wikimedia Enterprise

Datasets from the Wikimedia Foundation, the organization that hosts and supports Wikipedia, Wikidata, and many other projects affiliated with the movement.

Wikimedia Enterprise website
Wikimedia Enterprise Hugging Face organization.
Wikimedia Enterprise Collections on Hugging Face
Wikimedia Enterprise Collections on Kaggle
Wikimedia Enterprise GitHub Organization

Name	Description	URL	Date Added
Wikipedia Structured Contents	Early beta release of the English and French Wikipedia articles including infoboxes	Hugging Face	2024-09-16
Wikipedia Structured Contents	Early beta release of the English and French Wikipedia articles including infoboxes	Kaggle	2024-09-16
Wikimedia Wikisource	Wikisource dataset containing cleaned articles of all languages	Hugging Face	2023-12-01
Wikimedia Wikipedia	Wikipedia dataset containing cleaned articles of all languages	Hugging Face	2023-11-01
Wikimedia WIT	WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning	Hugging Face paper	2022-05-22

Your Contributions?

To expand our catalog, we welcome your contributions.