Our Key Contributors and Their Datasets
The following organizations, shown in alphabetical order, maintain open data sets that are becoming part of our catalog.
Table of contents
BrightQuery
BrightQuery (“BQ”) provides proprietary financial, legal, and employment information on private and public companies derived from regulatory filings and disclosures. BQ proprietary data is used in capital markets for investment decisions, banking and insurance for KYC & credit checks, and enterprises for master data management, sales, and marketing purposes.
In addition, BQ provides public information consisting of clean and standardized statistical data from all the major government agencies and NGOs around the world, and is doing so in partnership with the source agencies. BQ public datasets will be published at opendata.org/ and cataloged in OTDI spanning all topics: economics, demographics, healthcare, crime, climate, education, sustainability, etc. See also their documentation about the datasets they are building. Much of the data will be tabular (i.e., structured) time series data, as well as unstructured text.
More specific information is coming soon.
Common Crawl Foundation
Common Crawl Foundation is working on tagged and filtered crawl subsets for English and other languages.
More specific information is coming soon.
EPFL
The EPFL LLM team has curated a dataset to train their Meditron models. An open-access subset of the medical guidelines data is published on Hugging Face
See the Meditron GitHub repo README for more details about the whole dataset used to train Meditron.
Meta
Data for Good at Meta
Data for Good at Meta empowers partners with privacy-preserving data that strengthens communities and advances social issues. Data for Good is helping organizations respond to crises around the world and supporting research that advances economic opportunity.
There are 220 datasets available. See Meta’s page at the Humanitarian Data Exchange for the full list of datasets.
OMol25
OMol25 is an open dataset for molecules and electrolytes, possibly the largest ab-initio dataset ever released in terms of compute cost and a family of Universal Model for Atoms (UMA) trained against all of the open-science datasets the team has released in the past five years (materials, catalysts, molecules, MOFs, organic crystals).
For more information, including a demo to see how it works on different materials, see the following:
- Blog post: including links to the research paper, the dataset, the trained model, and code.
- Demo
- Press coverage: SEMAFOR
PleIAs
Domain-specific, clean datasets.
- PleIAs website
- PleIAs Hugging Face organization.
- PleIAs Collections on Hugging Face
Name | Description | URL | Date Added |
---|---|---|---|
Common Corpus | Largest multilingual pretraining data | Hugging Face | 2024-11-04 |
Toxic Commons | Tools for de-toxifying public domain data, especially multilingual and historical text data and data with OCR errors | Hugging Face | 2024-11-04 |
Finance Commons | A large collection of multimodal financial documents in open data | Hugging Face | 2024-11-04 |
Bad Data Toolbox | PleIAs collection of models for the data processing of challenging document and data sources | Hugging Face | 2024-11-04 |
Open Culture | A multilingual dataset of public domain books and newspapers | Hugging Face | 2024-11-04 |
Math PDF | A collection of open source PDFs on Mathematics | Hugging Face | 2025-03-19 |
ServiceNow
Multimodal, code, and other datasets.
- ServiceNow website
- ServiceNow Hugging Face organization
- BigCode Hugging Face organization
Name | Description | URL | Date Added |
---|---|---|---|
BigDocs-Bench | A dataset for a comprehensive benchmark suite designed to evaluate downstream tasks that transform visual inputs into structured outputs, such as GUI2UserIntent (fine-grained reasoning) and Image2Flow (structured output). We are actively working on releasing additional components of BigDocs-Bench and will update this repository as they become available. | Hugging Face | 2024-12-11 |
RepLiCA | RepLiQA is an evaluation dataset that contains Context-Question-Answer triplets, where contexts are non-factual but natural-looking documents about made up entities such as people or places that do not exist in reality… | Hugging Face | 2024-12-11 |
The Stack | Exact deduplicated version of The Stack dataset used for the BigCode project. | Hugging Face | 2024-12-11 |
The Stack Dedup | Near deduplicated version of The Stack (recommended for training). | Hugging Face | 2024-12-11 |
StarCoder Data | Pretraining dataset of StarCoder. | Hugging Face | 2024-12-11 |
SemiKong
The training dataset for the SemiKong collaboration that trained an open model for the semiconductor industry.
Name | Description | URL | Date Added |
---|---|---|---|
SemiKong | An open model training dataset for semiconductor technology | Hugging Face | 2024-09-01 |
Your Contributions?
To expand our catalog, we welcome your contributions.