Join Our Initiative Browse the Datasets Contribute a New Dataset

References: Other Information About Trusted, Open Data

Here is an evolving list of writings from other sources about the importance of open, trusted data, implications, technologies used, etc. Of course, the opinions expressed do not necessarily reflect the views of the AI Alliance. However, many of these sources influence our work.

Help Wanted: If you have other references you like, please let us know through email, data@thealliance.ai, or edit this page!

This section is organized by topic.

Table of contents

References: Other Information About Trusted, Open Data

Research Progress on Gathering and Using Open Datasets

While there are many open datasets of various sizes and areas of focus, a few attempts have been made to create very broad, completely open datasets suitable for training competitive foundation models.

Pleias

In December 2024, Pleias published Common Corpus, a two trillion token dataset, and used it to train several models. (paper)

The Pleias datasets are discussed in more detail on the contributors page.

Common Pile

Another large open dataset, Common Pile (HF announcement, HF location, HF blog, Paper, Code), was published in June 2025 by a consortium of researchers from University of Toronto, Vector Institute, Hugging Face, EleutherAI, The Allen Institute for Artificial Intelligence, Teraflop AI, Cornell University, University of Maryland College Park, MIT, CMU, Lila Sciences, Lawrence Livermore National Laboratory, etc.

They used 1 trillion and 2 trillion token subsets of Common Pile as training datasets for two models, Comma-v0.1-1t and Comma-v0.1-2t, respectively. Both are 7B parameter models.

NOTE: Because this dataset is published in Hugging Face, it will appear in our “snapshot” static catalog soon.

Avoiding “AI Slop”

The blog Low-background Steel (Pre AI) catalogs datasets known to predate the announcement of ChatGPT, after which AI-generated content became more and more prevalent in datasets. This site wants to ensure that pure, human-generated datasets exist for research and development. From the site:

Sources of data that haven’t been contaminated by AI-created content. Low-background Steel (and lead) is a type of metal uncontaminated by radioactive isotopes from nuclear testing. That steel and lead is usually recovered from ships that sunk before the Trinity Test in 1945. This blog is about uncontaminated content that I’m terming “Low-background Steel”. The idea is to point to sources of text, images and video that were created prior to the explosion of AI-generated content that occurred in 2022.

General Data Concerns

Hugging Face: Training Data Transparency in AI: Tools, Trends, and Policy Recommendations

Blog post by Yacine Jernite.

A call for “minimum meaningful public transparency standards to support effective AI regulation.”

U.S. Department of Commerce

Generative Artificial Intelligence and Open Data: Guidelines and Best Practices) (PDF). This guidance is intended to be used by the department and its bureaus, but it is generally useful.

Note that it was published January 16, 2025, just before the end of the Biden administration. It is not clear if these guidelines will be retained by the new administration.

The European Union AI Act - Data Implications

The The European AI Office of the European Union has responsibility for implementing the AI Act, which “… is the first-ever legal framework on AI, which addresses the risks of AI and positions Europe to play a leading role globally.”

Open Future, in collaboration with the Mozilla Foundation, has authored a white paper called Suffiently Detailed? A proposal for implementing the AI Act’s training data transparency specification for GPAI (general-purpose AI). This paper discusses new specification for model developers to produce a sufficiently detailed summary of the content used for model training. The announcement says the following:

The purpose of the paper we are sharing today is twofold. It clarifies the categories of rights and legitimate interests that justify access to information about training data. In addition to copyright, these include, among others, privacy and personal data protection, scientific freedom, the prohibition of discrimination, and respect for cultural and linguistic diversity. Moreover, it provides a blueprint for the forthcoming template for the “sufficiently detailed summary,” which is intended to serve these interests while respecting the rights of all parties concerned.

FAIR Principles

Website

Quoting from the website:

In 2016, the FAIR Guiding Principles for scientific data management and stewardship were published in Scientific Data. The authors intended to provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. The principles emphasize machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention) because humans increasingly rely on computational support to deal with data as a result of the increase in volume, complexity, and creation speed of data.

Both data and metadata must be findable to be usable, such as storage in known locations. Machine readability is essential for practical use.

Data must be accessible through established tools (e.g., web APIs), possibly including authentication and authorization.

Datasets often need to be combined, requiring interoperable tools, storage, etc.

Ultimately, the goal is to make data reusable, leading to requirements for clear licensing, provenance, and governance.

Finally, they define three types of entities: “data (or any digital object), metadata (information about that digital object), and infrastructure.” For instance, findable means that both metadata and data are registered or indexed in a searchable resource (an infrastructure component).

Licensing and Attribution

A Large-scale Audit of Dataset Licensing and Attribution in AI

A large-scale audit of dataset licensing and attribution in AI is a Nature paper from MIT researchers and others. From a corresponding MIT News article, the paper describes their “… systematic audit of more than 1,800 text datasets on popular hosting sites. They found that more than 70 percent of these datasets omitted some licensing information, while about 50 percent had information that contained errors.

“Building off these insights, they developed a user-friendly tool called the Data Provenance Explorer that automatically generates easy-to-read summaries of a dataset’s creators, sources, licenses, and allowable uses.”

Data Provenance and Governance

Data Provenance Initiative

Website, GitHub

There mission is to uncover the datasets used to train large language models. From their website:

The Data Provenance Initiative is a volunteer collective of AI researchers from around the world. We conduct large-scale audits of the massive datasets that power state-of-the-art AI models. We have audited over 4,000 popular text, speech, and video datasets, tracing them from origin to creation, cataloging data sources, licenses, creators, and other metadata, which researchers can examine using our Explorer tool. We recently analyzed 14,000 web domains, to understand the evolving provenance and consent signals behind AI data. The purpose of this work is to map the landscape of AI data, improving transparency, documentation, and informed use of data.

Data and Trust Alliance - Data Provenance Standards

The Data and Trust Alliance has defined a standard for provenance, as well as other projects.

Here is their statement about the purpose of this standard, quoted from the project web page:

For AI to create value for business and society, the data that trains and feeds models must be trustworthy.

Trust in data starts with transparency into provenance; assessing where data comes from, how it’s created, and whether it can be used, legally. Yet the ecosystem needs a common language to provide that transparency.

This is why we developed the first cross-industry data provenance standards.

Data Classification

Interactive Advertising Bureau Taxonomy

Interactive Advertising Bureau (GitHub) has defined a taxonomy (GitHub) of content, audience, and ad products (latest - V3.1).

IBM watsonx Natural Language Processing Categories

IBM’s watsonx Natural Language Processing (NLP) system works with a defined taxonomy of categories.

Searching for Datasets

University of California Berkeley

It Took Longer than I was Expecting: Why is Dataset Search Still so Hard? analyzes why searching for datasets is harder than it might seem.

Bill of Materials

The Linux Foundation - Implementing AI Bill of Materials (AI BOM) with SPDX 3.0

A bill of materials is a traditional concept used to specify for producers and consumers exactly what parts are contained in the whole. They have been used in shipping for a very long time, for example.

Software BoMs have the same goals, to very clearly state what components a software artifact contains.

This Linux Foundation report discusses the concept in the content of AI. A quote from the website:

A Software Bill of Materials (SBOM) is becoming an increasingly important tool in regulatory and technical spaces to introduce more transparency and security into a project’s software supply chain.

Artificial intelligence (AI) projects face unique challenges beyond the security of their software, and thus require a more expansive approach to a bill of materials. In this report, we introduce the concept of an AI-BOM, expanding on the SBOM to include the documentation of algorithms, data collection methods, frameworks and libraries, licensing information, and standard compliance.