Join Our Initiative Browse the Datasets Contribute a New Dataset
References: What Others Are Saying About Trusted, Open Data
Here is an evolving list of writings from other sources about the importance of open, trusted data. Note that the opinions expressed do not necessarily reflect the views of the AI Alliance.
Help Wanted: If you have other references you like, please let us know through email, data@thealliance.ai, or edit this page!
A Large-scale Audit of Dataset Licensing and Attribution in AI
A large-scale audit of dataset licensing and attribution in AI is a Nature paper from MIT researchers and others. From a corresponding MIT News article, the paper describes their “… systematic audit of more than 1,800 text datasets on popular hosting sites. They found that more than 70 percent of these datasets omitted some licensing information, while about 50 percent had information that contained errors.
“Building off these insights, they developed a user-friendly tool called the Data Provenance Explorer that automatically generates easy-to-read summaries of a dataset’s creators, sources, licenses, and allowable uses.”
The European Union AI Act - Data Implications
The The European AI Office of the European Union has responsibility for implementing the AI Act, which “… is the first-ever legal framework on AI, which addresses the risks of AI and positions Europe to play a leading role globally.”
Open Future, in collaboration with the Mozilla Foundation, has authored a white paper called Suffiently Detailed? A proposal for implementing the AI Act’s training data transparency specification for GPAI (general-purpose AI). This paper discusses new specification for model developers to produce a sufficiently detailed summary of the content used for model training. The announcement says the following:
The purpose of the paper we are sharing today is twofold. It clarifies the categories of rights and legitimate interests that justify access to information about training data. In addition to copyright, these include, among others, privacy and personal data protection, scientific freedom, the prohibition of discrimination, and respect for cultural and linguistic diversity. Moreover, it provides a blueprint for the forthcoming template for the “sufficiently detailed summary,” which is intended to serve these interests while respecting the rights of all parties concerned.
The Linux Foundation - Implementing AI Bill of Materials (AI BOM) with SPDX 3.0
A bill of materials is a traditional concept used to specify for producers and consumers exactly what parts are contained in the whole. They have been used in shipping for a very long time, for example.
Software BoMs have the same goals, to very clearly state what components a software artifact contains.
This Linux Foundation report discusses the concept in the content of AI. A quote from the website:
A Software Bill of Materials (SBOM) is becoming an increasingly important tool in regulatory and technical spaces to introduce more transparency and security into a project’s software supply chain.
Artificial intelligence (AI) projects face unique challenges beyond the security of their software, and thus require a more expansive approach to a bill of materials. In this report, we introduce the concept of an AI-BOM, expanding on the SBOM to include the documentation of algorithms, data collection methods, frameworks and libraries, licensing information, and standard compliance.
Data Provenance Initiative
There mission is to uncover the datasets used to train large language models. From their website:
The Data Provenance Initiative is a volunteer collective of AI researchers from around the world. We conduct large-scale audits of the massive datasets that power state-of-the-art AI models. We have audited over 4,000 popular text, speech, and video datasets, tracing them from origin to creation, cataloging data sources, licenses, creators, and other metadata, which researchers can examine using our Explorer tool. We recently analyzed 14,000 web domains, to understand the evolving provenance and consent signals behind AI data. The purpose of this work is to map the landscape of AI data, improving transparency, documentation, and informed use of data.
Hugging Face: Training Data Transparency in AI: Tools, Trends, and Policy Recommendations
A call for “minimum meaningful public transparency standards to support effective AI regulation.”