Join Our Initiative Browse the Datasets Contribute a New Dataset

How We Process Datasets

Table of contents

How We Process Datasets

Provenance and Governance

Given the importance of provenance and governance for the datasets in this initiative, we plan to analyze proposed datasets to ensure they meet our dataset specification. Derived datasets that do various forms of filtering are also planned, as discussed below.

We will publish the technical details of these processes as they are developed. We will open source all source code and deployment information for these pipelines under the AI Alliance standard code license: Apache 2.0. (See the Alliance community/CONTRIBUTING page for more details about our license conventions.)

Data Quality and “Cleanliness”

In Dataset Specification, we described several levels of quality and cleanliness that guide aspects of how we categorize datasets in our catalog. Think of the following as a rough outline of our ingestion and processing steps:

Raw: The dataset as submitted. Our most important criteria at this stage is unambigious provenance. Raw datasets may contain some objectionable content, but appropriate labels and usage guidance will be provided. For example, a dataset with hate speech may be suitable for use by researchers studying hate speech and working on detectors for it, but model developers may decide to avoid the dataset. However, in some cases, legal or other considerations may prevent us from accepting some content without additional filtering.
Filtered: A dataset created by passing a raw dataset through a processing pipeline to perform modifications such as removal of duplicate data, filtering for objectional content, etc.
Structured: A dataset created from a filtered dataset where the new structure is more suitable for model training (LLMs, time series, etc.), RAG usage, tuning, and other purposes. For example, JSON-formatted data is often desirable.

How We Process Datasets - Proposed

To go from Raw to Filtered, we currently plan to use processes with the following checks and filtering steps. These lists will mature over time:

Raw Data Ingestion

An initial quality analysis is performed, including the following checks:

Meets the Dataset Specification - e.g., license, provenance, etc.
No evident corruption - e.g., PDFs, JSON, etc. have valid formats.
No detectable inconsistencies between the data vs. the dataset card metadata.

Creating a Filtered (Derived) Dataset

There could be several filtered dataset that are derived from a single raw dataset, each of which would use one or more of the following transformations:

Exact and “fuzzy” duplication
Removal of low-quality content (e.g., HTML tags)
PII removal
Removal of copyrighted data (where detectable)
Removal of data covered by non-open access licenses (where detectable)
Toxic content removal (e.g., bias, hate speech, etc.)
Decontamination from known, public datasets for benchmarks and other evaluations
Other consistency and quality improvements

Creating a Structured (Derived) Dataset

The transformations to create one more structured datasets from a filtered dataset may include one or more of the following:

Tokenization
Conversion to JSON, YAML, or other format
Conversion of PDFs and other “rich” formats to text and images
Embedding - encoding with an encoding model and chunkifying for use in RAG and similar patterns

For All Processing

All ingestion and transformation steps will include full auditing to support data governance specification, so that full provenance and lineage back to original sources is tracked, with full visibility available to users of the datasets. Each dataset will have its governance metadata in its own dataset card that is publically available with the dataset. For example, it can be used to create Bills of Material by interested parties (see here).