Link Search Menu Expand Document
AI Alliance Banner
Browse the Datasets   Contribute a new Dataset!

How We Process Datasets

Table of contents
  1. How We Process Datasets
    1. Data Quality and “Cleanliness”
    2. How We Process Datasets

Data Quality and “Cleanliness”

In Dataset Requirements, we described several levels of quality and cleanliness that we use to categorize datasets. Think of this as a rough outline of our ingestion process:

  • Raw: The dataset as submitted, which could already be in good shape. Our most important criteria at this stage is unambigious provenance. Nevertheless, datasets that contain some objectionable content with legal implications, such as some forms of PII and company confidential information, may have to be rejected outright.
  • Filtered: A raw dataset has gone through our processing pipeline to remove duplicates, filter for objectional content, etc.
  • Structured: A filtered dataset has been reformated to be most suitable for model training (LLMs, time series, etc.), RAG patterns, and similar purposes. For example, JSON-formatted data is often desirable.

How We Process Datasets

To go from Raw to Filtered, we use a process with the following checks (which will evolve over time):

  • An initial quality check:
    • Acceptable format
    • Not corrupted (e.g., a valid PDF)
  • Filtering:
    • Duplicate removal
    • Remove low-quality data (e.g., HTML tags)
    • PII removal
    • Copyright data removal (where feasible)
    • Toxic content removal
    • Bias
    • Decontamination against known evaluation and benchmark datasets
    • License verification (where feasible, detect data known to be covered by a different, incompatible license)
    • Other consistency and quality improvements

The transformations to create Structured datasets may include one or more of the following:

  • Tokenization
  • Conversion to JSON or YAML
  • “Chunkification” (e.g., for use in RAG data stores)

All steps include auditing provenance and lineage with full visibility available to users of the datasets.