Link Search Menu Expand Document

Datasets for Different Modalities

Modalities include text, video, different widely-applicable concepts, like data formats, how the data was collected or transformed from other data (e.g., see text-to-...), etc., and general usage guidance like data intended for pretraining, reinforcement-learning, chain of thought, etc.

Keywords

3D Agents Alignment Arrow Arxiv Audio Benchmark Classification Chain Of Thought Chat Crowd Sourced CSV Embeddings Evaluation Fine Tuning Generated Data Feature Extraction Graph Handwritten Image Instruction Following LLM JSON Monolingual Multi Lingual Multimodal Multiple Choice Named Entity Recognition News NLP Planning Pretraining Problem Solving Prompt Question Answering RAG Reasoning Regression Reinforcement Learning Safety Search Security Sentence Similarity Sentence Transformers Sentiment Analysis Speech Summarization Tabular Retrieval Text To … To Text Translation Tutorial Unlearning Video Vision Wikipedia

Datasets for the Modality Keywords

3D (keyword: 3d)

Three-dimensional data.

This set includes the following additional keywords: depth-estimation, image-to-3d, text-to-3d

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Agents (keyword: agents)

This set includes the following additional keywords: agent, downstream-task, downstream-tasks, function-calling, language-agent

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Alignment (keyword: alignment)

This set includes the following additional keywords: acceptability-classification, alignment-lab-ai, explainability, fairness, grounding, hallucination, relevance

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Arrow (keyword: arrow)

Arrow formatted data.

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Arxiv (keyword: arxiv)

References to arXiv articles. (There are many keywords starting with arxiv:.)

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Audio (keyword: audio)

This set includes the following additional keywords: audio-classification, audio-to-audio, speaker-identification, text-to-audio, voice, voice-activity-detection

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Benchmark (keyword: benchmark)

Datasets associated with benchmarks of any kind.

This set includes the following additional keywords: alignment, aveni-bench, benchmarks, gsm8k, mteb, nli, test, testing

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Chain Of Thought (keyword: chain-of-thought)

This set includes the following additional keywords: cot

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Chat (keyword: chat)

This set includes the following additional keywords: argument, argumentation, chat-dataset, conversation, conversational, conversational-ai, conversations, debate, dialog, dialogue, dialogue-modeling, discussion, fictitious dialogues, multiple-turn-dialogue, roleplay, role-play

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Classification (keyword: classification)

All aspects of classification: text, images, etc.

This set includes the following additional keywords: acceptability-classification, audio-classification, entity-linking-classification, image-classification, intent-classification, multi-class-classification, multi-class-image-classification, multi-input-text-classification, multi-label-classification, multi-label-image-classification, segmentation, semantic-segmentation, semantic-similarity-classification, semantic-similarity-scoring, sentiment-classification, sentiment-scoring, tabular-classification, tabular-multi-class-classification, tabular-multi-label-classification, text-classification, text-scoring, token classification, token-classification, topic-classification, video-classification, zero-shot-classification, zero-shot-image-classification

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Crowd Sourced (keyword: crowdsourced)

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


CSV (keyword: csv)

CSV formatted data.

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Embeddings (keyword: embeddings)

This set includes the following additional keywords: embedding

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Evaluation (keyword: evaluation)

This set includes the following additional keywords: eval, quality

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Feature Extraction (keyword: feature-extraction)

This set includes the following additional keywords: image-feature-extraction

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Fine Tuning (keyword: finetuning)

Post training refinement of models for alignment, safety, etc.

This set includes the following additional keywords: finetune, fine-tune, fine-tuning, instruct, instruction-finetuning, instruction-fine-tuning, instruction-following, instruction tuning, instruction-tuning, preference, preferences, sft, structured-fine-tuning

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Generated Data (keyword: generated-data)

Datasets that were generated by humans or automation.

This set includes the following additional keywords: ai-generated, conditional-text-generation, code-generation, dialog-generation, explanation-generation, generation, generated, expert-generated, machine-generated, ocr, text generation, text-generation, text2text-generation, synthetic, synthetic-captions, synthetic-data, synthetic-dataset, synthgenai

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Graph (keyword: graph)

This set includes the following additional keywords: graphs, graph-ml, knowledge graph, knowledge-graph, knowledge graphs, knowledge-graphs

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Handwritten (keyword: handwritten)

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Image (keyword: image)

Datasets of images and analysis of them, such as object detection.

This set includes the following additional keywords: anime, chart, caption, danbooru, diagram, geometry-diagram, images, image-captioning, image-captions, image-caption pairs, image-caption-pairs, image classification, image-classification, image-data, image-feature-extraction, image-generation, image-segmentation, image-text-dataset, image-text-to-text, image-to-image, image-to-text, image-to-video, multi-class-image-classification, object detection, object-detection, photo, photos, photograph, photographs, scientific-figure, super-resolution, text-to-image, unconditional-image-generation

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Instruction Following (keyword: instruction-following)

This set includes the following additional keywords: instruct, instruction, instruction-finetuning, instruction-fine-tuning, instruction-tuning, multiturn, multi-turn

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


JSON (keyword: json)

JSON formatted data.

This set includes the following additional keywords: jsonl

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


LLM (keyword: llm)

This set includes the following additional keywords: alpaca, large-language-model, large-language-models, language model, language-modeling, llms, masked-language-modeling

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Monolingual (keyword: monolingual)

Primarily one language.

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Multi Lingual (keyword: multilingual)

Datasets with more than one language.

This set includes the following additional keywords: machine translation, multi-lingual, squad_v2_french_translated, translated

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Multimodal (keyword: multimodal)

This set includes the following additional keywords: multimodality, multi-modal, multi-modal-qa

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Multiple Choice (keyword: multiple-choice)

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Named Entity Recognition (keyword: named-entity-recognition)

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


News (keyword: news)

This set includes the following additional keywords: news-articles-summarization

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


NLP (keyword: nlp)

This set includes the following additional keywords: explanation, explanation-generation, natural-language-inference, natural-language-processing, natural-language-understanding

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Planning (keyword: planning)

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Pretraining (keyword: pretraining)

Training of foundation modals, before (or 'pre') tuning for alignment, safety, etc.

This set includes the following additional keywords: long context, long-context, distillation, pretrain, preservation-loss-training

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Problem Solving (keyword: problem-solving)

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Prompt (keyword: prompt)

This set includes the following additional keywords: dfp, french prompts, prompts, prompt engineering, prompt-generation

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Question Answering (keyword: question-answering)

Datasets with question and answer pairs and related content.

This set includes the following additional keywords: abstractive-qa, camel, closed-book-qa, closed-domain-qa, document-question-answering, extractive-qa, Figure Q&A, Math Q&A, multiple-choice-qa, multi-modal-qa, open-domain-qa, open-book-qa, q-and-a, qa, qna, q&a, questions, question-generation, table-question-answering, visual-question-answering, vqa

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


RAG (keyword: rag)

Retrieval-Augmented Generation.

This set includes the following additional keywords: retrieval augmented generation, retrieval-augmented-generation

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Reasoning (keyword: reasoning)

Datasets to improve model's abilities to reason.

This set includes the following additional keywords: reflection, step-by-step, logical-reasoning, mathematical-reasoning

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Regression (keyword: regression)

This set includes the following additional keywords: tabular-regression

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Reinforcement Learning (keyword: reinforcement-learning)

This set includes the following additional keywords: dpo, expert trajectory, human-feedback, rl, rlhf, rlaif

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Retrieval (keyword: retrieval)

This set includes the following additional keywords: document-retrieval, entity-linking-retrieval, fact-checking, fact-checking-retrieval, information-retrieval, text-retrieval

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Safety (keyword: safety)

This set includes the following additional keywords: deepfake, deep-fake, fairness, hallucination, hate-speech, hate-speech-detection, misinformation, red-teaming, toxicity

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Search (keyword: search)

This set includes the following additional keywords: codesearchnet, search-queries, semantic-search

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Security (keyword: security)

This set includes the following additional keywords: cybersecurity, jailbreak, red-teaming

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Sentence Similarity (keyword: sentence-similarity)

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Sentence Transformers (keyword: sentence-transformers)

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Sentiment Analysis (keyword: sentiment-analysis)

This set includes the following additional keywords: emotion, emotions, sentiment-classification, sentiment, sentiments

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Speech (keyword: speech)

This set includes the following additional keywords: automatic-speech-recognition, grammar, hate-speech, hate-speech-detection, linguistics, parts-of-speech, sarcasm-detection, speech-detection, speech-recognition, text-to-speech

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Summarization (keyword: summarization)

This set includes the following additional keywords: news-articles-summarization, paraphrase, paraphrase-identification, summary, text-simplification

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Tabular (keyword: tabular)

Data in table formats.

This set includes the following additional keywords: table, table-to-text

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Text To ... (keyword: text-to-...)

Generating images, videos, etc. from text.

This set includes the following additional keywords: image-text-to-text, text-to-audio, text-to-image, text-to-speech, text-to-sql, Text to Video, text-to-video, video-text-to-text

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


To Text (keyword: to-text)

Datasets for generating text from different data sources.

This set includes the following additional keywords: data-to-text, image-caption pairs, image-caption-pairs, image-text-to-text, image-to-text, table-to-text, video-text-to-text, video-to-text

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Translation (keyword: translation)

This set includes the following additional keywords: machine translation, translated

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Tutorial (keyword: tutorial)

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Unlearning (keyword: unlearning)

This set includes the following additional keywords: tofu

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Video (keyword: video)

This set includes the following additional keywords: drone, image-to-video, likert, lvlm, movie, movies, synthetic-captions, Text to Video, text-to-video, video-classification, video-text-to-text, video-to-text, vision-language, vlm, vlms, youtube

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Vision (keyword: vision)

This set includes the following additional keywords: computer-vision, computer vision

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.


Wikipedia (keyword: wikipedia)

This set includes the following additional keywords: nanodbpedia, extended, wikipedia, wiki, wikidata, wikimedia/wit_base, wikisql

Click a row to see the description. Use the line below the table to resize it. See About These Datasets for important details.