The Dataset Catalog
About This Catalog
The tables in this catalog list the metadata for Hugging Face-hosted datasets that were gathered as follows:
- The tables reflect a snapshot of the datasets as of June 5th, 2025. (Periodic updates are planned.)
- Of the approximately 413,000 Hugging Face datasets, 329,000 of them have queryable Croissant metadata.
- There are many open datasets that require you to request permission to use them first, even to query their Croissant metadata. Hence, these datasets are not in our catalog tables. However, some of them are listed separately in our Contributors and Other Datasets pages, along with some datasets not available in Hugging Face.
- Of the 329,000 datasets, we discard datasets with no license specified, leaving just 77,000!
- Licenses are specified as choosealicense.com/licenses/ URLs. Unfortunately, many datasets use undefined (“404”) URLs, about 17,000. We discarded those datasets, leaving 60,000.1
- The groupings are based on the presence of relevant keywords. Note that all the datasets list their language as
en
(English), but many have keywords for other languages. Those keywords are the basis for the Languages tables (including the one for English!).- All keywords were converted to lower case before “grouping”.
- When a section for a keyword lists additional keywords, it means we grouped together different keywords that we believe are related to the same topic, including synonyms. (Please point out any errors!) In these cases, we also show a Keyword column in the corresponding tables, so you can see which keyword was used to include the dataset. (This also means that occasionally some datasets will be listed more than once in their table.)
- Important: At this time, we are not yet validating datasets to ensure their metadata accurately reflect the data records themselves.
1: Some of the bad license links clearly intend to reference known licenses. We’ll revisit those cases.
Do you know of any datasets that should be shown, but aren’t? Let us know!
The Current Keywords Cataloged
Datasets For Languages
Subcategories
African Languages Languages in the Americas Asian Languages European Languages Languages in the Middle East Languages of the Pacific Islands and Nations
Keywords
African Languages
Subcategories
Keywords
Afar Afrikaans Akan Amharic Baatonum Bambara Bemba (zambia) Ber Birwa Central kanuri Chichewa Chokwe Cwi bwamu Dagbani Dinka Dyula Egyptian (ancient) Fanti Fulah Ganda Ghomálá’ Hausa Herero Igbo Kabiyè Kabuverdianu Kabyle Kachin Kamba (kenya) Kanuri Kikuyu Kimbundu Kinyarwanda Ko Kongo Koyraboro senni songhai Kutu Kwere Lingala Luba Katanga Luba Lulua Luo (kenya and tanzania) Makhuwa Makonde Malagasy Mamara senoufo Mossi N’ko Ndonga Nigerian fulfulde Nigerian pidgin Ndebele Nuer Nyankole Oromo Plateau malagasy Rundi Sango Sar Seselwa creole french Shona Somali Suba Swahili Susu Swati Tachelhit Tamasheq Tamazight Tigrigna Tsonga Setswana Tumbuka Umbundu Venda West central oromo Wolaytta Wolof Xhosa Yoruba Zulu
Languages in the Americas
Subcategories
Keywords
Achuar Shiwiar Algonquin Arabela Asháninka Aymara Bora Candoshi Shapra Caquinte Caribbean hindustani Cashibo Cacataibo Cashinahua Central aymara Central bikol Central bontok Central mazahua Chachi Chayahuita Cherokee Chimborazo highland quichua Chácobo Cofán Cree Eastern huasteca nahuatl Eastern maroon creole Galibi carib Garifuna Guarani Haitian Highland puebla nahuatl Huastec Huichol Imbabura highland quichua Inuktitut Inupiaq Ixil Jamaican creole english K’iche’ Kalaallisut Kaqchikel Kekchí Mapudungun Mezquital otomi Mi’kmaq Murui huitoto Navajo Ngäbere Nomatsiguenga Orpo Quechua Papantla totonac Purepecha Saramaccan Sharanahua Shipibo Conibo Shuar Siona Sirionó Sranan tongo Ticuna Tzotzil Wayuu Yine Yosondúa mixtec Yucateco Zapotec
Asian Languages
Subcategories
Keywords
Abkhaz Altai Amis Angika Armenian Assamese Avaric Awadhi Azerbaijani Balinese Balochi Bashkir Bengali Bishnupriya Bodo (india) Bolinao Burmese Carpathian romani Cebuano Central kurdish Chechen Chhattisgarhi Chinese Chuvash Crimean tatar Dari Dhivehi Dimli (individual language) Divehi Dogri (macrolanguage) Dzongkha Eastern tamang Erzya Farsi Filipino Gilaki Goan konkani Gujarati Hakha chin Hakka chinese Halh mongolian Hindi Hinglish Iloko Ingush Iranian persian Japanese Kalmyk Kankanaey Kannada Karachay Balkar Karelian Kashmiri Kazakh Khmer Kirghiz Kirmanjki (individual language) Komi Korean Kurdish Kyrgyz Lak Lao Lezghian Limbu Lushai Magahi Maithili Malay Malayalam Manipuri Mansi Marathi Mari (russia) Mazanderani Mingrelian Mongolian Nepali Nepali (individual language) Newari North azerbaijani Northern kurdish Northern uzbek Odia Oriya Oriya (macrolanguage) Ossetian Pampanga Pangasinan Panjabi Pashto Persian Russia buriat Sanskrit Santali Saraiki Sediq Shan Sindhi Sinhala South azerbaijani Tagalog Tajik Tamil Tatar Telugu Thai Tibetan Turkish Turkmen Tuvan Udmurt Uyghur Uzbek Vietnamese Waray (philippines) Western bukidnon manobo Yakut
European Languages
Subcategories
Keywords
Adyghe Albanian Aragonese Arpitan Asturian Basque Bavarian Belarusian Bosnian Breton Bulgarian Catalan Cornish Corsican Croatian Czech Danish Dutch English Esperanto Estonian Faroese Finnish French Frisian Friulian Gagauz Galician German Georgian Greek Hungarian Icelandic Ido Irish Italian Kashubian Kölsch Ladin Ladino Latgalian Latin Latvian Ligurian Limburgan Limburgish Lithuanian Liv Livvi Lombard Low german Lower sorbian Luxembourgish Macedonian Maltese Manx Mari Mirandese Moksha Neapolitan Northern sami Norwegian Occitan Occitan (post 1500) Old church slavonic Old english (ca. 450 1100) Old norse Picard Piemontese Polish Portuguese Romanian Romansh, Romany Russian Spanish Sardinian Saterfriesisch Scots Scottish gaelic Serbian Serbo Croatian Sicilian Silesian Slovak Slovenian Swedish Turkish Ukrainian Upper sorbian Venetian Vlax romani Volapük Walloon Welsh Yiddish
Languages in the Middle East
Subcategories
Keywords
Arabic Akkadian Ancient hebrew Assyrian neo Aramaic Hebrew
Languages of the Pacific Islands and Nations
Subcategories
Keywords
Ambulas Banjar Batak toba Benabena Betawi Bhojpuri Bine Bislama Buginese Bunama Burarra Chamorro Chuukese Dhao Doromu Koki Fiji hindi Fijian Halia Hawaiian Highland popoluca Hiri motu Iamalele Indonesian Javanese Kriol Kto Kâte Madurese Makasar Maori Marshallese Mende (papua new guinea) Minangkabau Mountain koiali Musi Muyuw Nauru Ngaju Nias Novial Pele Ata Pohnpeian Pular Rejang Samoan Sinaugoro Somba Siawari Sundanese Tahitian Tetun dili Tok pisin Tonga Warlpiri West kewa Yapese Yele
Datasets For Domains
Subcategories
Keywords
Advertising Agriculture Art Astronomy Automation Banking Biology Chemistry Climate Code Economics Education Environment Fashion Finance Food Game Geospatial Government History Insurance Legal Logic Mathematics Medical Music Philosophy Physics Politics Psychology Robotics Science Sports Time Series Web
Datasets For Modalities
text
, video
, different widely-applicable concepts, like data formats, how the data was collected or transformed from other data (e.g., see text-to-...
), etc., and general usage guidance like data intended for pretraining
, reinforcement-learning
, chain of thought
, etc.
Subcategories
Keywords
3D Agents Alignment Arrow Arxiv Audio Benchmark Classification Chain Of Thought Chat Crowd Sourced CSV Embeddings Evaluation Fine Tuning Generated Data Feature Extraction Graph Handwritten Image Instruction Following LLM JSON Monolingual Multi Lingual Multimodal Multiple Choice Named Entity Recognition News NLP Planning Pretraining Problem Solving Prompt Question Answering RAG Reasoning Regression Reinforcement Learning Safety Search Security Sentence Similarity Sentence Transformers Sentiment Analysis Speech Summarization Tabular Retrieval Text To … To Text Translation Tutorial Unlearning Video Vision Wikipedia