The Dataset Catalog
About This Catalog
The tables in this catalog list the metadata for Hugging Face-hosted datasets that were gathered as follows:
- The tables reflect a snapshot of the datasets as of May 5th, 2025. (Periodic updates are planned.)
- Of the approximately 350,000 datasets, only those queryable using Croissant metadata are considered, about 260,000.
- Of those, we discard datasets without a specified license, leaving just approximately 60,000!
- The licenses are specified as corresponding choosealicense.com/licenses/ URLs. Unfortunately, many undefined (“404”) URLs are specified. We discarded those datasets leaving 45,000.1
- The groupings are based on the presence of relevant keywords. Note that all the datasets list their language as
en
(English), but many have keywords for other languages. Those keywords are the basis for the Languages tables (including the one for English).- All keywords were converted to lower case before “grouping”. When a keyword entry lists additional keywords, it means we grouped together different keywords that we believe are related to the same topic, including synonyms. In these cases, we also show a Keyword column in the corresponding tables, so you can see which keyword was used to include a dataset.
- Important: At this time, we are not yet validating datasets to ensure their metadata accurately reflect the data records themselves.
- Do you know of any datasets that should be shown, but aren’t? Let us know!
1: Some of the bad license links clearly intend to reference known licenses. We’ll revisit those cases.
The Current Keywords Cataloged
Datasets For Languages
Subcategories
African Languages Languages in the Americas Asian Languages European Languages Languages in the Middle East Languages of the Pacific Islands and Nations
African Languages
Keywords
Afar Afrikaans Akan Amharic Baatonum Bambara Bemba (zambia) Ber Birwa Central kanuri Chichewa Chokwe Cwi bwamu Dagbani Dinka Dyula Egyptian (ancient) Fanti Fulah Ganda Ghomálá’ Hausa Herero Igbo Kabiyè Kabuverdianu Kabyle Kachin Kamba (kenya) Kanuri Kikuyu Kimbundu Kinyarwanda Ko Kongo Koyraboro senni songhai Kutu Kwere Lingala Luba Katanga Luba Lulua Luo (kenya and tanzania) Makhuwa Makonde Malagasy Mamara senoufo Mossi N’ko Ndonga Nigerian fulfulde Nigerian pidgin Ndebele Nuer Nyankole Oromo Plateau malagasy Rundi Sango Sar Seselwa creole french Shona Somali Suba Swahili Susu Swati Tachelhit Tamasheq Tamazight Tigrigna Tsonga Setswana Tumbuka Umbundu Venda West central oromo Wolaytta Wolof Xhosa Yoruba Zulu
Languages in the Americas
Keywords
Achuar Shiwiar Algonquin Arabela Asháninka Aymara Bora Candoshi Shapra Caquinte Caribbean hindustani Cashibo Cacataibo Cashinahua Central aymara Central bikol Central bontok Central mazahua Chachi Chayahuita Cherokee Chimborazo highland quichua Chácobo Cofán Cree Eastern huasteca nahuatl Eastern maroon creole Galibi carib Garifuna Guarani Haitian Highland puebla nahuatl Huastec Huichol Imbabura highland quichua Inuktitut Inupiaq Ixil Jamaican creole english K’iche’ Kalaallisut Kaqchikel Kekchí Mapudungun Mezquital otomi Mi’kmaq Murui huitoto Navajo Ngäbere Nomatsiguenga Orpo Quechua Papantla totonac Purepecha Saramaccan Sharanahua Shipibo Conibo Shuar Siona Sirionó Sranan tongo Ticuna Tzotzil Wayuu Yine Yosondúa mixtec Yucateco Zapotec
Asian Languages
Keywords
Abkhaz Altai Amis Angika Armenian Assamese Avaric Awadhi Azerbaijani Balinese Balochi Bashkir Bengali Bishnupriya Bodo (india) Bolinao Burmese Carpathian romani Cebuano Central kurdish Chechen Chhattisgarhi Chinese Chuvash Crimean tatar Dari Dhivehi Dimli (individual language) Divehi Dogri (macrolanguage) Dzongkha Eastern tamang Erzya Farsi Filipino Gilaki Goan konkani Gujarati Hakha chin Hakka chinese Halh mongolian Hindi Hinglish Iloko Ingush Iranian persian Japanese Kalmyk Kankanaey Kannada Karachay Balkar Karelian Kashmiri Kazakh Khmer Kirghiz Kirmanjki (individual language) Komi Korean Kurdish Kyrgyz Lak Lao Lezghian Limbu Lushai Magahi Maithili Malay Malayalam Manipuri Mansi Marathi Mari (russia) Mazanderani Mingrelian Mongolian Nepali Nepali (individual language) Newari North azerbaijani Northern kurdish Northern uzbek Odia Oriya Oriya (macrolanguage) Ossetian Pampanga Pangasinan Panjabi Pashto Persian Russia buriat Sanskrit Santali Saraiki Sediq Shan Sindhi Sinhala South azerbaijani Tagalog Tajik Tamil Tatar Telugu Thai Tibetan Turkish Turkmen Tuvan Udmurt Uyghur Uzbek Vietnamese Waray (philippines) Western bukidnon manobo Yakut
European Languages
Keywords
Adyghe Albanian Aragonese Arpitan Asturian Basque Bavarian Belarusian Bosnian Breton Bulgarian Catalan Cornish Corsican Croatian Czech Danish Dutch English Esperanto Estonian Faroese Finnish French Frisian Friulian Gagauz Galician German Georgian Greek Hungarian Icelandic Ido Irish Italian Kashubian Kölsch Ladin Ladino Latgalian Latin Latvian Ligurian Limburgan Limburgish Lithuanian Liv Livvi Lombard Low german Lower sorbian Luxembourgish Macedonian Maltese Manx Mari Mirandese Moksha Neapolitan Northern sami Norwegian Occitan Occitan (post 1500) Old church slavonic Old english (ca. 450 1100) Old norse Picard Piemontese Polish Portuguese Romanian Romansh, Romany Russian Spanish Sardinian Saterfriesisch Scots Scottish gaelic Serbian Serbo Croatian Sicilian Silesian Slovak Slovenian Swedish Turkish Ukrainian Upper sorbian Venetian Vlax romani Volapük Walloon Welsh Yiddish
Languages in the Middle East
Keywords
Arabic Akkadian Ancient hebrew Assyrian neo Aramaic Hebrew
Languages of the Pacific Islands and Nations
Keywords
Ambulas Banjar Batak toba Benabena Betawi Bhojpuri Bine Bislama Buginese Bunama Burarra Chamorro Chuukese Dhao Doromu Koki Fiji hindi Fijian Halia Hawaiian Highland popoluca Hiri motu Iamalele Indonesian Javanese Kriol Kto Kâte Madurese Makasar Maori Marshallese Mende (papua new guinea) Minangkabau Mountain koiali Musi Muyuw Nauru Ngaju Nias Novial Pele Ata Pohnpeian Pular Rejang Samoan Sinaugoro Somba Siawari Sundanese Tahitian Tetun dili Tok pisin Tonga Warlpiri West kewa Yapese Yele
Datasets For Domains
Keywords
Advertising Agriculture Art Astronomy Automation Banking Biology Chemistry Climate Code Economics Education Environment Fashion Finance Food Game Geospatial Government History Insurance Legal Logic Mathematics Medical Music Philosophy Physics Politics Psychology Robotics Science Sports Time Series Web
Datasets For Modalities
text
, video
, different widely-applicable concepts, like data formats, how the data was collected or transformed from other data (e.g., see text-to-...
), etc., and general usage guidance like data intended for pretraining
, reinforcement-learning
, chain of thought
, etc.
Keywords
3D Agents Alignment Arrow Arxiv Audio Benchmark Classification Chain Of Thought Chat Crowd Sourced CSV Embeddings Evaluation Fine Tuning Generated Data Feature Extraction Graph Handwritten Image Instruction Following LLM JSON Monolingual Multi Lingual Multimodal Multiple Choice Named Entity Recognition News NLP Planning Pretraining Problem Solving Prompt Question Answering RAG Reasoning Regression Reinforcement Learning Safety Search Security Sentence Similarity Sentence Transformers Sentiment Analysis Speech Summarization Tabular Retrieval Text To … To Text Translation Tutorial Unlearning Video Vision Wikipedia