Browse the Datasets Tell Us About Other Datasets GitHub Repo

The Dataset Catalog

About This Catalog

We learned a lot about the quality of datasets by examining the metadata for the datasets hosted by Hugging Face. The tables in this catalog list the metadata for a small subset of these datasets, small because of how we had to filter them. Here are the details of that process:

The tables reflect a snapshot of the datasets as of October 24^th, 2025. We are updating the snapshot approximately monthly while we work on a more automated, incremental, and iterative process.

The characteristics we describe below haven’t changed since we started these periodic updates June 5^th, 2025. However, the numbers gradually increase as more datasets are added to Hugging Face every day. We also round the numbers to the nearest thousands (“Ks”).

Of the approximately 554K Hugging Face datasets (as of October 24^th…), 493K of them have queryable Croissant metadata. Among the remaining 61K, 33K don’t have Croissant metadata (which is actually an improvement over previous snapshots) and 29K may have this metadata, but you are required to request permission to use the dataset first, even to query its metadata!

Of the remaining 493K datasets, 370K do not specify a license of any kind, so we discard them, leaving just 123K or 25% of those datasets!

Of the remaining 123K datasets, 11K “attempt” to define licenses, but do so improperly. Licenses are specified as choosealicense.com/licenses/ URLs. Unfortunately, these 11K datasets specify undefined (i.e., “404”) URLs. Previously, we discarded those datasets. However, some of the bad license links clearly intend to reference known licenses. We found that the list of licenses supported by choosealicense.com/licenses/ is actually quite small, but this is deliberate to encourage people to pick recent versions and to not be overwhelmed by too many choices. However, this also means that many valid licenses can’t be properly specified this way.

Instead, we looked at all these cases and found corresponding definitions for most of these additional licenses, with over 1500 of these improperly-specified datasets having permissive licenses (defined below).

After this processing, 109K datasets have identifiable, known licenses, permissive or not.

Of these 109K datasets, 95K of them have permissive licenses. These are the 95K datasets you will find in the catalog.

How we group the remaining 95K datasets into tables:

The groupings into tables are based on the corresponding keywords associated with the datasets.

The metadata for the datasets all have a language field, but all values are either en (English) or NULL, so we ignore this field.

However, many datasets have keywords for other languages. Those keywords are the basis for the Languages tables (including the one for English!).

All keywords were converted to lower case before grouping.

When a table for a keyword lists additional keywords (e.g., advertising), it means we grouped together different keywords that we believe are related to the same topic, including synonyms. (Please email us about any errors or report problems another way) In these cases, we also show a Keyword column in the corresponding table, so you can see which keyword was used to include the dataset. (This also means that occasionally a dataset will be listed multiple times in its table, once for each keyword.)

More About the Licenses

More details of our analysis of the licenses can be found in the GitHub repo’s license-notes.md. Here we provide a few more of the interesting details. The ScanCode LicenseDB project classifies licenses into one of six categories. The 109K “good” datasets are categorized as shown in Table 1:

Category	Count
Permissive	`92617`
Source-available	`6181`
Copyleft Limited	`3608`
Unstated License	`3262`
Public Domain	`2096`
Copyleft	`1112`

Table 1: Categories of licenses.

For our purposes, Permissive and Public Domain qualify as “open”, yielding 95K datasets. A total of 19 Permissive licenses were found, shown in Table 2:

Table 2: The permissive licenses.

Important: At this time, we are not yet validating datasets to ensure their metadata accurately reflect the data records themselves.

Note: Some of the datasets filtered out for one of the reasons discussed above are listed separately in our Contributors or Other Datasets pages, where we also describe some other useful datasets that are not available in Hugging Face and not yet included in this catalog.

Help Wanted: Do you know of any datasets that should be shown, but aren’t? Let us know through email or another way.

The Current Keywords Cataloged

Datasets For Languages

Datasets with different human languages.

Subcategories

African Languages Languages in the Americas Asian Languages European Languages Languages in the Middle East Languages of the Pacific Islands and Nations

African Languages

Ancient and modern languages in Africa.

Keywords

Afar Afrikaans Akan Amharic Baatonum Bambara Bemba (zambia) Ber Birwa Central kanuri Chichewa Chokwe Cwi bwamu Dagbani Dinka Dyula Egyptian (ancient) Fanti Fulah Ganda Ghomálá’ Hausa Herero Igbo Kabiyè Kabuverdianu Kabyle Kachin Kamba (kenya) Kanuri Kikuyu Kimbundu Kinyarwanda Ko Kongo Koyraboro senni songhai Kutu Kwere Lingala Luba Katanga Luba Lulua Luo (kenya and tanzania) Makhuwa Makonde Malagasy Mamara senoufo Mossi N’ko Ndonga Nigerian fulfulde Nigerian pidgin Ndebele Nuer Nyankole Oromo Plateau malagasy Rundi Sango Sar Seselwa creole french Shona Somali Suba Swahili Susu Swati Tachelhit Tamasheq Tamazight Tigrigna Tsonga Setswana Tumbuka Umbundu Venda West central oromo Wolaytta Wolof Xhosa Yoruba Zulu

Languages in the Americas

Ancient and modern languages in the Americas.

Keywords

Achuar Shiwiar Algonquin Arabela Asháninka Aymara Bora Candoshi Shapra Caquinte Caribbean hindustani Cashibo Cacataibo Cashinahua Central aymara Central bikol Central bontok Central mazahua Chachi Chayahuita Cherokee Chimborazo highland quichua Chácobo Cofán Cree Eastern huasteca nahuatl Eastern maroon creole Galibi carib Garifuna Guarani Haitian Highland puebla nahuatl Huastec Huichol Imbabura highland quichua Inuktitut Inupiaq Ixil Jamaican creole english K’iche’ Kalaallisut Kaqchikel Kekchí Mapudungun Mezquital otomi Mi’kmaq Murui huitoto Navajo Ngäbere Nomatsiguenga Orpo Quechua Papantla totonac Purepecha Saramaccan Sharanahua Shipibo Conibo Shuar Siona Sirionó Sranan tongo Ticuna Tzotzil Wayuu Yine Yosondúa mixtec Yucateco Zapotec

Asian Languages

Ancient and modern languages in Asia.

Keywords

Abkhaz Altai Amis Angika Armenian Assamese Avaric Awadhi Azerbaijani Balinese Balochi Bashkir Bengali Bishnupriya Bodo (india) Bolinao Burmese Carpathian romani Cebuano Central kurdish Chechen Chhattisgarhi Chinese Chuvash Crimean tatar Dari Dhivehi Dimli (individual language) Divehi Dogri (macrolanguage) Dzongkha Eastern tamang Erzya Farsi Filipino Gilaki Goan konkani Gujarati Hakha chin Hakka chinese Halh mongolian Hindi Hinglish Iloko Ingush Iranian persian Japanese Kalmyk Kankanaey Kannada Karachay Balkar Karelian Kashmiri Kazakh Khmer Kirghiz Kirmanjki (individual language) Komi Korean Kurdish Kyrgyz Lak Lao Lezghian Limbu Lushai Magahi Maithili Malay Malayalam Manipuri Mansi Marathi Mari (russia) Mazanderani Mingrelian Mongolian Nepali Nepali (individual language) Newari North azerbaijani Northern kurdish Northern uzbek Odia Oriya Oriya (macrolanguage) Ossetian Pampanga Pangasinan Panjabi Pashto Persian Russia buriat Sanskrit Santali Saraiki Sediq Shan Sindhi Sinhala South azerbaijani Tagalog Tajik Tamil Tatar Telugu Thai Tibetan Turkish Turkmen Tuvan Udmurt Uyghur Uzbek Vietnamese Waray (philippines) Western bukidnon manobo Yakut

European Languages

Ancient and modern languages in Europe.

Keywords

Adyghe Albanian Aragonese Arpitan Asturian Basque Bavarian Belarusian Bosnian Breton Bulgarian Catalan Cornish Corsican Croatian Czech Danish Dutch English Esperanto Estonian Faroese Finnish French Frisian Friulian Gagauz Galician German Georgian Greek Hungarian Icelandic Ido Irish Italian Kashubian Kölsch Ladin Ladino Latgalian Latin Latvian Ligurian Limburgan Limburgish Lithuanian Liv Livvi Lombard Low german Lower sorbian Luxembourgish Macedonian Maltese Manx Mari Mirandese Moksha Neapolitan Northern sami Norwegian Occitan Occitan (post 1500) Old church slavonic Old english (ca. 450 1100) Old norse Picard Piemontese Polish Portuguese Romanian Romansh, Romany Russian Spanish Sardinian Saterfriesisch Scots Scottish gaelic Serbian Serbo Croatian Sicilian Silesian Slovak Slovenian Swedish Turkish Ukrainian Upper sorbian Venetian Vlax romani Volapük Walloon Welsh Yiddish

Languages in the Middle East

Ancient and modern languages from the Middle East.

Keywords

Arabic Akkadian Ancient hebrew Assyrian neo Aramaic Hebrew

Languages of the Pacific Islands and Nations

Ancient and modern languages in the Pacific islands, Australia, and New Zealand.

Keywords

Ambulas Banjar Batak toba Benabena Betawi Bhojpuri Bine Bislama Buginese Bunama Burarra Chamorro Chuukese Dhao Doromu Koki Fiji hindi Fijian Halia Hawaiian Highland popoluca Hiri motu Iamalele Indonesian Javanese Kriol Kto Kâte Madurese Makasar Maori Marshallese Mende (papua new guinea) Minangkabau Mountain koiali Musi Muyuw Nauru Ngaju Nias Novial Pele Ata Pohnpeian Pular Rejang Samoan Sinaugoro Somba Siawari Sundanese Tahitian Tetun dili Tok pisin Tonga Warlpiri West kewa Yapese Yele

Datasets For Domains

Domains like chemistry, healthcare, etc.

Keywords

Advertising Agriculture Art Astronomy Automation Banking Biology Chemistry Climate Code Economics Education Environment Fashion Finance Food Game Geospatial Government History Insurance Legal Logic Mathematics Medical Music Philosophy Physics Politics Psychology Robotics Science Sports Time Series Web

Datasets For Modalities

Modalities include text, video, different widely-applicable concepts, like data formats, how the data was collected or transformed from other data (e.g., see text-to-...), etc., and general usage guidance like data intended for pretraining, reinforcement-learning, chain of thought, etc.

Keywords

3D Agents Alignment Arrow Arxiv Audio Benchmark Classification Chain Of Thought Chat Crowd Sourced CSV Embeddings Evaluation Fine Tuning Generated Data Feature Extraction Graph Handwritten Image Instruction Following LLM JSON Monolingual Multi Lingual Multimodal Multiple Choice Named Entity Recognition News NLP Planning Pretraining Problem Solving Prompt Question Answering RAG Reasoning Regression Reinforcement Learning Safety Search Security Sentence Similarity Sentence Transformers Sentiment Analysis Speech Summarization Tabular Retrieval Text To … To Text Translation Tutorial Unlearning Video Vision Wikipedia

The Dataset Catalog

More About the Licenses

The Current Keywords Cataloged

Subcategories

Keywords

Keywords

Keywords

Keywords

Keywords

Keywords

Keywords

Keywords

Child Pages