Other Datasets
Many other open datasets are not hosted at Hugging Face, but are not yet part of our catalog. For now, they are listed here.
Table of contents
Chemistry
Many datasets for chemistry are open for use.
CartBlanche
CartBlanche is an interface to ZINC-22, a free database of commercially-available compounds for virtual screening. From the website:
ZINC-22 focuses on make-on-demand (“tangible”) compounds from a small number of large catalogs: Enamine, WuXi and Mcule. Our sister database, ZINC20 focuses on smaller catalogs. ZINC-22 currently has about 54.9 billion molecules in 2D and 5.9 billion in 3D.
PubChem
PubChem is a free-to-use chemistry database. From the website:
PubChem is a free to use database with most of the data readily available for download. Exceptions may exist in cases where licensing agreements prevent our data contributors from allowing bulk downloads of some data sets.
Please consult the NCBI Policies and Disclaimers webpage (https://www.ncbi.nlm.nih.gov/home/about/policies/) and the NLM Web Policies webpage (https://www.nlm.nih.gov/web_policies.html).
The data in PubChem comes from hundreds of data contributors (https://pubchem.ncbi.nlm.nih.gov/source/). A data source may provide explicit data license information. One should check with the PubChem data source for the most current data licensing information.
PubChem strives to make clear the data provenance of all content. Within a given data table row or beneath provided content, the data provenance is provided. For example, this data shows Medical Subject Headings (MeSH) as the data source for the assertion of a chemical being a “Fibrinolytic Agent”:
Other Datasets?
If you know of other open datasets that we should include in our catalog, let us know.