The tmQM dataset provides the quantum properties, including geometries, atomic charges, and bond orders, of 108k transition metal complexes (TMCs) representing the 3d, 4d, and 5d series. All 30 transition metals are present, in combination with more than 30k different ligands. tmQM contains organometallic, bioinorganic, and Werner complexes. Structures were extracted from the Cambridge Structural Database (CSD; 2024 release) with a series of filters, yielding mononuclear TMCs with charges in the range [-1, 0, 1]. Electronic structure properties, including the energy, dipole moment, polarizability, and HOMO-LUMO gap, were all computed for the closed-shell singlet state. Two levels of theory were used: xTB (geometries) and DFT (single-point properties). tmQM is distributed under the MIT license in the four files linked and documented below, in which all TMCs are labeled with their CSD code. The dataset provided in this web page is an extension of the original tmQM dataset reported in this article: The tmQM Dataset - Quantum Geometries and Properties of 86k Transition Metal ComplexestmQM has been also used to derive a 60k graph dataset (tmQMg) and a 30k ligand library (tmQMg-L). In general, the purpose of tmQM is to provide the scientific community with a reliable source for developing and testing machine learning models for the exploration of the transition metal complex chemical space. Reviews, applications, and extensions of tmQM can be found in this link.

 


Download the dataset files here:


.xyz  file

 includes the geometries optimized at the GFN2-xTB level

.csv  file

 includes the quantum properties computed on the xTB-optimized geometries at the TPSSh/def2SVP level* ; i.e. electronic energy and dispersion energies, dipole moment, metal charge, HOMO/LUMO gap and energies, and polarizability

.q  file

 includes the natural atomic charges at the TPSSh/def2SVP level

.BO  file

 includes the Wiberg bond orders and atomic valence indices at the GFN2-xTB level

*Except the polarizability, which was computed at the GFN2-xTB level

All data is also available in this GitHub repo.

 


Further technical details:


In 2024, the CSD already provided structural data for over 1.3M chemical compounds, of which nearly 0.5M contained transition metals; however, not all compounds were of interest for tmQM, neither they were suitable for it, given the quantum nature of this dataset; TMCs were thus selected and curated using these filters:

Chemical composition filter: Mononuclear TMCs including any of the 30 transition metals bound to any of these elements: B, C, Si, N, P, As, O, S, Se, F, Cl, Br, and I, and including at least one C atom.

Geometry filters (I): Non-polymeric and with 3D coordinates available, excluding disordered structures.

Geometry filters (II): Heaviest fragment with a transition metal; excludes co-crystalizing molecules ( e.g. solvents and counterions ).

Electronic structure filters: Neutral and single-e ± charged TMCs with an even number of electrons.

Curation filter (I): Only the TMCs converging both the xTB and DFT calculations were included.

Curation filter (II): The 7% TMCs with the largest deviation of the xTB geometry relative to the CSD structure were excluded ( normalized by r factor and number of atoms ).

Curation filter (III): xTB-optimized geometries without any H atom or having C atoms with missing Hs were excluded using the filter developed by Ulissi and Blau in this article.

Curation filter (IV): xTB-optimized geometries with missing Hs, as well as dissociated, isolated ligands, were excluded using a C-focused multi-radii geometric filter.


Comments, questions, any feedback: Please contact us.