Molecular and defect identification combining Atomic Force Microscopy images with deep learning and tunneling currents
Rubén Pérez, 1,2
1 Departamento de Física Teórica de la Materia Condensada, Universidad Autónoma de Madrid (UAM), 28049 Madrid, Spain
2 Condensed Matter Physics Center (IFIMAC), UAM, 28049 Madrid, Spain
ruben [dot] perez [at] uam [dot] es
Recent advances in the interpretation of the contrast provided by AFM with CO-functionalized on porphycenes  and on self-assembled molecular layers driven by either halogen  or hydrogen bonds , shows that there are clear connections between fundamental chemical properties of the molecules and key features imprinted in AFM images. Inspired by these results, we address the problem of the complete identification (structure and composition) of molecular systems solely based on AFM images, without any prior information, exploiting deep learning (DL) techniques.
In a first step, we restrict ourselves to a small set of 60 flat molecules and demonstrate the automatic classification of AFM experimental images by a DL model trained essentially with a theoretically generated data set . Learning from the successes and the limitations of this proof-of-concept, we have developed QUAM-AFM, the largest data set of simulated AFM images generated from a selection of 685,513 molecules that span the most relevant bonding structures and chemical species in organic chemistry . QUAM-AFM contains, for each molecule, 24 3D image stacks, each consisting of constant-height images simulated for 10 tip–sample distances with a different combination of AFM operational parameters, resulting in a total of 165 million images. The data for each molecule includes, besides AFM images, ball-and-stick depictions, IUPAC names, chemical formulas, atomic coordinates, and map of atom heights. A graphical user interface allows the search for structures by CID number, IUPAC name, or chemical formula.
Using QUAM-AFM to train different deep learning models, we explore different alternatives to go beyond the classification of limited groups of molecules and achieve the complete identification of an arbitrarily complex, unknown molecule. Firstly, we frame the molecular identification as an image captioning problem and design an architecture, composed of two multimodal recurrent neural networks, capable of providing the IUPAC name of an unknown molecule using a 3D image stack as input . Secondly, we use a Conditional Generative Adversarial Network (CGAN) to convert the 3D stack of AFM images into a ball--and--stick depiction, where balls of different color and size represent the chemical species and sticks represent the bonds, providing complete information on the structure and chemical composition . Tests with a large set of theoretical images and few experimental examples demonstrate the accuracy and potential of the two approaches for molecular identification from AFM images.
In the last part of the talk, we shall discuss how AFM measurements with a CO-functionalized tip, combined with theoretical simulations of AFM [2,3] and STM images –where the effect of the CO relaxation is included – , we determined the regular patterns formed by CO2 molecules adsorbed on Au(111) and the nature of the characteristic defects observed both in assemblies driven by pre-adsorbed 1,4-phenylene diisocyanid (PDI) chains and free-standing CO2 islands.
 T. K. Shimizu, et al., J. Phys. Chem. C 124, 26759 (2020)
 J. Tschakert, et al., Nat. Commun. 11, 5630 (2020)
 P. Zahl, et al. Nanoscale 13, 18473 (2021)
 J. Carracedo-Cosme, et al., Nanomaterials 11, 1658 (2021)
 J. Carracedo-Cosme, et al., J. Chem. Inf. Model. 62, 1214 (2022)
 J. Carracedo-Cosme, et al. (2022) http://arxiv.org/abs/2205.00449
 J. Carracedo-Cosme and R. Perez, (2022) http://arxiv.org/abs/2205.00447
 Emiliano Ventura-Macias, et al. (2022) submitted