The Evolution of Data-Driven Modeling in Organic Chemistry

Permits non-commercial access and re-use, provided that author attribution and integrity are maintained; but does not permit creation of adaptations or other derivative works (https://creativecommons.org/licenses/by-nc-nd/4.0/).

Abstract

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0020.jpg

Organic chemistry is replete with complex relationships: for example, how a reactant’s structure relates to the resulting product formed; how reaction conditions relate to yield; how a catalyst’s structure relates to enantioselectivity. Questions like these are at the foundation of understanding reactivity and developing novel and improved reactions. An approach to probing these questions that is both longstanding and contemporary is data-driven modeling. Here, we provide a synopsis of the history of data-driven modeling in organic chemistry and the terms used to describe these endeavors. We include a timeline of the steps that led to its current state. The case studies included highlight how, as a community, we have advanced physical organic chemistry tools with the aid of computers and data to augment the intuition of expert chemists and to facilitate the prediction of structure–activity and structure–property relationships.

Short abstract

Data science has emerged as a powerful tool to study organic chemistry. Herein, we provide a history of data-driven modeling in organic chemistry and a discussion of the current state of the field.

Introduction

In recent years, machine learning and artificial intelligence have emerged as powerful tools in organic chemistry. 1−8 As a consequence, we thought it prudent to provide the community with a timeline of events that have both inspired and contributed to the clear uptick in the applications of various data-driven strategies to the chemical sciences. These strategies are rooted in linear free energy relationships (LFERs), of which the Hammett relationship is the paradigmatic example. 9 Classically, these analyses related a single parameter, a mathematical way to describe a subunit or the entirety of a molecule, to chemical reactivity. 10 Although LFERs initially were used to gain mechanistic insight, if a model captures underlying chemical reactivity, it can in principle predict the reactivity of unknown reactions. Nevertheless, the simplicity of LFERs can limit their predictive ability, particularly in complex chemical systems.

Thus, as time evolved, multiparameter approaches to correlate chemical reactivity to structure were introduced. In addition, the advent of computers facilitated the use of increasingly larger data sets and more advanced algorithms to describe and predict the reactivity of more complex systems. 11,12 In parallel with technological advances, many new terms, such as chemometrics and chemoinformatics, were introduced to describe these endeavors. Further, the realization that structure–activity modeling is not restricted to classic substituent effects, nor to seeking linear relationships of data to experimental observables, led to another terminology evolution, e.g., that of machine learning.

Here, we provide a synopsis of the history of data-driven approaches in organic chemistry alongside a tour through the evolution of terminology used to describe these endeavors ( Figure ​ Figure1 1 ). We include key historical steps that led us to the current state of machine learning in chemical synthesis ( Figure ​ Figure2 2 ). This Outlook does not aim to be a comprehensive review of modern work but rather will highlight advances in the field with select case studies. For a more comprehensive review of modern approaches, we refer readers to other recent reviews of machine learning in organic chemistry. 2−8

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0001.jpg

Fields that have contributed to the development of data science in organic chemistry. 13−15

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0002.jpg

Timeline of major developments of data-driven modeling in organic chemistry.

Linear Free Energy Relationships

Linear free energy relationships (LFERs) represent a well-established and powerful method to relate reactivity with chemical structure, historically represented by quantitative experimental parameters or descriptors ( Figure ​ Figure3 3 A). Parameters describe the influence of a subunit (substituent) of a molecule, an entire chemical structure, or even the solvent. The relationships are linear in energy, meaning that the changes in Gibbs free energy resulting from structural modifications are additive. Thus, these relationships involve logarithms of thermodynamic and kinetic data (e.g., equilibrium and rate constants, respectively). 10,16 This is readily understood by recalling that ΔG° = −RT(lnKeq).

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0003.jpg

(A) Schematic workflow for linear free energy relationships. (B) Ionization constant of benzoic acid used to derive Hammett parameters.

Many of the first parameters in LFERs were derived from reaction equilibria. In 1924, Brønsted and co-workers derived the first quantitative relationship between equilibria and reaction rate. 17 The linear relationship, known now as the Brønsted catalysis law, relates the ionization constant of acids (Ka’s) to the rate of general-acid catalyzed reactions via a sensitivity factor α ( Figure ​ Figure4 4 , eq 1). Thus, acid/base dissociation could be a reference process that is related to the outcomes of entirely different reactions. One may consider this as the first correlation that allows for the prediction of reaction behavior based on quantitative parameters (Ka and α). The Brønsted catalysis law marked the beginning of a revolution in physical organic chemistry. Through the 1930s, many papers noted quantitative relationships between reference reactions and entirely new processes. Specifically, benzoic acid acidity was found to correlate with the rate for various reactions involving substrates, reagents, or catalysts bearing substituted aromatics as well as other fragments. 18−20

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0004.jpg

Equations for evolution of free energy relationships. a Difference between the analogue as compared to the unsubstituted compound. b xi = 1 if the fragment is present; xi = 0 if it is absent.

Thus, in 1937, Hammett introduced an equation that provided a quantitative description of these relationships. 9 The Hammett parameter, σx, for a substituent x is derived from the ionization constant of the corresponding substituted benzoic acid ( Figure ​ Figure3 3 B). The relative stability of an x-substituted benzoate ion is influenced by the electronics of the substituent; e.g., an electron-donating substituent destabilizes the benzoate ion whereas an electron-withdrawing substituent stabilizes the benzoate ion. The Hammett relationship ( Figure ​ Figure4 4 , eq 2) correlates induction and resonance contributions from substituents (the σx-values) to the reactivity of a wide range of organic structures. The ρ-value reveals the sensitivity of a reaction to the induction and resonance changes imparted by the x-substituents relative to x = H. Nearly every class of organic reaction has been analyzed using the Hammett equation or its extended forms ( Figure ​ Figure4 4 ).

An illustrative example in the area of asymmetric catalysis comes from the Jacobsen group in their development of a Mn III salen-catalyzed enantioselective epoxidation of alkenes ( Figure ​ Figure5 5 A). Jacobsen and co-workers used an LFER to understand the impact of changing the salen ligand substituent on the enantioselectivity (related to ΔΔG ‡ ) and the mechanism of the asymmetric epoxidation reaction. 21 A Hammett plot demonstrated a linear correlation of the donating ability of the substituent, as measured by σp (subscript p here refers to a substituent in the para position), to the enantioselectivity. On the basis of this observation and other experimental evidence, the researchers concluded that the variation in enantioselectivity resulted from changes in the position of the epoxidation transition state relative to the reaction coordinate ( Figure ​ Figure5 5 B). 22 An increase in electron density of the ligand resulted in a milder oxidant that would proceed via a more product-like transition state with greater nonbonding interactions between the catalyst and substrate due to proximity, thus increasing enantioselectivity.

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0005.jpg

(A) Hammett plot for the Mn III salen-catalyzed enantioselective epoxidation of alkenes. (B) Mechanistic implications for the position of the transition state along the reaction coordinate. Adapted from ref (22). Copyright 1998 American Chemical Society.

Mechanistic applications of univariate LFERs, such as the Jacobsen example, have been successful in cases where substrates and catalysts can be systematically modified to isolate the impact of a single molecular property and mitigate the effects of other parameters. 23 Further, breaks in linearity, or outliers of the model, often provide additional insights into the mechanism. 24 However, the simplicity of the traditional descriptors used in an LFER can limit the obtainable insight in more complex scenarios. A univariate LFER assumes a linear relationship between a single parameter and reactivity or selectivity; however, chemical reactions are generally more complex as reaction outcomes are dependent on numerous factors and often in a nonlinear manner.

The challenge of modeling reactions influenced by multiple parameters has been investigated throughout the history of LFERs. In an overly simplistic sense, all chemical reactivity can be divided into at least steric and electronic effects (of course, solvent effects are critical, but our focus is on individual structural changes to reactants). In this vein, in 1952, Taft reported a two variable approach that derived electronic and steric parameters from the rates of acid- and base-catalyzed esterification/hydrolysis ( Figure ​ Figure6 6 ). 25 Taft assumed that base-catalyzed hydrolysis would be influenced by both steric and electronic effects, whereas acid-catalyzed hydrolysis would only be influenced by steric effects. This assumption is founded upon the nature of the rate-determining step: formation of the tetrahedral-carbon intermediate. In the case of base catalysis, this step involves a change in charge: a neutral substrate is converted to a negatively charged intermediate implicating a dominating role for electronic effects. In contrast, under acidic conditions, a positively charged substrate is converted into a positively charged intermediate. Thus, for acid catalysis, there is no change in formal charge; therefore, electronic effects are mitigated and steric effects dominate. According to these assumptions, Taft derived electronic and steric parameters from the rates of ester hydrolysis under basic and acidic conditions, respectively. On the basis of these assumptions, he derived a dual substituent LFER that separated electronic (σ*) and steric (Es) effects ( Figure ​ Figure4 4 , eq 3). 25−27

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0006.jpg

Mechanisms of ester hydrolysis under acid or base catalysis.

This multiparameter approach inspired others to explore increasingly more parameters. One particularly well-known approach, the Taft-Topsom equation, separated several substituent effects (e.g., Figure ​ Figure4 4 , eq 4). Here, parameters for field, induction, polarizability, and resonance and their contributions to the observed reactivity (the associated σx-values) were defined. 28 Building on this, in 1962, Hansch and Fujita and co-workers moved the field of LFERs toward phenomena more relevant to the pharmaceutical sciences and biochemistry ( Figure ​ Figure4 4 , eq 5), introducing correlations with partition coefficients (such as log P and π). 29 This advance is recognized as the origin of quantitative structure–activity relationships (QSAR) as well as the foundation for the field of chemoinformatics, a term introduced several decades later (vide infra).

Computers

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0007.jpg

(A) Computed highest occupied molecular orbital. (B) LFER relating (Cnx) 2 , a measure of the energy of the HOMO, to oxidation potential (Vc). (C) Predicted oxidation potential based on computed (Cnx) 2 . 32

Computational parameters offer many benefits over experimental ones, such as the ability to parametrize a structure preceding synthesis or access to parameters with no observable experimental equivalent. A level of automation can also be introduced with the derivation of computational parameters. 30 The relatively good accuracy at low computational cost of density functional theory (DFT) 33 has facilitated the use of computational parameters in linear free energy relationships. 31 However, when computing several structures, especially in cases where multiple conformers need to be surveyed, computational cost can become a challenge. In addition, computational parameters are sensitive to the model system used, such as functional/basis set, solvent model, or parametrization of a single conformer or several conformers.

The introduction of computers also made an epochal shift in physical organic chemistry by facilitating the use of large data sets with more extensive computational approaches. 11 Along with the Free-Wilson approach developed in 1964 ( Figure ​ Figure4 4 , eq 6), the early QSAR models used multivariate regression to relate biological activity to the presence or absence of certain substructures in a molecule. 34,35 Also, in this same decade, pattern recognition approaches born from the field of applied mathematics from the 1930s entered the chemistry literature ( Figure ​ Figure2 2 ), giving rise to the origin of chemometrics. 36 We note that even LFERs are a form of pattern recognition when σ, σ + , and σ – Hammett values are compared to find the best linear fit, which thereby imparts insight into charge and resonance effects. Importantly, as delineated here, we see a gradual progression of single variable linear free energy relationships to multivariate algorithms, leading ultimately to large-data chemometric methods.

Data Sets

Another consideration in the progression of data science in organic chemistry is the availability of large, high-quality experimental data sets. 37,38 These data sets have been compiled from either high-throughput experimentation (HTE) or the combination of disparate data sets from the literature. Although HTE has a rich history in the field of biology, its adoption into the field of chemistry only occurred recently and has mainly been adopted in industrial settings. 39 Alternatively, the organic literature contains a large volume of data, but it is often stored in different unstructured formats. Pioneering work from Lowe introduced an open-access database that extracted data from USPTO. 40 Other efforts, such as the Open Reaction Database, are seeking to expand access to experimental data through open access schema and a centralized repository. a While compiling data from the literature has enabled the use of larger data sets, a challenge with this approach is the bias of literature reactions toward positive results, such that only reactions with high yields or selectivity are reported. However, negative results provide important insight into a chemical system and are necessary to build predictive models.

Chemometrics

Chemometrics emerged as a discipline partially due to the ability to use computers in chemistry. 41−43 In the 1960s, many branches of chemistry were generating large data sets from spectroscopy, chromatography, kinetics, and other experimental methods. However, no statistical methods at the time were available to cope with data sets containing many variables (often several hundred). With timing that coincides with the advent of chemometrics, pattern recognition techniques or classification methods, now referred to as machine learning, were introduced to the chemical sciences. 44,45 In fact, chemical pattern recognition is now regarded, in part, as an origin of chemometrics, and the two terms are often synonymous.

In 1971 ( Figure ​ Figure2 2 ), Kowalski and Wold coined the word “chemometrics” ( Figure ​ Figure8 8 ) and shortly after founded the International Chemometrics Society in 1974. Their definition of chemometrics is very broad: “the application of mathematical and statistical tools to chemistry”. 41 While other definitions of chemometrics have been published, 46,47 one that we believe is particularly contemporary comes from Massart et al. 13 in 1998: “A chemical discipline that applies mathematics, statistics and formal logic (a) to design and select optimal experimental procedures; (b) to provide maximum relevant chemical information by analyzing chemical data; and (c) to obtain knowledge about chemical systems.” These definitions are so broad in order to encompass machine learning or artificial intelligence in any chemical endeavor, including synthetic methodology development.

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0008.jpg

General workflow for chemometrics.

From 1969 onward ( Figure ​ Figure2 2 ), a series of papers 48−51 applied a “computerized learning machine” (a historical term for machine learning methods) to chemical problems. For example, Jurs, Kowalski, and Isenhour applied a learning machine to the interpretation of low-resolution mass spectra of organic compounds, initiating an area of research that would later culminate in the fully fledged chemical data analysis software ARTHUR. 41 They used a single threshold logic unit (TLU) for binary classification ( Figure ​ Figure9 9 ). The TLU is an early model of an artificial neuron that returns +1 or −1 based on the sign of the sum of all vector elements after a linear transformation of the input. The model is trained by a gradient descent method now known as “delta rule” from iterative observations of individual training set members. 52 The iterative training method means that this model truly “learns from experience”, a notion commonly associated with machine learning.

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0009.jpg

Schematic representation of the binary pattern classifier, the result of which is multicategory pattern classification by least-squares. Reproduced from ref (48). Copyright 1969 American Chemical Society.

In this study, each compound was represented by the scaled intensity at integer m/z ratios in the fixed range of 12–132 u. The interpretation task was then transformed into a series of 26 binary classifications to determine the number of carbon, hydrogen, oxygen, and nitrogen atoms in a molecule. This method was further refined to recognize substructures or combine information from mass and infrared spectra in the interpretation tasks. 53,54 Aspects of the modeling procedure, such as training set design and feature selection, remain relevant to data driven predictive analysis today (see below). 49

Another example of a learning algorithm that was used for pattern recognition is Cora (“cortex”), which consists of feature selection and a voting scheme. 55 Ioffe and co-workers utilized this tool for modeling catalytic reactivity in two case studies: 56 (1) the activity of oxides as heterogeneous catalysts for CO oxidation, in which the components were represented by several physicochemical properties, and (2) the use of V2O5 as a catalyst for oxidation of various hydrocarbons, in which the starting materials and products were described by quantum chemical calculations.

Shortly later, Kowalski and Bender applied unsupervised learning to visualize high-dimensional chemical feature spaces in two-dimensional plots using linear projections as well as nonlinear manifold learning techniques. 57,58 They first demonstrated the utility of such two-dimensional representations for subsequent clustering using a divisive hierarchical clustering method and classification by the k-nearest neighbor method. As an example, they demonstrated the clustering and classification of the acid/base character of element oxides on the basis of six physicochemical properties of the elements themselves, thus predicting the reactivity with the parameters indicative of chemical structure. This is a clear direct tie to the use of multiparameter LFERs for predictive purposes.

In 1976, Wold 59 reported the method of Soft Independent Modeling of Class Analogy (SIMCA). SIMCA separates data into classifications by first performing a principal component analysis (PCA) on a data set to determine key features and then separates the data into classes on the basis of these features ( Figure ​ Figure10 10 A). SIMCA is considered to be the origin of modern chemometrics as opposed to simple curve fitting, such as used in LFERs. 36,59,60 As the first example, Wold and co-workers performed SIMCA analysis of 13 C NMR data of norbornanes. The data were analyzed to determine if the structure of a norbornane is exo or endo and whether there existed consistent patterns for each type of molecule ( Figure ​ Figure10 10 B). 60 In fact, most of the early advances involving SIMCA were for classification. 36 As Figure ​ Figure11 11 displays, a number of chemometrics approaches have been developed for a variety of disciplines: 36 chemoinformatics, 61 metabolomics, 62−64 medicinal and pharmaceutical chemistry, 65,66 forensic science, 47 and food science. 67,68 In Figure ​ Figure11 11 , we depict a breakdown of the common chemometrics methods and their most common graphical results as well as their general utility. The application of Bayesian statistics has also been explored. 69

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0010.jpg

(A) Workflow for Soft Independent Modeling of Class Analogy (SIMCA). (B) SIMCA used for classifying exo and endo norboranes. 60

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0011.jpg

Representative data analysis methods used in chemometrics and their contributions to other disciplines. Exploratory analysis summarizes the main characteristics in multidimensional data. For examples, HCA clusters data by distance; PCA projects data to the first few principal components to seek the largest variance. Pattern recognition analysis and discriminant analysis can both classify data into different groups. Pattern recognition analysis creates general patterns and classifies new objects into groups. For examples, kNN uses a plurality vote of its k-nearest neighbors (e.g., inside solid/dashed line circle); SIMCA calculates the residual distance from the disjoint PCA models for each group. Discriminant analysis requires the label of independent variables (X) for classification. LDA projects X data to seek the greatest separation between the different groups; PLS-DA is a PLS variant based on categorical dependent variables (Y). As quantitative methods, regression analysis builds the model to give continuous prediction. MLR regresses Y on the X directly. PCR regresses Y on a subset of the principle components of X. PLS projects both X and Y to a new space, in which X explains the maximum variance in Y. Black and gray axes describe original and new data spaces, respectively. yn stands for the n th dependent variable.

Chemoinformatics

With the increasing reliance on informatics in many scientific fields, in silico chemistry has significantly expanded the areas of possible chemical investigations, i.e., chemoinformatics ( Figure ​ Figure12 12 ). 70 Even though the term “chemoinformatics” took shape in late 1990s, the field originated from several beginnings. 71 Brown first defined the term in 1998: “Chemoinformatics is the mixing of those information resources to transform data into information and information into knowledge for the intended purpose of making better decisions faster in the area of drug lead identification and organization”. 14 However, modern definitions no longer imply that chemoinformatics is necessarily only linked to drug discovery. 70,72−74 For example, Gasteiger and Engel generally defined this discipline as “the application of [the] informatics method to solve chemical problems”. 75 Chemoinformatics can be described as a theoretical chemistry discipline complementary to quantum chemistry and force-field molecular modeling, 1,76 which focuses on describing molecular structure in a favorable format (for example, as matrices) for use in statistical modeling. Irrespective of this broader definition, chemoinformatics is primarily associated with QSAR or quantitative structure–property relationships (QSPRs) focused upon drug-lead identification. 76

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0012.jpg

General workflow for chemoinformatics.

Early QSAR models, such as Hansch and Free-Wilson analysis, were generally based on multivariate regression with limited features. 11,34 Although groundbreaking, these approaches were only valid for closely related compounds and when linear modeling was applicable. Modern QSAR has increased the use of global models, which are trained on a broad range of compounds, even those lacking structural similarity. Also, the application of sophisticated computational algorithms, embodied in machine learning techniques (discussed below), makes chemoinformatics capable of handling large-scale data sets. 76,77 Chemoinformatics covers a broad range of scientific strategies from chemical data collection and analysis to the exploration of structure–activity relationships and prediction of in vivo compound activities. 78

A general chemoinformatics model often has a “two-part process” to convert molecules to features and then to properties: (1) encode a compound as feature vectors; (2) map the feature vectors to the property of interest by applying chemoinformatics methods ( Figure ​ Figure12 12 ). 76 Compared with other branches of computational chemistry, chemoinformatics involves data processing that cannot be done without in silico mathematics and depends on large data sets that cannot be compressed to standard mathematical models. 70 As with chemometrics, chemoinformatics depends upon mathematical, statistical, and machine learning methods to translate chemical data into chemical information with the assistance of a computer. The two fields have borrowed heavily from each other and use many of the same methods. 79 The difference is that chemometrics uses multivariate data from instruments (e.g., spectral data), which often requires no information about chemical structure, while chemoinformatics concentrates on generating data from the description of the chemical structure. Although these two disciplines have a different focus on solving problems in chemistry, some literature 70,79 regards chemometrics as part of chemoinformatics.

Chemoinformatics can be considered as a very specific application of machine learning with an emphasis on modeling structure–property relationships for molecules. Similar to chemometrics, knowledge external to chemistry (e.g., graph theory for developing chemical descriptors) can be integrated into the workflow before machine learning methods are applied. 1

Artificial Intelligence and Machine Learning

Artificial intelligence is a general term for the study and construction of “intelligent agents”: devices or programs capable of cognitive functions such as learning, problem solving, and decision making upon perceiving stimuli. 80 This encompasses, but is not limited to, the field of machine learning, which refers to programs that improve with experience at performing a task. 15

With these notions in mind, the application of artificial intelligence and machine learning in a general sense to chemical problems began in the late 1960s. Several seminal papers were published in 1969, including two highly influential projects based on heuristics, and thus belong to artificial intelligence in a broader sense: Dendral and Logic and Heuristics Applied to Synthetic Analysis (LHASA).

Referred to as the first expert system, the Dendral project led by Feigenbaum, Buchanan, Lederberg, and Djerassi made extensive use of heuristics with the aim of scientific hypothesis generation. 81,82 Its utility for chemical questions was first demonstrated by the enumeration of isomers of organic molecules given a molecular formula 81 as well as the interpretation of mass spectral data of ketones. 82 Corey and Wipke’s LHASA 83 was the first implementation of the formalized rules of retrosynthesis that Corey had published two years prior. 84 This marked the beginning of the ongoing and active development of computer-assisted synthesis planning software. 85 Other groups pursued this goal early on, 86 including Dugundji and Ugi’s use of a matrix representation of molecules 87 or Gelernter et al. applying another heuristics-based approach. 88

Around 1988, the term “machine learning” (ML) started appearing in the titles of chemistry literature ( Figure ​ Figure2 2 ). 89−92 The introduction of machine learning techniques in the early 1990s marked a pivotal point in the evolution of chemical analysis methodology. 93 This led to a blur between what is considered chemometrics or machine learning, but we believe a subtle distinction has evolved: a reliance on linear relationships is now more associated with chemometrics, whereas nonlinear relationships and large data sets are more commonly considered ML. 70 In actuality, there is no sharp distinction between the statistical methods of chemometrics and machine learning. In both cases, computers are used to generate models, which have the capacity to cope with advanced model selection algorithms that are increasingly more sophisticated as the machine learning/chemometric community improves their approaches.

While chemometricians will claim support vector machines (SVMs), artificial neural networks (ANNs), and forest methods (such as random forest, RF) for their field, organic chemists generally consider these “advanced chemometrics” methods as machine learning ( Figure ​ Figure13 13 ). There is literature 94−97 that compares the results from traditional chemometric methods to what is now commonly termed as machine learning methods (e.g., SVM, ANN, RF). However, we believe these methods (“traditional” or “advanced”) should not be simply compared by their performance. The performance of the model relies on whether the algorithm is suitable for the data, which means that methods should be selected according to the properties of the data and the hypothesis to be analyzed. For example, as Brereton and Lloyd articulated in their review, most applications of SVM are on data sets with small numbers of variables in analytical chemistry. 98 However, there is no inherent reason they cannot be extended to highly multivariable data sets. Both chemometrics and machine learning evolved from the fields of pattern recognition and computational learning theory by applying statistical methods to improve model performance. 99 For relatively small or sparse data sets, simple machine learning algorithms (e.g., multiple linear regression, linear discriminant analysis (LDA), PCA, and PLS) may work well. With larger amounts of data and with higher complexity, especially in high-throughput screening (HTS) assays, the sought-after predictions often benefit from more sophisticated algorithms (e.g., SVM, RF, ANN, etc.). 3

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0023.jpg

Machine learning methods. (A) SVM models represent data as points in space and cluster the data by hyperplanes. (B) ANN models contain a series of artificial neurons that receive, process, and output data. (C) RF models analyze data by a series of decision trees.

Similarly, the early applications of SVMs have rapidly developed since the late 1990s in several research areas, including bioinformatics and biometrics. SVMs map data as points in higher-dimensional spaces that can be used for classification by identifying hyperplanes that separate clusters in the data. In the early 2000s, SVMs were introduced to chemistry for QSAR and protein structure studies. 102−105 Also, SVM can be applied to both classification and predictive problems. 98 ANNs are highly complex methods that pass information through interconnected layers of mathematical transformations, thereby generating internal representations of the original data. 106 In chemistry, the interest in neural-network computing has grown rapidly since 1986, 107 and different aspects of ANN methods have been investigated in QSAR studies since the 1990s. 108 More recently, the complex cognitive capacities of some ANN architectures have enabled applications beyond the prediction of numerical targets, as described below. RFs are an ensemble method to improve prediction accuracy based on a majority-voting scheme, which is an extension of decision tree algorithms. 100,101

The use of chemometrics and ML in chemical applications corresponds mainly to either descriptive or predictive settings, respectively. In descriptive modeling, the focus is to find a quantitative relationship between probability distributions that can then be interpreted and applied to make predictions. Conversely, in a predictive setting, we find that machine learning is the more commonly used terminology. For example, being able to make predictions, such as the outcome of a reaction, is the primary purpose of these mathematical applications, and the interpretation is perhaps secondary. Impressive applications have become possible using machine learning techniques, some of which autonomously generate scientific hypotheses (see below). Not only can this streamline the chemical discovery process, but also it can lead to experiments and discoveries that might not have been considered on the basis of human intuition or reasoning alone. This has been illustrated for several tasks relevant to organic chemistry including molecular design, synthesis planning, and reaction optimization and discovery. Thus, we now briefly highlight some representative examples, but a full discussion of the more recent achievements is out of the scope of this Outlook. 109

Reaction Optimization and Catalyst Design

Machine learning has been applied to several aspects of reaction optimization, including: (a) the qualitative prediction of what reaction occurs between a set of starting materials and reagents, (b) the quantitative prediction of reaction outcomes given examples of a known reaction with variations of reaction conditions, reagents, or catalysts, and (c) autonomous reaction exploration, which requires one to select reaction conditions to try in each successive test iteration.

Qualitative Prediction

Deciding if a reaction occurs between certain starting materials in the presence of certain reagents and predicting the product are key intuitive skills that chemists learn during their training. This skill also serves as the basis for suggesting novel reactions. Work toward computer models with such capabilities has been carried out all throughout the history of AI applications in chemistry. 87,110,111 Most early approaches utilized expert-coded reaction templates to map reactions to starting materials, a daunting task given the sheer quantity of possible reactions.

An alternative to expert-coded reaction templates is to learn the chemical reactivity from a large reaction database with appropriate ML models. One approach to this is the use of graph-convolutional networks (GCN). In a recent example, Coley et al. achieved this by representing molecules as annotated graphs and using a GCN to learn an internal representation of the atoms and molecules and, finally, predicting the bond changes happening in a reaction ( Figure ​ Figure14 14 ). 112 Another recent approach to template-free reaction prediction utilizes techniques originally used in natural language processing. Formally, the prediction task is treated as a “translation” of the language of reactants/reagents to the language of the products. In practice, reaction information is already stored in text form, most commonly the SMILES (Simplified Molecular-Input In-Line System) strings. 113 Schwaller et al. showed that a transformer model with a multihead attention mechanism, termed the molecular transformer, was able to perform this translation task ( Figure ​ Figure14 14 ). 114 Both groups employed data sets from the USPTO patent database 40 to train and test their models. In a common subset consisting of ca. 400k and 40k reactions for training and testing, respectively, the GCN model predicted the highest ranked product correctly in ca. 86% of the test cases and the molecular transformer correctly in ca. 90%. Both model outputs ranked lists of possible products, which can further be used to predict possible side products in a reaction. In the molecular transformer model, a visualization of the attention mechanism revealed a fascinating finding. The model learned to perform atom mapping by correctly connecting the atoms in the products to the corresponding atoms in the reagents without having been trained on mechanistic information. 115 In fact, many other models for reactivity prediction require the atom mapping as input information along with the reagents and products of a reaction, which is a major drawback because atom mapping is a very tedious and error-prone procedure for large reaction data sets. In some reactions, even the determination of the correct atom mapping can involve difficult mechanistic studies.

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0014.jpg

Graph-convolutional network (GCN) and a molecular transformer as applied to modeling molecules and predicting reaction outcomes. Adapted from ref (112) with permission from the Royal Society of Chemistry, Copyright 2019. Adapted from ref (114). Copyright 2019 American Chemical Society.

Quantitative Prediction

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0015.jpg

Machine learning for the quantitative prediction of reaction outcomes. (A) Prediction of the reaction yield for Buchwald–Hartwig C–N couplings. 117 (B) Prediction of the enantioselectivity of chiral phosphoric acid-catalyzed thiol additions to N-acylimines. 118

In another example, the Denmark group investigated the enantioselectivity of chiral phosphoric acid-catalyzed thiol additions to N-acylimines as a function of the catalyst and both the substrates ( Figure ​ Figure15 15 B). 118 The catalysts were represented by their average steric occupancy on a three-dimensional grid in order to reflect conformational flexibility in the molecular representation. Using a total of 2150 reactions, they found that support vector regression and deep feed-forward neural networks were best suited to predict the enantioselectivity of each reaction. Although the training set for the deep feed-forward neural network only comprised ligands that gave less than 80% ee, the model was able to predict the enantioselectivity of ligands that gave higher than 80% ee.

Many other approaches have been taken to represent molecules to predictive algorithms without the need for DFT computed descriptors. 119 This includes representations rooted in chemoinformatics such as molecular fingerprints 120 or text-based fingerprints 121 as well as representations that are intended to provide a physical description of the molecules such as the coulomb matrix and its evolution SLATM, 122 but a full discussion of this field is beyond the scope of this Outlook.

Autonomous Reaction Exploration

Reaction automation has also been used to search for entirely new reactions. Cronin’s group used an automated synthesis platform to carry out experiments in a limited chemical space defined by a certain number of molecules as potential starting materials ( Figure ​ Figure16 16 ). 123 A support vector machine classifier was trained to detect if a reaction has occurred in each experiment, using the result to populate a reaction database. Using this database, the chemical space was modeled by linear discriminant analysis (LDA) in order to suggest successive experiments with a higher probability of a reactive combination of starting materials. Using this workflow, four new reactions were identified and reproduced in separate batch experiments.

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0016.jpg

Automated synthesis platform. Figure adapted with permission from ref (123). Copyright 2018 Springer Nature.

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0017.jpg

Retrosynthesis algorithm. (A) Schematic overview of the Monte Carlo tree search. (B) Schematic overview of the expansion procedure. Figure adapted with permission from ref (125). Copyright 2018 Springer Nature.

A common question about machine learning models concerns their potential for creativity. In domains outside of chemistry, models have been developed that are capable of generating images, text, or music by sampling from a learned latent space of their domain of applicability. 128,129 The utility of such generative models has also been explored in the context of chemical discovery, most commonly in the context of medicinal chemistry where molecules with specific physiological and physicochemical properties need to be designed. 130,131 One approach is to represent molecules as text that encodes the full structure, for example, by SMILES strings 113 and adapt text-generating models to generate SMILES strings corresponding to new molecules, for example, using techniques developed in the context of natural language processing. 132 This can be used to generate potential drug candidates by applying desired properties such as biological activity or solubility as constraints on the generated SMILES strings 133 and, historically, is called chemoinformatics (see above).

Modern Examples of LFERs

As presented in this Outlook, ways to describe and predict chemical reactivity have greatly expanded in the field of chemistry. Since the initial introduction of linear free energy relationships, models capable of describing much more complex problems have emerged. However, these new methods do not take away from the power of the descriptive and predictive ability of classic models like linear regression. In a recent report from the Biscoe and Sigman groups, multivariate LFERs were highlighted as a way to analyze a reaction mechanism and predict reactivity ( Figure ​ Figure18 18 ). 134 The authors found that the enantiospecificity (es) of a Pd-catalyzed alkyl-Suzuki reaction could be described by the computed orbital energy of the phosphorus ligand lone pair (ELP(P)) and the computed energy of the P–C σ* orbitals (Eσ*(P–C)), a measure of the π-backbonding ability of the ligand. This correlation suggests that the stereoinvertive transmetalation proceeds through a coordinatively unsaturated intermediate that would be stabilized by a strong sigma donation from the ligand. In contrast, the stereoretentive transmetalation is stabilized by π-backbonding, indicating that this transformation involves the precoordination of a donor on the substrate, likely OH – . The addition of two steric parameters, the Sterimol parameters B1 Boltz and the length of the ligand (L), to account for competitive β-hydride elimination further improved the fit of the model. The model, which was based on a series of ligands in a training set, gave an excellent fit (R 2 = 0.94). It could also predict the es for a validation set of ligands not included in the original model (R 2 (EV) = 0.87).

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0018.jpg

(A) Pd-catalyzed Suzuki reaction. (B) Multivariate regression model for the full data set. (C) Contributing parameters and mechanistic implications. Adapted from ref (134) with permission from AAAS, Copyright 2018.

Developments in machine learning methods have also aided in the implementation of classic models. For example, volcano plots have been used in heterogeneous and homogeneous catalysis to estimate catalyst performance on the basis of Sabatier’s principle, which states that an active catalyst should bind substrate neither too tightly nor too loosely (the plateau of the volcano plot). 135,136 In the context of transition metal-catalyzed cross-coupling, the Corminboeuf group demonstrated that a descriptor value such as the relative energies (ΔE) of oxidative addition can determine if a catalyst falls into this active range ( Figure ​ Figure19 19 A). 137 While this descriptor value can be computed by DFT, the computational cost to do so for thousands of catalysts is intractable. Instead, machine learning can be utilized to estimate ΔE values of oxidative addition, thus circumventing costly DFT computations. 138 The catalyst library ( Figure ​ Figure19 19 B) studied in this work consisted of combinations of 91 ligands (CO, phosphines, N-heterocyclic carbenes, and pyridines) and 6 transition metals (Ni, Pd, Pt, Cu, Ag, and Au) for a total of 25 116 possible species for each intermediate. A kernel ridge regression (KRR) model, trained on 7054 complexes, predicted the ΔE values of 18 062 additional complexes. Using a preconstructed volcano plot, 557 complexes were identified to have ΔE descriptors that would be within the active window for catalysis. This work highlights the ability of machine learning to readily screen thousands of possible catalyst/ligand combinations without the need for costly DFT computations.

An external file that holds a picture, illustration, etc. Object name is oc1c00535_0019.jpg

(A) Volcano plot showing ΔE(pds) vs ΔE(rxn). ΔE(pds) is the energy difference of the potential-determining step; region I is reductive elimination, region II is transmetalation, region III is oxidative addition. ΔE(rxn) is the energy difference of oxidative addition. (B) Catalyst library. Adapted from ref (137). Copyright 2021 American Chemical Society.

Conclusion

It is clear from the timeline embodied in Figure ​ Figure2 2 that the use of data-driven modeling in chemistry has had a long and rich history. Over a period of nearly 100 years, chemists have created a multitude of approaches for examining experimental data to make mechanistic conclusions and predictions of reactivity. The earliest correlations evaluated differences in free energies and took the form of linear univariate (or sometimes multivariate) relationships involving parameters, mainly derived from experimental measurements arising from substituent effects that are dictated by systematic changes in chemical structures (e.g., pKa, σ, E, etc.). These linear free energy relationships correlated substituent effects (induction, resonance, sterics, etc.) to reactivity and are primarily used to explore reaction mechanisms, but, as we have noted, it is important to realize that the correlations can also be predictive. If parameters for new chemical structures are known, the linear relationships will reveal where the new structures will fall in a spectrum of reactivities. By the early 21st century, this predictive power compelled chemists to become increasingly more sophisticated in the kinds of correlations used, resulting in the use of nonlinear kernel functions, multilayered neural nets, or random forest trees ( Figure ​ Figure13 13 ) as well as other mathematical and statistical approaches commonly referred to as “machine learning”. In these studies, the parameters have become far broader and often include spectral or computational data, while still retaining elements of electronic and steric substituent effects.

Along this 100-year journey, new terminology was introduced into the literature to differentiate the applications and advent of mathematical techniques as well as experimental analyses and predictive approaches. To follow this evolution of terms, we return to the timeline of Figure ​ Figure2 2 and the definitions we choose for this Outlook given in Figure ​ Figure1 1 . Chemometrics, as originally defined (see discussion of Figure ​ Figure8 8 above), is so broad that it encompasses the use of any kind of mathematical and/or statistical approaches involving structural changes, experimental data, or computational parameters to understand and predict a chemical phenomenon. This would include computer analysis, and if being able to predict is dependent upon having learned, it would include the use of machine learning in chemistry. We have emphasized that the exact same protocols are the tools used in both chemometrics and machine learning, i.e., PCA, SVM, RF, ANN, etc. The definition of chemoinformatics is similarly broad, encompassing the use of “informatics” to solve chemical problems of any kind (see discussion of Figure ​ Figure12 12 above), where informatics is defined as describing a molecular structure in a computer-readable format, such as a matrix of values. While chemoinformatics was, and still is, primarily associated with drug discovery, the terminology used in this field is only subtly different than that used in chemometrics and, therefore, also machine learning. This brings us to the terminology associated with machine learning, where an additional feature is explicit irrespective of the field in which it is applied, that of automatically improving with experience. If performed with a computer, this implies an artificial intelligence, where there is a cognitive function such as learning, problem solving, and decision making upon perceiving stimuli. When used in organic chemistry for reaction discovery or optimization, the application of computational and statistical methods involving computer analysis to perform these cognitive functions is a subtle, but important, difference from chemometrics and chemoinformatics. The upshot is there is a tangled web of interrelationships of terminology as the field of data-driven science in organic chemistry has evolved over the past 100 years.

With all of these traditional tools in play, there is no wonder that we have seen a significant uptick in machine learning reports in organic chemistry. This has been aligned with questions of how to more effectively use available data, especially in industrial settings, and how to design data acquisition with the intention of using machine learning techniques from the outset. This focus on the “data” aspect is aligned with all types of exciting directions to streamline the goal of the synthetic endeavor by integrating more modern and sophisticated data/computer science algorithms, such as molecular/catalyst design, complex molecule synthesis, reaction optimization, and mechanistic interrogation. Each of these areas will also be aided by updates to parallel reaction screening technologies that integrate data rich outputs (e.g., temporal and kinetic measures 139 ), likely resulting in fully automated reaction discovery and optimization workflows. Finally, the entire premise of this field, providing understanding of chemical processes through quantitative featurization, is foundational in how one can imagine using data science in everyday mechanistic investigations and reaction methodology development.

Acknowledgments

We thank Dr. Jose A. Garrido and Dr. Andrzej M. Żurański for helpful discussions and Dr. Jose A. Garrido for assistance with the graphics. A.G.D. and M.S.S. gratefully acknowledge the NSF under the CCI Center for Computer Assisted Synthesis (CHE-1925607) for support. E.V.A. thanks the Welch Regents Chair (F-0046) for support. T.G. is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC 2008/1 – 390540038 and by a Liebig Fellowship of the Fonds der Chemischen Industrie.

Author Contributions

‡ W.L.W. and L.Z. contributed equally.

Notes

The authors declare no competing financial interest.

Footnotes

References