This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Empowering Machines to Think Like Chemists: Unveiling Molecular Structure-Polarity Relationships with Hierarchical Symbolic Regression

Siyu Lou,1,2† Chengchun Liu,3† Yuntian Chen2∗ Fanyang Mo3,4∗

1Department of Computer Science and Engineering, Shanghai Jiao Tong University,
Shanghai 200240, P.R.China
2Ningbo Institute of Digital Twin, Eastern Institute of Technology,
Ningbo 315200, P.R.China
3School of Materials Science and Engineering, Peking University,
Beijing 100871, P.R.China
4AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School,
Shenzhen 518500, P.R.China
These authors contributed equally to this work.
To whom correspondence should be addressed;
E-mail: [email protected], [email protected]

Thin-layer chromatography (TLC) is a crucial technique in molecular polarity analysis. Despite its importance, the interpretability of predictive models for TLC, especially those driven by artificial intelligence, remains a challenge. Current approaches, utilizing either high-dimensional molecular fingerprints or domain-knowledge-driven feature engineering, often face a dilemma between expressiveness and interpretability. To bridge this gap, we introduce Unsupervised Hierarchical Symbolic Regression (UHiSR), combining hierarchical neural networks and symbolic regression. UHiSR automatically distills chemical-intuitive polarity indices, and discovers interpretable equations that link molecular structure to chromatographic behavior.

Introduction

Polarity reflects uniformity and symmetry of charge distributions in electrets  (?). In particular, molecular polarity is crucial for understanding how molecules interact with each other, and it plays a fundamental role in the characterization of molecules  (?). Thin-layer chromatography (TLC) has gained widespread acceptance (?, ?) and provides critical insights into the behavior of organic molecules within varied solvent environments. Despite the established importance of TLC in determining the retardation factor (RfR_{f}), a crucial measure in polarity studies, the method is labor-intensive and involves large numbers of repetitive trials.

In recent years, significant progress has been made in the field of artificial intelligence-assisted chemical analysis, especially in the development of quantitative structure–activity relationships (QSAR) prediction models  (?, ?, ?, ?) and quantitative structure–property relationships (QSPR) prediction models  (?, ?, ?). There are some attempts have been made to construct prediction models for structure-retardation factors  (?, ?), but their performance is constrained by the limited quantity and standardization of TLC data. To address this challenge, our prior work  (?, ?) introduced an automatic high-throughput platform for TLC analysis, providing a more extensive and standardized dataset for model training. The enhanced dataset serves to improve the performance of the model, allowing for more accurate and robust predictions of structure-retardation factors in TLC experiments.

Although machine learning models, especially deep neural networks (DNNs) have demonstrated excellent performance, their “black box” nature raises concerns about interpretability and understanding of the underlying mechanisms. In this paper, we propose an innovative framework that integrates chemists’ experiential knowledge, aiming to “unpack” the “black box” of molecular structure-polarity relationships.

Essentially, our approach seeks a formalized expression to reveal the intricate mathematical relationship between the RfR_{f} value and the solute molecular structures as well as the eluent solvents in TLC analysis. Symbolic regression (SR) has recently emerged as a powerful tool for extracting the underlying mathematical equations from observed data  (?, ?, ?, ?, ?, ?, ?). It has gained prominence for scientific discovery in diverse fields  (?, ?, ?). However, existing SR methods encounter performance limitations, particularly when dealing with datasets containing multiple variables. Therefore, when using SR methods to explain structure-polarity prediction models, a dilemma emerges: molecular fingerprints, such as MACCS key molecular fingerprints  (?), despite their strong expressive power, pose great challenges for SR methods due to their high dimensionality; on the other hand, utilizing feature engineering and domain knowledge with a limited set of input variables results in discovered formulas lacking expressiveness.

To address this dilemma between expressiveness and interpretability, we present unsupervised hierarchical symbolic regression (UHiSR) (see Fig. 1), a new SR approach guided by hierarchical neural network. UHiSR introduces novel polarity indices, e.g., solvent polarity index and solute polarity index, which are learned through a semantic hierarchical model. Guided by domain knowledge, this model comprises multiple sub-models that govern the mapping relationships between a set of input variables, e.g., solvent compositions, and a specific polarity index, e.g., solvent polarity index, as well as the mapping between polarity indices and the output variable RfR_{f}. It is noteworthy that this process aligns with the chemists’ thinking processes. Human brain struggles with high-dimensional information, opting not to directly comprehend intricate mappings but rather to decompose and analyze through individual sub-models. Then, we can apply the SR method on the output RfR_{f} and the polarity indices, effectively reducing the dimension of the input variables of SR algorithm.

Refer to caption
Figure 1: Overview of Unsupervised Hierarchical Symbolic Regression (UHiSR). (A) Illustration of TLC experiment and the calculation of the retardation factor (Rf)R_{f}). (B) Feature engineering, involving five solvent features based on volume percentages and the decomposition of target molecules into functional groups (FG). The molecular structure is treated as a composite formed by stacking various functional group modules. (C) UHiSR framework with three main stages: chemist-guided feature clustering, hierarchical neural network for latent variable extraction (e.g., solute polarity index), and symbolic regression for discovering explicit equations between the target value and latent variables.

In addition, experienced organic chemists often do not use the molecular fingerprints to evaluate what solvent system should be used for TLC experiments for a given target molecule. Instead, organic chemists evaluate polarity primarily based on the nature, number, and molecular structure of functional groups in solute compounds  (?). Therefore, in this study, we seek a simpler and more chemically intuitive perspective, focusing on the basic characteristics of complex molecules, moving from computer engineering to a human-scientist-like thinking mode.

In conclusion, our work introduces several novel chemical-intuitive polarity indices, including the solvent polarity index and solute polarity index, in TLC experiments. These indices provide quantitative insights into the interaction dynamics among molecules in both solid-liquid and liquid-liquid systems. Furthermore, we present a concise-yet-accurate formula that links the RfR_{f} value with these polarity indices. The UHiSR framework, driven by both data and domain knowledge, not only uncovers these meaningful polarity indices directly from experimental data, but also constructs an equation that governs the underlying mechanisms. It empowers the machine to think like chemists, and builds a bridge between AI models and interpretable insights in molecular structure-polarity predictive modeling.

Results

Data driven interpretable polarity indices

It is essential for chemists to accurately characterize the overall polarity of molecules. For this purpose, there are many known descriptors such as the molecular polarity index (MPI), which is based on theoretical calculations. However, they often come with issues like high computational cost or only providing global information. Another way to estimate the molecule’s polarity is based on TLC experiments, where chemists typically keep solutes or solvents fixed, evaluate different solvent compounds or solute compounds, and utilize the RfR_{f} value to infer the corresponding polarities.

Based on the TLC dataset, we used a data-driven method to extract two polarity indices, namely Ψ\Psi and ξ\xi, acting as empirical descriptors providing information about the polarity of the solvent system and solute molecule, respectively. In contrast to traditional physicochemical descriptors, our proposed polarity indices offer enhanced interpretability and the ability to capture detailed structural information.

Chemical-intuitive feature engineering.

Rather than employing molecular fingerprints or conventional physicochemical descriptors like common machine learning does, our approach deliberately labels functional groups, just like human chemists typically do. In particular, this study encompasses an array of solute molecules, including those containing carbonyl, hydroxyl, amino, halogen, nitro, and cyano functionalities. This enables a tailored analysis of how these functional groups influence the compound’s RfR_{f} value across various solvents in terms of type, number, and allocation. For notational simplicity, we utilize the abbreviations from Table 1 for the features employed in our study. This table also provides the descriptions and statistics of these features. Our method merely uses 1919 distinct features for describing solute molecules, and thereby demonstrates an effective reduction of dimensionality compared with MACCS key molecular fingerprints (167167 dimensions). Furthermore, in contrast to traditional physicochemical descriptors, which are commonly low dimensional, our features offer a more localized and interpretable representation of the target molecule.

To evaluate the effectiveness of the selected features, we conducted a comparative analysis of their representation power (see Materials and Method for more details). As shown in Fig. 2, our proposed features demonstrate comparable fitting performance to MACCS, and slightly surpass conventional physicochemical descriptors. The results emphasize the substantial representational power of our features.

Table 1: Description and statistic analysis of the features used in this study. The abbreviations, descriptions, mean values along with the standard deviations (Std.) are presented. Additionally, Pearson’s correlation coefficients (rr) are computed to quantify the linear relationship between each feature and the observed RfR_{f} value.
Feature abbr. Feature description Mean(Std.) Pearson’s rr
Eluent solvent
Hex n-Hexane concentration 0.36(0.40)0.36(0.40) 0.54-0.54
EA Ethyl acetate concentration 0.15(0.28)0.15(0.28) 0.290.29
DCM Dichloromethane concentration 0.40(0.48)0.40(0.48) 0.210.21
MeOH Methanol concentration 0.014(0.025)0.014(0.025) 0.310.31
Et2O Diethyl ether concentration 0.071(0.22)0.071(0.22) 0.130.13
Solute molecule
NBen Count of benzene 1.12(0.57)1.12(0.57) 0.070.07
MSD Molecular skeleton descriptor 3.79(2.32)3.79(2.32) 0.070.07
DM Dipole moment 1.34(0.74)1.34(0.74) 0.02-0.02
CtPhenol Count of phenolic hydroxyl group 0.15(0.40)0.15(0.40) 0.15-0.15
CtOH Count of alcoholic hydroxyl group 0.04(0.21)0.04(0.21) 0.13-0.13
CtAldehyde Count of aldehyde group 0.19(0.40)0.19(0.40) 0.007-0.007
CtCO2H Count of carboxylic acid group 0.03(0.18)0.03(0.18) 0.17-0.17
CtRCO2R Count of ester group 0.10(0.32)0.10(0.32) 0.030.03
CtR2C=O Count of ketone group 0.35(0.54)0.35(0.54) 0.06-0.06
CtROR Count of ether group 0.17(0.47)0.17(0.47) 0.04-0.04
CtCN Count of cyano group 0.06(0.25)0.06(0.25) 0.01-0.01
CtNH2 Count of amine group 0.08(0.28)0.08(0.28) 0.20-0.20
CtNO2 Count of nitro group 0.06(0.25)0.06(0.25) 0.0030.003
CtAmide Count of amide group 0.02(0.14)0.02(0.14) 0.14-0.14
CtMe Count of methyl group 0.20(0.54)0.20(0.54) 0.100.10
CtF Count of fluorine 0.25(0.81)0.25(0.81) 0.070.07
CtCl Count of chlorine 0.11(0.39)0.11(0.39) 0.130.13
CtBr Count of bromine 0.15(0.39)0.15(0.39) 0.150.15
CtI Count of iodine 0.12(0.33)0.12(0.33) 0.220.22
Refer to caption
Figure 2: Comparative analysis of different molecular feature sets: the fitting accuracy is evaluated using the XGBoost model across three different feature groups. These feature groups comprise: (A) Features introduced in this paper, detailed in Table 1; (B) MACCS keys; (C) physicochemical descriptors.

Polarity indices.

In this study, the stationary phase on the TLC plate is silica gel. The determined RfR_{f} values epitomize a dynamic equilibrium. It reflects the competitive interplay between the highly polar stationary phase and the solute molecules, which are mediated through interactive forces by changing the polarities of the mobile phase during the experiment. This interplay exhibits two primary patterns: the solvent facilitates the upward transit of solute via capillary action; concurrently, solvent molecules intervene by dislodging solute molecules from the stationary phase’s surface, attenuating their interactions (as illustrated in Fig. 3).

The two polarity indices distinctly capture the interaction dynamics between the solvent compound and the silica gel, as well as between the solute molecule and the silica gel. Importantly, our approach enables a quantitative assessment of the aforementioned interactions. In the following paragraphs, we will separately analyze these two indices.

Refer to caption
Figure 3: Illustration of the polarity indices and their impact on chromatographic behavior. (A) The input features corresponds to different polarity indices. Here, FG stands for functional group. (B) TLC experiment can be understood as a process where the stationary phase (silica gel) and the mobile phase (solvent) compete for the solute molecules. This competition’s outcome is reflected in the RfR_{f} value. The factors influencing the RfR_{f} value can be categorized into three types: interaction between the stationary phase and the solvent (I), interaction between the stationary phase and the solute (II), and interaction between the solvent and the solute (III). The two polarity indices (Ψ\Psi and ξ)\xi) characterize how solvent and solute impact the chromatographic behavior separately. (C) Illustration of interactions between the stationary phase and different functional groups.

Solute polarity index.

Considering that the dataset contains 387387 organic compounds, previous research normally involves high dimensional features and physicochemical descriptors to encode the solute molecule into numerical values. However, the former are often challenging for human understanding, while the latter lack structural information.

To carry out this study, we carefully selected 1616 representative functional groups and used their counts as features for the target molecules. Additionally, we characterized the distribution of these functional groups within the molecule by using three features: count of benzene (NBen), molecular skeleton descriptor (MSD), and dipole moment (DM). Although we were able to reduce the feature numbers to 1919, establishing a explainable connection between the polarity of the molecule and these input features remains a major challenge.

We distinguished two types of solute features: functional group (FG) counts and FG distribution, as they may have distinct effects on the polarity of a given molecule (Fig. 3A). For instance, a molecule with a higher count of polarity-functional groups but with a symmetric structure may exhibit lower polarity than molecules with fewer polarity-functional groups. Therefore, we further extracted two indices, namely FG distribution polarity index α\alpha and FG polarity index β\beta, to isolate these effects.

Despite the polarity of functional groups being of great interest, there is a lack of quantitative and scalable methods to compare the polarities of different functional groups; in current practice, chemists often rely mainly on qualitative analysis and their experience. Our FG polarity index provides the first empirical quantification of the impact of different functional groups on molecular polarity. Considering the 1616 functional groups involved in this study, the input features for the sub-model used to obtain the FG polarity index β\beta consist of 1616-dimensional vectors, with each dimension representing the number of the respective functional group. To quantify the polarity of a specific functional group FGi, we configured the input feature vector such that the ii-th position was set to 11 and the others to 0. As illustrated in Fig. 3B, the sub-model’s outputs reveal a distinct polarity order among the functional groups, i.e., amides (3.363.36) >> carboxylic acid (2.902.90) >> amine (2.812.81) >> alcoholic hydroxyl (2.212.21) >> phenolic hydroxyl (2.052.05) >> aldehyde (0.460.46) >> nitro (0.05-0.05) >> ketone (0.31-0.31) >> ester (0.72-0.72) >> fluorine (1.25-1.25)>> ether (1.59-1.59) >> methyl (1.74-1.74) >> bromine (2.04-2.04) >> cyano (2.20-2.20) >> chlorine (2.22-2.22) >> iodine (2.94-2.94).

Solvent polarity index.

Another important interaction is between the solvent compound and the silica gel. The solvent polarity index Ψ\Psi provides an analytical lens through which we are able to assess the influence of disparate solvents on the RfR_{f} values.

Methanol (MeOH) possesses one hydrogen bond acceptor (HBA) and one hydrogen bond donor (HBD), capable of forming stable six- and seven-membered chelates with the silica gel. It also boasts an HBA that can engage in intermolecular hydrogen bonding with the siloxane (Si-O-Si) bridges of the silica gel. Ethyl acetate (EA) harbors two HBAs that can establish an eight-membered ring complex with the silica gel. The oxygen atoms of the ester moiety can partake in two distinct types of intermolecular hydrogen bonds with the gel. Diethyl ether (Et2O) contains one HBA and is thus limited to a singular mode of intermolecular hydrogen bonding with the silica gel. Dichloromethane (DCM) is characterized by two chlorine atoms, which can form a relatively less stable eight-membered ring structure with the gel, and induce weak intermolecular hydrogen bonding. The interaction between n-hexane (Hex) molecules and the silica gel is considerably feeble, with negligible capacity to displace solute molecules. Consequently, under the paradigm of “like dissolves like”, we infer a polarity hierarchy as follows:

MeOH>EA>Et2O>DCM>Hex.{\rm MeOH}>{\rm EA}>{\rm Et}_{2}{\rm O}>{\rm DCM}>{\rm Hex}.

The input features of the sub-model used to obtain Ψ\Psi consist of five-dimensional vectors, with each dimension representing the volume percentage of a solvent compound. Consequently, variations in the compound ratio within the solvent system directly influence the value of Ψ\Psi, thereby changing the polarity of the solvent. Moreover, to isolate the effect of each compound, we can assign the value 11 to the for the volume percentage that corresponds to the target solvent compound, while setting the others to 0. This approach yields the following values (see Fig. 3B):

ΨHex=1.238,ΨDCM=0.392,ΨEt2O=0.534,ΨEA=0.733,ΨMeOH=5.124.\Psi_{{\rm Hex}}=-1.238,\quad\Psi_{{\rm DCM}}=-0.392,\quad\Psi_{{\rm Et}_{2}{\rm O}}=0.534,\quad\Psi_{{\rm EA}}=0.733,\quad\Psi_{{\rm MeOH}}=5.124.

As shown in Table 2, we also conducted a comparative analysis for our empirical solvent descriptor Ψ\Psi and experimental miscibility indices such as LogP and aqueous solubility (?). The observed numerical consistency and trends align with our preliminary postulates (see Table 2).

Table 2: Solvent properties and solvent polarity index Ψ\Psi.
Solvent HBD HBA Solubility in water (g/100g)  (?) LogP Ψ\Psi
Hex 0 0 0.01 3.42 -1.238
DCM 0 0 1.32 1.01 -0.392
Et2O 0 1 6.9 0.76 0.534
EA 0 2 7.7 0.29 0.733
MeOH 1 1 Miscible -0.27 5.124

The RfR_{f} governing equation

To comprehensively understand the relationship between the RfR_{f} value and the two polarity indices, we used the symbolic regression to formulate empirical mathematical expressions connecting the RfR_{f} value and the two polarity indices (Fig. 4A). The observed RfR_{f} values were obtained directly from our automatic high-throughput platform  (?), while the computed RfR_{f} values were calculated from the RfR_{f} governing equation (see Equation (1) below). The polarity indices can also be determined through a hierarchical equation system, based on the values of the input variables (Fig. 4A), a discussion of which will follow.

To ensure that the computed RfR_{f} values lie within the range from 0 to 11, a sigmoidal function was employed (see Fig. 4B). By using symbolic regression, one obtains several candidates for the RfR_{f} governing equation (see Table S1 in the Supplementary Materials). Since a simple formula is desired, and considering the prescribed fitting accuracy (RMSE =0.091=0.091, R2=0.93R^{2}=0.93), we selected

Rf=σ(3.48Ψ+3.08ξ+1.86),σ(x)=11+ex.R_{f}=\sigma\left(3.48\Psi+3.08\xi+1.86\right),\quad\sigma(x)=\frac{1}{1+e^{-x}}. (1)

In order to separately analyze the individual influence of solvent and solute on the RfR_{f} value, we further decomposed the formula for RfR_{f} in Equation (1) into h(Ψ)h(\Psi) and g(ξ)g(\xi). For instance, as illustrated in Fig. 4C, when considering a specific solute molecule (i.e., fixing a value for ξ\xi), we can reformulate Equation (1) as Rf=h(Ψ)=σ(3.48Ψ+C1)R_{f}=h(\Psi)=\sigma(3.48\Psi+C_{1}), where C1C_{1} depends on the chosen solute molecule. The pattern observed in Fig. 4C shows that RfR_{f} increases with Ψ\Psi. Since a larger value of Ψ\Psi indicates greater polarity, this finding emphasizes that the Ψ\Psi correlates to the polarity of the solvent. Furthermore, we observe that the shape of h(Ψ)h(\Psi)’s graph varies significantly for different solute molecules. In particular, Fig. 4C indicates that, within the context of our dataset, changing the solvent’s composition only has a subtle effect on RfR_{f} for solute molecules with extreme polarities. Similarly, when the solvent is fixed, Equation (1) can be reformulated as Rf=g(ξ)=σ(3.08ξ+C2)R_{f}=g(\xi)=\sigma(3.08\xi+C_{2}), where C2C_{2} depends on the chosen value for Ψ\Psi. By analysis similar to the previous case (fixed solute), the RfR_{f} value increases with ξ\xi. But now, a larger value of ξ\xi indicates a lower polarity. Moreover, we observe that the shape of g(ξ)g(\xi)’s graph is of the same S-shaped type across different solvent systems (see Fig. 4C).

Quantification of solvent polarity

In addition to characterizing the polarity of the eluent solvent, the UHiSR framework offers an explainable solution for understanding how different compounds influence the solvent polarity index Ψ\Psi. In this study, we consider five distinct solvent compounds: Hex, EA, DCM, MeOH and Et2O. To achieve the designed solvent polarity, chemists usually adjust the ratios of these compounds in the mobile phase system. By directly applying the SR algorithm, we obtained the empirical mathematical equation between solvent polarity index Ψ\Psi and its corresponding input variables, i.e., the volume ratio of Hex, DCM, Et2O, EA and MeOH in the mobile phase system. The following formula achieved high R2R^{2} of 0.9830.983 and a low RMSE of 0.0940.094 with a simple form,

Ψ=Hex+1.59EA0.411DCM+11.1MeOH+Et2O2+0.142.\Psi=-{\rm Hex}+1.59{\rm EA}-0.411{\rm DCM}+11.1{\rm MeOH}+{\rm Et}_{2}{\rm O}^{2}+0.142. (2)

Specifically, there are three mobile phase systems involved in the dataset, which are Hex/EA, Hex/Et2O and MeOH/DCM. Equation (2) quantifies the impact of the two compounds on solvent polarity within each system. For instance, in the presence of Hex and EA, the solvent polarity index is calculated as Ψ=Hex+1.59EA\Psi=-{\rm Hex}+1.59{\rm EA}. As the volume of EA increases, the solvent polarity exhibits a corresponding increase. Moreover, the constant term of 11.1 with MeOH indicates that even a slight increase/decrease in the volume of MeOH will lead to a significant increase/decrease in solvent polarity.

Quantification of solute polarity

As mentioned earlier, we introduce two additional functional group-related indices, FG distribution polarity index α\alpha and FG polarity index β\beta, to enhance our understanding of the relationship between the molecular polarity and the molecular structure. With the help of symbolic regression, we obtained an explicit mathematical relationship between ξ\xi and (α,β)(\alpha,\beta) as follows:

ξ=c1β+c2αeα+c3eα+c4α+c5eαβ+c6,\xi=c_{1}\beta+c_{2}\alpha e^{\alpha}+c_{3}e^{\alpha}+c_{4}\alpha+c_{5}e^{\alpha}\beta+c_{6}, (3)

where c1=0.257,c2=0.464,c3=1.093,c4=0.0246,c5=0.464,c6=0.058c_{1}=-0.257,c_{2}=-0.464,c_{3}=1.093,c_{4}=0.0246,c_{5}=0.464,c_{6}=-0.058. Breaking down Equation (3), the molecular polarity can be understood as a combination of the polarity of the functional groups within the molecule (f1(α)=c2αeα+c3eα+c4αf_{1}(\alpha)=c_{2}\alpha e^{\alpha}+c_{3}e^{\alpha}+c_{4}\alpha), the distribution of these functional groups (f2(β)=c1βf_{2}(\beta)=c_{1}\beta) and the interaction between these two factors (f3(α,β)=c5eαβ+c6f_{3}(\alpha,\beta)=c_{5}e^{\alpha}\beta+c_{6}).

Regarding the FG polarity index β\beta, which categorizes the polarity of different functional groups, we observe a distinct trend: an increase in β\beta corresponds to a decrease in the solute polarity index ξ\xi, consequently leading to a smaller RfR_{f} value (Fig. 4D). This signifies that a higher β\beta value is indicative of greater polarity. As for the FG distribution polarity index α\alpha, we observe an exponential growth of ξ\xi when α\alpha increases. However, α\alpha is a composite element that includes the functional group distribution within- and the symmetry of the target molecule. In Fig. 4E, we provide 1010 example molecules with distinct α\alpha values.

Hierarchical equation system

UHiSR has the capability to reveal the relationship between the FG distribution polarity index α\alpha and its corresponding input variables, as well as between the FG polarity index β\beta and the counts of 1616 functional groups. Consequently, we can establish a hierarchical equation system that fully elucidates the mathematical relationship between RfR_{f} and all input variables, as presented in Table 3.

Table 3: Optimal hierarchical equation systems corresponding to RfR_{f} and input variables.
Equation R2R^{2} RMSE
Rf=σ(3.48Ψ+3.08ξ+1.86),σ(x)=1/(1+ex)R_{f}=\sigma\left(3.48\Psi+3.08\xi+1.86\right),\quad\sigma(x)=1/(1+e^{-x}) 0.9300.930 0.0910.091
Ψ=Hex+1.59EA0.411DCM+11.1MeOH+Et2O2+0.142\Psi=-{\rm Hex}+1.59{\rm EA}-0.411{\rm DCM}+11.1{\rm MeOH}+{\rm Et}_{2}{\rm O}^{2}+0.142 0.9830.983 0.0940.094
ξ=0.232β0.232(eα0.0531)(2α+2β+4.71)\xi=-0.232\beta-0.232\left(e^{\alpha}-0.0531\right)\left(-2\alpha+2\beta+4.71\right) 0.8410.841 0.2700.270
α=2NBen(DM+0.412)1.33NBen(DM+0.412)MSD0.04670.743\alpha=-2{\rm NBen}\left({\rm DM}+0.412\right)-\frac{1.33{\rm NBen}\left({\rm DM}+0.412\right)}{{\rm MSD}-0.0467}-0.743 0.8980.898 0.630.63
β=0.218γ1/γ26+0.413γ2+0.435γ40.435γ5+0.0223(γ3+γ4)2\beta=-0.218\gamma_{1}/\gamma_{2}^{6}+0.413\gamma_{2}+0.435\gamma_{4}-0.435\gamma_{5}+0.0223\left(\gamma_{3}+\gamma_{4}\right)^{2} +0.493~{}\quad~{}\quad~{}\quad~{}+0.493 0.8100.810 0.9340.934
γ1=3.09CtAmides3.91CtCO2H+1.87\gamma_{1}=-3.09{\rm CtAmides}-3.91{\rm CtCO}_{2}{\rm H}+1.87 0.9970.997 0.1520.152
γ2=CtRNH2+2CtOH1.76CtPhenol(1.76CtRNH2)\gamma_{2}=-{\rm CtRNH}_{2}+2{\rm CtOH}-1.76{\rm CtPhenol}\left(1.76-{\rm CtRNH}_{2}\right) +(CtRNH22+CtOHCtPhenol+0.912)2~{}\quad~{}\quad~{}\quad~{}~{}+\left(-{\rm CtRNH}_{2}^{2}+{\rm CtOH}-{\rm CtPhenol}+0.912\right)^{2} 0.9990.999 0.2040.204
γ3=CtNO2(CtRCO2R23.65)+0.762\gamma_{3}={\rm CtNO}_{2}\left(-{\rm CtRCO}_{2}{\rm R}^{2}-3.65\right)+0.762 0.9610.961 0.4130.413
γ4=CtAldehyde3CtAldehyde2(CtR2C=OCtF)2\gamma_{4}={\rm CtAldehyde}^{3}-{\rm CtAldehyde}^{2}\left({\rm CtR}_{2}{\rm C}\!\!=\!\!{\rm O}-{\rm CtF}\right)^{2} 3CtAldehyde+CtR2C=O2+2CtR2C=O+4CtCN~{}\quad~{}\quad~{}\quad~{}~{}-3{\rm CtAldehyde}+{\rm CtR}_{2}{\rm C}\!\!=\!\!{\rm O}^{2}+2{\rm CtR}_{2}{\rm C}\!\!=\!\!{\rm O}+4{\rm CtCN} 2CtF+eCtF(CtROR2CtF)20.746~{}\quad~{}\quad~{}\quad~{}~{}-2{\rm CtF}+e^{{\rm CtF}-\left({\rm CtROR}-2{\rm CtF}\right)^{2}}-0.746 0.9610.961 0.4460.446
γ5=(CtCl+2CtI)2+(CtBrCtBr0.305+CtMethyl)2\gamma_{5}=\left({\rm CtCl}+2{\rm CtI}\right)^{2}+\left(\frac{{\rm CtBr}}{{\rm CtBr}-0.305}+{\rm CtMethyl}\right)^{2} 0.9310.931 0.8050.805
Refer to caption
Figure 4: Visualization of the latent variables and the decomposition of the retrieved formula. (A)Visualization of the observed and calculated RfR_{f} values with two polarity indices Ψ\Psi and ξ\xi. (B) Fitting observed Rf values with a Sigmoid function. (C) Decomposition of Equation (1) into h(Ψ)h(\Psi) and g(ξ)g(\xi). (D) Decomposition of Equation (3) into f1(α)f_{1}(\alpha) and f2(β)f_{2}(\beta). (E) FG distribution polarity index α\alpha of 10 example compounds.

Discussion

Recent methods to identify quantitative structure-polarity relationships usually include various machine learning models and learning informative feature representations  (?, ?). Despite achieving superior prediction accuracy, these models or feature representations often lack interpretability, hindering a deeper understanding the underlying relationships within the dataset. One strategy to enhance interpretability is to articulate explicit formulas between the output and input variables. Symbolic regression emerges as an effective approach for unveiling complex relationships within datasets. However, feasibility and interpretability are challenged when dealing with a large number of variables, as the performance significantly decreases while the complexity of the discovered formulas increases drastically.

In this work, we introduce the integration of hierarchical neural networks and symbolic regression, aiming to enhance the interpretability of the model. We propose, for the first time, two important polarity indices, the solvent polarity index Ψ\Psi and the solute polarity index ξ\xi. These two polarity indices not only provide interpretable insights, but also enable direct RfR_{f} prediction through a formula-driven approach. Compared to high-dimensional feature representations, as we only use few indices (two) that are strongly tied to the given dataset, our approach offers increased ease of understanding and better alignment with the underlying task and the dataset. Additionally, the predictive accuracy is comparable to that of currently used DNN models. Besides, the data-driven discovery of FG distribution polarity index and FG polarity index facilitates a systematic exploration of our 1919-dimensional solute feature space, with each index focusing on different aspects (counts and distribution of functional groups).

The integration of symbolic regression offers an equation-based model, which has several advantages compared with traditional machine learning models. First, equation-based models have extremely low computational cost, making it an ideal choice in resource-constrained environments. Second, the equation-based model usually has only few parameters, especially compared to DNN models, increasing its transferablility and robustness. Due to these advantages, there are potential applications of equation-based models in edge computing. Moreover, unlike traditional neural network models that are deployed in centralized systems and require substantial data transfer for each new scenario, equation-based models offer much cheaper adaptability. This opens up new possibilities for applications in areas such as automated experimental platforms.

As final remark, we have focused only on examples from the chemistry community; however, building interpretable frameworks, that facilitate human understanding of the employed models, can be of general importance for science applications.

Materials and Methods

TLC Dataset

Several variables contribute to the variation in RfR_{f} values, including factors such as the nature of the stationary phase, temperature, humidity. Traditional methods of obtaining TLC data manually suffer from a lack of standardization, making scalability challenging. To address this issue, our prior work  (?) introduced a robotic platform designed for high-throughput collection of TLC data (Fig. 4A), which not only ensures efficiency in data acquisition but also establishes a standardized and scalable methodology.

The TLC dataset comprises in total 4,9444,944 measurements of the RfR_{f} values, involving 387387 organic compounds and three mobile phase systems. These mobile phase systems are n-Hexane/Ethyl acetate, Diethyl ether/n-Hexane and Methanol/Dichloromethane systems, with a total of 1717 different solvent compositions.

Refer to caption
Figure 5: Hierarchical structure of learning latent variables. Three stages are organized from top to bottom. (A) At the first stage, two latent variables, Ψ\Psi and ξ\xi, are generated to encapsulate the overall polarity of the solvent and the solute, respectively. (B) At the second stage, two additional latent variables, α\alpha and β\beta, are learned to assess the solute molecule’s polarity from two distinct perspectives. α\alpha is linked to the distribution of functional groups, while β\beta pertains to the quantity of individual functional groups. (C) The third stage characterizes five latent variables, each representing the impact of specific groups of functional groups.

Framework of UHiSR.

To address the inherent complexity imbalance between solvent and solute compounds, our framework, unsupervised hierarchical symbolic regression (UHiSR), is strategically organized hierarchically in a top-to-down manner. It is motivated by the rationale grounded in established practices observed in TLC experiments. As illustrated in Fig. 5, each stage focuses on specific aspects of the feature space, allowing the model to learn different latent variables which discern patterns at different levels of granularity.

For instance, the top stage provides a broad overview of the interaction between solvent and solute, capturing their global polarity characteristics. Subsequent stages then delve into more detailed and specific characteristics of the solute molecular structure, offering intricate insights regarding the contributions of different molecular features to the overall molecular polarity. As illustrated in Fig. 1C, the framework contains three key parts: feature engineering, hierarchical neural network and symbolic regressions. In this subsection, we introduce the algorithms and methods.

Feature engineering.

Organic molecules are composed of various functional groups arranged along their molecular skeletons. The arrangement of these groups influences the polarity of a given molecule. Organic chemists typically assess the polarity of target molecules by considering the type, number, and distribution of functional groups. Here, we focused on 1616 representative functional groups (Fig. 1C). Besides the type and number of functional groups, their spatial distribution is also crucial for molecular polarity. To account for this, we included additional descriptors such as the number of aromatic rings, dipole moments, and molecular skeleton descriptors (detailed in the Supplementary Materials).

In terms of features related to the solvent, we considered the volume ratios (V%V\%) of five different solvent compounds, i.e., n-Hexane, Ethyl acetate, Dichloromethane, Diethyl ether and Methanol.

To evaluate the representation power of the selected features, we conducted a comparative analysis. This analysis employed the XGBoost model and examined three different sets of features. Hereto, like in previous studies  (?), we primarily focus on solute molecular features. To maintain consistency, the features related to the eluent solvent were kept the same across all experiments. Feature group A consists of 1919-dimensional solute features (Table 1). Feature group B encompasses 167167-dimensional features, containing 167167-dimensional MACCS keys for describing molecular structure. Feature group C contains 88-dimensional features, focusing on conventional physicochemical descriptors. Specifically, these descriptors include molecular weight (MW), topological polar surface area (TPSA), the logarithm of nn-octanol–water partition coefficient (LogP), hydrogen bond donor number (HBD), hydrogen bond acceptor number (HBA), molecular polarity index (MPI), polar surface area percentage (PSA) and dipole moment (DM).

Architecture design of hierarchical neural network.

The architecture of our network is motivated by the rationale grounded in established practices observed in TLC experiments. TLC experiments conventionally entail the execution of controlled experiments designed to meticulously dissect the influence of solute polarity and solvent polarity on RfR_{f}. In this context, we modified conventional Multi-Layer Perceptron (MLP) architecture to meet the specific demand of our task. Specifically, given an input sample 𝐱𝐑d\mathbf{x}\in\mathbf{R}^{d}, the initial step involves the classification of all features into kk clusters, denoted as 𝐱1,,𝐱k\mathbf{x}_{1},\dots,\mathbf{x}_{k}. Here, 𝐱i𝐑di\mathbf{x}_{i}\in\mathbf{R}^{d_{i}} and their dimensions sum up to dd, i.e., i=1kdi=d\sum_{i=1}^{k}d_{i}=d. To enhance interpretability of these clusters, we incorporated a chemists-guided strategy, and the clusters used at each stage are demonstrated in Fig. 5.

Then, a key innovation lies in the formulation of the target latent representation layer jj, where zi(j)=gi(𝐱i)z^{(j)}_{i}=g_{i}(\mathbf{x}_{i}). Each function gig_{i} operates independently in the modified MLP, in the sense that there is no parameter sharing across g1,,gkg_{1},\dots,g_{k}. Thus, we can consider gig_{i} as a sub-model. This design ensures the preservation of feature-specific information. The final output o=h(𝐳(j))o=h(\mathbf{z}^{(j)}) encapsulates the information from the target latent representation layer.

This design restricts the receptive field of the latent variables, therefore ensuring that each latent variable solely depends on a specific group of input features. For instance, in the first stage, we considered all features and categorized them into two distinct clusters, solvent features and solute features. This categorization aligns with practical considerations where the polarity of the solvent and the solute is often analyzed separately. Consequently, the learned latent variables in this layer can be regarded as the solvent polarity index and the solute polarity index, respectively. The connections between the latent variables and the output layer capture the nonlinear relationship inherent in the data. This approach enables the model to learn informative latent variables in an unsupervised manner, which disentangles and highlights the distinct contributions of different groups of input features.

Symbolic regression.

SR aims to identify mathematical expressions that capture the inherent relationships within a given dataset. Unlike approaches that involve fitting parameters to overly complex general models (such as machine learning models), SR explores a space of simple analytic expressions. The goal is to discover accurate and interpretable models that directly represent the underlying patterns in the data. SR is usually implemented by evolutionary algorithms, such as genetic programming (GP)  (?, ?, ?).

The most common way to visualize a symbolic expression is a tree-structure with nodes and branches, as shown in Fig. 4A. This structure includes primitive functions (e.g., +,,×,÷,exp)+,-,\times,\div,\exp) and terminal nodes (input features and numeric constants). Through a series of mutation operations, the GP algorithm seeks to determine the optimal number of nodes and terminals that provide the best fit to a given dataset. For our implementation of the widely used GP-SR algorithm, we employed the open-source Python library PySR111https://github.com/MilesCranmer/PySR  (?). The hyperparameter configurations for our implementation are detailed in the Supplementary Materials.

It is worth mentioning that our method is not limited to GP algorithms. In fact, the UHiSR framework is designed to accommodate various symbolic regression methods; comparable results obtained from deep reinforcement learning based SR methods (such as DISCOVER  (?)) are included in the Supplementary Materials.

Experimental settings and training details

When constructing the hierarchical neural network, at each stage we set the number of neurons in the hidden layers of each sub-model to 5050. Furthermore, two hidden layers were employed. To facilitate the discovery of concise equations, we leaned towards employing a relatively simple network structure, which helps to prevent overfitting to experimental noise. The activation function chosen for each sub-model was LeakyReLU. In the first stage, the activation function of the output layer was set to a sigmoid, which ensures a standardized output within the range of 0 and 11 (Fig. 4B). The batch size was set to 20482048, and the Adam optimizer implemented in PyTorch was applied. The learning rate was 0.010.01. The dataset was randomly divided into 80/10/1080/10/10 for train/valid/test splits. Each hierarchical neural network was trained for 10001000 epochs, then the best validation checkpoint was selected for testing. To assess the prediction performance, we employed two key metrics: the RR-squared coefficient (R2R^{2}) and the root mean square error (RMSE), which are defined as

R2=1i=1N(yiy^i)2i=1N(yiy¯)2,RMSE=i=1N(yiy^i)2N,R^{2}=1-\frac{\sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}}{\sum_{i=1}^{N}(y_{i}-\bar{y})^{2}},\quad{\rm RMSE}=\sqrt{\frac{\sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}}{N}}, (4)

where NN is the total number of samples, yiy_{i} and y^i\hat{y}_{i} represent the label and the prediction value for the ii-th sample, and y¯\bar{y} denotes the mean of the ground truth values.

References

  • 1. G. M. Sessler, “Physical principles of electrets” in Electrets (Springer, Berlin, Heidelberg, 1987), pp. 13–80.
  • 2. A. R. Katritzky, D. C. Fara, H. Yang, K. Tämm, T. Tamm, M. Karelson, Quantitative measures of solvent polarity, Chem. Rev. 104, 175-198 (2004).
  • 3. K. Ciura, S. Dziomba, J. Nowakowska, M. J. Markuszewski, Thin layer chromatography in drug discovery process, J. Chromatogr. A 1520, 9-22 (2017).
  • 4. C. F. Poole, Thin-layer chromatography: challenges and opportunities, J. Chromatogr. A 1000, 963-984 (2003).
  • 5. A. F. Zahrt, J. J. Henle, B. T. Rose, Y. Wang, W. T. Darrow, S. E. Denmark, Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning, Science 363, eaau5631 (2019).
  • 6. E. N. Muratov, J. Bajorath, R. P. Sheridan, I. V. Tetko, D. Filimonov, V. Poroikov, T. I. Oprea, I. I. Baskin, A. Varnek, A. Roitberg, et al., Qsar without borders, Chem. Soc. Rev. 49, 3525–3564 (2020).
  • 7. C. B. Wahl, M. Aykol, J. H. Swisher, J. H. Montoya, S. K. Suram, C. A. Mirkin, Machine learning–accelerated design and synthesis of polyelemental heterostructures, Sci. Adv. 7, eabj5505 (2021).
  • 8. N. I. Rinehart, R. K. Saunthwal, J. Wellauer, A. F. Zahrt, L. Schlemper, A. S. Shved, R. Bigler, S. Fantasia, S. E. Denmark, A machine-learning tool to predict substrate-adaptive conditions for pd-catalyzed c–n couplings, Science 381, 965-972 (2023).
  • 9. Q. Yang, Y. Li, J.-D. Yang, Y. Liu, L. Zhang, S. Luo, J.-P. Cheng, Holistic prediction of the pka in diverse solvents based on a machine-learning approach, Angew. Chem. Int. Ed. 59, 19282-19291 (2020).
  • 10. Y. Yao, Q. Dong, A. Brozena, J. Luo, J. Miao, M. Chi, C. Wang, I. G. Kevrekidis, Z. J. Ren, J. Greeley, G. Wang, A. Anapolsky, L. Hu, High-entropy nanoparticles: Synthesis-structure-property relationships and data-driven discovery, Science 376, eabn3103 (2022).
  • 11. B. A. Koscher, R. B. Canty, M. A. McDonald, K. P. Greenman, C. J. McGill, C. L. Bilodeau, W. Jin, H. Wu, F. H. Vermeire, B. Jin, T. Hart, T. Kulesza, S.-C. Li, T. S. Jaakkola, R. Barzilay, R. Gómez-Bombarelli, W. H. Green, K. F. Jensen, Autonomous, multiproperty-driven molecular discovery: From predictions to measurements and back, Science 382, eadi1407 (2023).
  • 12. Łukasz Komsta, A functional-based approach to the retention in thin layer chromatographic screening systems, Anal. Chim. Acta 629, 66-72 (2008).
  • 13. S. Yousefinejad, F. Honarasa, N. Saeed, Quantitative structure–retardation factor relationship of protein amino acids in different solvent mixtures for normal-phase thin-layer chromatography, J. Sep. Sci. 38, 1771–1776 (2015).
  • 14. H. Xu, J. Lin, Q. Liu, Y. Chen, J. Zhang, Y. Yang, M. C. Young, Y. Xu, D. Zhang, F. Mo, High-throughput discovery of chemical structure-polarity relationships combining automation and machine-learning techniques, Chem 8, 3202-3214 (2022).
  • 15. H. Xu, D. Zhang, F. Mo, High-throughput automated platform for thin layer chromatography analysis, STAR protocols 3, 101893 (2022).
  • 16. L. Billard, E. Diday, From the statistics of data to the statistics of knowledge: symbolic data analysis, J. Am. Stat. Assoc. 98, 470–487 (2003).
  • 17. I. Arnaldo, K. Krawiec, U.-M. O’Reilly, paper presented at the 2014 Annual Conference on Genetic and Evolutionary Computation, Vancouver, BC, Canada, 12 July 2014.
  • 18. M. Quade, M. Abel, K. Shafi, R. K. Niven, B. R. Noack, Prediction of dynamical systems by symbolic regression, Phys. Rev. E 94, 012214 (2016).
  • 19. S. L. Brunton, J. L. Proctor, J. N. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proc. Natl. Acad. Sci. U.S.A. 113, 3932–3937 (2016).
  • 20. B. K. Petersen, M. Landajuela, T. N. Mundhenk, C. P. Santiago, S. Kim, J. T. Kim, paper presented at the 9th International Conference on Learning Representations, Virtual Event, Austria, 3 May 2021.
  • 21. M. Cranmer, A. Sanchez-Gonzalez, P. Battaglia, R. Xu, K. Cranmer, D. Spergel, S. Ho, paper presented at the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 06 December 2020.
  • 22. S.-M. Udrescu, M. Tegmark, Ai feynman: A physics-inspired method for symbolic regression, Sci. Adv. 6, eaay2631 (2020).
  • 23. F. Sun, Y. Liu, J.-X. Wang, H. Sun, paper presented at the 11th International Conference on Learning Representations, Kigali, Rwanda, 1 May 2023.
  • 24. J. Zeng, H. Xu, Y. Chen, D. Zhang, Deep learning discovery of macroscopic governing equations for viscous gravity currents from microscopic simulation data, Comput. Geosci. pp. 1–14 (2023).
  • 25. A. E. Allen, A. Tkatchenko, Machine learning of material properties: Predictive and interpretable multilinear models, Sci. Adv. 8, eabm7185 (2022).
  • 26. J. L. Durant, B. A. Leland, D. R. Henry, J. G. Nourse, Reoptimization of mdl keys for use in drug discovery, J. Chem. Inf. Model. 42, 1273–1280 (2002).
  • 27. J. Sherma, B. Fried, Handbook of thin-layer chromatography (CRC, Eastern, Pennsylvania, 2003).
  • 28. I. Smallwood, Handbook of organic solvent properties (Butterworth-Heinemann, Oxford, 2012).
  • 29. M. Cranmer, Interpretable machine learning for science with pysr and symbolicregression. jl, arXiv preprint arXiv:2305.01582 (2023).
  • 30. M. Du, Y. Chen, D. Zhang, Discover: Deep identification of symbolic open-form pdes via enhanced reinforcement-learning, arXiv preprint arXiv:2210.02181 (2022).

Acknowledgments

We thank the High Performance Computing Platform of Peking University for machine learning model training. We also thank Mengge Du for assisting in the implementation of DISCOVER.

Funding:

National Natural Science Foundation of China No.22071004
National Natural Science Foundation of China No.21933001
National Natural Science Foundation of China No.22150013
National Natural Science Foundation of China No.62106116

Author contributions:

Conceptualization: YC, FM
Methodology: SL, CL, YC, FM
Investigation: SL, CL
Visualization: SL, CL
Supervision: YC, FM
Writing-original draft: SL, CL
Writing-review & editing: SL, YC, FM

Competing interests:

Authors declare that they have no competing interests.

Data and materials availability:

The dataset utilized in this manuscript is accessible at the following link: https://github.com/SiyuLou/UnsupervisedHierarchicalSymbolicRegression.
All original code for the experiments and analysis in this work has been deposited at the website https://github.com/SiyuLou/UnsupervisedHierarchicalSymbolicRegression.

Supplementary Materials

Candidate equations for the hierarchical equation system

We present the candidate equations for each the hierarchical equation system, along with the corresponding R2R^{2} and RMSE values on the test dataset as reported in the Table S1-S10. Symbolic regression algorithms often face a trade-off between simplicity and fitting performance. Considering the noise in real-world datasets, we prioritize simpler equations, such as choosing Rf=σ(3.48Ψ+3.08ξ+1.86)R_{f}=\sigma(3.48\Psi+3.08\xi+1.86) as the RfR_{f} governing equation, even if there is a slight loss in fitting accuracy. On one hand, concise equations are easier to understand; on the other hand, they help avoid overfitting to noise, making the model more robust.

Table S1: Candidate equations for Rf(Ψ,ξ)R_{f}\sim(\Psi,\xi).
Equation R2R^{2} RMSE
Rf=σ(3.48Ψ+3.08ξ+1.86),σ(x)=1/(1+ex)R_{f}=\sigma(3.48\Psi+3.08\xi+1.86),\quad\sigma(x)=1/(1+e^{-x}) 0.9300.930 0.0910.091
Rf=σ(3.48Ψ+0.696ξ2+3.48ξ+1.86),σ(x)=1/(1+ex)R_{f}=\sigma(3.48\Psi+0.696\xi^{2}+3.48\xi+1.86),\quad\sigma(x)=1/(1+e^{-x}) 0.9310.931 0.0900.090
Rf=σ(3.48Ψ+ξ4Ψ2+1.44ξ2+3.48ξ+1.86),σ(x)=1/(1+ex)R_{f}=\sigma(3.48\Psi+\frac{\xi^{4}}{\Psi^{2}+1.44\xi^{2}}+3.48\xi+1.86),\quad\sigma(x)=1/(1+e^{-x}) 0.9330.933 0.0880.088
Rf=σ(3.48Ψ+ξ48.40Ψ4+1.44ξ2+3.48ξ+1.86),σ(x)=1/(1+ex)R_{f}=\sigma(3.48\Psi+\frac{\xi^{4}}{8.40\Psi^{4}+1.44\xi^{2}}+3.48\xi+1.86),\quad\sigma(x)=1/(1+e^{-x}) 0.9340.934 0.0880.088
Rf=σ(3.42Ψ+3.48ξ+1.86+2Ψ3ξ2+Ψξ2+1.01ξ46.33Ψ4+1.44ξ2),σ(x)=1/(1+ex)R_{f}=\sigma(3.42\Psi+3.48\xi+1.86+\frac{2\Psi^{3}\xi^{2}+\Psi\xi^{2}+1.01\xi^{4}}{6.33\Psi^{4}+1.44\xi^{2}}),\quad\sigma(x)=1/(1+e^{-x}) 0.9340.934 0.0880.088
Table S2: Candidate equations for Ψ(\Psi\sim(Hex, EA, DCM, MeOH, Et2O)).
Equation R2R^{2} RMSE
Ψ=Hex+2EA0.259DCM+12.5MeOH+Et2O\Psi=-{\rm Hex}+2{\rm EA}-0.259{\rm DCM}+12.5{\rm MeOH}+{\rm Et}_{2}{\rm O} 0.9600.960 0.1440.144
Ψ=Hex+1.59EA0.411DCM+11.1MeOH+Et2O2+0.142\Psi=-{\rm Hex}+1.59{\rm EA}-0.411{\rm DCM}+11.1{\rm MeOH}+{\rm Et}_{2}{\rm O}^{2}+0.142 0.9830.983 0.0940.094
Ψ=Hex+1.59EA+1.55DCM2MeOH0.102EA0.411DCM+Et2O2+0.141\Psi=-{\rm Hex}+1.59{\rm EA}+\frac{1.55{\rm DCM}^{2}{\rm MeOH}}{0.102-{\rm EA}}-0.411{\rm DCM}+{\rm Et}_{2}{\rm O}^{2}+0.141 0.9870.987 0.0820.082
Ψ=Hex+1.56EA+1.31DCM2MeOHEt2O+0.08740.436DCM+Et2O2+0.145\Psi=-{\rm Hex}+1.56{\rm EA}+\frac{1.31{\rm DCM}^{2}{\rm MeOH}}{{\rm Et}_{2}{\rm O}+0.0874}-0.436{\rm DCM}+{\rm Et}_{2}{\rm O}^{2}+0.145 0.9880.988 0.0790.079
Ψ=Hex+1.52EA+21.1DCM3MeOH(Hex+DCM)20.477DCMMeOH\Psi=-{\rm Hex}+1.52{\rm EA}+21.1{\rm DCM}^{3}{\rm MeOH}\left({\rm Hex}+{\rm DCM}\right)^{2}-0.477{\rm DCM}-{\rm MeOH} +Et2O2+0.149~{}\quad\quad+{\rm Et}_{2}{\rm O}^{2}+0.149 0.9890.989 0.0740.074
Table S3: Candidate equations for ξ(α,β)\xi\sim(\alpha,\beta).
Equation R2R^{2} RMSE
ξ=0.367+β+1.89α1.47\xi=0.367+\frac{\beta+1.89}{\alpha-1.47} 0.8270.827 0.280.28
ξ=0.232β0.232(eα0.0531)(2α+2β+4.71)\xi=-0.232\beta-0.232\left(e^{\alpha}-0.0531\right)\left(-2\alpha+2\beta+4.71\right) 0.8410.841 0.2700.270
ξ=0.232β0.232(eα0.0531)(3α+α4β2+2β+4.71)\xi=-0.232\beta-0.232\left(e^{\alpha}-0.0531\right)\left(-3\alpha+\frac{\alpha}{4\beta^{2}}+2\beta+4.71\right) 0.8580.858 0.2550.255
ξ=0.232β0.232(eα0.0531)(3α+2β+(4.70α+4.70β)eβ2+4.71)\xi=-0.232\beta-0.232\left(e^{\alpha}-0.0531\right)\left(-3\alpha+2\beta+\left(4.70\alpha+4.70\beta\right)e^{-\beta^{2}}+4.71\right) 0.8750.875 0.2380.238
ξ=0.232β0.232(eα0.0531)(3α+2β+(4.72α+4.72β)e(β0.160)2+4.71)\xi=-0.232\beta-0.232\left(e^{\alpha}\!-\!0.0531\right)\left(-3\alpha+2\beta+\left(4.72\alpha+4.72\beta\right)e^{-\left(\beta-0.160\right)^{2}}\!+\!4.71\right) 0.8780.878 0.2360.236
Table S4: Candidate equations for α(\alpha\sim( NBen, MSD, DM )).
Equation R2R^{2} RMSE
α=2NBen1.33NBenMSD0.0467DM+0.4120.743\alpha=\frac{-2{\rm NBen}-\frac{1.33{\rm NBen}}{{\rm MSD}-0.0467}}{{\rm DM}+0.412}-0.743 0.8980.898 0.630.63
α=2NBen1.82(NBen+MSD)MSD2+1.12DM+0.4050.529\alpha=\frac{-2{\rm NBen}-\frac{1.82\left({\rm NBen}+{\rm MSD}\right)}{{\rm MSD}^{2}+1.12}}{{\rm DM}+0.405}-0.529 0.9070.907 0.6040.604
α=2NBen+(NBen+MSD)(0.147MSD1.10)MSD+0.0978DM+0.4050.512\alpha=\frac{-2{\rm NBen}+\frac{\left({\rm NBen}+{\rm MSD}\right)\left(0.147{\rm MSD}-1.10\right)}{{\rm MSD}+0.0978}}{{\rm DM}+0.405}-0.512 0.9120.912 0.5870.587
α=2NBen+(NBen+MSD)(0.147MSD1.10)MSD+0.0978DM+0.4050.512\alpha=\frac{-2{\rm NBen}+\frac{\left({\rm NBen}+{\rm MSD}\right)\left(0.147{\rm MSD}-1.10\right)}{{\rm MSD}+0.0978}}{{\rm DM}+0.405}-0.512 0.9220.922 0.5520.552
α=2NBen+(NBen2+(DM+0.386)(NBen2+NBen+MSD))(0.153MSD+0.153DM1.27)MSD+0.327DM+0.4050.529\alpha=\frac{-2{\rm NBen}+\frac{\left({\rm NBen}^{2}+\left({\rm DM}+0.386\right)\left(-{\rm NBen}^{2}+{\rm NBen}+{\rm MSD}\right)\right)\left(0.153{\rm MSD}+0.153{\rm DM}-1.27\right)}{{\rm MSD}+0.327}}{{\rm DM}+0.405}-0.529 0.9230.923 0.5490.549
Table S5: Candidate equations for β(γ1,γ2,γ3,γ4,γ5)\beta\sim(\gamma_{1},\gamma_{2},\gamma_{3},\gamma_{4},\gamma_{5}).
Equation R2R^{2} RMSE
β=(γ1+γ5)(0.0212γ3γ40.441)+e0.238γ2+0.238γ40.488\beta=\left(\gamma_{1}+\gamma_{5}\right)\left(0.0212\gamma_{3}\gamma_{4}-0.441\right)+e^{0.238\gamma_{2}+0.238\gamma_{4}}-0.488 0.7740.774 1.021.02
β=0.218γ1γ26+0.412γ2+0.412γ40.412γ5+0.0212(γ3+γ4)2+0.333\beta=-\frac{0.218\gamma_{1}}{\gamma_{2}^{6}}+0.412\gamma_{2}+0.412\gamma_{4}-0.412\gamma_{5}+0.0212\left(\gamma_{3}+\gamma_{4}\right)^{2}+0.333 0.8040.804 0.9500.950
β=0.218γ1γ26+0.413γ2+0.435γ40.435γ5+0.0223(γ3+γ4)2+0.493\beta=-\frac{0.218\gamma_{1}}{\gamma_{2}^{6}}+0.413\gamma_{2}+0.435\gamma_{4}-0.435\gamma_{5}+0.0223\left(\gamma_{3}+\gamma_{4}\right)^{2}+0.493 0.8100.810 0.9340.934
β=0.241γ1γ26+0.367\beta=-\frac{0.241\gamma_{1}}{\gamma_{2}^{6}}+0.367 +0.241(0.0933γ10.0933γ31.52)(γ2γ4+γ50.0820(γ3+γ4)2)~{}\quad\quad+0.241\left(-0.0933\gamma_{1}-0.0933\gamma_{3}-1.52\right)\left(-\gamma_{2}-\gamma_{4}+\gamma_{5}-0.0820\left(\gamma_{3}+\gamma_{4}\right)^{2}\right) 0.8230.823 0.9040.904
β=0.241γ1γ26+0.367\beta=-\frac{0.241\gamma_{1}}{\gamma_{2}^{6}}+0.367 +0.241(0.150γ10.150γ31.52)(γ2γ4+γ50.0784(γ3+γ4)20.501)~{}\quad\quad+0.241\left(-0.150\gamma_{1}-0.150\gamma_{3}-1.52\right)\left(-\gamma_{2}-\gamma_{4}+\gamma_{5}-0.0784\left(\gamma_{3}+\gamma_{4}\right)^{2}-0.501\right) 0.8310.831 0.8830.883
Table S6: Candidate equations for γ1(\gamma_{1}\sim(CtAmides, CtCO2H)).
Equation R2R^{2} RMSE
γ1=3.26CtAmides3.26CtCO2H+1.87\gamma_{1}=-3.26{\rm CtAmides}-3.26{\rm CtCO}_{2}{\rm H}+1.87 0.9890.989 0.1520.152
γ1=3.09CtAmides3.91CtCO2H+1.87\gamma_{1}=-3.09{\rm CtAmides}-3.91{\rm CtCO}_{2}{\rm H}+1.87 0.9970.997 0.0740.074
γ1=2CtAmidesCtCO2H(CtAmides+CtCO2H+2.66)+1.86eCtAmides2\gamma_{1}=-2{\rm CtAmides}-{\rm CtCO}_{2}{\rm H}\left({\rm CtAmides}+{\rm CtCO}_{2}{\rm H}+2.66\right)+1.86e^{-{\rm CtAmides}^{2}} 0.9990.999 0.0320.032
γ1=2CtAmidesCtCO2H(CtAmides+CtCO2H+1.57)\gamma_{1}=-2{\rm CtAmides}-{\rm CtCO}_{2}{\rm H}\left({\rm CtAmides}+{\rm CtCO}_{2}{\rm H}+1.57\right) +1.86eCtAmidesCtCO2H~{}\quad~{}\quad+1.86e^{-{\rm CtAmides}-{\rm CtCO}_{2}{\rm H}} 0.9990.999 0.0350.035
γ1=2CtAmides0.617CtCO2H23.15CtCO2H\gamma_{1}=-2{\rm CtAmides}-0.617{\rm CtCO}_{2}{\rm H}^{2}-3.15{\rm CtCO}_{2}{\rm H} +1.87eCtAmides(CtAmides+CtCO2H)~{}\quad~{}\quad+1.87e^{-{\rm CtAmides}\left({\rm CtAmides}+{\rm CtCO}_{2}{\rm H}\right)} 1.0001.000 0.0120.012
Table S7: Candidate equations for γ2(\gamma_{2}\sim(CtRNH2, CtOH, CtPhenol)).

. Equation R2R^{2} RMSE γ2=1.86(0.734CtRNH22+0.734CtOH0.734CtPhenol+1)2eCtPhenol\gamma_{2}=1.86\left(-0.734{\rm CtRNH}_{2}^{2}+0.734{\rm CtOH}-0.734{\rm CtPhenol}+1\right)^{2}-e^{{\rm CtPhenol}} 0.9550.955 0.3890.389 γ2=CtRNH22CtPhenol2.79\gamma_{2}={\rm CtRNH}_{2}^{2}-{\rm CtPhenol}-2.79 +3.59(0.528CtRNH22+0.528CtOH0.528CtPhenol+1)2~{}\quad~{}\quad+3.59\left(-0.528{\rm CtRNH}_{2}^{2}+0.528{\rm CtOH}-0.528{\rm CtPhenol}+1\right)^{2} 0.9870.987 0.2040.204 γ2=CtRNH2+2CtOH1.76CtPhenol(1.76CtRNH2)\gamma_{2}=-{\rm CtRNH}_{2}+2{\rm CtOH}-1.76{\rm CtPhenol}\cdot\left(1.76-{\rm CtRNH}_{2}\right) +(CtRNH22+CtOHCtPhenol+0.912)2~{}\quad~{}\quad+\left(-{\rm CtRNH}_{2}^{2}+{\rm CtOH}-{\rm CtPhenol}+0.912\right)^{2} 0.9990.999 0.0710.071 γ2=1.06CtRNH2+2CtOHCtPhenol(2.132CtRNH2)CtPhenol\gamma_{2}=-1.06{\rm CtRNH}_{2}+2{\rm CtOH}-{\rm CtPhenol}\cdot\left(2.13-2{\rm CtRNH}_{2}\right)-{\rm CtPhenol} +(CtRNH22+CtOHCtPhenol+0.912)2~{}\quad~{}\quad+\left(-{\rm CtRNH}_{2}^{2}+{\rm CtOH}-{\rm CtPhenol}+0.912\right)^{2} 0.9990.999 0.0600.060 γ2=1.13CtRNH2(10.939CtOH)2+CtOH3+CtOH\gamma_{2}=-1.13{\rm CtRNH}_{2}\left(1-0.939{\rm CtOH}\right)^{2}+{\rm CtOH}^{3}+{\rm CtOH} CtPhenol(CtRNH220.853CtRNH2+2.20)CtPhenol~{}\quad~{}\quad-{\rm CtPhenol}\left(-{\rm CtRNH}_{2}^{2}-0.853{\rm CtRNH}_{2}+2.20\right)-{\rm CtPhenol} +(CtRNH22+CtOHCtPhenol+0.912)2~{}\quad~{}\quad+\left(-{\rm CtRNH}_{2}^{2}+{\rm CtOH}-{\rm CtPhenol}+0.912\right)^{2} 1.0001.000 0.0360.036

Table S8: Candidate equations for γ3(\gamma_{3}\sim(CtNO2, CtRCO2R)).
Equation R2R^{2} RMSE
γ3=CtNO2(CtRCO2R23.65)+0.762\gamma_{3}={\rm CtNO}_{2}\left(-{\rm CtRCO}_{2}{\rm R}^{2}-3.65\right)+0.762 0.9610.961 0.4130.413
γ3=CtNO2(CtRCO2R23.65)+0.872e8CtRCO2R3\gamma_{3}={\rm CtNO}_{2}\left(-{\rm CtRCO}_{2}{\rm R}^{2}-3.65\right)+0.872e^{-8{\rm CtRCO}_{2}{\rm R}^{3}} 0.9790.979 0.3050.305
γ3=CtNO2(CtRCO2R23.65)CtRCO2RCtNO2+CtRCO2R30.329+0.853\gamma_{3}={\rm CtNO}_{2}\left(-{\rm CtRCO}_{2}{\rm R}^{2}-3.65\right)-\frac{{\rm CtRCO}_{2}{\rm R}}{{\rm CtNO}_{2}+{\rm CtRCO}_{2}{\rm R}^{3}-0.329}+0.853 0.9910.991 0.2020.202
γ3=0.346CtNO22CtRCO2R2CtRCO2R\gamma_{3}=-0.346{\rm CtNO}_{2}^{2}{\rm CtRCO}_{2}{\rm R}^{2}-{\rm CtRCO}_{2}{\rm R} +(CtNO20.246)(CtRCO2R2CtNO23+CtNO2+CtRCO2R1.523.70)~{}\quad~{}\quad+\left({\rm CtNO}_{2}-0.246\right)\left(-\frac{{\rm CtRCO}_{2}{\rm R}^{2}}{{\rm CtNO}_{2}^{3}+{\rm CtNO}_{2}+{\rm CtRCO}_{2}{\rm R}-1.52}-3.70\right) 0.9950.995 0.1500.150
γ3=CtRCO2R3.70(CtNO20.246)\gamma_{3}=-{\rm CtRCO}_{2}{\rm R}-3.70\left({\rm CtNO}_{2}-0.246\right) +(CtNO20.246)(0.168CtNO27CtRCO2R31.26CtRCO2R2CtNO23+CtNO2+CtRCO2R1.52)~{}\quad~{}\quad+\left({\rm CtNO}_{2}\!-\!0.246\right)\left(-0.168{\rm CtNO}_{2}^{7}{\rm CtRCO}_{2}{\rm R}^{3}-\frac{1.26{\rm CtRCO}_{2}{\rm R}^{2}}{{\rm CtNO}_{2}^{3}+{\rm CtNO}_{2}+{\rm CtRCO}_{2}{\rm R}-1.52}\right) 0.9970.997 0.1160.116
Table S9: Candidate equations for γ4(\gamma_{4}\sim(CtF, CtAldehyde, CtR2CO, CtCN, CtROR)).
Equation R2R^{2} RMSE
γ4=(CtF+(1.66CtAldehyde)(CtAldehyde+CtR2CO+CtCN))\gamma_{4}=\left(-{\rm CtF}+\left(1.66-{\rm CtAldehyde}\right)\left(-{\rm CtAldehyde}+{\rm CtR}_{2}{\rm CO}+{\rm CtCN}\right)\right) (CtAldehyde2CtROR3+2.05)~{}\quad~{}\quad\left({\rm CtAldehyde}^{2}-{\rm CtROR}^{3}+2.05\right) 0.9370.937 0.5650.565
γ4=CtAldehyde+(CtAldehyde24CtROR2+2.05)\gamma_{4}=-{\rm CtAldehyde}+\left(-{\rm CtAldehyde}^{2}-4{\rm CtROR}^{2}+2.05\right) (CtAldehyde+1.60CtR2CO+1.60CtCNCtF+0.0807)~{}\quad~{}\quad\left(-{\rm CtAldehyde}+1.60{\rm CtR}_{2}{\rm CO}+1.60{\rm CtCN}-{\rm CtF}+0.0807\right) 0.9390.939 0.5570.557
γ4=CtAldehyde33CtAldehyde+3CtR2CO+2CtCN2CtF\gamma_{4}={\rm CtAldehyde}^{3}-3{\rm CtAldehyde}+3{\rm CtR}_{2}{\rm CO}+2{\rm CtCN}-2{\rm CtF} +eCtCN(CtROR2CtF)20.746~{}\quad~{}\quad+e^{{\rm CtCN}-\left({\rm CtROR}-2{\rm CtF}\right)^{2}}-0.746 0.9580.958 0.4590.459
γ4=CtAldehyde3CtAldehydeCtR2CO3CtAldehyde+CtR2CO2+2CtR2CO\gamma_{4}={\rm CtAldehyde}^{3}-{\rm CtAldehyde}{\rm CtR}_{2}{\rm CO}-3{\rm CtAldehyde}+{\rm CtR}_{2}{\rm CO}^{2}+2{\rm CtR}_{2}{\rm CO} +CtCN2CtF+eCtCN(CtROR2CtF)20.746~{}\quad~{}\quad+{\rm CtCN}-2{\rm CtF}+e^{{\rm CtCN}-\left({\rm CtROR}-2{\rm CtF}\right)^{2}}-0.746 0.9610.961 0.4460.446
γ4=CtAldehyde3CtAldehyde2(CtR2COCtF)23CtAldehyde+CtR2CO2\gamma_{4}={\rm CtAldehyde}^{3}-{\rm CtAldehyde}^{2}\left({\rm CtR}_{2}{\rm CO}-{\rm CtF}\right)^{2}-3{\rm CtAldehyde}+{\rm CtR}_{2}{\rm CO}^{2} +2CtR2CO+4CtCN2CtF+eCtF(CtROR2CtF)20.746~{}\quad~{}\quad+2{\rm CtR}_{2}{\rm CO}+4{\rm CtCN}-2{\rm CtF}+e^{{\rm CtF}-\left({\rm CtROR}-2{\rm CtF}\right)^{2}}-0.746 0.9700.970 0.3910.391
Table S10: Candidate equations γ5(\gamma_{5}\sim(CtCl, CtBr, CtI, CtMethyl)).
Equation R2R^{2} RMSE
γ5=(CtCl+2CtI)2+(CtBrCtBr0.305+CtMethyl)2\gamma_{5}=\left({\rm CtCl}+2{\rm CtI}\right)^{2}+\left(\frac{{\rm CtBr}}{{\rm CtBr}-0.305}+{\rm CtMethyl}\right)^{2} 0.9310.931 0.8050.805
γ5=CtICtMethyl+(CtCl+2CtI)2+(CtBrCtBr0.359+CtMethyl)20.372\gamma_{5}={\rm CtI}{\rm CtMethyl}+\left({\rm CtCl}+2{\rm CtI}\right)^{2}+\left(\frac{{\rm CtBr}}{{\rm CtBr}-0.359}+{\rm CtMethyl}\right)^{2}-0.372 0.9510.951 0.6770.677
γ5=CtBr(CtBr3CtMethyl)2(CtCl+CtI+0.245)2\gamma_{5}={\rm CtBr}-\left({\rm CtBr}^{3}-{\rm CtMethyl}\right)^{2}\left({\rm CtCl}+{\rm CtI}+0.245\right)^{2} +(CtCl+CtBr+2CtI+CtMethyl)20.330~{}\quad~{}\quad+\left({\rm CtCl}+{\rm CtBr}+2{\rm CtI}+{\rm CtMethyl}\right)^{2}-0.330 0.9650.965 0.5780.578
γ5=CtCl2+2CtCl+CtBr(CtCl+CtI)3(CtBr+CtMethyl2+CtMethyl)\gamma_{5}=-{\rm CtCl}^{2}+2{\rm CtCl}+{\rm CtBr}-\left({\rm CtCl}+{\rm CtI}\right)^{3}\left({\rm CtBr}+{\rm CtMethyl}^{2}+{\rm CtMethyl}\right) 0.486(CtBr20.879)2+(CtCl+CtBr+2CtI+CtMethyl)2~{}\quad~{}\quad-0.486\left({\rm CtBr}^{2}-0.879\right)^{2}+\left({\rm CtCl}+{\rm CtBr}+2{\rm CtI}+{\rm CtMethyl}\right)^{2} 0.9820.982 0.4110.411
γ5=CtCl2+CtClCtI+1.46CtCl+CtBr\gamma_{5}=-{\rm CtCl}^{2}+{\rm CtCl}{\rm CtI}+1.46{\rm CtCl}+{\rm CtBr} (CtCl+CtI)3(CtBr+CtMethyl2+CtMethyl)~{}\quad~{}\quad-\left({\rm CtCl}+{\rm CtI}\right)^{3}\left({\rm CtBr}+{\rm CtMethyl}^{2}+{\rm CtMethyl}\right) 0.486(CtBr20.879)2+(CtCl+CtBr+2CtI+CtMethyl)2~{}\quad~{}\quad-0.486\left({\rm CtBr}^{2}-0.879\right)^{2}+\left({\rm CtCl}+{\rm CtBr}+2{\rm CtI}+{\rm CtMethyl}\right)^{2} 0.9850.985 0.3770.377

Molecular skeleton descriptor

The distribution of functional groups within a molecule influences its properties. To quantify this effect, we developed a molecular skeleton descriptor (MSD) focusing on aromatic rings. In the MSD framework, the benzene ring constitutes the core structure. The descriptor values are based on the pattern of substitution on the benzene ring. The value 0 indicates the absence of a benzene ring. The value 11 denotes a mono-substituted aromatic ring. The values 22, 33, and 44 represent di-substituted aromatic rings in ortho, meta, and para positions, respectively. Values 55, 66, and 77 correspond to tri-substituted aromatic rings with 11,22,33-substitution, 11,22,44-substitution, and 11,33,55-substitution patterns, respectively. Furthermore, values 88, 99, and 1010 are assigned to tetra-substituted aromatic rings with 11,22,33,44-substitution, 11,22,33,55-substitution, and 11,22,44,55-substitution patterns, respectively. Lastly, values 1111 and 1212 are used to denote penta-substituted and hexa-substituted aromatic rings, respectively. 1212 MSD structures with the aromatic ring are illustrated in Fig. S1. This descriptor provides a systematic approach to categorize the substitution patterns of benzene rings within molecular structures.

Refer to caption
Figure S1: Illustration of molecular skeleton descriptors.

Hyperparameter configurations for pySR

We implemented the GP algorithm using the open-source Python library pySR. The hyperparameter configurations are detailed in Tables S11 and S12. To capture the higher non-linearity in the relationships between various polarity indices (e.g., α\alpha, β\beta) and their corresponding input features, we increased the number of cycles per iteration. This adjustment aims to discover equations that have better fitting performance, with a trade-off of increased computation time.

Table S11: Hyperparameters setup for discovering RfR_{f} governing equation.
Hyperparameter Value
procs 44
population 88
population size 5050
ncyclesperiteration 5050
niterations 200200
maxsize 5050
maxdepth 1010
binary operators [“++”, “×\times”, “-”, “//”]
unary operators “square”
complexity of constants 22
Table S12: Hyperparameters setup for discovering polarity indices governing equation.
Hyperparameter Value
procs 44
population 88
population size 5050
ncyclesperiteration 500500
niterations 200200
maxsize 5050
maxdepth 1010
binary operators [“++”, “×\times”, “-”, “//”]
unary operators [“square”, “cube”, “exp”]
complexity of constants 22

More results for various SR algorithms

The UHiSR framework is flexible and supports multiple symbolic regression methods. In addition to the GP algorithm, we also present results from SR algorithms employing deep reinforcement learning. The open source Python package, DISCOVER222https://github.com/menggedu/discover, was utilized for the implementation of this method. The discovered hierarchical equation system is detailed in Table S13. The hyperparameter configurations for our implementations are detailed in Table S14.

Table S13: Hierarchical equation systems from DISCOVER.
Equation R2R^{2} RMSE
Rf=σ(3.48Ψ+3.08ξ+1.86),σ(x)=1/(1+ex)R_{f}=\sigma\left(3.48\Psi+3.08\xi+1.86\right),\quad\sigma(x)=1/(1+e^{-x}) 0.9300.930 0.0910.091
Ψ=0.73EA+7.37MeOH0.27DCM+0.51Et2O21.29Hex4\Psi=0.73{\rm EA}+7.37{\rm MeOH}-0.27{\rm DCM}+0.51{\rm Et}_{2}{\rm O}^{2}-1.29{\rm Hex}^{4} 0.9620.962 0.110.11
ξ=0.03α+0.04(eαβ)eα0.21β1.57eα\xi=-0.03\alpha+0.04(e^{\alpha}-\beta)e^{\alpha}-0.21\beta-1.57e^{\alpha} 0.8400.840 0.2710.271
        α=1.97NBen+0.22DM1.49eNBenNBen×DM+0.12NBen×MSD\alpha=-1.97{\rm NBen}+0.22{\rm DM}-1.49e^{{\rm NBen}-{\rm NBen}\times{\rm DM}}+0.12{\rm NBen}\times{\rm MSD} 0.8670.867 0.7220.722
β=0.45γ4+0.026(γ5γ1)20.33γ1+0.064γ22+0.036γ42+0.055γ3γ40.56γ5\beta=0.45\gamma_{4}+0.026(\gamma_{5}-\gamma_{1})^{2}-0.33\gamma_{1}+0.064\gamma_{2}^{2}+0.036\gamma_{4}^{2}+0.055\gamma_{3}\gamma_{4}-0.56\gamma_{5} +0.27γ2~{}\quad~{}\quad~{}\quad~{}+0.27\gamma_{2} 0.7930.793 0.9750.975
        γ1=1.88eCtAmides3.93CtAmides2.48CtAmides23.92CtCO2H\gamma_{1}=1.88e^{\rm CtAmides}-3.93{\rm CtAmides}-2.48{\rm CtAmides}^{2}-3.92{\rm CtCO}_{2}{\rm H} 0.9990.999 0.0460.046
γ2=7.95CtRNH2+9.30CtOH5.98CtPhenol+1.09eCtPhenolCtRNH2\gamma_{2}=-7.95{\rm CtRNH}_{2}+9.30{\rm CtOH}-5.98{\rm CtPhenol}+1.09e^{{\rm CtPhenol}-{\rm CtRNH}_{2}} 2.66eCtOH+2.41eCtRNH2~{}\quad~{}\quad\quad\quad-2.66e^{\rm CtOH}+2.41e^{{\rm CtRNH}_{2}} 0.9960.996 0.0460.046
γ3=0.93eCtCO2HCtAmides3.14CtAmides0.014eCtAmides2.81CtCO2H\gamma_{3}=0.93e^{{\rm CtCO}_{2}{\rm H}-{\rm CtAmides}}-3.14{\rm CtAmides}-0.014e^{{\rm CtAmides}}-2.81{\rm CtCO}_{2}{\rm H} 0.001eCtCO2H2~{}\quad~{}\quad\quad\quad-0.001e^{{\rm CtCO}_{2}{\rm H}^{2}} 0.9910.991 0.2030.203
γ4=1.92CtCN0.61CtR2CO+1.92CtCN+0.32eCtAldehyde+2.34CtAldehyde\gamma_{4}=1.92{\rm CtCN}-0.61{\rm CtR}_{2}{\rm CO}+1.92{\rm CtCN}+0.32e^{{\rm CtAldehyde}}+2.34{\rm CtAldehyde} 0.095CtROR34.32CtROR+0.99CtROR2+0.51CtF43.00CtF~{}\quad~{}\quad\quad\quad-0.095{\rm CtROR}^{3}-4.32{\rm CtROR}+0.99{\rm CtROR}^{2}+0.51{\rm CtF}^{4}-3.00{\rm CtF} 0.9710.971 0.3850.385
γ5=3.66CtBr+2.43CtMethyl+0.63CtCl0.64eCtBr+0.57(CtCl+4CtI)2\gamma_{5}=3.66{\rm CtBr}+2.43{\rm CtMethyl}+0.63{\rm CtCl}-0.64e^{{\rm CtBr}}+0.57({\rm CtCl}+4{\rm CtI})^{2} 4.42CtI~{}\quad~{}\quad\quad\quad-4.42{\rm CtI} 0.9350.935 0.7800.780
Table S14: Hyperparameters setup for DISCOVER.
Category Hyperparameter Value
task function set [“add”, “sub”, “mul”, “div”,“n2”,“n3”,“exp”]
training n samples 200000200000
batch size 500500
epsilon 0.20.2
controller learning rate 0.00250.0025
entropy weight 0.030.03
entropy gamma 0.70.7