Empowering Machines to Think Like Chemists: Unveiling Molecular Structure-Polarity Relationships with Hierarchical Symbolic Regression
Thin-layer chromatography (TLC) is a crucial technique in molecular polarity analysis. Despite its importance, the interpretability of predictive models for TLC, especially those driven by artificial intelligence, remains a challenge. Current approaches, utilizing either high-dimensional molecular fingerprints or domain-knowledge-driven feature engineering, often face a dilemma between expressiveness and interpretability. To bridge this gap, we introduce Unsupervised Hierarchical Symbolic Regression (UHiSR), combining hierarchical neural networks and symbolic regression. UHiSR automatically distills chemical-intuitive polarity indices, and discovers interpretable equations that link molecular structure to chromatographic behavior.
Introduction
Polarity reflects uniformity and symmetry of charge distributions in electrets (?). In particular, molecular polarity is crucial for understanding how molecules interact with each other, and it plays a fundamental role in the characterization of molecules (?). Thin-layer chromatography (TLC) has gained widespread acceptance (?, ?) and provides critical insights into the behavior of organic molecules within varied solvent environments. Despite the established importance of TLC in determining the retardation factor (), a crucial measure in polarity studies, the method is labor-intensive and involves large numbers of repetitive trials.
In recent years, significant progress has been made in the field of artificial intelligence-assisted chemical analysis, especially in the development of quantitative structure–activity relationships (QSAR) prediction models (?, ?, ?, ?) and quantitative structure–property relationships (QSPR) prediction models (?, ?, ?). There are some attempts have been made to construct prediction models for structure-retardation factors (?, ?), but their performance is constrained by the limited quantity and standardization of TLC data. To address this challenge, our prior work (?, ?) introduced an automatic high-throughput platform for TLC analysis, providing a more extensive and standardized dataset for model training. The enhanced dataset serves to improve the performance of the model, allowing for more accurate and robust predictions of structure-retardation factors in TLC experiments.
Although machine learning models, especially deep neural networks (DNNs) have demonstrated excellent performance, their “black box” nature raises concerns about interpretability and understanding of the underlying mechanisms. In this paper, we propose an innovative framework that integrates chemists’ experiential knowledge, aiming to “unpack” the “black box” of molecular structure-polarity relationships.
Essentially, our approach seeks a formalized expression to reveal the intricate mathematical relationship between the value and the solute molecular structures as well as the eluent solvents in TLC analysis. Symbolic regression (SR) has recently emerged as a powerful tool for extracting the underlying mathematical equations from observed data (?, ?, ?, ?, ?, ?, ?). It has gained prominence for scientific discovery in diverse fields (?, ?, ?). However, existing SR methods encounter performance limitations, particularly when dealing with datasets containing multiple variables. Therefore, when using SR methods to explain structure-polarity prediction models, a dilemma emerges: molecular fingerprints, such as MACCS key molecular fingerprints (?), despite their strong expressive power, pose great challenges for SR methods due to their high dimensionality; on the other hand, utilizing feature engineering and domain knowledge with a limited set of input variables results in discovered formulas lacking expressiveness.
To address this dilemma between expressiveness and interpretability, we present unsupervised hierarchical symbolic regression (UHiSR) (see Fig. 1), a new SR approach guided by hierarchical neural network. UHiSR introduces novel polarity indices, e.g., solvent polarity index and solute polarity index, which are learned through a semantic hierarchical model. Guided by domain knowledge, this model comprises multiple sub-models that govern the mapping relationships between a set of input variables, e.g., solvent compositions, and a specific polarity index, e.g., solvent polarity index, as well as the mapping between polarity indices and the output variable . It is noteworthy that this process aligns with the chemists’ thinking processes. Human brain struggles with high-dimensional information, opting not to directly comprehend intricate mappings but rather to decompose and analyze through individual sub-models. Then, we can apply the SR method on the output and the polarity indices, effectively reducing the dimension of the input variables of SR algorithm.

In addition, experienced organic chemists often do not use the molecular fingerprints to evaluate what solvent system should be used for TLC experiments for a given target molecule. Instead, organic chemists evaluate polarity primarily based on the nature, number, and molecular structure of functional groups in solute compounds (?). Therefore, in this study, we seek a simpler and more chemically intuitive perspective, focusing on the basic characteristics of complex molecules, moving from computer engineering to a human-scientist-like thinking mode.
In conclusion, our work introduces several novel chemical-intuitive polarity indices, including the solvent polarity index and solute polarity index, in TLC experiments. These indices provide quantitative insights into the interaction dynamics among molecules in both solid-liquid and liquid-liquid systems. Furthermore, we present a concise-yet-accurate formula that links the value with these polarity indices. The UHiSR framework, driven by both data and domain knowledge, not only uncovers these meaningful polarity indices directly from experimental data, but also constructs an equation that governs the underlying mechanisms. It empowers the machine to think like chemists, and builds a bridge between AI models and interpretable insights in molecular structure-polarity predictive modeling.
Results
Data driven interpretable polarity indices
It is essential for chemists to accurately characterize the overall polarity of molecules. For this purpose, there are many known descriptors such as the molecular polarity index (MPI), which is based on theoretical calculations. However, they often come with issues like high computational cost or only providing global information. Another way to estimate the molecule’s polarity is based on TLC experiments, where chemists typically keep solutes or solvents fixed, evaluate different solvent compounds or solute compounds, and utilize the value to infer the corresponding polarities.
Based on the TLC dataset, we used a data-driven method to extract two polarity indices, namely and , acting as empirical descriptors providing information about the polarity of the solvent system and solute molecule, respectively. In contrast to traditional physicochemical descriptors, our proposed polarity indices offer enhanced interpretability and the ability to capture detailed structural information.
Chemical-intuitive feature engineering.
Rather than employing molecular fingerprints or conventional physicochemical descriptors like common machine learning does, our approach deliberately labels functional groups, just like human chemists typically do. In particular, this study encompasses an array of solute molecules, including those containing carbonyl, hydroxyl, amino, halogen, nitro, and cyano functionalities. This enables a tailored analysis of how these functional groups influence the compound’s value across various solvents in terms of type, number, and allocation. For notational simplicity, we utilize the abbreviations from Table 1 for the features employed in our study. This table also provides the descriptions and statistics of these features. Our method merely uses distinct features for describing solute molecules, and thereby demonstrates an effective reduction of dimensionality compared with MACCS key molecular fingerprints ( dimensions). Furthermore, in contrast to traditional physicochemical descriptors, which are commonly low dimensional, our features offer a more localized and interpretable representation of the target molecule.
To evaluate the effectiveness of the selected features, we conducted a comparative analysis of their representation power (see Materials and Method for more details). As shown in Fig. 2, our proposed features demonstrate comparable fitting performance to MACCS, and slightly surpass conventional physicochemical descriptors. The results emphasize the substantial representational power of our features.
Feature abbr. | Feature description | Mean(Std.) | Pearson’s |
---|---|---|---|
Eluent solvent | |||
Hex | n-Hexane concentration | ||
EA | Ethyl acetate concentration | ||
DCM | Dichloromethane concentration | ||
MeOH | Methanol concentration | ||
Et2O | Diethyl ether concentration | ||
Solute molecule | |||
NBen | Count of benzene | ||
MSD | Molecular skeleton descriptor | ||
DM | Dipole moment | ||
CtPhenol | Count of phenolic hydroxyl group | ||
CtOH | Count of alcoholic hydroxyl group | ||
CtAldehyde | Count of aldehyde group | ||
CtCO2H | Count of carboxylic acid group | ||
CtRCO2R | Count of ester group | ||
CtR2C=O | Count of ketone group | ||
CtROR | Count of ether group | ||
CtCN | Count of cyano group | ||
CtNH2 | Count of amine group | ||
CtNO2 | Count of nitro group | ||
CtAmide | Count of amide group | ||
CtMe | Count of methyl group | ||
CtF | Count of fluorine | ||
CtCl | Count of chlorine | ||
CtBr | Count of bromine | ||
CtI | Count of iodine |

Polarity indices.
In this study, the stationary phase on the TLC plate is silica gel. The determined values epitomize a dynamic equilibrium. It reflects the competitive interplay between the highly polar stationary phase and the solute molecules, which are mediated through interactive forces by changing the polarities of the mobile phase during the experiment. This interplay exhibits two primary patterns: the solvent facilitates the upward transit of solute via capillary action; concurrently, solvent molecules intervene by dislodging solute molecules from the stationary phase’s surface, attenuating their interactions (as illustrated in Fig. 3).
The two polarity indices distinctly capture the interaction dynamics between the solvent compound and the silica gel, as well as between the solute molecule and the silica gel. Importantly, our approach enables a quantitative assessment of the aforementioned interactions. In the following paragraphs, we will separately analyze these two indices.

Solute polarity index.
Considering that the dataset contains organic compounds, previous research normally involves high dimensional features and physicochemical descriptors to encode the solute molecule into numerical values. However, the former are often challenging for human understanding, while the latter lack structural information.
To carry out this study, we carefully selected representative functional groups and used their counts as features for the target molecules. Additionally, we characterized the distribution of these functional groups within the molecule by using three features: count of benzene (NBen), molecular skeleton descriptor (MSD), and dipole moment (DM). Although we were able to reduce the feature numbers to , establishing a explainable connection between the polarity of the molecule and these input features remains a major challenge.
We distinguished two types of solute features: functional group (FG) counts and FG distribution, as they may have distinct effects on the polarity of a given molecule (Fig. 3A). For instance, a molecule with a higher count of polarity-functional groups but with a symmetric structure may exhibit lower polarity than molecules with fewer polarity-functional groups. Therefore, we further extracted two indices, namely FG distribution polarity index and FG polarity index , to isolate these effects.
Despite the polarity of functional groups being of great interest, there is a lack of quantitative and scalable methods to compare the polarities of different functional groups; in current practice, chemists often rely mainly on qualitative analysis and their experience. Our FG polarity index provides the first empirical quantification of the impact of different functional groups on molecular polarity. Considering the functional groups involved in this study, the input features for the sub-model used to obtain the FG polarity index consist of -dimensional vectors, with each dimension representing the number of the respective functional group. To quantify the polarity of a specific functional group FGi, we configured the input feature vector such that the -th position was set to and the others to . As illustrated in Fig. 3B, the sub-model’s outputs reveal a distinct polarity order among the functional groups, i.e., amides () carboxylic acid () amine () alcoholic hydroxyl () phenolic hydroxyl () aldehyde () nitro () ketone () ester () fluorine () ether () methyl () bromine () cyano () chlorine () iodine ().
Solvent polarity index.
Another important interaction is between the solvent compound and the silica gel. The solvent polarity index provides an analytical lens through which we are able to assess the influence of disparate solvents on the values.
Methanol (MeOH) possesses one hydrogen bond acceptor (HBA) and one hydrogen bond donor (HBD), capable of forming stable six- and seven-membered chelates with the silica gel. It also boasts an HBA that can engage in intermolecular hydrogen bonding with the siloxane (Si-O-Si) bridges of the silica gel. Ethyl acetate (EA) harbors two HBAs that can establish an eight-membered ring complex with the silica gel. The oxygen atoms of the ester moiety can partake in two distinct types of intermolecular hydrogen bonds with the gel. Diethyl ether (Et2O) contains one HBA and is thus limited to a singular mode of intermolecular hydrogen bonding with the silica gel. Dichloromethane (DCM) is characterized by two chlorine atoms, which can form a relatively less stable eight-membered ring structure with the gel, and induce weak intermolecular hydrogen bonding. The interaction between n-hexane (Hex) molecules and the silica gel is considerably feeble, with negligible capacity to displace solute molecules. Consequently, under the paradigm of “like dissolves like”, we infer a polarity hierarchy as follows:
The input features of the sub-model used to obtain consist of five-dimensional vectors, with each dimension representing the volume percentage of a solvent compound. Consequently, variations in the compound ratio within the solvent system directly influence the value of , thereby changing the polarity of the solvent. Moreover, to isolate the effect of each compound, we can assign the value to the for the volume percentage that corresponds to the target solvent compound, while setting the others to . This approach yields the following values (see Fig. 3B):
As shown in Table 2, we also conducted a comparative analysis for our empirical solvent descriptor and experimental miscibility indices such as LogP and aqueous solubility (?). The observed numerical consistency and trends align with our preliminary postulates (see Table 2).
Solvent | HBD | HBA | Solubility in water (g/100g) (?) | LogP | |
---|---|---|---|---|---|
Hex | 0 | 0 | 0.01 | 3.42 | -1.238 |
DCM | 0 | 0 | 1.32 | 1.01 | -0.392 |
Et2O | 0 | 1 | 6.9 | 0.76 | 0.534 |
EA | 0 | 2 | 7.7 | 0.29 | 0.733 |
MeOH | 1 | 1 | Miscible | -0.27 | 5.124 |
The governing equation
To comprehensively understand the relationship between the value and the two polarity indices, we used the symbolic regression to formulate empirical mathematical expressions connecting the value and the two polarity indices (Fig. 4A). The observed values were obtained directly from our automatic high-throughput platform (?), while the computed values were calculated from the governing equation (see Equation (1) below). The polarity indices can also be determined through a hierarchical equation system, based on the values of the input variables (Fig. 4A), a discussion of which will follow.
To ensure that the computed values lie within the range from to , a sigmoidal function was employed (see Fig. 4B). By using symbolic regression, one obtains several candidates for the governing equation (see Table S1 in the Supplementary Materials). Since a simple formula is desired, and considering the prescribed fitting accuracy (RMSE , ), we selected
(1) |
In order to separately analyze the individual influence of solvent and solute on the value, we further decomposed the formula for in Equation (1) into and . For instance, as illustrated in Fig. 4C, when considering a specific solute molecule (i.e., fixing a value for ), we can reformulate Equation (1) as , where depends on the chosen solute molecule. The pattern observed in Fig. 4C shows that increases with . Since a larger value of indicates greater polarity, this finding emphasizes that the correlates to the polarity of the solvent. Furthermore, we observe that the shape of ’s graph varies significantly for different solute molecules. In particular, Fig. 4C indicates that, within the context of our dataset, changing the solvent’s composition only has a subtle effect on for solute molecules with extreme polarities. Similarly, when the solvent is fixed, Equation (1) can be reformulated as , where depends on the chosen value for . By analysis similar to the previous case (fixed solute), the value increases with . But now, a larger value of indicates a lower polarity. Moreover, we observe that the shape of ’s graph is of the same S-shaped type across different solvent systems (see Fig. 4C).
Quantification of solvent polarity
In addition to characterizing the polarity of the eluent solvent, the UHiSR framework offers an explainable solution for understanding how different compounds influence the solvent polarity index . In this study, we consider five distinct solvent compounds: Hex, EA, DCM, MeOH and Et2O. To achieve the designed solvent polarity, chemists usually adjust the ratios of these compounds in the mobile phase system. By directly applying the SR algorithm, we obtained the empirical mathematical equation between solvent polarity index and its corresponding input variables, i.e., the volume ratio of Hex, DCM, Et2O, EA and MeOH in the mobile phase system. The following formula achieved high of and a low RMSE of with a simple form,
(2) |
Specifically, there are three mobile phase systems involved in the dataset, which are Hex/EA, Hex/Et2O and MeOH/DCM. Equation (2) quantifies the impact of the two compounds on solvent polarity within each system. For instance, in the presence of Hex and EA, the solvent polarity index is calculated as . As the volume of EA increases, the solvent polarity exhibits a corresponding increase. Moreover, the constant term of 11.1 with MeOH indicates that even a slight increase/decrease in the volume of MeOH will lead to a significant increase/decrease in solvent polarity.
Quantification of solute polarity
As mentioned earlier, we introduce two additional functional group-related indices, FG distribution polarity index and FG polarity index , to enhance our understanding of the relationship between the molecular polarity and the molecular structure. With the help of symbolic regression, we obtained an explicit mathematical relationship between and as follows:
(3) |
where . Breaking down Equation (3), the molecular polarity can be understood as a combination of the polarity of the functional groups within the molecule (), the distribution of these functional groups () and the interaction between these two factors ().
Regarding the FG polarity index , which categorizes the polarity of different functional groups, we observe a distinct trend: an increase in corresponds to a decrease in the solute polarity index , consequently leading to a smaller value (Fig. 4D). This signifies that a higher value is indicative of greater polarity. As for the FG distribution polarity index , we observe an exponential growth of when increases. However, is a composite element that includes the functional group distribution within- and the symmetry of the target molecule. In Fig. 4E, we provide example molecules with distinct values.
Hierarchical equation system
UHiSR has the capability to reveal the relationship between the FG distribution polarity index and its corresponding input variables, as well as between the FG polarity index and the counts of functional groups. Consequently, we can establish a hierarchical equation system that fully elucidates the mathematical relationship between and all input variables, as presented in Table 3.
Equation | RMSE | |
---|---|---|

Discussion
Recent methods to identify quantitative structure-polarity relationships usually include various machine learning models and learning informative feature representations (?, ?). Despite achieving superior prediction accuracy, these models or feature representations often lack interpretability, hindering a deeper understanding the underlying relationships within the dataset. One strategy to enhance interpretability is to articulate explicit formulas between the output and input variables. Symbolic regression emerges as an effective approach for unveiling complex relationships within datasets. However, feasibility and interpretability are challenged when dealing with a large number of variables, as the performance significantly decreases while the complexity of the discovered formulas increases drastically.
In this work, we introduce the integration of hierarchical neural networks and symbolic regression, aiming to enhance the interpretability of the model. We propose, for the first time, two important polarity indices, the solvent polarity index and the solute polarity index . These two polarity indices not only provide interpretable insights, but also enable direct prediction through a formula-driven approach. Compared to high-dimensional feature representations, as we only use few indices (two) that are strongly tied to the given dataset, our approach offers increased ease of understanding and better alignment with the underlying task and the dataset. Additionally, the predictive accuracy is comparable to that of currently used DNN models. Besides, the data-driven discovery of FG distribution polarity index and FG polarity index facilitates a systematic exploration of our -dimensional solute feature space, with each index focusing on different aspects (counts and distribution of functional groups).
The integration of symbolic regression offers an equation-based model, which has several advantages compared with traditional machine learning models. First, equation-based models have extremely low computational cost, making it an ideal choice in resource-constrained environments. Second, the equation-based model usually has only few parameters, especially compared to DNN models, increasing its transferablility and robustness. Due to these advantages, there are potential applications of equation-based models in edge computing. Moreover, unlike traditional neural network models that are deployed in centralized systems and require substantial data transfer for each new scenario, equation-based models offer much cheaper adaptability. This opens up new possibilities for applications in areas such as automated experimental platforms.
As final remark, we have focused only on examples from the chemistry community; however, building interpretable frameworks, that facilitate human understanding of the employed models, can be of general importance for science applications.
Materials and Methods
TLC Dataset
Several variables contribute to the variation in values, including factors such as the nature of the stationary phase, temperature, humidity. Traditional methods of obtaining TLC data manually suffer from a lack of standardization, making scalability challenging. To address this issue, our prior work (?) introduced a robotic platform designed for high-throughput collection of TLC data (Fig. 4A), which not only ensures efficiency in data acquisition but also establishes a standardized and scalable methodology.
The TLC dataset comprises in total measurements of the values, involving organic compounds and three mobile phase systems. These mobile phase systems are n-Hexane/Ethyl acetate, Diethyl ether/n-Hexane and Methanol/Dichloromethane systems, with a total of different solvent compositions.

Framework of UHiSR.
To address the inherent complexity imbalance between solvent and solute compounds, our framework, unsupervised hierarchical symbolic regression (UHiSR), is strategically organized hierarchically in a top-to-down manner. It is motivated by the rationale grounded in established practices observed in TLC experiments. As illustrated in Fig. 5, each stage focuses on specific aspects of the feature space, allowing the model to learn different latent variables which discern patterns at different levels of granularity.
For instance, the top stage provides a broad overview of the interaction between solvent and solute, capturing their global polarity characteristics. Subsequent stages then delve into more detailed and specific characteristics of the solute molecular structure, offering intricate insights regarding the contributions of different molecular features to the overall molecular polarity. As illustrated in Fig. 1C, the framework contains three key parts: feature engineering, hierarchical neural network and symbolic regressions. In this subsection, we introduce the algorithms and methods.
Feature engineering.
Organic molecules are composed of various functional groups arranged along their molecular skeletons. The arrangement of these groups influences the polarity of a given molecule. Organic chemists typically assess the polarity of target molecules by considering the type, number, and distribution of functional groups. Here, we focused on representative functional groups (Fig. 1C). Besides the type and number of functional groups, their spatial distribution is also crucial for molecular polarity. To account for this, we included additional descriptors such as the number of aromatic rings, dipole moments, and molecular skeleton descriptors (detailed in the Supplementary Materials).
In terms of features related to the solvent, we considered the volume ratios () of five different solvent compounds, i.e., n-Hexane, Ethyl acetate, Dichloromethane, Diethyl ether and Methanol.
To evaluate the representation power of the selected features, we conducted a comparative analysis. This analysis employed the XGBoost model and examined three different sets of features. Hereto, like in previous studies (?), we primarily focus on solute molecular features. To maintain consistency, the features related to the eluent solvent were kept the same across all experiments. Feature group A consists of -dimensional solute features (Table 1). Feature group B encompasses -dimensional features, containing -dimensional MACCS keys for describing molecular structure. Feature group C contains -dimensional features, focusing on conventional physicochemical descriptors. Specifically, these descriptors include molecular weight (MW), topological polar surface area (TPSA), the logarithm of -octanol–water partition coefficient (LogP), hydrogen bond donor number (HBD), hydrogen bond acceptor number (HBA), molecular polarity index (MPI), polar surface area percentage (PSA) and dipole moment (DM).
Architecture design of hierarchical neural network.
The architecture of our network is motivated by the rationale grounded in established practices observed in TLC experiments. TLC experiments conventionally entail the execution of controlled experiments designed to meticulously dissect the influence of solute polarity and solvent polarity on . In this context, we modified conventional Multi-Layer Perceptron (MLP) architecture to meet the specific demand of our task. Specifically, given an input sample , the initial step involves the classification of all features into clusters, denoted as . Here, and their dimensions sum up to , i.e., . To enhance interpretability of these clusters, we incorporated a chemists-guided strategy, and the clusters used at each stage are demonstrated in Fig. 5.
Then, a key innovation lies in the formulation of the target latent representation layer , where . Each function operates independently in the modified MLP, in the sense that there is no parameter sharing across . Thus, we can consider as a sub-model. This design ensures the preservation of feature-specific information. The final output encapsulates the information from the target latent representation layer.
This design restricts the receptive field of the latent variables, therefore ensuring that each latent variable solely depends on a specific group of input features. For instance, in the first stage, we considered all features and categorized them into two distinct clusters, solvent features and solute features. This categorization aligns with practical considerations where the polarity of the solvent and the solute is often analyzed separately. Consequently, the learned latent variables in this layer can be regarded as the solvent polarity index and the solute polarity index, respectively. The connections between the latent variables and the output layer capture the nonlinear relationship inherent in the data. This approach enables the model to learn informative latent variables in an unsupervised manner, which disentangles and highlights the distinct contributions of different groups of input features.
Symbolic regression.
SR aims to identify mathematical expressions that capture the inherent relationships within a given dataset. Unlike approaches that involve fitting parameters to overly complex general models (such as machine learning models), SR explores a space of simple analytic expressions. The goal is to discover accurate and interpretable models that directly represent the underlying patterns in the data. SR is usually implemented by evolutionary algorithms, such as genetic programming (GP) (?, ?, ?).
The most common way to visualize a symbolic expression is a tree-structure with nodes and branches, as shown in Fig. 4A. This structure includes primitive functions (e.g., and terminal nodes (input features and numeric constants). Through a series of mutation operations, the GP algorithm seeks to determine the optimal number of nodes and terminals that provide the best fit to a given dataset. For our implementation of the widely used GP-SR algorithm, we employed the open-source Python library PySR111https://github.com/MilesCranmer/PySR (?). The hyperparameter configurations for our implementation are detailed in the Supplementary Materials.
It is worth mentioning that our method is not limited to GP algorithms. In fact, the UHiSR framework is designed to accommodate various symbolic regression methods; comparable results obtained from deep reinforcement learning based SR methods (such as DISCOVER (?)) are included in the Supplementary Materials.
Experimental settings and training details
When constructing the hierarchical neural network, at each stage we set the number of neurons in the hidden layers of each sub-model to . Furthermore, two hidden layers were employed. To facilitate the discovery of concise equations, we leaned towards employing a relatively simple network structure, which helps to prevent overfitting to experimental noise. The activation function chosen for each sub-model was LeakyReLU. In the first stage, the activation function of the output layer was set to a sigmoid, which ensures a standardized output within the range of and (Fig. 4B). The batch size was set to , and the Adam optimizer implemented in PyTorch was applied. The learning rate was . The dataset was randomly divided into for train/valid/test splits. Each hierarchical neural network was trained for epochs, then the best validation checkpoint was selected for testing. To assess the prediction performance, we employed two key metrics: the -squared coefficient () and the root mean square error (RMSE), which are defined as
(4) |
where is the total number of samples, and represent the label and the prediction value for the -th sample, and denotes the mean of the ground truth values.
References
- 1. G. M. Sessler, “Physical principles of electrets” in Electrets (Springer, Berlin, Heidelberg, 1987), pp. 13–80.
- 2. A. R. Katritzky, D. C. Fara, H. Yang, K. Tämm, T. Tamm, M. Karelson, Quantitative measures of solvent polarity, Chem. Rev. 104, 175-198 (2004).
- 3. K. Ciura, S. Dziomba, J. Nowakowska, M. J. Markuszewski, Thin layer chromatography in drug discovery process, J. Chromatogr. A 1520, 9-22 (2017).
- 4. C. F. Poole, Thin-layer chromatography: challenges and opportunities, J. Chromatogr. A 1000, 963-984 (2003).
- 5. A. F. Zahrt, J. J. Henle, B. T. Rose, Y. Wang, W. T. Darrow, S. E. Denmark, Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning, Science 363, eaau5631 (2019).
- 6. E. N. Muratov, J. Bajorath, R. P. Sheridan, I. V. Tetko, D. Filimonov, V. Poroikov, T. I. Oprea, I. I. Baskin, A. Varnek, A. Roitberg, et al., Qsar without borders, Chem. Soc. Rev. 49, 3525–3564 (2020).
- 7. C. B. Wahl, M. Aykol, J. H. Swisher, J. H. Montoya, S. K. Suram, C. A. Mirkin, Machine learning–accelerated design and synthesis of polyelemental heterostructures, Sci. Adv. 7, eabj5505 (2021).
- 8. N. I. Rinehart, R. K. Saunthwal, J. Wellauer, A. F. Zahrt, L. Schlemper, A. S. Shved, R. Bigler, S. Fantasia, S. E. Denmark, A machine-learning tool to predict substrate-adaptive conditions for pd-catalyzed c–n couplings, Science 381, 965-972 (2023).
- 9. Q. Yang, Y. Li, J.-D. Yang, Y. Liu, L. Zhang, S. Luo, J.-P. Cheng, Holistic prediction of the pka in diverse solvents based on a machine-learning approach, Angew. Chem. Int. Ed. 59, 19282-19291 (2020).
- 10. Y. Yao, Q. Dong, A. Brozena, J. Luo, J. Miao, M. Chi, C. Wang, I. G. Kevrekidis, Z. J. Ren, J. Greeley, G. Wang, A. Anapolsky, L. Hu, High-entropy nanoparticles: Synthesis-structure-property relationships and data-driven discovery, Science 376, eabn3103 (2022).
- 11. B. A. Koscher, R. B. Canty, M. A. McDonald, K. P. Greenman, C. J. McGill, C. L. Bilodeau, W. Jin, H. Wu, F. H. Vermeire, B. Jin, T. Hart, T. Kulesza, S.-C. Li, T. S. Jaakkola, R. Barzilay, R. Gómez-Bombarelli, W. H. Green, K. F. Jensen, Autonomous, multiproperty-driven molecular discovery: From predictions to measurements and back, Science 382, eadi1407 (2023).
- 12. Łukasz Komsta, A functional-based approach to the retention in thin layer chromatographic screening systems, Anal. Chim. Acta 629, 66-72 (2008).
- 13. S. Yousefinejad, F. Honarasa, N. Saeed, Quantitative structure–retardation factor relationship of protein amino acids in different solvent mixtures for normal-phase thin-layer chromatography, J. Sep. Sci. 38, 1771–1776 (2015).
- 14. H. Xu, J. Lin, Q. Liu, Y. Chen, J. Zhang, Y. Yang, M. C. Young, Y. Xu, D. Zhang, F. Mo, High-throughput discovery of chemical structure-polarity relationships combining automation and machine-learning techniques, Chem 8, 3202-3214 (2022).
- 15. H. Xu, D. Zhang, F. Mo, High-throughput automated platform for thin layer chromatography analysis, STAR protocols 3, 101893 (2022).
- 16. L. Billard, E. Diday, From the statistics of data to the statistics of knowledge: symbolic data analysis, J. Am. Stat. Assoc. 98, 470–487 (2003).
- 17. I. Arnaldo, K. Krawiec, U.-M. O’Reilly, paper presented at the 2014 Annual Conference on Genetic and Evolutionary Computation, Vancouver, BC, Canada, 12 July 2014.
- 18. M. Quade, M. Abel, K. Shafi, R. K. Niven, B. R. Noack, Prediction of dynamical systems by symbolic regression, Phys. Rev. E 94, 012214 (2016).
- 19. S. L. Brunton, J. L. Proctor, J. N. Kutz, Discovering governing equations from data by sparse identification of nonlinear dynamical systems, Proc. Natl. Acad. Sci. U.S.A. 113, 3932–3937 (2016).
- 20. B. K. Petersen, M. Landajuela, T. N. Mundhenk, C. P. Santiago, S. Kim, J. T. Kim, paper presented at the 9th International Conference on Learning Representations, Virtual Event, Austria, 3 May 2021.
- 21. M. Cranmer, A. Sanchez-Gonzalez, P. Battaglia, R. Xu, K. Cranmer, D. Spergel, S. Ho, paper presented at the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 06 December 2020.
- 22. S.-M. Udrescu, M. Tegmark, Ai feynman: A physics-inspired method for symbolic regression, Sci. Adv. 6, eaay2631 (2020).
- 23. F. Sun, Y. Liu, J.-X. Wang, H. Sun, paper presented at the 11th International Conference on Learning Representations, Kigali, Rwanda, 1 May 2023.
- 24. J. Zeng, H. Xu, Y. Chen, D. Zhang, Deep learning discovery of macroscopic governing equations for viscous gravity currents from microscopic simulation data, Comput. Geosci. pp. 1–14 (2023).
- 25. A. E. Allen, A. Tkatchenko, Machine learning of material properties: Predictive and interpretable multilinear models, Sci. Adv. 8, eabm7185 (2022).
- 26. J. L. Durant, B. A. Leland, D. R. Henry, J. G. Nourse, Reoptimization of mdl keys for use in drug discovery, J. Chem. Inf. Model. 42, 1273–1280 (2002).
- 27. J. Sherma, B. Fried, Handbook of thin-layer chromatography (CRC, Eastern, Pennsylvania, 2003).
- 28. I. Smallwood, Handbook of organic solvent properties (Butterworth-Heinemann, Oxford, 2012).
- 29. M. Cranmer, Interpretable machine learning for science with pysr and symbolicregression. jl, arXiv preprint arXiv:2305.01582 (2023).
- 30. M. Du, Y. Chen, D. Zhang, Discover: Deep identification of symbolic open-form pdes via enhanced reinforcement-learning, arXiv preprint arXiv:2210.02181 (2022).
Acknowledgments
We thank the High Performance Computing Platform of Peking University for machine learning model training. We also thank Mengge Du for assisting in the implementation of DISCOVER.
Funding:
National Natural Science Foundation of China No.22071004
National Natural Science Foundation of China No.21933001
National Natural Science Foundation of China No.22150013
National Natural Science Foundation of China No.62106116
Author contributions:
Conceptualization: YC, FM
Methodology: SL, CL, YC, FM
Investigation: SL, CL
Visualization: SL, CL
Supervision: YC, FM
Writing-original draft: SL, CL
Writing-review & editing: SL, YC, FM
Competing interests:
Authors declare that they have no competing interests.
Data and materials availability:
The dataset utilized in this manuscript is accessible at the following link: https://github.com/SiyuLou/UnsupervisedHierarchicalSymbolicRegression.
All original code for the experiments and analysis in this work has been deposited at the website https://github.com/SiyuLou/UnsupervisedHierarchicalSymbolicRegression.
Supplementary Materials
Candidate equations for the hierarchical equation system
We present the candidate equations for each the hierarchical equation system, along with the corresponding and RMSE values on the test dataset as reported in the Table S1-S10. Symbolic regression algorithms often face a trade-off between simplicity and fitting performance. Considering the noise in real-world datasets, we prioritize simpler equations, such as choosing as the governing equation, even if there is a slight loss in fitting accuracy. On one hand, concise equations are easier to understand; on the other hand, they help avoid overfitting to noise, making the model more robust.
Equation | RMSE | |
---|---|---|
Equation | RMSE | |
---|---|---|
Equation | RMSE | |
---|---|---|
Equation | RMSE | |
---|---|---|
Equation | RMSE | |
---|---|---|
Equation | RMSE | |
---|---|---|
. Equation RMSE
Equation | RMSE | |
---|---|---|
Equation | RMSE | |
---|---|---|
Equation | RMSE | |
---|---|---|
Molecular skeleton descriptor
The distribution of functional groups within a molecule influences its properties. To quantify this effect, we developed a molecular skeleton descriptor (MSD) focusing on aromatic rings. In the MSD framework, the benzene ring constitutes the core structure. The descriptor values are based on the pattern of substitution on the benzene ring. The value indicates the absence of a benzene ring. The value denotes a mono-substituted aromatic ring. The values , , and represent di-substituted aromatic rings in ortho, meta, and para positions, respectively. Values , , and correspond to tri-substituted aromatic rings with ,,-substitution, ,,-substitution, and ,,-substitution patterns, respectively. Furthermore, values , , and are assigned to tetra-substituted aromatic rings with ,,,-substitution, ,,,-substitution, and ,,,-substitution patterns, respectively. Lastly, values and are used to denote penta-substituted and hexa-substituted aromatic rings, respectively. MSD structures with the aromatic ring are illustrated in Fig. S1. This descriptor provides a systematic approach to categorize the substitution patterns of benzene rings within molecular structures.

Hyperparameter configurations for pySR
We implemented the GP algorithm using the open-source Python library pySR. The hyperparameter configurations are detailed in Tables S11 and S12. To capture the higher non-linearity in the relationships between various polarity indices (e.g., , ) and their corresponding input features, we increased the number of cycles per iteration. This adjustment aims to discover equations that have better fitting performance, with a trade-off of increased computation time.
Hyperparameter | Value |
---|---|
procs | |
population | |
population size | |
ncyclesperiteration | |
niterations | |
maxsize | |
maxdepth | |
binary operators | [“”, “”, “”, “”] |
unary operators | “square” |
complexity of constants |
Hyperparameter | Value |
---|---|
procs | |
population | |
population size | |
ncyclesperiteration | |
niterations | |
maxsize | |
maxdepth | |
binary operators | [“”, “”, “”, “”] |
unary operators | [“square”, “cube”, “exp”] |
complexity of constants |
More results for various SR algorithms
The UHiSR framework is flexible and supports multiple symbolic regression methods. In addition to the GP algorithm, we also present results from SR algorithms employing deep reinforcement learning. The open source Python package, DISCOVER222https://github.com/menggedu/discover, was utilized for the implementation of this method. The discovered hierarchical equation system is detailed in Table S13. The hyperparameter configurations for our implementations are detailed in Table S14.
Equation | RMSE | |
---|---|---|
Category | Hyperparameter | Value |
---|---|---|
task | function set | [“add”, “sub”, “mul”, “div”,“n2”,“n3”,“exp”] |
training | n samples | |
batch size | ||
epsilon | ||
controller | learning rate | |
entropy weight | ||
entropy gamma |