T-Rex: Text-assisted Retrosynthesis Prediction

Yifeng Liu^∗α, Hanwen Xu^∗β, Tangqi Fang^α, Haocheng Xi^α,
Zixuan Liu^β, Sheng Zhang^γ, Hoifung Poon^γ, Sheng Wang^†β
^αInstitute for Interdisciplinary Information Sciences, Tsinghua University
^β Paul G. Allen School of Computer Science and Engineering, University of Washington
^γ Microsoft Research
{liuyifen20, fangtq20, xihc20}@mails.tsinghua.edu.cn,
{xuhw, zucksliu, swang}@cs.washington.edu,
{zhang.sheng, hoifung}@microsoft.com

Abstract

^†^†footnotetext: *: Co-first author.

\dagger

: Corresponding author

As a fundamental task in computational chemistry, retrosynthesis prediction aims to identify a set of reactants to synthesize a target molecule. Existing template-free approaches only consider the graph structures of the target molecule, which often cannot generalize well to rare reaction types and large molecules. Here, we propose T-Rex, a text-assisted retrosynthesis prediction approach that exploits pre-trained text language models, such as ChatGPT, to assist the generation of reactants. T-Rex first exploits ChatGPT to generate a description for the target molecule and rank candidate reaction centers based both the description and the molecular graph. It then re-ranks these candidates by querying the descriptions for each reactants and examines which group of reactants can best synthesize the target molecule. We observed that T-Rex substantially outperformed graph-based state-of-the-art approaches on two datasets, indicating the effectiveness of considering text information. We further found that T-Rex outperformed the variant that only use ChatGPT-based description without the re-ranking step, demonstrate how our framework outperformed a straightforward integration of ChatGPT and graph information. Collectively, we show that text generated by pre-trained language models can substantially improve retrosynthesis prediction, opening up new avenues for exploiting ChatGPT to advance computational chemistry. And the codes can be found at https://github.com/lauyikfung/T-Rex.

1 Introduction

Retrosynthesis, which aims to find a few small reactants that can synthesize a given target product, is a fundamental problem in organic chemistry (Fig. 1) Corey (1988); Szymkuć et al. (2016); Segler et al. (2018); Strieth-Kalthoff et al. (2020); Jiang et al. (2022). Retrosynthesis analysis is time-consuming and challenging even for experienced chemists due to the extremely large search space of all possible transformations. Recently, deep learning approaches have been proposed for retrosynthesis prediction Coley et al. (2018). For example, the state-of-the-art retrosynthesis prediction method formulates this task as a graph generation problem, where a large graph of the target product is used to generate two small graphs of reactants.

Refer to caption — Figure 1: Illustration of the retrosynthesis prediction. We formulate retrosynthesis prediction as a two-step approach. First, we identify the bond that splits the target molecule into two synthons. This step is formulated as a multi-class classification problem. Second, each synthon is used to generate a reactant. This step is formulated as a graph-to-graph generation problem.

Existing retrosynthesis prediction approaches can be classified into template-based and template-free. Template-based approaches Hartenfeller et al. (2011); Szymkuć et al. (2016); Coley et al. (2017a); Law et al. (2009) match the subgraph patterns to templates, decompose the target molecule, and then use the indicated atomic modifications to obtain the reactants. Despite the substantial progress, template-based approaches are inefficient since the process of subgraph matching can yield hundreds of potential templates for a target molecule. As a result, template-free methods have become a promising alternative. Template-free methods utilize sequence-to-sequence techniques to generate target reactants using SMILES Weininger (1988) strings of the product Liu et al. (2017); Karpov et al. (2019); Tu and Coley (2022); Sacha et al. (2021a); Sun et al. (2020). Recently, MEGAN Sacha et al. (2021b) and G2Gs Shi et al. (2020) have achieved the state-of-the-art retrosynthesis prediction results by transforming SMILES strings into graph structures and then exploiting a graph-to-graph generation framework.

Here, we propose a novel ranking and re-ranking approach T-Rex for retrosynthesis prediction. The key idea of T-Rex is to use large language models, including ChatGPT, to provide descriptions for the reaction procedure, and then integrate these descriptions with molecule graphs for retrosynthesis prediction. In particular, we first use ChatGPT to obtain the description of the target product based on its SMILES representation, and use this textual description and the molecular graph to rank candidate reaction centers. For each candidate reaction center, we then use ChatGPT to obtain the descriptions for both reactants. Finally, we use these description to re-rank the candidate reaction centers. After the candidate reaction center is identified, we exploit existing graph-to-graph approach for synthon completion.

We evaluated T-Rex on two large-scale retrosynthesis datasets USPTO-50k Coley et al. (2017b) and USPTO-MIT Jin et al. (2017). We found that T-Rex substantially outperformed G2Gs model on both within-dataset cross-validation and cross-dataset prediction. We further demonstrated that T-Rex framework can be used to predict reaction type and these predicted reaction types can further boost the retrosynthesis prediction. Finally, we extended T-Rex to deconstruct a product to three reactants and showed that T-Rex can also achieve the best performance on this task. Collectively, T-Rex utilized text data generated by language models to boost retrosynthesis prediction, opening up new avenues for using language models for computational chemistry.

2 Preliminaries

Retrosynthesis prediction In this paper, a molecular graph is described as $G=(\mathbf{V},\mathbf{E})$ with $\mathbf{V}\in\{0,1\}^{n\times n\times b}$ and $\mathbf{E}\in\{0,1\}^{n\times d}$ , where the numbers of atoms, bond types and the dimension of node features are $n,b,$ and $d$ , respectively. $\mathbf{V}_{ijk}=1$ if and only if atom node $i$ and atom node $j$ has bond type $k$ . $\mathbf{E}$ is the matrix of the atom node features.

A retrosynthesis pair $R$ is represented by a pair of molecular graphs $(G_{p},G_{r})$ . Here $G_{p}=\{G_{p_{c}}\}_{c=1}^{C_{p}}$ are the products and $G_{r}=\{G_{r_{c}}\}_{c=1}^{C_{r}}$ are the reactants. In this paper, we focus on the single product case, therefore $C_{p}=1$ and we simplify the $R$ as $(G_{p},\{G_{r_{c}}\}_{c=1}^{C_{r}})$ .
Reaction center identification. We follow G2Gs Shi et al. (2020) by defining reaction center as an atom pair $(i,j)$ so that there is no bond between $i^{th}$ and $j^{th}$ nodes in $G_{r}$ but there is a bond between $i^{th}$ and $j^{th}$ nodes in $G_{p}$ . The task of reaction center identification is to find the reaction centers to get the synthons $\{S_{c}\}_{c=1}^{C_{r}}$ that are subgraphs extracted from $G_{p}$ by simply breaking down the according bonds. Since the reaction centers could be on the rings, there can be only one synthon(i.e., one reactants) for a single product. And the synthons may not be a valid molecule, which can be transformed into target reactants by the reactant generation task.
Reactant generation. For a retrosynthesis pair $R=(G_{p},\{G_{r_{c}}\}_{c=1}^{C_{r}})$ , denote $\{S_{c}\}_{c=1}^{C_{r}}$ as the according synthons of $G$ . The reactant generation task is to predict the reactants $\{G_{r_{c}}\}_{c=1}^{C_{r}}$ by these synthons. Since in different retrosynthesis pairs, the same synthon may correspond to different reactants, the information of products should also be considered, which is represented by a latent vector $z$ in G2Gs model.
Molecular graph feature learning. To encode the overall characteristics of the graph in both tasks, the G2Gs model introduced an L-layer Relational Graph Convolutional Networks(R-GCN) Schlichtkrull et al. (2018) to get the node embeddings and graph-level embeddings. Let $s\in\mathbb{R}$ be the embedding dimension and $\mathbf{h}^{l}\in\mathbb{R}^{n\times s}$ be the hidden state at the $l^{th}$ layer computed by R-GCN, where $\mathbf{h}^{0}=\mathbf{E}$ . For each layer, the R-GCN compute the hidden state by:

\displaystyle\mathbf{h}^{l}=\text{Ag}(\text{ReLU}(\mathbf{D}_{i}\mathbf{h}^{l-1}\mathbf{W}_{i}^{l}|i\in(1,\cdots,b))

(1)

where $\mathbf{D}_{i}=\mathbf{V}_{[:,:,i]}+I$ is the adjacency matrix for the $i^{th}$ bond type with self-loop and $W_{i}^{l}$ is the trainable parameters. Ag $(\cdot)$ is an aggregation function selected from summation, averaging and concatenation. The overall graph embedding $\mathbf{h}_{G}$ is calculated by summation of $\mathbf{h}^{L}$ , i.e., $\mathbf{h}_{G}=\text{Sum}(\mathbf{h}^{L})$ .

3 Methodology

Our proposed framework T-Rex is a novel text-assisted retrosynthesis prediction approach. The core idea of T-Rex is to introduce the power of modern large-scale language models to enhance the current molecular structure-based representations. Following previous approaches, we formulated the retrosynthesis task as the reaction center identification and synthon completion. Notably, reaction center identification is a prerequisite for robust synthon completion and can be a bottleneck for retrosynthesis. Therefore, our approach investigated how textual descriptions could correct the mismatched reaction centers. We introduced a two-stage center identification strategy, based on ranking and re-ranking of candidate reaction centers with their textual descriptions. In the first stage, we integrate the text information into a graph neural network to roughly identify a few candidate reaction centers. In the second stage, we exploit a molecule-text translation framework to train a re-ranking model. The re-ranking model is capable of recognizing the correct reaction centers from a list of reaction centers identified by the first-stage model. This end-to-end two-stage framework finally enables accurate and robust molecule retrosynthesis, as illustrated in Fig. 2.

3.1 Candidate ranking using ChatGPT

We first denoted a retrosynthesis pair as $R=(G_{p},\{G_{r_{c}}\}_{c=1}^{C_{r}})$ . In reaction center identification, we used a binary matrix $\mathbf{B}\in\{0,1\}^{n\times n}$ as the labels for reaction centers, where $\mathbf{B}_{ij}=1$ recognized the chemical bond between $i^{th}$ node and $j^{th}$ node as the reaction center. Therefore, the center identification task was formulated as a binary prediction task. The input of this task is the structure information of the product molecules, e.g., the SMILES sequence, and the output is the probability of each bond being identified as the reaction center.

To obtain a high-quality molecule structure representation, we integrate the textual information and graphic information together. We first get the text description of the products by ChatGPT, which generates descriptions based on the IUPAC name of the product. The IUPAC name represents a unique representation of a chemical structure, based on a set of mapping rules between structures and language phrases, and is characterized in a more natural language-like format as compared to SMILES. Therefore, the IUPAC name can serve as a bridge between the chemical molecules and large-scale language models. Here we use the PubChemPy Pub package to generate IUPAC name for each molecule in our dataset. We then generate the text description by the following prompt:

Please delineate the structural features, functional aspects, and applicable implementations of the molecule {{ NAME }}, commencing with the introduction:"The molecule is {{ NAME }}". Reasoning the most plausible type for synthesizing this molecule in the final step, and offer a rationale for your choice.

where the {{ NAME }} item is replaced by the IUPAC name. For generated text description, we trained a BERT model to get the textual embedding $\mathbf{H}_{t}$ . Then we also utilize a L-layer R-GCN introduced in Sec. 2 for computing graphic embedding vector $\mathbf{H}_{g}^{L}$ :

\displaystyle\mathbf{h}^{L}=\text{R-GCN}_{1}(G_{p}),\mathbf{H}_{g}^{L}=\text{Sum}(\mathbf{h}^{L}).

(2)

After that, we obtain the embedding for each edge by cross concatenation:

\displaystyle\mathbf{e}_{ij}=\mathbf{h}_{i}^{L}||\mathbf{h}_{j}^{L}||\mathbf{V}_{ij}||\mathbf{H}_{g}^{L}||\mathbf{H}_{t}

(3)

where $||$ stands for vector concatenation, $\mathbf{h}_{i}^{L}\in R^{s}$ denotes the $i^{th}$ row of $h^{L}$ representing the features of $i^{th}$ node and $\mathbf{V}_{ij}=\mathbf{V}_{[i,j,:]}\in\{0,1\}^{b}$ is the edge type between $i^{th}$ node and $j^{th}$ node. When accounting for the scenario with reaction type given, we can incorporate this information by just concatenating its embedding with $\mathbf{e}_{ij}$ . The final probability of the chemical bond being the reaction center is computed by:

\displaystyle\mathbf{r}_{ij}=\sigma(\text{FF}_{1}(\mathbf{e}_{ij})).

(4)

Here $\text{FF}_{1}$ denotes a feedforward network and $\sigma$ is the Sigmoid activation function. Subsequently, we decided the reaction center bond to be the one that connects the $i^{th}$ node and $j^{th}$ node with the highest $\mathbf{e}_{ij}$ value. The model was trained using the cross entropy loss

	$\displaystyle\mathcal{L}_{1}=-\sum_{x\in X}$	$\displaystyle\sum_{i\neq j}\lambda\mathbf{B}_{ij}\log(\mathbf{r}_{ij})$
		$\displaystyle+(1-\mathbf{B}_{ij})\log(1-\mathbf{r}_{ij}),$		(5)

where X is the training dataset and $\lambda\in[1,+\infty)$ is a weighting hyper-parameter employed to balance the extremely small logits of $\mathbf{r}_{ij}$ .

After center identification, we then cut off the chemical bond of reaction centers to get the synthons where each synthon is a connected subgraph. Then we generate the reactants for each synthon using a synthon completion model. We simply utilize the same structure in Shi et al. (2020) to predict the candidate reactant pairs by beam search. We collected a list of most possible candidate pairs to train the model in re-ranking stage.

3.2 Re-ranking using ChatGPT

In the previous stage, we collected a lot of retrosynthesis pairs. However, it is notable that the generated reactants may mismatch the ground truth due to the limited capabilities GNN-based molecule representations. To determine the most possible reactants, we proposed the re-ranking stage by utilizing the capability of ChatGPT. Given a set of retrosynthesis pairs $S=\{R_{i}\}_{i=1}^{k}$ for the product $x$ , where $k$ is the beam size in the synthon completion and $R_{i}=(G_{p},\{G_{r_{c}}\}_{c=1}^{C_{r_{i}}})$ is an individual retrosynthesis pair sharing the same product with other pairs in the set. The re-ranking model is a classifier which identified the candidate reactants from several candidates which have been predicted as the retrosynthesis path with a high probability. The input of the re-ranking clasifier is the same as the first stage model, i.e., the concatenation of text embeddings and molecular graph embeddings. The textual description $T^{i}$ of the reactant candidate $R_{i}$ was generated by ChatGPT using a different prompt:

Please delineate the structural features, functional aspects, and applicable implementations of the molecules {{ NAME }} and possible reactants {{ REACTANT1 }} and {{ REACTANT2 }} to synthesize it. You should generate the descriptions respectively as above example. These descriptions are linked by " [SEP] ", and each commences with the introduction:"The molecule is …".

where the {{ NAME }}, {{ REACTANT1 }} and {{ REACTANT2 }} are the IUPAC names of the product and the two reactants. In the case when there only exists one reactant, we repeat the IUPAC name twice to make it compatible with our paradigm. Then we used a BERT model as the encoder of the textual descriptions:

\displaystyle\mathbf{H}_{t}^{i}=\text{BERT}_{2}(T^{i}).

(6)

To embed the molecular graph information, we also utilized the R-GCN structure as the encoder:

\displaystyle\mathbf{h}^{L,i}=\text{R-GCN}_{2}(\{G_{r_{c}}\}_{c=1}^{C_{r_{i}}}),\mathbf{H}_{g}^{i}=\text{Sum}(\mathbf{h}^{L,i}).

(7)

Here we exclude the graph of product since its information has been already extracted in the previous stage. Afterwards, the embedding for the retrosynthesis pair is obtained by concatenating the textual and graphic embedding vectors:

\displaystyle\mathbf{H}^{i}=\mathbf{H}_{g}^{i}||\mathbf{H}_{t}^{i}.

(8)

Finally, we get a probability score for $R_{i}$ being recognized by the classifier:

\displaystyle a^{i}=\text{Softmax}(\text{FF}_{2}(\mathbf{H}^{i})),

(9)

where $\text{FF}_{2}$ is another feedforward network and $a^{i}=[a_{-}^{i},a_{+}^{i}]$ is the computed logits from the classifier. The confidence score, i.e., the probability for each candidate is then denoted as $a_{+}^{i}$ .

In our implementation, we selected the top 3 reactant candidate from the first stage model as the inputs to the re-ranking model. And we label the candidates matching the ground truth as positive examples and others as negative examples. Notably, we observed a significant data distribution imbalance, i.e., two-thirds of instances have negative labels among the generated candidates. To augment the number of positive samples in the training set of the re-ranking model, we additionally add the ground truth reactants with positive labels. Then we train the re-ranking model as a binary classification task. The model is trained using cross entropy loss for the classification task. Additionally, we incorporate a contrastive learning loss term to optimize the molecule representations. Finally, the total loss is computed as follows:

	$\displaystyle\mathcal{L}_{2}=\sum_{S\in\mathbb{S}}\sum_{i}\text{CELoss}(a_{S}^{i},label_{S}^{i})$
	$\displaystyle-\alpha\sum_{j\neq i}\log\frac{\exp(\text{sim}(\mathbf{H}_{t}^{i},\mathbf{H}_{t}^{j})}{\sum_{m}\exp(\text{sim}(\mathbf{H}_{t}^{i},\mathbf{H}_{t}^{m}))},$		(10)

where $\text{sim}(\mathbf{u},\mathbf{v})=\frac{\mathbf{u}^{T}\mathbf{v}}{||\mathbf{u}||\cdot||\mathbf{v}||}$ is the cosine similarity function and $label_{S}^{i}$ is the label for the $i^{th}$ retrosynthesis pair in set $S$ . Furthermore, the two losses are appropriately balanced via the utilization of a parameter denoted as $\alpha$ .

During the inference phase, we compute the probability for each reactant pair among the top $k$ generated candidates. In the evaluation, we can calculated the top $k^{\prime}$ exact match accuracy by considering the reactant pairs with the largest $k^{\prime}$ scores. In our implementation, we evaluate top $k^{\prime}=1,3,5$ accuracy based on top $k=3,5,10$ candidates, respectively.

4 Experimental setting

We evaluated our method and comparison approaches on two widely-used retrosynthesis prediction datasets: USPTO-50k Coley et al. (2017b) and the filtered USPTO-MIT Jin et al. (2017). All the reactions in both datasets are excluded in USPTO-MIT. The USPTO-50k dataset contains 50,016 reactions with 10 reaction types. The USPTO-MIT dataset does not contain the information for reaction type. In USPTO-MIT, we only considered reactions that have one product and two reactants. This results in 62,212 reactions. We followed previous work Shi et al. (2020) to split the retrosynthesis pairs into 80% training, 10% validation and 10% test using cross-validation.

We compared our method to three other template-free approaches. Particularly, Seq2Seq2 Liu et al. (2017) and Transformer Karpov et al. (2019) are two template-free end-to-end frameworks directly generate SMILES strings of reactants from product SMILES strings. G2Gs Shi et al. (2020) is one of the state-of-the-art template-free approaches which splits the prediction process into center identification and reactant generation. We obtained the results of G2Gs model from https://torchdrug.ai/docs/benchmark/retrosynthesis.html, which is slightly different from those in Shi et al. (2020). Following G2Gs Shi et al. (2020), we use the top-k exact match accuracy as our evaluation metrics where k is chosen from 1, 3, 5, and 10. Moreover, we also compute the center identification accuracy to examine the benefit of using text embeddings.

We used RDKit Landrum et al. (2016) to preprocess SMILES strings. For generating text information, we used the Transformers library to generate the text information by MolT5 Edwards et al. (2022) and to get the text embedding by BERT-small Bhargava et al. (2021); Turc et al. (2019) and PubMedBERT Gu et al. (2020). And we also use the API for the gpt-3.5-turbo-0301 model provided by OpenAI to generate more detailed descriptions of the products with no penalty for frequency and presence and the temperature set as 0.7 and maximal token number as 512. We implement the R-GCN with 4 layers of the same embedding size of 512. And the graph embedding sizes for center identification and synthon completion tasks are set to 1536. We set $\lambda$ as 20 and the beam size as 10 in the candidate generation stage while the $\alpha$ is set to 0.2 in the re-ranking stage. We optimize our model with Adam Kingma and Ba (2014). The training hyperparameters are set as follows. We set the batch size of center identification batch size to 32 and the batch size of synthon completion to 128, and the batch size of ranking stage as 8. The learning rates for two stages are set to 1e-3 and 1e-5, respectively, and the warm-up ratio to 0.1. The weight decay is set to 0.01. We trained center identification model, synthon completion model and re-ranking model for 50, 10 and 10 epochs, respectively. We set beam size to 10 in beam search of retrosynthesis. All experiments are carried out on a single NVIDIA RTX A4000.

5 Experimental results

5.1 Subtantial improvement on retrosynthesis prediction

We first evaluated the retrosyntheis prediction performance using T-Rex. We calculated the top-k accuracy of the proposed approach on USPTO-50k and filtered USPTO-MIT datasets in both reaction class known and reaction class unknown settings. The results of T-Rex, G2Gs and G2Gs+text models when reaction class was not given is summarized in Fig. 3 and Appendix. A. We first noticed that G2Gs+Text outperformed the state-of-the-art template-free model G2Gs on Top 1 and Top 3 accuracy and achieved comparable performance on Top 5 and Top 10 accuracy on both datasets. This result demonstrates the effectiveness of incorporating text description generated by ChatGPT into the graph-to-graph retrosynthesis model. Furthermore, our T-Rex model exhibits superior performance on top-1 and top-3 accuracy on two datasets compared to the G2Gs+Text model, highlighting the benefits of the re-ranking stage in the T-Rex model.

5.2 Cross-dataset prediction

We next examine the generalizability of our T-Rex model by evaluating the cross-dataset retrosythesis prediction. Specifically, we trained a model on USPTO-50k to predict the retrosythesis on USPTO-MIT Table 1. Since USPTO-MIT and USPTO-50k have disjoint reaction types, these two datasets present substantially different distributions. We found that T-Rex achieved the best results on Top 1 and Top 5 in this cross-dataset setting. On the reverse side, our method also obtained the best performance on Top 1 and Top 5 accuracy, reflecting the good generalizability of our T-REX model.

Next, we conducted a series of incremental experiments by training on the entire set of USPTO-50k and part of USPTO-MIT, and testing on the remaining USPTO-MIT. We summarized the performance in Fig. 4. T-Rex consistently outperformed other approaches with regard to different proportions of USPTO-MIT training set. This further confirms the strong generalizability of T-Rex and the benefit of using text for retrosynthesis prediction.

Finally, since USPTO-MIT does not have reaction type information, we exploited T-Rex to predict the reaction type. Specifically, we used USPTO-50k to train a reaction type classifier and then used this classifier to predict reaction type for each target product in a multi-class classification setting. We then incorporated this reaction type into our framework. We found that this can further improve the retrosynthesis prediction performance (Appendix 5.4), again demonstrating the ability of T-Rex to jointly learn from heterogeneous datasets.

5.3 Ablation studies for the re-ranking stage

After observing substantial improvement of T-Rex over the G2Gs+Text baseline, we then conducted ablation studies for different components in the re-ranking stage. In particular, we assess the variants without text embedding (denoted as T-Rex w/o text), without graph embedding (denoted as T-Rex w/o graph), without contrastive learning loss on USPTO-50k (denoted as T-Rex w/o c.l.l.) or with the texts generated by MolT5 (denoted as T-Rex(MolT5)). The results of top 1, top 3 and top 5 exact match accuracy are shown in Table 2.

Models	Accuracy
Models	Top-1	Top-3	Top-5	Top-10
USPTO-50k $\rightarrow$ USPTO-MIT
G2Gs	0.171	0.292	0.338	0.396
G2Gs+Text	0.166	0.306	0.347	0.391
T-Rex	0.204	0.307	0.329	0.391
USPTO-MIT $\rightarrow$ USPTO-50k
G2Gs	0.233	0.424	0.479	0.548
G2Gs+Text	0.199	0.383	0.442	0.499
T-Rex	0.300	0.431	0.456	0.499

Table 1: Results of the cross-dataset prediction.

USPTO-50k	Accuracy
USPTO-50k	Top-1	Top-3	Top-5
T-Rex w/o text	0.453	0.633	0.647
T-Rex w/o graph	0.434	0.677	0.707
T-Rex w/o c.l.l.	0.491	0.705	0.740
T-Rex(MolT5)	0.509	0.712	0.755
T-Rex	0.523	0.735	0.785

Table 2: Results of the ablation study for different components in the re-ranking stage.

We found that all these four components are important because the original T-Rex outperformed all these four variants. Moreover, the inclusion of text information contributes to performance improvements most substantially, as demonstrated by large drops in Top 3 and Top 5 accuracy by the variant that does not use the text information, necessitating the incorporation of text information into our framework. Moreover, the variant that uses MolT5 instead of ChatGPT to generate textual description performs substantially worse than T-Rex, indicating the importance of using high-quality text in our framework, further supporting the benefits of using text information for retrosynthesis prediction.

5.4 Transferring Test on Predicted Reaction Type

We also tried to improve the performance of our T-Rex model in the transferring test with the help of reaction type. Specifically, we started by training a reaction-type classifier on the USPTO-50k dataset. We then employed this classifier to predict the reaction types within the filtered USPTO-MIT dataset, even in cases where certain reactions cannot be definitively classified under any specific reaction type.

We have explored various potential classifier architectures, encompassing different combinations of graph features solely pertaining to the product (denoted as g.p.), graphic and textual information associated with the product (denoted as g.p.+t.p.), as well as graphic features of the reactants and product in conjunction with textual information specific to the product (denoted as g.p.+t.p.+g.r.). To make the experiments comparable, we only gave the reaction type information in the re-ranking stage. The top 1, 3, and 5 exact match accuracies are listed in Table 3.

USPTO-MIT	Accuracy %
USPTO-MIT	Top-1	Top-3	Top-5
g.p.	0.568	0.701	0.719
g.p.+t.p.	0.568	0.700	0.718
g.p.+t.p.+g.r.	0.574	0.706	0.722
T-Rex(no reaction type)	0.523	0.735	0.785

Table 3: Results of the models trained on filtered USPTO-MIT dataset with the predicted reaction types.

We can see that with the help of predicted reaction types, the top 1 exact match accuracy can improve much, demonstrating its potential to incorporate information of new resources.

5.5 Selection of Text Embedding Model

In our experiment setting, we used two types of BERT: BERT-small and PubMedBERT. And we also did the experiments with only one types of BERT for USPTO-50k, and the results are presented in Table 4. We can notice that the selection of different types of embedding models may also contribute to the performance of the retrosynthesis prediction task.

USPTO-50k

Accuracy %

Top-1

Top-3

Top-5

BERT-small-only

0.502

0.712

0.753

PubMedBERT-only

0.509

0.713

0.756

BERT-small +

PubMedBERT (%)

0.523

0.735

0.785

Table 4: Results of the models trained on USPTO-50k for different BERT type combination.

5.6 Deep Investigation on Re-ranking Stage

In Figure 3, we noticed T-Rex posed a great improvement than G2Gs on top-1 accuracy, but a smaller increase in top-3, 5 and 10. Therefore, we made deep investigation on the re-ranking stage in two aspects. Firstly, the increase in the top-1 accuracy results from the re-ranking stage. We have recorded the accuracies before and after re-ranking for each set of retrosynthesis pairs. For ground truth retrosynthesis pairs ranking first in the candidate ranking stage, 87.3% of them remain top-1 by the ranker in the re-ranking stage. And for those ranking 2 or 3 in the first stage which are ground truth pairs, 52.3% of them ranks top-1 in the second stage. Moreover, in the re-ranking stage, we have tried to involve only graph features, and the top-1 and top-3 accuracies are only 45.3% and 63.3%, respectively (Table 2), showing that the involvement of text information is still important for improving accuracies. All in all, we conclude that the dramatic improvement in top-1 accuracy may results from the architecture of ranker and the information from ChatGPT involved in the re-ranking stage.

5.7 Generalization to Rare Reaction Types

To assess the generalizability of the T-Rex model, we computed accuracies across all 10 reaction types in one side of cross-dataset experiments (USPTO-MIT $\rightarrow$ USPTO-50k). Notably, the top-1 accuracy across all reaction types reveals that the T-Rex model consistently outperforms G2Gs, demonstrating superior performance. And T-Rex outperforms G2Gs greatly in terms of accuracies for some rare reaction types including Oxidations, Functional group inter-conversion (FGI) as well as Reductions. Moreover, the top-1 accuracies of G2Gs and T-Rex for these specific reaction types within the test dataset along with the corresponding proportions of these types are presented in Table 5. T-Rex exhibits substantially higher accuracy than G2Gs for these specific types, showing its potential for generalization to rare reaction types.

Reaction Type

(Proportion in

Test Dataset, %)

Accuracy of

G2Gs (%)

Accuracy of

T-Rex (%)

Oxidations (1.52)

33.8

48.6

FGI (3.52)

11.6

21.5

Reductions (9.38)

3.28

44.8

Table 5: The accuracies for G2Gs and T-Rex for rare reaction types of oxidations, FGI as well as reductions.

Moreover, for top-5 and top-10 accuracies, we noticed that G2Gs can outperform T-Rex slightly for some reaction types. For example, for acylation and related processes, taking up 24.3% of the test dataset, the top-5 and top-10 accuracies for G2Gs are 64.8%, 70.2% while the top-5 and top-10 accuracies for T-Rex are 63.0%, 66.2%. However, for some rare reaction types, T-Rex instead outperforms G2Gs greatly. For example, for Functional group addition (FGA) reactions(taking up 0.52% in the test dataset), the top-5 and top-10 accuracies for G2Gs are both 0% while the top-5 and top-10 accuracies for T-Rex are both 12%; for FGI reactions (taking up 3.52% in the test dataset), the top-5 and top-10 accuracies for G2Gs are 26.7%, 34.3% while the top-5 and top-10 accuracies for T-Rex are 30.3%, 40.2%. From these types, we can still see T-Rex’s potential of generalization to rare reaction types.

5.8 Case Study

We presented the t-SNE visualizations of the embeddings of G2Gs, G2Gs with text description(G2Gs+Text), and our T-Rex model in Fig. 5. Here, G2Gs and G2Gs+Text show the embeddings from the reaction center identification stage, while T-Rex shows the embeddings from the re-ranking stage, which are unavailable for the other two methods. We can see that the embeddings of both G2Gs and G2Gs+Text model do not show visible patterns while our T-Rex can show clear pattern for the retrosynthesis pairs, again demonstrating the effectiveness of the re-ranking stage. Furthermore, we visualized an example that the re-ranking stage successfully correct the mis-predicted reaction center by using the text information (Fig. 6). More examples are presented in Section E in the Appendix.

6 Related Work

Due to the large search space in retrosynthesis, many machine learning approaches have been developed for retrosynthesis prediction. Existing approaches can be classified into template-based and template-free models. These templates are either hand-crafted by experts Hartenfeller et al. (2011); Szymkuć et al. (2016) or extracted from large databases automatically Coley et al. (2017a); Law et al. (2009). Prior works have exploited various techniques to select from multiple candidate templates Dai et al. (2019); Coley et al. (2017b); Segler and Waller (2017); Chen and Jung (2021); Zhu et al. (2023); Baylon et al. (2019). Nevertheless, the process of identifying templates is extremely time-consuming. To address this issue, template-free methods Liu et al. (2017); Karpov et al. (2019); Tu and Coley (2022); Sacha et al. (2021a); Somnath et al. (2020); Sun et al. (2020); Sacha et al. (2021b) utilize SMILES Weininger (1988) representation to formulate the task as a sequence-to-sequence generation problem. G2Gs Shi et al. (2020) converted SMILES strings to graph structures in order to capture the intrinsically rich chemical contexts. To the best of our knowledge, none of these approaches have incorporated text data, especially those generated by pre-trained language models, into the prediction of retrosynthesis. In contrast, our T-Rex framework exploited text descriptions of molecules as additional features for modeling molecular graphs, achieving substantial improvement on two retrosynthesis prediction datasets.

ChatGPT and similar large models have recently transformed natural language processing Leiter et al. (2023); Bordt and von Luxburg (2023); Desaire et al. (2023); Siu (2023); Uludag (2023); Juhi et al. (2023); Bubeck et al. (2023). Due to their success in NLP, there is a growing interest in utilizing ChatGPT to aid scientific research Auer et al. (2023); Castro Nascimento and Pimentel (2023), such as drug discovery Sharma and Thakur (2023); Liang et al. (2023), computational materials science Hong (2023), and chemistry Mahjour et al. (2023). Nevertheless, the majority of these approaches primarily focus on text data only, overlooking the possibility of using ChatGPT to advance the modeling of other non-text data modalities in scientific domain. In contrast, we investigated how ChatGPT can improve retrosynthesis prediction through combining the textual description generated by ChatGPT and molecular graphs.

7 Conclusion

In this paper, we have introduced T-Rex, a novel retrosynthesis prediction model utilizing the textual descriptions generated by large langauge models. Specifically, we have integrated the text embeddings generated by ChatGPT into the graph model. We have also introduced the ranking stage to further improve the performance. The experimental results show that the involvement of text information and the ranking stage enables the model to outperform existing template-free models in both datasets.

To the best of our knowledge, our framework is the first attempt that leverages text description for retrosynthesis prediction. Since large language models are trained on massive-scale corpora, they often possess powerful reasoning and analysis capabilities. We envision that our method will motivate future research in using text language models for other tasks in computational chemistry.

References

(1) Pubchempy: https://pubchempy.readthedocs.io/.
Auer et al. (2023) Sören Auer, Dante AC Barone, Cassiano Bartz, Eduardo G Cortes, Mohamad Yaser Jaradeh, Oliver Karras, Manolis Koubarakis, Dmitry Mouromtsev, Dmitrii Pliukhin, Daniil Radyush, et al. 2023. The sciqa scientific question answering benchmark for scholarly knowledge. Scientific Reports, 13(1):7240.
Baylon et al. (2019) Javier L Baylon, Nicholas A Cilfone, Jeffrey R Gulcher, and Thomas W Chittenden. 2019. Enhancing retrosynthetic reaction prediction with deep learning using multiscale reaction classification. Journal of chemical information and modeling, 59(2):673–688.
Bhargava et al. (2021) Prajjwal Bhargava, Aleksandr Drozd, and Anna Rogers. 2021. Generalization in nli: Ways (not) to go beyond simple heuristics.
Bordt and von Luxburg (2023) Sebastian Bordt and Ulrike von Luxburg. 2023. Chatgpt participates in a computer science exam. arXiv preprint arXiv:2303.09461.
Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
Castro Nascimento and Pimentel (2023) Cayque Monteiro Castro Nascimento and André Silva Pimentel. 2023. Do large language models understand chemistry? a conversation with chatgpt. Journal of Chemical Information and Modeling, 63(6):1649–1655.
Chen and Jung (2021) Shuan Chen and Yousung Jung. 2021. Deep retrosynthetic reaction prediction using local reactivity and global attention. JACS Au, 1(10):1612–1620.
Coley et al. (2017a) Connor W Coley, Regina Barzilay, Tommi S Jaakkola, William H Green, and Klavs F Jensen. 2017a. Prediction of organic reaction outcomes using machine learning. ACS central science, 3(5):434–443.
Coley et al. (2018) Connor W Coley, William H Green, and Klavs F Jensen. 2018. Machine learning in computer-aided synthesis planning. Accounts of chemical research, 51(5):1281–1289.
Coley et al. (2017b) Connor W Coley, Luke Rogers, William H Green, and Klavs F Jensen. 2017b. Computer-assisted retrosynthesis based on molecular similarity. ACS central science, 3(12):1237–1245.
Corey (1988) EJ Corey. 1988. Robert robinson lecture. retrosynthetic thinking—essentials and examples. Chemical society reviews, 17:111–133.
Dai et al. (2019) Hanjun Dai, Chengtao Li, Connor Coley, Bo Dai, and Le Song. 2019. Retrosynthesis prediction with conditional graph logic network. Advances in Neural Information Processing Systems, 32.
Desaire et al. (2023) Heather Desaire, Aleesa E Chua, Madeline Isom, Romana Jarosova, and David Hua. 2023. Chatgpt or academic scientist? distinguishing authorship with over 99% accuracy using off-the-shelf machine learning tools. arXiv preprint arXiv:2303.16352.
Edwards et al. (2022) Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. 2022. Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 375–413, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Gu et al. (2020) Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2020. Domain-specific language model pretraining for biomedical natural language processing.
Hartenfeller et al. (2011) Markus Hartenfeller, Martin Eberle, Peter Meier, Cristina Nieto-Oberhuber, Karl-Heinz Altmann, Gisbert Schneider, Edgar Jacoby, and Steffen Renner. 2011. A collection of robust organic synthesis reactions for in silico molecule design. Journal of chemical information and modeling, 51(12):3093–3098.
Hong (2023) Zijian Hong. 2023. Chatgpt for computational materials science: A perspective. Energy Material Advances, 4:0026.
Jiang et al. (2022) Yinjie Jiang, Yemin Yu, Ming Kong, Yu Mei, Luotian Yuan, Zhengxing Huang, Kun Kuang, Zhihua Wang, Huaxiu Yao, James Zou, et al. 2022. Artificial intelligence for retrosynthesis prediction. Engineering.
Jin et al. (2017) Wengong Jin, Connor Coley, Regina Barzilay, and Tommi Jaakkola. 2017. Predicting organic reaction outcomes with weisfeiler-lehman network. Advances in neural information processing systems, 30.
Juhi et al. (2023) Ayesha Juhi, Neha Pipil, Soumya Santra, Shaikat Mondal, Joshil Kumar Behera, Himel Mondal, Neha Pipil Sr, Soumya Santra Sr, and Joshil K Behera IV. 2023. The capability of chatgpt in predicting and explaining common drug-drug interactions. Cureus, 15(3).
Karpov et al. (2019) Pavel Karpov, Guillaume Godin, and Igor V Tetko. 2019. A transformer model for retrosynthesis. In Artificial Neural Networks and Machine Learning–ICANN 2019: Workshop and Special Sessions: 28th International Conference on Artificial Neural Networks, Munich, Germany, September 17–19, 2019, Proceedings, pages 817–830. Springer.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Landrum et al. (2016) Greg Landrum et al. 2016. Rdkit: Open-source cheminformatics software.
Law et al. (2009) James Law, Zsolt Zsoldos, Aniko Simon, Darryl Reid, Yang Liu, Sing Yoong Khew, A Peter Johnson, Sarah Major, Robert A Wade, and Howard Y Ando. 2009. Route designer: a retrosynthetic analysis tool utilizing automated retrosynthetic rule generation. Journal of chemical information and modeling, 49(3):593–602.
Leiter et al. (2023) Christoph Leiter, Ran Zhang, Yanran Chen, Jonas Belouadi, Daniil Larionov, Vivian Fresen, and Steffen Eger. 2023. Chatgpt: A meta-analysis after 2.5 months. arXiv preprint arXiv:2302.13795.
Liang et al. (2023) Youwei Liang, Ruiyi Zhang, Li Zhang, and Pengtao Xie. 2023. Drugchat: Towards enabling chatgpt-like capabilities on drug molecule graphs.
Liu et al. (2017) Bowen Liu, Bharath Ramsundar, Prasad Kawthekar, Jade Shi, Joseph Gomes, Quang Luu Nguyen, Stephen Ho, Jack Sloane, Paul Wender, and Vijay Pande. 2017. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS central science, 3(10):1103–1113.
Mahjour et al. (2023) Babak Mahjour, Jillian Hoffstadt, and Tim Cernak. 2023. Designing chemical reaction arrays using phactor and chatgpt.
Sacha et al. (2021a) Mikołaj Sacha, Mikołaj Błaz, Piotr Byrski, Paweł Dabrowski-Tumanski, Mikołaj Chrominski, Rafał Loska, Paweł Włodarczyk-Pruszynski, and Stanisław Jastrzebski. 2021a. Molecule edit graph attention network: modeling chemical reactions as sequences of graph edits. Journal of Chemical Information and Modeling, 61(7):3273–3284.
Sacha et al. (2021b) Mikołaj Sacha, Mikołaj Błaz, Piotr Byrski, Paweł Dabrowski-Tumanski, Mikołaj Chrominski, Rafał Loska, Paweł Włodarczyk-Pruszynski, and Stanisław Jastrzebski. 2021b. Molecule edit graph attention network: modeling chemical reactions as sequences of graph edits. Journal of Chemical Information and Modeling, 61(7):3273–3284.
Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15, pages 593–607. Springer.
Segler et al. (2018) Marwin HS Segler, Mike Preuss, and Mark P Waller. 2018. Planning chemical syntheses with deep neural networks and symbolic ai. Nature, 555(7698):604–610.
Segler and Waller (2017) Marwin HS Segler and Mark P Waller. 2017. Neural-symbolic machine learning for retrosynthesis and reaction prediction. Chemistry–A European Journal, 23(25):5966–5971.
Sharma and Thakur (2023) Gaurav Sharma and Abhishek Thakur. 2023. Chatgpt in drug discovery.
Shi et al. (2020) Chence Shi, Minkai Xu, Hongyu Guo, Ming Zhang, and Jian Tang. 2020. A graph to graphs framework for retrosynthesis prediction. In International conference on machine learning, pages 8818–8827. PMLR.
Siu (2023) Sai Cheong Siu. 2023. Chatgpt and gpt-4 for professional translators: Exploring the potential of large language models in translation. Available at SSRN 4448091.
Somnath et al. (2020) Vignesh Ram Somnath, Charlotte Bunne, Connor W Coley, Andreas Krause, and Regina Barzilay. 2020. Learning graph models for template-free retrosynthesis. arXiv preprint arXiv:2006.07038.
Strieth-Kalthoff et al. (2020) Felix Strieth-Kalthoff, Frederik Sandfort, Marwin HS Segler, and Frank Glorius. 2020. Machine learning the ropes: principles, applications and directions in synthetic chemistry. Chemical Society Reviews, 49(17):6154–6168.
Sun et al. (2020) Ruoxi Sun, Hanjun Dai, Li Li, Steven Kearnes, and Bo Dai. 2020. Energy-based view of retrosynthesis. arXiv preprint arXiv:2007.13437.
Szymkuć et al. (2016) Sara Szymkuć, Ewa P Gajewska, Tomasz Klucznik, Karol Molga, Piotr Dittwald, Michał Startek, Michał Bajczyk, and Bartosz A Grzybowski. 2016. Computer-assisted synthetic planning: the end of the beginning. Angewandte Chemie International Edition, 55(20):5904–5937.
Tu and Coley (2022) Zhengkai Tu and Connor W Coley. 2022. Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction. Journal of chemical information and modeling, 62(15):3503–3513.
Turc et al. (2019) Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-read students learn better: The impact of student initialization on knowledge distillation. CoRR, abs/1908.08962.
Uludag (2023) Kadir Uludag. 2023. The use of ai-supported chatbot in psychology. Available at SSRN 4331367.
Weininger (1988) David Weininger. 1988. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36.
Yan et al. (2020) Chaochao Yan, Qianggang Ding, Peilin Zhao, Shuangjia Zheng, Jinyu Yang, Yang Yu, and Junzhou Huang. 2020. Retroxpert: Decompose retrosynthesis prediction like a chemist. Advances in Neural Information Processing Systems, 33:11248–11258.
Zhu et al. (2023) Jinhua Zhu, Kehan Wu, Bohan Wang, Yingce Xia, Shufang Xie, Qi Meng, Lijun Wu, Tao Qin, Wengang Zhou, Houqiang Li, et al. 2023. $\mathcal{O}$ -gnn: incorporating ring priors into molecular modeling. In The Eleventh International Conference on Learning Representations.

Appendix A Detailed Results

We display the detailed results of comparison models on USPTO-50k and filtered USPTO-MIT datasets in Tables 6, 7 and 8. The results of G2Gs model are from https://torchdrug.ai/docs/benchmark/retrosynthesis.html, which is slightly different from those in Shi et al. (2020).

USPTO-50k	Accuracy
USPTO-50k	Top-1	Top-3	Top-5	Top-10
Template-free
Transformer	0.379	0.573	0.627	$\backslash$
RetroXpert	0.502	0.613	0.623	0.635
MEGAN	0.479	0.712	0.785	0.857
G2Gs	0.437	0.676	0.769	0.831
G2Gs+Text	0.441	0.705	0.784	0.852
T-Rex	0.523	0.735	0.785	0.852
Template-based
GraphRetro	0.537	0.683	0.722	0.755
LocalRetro	0.534	0.775	0.859	0.924
$\mathcal{O}$ -GNN	0.541	0.777	0.860	0.925

Table 6: Top-k exact match accuracy for USPTO-50k dataset when reaction class is not given for template-free models including Transformer Karpov et al. (2019), RetroXpert Yan et al. (2020), MEGAN Sacha et al. (2021b), G2Gs Shi et al. (2020) as well as T-Rex. We also include some state-of-the-art template-based models including GraphRetro Somnath et al. (2020), LocalRetro Chen and Jung (2021) and

\mathcal{O}

-GNN Zhu et al. (2023) for comparison.

USPTO-MIT	Accuracy
USPTO-MIT	Top-1	Top-3	Top-5	Top-10
G2Gs	0.417	0.630	0.698	0.750
G2Gs+Text	0.420	0.631	0.697	0.753
T-Rex	0.538	0.675	0.694	0.753

Table 7: Top-k exact match accuracy for USPTO-MIT dataset.

USPTO-50k	Accuracy
USPTO-50k	Top-1	Top-3	Top-5	Top-10
Template-free
Seq2seq	0.374	0.524	0.570	0.617
MEGAN	0.616	0.831	0.881	0.925
G2Gs	0.625	0.849	0.904	0.935
T-Rex	0.630	0.858	0.912	0.959
Template-based
GraphRetro	0.639	0.815	0.852	0.881
LocalRetro	0.639	0.868	0.924	0.963
$\mathcal{O}$ -GNN	0.657	0.877	0.934	0.969

Table 8: Top-k exact match accuracy for USPTO-50k dataset when reaction class is given for template-free models including Seq2Seq Liu et al. (2017), MEGAN Sacha et al. (2021b), G2Gs Shi et al. (2020) as well as T-Rex. Since the performance of G2Gs on the paper and the website are different, and the G2Gs can outperform MEGAN on top-k accuracy for USPTO-50k when reaction class is given, we still assert that G2Gs is one of the state-of-the-art model. We also include some state-of-the-art template-based models including GraphRetro Somnath et al. (2020), LocalRetro Chen and Jung (2021) and

\mathcal{O}

-GNN Zhu et al. (2023) for comparison.

Appendix B Discussion about the Potential Data Leakage Issue

To investigate potential data leakage issues in ChatGPT, we conducted an in-depth analysis of the text generated by the model. In our examination of the generated texts, only 3% of the texts mention one of the reactant names and only 0.1% of the texts mention both names. Especially, none of the texts mention the SMILES of any reactant. Furthermore, it was observed that the deletion of these specific items led to a marginal decrease of 0.2% in top-1 accuracy on the test dataset. Additionally, MolT5, an alternative model for generating molecule descriptions, was employed. The texts generated by MolT5 exhibited an absence of any mention of reactants. The findings presented in Table 2 underscore the significance of text information quality. In summary, the improvement of our methodology primarily comes from considering the text information rather than the potential data leakage of ChatGPT.

Appendix C Stereochemistry Issue

In many prior retrosynthesis prediction studies, results have commonly been presented under the assumption of stereochemistry ignorance, where all isomers of the ground truth are deemed correct. Consequently, within the main body of this paper, we adhere to the same assumption to ensure a meaningful comparison with earlier works, including G2Gs. However, for a fair assessment of other models such as LocalRetro Chen and Jung (2021), we also showcase results for USPTO-50k with stereochemistry taken into account in Table 9.

USPTO-50k	Accuracy
USPTO-50k	Top-1	Top-3	Top-5	Top-10
G2Gs	0.425	0.661	0.749	0.813
G2Gs+Text	0.429	0.688	0.764	0.831
T-Rex	0.510	0.717	0.766	0.831

Table 9: Top-k exact match accuracy for USPTO-50k dataset when reaction class is not given.

It can be seen from the table that under the circumstance of stereochemistry, the T-Rex still outperforms the G2Gs model for top-1, 3, 5 and 10.

Appendix D Limitations of T-Rex

Our T-Rex model currently has the following limitations. First, the utilization of ChatGPT API and MolT5 model to generate auxiliary text information requires us to run ChatGPT APIs a large number of times, resulting in substantial costs. Meanwhile, it takes 9 hours for MolT5 model to generate the texts for the classifier stage on a single NVIDIA RTX A4000, which can only achieve a moderate accuracy of 70% for reaction type prediction. Therefore, we plan to exploit active learning to decrease the number of times we need to run large language models. Moreover, our approach can only be applied to one-step retrosynthesis prediction for single products. It would be interesting to extend our T-Rex framework to support multi-step retrosynthesis and explore how ChatGPT can be used in different steps.

Appendix E Examples of Generated Texts

In this section, we present several examples of text generated by ChatGPT along with accompanying analyses. Additionally, we juxtapose MolT5-generated texts with those generated by ChatGPT to highlight both the remarkable capabilities and limitations of ChatGPT.

Item	Value
Product SMILE	Oc1ccc(OCc2ccccc2)c(Br)c1
Product IUPAC name	3-bromo-4-phenylmethoxyphenol(4-(Benzyloxy)-2-bromophenol)
Reactant smiles	BrCc1ccccc1+Oc1ccc(O)c(Br)c1
Reactant IUPAC names	bromomethylbenzene+ 2-bromobenzene-1,4-diol
MolT5-generated text	The molecule is a cyclic ketone that is then2(2-bromo)-derivative of cyclopropane. It has a role as a metabolite and a mouse metabolite. It is a cyclic ketone and a spiro compound. It derives from a cyclopropane. Problem with MolT5-generated text: it is not cyclic ketone and it has nothing to do with spiro compound. Evidently it is not a derivative of cyclopropane, either.
ChatGPT-generated text	3-Bromo-4-phenylmethoxyphenol is an organic compound with a molecular formula of C10H9BrO. It is a white crystalline solid with a melting point of 212-214°C. Its structure consists of a benzene ring with an -OH group attached to the 4th carbon atom, a bromine atom bonded to the 3rd carbon atom, and a -OCH3 group attached to the 4th carbon atom. The most likely reaction to synthesize 3-bromo-4-phenylmethoxyphenol would involve the following steps: 1. Start with benzene and treat it with a halogenating agent such as bromine to add a bromine atom to the 3rd carbon of the benzene ring. This will form 1,3-dibromobenzene. 2. Reduce the 1,3-dibromobenzene with a reducing agent such as zinc to form 3-bromophenol. 3. Treat the 3-bromophenol with an alcohol in presence of an acid catalyst such as sulfuric acid to form 3-bromo-4-phenylmethoxyphenol. Analysis of the ChatGPT-generated text: ChatGPT identified some of the physical features, but the melting point is wrong(actually it is 72-73°C according to www.chemicalbook.com). Moreover, it analyzed its molecular structure but not exactly right. For predicting the synthetic route, it identified a possible reaction precursor: 3-bromophenol. However, in practice, it is hard to directly add a great organic group to the benzene ring so it does not match the ground truth reaction. Therefore, we think that ChatGPT can provide useful information for local structures, but it is challenging for ChatGPT to identify more detailed information about the product.

Table 10: Example of 3-bromo-4-phenylmethoxyphenol(4-(Benzyloxy)-2-bromophenol).

Item	Value
Product SMILE	O=C1CN(c2ccncc2)CCN1c1ccc(OCc2ccccc2)cc1
Product IUPAC name	1-(4-phenylmethoxyphenyl)-4-pyridin-4-ylpiperazin-2-one
Reactant smiles	Brc1ccc(OCc2ccccc2)cc1+O=C1CN(c2ccncc2)CCN1
Reactant IUPAC names	1-bromo-4-phenylmethoxybenzene+ 4-pyridin-4-ylpiperazin-2-one
MolT5-generated text	The molecule is a tertiary amino compound that is up toc2c(naCc3cccc3) with an additional cinchoyl group. It has a role as a vasodilator agent, an anti-ulcer drug and a metabolite. It is a tertiary amino compound and an epoxide. It derives from a hydride of a nipecoside. Problem with MolT5-generated text: it is neither tertiary amino nor epoxide. Moreover, it contains some meaningless words including nipecoside and cinchoyl.
ChatGPT-generated text	1-(4-Phenylmethoxyphenyl)-4-pyridin-4-ylpiperazin-2-one is an organic compound composed of several functional groups. It has a phenyl group attached to a methoxy group, which is attached to a pyridin-4-yl group, which is attached to a piperazin-2-one group. The most possible reaction to synthesize this compound step by step would be a sequence of nucleophilic substitution and cyclization reactions. Firstly, a nucleophilic substitution of a halogenated phenol with a pyridinium chloride would form a pyridinium salt. This salt can then be reacted with a primary amine in a cyclization reaction to form a pyridine intermediate. Finally, this pyridine intermediate can be reacted with a methoxybenzene in a nucleophilic substitution reaction to form the target compound. Analysis of the ChatGPT-generated text: ChatGPT identified its phenylmethoxyphenyl group and pyridin-4-yl group which attach to the piperazin-2-one group. However, it contains minor errors about the group structure. Moreover, it predicted a pyridine intermediate as a potential reaction precursor and it also demonstrated that the final step is a nucleophilic substitution reaction. However, it mistakenly regard methoxybenzene as one of the reactants while 1-bromo-4-phenylmethoxybenzene is the true reactant.

Table 11: Example of 1-(4-phenylmethoxyphenyl)-4-pyridin-4-ylpiperazin-2-one.

Item	Value
Product SMILE	Cn1c(-c2cc(C(F)(F)F)cc(C(F)(F)F)c2)nc(-c2ccncc2)c1C(=O)N1CCC(N2CCCC2)CC1
Product IUPAC name	[2-[3,5-bis(trifluoromethyl)phenyl]-3-methyl-5-pyridin-4-ylimidazol-4-yl]-(4-pyrrolidin-1-ylpiperidin-1-yl)methanone
Reactant smiles	Cn1c(-c2cc(C(F)(F)F)cc(C(F)(F)F)c2)nc(I)c1C(=O)N1CCC(N2CCCC2)CC1 + OB(O)c1ccncc1
Reactant IUPAC names	[2-[3,5-bis(trifluoromethyl)phenyl]-5-iodo-3-methylimidazol-4-yl]-(4-pyrrolidin-1-ylpiperidin-1-yl)methanone + pyridin-4-ylboronic acid
MolT5-generated text	The molecule is a cyclic ketone that is nitrilo substituted at position 2 by a methyl group. It is a cyclic ketone, an organofluorine compound, a pyrrolidinone and a monocarboxylic acid amide. It derives from a nitrilol. Problem with MolT5-generated text: it is not cyclic ketone and it does not have anything to do with nitrilo substitution and nitrilol.
ChatGPT-generated text	The structure of 2-[3,5-bis(trifluoromethyl)phenyl]-3-methyl-5-pyridin-4-ylimidazol-4-yl]-(4-pyrrolidin-1-ylpiperidin-1-yl)methanone is a heterocyclic compound composed of two rings, an imidazole ring and a pyridine ring, both of which contain nitrogen atoms. The two rings are connected by a methylene bridge. The imidazole ring is further substituted with a 3,5-bis(trifluoromethyl)phenyl group and the pyridine ring is substituted with a 4-pyrrolidin-1-ylpiperidin-1-yl group. The most likely reaction to synthesize this compound would be an Ullmann-type coupling reaction. This would involve the reaction of an aryl halide (3,5-bis(trifluoromethyl)phenyl bromide) with a primary alkyl halide (4-pyrrolidin-1-ylpiperidin-1-yl bromide) in the presence of a copper catalyst, giving the alkyl aryl ether as a product. This alkyl aryl ether can then be reacted with an imidazole-4-carbaldehyde to produce the desired 2-[3,5-bis(trifluoromethyl)phenyl]-3-methyl-5-pyridin-4-ylimidazol-4-yl]-(4-pyrrolidin-1-ylpiperidin-1-yl)methanone. Analysis of the ChatGPT-generated text: ChatGPT identifies two of the heterocycles: an imidazole ring and a pyridine ring. It also points out the accurate positions of 3,5-bis(trifluoromethyl)phenyl group and 4-pyrrolidin-1-ylpiperidin-1-yl group. It demonstrated a possible synthetic route using Ullmann-type coupling reaction. However, this route does not match the ground truth which is a Suzuki reaction. Therefore, we think that ChatGPT can provide useful information for local structures, but cannot identify information related to global structures, such as reaction type and synthetic route.

Table 12: Example of [2-[3,5-bis(trifluoromethyl)phenyl]-3-methyl-5-pyridin-4-ylimidazol-4-yl]-(4-pyrrolidin-1-ylpiperidin-1-yl)methanone.