This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Regressor-Guided Graph Diffusion Model for Predicting Enzyme Mutations to Enhance Turnover Number

Xiaozhu Yu1,2,§\S, Kai Yi1,3, Yu Guang Wang1,4, Yiqing Shen1,5,* 1Toursun Synbio, Shanghai, China
2Pratt School of Engineering, Duke University, Durham, USA
3MRC Laboratory of Molecular Biology, University of Cambridge, Cambridge, UK
4Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China
5Department of Computer Science, Johns Hopkins University, Baltimore, USA
§\SWork done during the internship at Toursun Synbio. *Corresponding Author.
[email protected]  [email protected]
Abstract

Enzymes are biological catalysts that can accelerate chemical reactions compared to uncatalyzed reactions in aqueous environments. Their catalytic efficiency is quantified by the turnover number (kcatk_{cat}), a parameter in enzyme kinetics. Enhancing enzyme activity is important for optimizing slow chemical reactions, with far-reaching implications for both research and industrial applications. However, traditional wet-lab methods for measuring and optimizing enzyme activity are often resource-intensive and time-consuming. To address these limitations, we introduce kcatk_{cat}Diffuser, a novel regressor-guided diffusion model designed to predict and improve enzyme turnover numbers. Our approach innovatively reformulates enzyme mutation prediction as a protein inverse folding task, thereby establishing a direct link between structural prediction and functional optimization. kcatk_{cat}Diffuser is a graph diffusion model guided by a regressor, enabling the prediction of amino acid mutations at multiple random positions simultaneously. Evaluations on BERENDA dataset shows that kcatk_{cat}Diffuser can achieve a Δlogkcat\Delta\log k_{cat} of 0.209, outperforming state-of-the-art methods like ProteinMPNN, PiFold, GraDe-IF in improving enzyme turnover numbers. Additionally, kcatk_{cat}Diffuser maintains high structural fidelity with a recovery rate of 0.716, pLDDT score of 92.515, RMSD of 3.764, and TM-score of 0.934, demonstrating its ability to generate enzyme variants with enhanced activity while preserving essential structural properties. Overall, kcatk_{cat}Diffuser represents a more efficient and targeted approach to enhancing enzyme activity. The code is available at https://github.com/xz32yu/KcatDiffuser.

Index Terms:
Enzyme Engineering, Turnover Number, Diffusion Models, Graph Neural Networks, Protein Inverse Folding.

I Introduction

Enzymes are biological catalysts that accelerate chemical reactions and maintaining cellular metabolic processes essential for life [1]. These protein molecules lower the activation energy of biochemical reactions, enabling them to occur at rates compatible with cellular function. The efficiency of enzyme catalysis is often quantified by the turnover number (kcatk_{cat}), a key parameter in enzyme kinetics that provides insights into cellular metabolism, proteome allocation, and physiological diversity [2]. kcatk_{cat} represents the maximum number of substrate molecules converted to product per enzyme molecule per unit time under saturating substrate conditions. However, experimental determination of kcatk_{cat} is both time-consuming and resource-intensive, requiring purified enzymes and specialized equipment [3]. This limitation has led to a scarcity of experimentally measured kcatk_{cat} values, with less than 1% of enzymes listed in the UniProt database having experimentally determined kcatk_{cat} values [5].

Recent advancements in artificial intelligence, particularly deep learning, have led to the emergence of models capable of predicting enzyme activity from various inputs. For instance, DLKcat can predict metabolic enzyme activity based on substrate structure and enzyme sequence [4]. This model utilizes a combination of convolutional neural networks and graph neural networks to process protein sequences and substrate structures, respectively. Building upon this work, DeepEnzyme incorporates enzyme protein structure as an additional input to enhance prediction accuracy [5]. By leveraging the integrated features from both sequences and 3D structures, DeepEnzyme demonstrates improved robustness when processing enzymes with low sequence similarity compared to those in the training dataset [5]. While these models have shown promise in predicting kcatk_{cat} values, they primarily focus on estimating existing enzyme activities rather than addressing the crucial challenge of improving enzyme activity.

In the field of protein design, models such as Evolutionary Scale Modeling-1v (ESM-1v) have been developed to predict the effects of protein variants on a wide range of properties [6]. These models leverage large-scale protein sequence data to learn evolutionary patterns and make zero-shot predictions of mutational effects across diverse proteins with different functions [6]. While ESM-1v and similar models have shown promise in predicting variant effects, they are not specifically designed to optimize enzyme kinetic parameters like the turnover number (kcatk_{cat}) or suggest mutations for enhancing enzyme activity. Traditional methods that rely on single or double amino acid substitutions often fail to achieve significant improvements in enzyme activity due to the complex interdependencies within protein structures. To address the limitations of current approaches, we focus on improving enzyme activity in this paper by proposing a model capable of modifying amino acids at multiple random positions, a task well-suited for diffusion models [7]. Diffusion models have shown promise in the field of protein design. Notably, GraDe-IF [7] has emerged as a powerful model for inverse protein folding, which aims to maintain the given protein backbone while generating new amino acid sequences. GraDe-IF’s ability to produce diverse sequences while preserving structural integrity makes it an ideal starting point for our work on enzyme mutation. By adapting and extending the principles of GraDe-IF, we aim to develop a diffusion-based model specifically tailored to enhance enzyme turnover numbers.

The major contributions of this work is three fold. Firstly, we innovatively reformulate enzyme mutation prediction for optimizing turnover number as a protein inverse folding task, thereby establishing a direct link between structural prediction and functional optimization. Secondly, we introduce a regressor-guided graph diffusion model, named kcatk_{cat}Diffuser, designed to enhance turnover number (kcatk_{cat}). Moreover, we intergate it with an efficient sampling scheme using DDIM, allowing for larger step sizes and faster generation of enzyme variants. kcatk_{cat}Diffuser enables modifications of amino acids at multiple random positions simultaneously, overcoming the limitations of traditional site-directed mutation prediction methods. Finally, we train our model on the BRENDA enzyme dataset, ensuring its applicability to a wide range of enzymatic systems and demonstrating its potential for generalizable enzyme optimization.

Refer to caption
Figure 1: Overview of kcatk_{cat}Diffuser. The framework combines an inverse protein folding diffusion model with a kcatk_{cat} regressor for guided sampling. The input consists of a substrate, protein sequence, and protein structure. The inverse folding component uses a graph-based diffusion model to generate new amino acid sequences. Concurrently, the kcatk_{cat} predictor (regressor) estimates the turnover number, providing a guidance signal to steer the denosing process towards sequences with potentially higher kcatk_{cat} values. The regressor guidance is implemented through a gradient-based approach, pushing the sampling towards regions of higher predicted kcatk_{cat}.

II Methods

II-A Problem Formulation

This work aims to develop a method for generate enzyme variants that enhance the turnover number (kcatk_{cat}). We formulate it as an inverse protein folding problem with an additional optimization objective during the sampling stage with a regressor guidance signal. Given a protein structure represented by its backbone coordinates 𝑿pos=x1pos,,xnpos\boldsymbol{X}^{pos}={x^{pos}_{1},\ldots,x^{pos}_{n}}, where nn is the number of amino acids, our task is to predict a set of feasible amino acid sequences 𝑿aa=x1aa,,xnaa\boldsymbol{X}^{aa}={x^{aa}_{1},\ldots,x^{aa}_{n}} that can fold into the given structure, while simultaneously identifying sequences likely to exhibit improved kcatk_{cat} values compared to the wild-type enzyme. We approach this problem by modeling the conditional probability distribution p(𝑿aa|𝑿pos)p(\boldsymbol{X}^{aa}|\boldsymbol{X}^{pos}), which represents the likelihood of amino acid sequences given the backbone structure. To incorporate the optimization of kcatk_{cat}, we introduce a regressor function gη(𝑿aa,𝑿pos)kcatg_{\eta}(\boldsymbol{X}^{aa},\boldsymbol{X}^{pos})\rightarrow k_{cat} that predicts the turnover number for a given sequence and structure. To solve it, we develop kcatk_{cat}Diffuser, a regressor-guided graph diffusion model, as depicted in Fig. 1. This model combines a protein inverse folding diffusion model pθ(𝑿aa|𝑿pos)p_{\theta}(\boldsymbol{X}^{aa}|\boldsymbol{X}^{pos}) that generates diverse amino acid sequences compatible with the given backbone structure, and a regressor gηg_{\eta} that guides the sampling process towards sequences with potentially higher kcatk_{cat} values. In our framework, we represent the protein structure as a graph G={𝑿,𝑨,𝑬}G=\{\boldsymbol{X},\boldsymbol{A},\boldsymbol{E}\}, where 𝑿\boldsymbol{X} includes positional, amino acid type, and physicochemical property information, 𝑨\boldsymbol{A} is the adjacency matrix, and 𝑬\boldsymbol{E} captures edge features. This graph representation allows us to capture the complex spatial relationships within the protein structure. By integrating these components, kcatk_{cat}Diffuser aims to generate enzyme variants that not only maintain the desired protein structure but also exhibit enhanced catalytic efficiency as measured by kcatk_{cat}.

II-B Protein Graph Construction and Feature Encoding

To implement kcatk_{cat}Diffuser, we represent proteins as graphs G={𝑿,𝑨,𝑬}G=\{\boldsymbol{X},\boldsymbol{A},\boldsymbol{E}\} by converting PDB files into this format [7]. The node features 𝑿\boldsymbol{X} are defined as:

𝑿=[𝑿pos,𝑿aa,𝑿prop],\boldsymbol{X}=[\boldsymbol{X}^{pos},\boldsymbol{X}^{aa},\boldsymbol{X}^{prop}], (1)

where 𝑿pos\boldsymbol{X}^{pos} denotes the 3D coordinates of the α\alpha-carbon, 𝑿aa\boldsymbol{X}^{aa} is a one-hot encoded vector of amino acid type, and 𝑿prop\boldsymbol{X}^{prop} represents physicochemical properties. Edge attributes 𝑬\boldsymbol{E} capture spatial and chemical relationships between connected residues, i.e.

𝑬=[dij,Δposij,ϕij],\boldsymbol{E}=[d_{ij},\Delta pos_{ij},\phi_{ij}], (2)

where dijd_{ij} is the distance between residues ii and jj, Δposij\Delta pos_{ij} is their relative position, and ϕij\phi_{ij} encodes dihedral angles. The graph construction utilizes a k-nearest neighbor (kNN) method, establishing connections between amino acids within a 30Åradius to preserve the protein’s tertiary structure while creating a computationally tractable representation. The corresponding adjacency matrix 𝑨\boldsymbol{A} is defined by

Aij={1if |𝑿ipos𝑿jpos|<30Å and jkNN(i)0otherwise.A_{ij}=\begin{cases}1&\text{if }|\boldsymbol{X}^{pos}_{i}-\boldsymbol{X}^{pos}_{j}|<30\text{\AA}\text{ and }j\in\text{kNN}(i)\\ 0&\text{otherwise}\end{cases}. (3)

For better graph representation, we further incorporate protein backbone information, including dihedral angles (ψ\psi, ϕ\phi) and secondary structure elements (ss), encoded as:

𝑿backbone=[cos(ψ),sin(ψ),cos(ϕ),sin(ϕ),one_hot(ss)].\boldsymbol{X}_{backbone}=[\cos(\psi),\sin(\psi),\cos(\phi),\sin(\phi),\text{one\_hot}(\text{ss})]. (4)

II-C Protein Inverse Folding Diffusion Model

In the protein inverse folding diffusion model pθp_{\theta}, the diffusion process gradually adds noise to the amino acid types 𝑿aa\boldsymbol{X}^{aa} over TT timesteps, transforming them from the original sequence to a uniform distribution. This process is defined by a forward transition probability q(𝒙t|𝒙t1)q(\boldsymbol{x}_{t}|\boldsymbol{x}_{t-1}), where 𝒙t\boldsymbol{x}_{t} represents the noisy amino acid types at timestep tt. The reverse denoising process, parameterized by θ\theta, aims to recover the original sequence 𝑿aa\boldsymbol{X}^{aa} through iterative refinement

pθ(𝒙t1|𝒙t)𝒙aaq(𝒙t1|𝒙t,𝒙aa)pθ(𝒙aa|𝒙t),p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_{t})\propto\sum_{\boldsymbol{x}^{aa}}q(\boldsymbol{x}_{t-1}|\boldsymbol{x}_{t},\boldsymbol{x}^{aa})\cdot p_{\theta}(\boldsymbol{x}^{aa}|\boldsymbol{x}_{t}), (5)

where, pθ(𝒙aa|𝒙t)p_{\theta}(\boldsymbol{x}^{aa}|\boldsymbol{x}_{t}) is predicted by an equivalent graph neural network (EGNN) as the denoising network that takes as input the noisy protein graph and additional structural information including backbone dihedral angles and secondary structure elements. To accelerate sampling, we employ the Denoising Diffusion Implicit Models (DDIM) [34], which allows for larger step sizes in the sampling process:

pθ(𝒙tk|𝒙t)(𝒙aaq(𝒙tk|𝒙t,𝒙aa)p^(𝒙aa|𝒙t))Tp_{\theta}(\boldsymbol{x}_{t-k}|\boldsymbol{x}_{t})\propto(\sum_{\boldsymbol{x}^{aa}}q(\boldsymbol{x}_{t-k}|\boldsymbol{x}_{t},\boldsymbol{x}^{aa})\hat{p}(\boldsymbol{x}^{aa}|\boldsymbol{x}_{t}))^{T} (6)

where TT controls the time step. The multi-step posterior q(𝒙tk|𝒙t,𝒙aa)q(\boldsymbol{x}_{t-k}|\boldsymbol{x}_{t},\boldsymbol{x}^{aa}) is computed using the cumulative transition matrices

q(𝒙tk|\displaystyle q(\boldsymbol{x}_{t-k}| 𝒙t,𝒙aa)=\displaystyle\boldsymbol{x}_{t},\boldsymbol{x}^{aa})= (7)
Cat(𝒙tk|𝒙tQtTQtkT𝒙aaQ¯tk𝒙aaQ¯t𝒙tT),\displaystyle\text{Cat}\Big{(}\boldsymbol{x}_{t-k}|\frac{\boldsymbol{x}_{t}Q^{T}_{t}...Q^{T}_{t-k}\odot\boldsymbol{x}^{aa}\bar{Q}_{t-k}}{\boldsymbol{x}^{aa}\bar{Q}_{t}\boldsymbol{x}^{T}_{t}}\Big{)},

where QtQ_{t} represents the transition matrix at time tt, and Q¯t\bar{Q}_{t} is the cumulative transition matrix up to time tt In this equation, Cat()\text{Cat}(\cdot) denotes the categorical distribution, while \odot represents element-wise multiplication (Hadamard product). The fraction inside Cat()\text{Cat}(\cdot) represents the unnormalized probabilities for the categorical distribution, with QtTQ^{T}_{t} being the transpose of the transition matrix at time tt.

II-D Regressor-Guided Graph Diffusion Sampling for kcatk_{cat} Optimization

We introduce a regressor-guided diffusion sampling scheme that combines a kcatk_{cat} regressor with the protein inverse folding diffusion model to efficiently generate enzyme variants with potentially improved kcatk_{cat} values.

Regressor

The regressor gηg_{\eta} comprises of Transformer [32] and GCN [33] to extract features from both protein 1D sequences and 3D structures. It processes overlapping n-grams of amino acids through an embedding layer and a Transformer encoder, while the GCN processes the 3D structure inputs of the protein graph and substrate. The regressor gηg_{\eta} then fuses these representations to predict the kcatk_{cat} value. We train the learnable weights η\eta on the BRENDA dataset of enzyme kinetic data.

Refer to caption
Figure 2: Case study comparison of protein generated by different models. Each row represents a distinct enzyme (EC numbers shown on the left). Columns show results from ProteinMPNN, PiFold, GraDe-IF, and kcatk_{cat}Diffuser (without and with regressor guidance). Green structures represent the original proteins, while cyan structures are the generated variants. Performance metrics are provided for each case, including Δlogkcat\Delta\log k_{cat}, recovery rate, pLDDT, RMSD, and TM-score.

Regressor-guided Sampling

We use the regressor gηg_{\eta} to guide the unconditional protein inverse folding diffusion model ϕθ\phi_{\theta}. The target prediction 𝒚G\boldsymbol{y}_{G} of a protein graph GG is obtained from a noisy version of GG where gη(Gt)=𝒚^g_{\eta}(G_{t})=\hat{\boldsymbol{y}}. We make an assumption that the conditional probability of the noisy sequence given the target property is equal to the unconditional probability. This simplification allows us to factorize the joint probability and incorporate the regressor’s guidance more effectively.

Under this assumption, we have q˙(𝒙tk|𝒙t,𝒙aa,𝒚G)=q˙(𝒙tk|𝒙t,𝒙aa)\dot{q}(\boldsymbol{x}_{t-k}|\boldsymbol{x}_{t},\boldsymbol{x}^{aa},\boldsymbol{y}_{G})=\dot{q}(\boldsymbol{x}_{t-k}|\boldsymbol{x}_{t},\boldsymbol{x}^{aa}), which yields to

q˙(𝒙tk|𝒙t,𝒙aa,𝒚G)q(𝒙t1|𝒙t,𝒙aa)q˙(𝒚G|𝒙tk),\dot{q}(\boldsymbol{x}_{t-k}|\boldsymbol{x}_{t},\boldsymbol{x}^{aa},\boldsymbol{y}_{G})\propto{q(\boldsymbol{x}_{t-1}|\boldsymbol{x}_{t},\boldsymbol{x}^{aa})\dot{q}(\boldsymbol{y}_{G}|\boldsymbol{x}_{t-k})}, (8)

where 𝒚G\boldsymbol{y}_{G} indicates the target properties of a protein graph GG, q˙\dot{q} denotes the noising process conditioned on 𝒚G\boldsymbol{y}_{G}, and qq denotes the unconditional noising process. This equation expresses the probability of a less noisy sequence 𝒙tk\boldsymbol{x}_{t-k} given the current noisy sequence 𝒙t\boldsymbol{x}_{t}, the original sequence 𝒙aa\boldsymbol{x}^{aa}, and the target property 𝒚G\boldsymbol{y}_{G}. It combines the unconditional noising process qq with the conditional probability of the target property given the less noisy sequence. To make this formulation computationally tractable, we employ a first-order approximation to define 𝒙\nabla_{\boldsymbol{x}}, which allows us to linearize the log-probability around the current point, described as

logq˙(𝒚G|𝒙tk,𝒙aa)logq˙(𝒚G|𝒙t,𝒙aa)+λ𝒙logq˙(𝒚G|𝒙tk,𝒙aa),𝒙tk𝒙tc(𝒙t,𝒙aa)+λ1inxilogq˙(𝒚G|𝒙tk,𝒙aa),𝒙i,t1\begin{split}&\log\dot{q}(\boldsymbol{y}_{G}|\boldsymbol{x}_{t-k},\boldsymbol{x}^{aa})\\ &\approx\log\dot{q}(\boldsymbol{y}_{G}|\boldsymbol{x}_{t},\boldsymbol{x}^{aa})+\\ &\qquad\lambda\langle\nabla{\boldsymbol{x}}\log\dot{q}(\boldsymbol{y}_{G}|\boldsymbol{x}_{t-k},\boldsymbol{x}^{aa}),\boldsymbol{x}_{t-k}-\boldsymbol{x}_{t}\rangle\\ &\approx c(\boldsymbol{x}_{t},\boldsymbol{x}^{aa})+\lambda\sum_{1\leq i\leq n}\langle\nabla{x_{i}}\log\dot{q}(\boldsymbol{y}_{G}|\boldsymbol{x}_{t-k},\boldsymbol{x}^{aa}),\boldsymbol{x}_{i,t-1}\rangle\end{split}

where λ\lambda indicates the extent to which the regressor 𝒚G\boldsymbol{y}_{G} influences the outcomes, and cc is a function independent of 𝒙tk\boldsymbol{x}_{t-k}. Finally, assuming that the conditional probability of the target property follows a Gaussian distribution centered at the regressor’s prediction, i.e. q˙(𝒚G|𝒙t,𝒙aa)=𝒩(g(𝒙t),σy𝑰)\dot{q}(\boldsymbol{y}_{G}|\boldsymbol{x}_{t},\boldsymbol{x}^{aa})=\mathcal{N}(g(\boldsymbol{x}_{t}),\sigma_{y}\boldsymbol{I}), we can express the gradient of the log-probability with respect to the protein graph as:

Gtq˙η(𝒚|Gt)Gt|𝒚^𝒚G|2\nabla_{G_{t}}\dot{q}_{\eta}(\boldsymbol{y}|G_{t})\propto{-\nabla{G_{t}}|\hat{\boldsymbol{y}}-\boldsymbol{y}_{G}|^{2}} (9)

where gg is estimated by gηg_{\eta}. This gradient guides the sampling process towards protein sequences that are more likely to exhibit the higher kcatk_{cat} value during the inverse folding process.

TABLE I: Performance comparison in terms of Δlogkcat\Delta\log k_{cat} (improvement in enzyme turnover number), Recovery Rate, pLDDT, RMSD, and TM-score. Arrows indicate whether higher (\uparrow) or lower (\downarrow) values are better. kcatk_{cat}Diffuser demonstrates superior performance across most metrics, particularly in enzyme activity improvement (Δlogkcat\Delta\log k_{cat}) and sequence recovery, while maintaining high structural quality. Best performance of each metric is marked in bold.
Methods 𝚫𝐥𝐨𝐠𝒌𝒄𝒂𝒕\Delta\log k_{cat} ()(\uparrow) Recovery Rate ()(\uparrow) pLDDT ()(\uparrow) RMSD ()(\downarrow) TM-score ()(\uparrow)
ProteinMPNN[20] 0.117 0.342 92.038 5.444 0.892
PiFold[21] 0.087 0.473 92.968 4.430 0.922
GraDe-IF[7] -0.057 0.406 89.165 7.533 0.810
𝒌𝒄𝒂𝒕k_{cat}Diffuser 0.209 0.716 92.515 3.764 0.934

III Experiments

III-A Implementation Details

The regressor employs a Transformer-based architecture, comprising 3 output layers and 3 Transformer encoding layers. It utilizes an input dimension of 20, a hidden dimension of 64, and 4 attention heads to capture diverse input features effectively. For the kcatk_{cat}Diffuser, we adopted a learning rate of 0.00050.0005 and a dropout rate of 0.10.1 to mitigate overfitting. To ensure training stability, gradient clipping was applied with a threshold of 1.01.0. The denoising network in kcatk_{cat}Diffuser is a EGNN with 6 layers, each with a hidden size of 128 units, and incorporates embedding layers with a dimension of 128. The training process involves a diffusion sequence length of 500 time steps. To enhance the model’s robustness and account for sequence variability, we introduced BLOSUM-based noise to the input data [7]. This noise injection simulates natural amino acid substitutions, potentially improving the model’s generalization capabilities for enzyme mutation prediction [7].

III-B Dataset Preparation

To train kcatk_{cat}Diffuser, in addition to the CATH dataset, we also leverage BRENDA enzyme database, which includes EC numbers, organisms, enzyme sequences, simplified molecular-input line-entry system (SMILES) representations of substrates, and kcatk_{cat} values [26]. We focus on enzyme-substrate pairs to align with our model’s objective of optimizing kcatk_{cat}. While kcatk_{cat}Diffuser requires protein structures as input, BRENDA primarily provides enzyme sequences. To bridge this gap, we employ ESMFold [27] to predict 3D structures from these sequences, resulting in 15,603 enzyme structures. We divide this dataset into 12,482 enzymes for training, 1,560 for validation, and 1,561 for testing. These structures are then converted into graph representations G={𝑿,𝑨,𝑬}G=\{\boldsymbol{X},\boldsymbol{A},\boldsymbol{E}\} using our pre-processing pipeline. To ensure data quality and computational feasibility, we filter out empty files and structures larger than 10MB from both CATH and BRENDA datasets before pre-processing. We investigate the impact of incorporating the BRENDA dataset by training kcatk_{cat}Diffuser on two configurations: CATH dataset only and combined CATH and BRENDA datasets in the experiment section. While training configuration differs, we evaluate both configurations on the BRENDA test set.

III-C Evaluation Metrics

The first evaluation metric is Δlogkcat\Delta\log k_{cat}, which quantifies the improvement in the enzyme’s turnover number. A higher value indicates more enhancement in catalytic efficiency. Then, we assess the model’s ability to generate sequences similar to the native protein using the recovery rate. To assess the structural quality of generated sequences, we use ESMFold to predict their 3D structures and compare them to the original crystal structures. We evaluate foldability using three metrics: pLDDT, a confidence measure for per-residue structural accuracy [28]; RMSD (Root Mean Square Deviation), which measures atomic-level differences between model and native structures [29]; and TM-score (Template Modeling score), which assesses global structural similarity, correlating strongly with overall model quality [30, 31]. These metrics provide a comprehensive evaluation of the generated sequences’ ability to maintain the desired protein structure while potentially exhibiting improved kcatk_{cat} values, aligning with the core objectives of kcatk_{cat}Diffuser.

III-D Results

The results, summarized in Table I, demonstrate the effectiveness of our approach across multiple metrics. Specifically, kcatk_{cat}Diffuser achieved the highest improvement in enzyme turnover number, with a Δlogkcat\Delta\log k_{cat} of 0.209. This represents an enhancement over ProteinMPNN (0.117) and PiFold (0.087), while GraDe-IF showed a slight decrease (-0.057), demonstrating the effectiveness of our regressor-guided diffusion approach in optimizing enzyme activity. Our kcatk_{cat}Diffuser also outperformed all baselines in in terms of recovery rate, achieving 0.716. This is higher than PiFold (0.473), GraDe-IF (0.406), and ProteinMPNN (0.342). The high recovery rate indicates that kcatk_{cat}Diffuser generates sequences that closely resemble the native protein while still introducing beneficial mutations. In terms of structural quality, kcatk_{cat}Diffuser maintained high fidelity while improving enzyme activity. Our model achieved a pLDDT score of 92.515, slightly lower than PiFold (92.968) but higher than ProteinMPNN (92.038) and GraDe-IF (89.165), indicating high confidence in the local structural accuracy of our generated sequences. kcatk_{cat}Diffuser achieved the lowest RMSD of 3.764, better than all baselines, suggesting that our generated structures closely align with the native structures at the atomic level. Furthermore, our model attained the highest TM-score of 0.934, indicating excellent global structural similarity to the native proteins. These results demonstrate that kcatk_{cat}Diffuser can balance enzyme activity improvement with structural integrity. To further illustrate the performance of kcatk_{cat}Diffuser, we conducted a case study across five diverse enzyme classes (Fig. 2). The visual comparison and accompanying metrics demonstrate the superiority of our approach, particularly when using regressor guidance. Across all cases, kcatk_{cat}Diffuser consistently achieved higher Δlogkcat\Delta\log k_{cat} values, indicating greater improvements in enzyme activity. For instance, in the EC 2.5.1.31 case, kcatk_{cat}Diffuser with regressor guidance achieved a Δlogkcat\Delta\log k_{cat} of 0.486, outperforming other methods. Importantly, these activity improvements were achieved while maintaining high structural fidelity, as evidenced by the consistently low RMSD values and high TM-scores. The generated structures (cyan) closely align with the original proteins (green), demonstrating kcatk_{cat}Diffuser’s ability to optimize enzyme activity without compromising structural integrity.

TABLE II: Comparison of model complexity. The table shows the number of parameters in millions, memory usage in megabytes, and inference time in seconds for each compared model.
Methods # Param. (M) Memory (MB) Time (s)
ProteinMPNN 1.66 237.1 0.60
PiFold 6.61 108.0 0.26
GraDe-IF 7.64 140.9 0.39
kcatk_{cat}Diffuser 8.85 170.0 5.23

III-E Model Complexity Analysis

We conducted a comprehensive analysis of model complexity, comparing kcatk_{cat}Diffuser with ProteinMPNN, PiFold, and GraDe-IF. Table II summarizes the results in terms of number of parameters, memory usage, and inference time. kcatk_{cat}Diffuser has the highest number of parameters (8.85M) among the compared models. This increased complexity allows kcatk_{cat}Diffuser to capture relationships between protein structure and enzyme activity. In terms of memory usage, kcatk_{cat}Diffuser (170.0 MB) sits between the memory-efficient PiFold (108.0 MB) and the more memory-intensive ProteinMPNN (237.1 MB). This moderate memory footprint makes KcatDiffuser suitable for deployment on a wide range of hardware configurations, balancing performance with resource requirements. The inference time of kcatk_{cat}Diffuser (5.23s) is notably higher than the other models, which range from 0.260.26s to 0.600.60s. This increased computational cost is primarily due to the iterative nature of the diffusion process and the additional computations required for the regressor-guided sampling. However, this trade-off in speed enables kcatk_{cat}Diffuser to perform multi-site mutations and optimize for enzyme activity, capabilities not present in the faster models.

TABLE III: Ablation study on the influence of regressor guidance strength (λ\lambda) on kcatk_{cat}Diffuser performance.
λ\lambda 𝚫𝐥𝐨𝐠𝒌𝒄𝒂𝒕\Delta\log k_{cat} (\uparrow) Recovery Rate (\uparrow) pLDDT (\uparrow) RMSD (\downarrow) TM-score (\uparrow)
0.1 0.194 0.730 92.343 3.641 0.937
0.5 0.167 0.732 92.403 3.626 0.937
1.0 0.184 0.733 92.408 3.638 0.937
5.0 0.209 0.716 92.515 3.764 0.934
10.0 0.124 0.643 91.408 6.253 0.867
20.0 0.083 0.537 85.920 16.390 0.613

III-F Ablation Study

To explore the impact of regressor guidance on kcatk_{cat}Diffuser’s performance, we conducted an ablation study by varying the regressor guidance strength parameter λ\lambda. Table III presents the results of this study, demonstrating the trade-off between enzyme activity improvement and structural integrity. For lower values of λ\lambda (0.1 to 1.0), we observe relatively stable performance across all metrics. The model maintains high structural fidelity, as evidenced by consistent pLDDT scores around 92.4, low RMSD values (3.64), and high TM-scores (0.937). The recovery rates are also highest in this range (0.730-0.733), indicating that the generated sequences closely resemble the native proteins. As λ\lambda increases to 5.0, we see an improvement in Δlogkcat\Delta\log k_{cat} (0.209), suggesting enhanced enzyme activity. This comes with a slight decrease in recovery rate (0.716) and marginal changes in structural metrics, indicating a good balance between activity improvement and structural preservation. However, further increases in λ\lambda (10.0 and 20.0) lead to a decline in performance across all metrics. The Δlogkcat\Delta\log k_{cat} decreases, and we observe a marked deterioration in structural integrity, particularly at λ=20.0\lambda=20.0 (RMSD of 16.390 and TM-score of 0.613). This ablation study reveals that moderate regressor guidance (λ=5.0\lambda=5.0) yields the best results, optimizing enzyme activity while maintaining structural stability.

IV Conclusion

In this work, we propose kcatk_{cat}Diffuser, a novel regressor-guided graph diffusion model designed to enhance enzyme turnover numbers while maintaining protein structural integrity. By reformulating enzyme mutation prediction as a protein inverse folding task, our approach establishes a direct link between structural prediction and functional optimization. Our evaluation demonstrates that kcatk_{cat}Diffuser outperforms state-of-the-art methods across multiple metrics. While kcatk_{cat}Diffuser exhibits higher computational complexity compared to some baseline models, this trade-off enables multi-site mutations and activity optimization capabilities not present in faster approaches. The model’s moderate memory footprint also ensures practical deployability across various hardware configurations. Thus, kcatk_{cat}Diffuser represents an advancement in computational enzyme engineering, offering an efficient and targeted approach to enhancing enzyme activity. By enabling the prediction of beneficial multi-site mutations, our model addresses key challenges in enzyme optimization and opens new avenues for rational design of improved biocatalysts. Future work could focus on further optimizing the model’s efficiency and exploring its applicability to a broader range of enzyme classes and reaction types.

References

  • [1] K. Chen and F. H. Arnold, “Engineering new catalytic activities in enzymes,” Nature Catalysis, vol. 3, no. 3, pp. 203–213, 2020.
  • [2] P. Wendering, M. Arend, Z. Razaghi-Moghadam, and Z. Nikoloski, “Data integration across conditions improves turnover number estimates and metabolic predictions,” Nature Communications, vol. 14, no. 1, p. 1485, 2023.
  • [3] S. Qiu, S. Zhao, and A. Yang, “Dltkcat: deep learning-based prediction of temperature-dependent enzyme turnover rates,” Briefings in Bioinformatics, vol. 25, no. 1, p. bbad506, 2024.
  • [4] F. Li, L. Yuan, H. Lu, G. Li, Y. Chen, M. K. Engqvist, E. J. Kerkhoven, and J. Nielsen, “Deep learning-based k cat prediction enables improved enzyme-constrained model reconstruction,” Nature Catalysis, vol. 5, no. 8, pp. 662–672, 2022.
  • [5] T. Wang, G. Xiang, S. He, L. Su, X. Yan, and H. Lu, “Deepenzyme: a robust deep learning model for improved enzyme turnover number prediction by utilizing features of protein 3d structures,” bioRxiv, pp. 2023–12, 2023.
  • [6] J. Meier, R. Rao, R. Verkuil, J. Liu, T. Sercu, and A. Rives, “Language models enable zero-shot prediction of the effects of mutations on protein function,” Advances in neural information processing systems, vol. 34, pp. 29 287–29 303, 2021.
  • [7] K. Yi, B. Zhou, Y. Shen, P. Liò, and Y. Wang, “Graph denoising diffusion for inverse protein folding,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [8] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning.   PMLR, 2015, pp. 2256–2265.
  • [9] Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Advances in neural information processing systems, vol. 32, 2019.
  • [10] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.
  • [11] K. Black, M. Janner, Y. Du, I. Kostrikov, and S. Levine, “Training diffusion models with reinforcement learning,” arXiv preprint arXiv:2305.13301, 2023.
  • [12] H. He, C. Bai, K. Xu, Z. Yang, W. Zhang, D. Wang, B. Zhao, and X. Li, “Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning,” Advances in neural information processing systems, vol. 36, 2024.
  • [13] J. S. Lee, J. Kim, and P. M. Kim, “Score-based generative modeling for de novo protein design,” Nature Computational Science, vol. 3, no. 5, pp. 382–392, 2023.
  • [14] E. Hoogeboom, V. G. Satorras, C. Vignac, and M. Welling, “Equivariant diffusion for molecule generation in 3d,” in International conference on machine learning.   PMLR, 2022, pp. 8867–8887.
  • [15] K. E. Wu, K. K. Yang, R. van den Berg, S. Alamdari, J. Y. Zou, A. X. Lu, and A. P. Amini, “Protein structure generation via folding diffusion,” Nature communications, vol. 15, no. 1, p. 1059, 2024.
  • [16] J. Ingraham, V. Garg, R. Barzilay, and T. Jaakkola, “Generative models for graph-based protein design,” Advances in neural information processing systems, vol. 32, 2019.
  • [17] B. Jing, S. Eismann, P. Suriana, R. J. L. Townshend, and R. Dror, “Learning from protein structure with geometric vector perceptrons,” in International Conference on Learning Representations, 2020.
  • [18] C. Tan, Z. Gao, J. Xia, B. Hu, and S. Z. Li, “Generative de novo protein design with global context,” arXiv preprint arXiv:2204.10673, 2022.
  • [19] A. Strokach, D. Becerra, C. Corbi-Verge, A. Perez-Riba, and P. M. Kim, “Fast and flexible protein design using deep graph neural networks,” Cell systems, vol. 11, no. 4, pp. 402–411, 2020.
  • [20] J. Dauparas, I. Anishchenko, N. Bennett, H. Bai, R. J. Ragotte, L. F. Milles, B. I. Wicky, A. Courbet, R. J. de Haas, N. Bethel et al., “Robust deep learning–based protein sequence design using proteinmpnn,” Science, vol. 378, no. 6615, pp. 49–56, 2022.
  • [21] Z. Gao, C. Tan, P. Chacón, and S. Z. Li, “Pifold: Toward effective and efficient protein inverse folding,” arXiv preprint arXiv:2209.12643, 2022.
  • [22] D. Heckmann, C. J. Lloyd, N. Mih, Y. Ha, D. C. Zielinski, Z. B. Haiman, A. A. Desouki, M. J. Lercher, and B. O. Palsson, “Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models,” Nature communications, vol. 9, no. 1, p. 5252, 2018.
  • [23] A. Kroll, Y. Rousset, X.-P. Hu, N. A. Liebrand, and M. J. Lercher, “Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning,” Nature communications, vol. 14, no. 1, p. 4139, 2023.
  • [24] C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton, “Cath–a hierarchic classification of protein domain structures,” Structure, vol. 5, no. 8, pp. 1093–1109, 1997.
  • [25] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, “The protein data bank,” Nucleic acids research, vol. 28, no. 1, pp. 235–242, 2000.
  • [26] A. Chang, L. Jeske, S. Ulbrich, J. Hofmann, J. Koblitz, I. Schomburg, M. Neumann-Schaal, D. Jahn, and D. Schomburg, “Brenda, the elixir core data resource in 2021: new developments and updates,” Nucleic acids research, vol. 49, no. D1, pp. D498–D508, 2021.
  • [27] B. Hie, S. Candido, Z. Lin, O. Kabeli, R. Rao, N. Smetanin, T. Sercu, and A. Rives, “A high-level programming language for generative protein design,” bioRxiv, pp. 2022–12, 2022.
  • [28] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko et al., “Highly accurate protein structure prediction with alphafold,” nature, vol. 596, no. 7873, pp. 583–589, 2021.
  • [29] W. Kabsch, “A solution for the best rotation to relate two sets of vectors,” Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, vol. 32, no. 5, pp. 922–923, 1976.
  • [30] Y. Zhang and J. Skolnick, “Scoring function for automated assessment of protein structure template quality,” Proteins: Structure, Function, and Bioinformatics, vol. 57, no. 4, pp. 702–710, 2004.
  • [31] J. Xu and Y. Zhang, “How significant is a protein structure similarity with tm-score= 0.5?” Bioinformatics, vol. 26, no. 7, pp. 889–895, 2010.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [33] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  • [34] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.