This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improving protein optimization with smoothed fitness landscapes

Andrew Kirjner
Massachusetts Institute of Technology
[email protected]
&Jason Yim
Massachusetts Institute of Technology
[email protected]
&Raman Samusevich
IOCB Prague, Czech Academy of Sciences,
CIIRC, Czech Technical University in Prague,
University of Chemistry and Technology, Prague
[email protected]
&Shahar Bracha
Massachusetts Institute of Technology
[email protected]
&Tommi Jaakkola
Massachusetts Institute of Technology
[email protected]
&Regina Barzilay
Massachusetts Institute of Technology
[email protected]
\ANDIla Fiete
Massachusetts Institute of Technology
[email protected]
Contributed equally to this work. Authors agreed ordering can be changed for their respective interests.Advised equally to this work.
Abstract

The ability to engineer novel proteins with higher fitness for a desired property would be revolutionary for biotechnology and medicine. Modeling the combinatorially large space of sequences is infeasible; prior methods often constrain optimization to a small mutational radius, but this drastically limits the design space. Instead of heuristics, we propose smoothing the fitness landscape to facilitate protein optimization. First, we formulate protein fitness as a graph signal then use Tikunov regularization to smooth the fitness landscape. We find optimizing in this smoothed landscape leads to improved performance across multiple methods in the GFP and AAV benchmarks. Second, we achieve state-of-the-art results utilizing discrete energy-based models and MCMC in the smoothed landscape. Our method, called Gibbs sampling with Graph-based Smoothing (GGS), demonstrates a unique ability to achieve 2.5 fold fitness improvement (with in-silico evaluation) over its training set. GGS demonstrates potential to optimize proteins in the limited data regime. Code: https://github.com/kirjner/GGS

1 Introduction

In protein engineering, fitness can be defined as performance on a desired property or function. Examples of fitness include catalytic activity for enzymes (Anderson et al., 2021) and fluorescence for biomarkers (Remington, 2011). Protein optimization seeks to improve protein fitness by altering the underlying sequences of amino acids. However, the number of possible proteins increases exponentially with sequence length, rendering it infeasible to perform brute-force search to engineer novel functions, which often require multiple mutations from the starting sequence (i.e. at least 3 (Ghafari & Weissman, 2019)). Directed evolution (Arnold, 1998) has been successful in improving protein fitness, but it requires substantial labor and time.

We aim to computationally generate high-fitness proteins by optimizing a learned model of the fitness landscape, but face several challenges. Proteins can be notorious for highly non-smooth fitness landscapes111Landscape refers to the sequence to fitness mapping.: fitness can change dramatically with single mutations, fitness measurements contain experimental noise, and most protein sequences have zero fitness (Brookes et al., 2022). Furthermore, protein fitness datasets are scarce and difficult to generate due to their high costs (Dallago et al., 2021). As a result, machine learning (ML) methods are susceptible to predicting false positives and getting stuck in local optima (Brookes et al., 2019). The 3D protein structure, if available, can provide information in navigating the noisy fitness landscape such as identifying hot spot residues (Zerbe et al., 2012), but high quality structures are not available in many cases.

One way to deal with noisy and limited data is to regularize the fitness landscape model222In the sequel, we will use “model” when referring to the fitness landscape model.. Our work considers a smoothing regularizer in which similar sequences (based on a distance measure) are predicted to have similar predicted fitness. While actual fitness lanscapes are not smooth, smoothing can be an important tool in the context of optimization, allowing gradient-based methods to reach higher peaks by avoiding local optima, especially in discrete optimization (Zanella, 2020). A few works have studied properties of protein fitness landscapes (Section 2), but none have directly applied smoothing with a graph framework during optimization.

We propose a novel method for applying smoothing to protein sequence and fitness data together with an optimization technique that takes advantage of the smoothing. First, we formulate sequences as a graph with fitness values as node attributes and apply Tikunov regularization to smooth the topological signal measured by the graph Laplacian. The smoothed data is then fitted with a neural network to be used as a model for discrete optimization (Figure 1 top). Second, we sample over the energy function for high fitness sequences by using the model’s gradients in a Gibbs With Gradients (GWG) procedure (Grathwohl et al., 2021). In GWG, a discrete distribution is constructed based on the model’s gradients where mutations with improved fitness will correlate with higher probability. The process of taking gradients and sampling mutations is performed in an iterative fashion where subsequent mutations will guide towards higher fitness (Figure 1 bottom).

Refer to caption
Figure 1: Overview. (A) Protein optimization is challenging due to a noisy fitness landscape where the starting dataset (unblurred) is a fraction of the landscape with the highest fitness sequences hidden (blurred). (B) We develop Graph-based Smoothing (GS) to estimate a smoothed fitness landscape from the starting data. (C) A model is trained on the smoothed fitness landscape to infer the rest of the landscape. (D) Gradients from the model are used in Gibbs With Gradients (GWG) where on each step a new mutation is proposed. (E) The goal of sampling is for each trajectory to gradually head towards higher fitness.

Figure 1 shows an overview of the method. We refer to the procedure of smoothing then sampling as Gibbs sampling with Graph-based Smoothing (GGS). To evaluate our method, we introduce a set of tasks using the well studied Green Fluorescent Proteins (GFP) (Sarkisyan et al., 2016) and Adeno-Associated Virus (AAV) (Bryant et al., 2021) proteins. We chose GFP and AAV because of their real-world importance and availability of large mutational data. We design a set of tasks that emulate starting with noisy and limited data and evaluate with a trained model (as done in most prior works). We evaluate GGS and prior works on our proposed benchmarks to show that GGS is state-of-the-art in GFP and AAV fitness optimization. Our contributions are summarized as follows:

  • We develop a novel sequence-based protein optimization algorithm, GGS, which uses graph-based smoothing to train a smoothed fitness model. The model is used as a discrete energy function to progressively sample mutations towards higher-fitness sequences with GWG (Section 3).

  • We develop a set of tasks that measure a method’s ability to extrapolate towards higher fitness. We use publicly available GFP and AAV datasets to emulate difficult optimization scenarios of starting with limited and noisy data (Section 4.1).

  • Our benchmark shows prior methods fail to extrapolate towards higher fitness. However, we show graph-based smoothing can drastically improve their performance; in one baseline, the fitness jumps from 18% to 39% in GFP and 4% to 44% in AAV after smoothing (Section 4.2).

  • Our method GGS directly exploits smoothness to achieve state-of-the-art results with 5 times higher fitness in GFP and 2 times higher in AAV compared to the next best method (Section 4.2).

2 Related work

Protein optimization and design. Approaches can broadly be categorized using sequence, structure or both. Sequence-based methods have been explored through the lens of reinforcement learning (Angermueller et al., 2020), latent space optimization (Stanton et al., 2022; Lee et al., ; Maus et al., 2022), generative models (Notin et al., 2022; Meier et al., 2021; Jain et al., 2022; Gruver et al., 2023), and model-based directed evolution (Sinai et al., 2020; Padmakumar et al., 2023; Ren et al., 2022). Together they face the issue of a noisy fitness landscape to optimize. We focus on sequence-based methods using Gibbs With Gradients (GWG) (Grathwohl et al., 2021) which can perform state-of-the-art in discrete optimization but requires a smooth energy function for strong performance. Concurrently, Emami et al. (2023) used GWG for protein optimization with a product of experts distribution using a protein language model. However, they achieved subpar results.

Previous methods focused on developing new sampling and optimization techniques. Our work is complimentary by addressing the need for improved regularization and smoothing. We show in our experiments that our smoothing technique can enhance the performance of prior methods.

Protein fitness regularization. The NK model was an early attempt to model smoothness of protein fitness through a statistical model of epistasis (Kauffman & Weinberger, 1989). Brookes et al. (2022) proposed a framework to approximate the sparsity of protein fitness using a generalized NK model (Buzas & Dinitz, 2013). Concurrently, dWJS (Frey et al., 2023) is most related to our work by utilizing Gaussian noise to regularize the discrete energy function during Langevin MCMC. dWJS trains by denoising to smooth a energy-based model whereas we apply discrete regularization using graph-based smoothing techniques.

Finally, we distinguish our smoothing method from traditional regularizers applied during training such as dropout (Srivastava et al., 2014). Our goal is to smooth the fitness landscape in a way that is amenable for iterative optimization. We enforce similar sequences to have similar fitness which is not guaranteed with dropout or similar regularizers applied in minibatch training. Evaluating multiple smoothing strategies is not the focus of our work, but rather to demonstrate their importance.

3 Method

The following describes our method. Section 3.1 details the problem formulation. Next fig. 2 describes the procedure for training a smoothed model. Lastly, section 3.3 provides background on Gibbs With Gradients (GWG) which is adapted for protein optimization. The full algorithm, Gibbs sampling with Graph-based Smoothing (GGS), is presented in Algorithm 1.

3.1 Problem formulation

We denote the starting set of NN proteins as 𝒟=(X,Y)\mathcal{D}=(X,Y) where X={x1,,xN}𝒱MX=\{x_{1},\dots,x_{N}\}\subset\mathcal{V}^{M} are the sequences and Y={y1,,yN}Y=\{y_{1},\dots,y_{N}\} are corresponding real-valued scalar fitness measurements. Each sequence xi𝒱Mx_{i}\in\mathcal{V}^{M} is composed of MM residues from a vocabulary 𝒱\mathcal{V} of 20 amino acids. Subscripts refer to different sequences. Note our method can be extended to other modalities, e.g. nucleic acids.

For in-silico evaluation, we denote the set of all known sequences and fitness measurements as 𝒟=(X,Y)\mathcal{D}^{*}=(X^{*},Y^{*}). We assume there exists a unknown black-box function g:𝒱Mg:\mathcal{V}^{M}\rightarrow\mathbb{R} such that g(x)=yg(x^{*})=y^{*}. In practice, gg needs to be approximated by a evaluator model, gϕg_{\phi}, trained with weights ϕ\phi to minimize prediction error on 𝒟\mathcal{D}^{*}. gϕg_{\phi} poses a limitation to evaluation since the true fitness needs to be verified with biological experiments. Nevertheless, an in-silico approximation provides a accessible way for evaluation and is done in all prior works. The starting dataset is a strict subset of the known dataset 𝒟𝒟\mathcal{D}\subset\mathcal{D}^{*} to simulate fitness optimization scenarios. Given 𝒟\mathcal{D}, our task is to generate a set of sequences with higher fitness than the starting set.

3.2 Graph-based smoothing on proteins

Our goal is to develop a model of the sequence-to-fitness mapping that can be utilized when sampling higher fitness sequences. Unfortunately, the high-dimensional sequence space coupled with few data points and noisy labels can result in a noisy model that is prone to sampling false positives or getting stuck in local optima. To address this, we use smoothing techniques from graph signal processing.

The smoothing process is depicted in Figure 2. First, we train a noisy fitness model fθ~:𝒱Mf_{\tilde{\theta}}:\mathcal{V}^{M}\rightarrow\mathbb{R} with weights θ~\tilde{\theta} on the initial dataset 𝒟\mathcal{D} using Mean-Squared Error (MSE). 𝒟\mathcal{D} is usually very small in real-world scenarios. We augment the dataset by using fθ~f_{\tilde{\theta}} to infer the fitness of neighboring sequences which we do not have labels for – known as transductive inference. Neighboring sequences are generated by randomly applying point mutations to each sequence in XX. The augmented and original sequences become nodes, VV, in our graph while their fitness labels are node attributes. Edges, \mathcal{E}, are constructed with a kk-nearest neighbor (kNN) graph around each node based on the Levenshtein distance333Defined as the minimum number of mutations between two sequences.. The graph construction algorithm can be found in Algorithm 4.

The following borrows techniques from Isufi et al. (2022). The smoothness of the fitness variability in our protein graph is defined as the sum over the square of all local variability,

𝖳𝖵2(Y)=12i𝒱(Δyi)2,Δyi=(i,j)(yiyj)2.\mathsf{TV}_{2}(Y)=\frac{1}{2}\sum_{i\in\mathcal{V}}(\Delta y_{i})^{2},\quad\Delta y_{i}=\sqrt{\sum_{(i,j)\in\mathcal{E}}(y_{i}-y_{j})^{2}}. (1)

𝖳𝖵\mathsf{TV} refers to Total Variation and Δyi\Delta y_{i} is the local variability of node ii that measures local changes in fitness. Using 𝖳𝖵2\mathsf{TV}_{2} as a regularizer, we solve the following optimization problem, known as Tikhunov regularization (Zhou & Schölkopf, 2004), for a new set of smoothed fitness labels,

argminY^|V|YY^22+γ𝖳𝖵2(Y^).\operatorname*{arg\,min}_{\hat{Y}\in\mathbb{R}^{|V|}}\|Y-\hat{Y}\|_{2}^{2}+\gamma\ \mathsf{TV}_{2}(\hat{Y}). (2)

With abuse of notation, we represent YY as a vector with each node’s fitness. γ\gamma is a hyperparameter set to control the smoothness; too high can lead to underfitting. We experiment with different γ\gamma’s in Section 4. Since eq. 2 is a quadratic convex problem, it has a closed form solution, Y^=(𝕀+γL)1Y\hat{Y}=(\mathds{I}+\gamma L)^{-1}Y where LL is the graph Laplacian and 𝕀\mathds{I} is the identity matrix. The final step is to retrain the model on the sequences in the graph and their smoothed fitness labels. The result will be a model fθf_{\theta} with lower 𝖳𝖵2\mathsf{TV}_{2} than before and thus improved smoothness. The smoothing algorithm is in Algorithm 2.

Refer to caption
Figure 2: Steps in graph-based smoothing on proteins illustrated with a fictitious data of length 2 sequences with vocabulary {A,B}\{A,B\}. Above each node are corresponding fitness values. Solid nodes are those in our training set while dashed nodes are augmented via point mutations to increase the smoothing effectiveness. See fig. 2 for description of each step.

3.3 Sampling improved fitness with Gibbs

Equipped with model fθf_{\theta} from fig. 2, we apply it in a procedure to sample mutations that improve the starting sequences’ fitness. fθf_{\theta} can also be viewed as an energy-based model (EBM) that defines a Boltzmann distribution logp(x)=fθ(x)logZ\log p(x)=f_{\theta}(x)-\log Z where ZZ is the normalization constant. Higher fitness sequences will be more likely under this distribution, while sampling will induce diversity and novelty. To sample from p(x)p(x), we use Gibbs With Gradients (GWG) Grathwohl et al. (2021) which has attracted significant interest due to its simplicity and state-of-the-art performance in discrete optimization. In this section, we describe the GWG procedure for protein sequences. GWG uses Gibbs sampling with approximations of locally informed proposals (Zanella, 2020):

q(x|x)exp(12i(xi)dθ(x)i)𝟙(xH(x)),dθ(x)i=[xfθ(x)]ixi[fθ(x)]i.q(x^{\prime}|x)\propto\mathrm{exp}\left(\frac{1}{2}\sum_{i}(x^{\prime}_{i})^{\top}d_{\theta}(x)_{i}\right)\mathds{1}(x^{\prime}\in H(x)),\quad d_{\theta}(x)_{i}=\left[\nabla_{x}f_{\theta}(x)\right]_{i}-x_{i}\odot\left[\nabla f_{\theta}(x)\right]_{i}. (3)

With slight abuse of notation, we use the one-hot sequence representation x{0,1}M×|𝒱|x\in\{0,1\}^{M\times|\mathcal{V}|} where xi{0,1}|𝒱|x_{i}\in\{0,1\}^{|\mathcal{V}|} represents the iith index of the sequence with 1 at its amino acid index and 0 elsewhere. \odot is the element wise product. H(x)={y𝒱M:dHamming(x,y)1}H(x)=\{y\in\mathcal{V}^{M}:d_{\mathrm{Hamming}}(x,y)\leq 1\} is the 1-ball around xx using Hamming distance. The core idea of GWG is to use dθ(x)id_{\theta}(x)_{i} as the first order approximation of a continuous gradient of the change in likelihood from mutating the iith index of xx to a different amino acid. The quality of the proposals in eq. 3 rely on the smoothness of the energy fθf_{\theta} (Theorem 1 in Grathwohl et al. (2021)). If the gradients, fθ\nabla f_{\theta}, are noisy, then the proposal distributions are ineffective in sampling better sequences. Hence, smoothing fθf_{\theta} is desirable (see section 4).

The choice of H()H(\cdot) as the 1-Hamming ball limits xx^{\prime} to point mutations from xx and only requires 𝒪(M×|𝒱|)\mathcal{O}\left(M\times|\mathcal{V}|\right) compute to construct. Let the point mutation where xx and xx^{\prime} differ be defined by the residue location, iloc{1,,M}i^{\mathrm{loc}}\in\{1,\dots,M\}, and amino acid substitution, jsub{1,,|𝒱|}j^{\mathrm{sub}}\in\{1,\dots,|\mathcal{V}|\}. By limiting xx^{\prime} to point mutants (iloc,jsub)(i^{\mathrm{loc}},j^{\mathrm{sub}}), sampling q(x|x)q(x^{\prime}|x) is equivalent to sampling the following,

(iloc,jsub)q(|x)=Cat(Softmax({dθ(x)i,jτ}i=1,j=1M,|𝒱|))\displaystyle(i^{\mathrm{loc}},j^{\mathrm{sub}})\sim q(\cdot|x)=\text{Cat}\left(\text{Softmax}\left(\left\{\frac{d_{\theta}(x)_{i,j}}{\tau}\right\}_{i=1,j=1}^{M,|\mathcal{V}|}\right)\right) (4)

where τ\tau is the sampling temperature and dθ(x)i,jd_{\theta}(x)_{i,j} is the logits of mutating to (i,j)(i,j). The proposal sequence xx^{\prime} is constructed by setting its iloci^{\mathrm{loc}} residue to jsubj^{\mathrm{sub}} and equal to xx elsewhere. Each proposed sequence is accepted or rejected using Metropolis-Hasting (MH),

min(exp(fθ(x)fθ(x))q(x|x)q(x|x), 1).\min\left(\exp(f_{\theta}(x^{\prime})-f_{\theta}(x))\frac{q(x|x^{\prime})}{q(x^{\prime}|x)},\ \ 1\right). (5)

We provide the GWG algorithm in Algorithm 3.

Clustered sampling.

GWG requires a starting sequence to start mutating. A reasonable starting set are the sequences XX used to train the model. On each round rr, we use eq. 4 to propose NpropN_{\text{prop}} mutations for each sequence. If accepted via eq. 5, then the mutated sequence will be added to the next round. However, this procedure can lead to an intractable number of sequences to consider.

To control compute bandwidth, we perform hierarchical clustering (Müllner, 2011) on all the sequences in a round and take the sequence of each cluster with the highest predicted fitness using fθf_{\theta}. Let 𝒞\mathcal{C} be the number of clusters which we set based on amount of available compute. This procedure, known as Reduce, is,

Reduce(X;θ)=c=1𝒞{argmaxxXcfθ(x)} where {Xc}c=1𝒞=Cluster(X;𝒞).\texttt{Reduce}(X;\theta)=\bigcup_{c=1}^{\mathcal{C}}\{\operatorname*{arg\,max}_{x\in X^{c}}f_{\theta}(x)\}\ \text{ where }\ \{X^{c}\}_{c=1}^{\mathcal{C}}=\texttt{Cluster}(X;\mathcal{C}). (6)

Each round rr reduces the sequences from the previous round and performs GWG sampling.

X~r=Reduce(Xr;θ),Xr+1=GWG(X~r;θ)\tilde{X}_{r}=\texttt{Reduce}(X_{r};\theta),\quad X_{r+1}=\texttt{GWG}(\tilde{X}_{r};\theta) (7)

To summarize, we adapted GWG for protein optimization by developing a smoothed model to satisfy GWG’s smoothness assumptions and use clustering during sampling to reduce redundancy and compute. An illustration of clustered sampling is provided in Figure 5.

The full algorithm for smoothing and clustered sampling is provided in Algorithm 1.

Algorithm 1 GGS: Gibbs sampling with Graph-based Smoothing
1:Starting dataset: 𝒟=(X,Y)\mathcal{D}=(X,Y)
2:θ~argmaxθ~𝔼(x,y)𝒟[(yfθ~(x))2]\tilde{\theta}\leftarrow\operatorname*{arg\,max}_{\tilde{\theta}}\mathbb{E}_{(x,y)\sim\mathcal{D}}\left[(y-f_{\tilde{\theta}}(x))^{2}\right] \triangleright Initial training
3:θSmooth(𝒟;θ~)\theta\leftarrow\texttt{Smooth}(\mathcal{D};\tilde{\theta}) \triangleright GS algorithm 2
4:for r=0,,R1r=0,\dots,R-1 do
5:     X~rReduce(Xr;θ)\tilde{X}_{r}\leftarrow\texttt{Reduce}(X_{r};\theta)
6:     Xr+1GWG(X~r;θ)X_{r+1}\leftarrow\texttt{GWG}(\tilde{X}_{r};\theta) \triangleright GWG algorithm 3
7:end for
8:Return TopK(XR)\texttt{TopK}(X_{R}) \triangleright Return Top-K best sequences based on predicted fitness fθf_{\theta}

4 Experiments

Our experiments demonstrate the benefits of smoothing in protein optimization. Section 4.1 presents a set of challenging tasks based on the GFP and AAV proteins that emulate starting with experimental noise and a sparsely sampled fitness landscape. Section 4.2 evaluates the performance of baselines and our method, GGS, on our benchmark. In addition, we find applying smoothing improves performance for two baselines. Section 4.3 provides sweeps over hyperparameters and analysis of GGS.

Baselines. We choose a representative set of prior works that evaluated on GFP and AAV: GFlowNets (GFN-AL) (Jain et al., 2022), model-based adaptive sampling (CbAS) (Brookes et al., 2019), greedy search (AdaLead) (Sinai et al., 2020), bayesian optimization (BO-qei) (Wilson et al., 2017), conservative model-based optimization (CoMs) (Trabucco et al., 2021), and proximal exploration (PEX) (Ren et al., 2022). NOS (Gruver et al., 2023) performs protein optimization with diffusion models. However, their framework is tailored to antibody optimization and requires non-trivial modifications for general proteins. We were unable to evaluate Song & Li (2023) due to unrunnable public code.

GGS implementation. We use a 1D CNN (see Section B.1 for architecture and training) for model fθf_{\theta}. To ensure a fair comparison, we use the same model architecture in baselines when possible. In graph-based smoothing (GS), we augment the graph until it has Nnodes=250,000N_{\text{nodes}}=250,000 nodes. We found larger graphs to not give improvements. Similarly, we use τ=0.1\tau=0.1, R=15R=15 rounds and Nprop=100N_{\text{prop}}=100 proposals per round during GWG at which sequences would converge and more sampling did not give improvements. We choose the smoothing weight γ=1.0\gamma=1.0 through grid search. We study sensitivity to hyperparameters, especially γ\gamma, in Section 4.3.

4.1 Benchmark

We develop a set of tasks based on two well-studied protein systems: Green Fluoresent Protein (GFP) and Adeno-Associated Virus (AAV) (Sarkisyan et al., 2016; Bryant et al., 2021). These were chosen due to their relatively large amount of measurements, 56,806 and 44,156 respectively, with sequence variability of up to 15 mutations from the wild-type. Other datasets are either too small or do not have enough sequence variability. GFP’s fitness is its fluorescence properties as a biomarker while for AAV’s is the ability to package a DNA payload, i.e. for gene delivery. We found GFP and AAV to suffice in demonstrating how prior methods fail to extrapolate.

One measure of difficulty is the number of mutations required to achieve the highest known fitness; this assesses a method’s exploration capability. We designate the set of optimal proteins, X99thX^{\text{99th}}, as any sequence in the 99th fitness percentile in the entire dataset444This may differ from the true optimal protein found in nature. Unfortunately, we must work with existing datasets since every possible protein cannot be experimentally measured.. Quantitatively, we compute the minimum number of mutations required from the training set to achieve the optimal fitness:

Gap(X0;X99th)=min({dist(x,x~):xX,x~X99th}).\text{Gap}(X_{0};X^{\text{99th}})=\texttt{min}(\{\texttt{dist}(x,\tilde{x}):x\in X,\tilde{x}\in X^{\text{99th}}\}). (8)

A high mutational gap would require the method discovering many mutations in a high dimensional space. A second measure of difficulty is the fitness range of the starting set of sequences. Starting with a low range of fitness requires the method to learn from barely functional proteins and exploit limited knowledge to find mutations that confer higher fitness. Appendix A shows Gap and starting rate are necessary as we found the previous GFP benchmark (Trabucco et al., 2022) as too “easy” by only requiring one mutation to achieve the optimal fitness.

Recall the protein optimization task is to use the starting set 𝒟\mathcal{D} to propose a set of sequences with higher fitness. We design two difficulties, medium and hard, for GFP and AAV based on the properties of 𝒟\mathcal{D}. We restricted the range and the mutational gap to modulate task difficulty. We found Gap=7=7 and Range <30%<30\% to suffice in finding where our baseline methods fail to discover better proteins. We use this setting as the hard difficulty and sought to develop GGS to solve it.

Table 1: GFP tasks
Difficulty Range (%) Gap |𝒟||\mathcal{D}|
Medium 20th-40th 6 2828
Hard <30<30th 7 2426
Table 2: AAV tasks
Difficulty Range (%) Gap |𝒟||\mathcal{D}|
Medium 20th-40th 6 2139
Hard <30<30th 7 3448
In-silico evaluation.

We follow prior works in using a trained evaluator model as a proxy for real-world experimental validation. A popular model choice is the TAPE transformer (Rao et al., 2019). However, we noticed a poor performance of the transformer compared to a simpler CNN that matches the findings of Dallago et al. (2021). We use CNN architecture for the evaluator due to its superior performance. Following Jain et al. (2022), each method generates 128 samples X^={x^i}i=1128\hat{X}=\{\hat{x}_{i}\}_{i=1}^{128} whose approximated fitness is predicted with the evaluator. We additionally report Diversity and Novelty that are also used in Jain et al. (2022). Descriptions of these metrics can be found in Section B.2 We emphasize that higher diversity and novelty are not equivalent to better performance, but provide insight into the exploration and exploitation trade-offs of different methods. For instance, a random algorithm would achieve maximum diversity and novelty.

4.2 Results

We run 5 seeds and report the average metric across all seeds including the standard deviation in parentheses. We evaluate GGS and previously described baselines. To ensure a fair comparison, we use the same CNN architecture as the model across all methods – all our baselines (and GGS) perform model-based optimization. Since graph-based smoothing (GS) is a general technique, we sought to evaluate its effectiveness in each of our baselines. To incorporate GS, we used the smoothed predictor as a replacement in each baseline which will be denoted with “+ GS”. Table 3 summarizes GFP results while table 4 summarizes AAV.

Table 3: GFP optimization results. Bold indicates improvement with smoothing.
Medium difficulty Hard difficulty
Method Fitness Diversity Novelty Fitness Diversity Novelty
GFN-AL 0.09 (0.1) 25.1 (0.5) 213 (2.2) 0.1 (0.2) 23.6 (1.0) 214 (4.2)
GFN-AL + GS 0.15 (0.1) 16.3 (1.6) 213 (2.7) 0.16 (0.2) 22.2 (0.8) 215 (4.6)
\hdashlineCbAS 0.14 (0.0) 9.7 (1.1) 7.2 (0.4) 0.18 (0.0) 9.6 (1.3) 7.8 (0.4)
CbAS + GS 0.66 (0.1) 3.8 (0.4) 5.0 (0.0) 0.57 (0.0) 4.2 (0.17) 6.3 (0.6)
\hdashlineAdaLead 0.56 (0.0) 3.5 (0.1) 2.0 (0.0) 0.18 (0.0) 5.6 (0.5) 2.8 (0.4)
AdaLead + GS 0.59 (0.0) 5.5 (0.3) 2.0 (0.0) 0.39 (0.0) 3.5 (0.1) 2.0 (0.0)
\hdashlineBOqei 0.20 (0.0) 19.3 (0.0) 0.0 (0.0) 0.0 (0.5) 94.6 (71) 54.1 (81)
BOqei + GS 0.08 (0.0) 19.3 (0.0) 0.0 (0.0) 0.01 (0.0) 13.4 (0.0) 0.0 (0.0)
\hdashlineCoMS 0.0 (0.1) 133 (25) 192 (12) 0.0 (0.1) 144 (7.5) 201 (3.0)
CoMS + GS 0.0 (0.5) 129 (25) 128 (84) 0.0 (0.1) 114 (36) 187 (5.7)
\hdashlinePEX 0.47 (0.0) 3.0 (0.0) 1.4 (0.2) 0.0 (0.0) 3.0 (0.0 1.3 (0.3)
PEX + GS 0.45 (0.0) 2.9 (0.0) 1.2 (0.3) 0.0 (0.0) 2.9 (0.0) 1.2 (0.3)
GWG 0.1 (0.0) 33.0 (0.8) 12.8 (0.4) 0.0 (0.0) 4.2 (7.0) 7.6 (1.1)
GGS (ours) 0.76 (0.0) 3.7 (0.2) 5.0 (0.0) 0.74 (0.0) 3.6 (0.1) 8.0 (0.0)
Table 4: AAV optimization results. Bold indicates improvement with smoothing.
Medium difficulty Hard difficulty
Method Fitness Diversity Novelty Fitness Diversity Novelty
GFN-AL 0.2 (0.1) 9.6 (1.2) 19.4 (1.1) 0.1 (0.1) 11.6 (1.4) 19.6 (1.1)
GFN-AL + GS 0.18 (0.1) 9.0 (1.1) 20.6 (0.5) 0.1 (0.1) 9.5 (2.5) 19.4 (1.1)
\hdashlineCbAS 0.43 (0.0) 12.7 (0.7) 7.2 (0.4) 0.36 (0.0) 14.4 (0.7) 8.6 (0.5)
CbAS + GS 0.47 (0.1) 8.8 (0.9) 5.3 (0.6) 0.4 (0.0) 12.5 (0.4) 7.0 (0.0)
\hdashlineAdaLead 0.46 (0.0) 8.5 (0.8) 2.8 (0.4) 0.4 (0.0) 8.53 (0.1) 3.4 (0.5)
AdaLead + GS 0.43 (0.0) 3.77 (0.2) 2.0 (0.0) 0.44 (0.0) 2.9 (0.1) 2.0 (0.0)
\hdashlineBOqei 0.38 (0.0) 15.22 (0.8) 0.0 (0.0) 0.32 (0.0) 17.9 (0.3) 0.0 (0.0)
BOqei + GS 0.34 (0.0) 12.2 (0.3) 0.0 (0.0) 0.32 (0.0) 17.2 (0.7) 0.0 (0.0)
\hdashlineCoMS 0.37 (0.1) 10.1 (5.9) 8.2 (3.5) 0.26 (0.0) 10.7 (3.5) 10.0 (2.8)
CoMS + GS 0.37 (0.1) 9.0 (3.6) 8.6 (3.7) 0.22 (0.1) 13.2 (1.9) 12.6 (2.4)
\hdashlinePEX 0.4 (0.0) 2.8 (0.0) 1.4 (0.2) 0.3 (0.0) 2.8 (0.0) 1.3 (0.3)
PEX + GS 0.4 (0.0) 2.8 (0.0) 1.4 (0.2) 0.3 (0.0) 2.8 (0.0) 1.1 (0.2)
GWG 0.43 (0.1) 6.6 (6.3) 7.7 (0.8) 0.33 (0.0) 12.0 (0.4) 12.2 (0.4)
GGS (ours) 0.51 (0.0) 4.0 (0.2) 5.4 (0.5) 0.60 (0.0) 4.5 (0.5) 7.0 (0.0)

GGS substantially outperforms all unsmoothed baselines, consistently achieving a improvement in fitness from the starting range of fitness in each difficulty. However, the smoothed baselines (lines with + GS) demonstrated a up to three fold improvement for CbAS, AdaLead. We find larger improvements in GFP where the sequence space is far larger than AAV – suggesting the GFP fitness landscape is harder to optimize over.

The most difficult task is clearly hard difficulty on GFP where all the baselines without smoothing cannot achieve fitness higher than the training set. With smoothing, GGS achieves the best fitness since the sampling procedure uses gradient-based proposals that benefit from a smooth model. Section C.2.1 presents results on additional difficulties to analyze GGS beyond hard..

We observe GGS is able to achieve the highest fitness while exhibiting respectable diversity and novelty. Notably, GGS’s novelty falls within the range of the mutational gap in each difficulty, suggesting it is extrapolating an appropriate amount for each task. Our sampling procedure, GWG, fails to perform without smoothing which agrees with its theoretical requirements of requiring a smooth model for good performance. We conclude smoothing is a beneficial technique not only for GGS but also for some baselines. GGS is able to achieve state-of-the-art results in our benchmark.

4.3 Analysis

We analyze the effect of varying the following hyperparameters: number of nodes NnodesN_{\text{nodes}} in the protein graph, smoothness weight γ\gamma in eq. 2, and number of sampling rounds RR during GWG sampling. For space, we leave the analysis of the sampling temperature τ\tau in section C.1. Figure 3 presents the results of running GGS with different hyperparameters on the hard difficulty of GFP and AAV. Along the X-axis, we plot the median performance of the sequences during each round of GWG where r=0r=0 is initialization and r=15r=15 are the sequences and the end of GWG. The Y-axis shows the predicted fitness of the smoothed model in blue while the fitness scored with our is shown in red. Interestingly, we find in the majority of cases the smoothed model’s predictions are highly correlated with the evaluator along the sampling trajectory. This is despite the model being trained on 4% of the data with the hard filtering. Section C.2.2 shows the prediction error where we find smoothing greatly improves in predicting the fitness of unseen sequences despite having higher train error.

Graph size.

We find Nnodes=250,000N_{\text{nodes}}=250,000 nodes to have the best performance over a smaller graph with 100,000 nodes. Larger graphs allow for better approximation of the fitness landscape. However, larger graphs require more compute. A future direction could be to determine optimal graph size with different node augmentations strategies than random mutations.

Smoothing.

Too much smoothing γ=10.0\gamma=10.0 can lead to worse performance in AAV while GFP is not sensitive. This suggests the optimal γ\gamma is dependent on the particular fitness landscape. Since real proteins landscapes are unknown, the biggest limitation of our method is determining the optimal γ\gamma. An important extension of GGS is to theoretically characterize landscapes (Buzas & Dinitz, 2013) and provide guidelines of selecting γ\gamma.

Sampling convergence.

We find a set number of rounds are required for GWG sampling to converge when the landscape is smooth enough (middle and right column). We find additional rounds are unnecessary; in practice, more rounds can be ran to ensure convergence. Results on sweeping the temperature are in Section C.1 where we see 0.1 clearly performs the best for GFP and AAV.

Refer to caption
Figure 3: GGS hyperparameter analysis on GFP and AAV hard difficulty. See Section 4.3.

5 Discussion

We present Gibbs sampling with Graph-based Smoothing (GGS) for protein optimization with a smoothed fitness landscape. Our main contribution and insight is a novel application of graph signal processing to protein optimization. We show smoothing is not only beneficial to our method but also to our baselines. To evaluate, we designed a suite of tasks around two measure of difficulty: number of edits to achieve the 99th percentile (mutational gap) and starting range of fitness. All baselines struggled to achieve good performance on our tasks. However, some baselines showed a three fold improvement with smoothing. GGS performed the best by combining Gibbs with gradients with a smoothed model – demonstrating the synergy of gradient-based sampling with a smooth discrete energy-based model. Our results highlight the benefits of optimizing over a smooth landscape that may not be reflective of the true fitness landscape. We believe it’s important to investigate how regularization can be used to transform protein fitness data to be compatible with modern optimization algorithms. Our goal is to not learn the excess biological noise, but find the signal in the data to discover the best protein. We conclude with limitations.

Evaluation limitations.

The results demonstrate strong evidence of using smoothing given its improvement in multiple methods. Despite this, our evaluations follow prior works by utilizing an trained model for evaluation. This can be unreliable compared to testing out sequences with wet-lab validation. Unfortunately, wet-lab validation can be cost and time intensive. The ultimate test would be to use GGS in an active learning or experimental pipeline with wet-lab validation in the loop.

Method limitations.

Our method utilizes several hyperparameters such as the graph size and smoothing parameter γ\gamma. We demonstrated the effects of each hyperparameter in Section 4.3. Given the success of smoothing, it is desirable to find systematic ways to determine optimal hyperparameters based on an approximation of the underlying fitness landscape. We demonstrated our hyperparameter choices are not specific to either AAV or GFP, but this does not guarantee optimality for new landscapes. We believe the connections between spectral graph theory and protein optimization has more to give in advancing the important problem of protein optimization.

Acknowledgments

The authors thank Hannes Stärk, Rachel Wu, Nathaniel Bennett, Sean Murphy, Jaedong Hwang, Josef Šivic, and Tomáš Pluskal for helpful discussion and feedback.

JY was supported in part by an NSF-GRFP. JY, RB, and TJ acknowledge support from NSF Expeditions grant (award 1918839: Collaborative Research: Understanding the World Through Code), Machine Learning for Pharmaceutical Discovery and Synthesis (MLPDS) consortium, the Abdul Latif Jameel Clinic for Machine Learning in Health, the DTRA Discovery of Medical Countermeasures Against New and Emerging (DOMANE) threats program, the DARPA Accelerated Molecular Discovery program and the Sanofi Computational Antibody Design grant. IF is supported by the Office of Naval Research, the Howard Hughes Medical Institute (HHMI), and NIH (NIMH-MH129046). RS was partly supported by the European Regional Development Fund under the project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15_00315\_003/0000468), the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254), and the MISTI Global Seed Funds under the MIT-Czech Republic Seed Fund.

References

  • Anderson et al. (2021) Dave W Anderson, Florian Baier, Gloria Yang, and Nobuhiko Tokuriki. The adaptive landscape of a metallo-enzyme is shaped by environment-dependent epistasis. Nature Communications, 12(1):3867, 2021.
  • Angermueller et al. (2020) Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, and Lucy Colwell. Model-based reinforcement learning for biological sequence design. 2020.
  • Arnold (1998) Frances H Arnold. Design by directed evolution. Accounts of chemical research, 31(3):125–131, 1998.
  • Brookes et al. (2019) David Brookes, Hahnbeom Park, and Jennifer Listgarten. Conditioning by adaptive sampling for robust design. In International conference on machine learning, pp.  773–782. PMLR, 2019.
  • Brookes et al. (2022) David H Brookes, Amirali Aghazadeh, and Jennifer Listgarten. On the sparsity of fitness functions and implications for learning. Proceedings of the National Academy of Sciences, 119(1):e2109649118, 2022.
  • Bryant et al. (2021) Drew H Bryant, Ali Bashir, Sam Sinai, Nina K Jain, Pierce J Ogden, Patrick F Riley, George M Church, Lucy J Colwell, and Eric D Kelsic. Deep diversification of an aav capsid protein by machine learning. Nature Biotechnology, 39(6):691–696, 2021.
  • Buzas & Dinitz (2013) Jeffrey Buzas and Jeffrey Dinitz. An analysis of nk landscapes: Interaction structure, statistical properties, and expected number of local optima. IEEE Transactions on Evolutionary Computation, 18(6):807–818, 2013.
  • Dallago et al. (2021) Christian Dallago, Jody Mou, Kadina E Johnston, Bruce J Wittmann, Nicholas Bhattacharya, Samuel Goldman, Ali Madani, and Kevin K Yang. Flip: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, pp.  2021–11, 2021.
  • Emami et al. (2023) Patrick Emami, Aidan Perreault, Jeffrey Law, David Biagioni, and Peter St John. Plug & play directed evolution of proteins with gradient-based discrete mcmc. Machine Learning: Science and Technology, 4(2):025014, 2023.
  • Frey et al. (2023) Nathan C Frey, Daniel Berenberg, Karina Zadorozhny, Joseph Kleinhenz, Julien Lafrance-Vanasse, Isidro Hotzel, Yan Wu, Stephen Ra, Richard Bonneau, Kyunghyun Cho, et al. Protein discovery with discrete walk-jump sampling. arXiv preprint arXiv:2306.12360, 2023.
  • Ghafari & Weissman (2019) Mahan Ghafari and Daniel B. Weissman. The expected time to cross extended fitness plateaus. Theoretical Population Biology, 129:54–67, 2019. ISSN 0040-5809. doi: https://doi.org/10.1016/j.tpb.2019.03.008. URL https://www.sciencedirect.com/science/article/pii/S0040580918301011. Special issue in honor of Marcus Feldman’s 75th birthday.
  • Grathwohl et al. (2021) Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, and Chris Maddison. Oops i took a gradient: Scalable sampling for discrete distributions. In International Conference on Machine Learning, pp.  3831–3841. PMLR, 2021.
  • Gruver et al. (2023) Nate Gruver, Samuel Stanton, Nathan C Frey, Tim GJ Rudner, Isidro Hotzel, Julien Lafrance-Vanasse, Arvind Rajpal, Kyunghyun Cho, and Andrew Gordon Wilson. Protein design with guided discrete diffusion. arXiv preprint arXiv:2305.20009, 2023.
  • Isufi et al. (2022) Elvin Isufi, Fernando Gama, David I Shuman, and Santiago Segarra. Graph filters for signal processing and machine learning on graphs. arXiv preprint arXiv:2211.08854, 2022.
  • Jain et al. (2022) Moksh Jain, Emmanuel Bengio, Alex Hernandez-Garcia, Jarrid Rector-Brooks, Bonaventure F. P. Dossou, Chanakya Ajit Ekbote, Jie Fu, Tianyu Zhang, Michael Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, and Yoshua Bengio. Biological sequence design with GFlowNets. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  9786–9801. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/jain22a.html.
  • Kauffman & Weinberger (1989) Stuart A Kauffman and Edward D Weinberger. The nk model of rugged fitness landscapes and its application to maturation of the immune response. Journal of theoretical biology, 141(2):211–245, 1989.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • (18) Minji Lee, Luiz Felipe Vecchietti, Hyunkyu Jung, Hyunjoo Ro, Meeyoung Cha, and Ho Min Kim. Protein sequence design in a latent space via model-based reinforcement learning.
  • Maus et al. (2022) Natalie Maus, Haydn Jones, Juston Moore, Matt J Kusner, John Bradshaw, and Jacob Gardner. Local latent space bayesian optimization over structured inputs. Advances in Neural Information Processing Systems, 35:34505–34518, 2022.
  • Meier et al. (2021) Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu, and Alex Rives. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34:29287–29303, 2021.
  • Müllner (2011) Daniel Müllner. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378, 2011.
  • Notin et al. (2022) Pascal Notin, Mafalda Dias, Jonathan Frazer, Javier Marchena Hurtado, Aidan N Gomez, Debora Marks, and Yarin Gal. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, pp.  16990–17017. PMLR, 2022.
  • Padmakumar et al. (2023) Vishakh Padmakumar, Richard Yuanzhe Pang, He He, and Ankur P Parikh. Extrapolative controlled sequence generation via iterative refinement. arXiv preprint arXiv:2303.04562, 2023.
  • Rao et al. (2019) Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, and Yun S Song. Evaluating protein transfer learning with tape. In Advances in Neural Information Processing Systems, 2019.
  • Remington (2011) S James Remington. Green fluorescent protein: a perspective. Protein Science, 20(9):1509–1519, 2011.
  • Ren et al. (2022) Zhizhou Ren, Jiahan Li, Fan Ding, Yuan Zhou, Jianzhu Ma, and Jian Peng. Proximal exploration for model-guided protein sequence design. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  18520–18536. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ren22a.html.
  • Sarkisyan et al. (2016) Karen S Sarkisyan, Dmitry A Bolotin, Margarita V Meer, Dinara R Usmanova, Alexander S Mishin, George V Sharonov, Dmitry N Ivankov, Nina G Bozhanova, Mikhail S Baranov, Onuralp Soylemez, et al. Local fitness landscape of the green fluorescent protein. Nature, 533(7603):397–401, 2016.
  • Sinai et al. (2020) Sam Sinai, Richard Wang, Alexander Whatley, Stewart Slocum, Elina Locane, and Eric D Kelsic. Adalead: A simple and robust adaptive greedy search algorithm for sequence design. arXiv preprint arXiv:2010.02141, 2020.
  • Song & Li (2023) Zhenqiao Song and Lei Li. Importance weighted expectation-maximization for protein sequence design. arXiv preprint arXiv:2305.00386, 2023.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  • Stanton et al. (2022) Samuel Stanton, Wesley Maddox, Nate Gruver, Phillip Maffettone, Emily Delaney, Peyton Greenside, and Andrew Gordon Wilson. Accelerating bayesian optimization for biological sequence design with denoising autoencoders. arXiv preprint arXiv:2203.12742, 2022.
  • Trabucco et al. (2021) Brandon Trabucco, Aviral Kumar, Xinyang Geng, and Sergey Levine. Conservative objective models for effective offline model-based optimization. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  10358–10368. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/trabucco21a.html.
  • Trabucco et al. (2022) Brandon Trabucco, Xinyang Geng, Aviral Kumar, and Sergey Levine. Design-bench: Benchmarks for data-driven offline model-based optimization. CoRR, abs/2202.08450, 2022. URL https://arxiv.org/abs/2202.08450.
  • Wilson et al. (2017) James T Wilson, Riccardo Moriconi, Frank Hutter, and Marc Peter Deisenroth. The reparameterization trick for acquisition functions. arXiv preprint arXiv:1712.00424, 2017.
  • Zanella (2020) Giacomo Zanella. Informed proposals for local mcmc in discrete spaces. Journal of the American Statistical Association, 115(530):852–865, 2020.
  • Zerbe et al. (2012) Brandon S Zerbe, David R Hall, Sandor Vajda, Adrian Whitty, and Dima Kozakov. Relationship between hot spot residues and ligand binding hot spots in protein–protein interfaces. Journal of chemical information and modeling, 52(8):2236–2244, 2012.
  • Zhou & Schölkopf (2004) Dengyong Zhou and Bernhard Schölkopf. A regularization framework for learning from graph data. In ICML 2004 Workshop on Statistical Relational Learning and Its Connections to Other Fields (SRL 2004), pp.  132–137, 2004.

Appendix A Additional GFP analysis

Design-bench difficulty. Prior works have used the GFP task introduced by design-bench (DB), a suite of model-based reinforcement learning tasks (Trabucco et al., 2022), which samples a starting set of 5,000 sequences from the 50-60th percentile fitness range. However, we found this task to be too easy in the sense only one mutation was required from sequences in the training set to achieve the 99th percentile. We quantify this difficulty using the mutational gap described in eq. 8. Our proposed medium and hard difficulties (Table 2) require many more mutations to reach the top fitness percentile, see Figure 4. Similar issues may be present in other benchmarks.

Refer to caption
Figure 4: Easy is taken from design-bench where sequences between the 50-60th percentile are used in training regardless of edit distance to sequences in the 99th percentile. Data leakage is present due to multiple measurements that allows the wild-type and other top sequences to be included during training. Medium filters the training dataset to have sequences in the 20-40th percentile and be 6 or more mutations away from anything in the top 99th percentile. Hard similarly filters for sequences in at most the 30th percentile and 7 or more mutations away.

Appendix B Additional methods

B.1 CNN architecture

We utilize a 1D convolutional neural network (CNN) architecture in our model and oracle. The CNN takes in a one-hot encoded sequence as input then applies a 1D convolution with kernel width 5 followed by max-pooling and a dense layer to a single node that outputs a scalar value. It uses 256 channels throughout for a total of 157,000 parameters. Despite its simplicity, we find the CNN to outperform Transformers. Indeed, this corroborates the results in Dallago et al. (2021) that a simple CNN can be effective in low data regimes.

Training is performed with batch size 1024, ADAM optimizer (Kingma & Ba, 2014) (with β1=0.9,β2=0.999\beta_{1}=0.9,\beta_{2}=0.999), learning rate 0.0001, and 50 epochs, using a single A6000 Nvidia GPU.

B.2 Metrics

We provide mathematical definitions of each metric. Note gϕg_{\phi} is the evaluator trained to predict the approximate fitness as a proxy for experimental validation.

  • (Normalized) Fitness = median({ξ(x^i;Y)}i=1Nsamples) where ξ(x^;Y)=gϕ(x^i)min(Y)max(Y)min(Y)\texttt{median}(\{\xi(\hat{x}_{i};Y^{*})\}_{i=1}^{N_{\text{samples}}})\text{ where }\xi(\hat{x};Y^{*})=\frac{g_{\phi}(\hat{x}_{i})-\texttt{min}(Y^{*})}{\texttt{max}(Y^{*})-\texttt{min}(Y^{*})} is the min-max normalized fitness based on the lowest and highest known fitness in YY^{*}.

  • Diversity = median({dist(x,x~):x,x~X^,xx~})\texttt{median}(\{\texttt{dist}(x,\tilde{x}):x,\tilde{x}\in\hat{X},x\neq\tilde{x}\}) is the average sample similarity.

  • Novelty = median({η(x^i;X)}i=1Nsamples)\texttt{median}(\{\eta(\hat{x}_{i};X)\}_{i=1}^{N_{\text{samples}}}) where η(x;X)=min({dist(x,x~):x~X,x~x})\eta(x;X)=\texttt{min}(\{\texttt{dist}(x,\tilde{x}):\tilde{x}\in X^{*},\tilde{x}\neq x\}) is the minimum distance of sample xx to any of the starting sequences XX.

Algorithm 2 Smooth: Graph-based Smoothing
1:Sequences: XX
2:Noisy model weights: θ~\tilde{\theta}
3:V,ECreateGraph(X)V,E\leftarrow\texttt{CreateGraph}(X) \triangleright Construct graph (Algorithm 4).
4:LGraphLaplacian(V,E)L\leftarrow\texttt{GraphLaplacian}(V,E) \triangleright Compute graph Laplacian.
5:Y[fθ~(x1),,fθ~(xNnodes)]Y\leftarrow[f_{\tilde{\theta}}(x_{1}),\dots,f_{\tilde{\theta}}(x_{N_{\text{nodes}}})]^{\top}
6:Y^(𝕀+γL)1Y\hat{Y}\leftarrow(\mathds{I}+\gamma L)^{-1}Y \triangleright Compute smoothed fitness labels.
7:θargmaxθ𝔼(x,y^)(V,Y^)[(y^fθ(x))2]\theta\leftarrow\operatorname*{arg\,max}_{\theta}\mathbb{E}_{(x,\hat{y})\sim(V,\hat{Y})}\left[(\hat{y}-f_{\theta}(x))^{2}\right] \triangleright Train on smoothed dataset.
8:Return θ\theta
Algorithm 3 GWG: Gibbs With Gradients
1:Parent sequences: XX
2:Model weights: θ\theta
3:XX^{\prime}\leftarrow\emptyset
4:for xXx\in X do
5:     for i=1,,Npropi=1,\dots,N_{\text{prop}} do \triangleright Number of proposals per sequence.
6:         xxx^{\prime}\leftarrow x
7:         (iloc,jsub)q(|x)(i^{\mathrm{loc}},j^{\mathrm{sub}})\sim q(\cdot|x) \triangleright Sample index and token eq. 4
8:         xiloc𝒱jsubx^{\prime}_{i^{\mathrm{loc}}}\leftarrow\mathcal{V}_{j^{\mathrm{sub}}} \triangleright Apply mutation
9:         if accept using eq. 5 then
10:              XX{x}X^{\prime}\leftarrow X^{\prime}\cup\{x^{\prime}\}
11:         end if
12:     end for
13:end for
14:Return XX^{\prime} \triangleright Return accepted sequences.
Algorithm 4 CreateGraph
1:Sequences: XX
2:VXV\leftarrow X \triangleright Construct nodes.
3:while |V|Nnodes|V|\leq N_{\text{nodes}} do
4:     x𝒰(V)x\sim\mathcal{U}(V)
5:     xPointMutation(x)x^{\prime}\leftarrow\texttt{PointMutation}(x) \triangleright Sample a point mutation uniformly at random.
6:end while
7:ExVkNN(x;V)E\leftarrow\bigcup_{x\in V}\texttt{kNN}(x;V) \triangleright Construct edges (Algorithm 5).
8:Return (V,E)(V,E)
Algorithm 5 kNN
1:Current node: xx
2:All nodes: VV
3:𝒟(x)xV/{x}dist(x,x)\mathcal{D}(x)\leftarrow\bigcup_{x^{\prime}\in V/\{x\}}\mathrm{dist}(x^{\prime},x) \triangleright Levenstein distance between every pair of sequences.
4:𝒳TopK(𝒟(x),V)\mathcal{X}^{\prime}\leftarrow\texttt{TopK}(\mathcal{D}(x),V) \triangleright Compute K closest sequences to xx.
5:E(x)x𝒳(x,x)\mathrm{E}(x)\leftarrow\bigcup_{x^{\prime}\in\mathcal{X}^{\prime}}(x,x^{\prime}) \triangleright Construct neighborhood around xx.
6:Return E(x)\mathrm{E}(x)
Refer to caption
Figure 5: Illustration of clustered sampling. V~r\tilde{V}_{r} is the starting set of sequences for sampling in round rr. GWG (Algorithm 3) is ran to generate many sample sequences, Vr+1V_{r+1}. To control computation, we hierarchically cluster all sampled sequences based on Levenshtein distance and take the top fitness sequence in each cluster, using our trained fitness prediction model fθf_{\theta} to score each sequence – we refer to this subroutine as Reduce (eq. 6). The top sequences, V~r+1\tilde{V}_{r+1} are used for the next round.

Appendix C Additional results

C.1 Sampling temperature sweep

We determine the effect of different tmperatures γ\gamma when running GGS on the hard difficulty for GFP and AAV. All other hyperparameters follow those used in the main results, see Section 4.2. Table 5 shows the results where clearly γ=0.1\gamma=0.1 performs the best for both AAV and GFP.

Table 5: Temperature sweep.
GFP hard AAV hard
Temperature (γ\gamma) Fitness Diversity Novelty Fitness Diversity Novelty
0.01 0.65 (0.0) 5.3 (0.8) 7.4 (0.5) 0.45 (0.0) 15.2 (1.1) 9.0 (0.0)
0.1 0.74 (0.0) 3.6 (0.1) 8.0 (0.0) 0.6 (0.0) 4.5 (0.2) 7.0 (0.0)
1.0 0.0 (0.1) 28.2 (0.8) 11.4 (0.5) 0.45 (0.0) 11.9 (0.5) 8.0 (0.0)
2.0 0.0 (0.1) 36.1 (1.0) 13.0 (0.0) 0.33 (0.0) 16.7 (0.9) 8.5 (0.5)

C.2 Smoothing analysis

In this section, we provide further analyses into the effect of smoothing on performance of GGS, extrapolation to unseen data, and acceptance rate of the GWG sampling procedure. Throughout, we use the same parameters τ=0.1,γ=1,r=15,Nnodes=250,000\tau=0.1,\gamma=1,r=15,N_{nodes}=250,000 as in the main text.

C.2.1 Additional Benchmarks

We first define additional benchmarks, one easier, and three harder, for each protein dataset.

Table 6: GFP extra tasks
Difficulty Range (%) Gap |𝒟||\mathcal{D}|
Easy 50th-60th 0 5609
Harder1 <30<30th 8 1129
Harder2 <20<20th 8 792
Harder3 <10<10th 8 397
Table 7: AAV extra tasks
Difficulty Range (%) Gap |𝒟||\mathcal{D}|
Easy 50th-60th 0 4413
Harder1 <30<30th 13 1157
Harder2 <20<20th 13 920
Harder3 <10<10th 13 476

We note that the “easy” GFP task is equivalent to the design-bench baseline that is sometimes used as a benchmark in protein engineering tasks. Due to experimental noise, protein variants are assayed multiple times, and can be assigned multiple fitness values, which means the fitness values of one sequence may occupy a large percentile range. In the case of this task, multiple measurements of the wildtype GFP fitness are found in the 50th-60th percentile range. Because WT GFP is also a “top sequence,” this task necessarily has a mutational gap of 0. Due to this leakage, we develop our own benchmarks in the main text, and extend those to AAV.

C.2.2 How smoothing affects performance

The following two tables show how a smoothed model outperforms its unsmoothed counterpart according to our evaluator across all GFP/AAV benchmarks except AAV Harder2 (see ()(\bm{*})), and GFP Harder3, where the smoothing was not sufficient to induce successful GWG sampling (see Table 10).

Table 8: Smoothing improves GGS performance on GFP tasks
Difficulty Smoothed Median Fitness Diversity Novelty
Easy No 0.05 24.83 13.36
Yes 0.84 5.45 3.51
\hdashlineMedium No 0.51 10.5 15.4
Yes 0.76 3.7 5.0
\hdashlineHard No 0.10 23.02 16.8
Yes 0.74 3.6 8.0
\hdashlineHarder1 No 0.00 22.86 17.0
Yes 0.67 4.45 9.12
\hdashlineHarder2 No 0.00 22.22 16.5
Yes 0.60 5.42 9.82
\hdashlineHarder3 No 0.00 23.02 16.8
Yes 0.00 15.73 21.2

For the GFP task, our model fails (achieves 0 median fitness) when we restrict the data to the 10th percentile and mutation gap 8 for GFP where |𝒟|=397|\mathcal{D}|=397.

Table 9: Smoothing improves GGS performance on AAV tasks
Difficulty Smoothed Median Fitness Diversity Novelty
Easy No 0.47 2.69 7.81
Yes 0.49 9.18 7.99
\hdashlineMedium No 0.37 6.60 6.62
Yes 0.48 4.66 5.59
\hdashlineHard No 0.33 12.32 13.8
Yes 0.60 4.5 7.0
\hdashlineHarder1 No 0.30 0.53 6.00
Yes 0.31 13.80 14.679
\hdashlineHarder2 No 0.28 4.46 11.93
Yes 0.27 15.98 19.41
\hdashlineHarder3 No 0.25 3.08 5.63
Yes 0.38 7.05 9.486

()\bm{(*)}: The unsmoothed model only outperforms its smoothed counterpart when applying GWG to the unsmoothed model generates only a few unique sequences nearby to the starting set (as evidenced by the low novelty for this benchmark)

For AAV, we find the model is able to still find signal and achieve 0.384 evaluated fitness despite the data being limited to the 10th percentile and mutation gap of 13 where |𝒟|=476|\mathcal{D}|=476. It is notable, though, that the performance improvements gained from smoothing are smaller than in the case of GFP. Presumably, this is due to the vastly reduced dimension of the AAV sequence space in comparison to that of GFP, which may result in a neural network to learn a smoother landscape without any regularization.

C.2.3 How smoothing affects extrapolation + sampling

The following tables show the benefits of smoothing on extrapolation to held out ground truth experimental data, up to a certain difficulty benchmark, as well as how smoothing vastly improves the acceptance rate for the GWG sampling procedure.

Table 10: Smoothing improves extrapolation and GWG sampling, up to GFP Harder3
Difficulty Smoothed Train MAE Holdout MAE Acc. Rate
Easy No 0.03 0.99 0.02
Yes 0.71 0.61 0.99
\hdashlineMedium No 0.10 1.29 0.61
Yes 0.20 0.88 0.62
\hdashlineHard No 0.06 1.44 0.01
Yes 0.15 0.93 0.43
\hdashlineHarder1 No 0.07 1.39 0.01
Yes 0.15 0.94 0.43
\hdashlineHarder2 No 0.01 1.41 0.01
Yes 0.12 0.90 0.59
\hdashlineHarder3 No 0.01 1.41 0.01
Yes 0.01 1.42 0.01
Table 11: Smoothing improves extrapolation up to AAV Hard and GWG sampling on all AAV tasks
Difficulty Smoothed Train MAE Holdout MAE Acc. Rate
Easy No 0.28 2.82 0.01
Yes 1.76 2.28 0.99
\hdashlineMedium No 0.35 3.12 0.01
Yes 0.44 2.76 0.82
\hdashlineHard No 0.48 3.70 0.30
Yes 0.55 3.09 0.78
\hdashlineHarder1 No 0.66 3.99 0.01
Yes 0.69 4.24 0.47
\hdashlineHarder2 No 0.56 4.13 0.01
Yes 0.58 4.37 0.55
\hdashlineHarder3 No 0.47 4.58 0.01
Yes 0.47 4.59 0.64

For each benchmark category, we evaluated the impact of smoothing on extrapolation abilities by analyzing the Mean Absolute Error (MAE) of the models on that benchmark’s training and holdout datasets from the experimental ground truth. The effectiveness of smoothing was indicated by reduced MAE values on the holdout set. We also find that the MAE on the training set is lower for the unsmoothed models, as expected. In line with the results of the previous section, the effect of smoothing is reduced for AAV. As task difficulty increases, for both proteins, the effectiveness of smoothing on extrapolation decreases, which we expect as any signal leading from the training set to the fitter sequences gets obscured as training set size decreases.

Finally, we note that in every case except two, smoothing dramatically increases acceptance rate of the GWG sampling procedure, which aligns with the inversely proportional relationship between smoothness of the energy function and sampling efficiency. In the case of the hardest GFP task, even the the smoothed model had overfit to the training set. As for the GFP medium task, we suspect that this particular section of the experimental dataset allowed the unsmoothed model to learn a smooth landscape initially.