Adaptive machine learning for protein engineering

Brian L. Hie Kevin K. Yang [email protected]

Abstract

Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.

keywords:

Machine learning , protein engineering , model-based optimization , adaptive sampling , Bayesian optimization , Gaussian process

^†^†journal: Current Opinion in Structural Biology

\affiliation

[inst1]organization=Department of Biochemistry, Stanford University School of Medicine,city=Stanford, state=CA, postcode=94305, country=USA \affiliation[inst2]organization=Stanford ChEM-H, Stanford University,city=Stanford, state=CA, postcode=94305, country=USA

^cor1^cor1footnotetext: Corresponding author.

\affiliation

[inst3]organization=Microsoft Research New England,city=Cambridge, state=MA, postcode=02142, country=USA

1 Introduction

Protein engineering seeks to design or discover novel proteins with useful properties [1], but doing so is challenging because (i) evaluating a given design is costly (requiring a laboratory experiment to express and characterize each design) and (ii) the search space of all possible protein sequences is very large (with twenty natural amino acids, there are $20^{100}$ possible sequences of length 100, a search space larger than the number of atoms in the universe) [2].

To reduce the number of evaluations required, researchers can leverage machine learning to train a surrogate model that predicts the property of interest from the protein sequence. Because making predictions from a machine-learning model is less expensive and faster than conducting a wetlab experiment, the surrogate model can reduce the overall experimental burden [2, 3]. However, once a sequence-to-function model is available, a second consideration remains: how does one select new designs from the combinatorially large protein search space?

In this review, we address this question through what we call “adaptive machine learning,” which covers two problem settings. First, we consider the problem of using a trained machine-learning surrogate model to select one or more optimized sequences for laboratory measurement. Second, we consider the problem of finding optimized sequences over sequential rounds of experimental measurement, model training, and property prediction. We describe key principles, highlight useful examples from the literature, and discuss areas that would benefit from future methodological improvements.

2 Overview and key principles

There are four important components to consider when doing adaptive learning for protein optimization: the property to optimize, the sequence-to-function predictor, the acquisition function that prioritizes designs, and the generative model that proposes the designs (Figure 1A-D).

Refer to caption — Figure 1: Overview of adaptive machine learning for protein engineering. Four key components are the optimization property, the surrogate model that predicts the property given a sequence, a generative model that proposes sequences, and an acquisition function that prioritizes sets of sequences to measure given sequence information and predictions from the surrogate model. Sequences can be acquired from an explicit design space or from sequences proposed by a generative model. Optionally, the surrogate model’s predictions can guide the generative model toward better designs through adaptive sampling. Acquired sequences can be measured in the laboratory and the resulting data can help guide future rounds of optimization.

2.1 The property

The optimization property is the phenotype-of-interest (Figure 1A). For example, one could maximize the fluorescence intensity at a given excitation wavelength. Multiple properties may be considered, such as maximizing the enzymatic activity while minimizing the immunogenicity of a protein drug.

2.2 The predictor

The surrogate model takes in protein sequence information and predicts the value of the optimization property (Figure 1B). The surrogate model requires a number of design considerations, including how to represent the protein sequence and what kind of machine-learning algorithm to use. Typical sequence representations range from a simple binary encoding of the raw sequence to more complex continuous neural encodings [4, 5]. Model architectures also range in complexity from linear regression to deep neural networks, and have been reviewed in-depth previously [2, 5]. Often, it is also desirable for the surrogate model to quantify the uncertainty in its predictions in order to make reliable decisions. Popular models that provide a notion of uncertainty include Gaussian processes and model ensembles, which we review in Section 4.

2.3 The prioritizer

An acquisition function uses sequence information and the output of the surrogate model to prioritize designs for experimental measurements (Figure 1D). The simplest acquisition function greedily selects the top prediction (or the top few predictions) according to the surrogate model. Greedy acquisition is common in practice and can work well [3, 6, 7, 8, 9, 10], but can also limit search to narrower regions of the protein landscape (Figure 2A), which we discuss further in Section 4.

Many standard acquisition functions are designed to select only a single example on each round. Often, however, obtaining many experimental measurements in parallel is desirable. While simply acquiring several of the top-ranked sequences is possible, this approach may be prone to acquiring many similar examples (Figure 2B). Special methods for batched acquisition are therefore designed to encourage acquiring more diverse sequences [11, 12, 13, 14, 15], but this remains an open area of methodological development.

2.4 The proposer

Although evaluating the surrogate model is faster and cheaper than actually obtaining laboratory measurements, it is nevertheless impossible to evaluate the surrogate on all possible protein sequences. One approach is to explicitly define a design space of proteins. Example design spaces include traditional library designs for protein engineering such as all single- or double-mutants, all the possible mutations at a small number of sites [6, 7, 16], or a recombination library defined by mixing and matching parts of homologous parent sequences [12, 17, 18, 19].

Another approach is to use a generative model to implicitly define the design space by learning a probability distribution over sequences (Figure 1C). A generative model performs one or both of two fundamental tasks: (i) assigning every possible sequence a likelihood of being in the proposal distribution, and (ii) generating examples of sequences from a proposal distribution. The simplest models assume this distribution can be modeled by considering sites independently or relationships between pairs of sites [20, 21]. More recently, researchers have used deep neural network generative models to learn more complex sequence distributions and propose sequences for evaluation [22]; relevant neural architectures include variational autoencoders (VAEs) [23, 24, 25], generative adversarial networks (GANs) [26, 27], and autoregressive language models [28, 29, 30, 31].

Reference and date	Optimization property	Surrogate model	Acq. func.	Generative model / design space	Seq. opt.?	In-vitro?
Fox et al. (2007) [3]	Enzyme catalytic activity	Linear regression	Greedy	Sequence recombination	Yes	Yes
Romero et al. (2013) [12]	Protein thermostability	Gaussian process	UCB	Sequence recombination	Yes	Yes
Bedbrook et al. (2017) [32]	Protein localization	Gaussian process	UCB	Sequence recombination	Yes	Yes
Wu et al. (2019) [6]	Enzyme catalytic activity	Regressor ensemble	Greedy	Explicitly-defined design space	Yes	Yes
Brookes et al. (2019) [33]	Protein fluorescence	Neural network ensemble	Greedy	Neural network (VAE)	No	No
Kumar and Levine (2019) [34]	Protein fluorescence	Neural network ensemble	Greedy	Neural network (GAN)	No	No
Gupta and Zou (2020) [27]	Antimicrobial activity	Neural network (RNN)	Greedy	Neural network (GAN)	No	No
Liu et al. (2020) [35]	Antibody affinity	Neural network ensemble	Greedy	Activation maximization	No	Yes
Wittmann et al. (2020) [7]	Protein expression and binding	Regressor ensemble	Greedy	Explicitly-defined design space	Yes	No
Anishchenko et al. (2020) [36]	Valid folding	Neural network (CNN)	Greedy	Sequence mutation	No	Yes
Biswas et al. (2021) [8]	Protein fitness, fluorescence	Linear regression	Greedy	Sequence mutation	No	Yes
Bryant et al. (2021) [9]	Protein viability	Classifier ensemble	Greedy	Sequence mutation	No	Yes
Greenhalgh et al. (2021) [37]	Enzyme catalytic activity	Gaussian process	UCB	Sequence recombination	Yes	Yes

Table 1: Examples of adaptive learning for protein engineering. Acq. func.: Acquisition function. Seq. opt.: Sequential optimization, indicates studies that performed multiple rounds of variant selection and surrogate model training. In-vitro: Indicates studies that obtained in-vitro measurements of new protein sequences. RNN: Recurrent neural network [28]. CNN: Convolutional neural network [38].

3 Model-based protein optimization

After obtaining a surrogate model, model-based optimization can prioritize sequences of interest by finding the sequence or sequences in the design space that optimize the acquisition function (Figure 1E). Notably, in this setting we do not provide the surrogate model with new training data. The simplest approach is to first obtain a sequence design space (explicitly-defined or from a generative model) and then to select sequences from this design space based on a surrogate model and a corresponding acquisition function (for example, picking the best predicted sequences in the design space) [6, 12, 39, 10].

3.1 Adaptive sampling from a generative model

Rather than separate generation and evaluation steps, the surrogate model can also help shift the generative model so that it proposes more optimal sequences. For example, in adaptive sampling, protein sequences are sampled from a generative model, the outputs of a surrogate model are used to re-estimate the parameters of the generative model, and the process iterates until convergence [25]. For example, Brookes and Listgarten use adaptive sampling based on a VAE generative model and a neural network surrogate model to optimize DNA sequences for protein expression abundance [25]. Gupta and Zou use adaptive sampling based on a GAN generative model and a recurrent neural network surrogate model to design antimicrobial peptides [27]. To perform de novo protein design [40], Anishchenko et al. use adaptive sampling based on a mutation-based generative model and a neural network surrogate model meant to identify sequences with valid folds [36] (Table 1).

By default, adaptive sampling assumes a trustworthy surrogate model; however, in practice, these models are imperfect and can be prone to poor predictions in many regions of the protein space. To address this problem, adaptive sampling can avoid degeneracies in the surrogate model by constraining sampling to be close to the training distribution for the surrogate model [33, 36]. In each iteration, it is also possible to retrain the surrogate model to avoid pathologies [41]. Sequence generation can also be improved by sampling from an ensemble of generative models [42].

A closely related approach to model-based optimization uses genetic algorithms to heuristically balance both mutation and recombination to produce new sequences [43] by adaptively querying a surrogate model to preserve sequence designs [15]. Another approach to model-based optimization inverts the surrogate model by finding the elements from the generative model’s distribution that are most likely to have a desirable value according to the surrogate model. An inverse of the surrogate model could be trained via an iterative procedure similar to adaptive sampling [34], or the inverse of a differentiable surrogate model can be computed using gradient-based methods [35, Linder2020a, Linder2020b].

4 Sequential optimization

In sequential optimization, the surrogate model has access to multiple rounds of experimental measurement, which provide new data that can be re-incorporated into subsequent rounds of model training [44] (Figure 1E). To train the initial surrogate model, the first batch can consist of random samples from the design space [7] or sequences with known measurements (for example, from publicly available data) [8]. Sequential optimization is guided by an objective function that specifies the overall goal of the optimization procedure. The objective most relevant to protein engineering is to find the protein sequence that maximizes the optimization property (for example, find the most fluorescent sequence).

Greedy acquisition across experimental rounds is the simplest implementation of sequential optimization and is used widely in practice [7] (Figure 2A). Going beyond greedy acquisition means tolerating more risk for potentially higher reward, which is often described as a tradeoff between “exploitation” (equivalent to greedy acquisition) and “exploration” [45, 46, 47, 48], in which an algorithm acquires the proteins with high surrogate-model uncertainty to improve future predictions. Here we focus predominately on Bayesian optimization as the main alternative to greedy acquisition. Different objectives, such as training the best overall model through active learning [44] or reformulating the objective as a reinforcement learning problem [49] are also possible, though are less common in protein engineering applications.

4.1 Bayesian optimization

Bayesian optimization searches for a protein that maximizes the optimization property [47] by leveraging Bayesian uncertainty in the surrogate function to guide the exploration-exploitation tradeoff. Bayesian optimization relies on the acquisition function to systematically weigh the surrogate model’s prediction with its associated uncertainty. The upper confidence bound (UCB) acquisition function adds the prediction value with a weighted uncertainty term that lets the user control the influence of uncertainty on the prediction, with a larger weight encouraging more exploration [46, 50] (Figure 2A). UCB has good theoretical properties [47] and is used widely in practice [12, 32, 37, 51] (Table 1). Other notable acquisition functions select an example that is predicted to, on expectation, have the largest improvement compared to the best example in the training set or compared to a randomly drawn example from the training set [47, 52, 53]. Uncertainty can also help improve exploration in batched acquisition [11, 12, 13, 14] (Figure 2B).

4.1.1 Gaussian-process surrogate models

A Gaussian process is a popular surrogate model in Bayesian optimization [54]. Gaussian process regression assumes that the optimization-property values of any set of sequences are distributed according to a multivariate Gaussian. Often, the mean of the distribution is used as the prediction value and the marginal variance can be used to compute uncertainty. Probabilistic classification with Gaussian processes is also possible but requires approximation [55, 54] and multitask Gaussian processes can be used to predict multiple optimization properties [56]. Because of their theoretical elegance, flexibility, and good performance in practice, Gaussian processes have seen wide use in protein engineering applications [12, 32, 37, 51, 57, 58].

Gaussian processes are defined by a mean function and a kernel function that together specify how training examples influence the prediction. The most widely used kernel functions [59] are defined on continuous inputs and can accommodate neural embeddings or binary sequence embeddings [4, 32, 51]. Kernels can also be defined on discrete inputs, including sequences [60, 61, 62].

The cost of exact inference with Gaussian processes grows cubically with the size of the training data, which may be prohibitive on extremely large protein sequence datasets. There is a wide literature on scaling Gaussian processes, including methods for exploiting sparsity in the structure of the training data or performing approximate inference, which applies to both continuous and discrete inputs [63, 64].

4.1.2 Beyond Gaussian process regression

Bayesian optimization can also leverage more bespoke Bayesian models and algorithms for exact or approximate inference [65]. The wide interest in neural network models over the last decade has also led to increased interest in uncertainty prediction through Bayesian neural networks, in which the parameters of the network are themselves random variables with associated prior distributions, though efficient and accurate inference in these models can be challenging [66].

Probabilistic surrogate models can also be implemented by model ensembles, which train multiple sequence-to-function models on the same data and rely on variance in model predictions, due to different model architectures or randomness in the training procedure, to estimate uncertainty [35, 67, 68]. Ensembles are not Bayesian by default, so incorporating prior information into these models can be challenging [69, 70].

5 Discussion

Protein engineering is a challenging task given its proven computational hardness [71] and general biological complexity. Here we have reviewed how machine learning can help protein engineers deal with the immense complexity of the protein sequence landscape by proposing sequence designs and guiding a researcher across multiple rounds of laboratory experimentation.

While the techniques we review are a good representation of the current state-of-the-art, there is much room for methodological development. When performing Bayesian optimization, obtaining well-calibrated uncertainty estimates can be more difficult when the inputs are discrete or high-dimensional (or both).

Protein engineering also typically involves much fewer rounds (since experimental measurements are often resource-intensive) and much larger batches (for example, using multiplexed experimental designs) than is typically assumed in the theoretical literature for Bayesian optimization. Another consideration, particularly in scientific applications, is to design proteins without using a natural starting point or for properties that do not exist in natural proteins. Doing so with data-driven approaches may require additional modeling considerations beyond sequence data alone, such as biophysics, biochemistry, and protein structure.

Acknowledgements

We thank Nicholas Bhattacharya and Sam Sinai. for helpful comments and discussion. B.L.H. acknowledges the support of the Stanford Science Fellows program.

References

[1] F. H. Arnold, Directed Evolution: Bringing New Chemistry to Life, Angewandte Chemie - International Edition 57 (16) (2018). doi:10.1002/anie.201708408.
[2] K. K. Yang, Z. Wu, F. H. Arnold, Machine-learning-guided directed evolution for protein engineering, Nature Methods 16 (8) (2019) 687–694. arXiv:1811.10775, doi:10.1038/s41592-019-0496-6.
[3] R. J. Fox, S. C. Davis, E. C. Mundorff, L. M. Newman, V. Gavrilovic, S. K. Ma, L. M. Chung, C. Ching, S. Tam, S. Muley, J. Grate, J. Gruber, J. C. Whitman, R. A. Sheldon, G. W. Huisman, Improving catalytic function by ProSAR-driven enzyme evolution, Nature Biotechnology 25 (3) (2007) 338–344. doi:10.1038/nbt1286.
[4] B. J. Wittmann, K. E. Johnston, Z. Wu, F. H. Arnold, Advances in machine learning for directed evolution, Current Opinion in Structural Biology 69 (2021) 11–18.
[5] V. Frappier, A. E. Keating, Data-driven computational protein design, Current Opinion in Structural Biology 69 (August) (2021) 63–69. doi:10.1016/j.sbi.2021.03.009.
[6] Z. Wu, S. B. Jennifer Kan, R. D. Lewis, B. J. Wittmann, F. H. Arnold, Machine learning-assisted directed protein evolution with combinatorial libraries, Proceedings of the National Academy of Sciences of the United States of America 116 (18) (2019) 8852–8858. doi:10.1073/pnas.1901979116.
[7] B. J. Wittmann, Y. Yue, F. H. Arnold, Machine learning-assisted directed evolution navigates a combinatorial epistatic fitness landscape with minimal screening burden (2020). doi:10.1101/2020.12.04.408955. *This paper explores how different choices of sequence representations, surrogate models, and greedy (batched) acquisition functions affect the ability of machine-learning-guided directed evolution to acquire the optimal sequence over rounds of sequential optimization.
[8] S. Biswas, G. Khimulya, E. C. Alley, K. M. Esvelt, G. M. Church, Low-N protein engineering with data-efficient deep learning, Nature Methods 18 (4) (2021). doi:10.1038/s41592-021-01100-y. **This paper demonstrates how neural sequence embeddings combined with linear-regression surrogate models can be used to greedily acquire sequences with desirable properties using a small amount of training data.
[9] D. H. Bryant, A. Bashir, S. Sinai, N. K. Jain, P. J. Ogden, P. F. Riley, G. M. Church, L. J. Colwell, E. D. Kelsic, Deep diversification of an AAV capsid protein by machine learning, Nature Biotechnology (2021). doi:10.1038/s41587-020-00793-4. *This paper evaluates the ability of a logistic regression model, a recurrent neural network, and a convolutional neural network to propose diverse, viable capsid proteins.
[10] J. M. Singer, S. Novotney, D. Strickland, H. K. Haddox, N. Leiby, G. J. Rocklin, C. M. Chow, A. Roy, A. K. Bera, F. C. Motta, L. Cao, E.-M. Strauch, T. M. Chidyausiku, A. Ford, E. Ho, C. O. Mackenzie, H. Eramian, F. DiMaio, G. Grigoryan, M. Vaughn, L. J. Stewart, D. Baker, E. Klavins, Large-scale design and refinement of stable proteins using sequence-only models, bioRxiv (2021) 10.1101/2021.03.12.435185.
[11] J. Azimi, A. Fern, X. Z. Fern, Batch Bayesian optimization via simulation matching, Advances in Neural Information Processing Systems 23 (2010) 109–117.
[12] P. A. Romero, A. Krause, F. H. Arnold, Navigating the protein fitness landscape with Gaussian processes, Proceedings of the National Academy of Sciences of the United States of America 110 (3) (2013). doi:10.1073/pnas.1215251110.
[13] J. González, Z. Dai, P. Hennig, N. Lawrence, Batch bayesian optimization via local penalization, in: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, 2016.
[14] K. K. Yang, Y. Chen, A. Lee, Y. Yue, Batched stochastic Bayesian optimization via combinatorial constraints design, International Conference on Artificial Intelligence and Statistics 22 (2020) 3410–3419.
[15] S. Sinai, S. Slocum, R. Wang, E. Locane, A. Whatley, E. D. Kelsic, AdaLead: A simple and robust adaptive greedy search algorithm for sequence design, arXiv cs.LG (2010.02141) (2020). *This paper demonstrates how a simple genetic algorithm and greedy acquisition is competitive with other strategies for model-based optimization.
[16] N. C. Wu, L. Dai, C. A. Olson, J. O. Lloyd-Smith, R. Sun, Adaptation in protein fitness landscapes is facilitated by indirect paths, eLife 5 (2016) e16965. doi:10.7554/eLife.16965.
[17] C. A. Voigt, C. Martinez, Z. G. Wang, S. L. Mayo, F. H. Arnold, Protein building blocks preserved by recombination, Nature Structural Biology 9 (7) (2002). doi:10.1038/nsb805.
[18] C. R. Otey, M. Landwehr, J. B. Endelman, K. Hiraga, J. D. Bloom, F. H. Arnold, Structure-guided recombination creates an artificial family of cytochromes P450, PLoS Biology 4 (5) (2006). doi:10.1371/journal.pbio.0040112.
[19] M. A. Smith, P. A. Romero, T. Wu, E. M. Brustad, F. H. Arnold, Chimeragenesis of distantly-related proteins by noncontiguous recombination, Protein Science 22 (2) (2013) 231–238. doi:10.1002/pro.2202.
[20] T. A. Hopf, J. B. Ingraham, F. J. Poelwijk, C. P. Schärfe, M. Springer, C. Sander, D. S. Marks, Mutation effects predicted from sequence co-variation, Nature Biotechnology 35 (2) (2017) 128–135. doi:10.1038/nbt.3769.
[21] W. P. Russ, M. Figliuzzi, C. Stocker, P. Barrat-Charlaix, M. Socolich, P. Kast, D. Hilvert, R. Monasson, S. Cocco, M. Weigt, R. Ranganathan, An evolution-based model for designing chorismate mutase enzymes, Science 369 (6502) (2020) 440–445. doi:10.1126/science.aba3304.
[22] Z. Wu, K. E. Johnston, F. H. Arnold, K. K. Yang, Protein sequence design with deep generative models, arXiv q-bio.QM (2104.04457) (2021).
[23] D. P. Kingma, M. Welling, Auto-Encoding Variational Bayes, 2nd International Conference on Learning Representations (2014) arXiv:1312.6114arXiv:arXiv:1312.6114v10.
[24] A. J. Riesselman, J. B. Ingraham, D. S. Marks, Deep generative models of genetic variation capture the effects of mutations, Nature Methods 15 (10) (2018) 816–822. doi:10.1038/s41592-018-0138-4.
[25] D. H. Brookes, J. Listgarten, Design by adaptive sampling, arXiv cs.LG (1810.03714) (2018).
[26] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks, Communications of the ACM 63 (11) (2020) 139–144. doi:10.1145/3422622.
[27] A. Gupta, J. Zou, Feedback GAN for DNA optimizes protein functions, Nature Machine Intelligence 1 (2019) 105–111. doi:10.1038/s42256-019-0017-4. *This paper uses adaptive sampling with a GAN generative model and a neural network surrogate model to design antimicrobial peptides.
[28] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Computation 9 (8) (1997) 1735–1780. doi:10.1162/neco.1997.9.8.1735.
[29] T. Bepler, B. Berger, Learning protein sequence embeddings using information from structure, in: 7th International Conference on Learning Representations, Vol. arXiv, 2019, p. 1902.08661.
[30] J.-E. Shin, A. J. Riesselman, A. W. Kollasch, C. McMahon, E. Simon, C. Sander, A. Manglik, A. C. Kruse, D. S. Marks, Protein design and variant prediction using autoregressive generative models, Nature Communications 12 (1) (2021) Article number: 2403. *This paper uses an autoregressive language model as a sequence generator to proposes diverse camelid nanobody designs.
[31] A. Madani, B. McCann, N. Naik, N. S. Keskar, N. Anand, R. R. Eguchi, P.-S. Huang, R. Socher, ProGen: Language Modeling for Protein Generation, bioRxiv (2021) 10.1101/2020.03.07.982272.
[32] C. N. Bedbrook, K. K. Yang, A. J. Rice, V. Gradinaru, F. H. Arnold, Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization, PLoS Computational Biology 13 (10) (2017) e1005786. doi:10.1371/journal.pcbi.1005786.
[33] D. H. Brookes, H. Park, J. Listgarten, Conditioning by adaptive sampling for robust design, International Conference on Machine Learning 36 (2019) 773–782. *This paper extends the design-by-adaptive-sampling framework to avoid pathologies in the surrogate model by constraining designs to be close to a prior distribution, which is applied to designing fluorescent proteins.
[34] A. Kumar, S. Levine, Model Inversion Networks for Model-Based Optimization, Advances in Neural Information Processing Systems 33 (2020).
[35] G. Liu, H. Zeng, J. Mueller, B. Carter, Z. Wang, J. Schilz, G. Horny, M. E. Birnbaum, S. Ewert, D. K. Gifford, Antibody complementarity determining region design using high-capacity machine learning, Bioinformatics 36 (7) (2020) 2126–2133. doi:10.1093/bioinformatics/btz895. *This paper uses activation maximization, a gradient-based model inversion method, to design high-affinity antibodies.
[36] I. Anishchenko, T. M. Chidyausiku, S. Ovchinnikov, S. J. Pellock, D. Baker, De novo protein design by deep network hallucination (2020). doi:10.1101/2020.07.22.211482. *This paper uses an adaptive sampling approach to perform de-novo protein design based on a surrogate model that predicts the three-dimensional fold of a given sequence.
[37] J. C. Greenhalgh, S. A. Fahlberg, B. F. Pfleger, P. A. Romero, Machine learning-guided acyl-ACP reductase engineering for improved in vivo fatty alcohol production, bioRxiv (2021) 10.1101/2021.05.21.445192. **This paper uses a Gaussian process regressor and UCB acquisition over multiple rounds of sequential optimization to engineer improved fatty acyl reductases.
[38] K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biological Cybernetics 36 (4) (1980) 267–285. doi:10.1007/BF00344251.
[39] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, A. Aspuru-Guzik, Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules, ACS Central Science 4 (2) (2018) 268–276. doi:10.1021/acscentsci.7b00572.
[40] P. S. Huang, S. E. Boyken, D. Baker, The coming of age of de novo protein design (2016). doi:10.1038/nature19946.
[41] C. Fannjiang, J. Listgarten, Autofocused oracles for model-based design, Advances in Neural Information Processing Systems 33 (2020). *This paper demonstrates that retraining a surrogate model during adaptive sampling can improve model-based optimization.
[42] C. Angermueller, D. Belanger, A. Gane, Z. Mariet, D. Dohan, K. Murphy, L. Colwell, D. Sculley, Population-Based Black-Box Optimization for Biological Sequence Design, International Conference on Machine Learning 37 (2020) 324–334. *This paper improves the robustness and diversity of sequence designs by using an ensemble of generative models that can be re-weighted based on the output of the surrogate model over multiple rounds of adaptive sampling.
[43] N. Hansen, The CMA evolution strategy: A comparing review, in: Towards a New Evolutionary Computation, 2006, pp. 75–102. doi:10.1007/11007937\_4.
[44] M. Eisenstein, Active machine learning helps drug hunters tackle biology, Nature Biotechnology 38 (5) (2020) 512. doi:10.1038/s41587-020-0521-4.
[45] H. Robbins, Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society 58 (5) (1952) 527–535.
[46] P. Auer, Using confidence bounds for exploitation-exploration trade-offs, Journal of Machine Learning Research (2003) 397–422doi:10.1162/153244303321897663.
[47] J. Snoek, H. Larochelle, R. P. Adams, Practical Bayesian optimization of machine learning algorithms, Advances in Neural Information Processing Systems 4 (2012) 2951–2959.
[48] R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.
[49] C. Angermueller, D. Dohan, D. Belanger, R. Deshpande, K. Murphy, L. J. Colwell, Model-based reinforcement learning for biological sequence design, International Conference on Learning Representations (2020).
[50] N. Srinivas, A. Krause, S. Kakade, M. Seeger, Gaussian process optimization in the bandit setting: No regret and experimental design, International Conference on Machine Learning 27 (2010). doi:10.1109/TIT.2011.2182033.
[51] B. Hie, B. D. Bryson, B. Berger, Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design, Cell Systems 11 (2020) 461–477. doi:10.1016/j.cels.2020.09.007. *This paper uses a Gaussian process surrogate model with neural sequence embeddings to optimize for compound-kinase binding affinity and protein fluorescence while incorporating uncertainty into the acquisition function.
[52] T. S. Frisby, C. J. Langmead, Fold family-regularized bayesian optimization for directed protein evolution, in: Leibniz International Proceedings in Informatics, LIPIcs, Vol. 172, 2020. doi:10.4230/LIPIcs.WABI.2020.18.
[53] J. T. Wilson, F. Hutter, M. P. Deisenroth, Maximizing acquisition functions for Bayesian optimization, Advances in Neural Information Processing Systems 31 (2018) 9884–9895.
[54] C. E. Rasmussen, C. K. I. Williams, Gaussian processes for machine learning, MIT Press, 2005.
[55] M. Kuss, C. E. Rasmussen, Assessing approximations for Gaussian process classification, Advances in Neural Information Processing Systems (2006) 699–706.
[56] E. V. Bonilla, K. M. A. Chai, C. K. Williams, Multi-task Gaussian Process prediction, Advances in Neural Information Processing Systems 20 (2009) 153–160.
[57] C. N. Bedbrook, K. K. Yang, J. E. Robinson, E. D. Mackey, V. Gradinaru, F. H. Arnold, Machine learning-guided channelrhodopsin engineering enables minimally invasive optogenetics, Nature Methods 16 (11) (2019) 1176–1184. doi:10.1038/s41592-019-0583-8. *This paper uses Gaussian process classifiers and regressors as surrogate models to design channelrhodopsins based on multiple optimization properties.
[58] S. Voutilainen, M. Heinonen, M. Andberg, E. Jokinen, H. Maaheimo, J. Pääkkönen, N. Hakulinen, J. Rouvinen, H. Lähdesmäki, S. Kaski, J. Rousu, M. Penttilä, A. Koivula, Substrate specificity of 2-deoxy-D-ribose 5-phosphate aldolase (DERA) assessed by different protein engineering and machine learning methods, Applied Microbiology and Biotechnology 104 (24) (2020) 10515–10529. doi:10.1007/s00253-020-10960-x.
[59] C. A. Micchelli, Y. Xu, H. Zhang, Universal kernels, Journal of Machine Learning Research 7 (December) (2006) 2651–2667.
[60] C. Oh, J. M. Tomczak, E. Gavves, M. Welling, Combinatorial Bayesian optimization using the graph cartesian product, Advances in Neural Information Processing Systems 33 (2019).
[61] D. Beck, T. Cohn, Learning Kernels over Strings using Gaussian Processes, International Joint Conference on Natural Language Processing 2 (2017) 67–73.
[62] H. B. Moss, D. Beck, J. González, D. S. Leslie, P. Rayson, BOSS: Bayesian optimization over string spaces, Neural Information Processing Systems 34 (2020).
[63] H. Liu, Y. S. Ong, X. Shen, J. Cai, When Gaussian Process Meets Big Data: A Review of Scalable GPs, IEEE Transactions on Neural Networks and Learning Systems (2020). arXiv:1807.01065, doi:10.1109/TNNLS.2019.2957109.
[64] V. Fortuin, G. Dresdner, H. Strathmann, G. Rätsch, Scalable Gaussian Processes on Discrete Domains, arXiv stat.ML (1810.10368) (2019).
[65] D. Koller, N. Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT Press, 2009.
[66] R. M. Neal, Bayesian learning for neural networks, Springer Science & Business Media, 2012.
[67] B. Lakshminarayanan, A. Pritzel, C. Blundell, Simple and scalable predictive uncertainty estimation using deep ensembles, Advances in Neural Information Processing Systems (2017) 6402–6413.
[68] H. Zeng, D. K. Gifford, Quantification of Uncertainty in Peptide-MHC Binding Prediction Improves High-Affinity Peptide Selection for Therapeutic Design, Cell Systems 9 (2) (2019) 159–166. doi:10.1016/j.cels.2019.05.004.
[69] A. Amini, W. Schwarting, A. Soleimany, D. Rus, Deep evidential regression, Advances in Neural Information Processing Systems 33 (2019) 14927–14937.
[70] P. Izmailov, S. Vikram, M. D. Hoffman, A. G. Wilson, What Are Bayesian Neural Network Posteriors Really Like?, arXiv cs.LG (2104.14421) (2021).
[71] N. A. Pierce, E. Winfree, Protein design is NP-hard, Protein Engineering 15 (10) (2003) 779–782. doi:10.1093/protein/15.10.779.