Stochastic Variational Methods in Generalized Hidden Semi-Markov Models to Characterize Functionality in Random Heteropolymers

Yun Zhou Division of Biostatistics, University of California Berkeley, Berkeley, CA, USA. Boying Gong Division of Biostatistics, University of California Berkeley, Berkeley, CA, USA. Tao Jiang Department of Chemistry, The MOE Key Laboratory of Spectrochemical Analysis and Instrumentation, Xiamen University, Xiamen, China. Ting Xu Department of Materials Science and Engineering, University of California Berkeley, Berkeley, CA, USA. Department of Chemistry, University of California Berkeley, Berkeley, CA, USA. Materials Science Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA. Haiyan Huang Corresponding author. Email: hyh0110@berkeley.edu Department of Statistics, University of California Berkeley, Berkeley, CA, USA. Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA.

Abstract

Recent years have seen substantial advances in the development of biofunctional materials using synthetic polymers. The growing problem of elusive sequence-functionality relations for most biomaterials has driven researchers to seek more effective tools and analysis methods. In this study, statistical models are used to study sequence features of the recently reported random heteropolymers (RHP), which transport protons across lipid bilayers selectively and rapidly like natural proton channels. We utilized the probabilistic graphical model framework and developed a generalized hidden semi-Markov model (GHSMM-RHP) to extract the function-determining sequence features, including the transmembrane segments within a chain and the sequence heterogeneity among different chains. We developed stochastic variational methods for efficient inference on parameter estimation and predictions, and empirically studied their computational performance from a comparative perspective on Bayesian (i.e., stochastic variational Bayes) versus frequentist (i.e., stochastic variational expectation-maximization) frameworks that have been studied separately before. The real data results agree well with the laboratory experiments, and suggest GHSMM-RHP’s potential in predicting protein-like behavior at the polymer-chain level.

Keywords: random heteropolymers, probabilistic graphical model, stochastic variational Bayes, stochastic variational expectation-maximization

1 Introduction

Protein-mimicry with synthetic polymers has been actively pursued for decades to meet material demands within the environmental, energy and life sciences. Recently, Panganiban et al. (2018) and Jiang et al. (2020) reported a promising protein-mimic polymer system on the basis of random heterogeneous polymers (RHP) comprising monomers with distinct chemical properties. Through wet lab experiments, they found long-term screening for optimizing the monomer choices and monomer ratios can lead to some RHPs with protein-like functions. The RHP systems provide an attractive platform for developing polymer-based biofunctional materials.

In Jiang et al. (2020), several datasets of four-monomer based RHPs were synthesized, which behave like natural protein-based proton channels. The four methacrylate-based monomers were selected by chemical diversity, i.e., methyl methacrylate (MMA), 2-ethylhexyl methacrylate (EHMA), oligo (ethylene glycol) methacrylate (OEGMA), and 3-sulfopropyl methacrylate potassium salt (SPMA). MMA and EHMA are the hydrophobic monomers, while OEGMA and SPMA are the hydrophilic monomers. RAFT polymerization was used to control the ratio of four monomers in an RHP dataset, in order to tailor the overall hydrophobicity. Different RHP datasets with varying monomer ratios were generated, and some of them were found capable of being inserted into lipid bilayers and promoting transmembrane proton transport. However, there is a difference from natural proteins that make up the channels. A functional natural protein has certain specific amino acid sequence(s). But the functional performance of each RHP dataset is determined as a whole, and chains in an RHP dataset, which are synthesized under a fixed monomer ratio, are largely different from each other. The sequence heterogeneity in RHP brings challenges to directly linking individual monomer sequence to the overall performance of a polymer dataset. Study of the sequence heterogeneity under various monomer composition ratios would shed light on the understanding of RHPs’ functional adaptability, enabling their uniform behavior in various environments. More importantly, it offers promising potential for further RHP design in a predictable manner.

As RHPs’ local hydrophobicity levels relate to short-range interactions, each RHP chain can be studied as a concatenation of segments of various cumulative hydrophobicity. It is believed that varied segment lengths and ending positions substantially affect RHPs’ chemical properties. Therefore, segmentation can reveal valuable latent structure information on RHP heterogeneity and functionalities when individual chain experimentation is unavailable. The differences in cumulative hydrophobicity among segments introduce the segment heterogeneity. To analyze such heterogeneity, the segments are assigned to three states: the hydrophilic state, $S1$ , for the segments rich in hydrophilic monomer OEGMA and the negatively charged SPMA; the hydrophobic state, $S3$ , for those comprising mostly hydrophobic monomers MMA and EHMA; the intermediate state, $S2$ , for ones that cannot be simply assigned as hydrophobic or hydrophilic (amphiphilic segments). The $S3$ segments have the ability to span across the bilayer region and provide the proton transport pathway. The $S1$ segments prefer the water environment. The $S2$ segments are believed to have short residence time in both bilayer and water regions.

Figure 1 displays the schematic illustration of RHP chains in a lipid bilayer. Note however, information on segmentation and spatial structure is not available in practice, and we only have access to the linear sequence of four monomers in each RHP dataset.

Determining how RHPs’ performance relates to segment heterogeneity and sequence heterogeneity, rather than to a specific realization of polymer sequence, provides a promising strategy for developing biofunctional materials. The two types of heterogeneity correspond to two key factors in RHP design: (1) the segmentation of local hydrophobicity that regulates short-range interactions, and (2) the choice of overall monomer composition ratio that modulates the solubility. So far, domain knowledge is extremely limited in the field. There is also a lack of statistical study that could bridge the experimental restrictions and RHPs’ predictable performance. To exploit the statistical design of RHP and to aid its synthesis, we utilized the probabilistic graphical model framework and developed a generalized hidden semi-Markov model (GHSMM-RHP). Based on this model, we predicted the segments of varying hydrophobicity within RHP chains to address the first key factor above. To address the second factor, we statistically quantified the heterogeneity of chain distributions among different RHP datasets. GHSMM-RHP helps to extract the hidden phenomena that are otherwise hard to quantify, and utilizes efficient approximate inference in place of expensive exact computation. That is, we aim to address a joint estimation-prediction problem in RHP systems under the GHSMM-RHP model.

Specifically, GHSMM-RHP adopts the residential time design, which has fewer parameters and lower time complexity than generic hidden semi-Markov model (Yu, 2010). GHSMM-RHP incorporates additional layer of nodes to allow handling segment constraints such as the monomer positions, as outlined in Section 3. Based on the model, we developed stochastic variational methods in closed-form under both frequentist and Bayesian settings. To keep the monomer dependencies within an RHP chain, we derived message-passing formulas on the graphical model of conjugate exponential families (Maathuis et al., 2018). GHSMM-RHP differs from existing efforts in its novel design that incorporates specific constraints for large RHP systems, and provides a scalable graphical model with efficient algorithms that meets the inference demand in RHP applications.

In addition, we studied the GHSMM-RHP model through a simulation analysis, and compared estimates from the frequentist and Bayesian stochastic variational methods. In particular, we implemented stochastic variational expectation-maximization (SVEM) and stochastic variational Bayesian (SVB), and empirically addressed their estimation/prediction accuracy and computational (in)stability against hyper-parameters. We found that both methods achieved low statistical error, while SVEM exhibited greater stability against random initializations. Our work makes the first attempt to bring the two stochastic variational methods, which were studied separately before, to the same nonconvex regime. It also contributes to the current literature on the computational error of stochastic variational methods.

We further evaluated the GHSMM-RHP model using real datasets. Jiang et al. (2020) synthesized seven RHP datasets with varied compositional ratios and tested their performance. Among them, RHP with a compositional ratio of 50(MMA) : 20(EHMA) : 25(OEGMA) : 5(SPMA) was the most functionally active, which we refer to as RHP50/20. We used RHP50/20 as a reference and compared other RHPs to RHP50/20 based on the GHSMM-RHP model. The RHP datasets identified by our model as having high similarity to RHP50/20 all produced high proton transport efficiency in the lab experiment, which suggests that the GHSMM-RHP model has the potential to aid in predictable RHP performance. By bridging the gap between costly laboratory assessments and less expensive computational models, we hope to provide guidance for designing improved RHP syntheses.

The rest of the paper is organized as follows. In Section 2 we review the background of methods related to the GHSMM-RHP model. In Section 3 we describe the model that incorporates the constraints of RHP systems. In Section 4 different stochastic variational methods are derived. We implemented the algorithms and compared the results in Section 5. Section 6 concludes the work and discusses avenues for further research.

2 Method Background Review

2.1 Relevant Work on the GHSMM-RHP Model

The GHSMM-RHP model is inspired by previous attempts by Jiang et al. (2020) to use a Hidden Markov model (HMM) to analyze certain restricted RHP datasets. HMMs are widely used to describe a sequence of observable events which are generated by a sequence of internal hidden states. It has a long successful history in biological sequence analysis, such as gene prediction (Munch and Krogh, 2006) and modeling DNA sequencing errors (Lottaz et al., 2003).

We noticed two drawbacks of using HMM in RHP applications. First, the hidden state length distribution in HMM implicitly follows a geometric distribution. However, the segment length of RHP chains has a clear lower and upper cutoff, which needs to be configurable in the model to test key polymerization assumptions (Jiang et al., 2020). As suggested by Rabiner (1989), more flexible state durations need to be incorporated to improve parameter estimation and segment prediction. To overcome this issue we adopted a semi-Markov modeling of the underlying hidden states (Yu, 2010). Such a duration model has a more scalable graphical architecture to account for segmental heterogeneity. The other downside involves the time complexity for inference algorithms. The widely used Baum-Welch algorithm (Baum et al., 1970) takes $TNKS^{2}$ time under a classic HMM, and $TNKS(S+D)$ under residential time model (a semi-Markov model described in Yu and Kobayashi (2003)), where T is the number of iterations, N is the total number of sequences, K is the number of monomers in one RHP chain, S is the number of hidden states, and D is the span of the hidden state length distribution. The algorithms become time-consuming when training large libraries of RHP datasets. This computational issue can be alleviated by implementing stochastic variational methods, where stochastic optimizations are enabled through processing only a small portion of the entire dataset in each iteration (Robbins and Monro, 1951; Bottou, 2010).

The basic idea of variational methods used in GHSMM-RHP is to turn statistical estimation into a simpler lower bound optimization problem with tractable approximation densities. It defines a family of distributions $\mathcal{Q}$ over the latent variables (i.e., the hidden state sequences underlying the RHP chain), each considered as a candidate approximation to the true posterior or conditional probabilities. $\mathcal{Q}$ should be large enough to reduce the approximation error, but also simple enough to allow for efficient optimization algorithms. Among the related work, Ghahramani and Beal (2001) and Winn et al. (2005) connected the junction tree algorithms on graphical models with the variational Bayesian posterior distributions. MacKay (1997) and Beal (2003) were among the first efforts to implement variational Bayesian on classic HMM models. Hoffman et al. (2013) introduced stochastic optimization into variational Bayesian, and extensions to other Bayesian time series models were reported in Johnson and Willsky (2014). Foti et al. (2014) used stochastic variational Bayesian for HMM but in a different setting: it involved a single long sequence broken into mini-batches, rather than multiple relatively short sequences.

2.2 Relevant Work on Stochastic Variational Methods

Variational methods, such as variational expectation-maximization (VEM) and variational Bayes (VB, sometimes termed variational inference), provide alternative solutions to maximum likelihood estimation and Bayesian posterior approximation respectively (Wainwright and Jordan, 2008; Blei et al., 2017). Being faster than convex relaxations and Markov chain Monte Carlo as well as adaptable to different application models, variational methods have prospered in various large scale approximate inference problems such as computational biology (Carbonetto et al., 2012), natural language processing (Yogatama et al., 2014), and astronomy (Regier et al., 2015). The objective function by classic proposals (Neal and Hinton, 1998; Jordan et al., 1999) is:

\displaystyle\ln p_{\theta}(y)=\ln\int p_{\theta}(y,z)dz\geq\ln p_{\theta}(y)-\text{KL}(q(z)||p_{\theta}(z|y)).

(1)

Here, the likelihood function $p_{\theta}(y)$ can be replaced by marginal $p(y)$ in the Bayesian setting, where $z$ are latent variables that include parameters $\theta$ . One can choose $q(z)$ over some proper variational distribution family $\mathcal{Q}$ . Based on Equation 1, some variants include those that are scaled up by stochastic approximations (Cappé and Moulines, 2009; Hoffman et al., 2013), and those that are distinct in variational objective formulations (e.g., $\alpha$ -VB (Yang et al., 2020)) or distribution families $\mathcal{Q}$ (e.g., Auto-encoding VB (Kingma and Welling, 2013), and Boosting VB (Guo et al., 2016)).

Most theoretical justifications for variational methods, which are still in active development, focus on the statistical properties of the global optimizer (either point estimates or posterior approximations). For example, the consistency and asymptotic normality in stochastic block models (Bickel et al., 2013), mixture models (Westling and McCormick, 2019), or more general parametric models with mean-field assumptions (Wang and Blei, 2019); the non-asymptotic convergence rate for latent Gaussian models (Sheth and Khardon, 2017), approximating tempered posterior (Alquier et al., 2020), point estimates using mean-field assumptions (Pati et al., 2018), or high-dimensional and nonparametric inference (Zhang et al., 2020b).

Although the global optimum for variational methods can be achieved via some suitable settings (Awasthi and Risteski, 2015), the computational error involving the convergence behavior of iterative updates has been less studied. In fact, solutions to variational objective functions often suffer from multiple local optimums, making optimizations sensitive to initializations and other hyper-parameters. For example, unfavorable scenarios were constructed in the stochastic block model with futile random initializations (Mukherjee and Sarkar, 2018), and in the topic model with uninformative objective functions (Ghorbani et al., 2019). They point out cases that fail in practice due to either optimization algorithms or problem formulations.

A few studies have investigated scenarios in which the global optimum can not be attained feasibly. For example, Zhang et al. (2020a) analyzed the computational rate of convergence for mean-field variational Bayes in stochastic block models, and Balakrishnan et al. (2017) studied EM-type of iterates in broader frequentist settings. However, those results are either model-specific or conditional on proper hyper-parameters, and the additional tuning effort with pilot algorithms may not be warranted. Improvement of the computational error in variational methods with general initializations and model choice remains an open problem. Moreover, when applying variational methods in a broad range of model types, poor approximations can originate from the instability of stochastic optimization algorithms with many hyper-parameters, even if the variational family $\mathcal{Q}$ contains the true posterior (Dhaka et al., 2020).

Results on the accuracy and stability of stochastic variational methods remain inconclusive for a wide variety of models. Misperception of its properties will hinder reproducible scientific research and reliable decision making. In this paper, we make an attempt to empirically study the computational performance of stochastic variational methods under the GHSMM-RHP model, as well as to compare the performance of stochastic variation methods under Bayesian (i.e., SVB) and frequentist (i.e., SVEM) frameworks that were previously studied separately. The observations and conclusions developed in our study from a comparative perspective on SVB and SVEM not only provide insight into the choice of these two methods, but also contribute to the existing literature on the computational error of stochastic variational methods.

3 Generalized Hidden Semi-Markov Model for RHP

As described previously, each RHP dataset contains thousands of sequences randomly generated under a fixed monomer ratio. Note that we only have access to the one-dimensional primary structure, that is, the linear sequence of four monomers. We start by discussing a few sequence constraints, which are the same as those stated in Jiang et al. (2020). These constraints describe the underlying biology of functional RHPs: (1) We define $S3$ segments to be composed of hydrophobic monomers (i.e., MMA, EHMA) and no more than one OEGMA. If an OEGMA appears in an S3 segment, it must be at least two monomers away from both ends of the $S3$ segment. (2) $S2$ and $S3$ are separated by hydrophilic segments $S1$ . (3) The negatively charged hydrophilic monomer SPMA appears only in $S1$ .

The hydrophobic monomers in $S3$ segments enable the insertion of the chain into lipid bilayers, and the appearance of a single OEGMA doesn’t compromise the overall activity if it is in the middle of an $S3$ segment. The length of the $S3$ segments depends on various factors such as chain conformation and membrane flexibility, but it should be sufficiently long to span the core of the lipid bilayer. The $S1$ segments have a high preference for water (high polarity of the monomer), and appear in between $S2$ and $S3$ .

We adapted the right-censored hidden semi-Markov model in (Yu, 2010) to model the RHP chain while incorporating the above constraints. Here, the hidden state duration is a random variable following a categorical distribution over a predefined range. The unobserved states correspond to three types of segments: hydrophilic state $S1$ , intermediate state $S2$ , and hydrophobic state $S3$ . We denote a random RHP chain as $(y_{1},...,y_{K})$ . At each position $k\in 1,...,K$ , $y_{k}$ takes values from the four monomers, i.e., MMA, OEGMA, EHMA, SPMA. $(z_{k},\tau_{k},\iota_{k})$ are the latent variables. $z_{k}$ is the hidden segment state, $\tau_{k}$ is the remaining duration of the current hidden state, and $\iota_{k}$ is the location of an emitted OEGMA. Their distributions are described below.

$\bullet\hskip 2.84526ptz_{1}$ : The model starts with the initial state variable $z_{1}$ , taking value in the state space $\mathcal{S}_{z}=\{S1,S2,S3\}$ . $p(z_{1}|\pi)$ follows a categorical distribution with parameters $\pi=(\pi_{S1},\pi_{S2},\pi_{S3})$ :

\displaystyle\ln p(z_{1}|\pi)=\sum_{i\in\mathcal{S}_{z}}\ln(\pi_{i})\mathbbm{1}\{z_{1}=i\}.

(2)

$\bullet\hskip 2.84526pt(\tau_{1},\iota_{1})$ : Latent variables $(\tau_{1},\iota_{1})$ depend on $z_{1}$ . (1) If $z_{1}=S3$ , then the duration $\tau_{1}$ takes values between $\underline{D}$ and $\overline{D}$ , the predefined shortest and longest segment length. The OEGMA location $\iota_{1}$ takes values between 3 and $\tau_{1}-2$ , meaning the OEGMA shall be at least two monomers away from both ends of the $S3$ segment. $\iota_{1}$ may also be 0, meaning there is no OEGMA in the current $S3$ segment. This support is denoted as $\mathcal{S}^{S3}_{\tau,\iota}=\{(d,l)|(d,l)\in\{\underline{D},...,\overline{D}\}\times\{0,3,...,\overline{D}-2\},l\leq d-2\}$ . The conditional probability $p(\tau_{1},\iota_{1}|z_{1},\rho^{S3})$ follows a categorical distribution with parameters $\rho^{S3}=\{\rho^{S3}_{d,l}|(d,l)\in\mathcal{S}^{S3}_{\tau,\iota}\}$ . (2) If $z_{1}=S1$ or $z_{1}=S2$ , then the duration $\tau_{1}$ takes value between $1$ and $\overline{D}$ . There is no constraint on OEGMA location for $S1$ and $S2$ segments, so $\iota_{1}$ can take any value without affecting the likelihood function. We set it to 0 for convenience. This support is denoted as $\mathcal{S}^{S1}_{\tau,\iota}=\mathcal{S}^{S2}_{\tau,\iota}=\{(d,l)|(d,l)\in\{1,...,\overline{D}\}\times\{0\}\}$ . The conditional probability $p(\tau_{1},\iota_{1}|z_{1},\rho^{S1})$ follows a categorical distribution with parameters $\rho^{S1}=\{\rho^{S1}_{d,l}|(d,l)\in\mathcal{S}^{S1}_{\tau,\iota}\}$ . $p(\tau_{1},\iota_{1}|z_{1},\rho^{S2})$ and $\rho^{S2}$ is defined similarly.

In summary, the conditional probability $p(\tau_{1},\iota_{1}|z_{1},\rho)$ follows a categorical distribution with parameters $\rho=(\rho^{S1},\rho^{S2},\rho^{S3})$ . Three sub-vectors $\rho^{S1},\rho^{S2},\rho^{S3}$ correspond to three supports $\mathcal{S}^{S1}_{\tau,\iota},\mathcal{S}^{S2}_{\tau,\iota},\mathcal{S}^{S3}_{\tau,\iota}$ respectively. This gives:

\displaystyle\ln p(\tau_{1},\iota_{1}|z_{1},\rho)=\sum_{i\in\mathcal{S}_{z}}\sum_{(d,l)\in\mathcal{S}^{i}_{\tau,\iota}}\ln(\rho^{i}_{d,l})\mathbbm{1}\{(z_{1},\tau_{1},\iota_{1})=(i,d,l)\}.

(3)

$\bullet\hskip 2.84526pt(z_{k},\tau_{k},\iota_{k})$ : For positions $k=2,...,K$ , latent variables $(z_{k},\tau_{k},\iota_{k})$ evolve as a time-homogeneous Markov chain, taking values in the same joint space as the one described for $(z_{1},\tau_{1},\iota_{1})$ . The graphical model diagram is illustrated in Figure 2. The transition distribution with parameters $\alpha,\rho$ is defined as $\zeta_{(z_{k-1},\tau_{k-1},\iota_{k-1})\to(z_{k},\tau_{k},\iota_{k})}$ . When $(z_{k},\tau_{k},\iota_{k})=(j,d^{\prime},l^{\prime})$ and $(z_{k-1},\tau_{k-1},\iota_{k-1})=(i,d,l)$ , the transition kernel $\zeta$ is:

\displaystyle\zeta_{(i,d,l)\to(j,d^{\prime},l^{\prime})}=\begin{cases}1,&\text{if $d>1,$}\text{$(j,d^{\prime},l^{\prime})=(i,d-1,l)$}.\\ \alpha_{i,j}\rho^{j}_{d^{\prime},l^{\prime}},&\text{if $d=1,j\neq i$}.\\ 0,&\text{otherwise}.\end{cases}

As the position $k$ proceeds we note the following. (1) If $d>1$ , there is no transition on hidden segment state, i.e., $z_{k}=j=i$ . The remaining duration $\tau_{k-1}=d$ of hidden segment state $z_{k-1}=i$ will deterministically decrease by 1 until it reaches 1, i.e., $\tau_{k}=d-1$ , $\tau_{k+1}=d-2$ , et cetera. (2) If $d=1$ , the RHP will transit to a different hidden segment state $z_{k}=j\in\mathcal{S}_{z}\setminus\{i\}$ with probability $\alpha_{i,j}$ , followed by a new pair $(\tau_{k},\iota_{k})=(d^{\prime},l^{\prime})$ generated according to the categorical probability parameter $\rho^{j}$ . We disallow transitions between S2 and S3 due to biological constraints, i.e., $\alpha_{S2,S3}=\alpha_{S3,S2}=0$ . Thus, we have:

		$\displaystyle\ln[p(z_{k},\tau_{k},\iota_{k}\|z_{k-1},\tau_{k-1},\iota_{k-1},\alpha,\rho)$		(4)
	$\displaystyle=$	$\displaystyle\sum_{i,j\in\mathcal{S}_{z}}\sum_{(d,l)\in\mathcal{S}^{i}_{\tau,\iota}}\sum_{(d^{\prime},l^{\prime})\in\mathcal{S}^{j}_{\tau,\iota}}\Big{[}\ln(\zeta_{(i,d,l)\to(j,d^{\prime},l^{\prime})})\cdot\mathbbm{1}\{(z_{k-1},\tau_{k-1},\iota_{k-1})=(i,d,l),(z_{k},\tau_{k},\iota_{k})=(j,d^{\prime},l^{\prime})\}\Big{]}.$		(4)

$\bullet\hskip 2.84526pty_{k}$ : The observed random variable $y_{k}$ is conditionally independent of all the other variables given $(z_{k},\tau_{k},\iota_{k})$ . $y_{k}$ takes values in the space of four monomers $\mathcal{S}_{y}=\{MMA,OEGMA,EHMA,SPMA\}$ . The emission distribution $p(y_{k}|z_{k},\tau_{k},\iota_{k},\beta)$ follows a categorical distribution with parameters $\beta=(\beta_{S1},\beta_{S2},\beta_{S3},\beta_{S3^{\prime}})$ . The four sub-vectors of $\beta$ are used conditional on different values of $(z_{k},\tau_{k},\iota_{k})$ , as shown in Table 1.

Table 1: Parameters for emission distribution.

Emission distribution	Support of $y_{k}$
$p(y_{k}=i\|z_{k}=S1,\beta)=\beta_{S1,i}$	$i\in\mathcal{S}_{y}$
$p(y_{k}=i\|z_{k}=S2,\beta)=\beta_{S2,i}$	$i\in\mathcal{S}_{y}\setminus\{SPMA\}$
$p(y_{k}=i\|z_{k}=S3,\tau_{k}\neq\iota_{k}\text{ or }z_{k}=S3,\iota_{k}=0,\beta)=\beta_{S3,i}$	$i\in\mathcal{S}_{y}\setminus\{OEGMA,SPMA\}$
$p(y_{k}=i\|z_{k}=S3,\tau_{k}=\iota_{k},\beta)=\beta_{S3^{\prime},i}=1$	$i\in\{OEGMA\}$

Specifically, $\beta_{S1}$ is for categorical distribution over $\mathcal{S}_{y}$ when $z_{k}=S1$ . $\beta_{S2}$ is for categorical distribution over $\mathcal{S}_{y}\setminus\{SPMA\}$ when $z_{k}=S2$ , because we assume the SPMA monomer appears only in $S1$ segments. $\beta_{S3}$ is for categorical distribution over $\mathcal{S}_{y}\setminus\{OEGMA,SPMA\}$ if $z_{k}=S3,\tau_{k}\neq\iota_{k}$ or $z_{k}=S3,\iota_{k}=0$ . It is used when the current position $k$ is within an $S3$ segment but no OEGMA is emitted, either because $\tau_{k}$ hasn’t reached the designated OEGMA location $\iota_{k}$ or because the current $S3$ segment is made up of purely MMA and EHMA. $\beta_{S3^{\prime}}$ is for categorical distribution over $\{OEGMA\}$ if $z_{k}=S3,\tau_{k}=\iota_{k}$ . It degenerates to a deterministic value. It is used when the duration variable $\tau_{k}$ decreases to the designated OEGMA location $\iota_{k}$ , indicating that the only OEGMA should be emitted in the current $S3$ segment. Thus we have:

	$\displaystyle\ln[p(y_{k}\|z_{k},\tau_{k},\iota_{k},\beta)]$	(5)
$\displaystyle=$	$\displaystyle\sum_{j\in\mathcal{S}_{y}}\sum_{\text{m}\in\{S1,S2\}}\Big{[}\ln(\beta_{\text{m},j})\mathbbm{1}\{z_{k}=\text{m},y_{k}=j\}\Big{]}+\sum_{j\in\mathcal{S}_{y}}\sum_{\text{m}\in\{S3^{\prime}\}}\Big{[}\ln(\beta_{\text{m},j})\mathbbm{1}\{z_{k}=m,\tau_{k}=\iota_{k},y_{k}=j\}\Big{]}+$
	$\displaystyle\sum_{j\in\mathcal{S}_{y}}\sum_{\text{m}\in\{S3\}}\Big{[}\ln(\beta_{\text{m},j})\mathbbm{1}\{z_{k}=m,\tau_{k}\neq\iota_{k}\text{ or }z_{k}=m,\iota_{k}=0,y_{k}=j\}\Big{]}.$

This probabilistic model incorporates the aforementioned constraints on RHP systems, and allows for explicit modeling of the hidden states’ duration. The complete likelihood function for one RHP chain is:

	$\displaystyle p(y_{1:K},z_{1:K},\tau_{1:K},\iota_{1:K}\|\pi,\alpha,\beta,\rho)$	(6)
$\displaystyle=$	$\displaystyle p(z_{1}\|\pi)p(\tau_{1},\iota_{1}\|z_{1},\rho)\prod_{k=2}^{K}[p(z_{k},\tau_{k},\iota_{k}\|z_{k-1},\tau_{k-1},\iota_{k-1},\alpha,\rho)]\prod_{k=1}^{K}[p(y_{k}\|z_{k},\tau_{k},\iota_{k},\beta)]$
$\displaystyle=$	$\displaystyle\exp\{{\eta}({\theta})^{\top}\cdot{\phi}(\bm{z,\tau,\iota,y})\}.$

In the last line of Equation LABEL:eq:complete_likelihood, we plug in Equations 2 to LABEL:eq:emission and simplify Equation LABEL:eq:complete_likelihood further to an exponential family. Here, ${\theta}=(\pi,\alpha,\beta,\rho)$ . ${\eta}({\theta})=\ln\theta$ is the canonical parameter vector and $\ln$ is applied element-wise. We denote $\bm{y}=y_{1:K}=(y_{1},...,y_{K})$ , with $\bm{z,\tau,\iota}$ similarly defined. ${\phi}(\bm{z,\tau,\iota,y})$ is the vector of complete-data sufficient statistics. All sequences are i.i.d. given the parameter ${\theta}$ , but for clarity we have suppressed indices for i.i.d. replicas of RHP sequences so far. Denote $\bm{y}^{(s)}=y_{1:K}^{(s)}=(y_{1}^{(s)},...,y_{K}^{(s)})$ as observations of the $s$ -th sequence, $s=1,...,N$ . Denote $\bm{y}^{(\mathcal{I})}=\{\bm{y}^{(s)}\}_{s\in\mathcal{I}}$ as all RHP sequences in the index set $\mathcal{I}\subseteq\{1,...,N\}$ , i.e., $\bm{y}^{(1:N)}=\{\bm{y}^{(s)}\}_{s=1}^{N}$ . $\bm{z}^{(s)},\bm{\tau}^{(s)},\bm{\iota}^{(s)}$ and $\phi^{(s)}=\phi(\bm{z}^{(s)},\bm{\tau}^{(s)},\bm{\iota}^{(s)},\bm{y}^{(s)})$ are defined similarly.

4 Stochastic Variational Methods for the GHSMM-RHP

Variational methods have been used for approximate inference under different scenarios when exact solutions are intractable or too expensive to achieve (Bishop, 2006; Wainwright and Jordan, 2008). To develop more efficient algorithms in modern large-scale learning problems, stochastic optimizations are introduced to bypass the need to iterate through the entire datasets in each iteration (Robbins and Monro, 1951). Combining variational methods and stochastic optimizations helps alleviate the computational complexity of estimations in the GHSMM-RHP model.

In particular, in the Bayesian setting where parameters $\theta$ are viewed as random variables, we coupled stochastic natural gradient descent (Amari, 1998) with variational Bayes following the work of Hoffman et al. (2013). In the frequentist setting where parameters $\theta$ are viewed as fixed but unknown, we coupled the stochastic Frank-Wolfe algorithm (Reddi et al., 2016) with variational expectation-maximization.

4.1 Stochastic Variational Bayesian

In this section, we describe the stochastic variational Bayesian (SVB) algorithms for our GHSMM-RHP model. For the complete likelihood function $p(\bm{z}^{(1:N)},\bm{\tau}^{(1:N)},\bm{\iota}^{(1:N)},\bm{y}^{(1:N)}|{\theta})=\exp\{{\eta}({\theta})^{\top}\sum_{s=1}^{N}\phi^{(s)}\}$ , we consider the conditionally conjugate exponential family (Bernardo and Smith, 2009) with priors taking the form of $p({\theta}|\widetilde{\lambda}_{\theta})=\exp\{{\widetilde{\lambda}}_{\theta}^{\top}{\eta}({\theta})-B(\widetilde{\lambda}_{\theta})\}$ . Here, prior parameters $\widetilde{\lambda}_{\theta}=({\widetilde{\lambda}}_{\pi},{\widetilde{\lambda}}_{\alpha},{\widetilde{\lambda}}_{\beta},{\widetilde{\lambda}}_{{\rho}})$ are for ${\theta}=(\pi,\alpha,\beta,\rho)$ , respectively. These add to the full model with:

	$\displaystyle{\pi}\|{\widetilde{\lambda}}_{\pi}\sim\text{Dir}({\widetilde{\lambda}}_{\pi}+\bm{1}),$		(7)
	$\displaystyle\alpha_{i}\|{\widetilde{\lambda}}_{\alpha_{i}}\sim\text{Dir}({\widetilde{\lambda}}_{\alpha_{i}}+\bm{1}),\forall i\in\{S1,S2,S3\},$
	$\displaystyle{\rho}^{i}\|{\widetilde{\lambda}}_{{\rho}^{i}}\sim\text{Dir}({\widetilde{\lambda}}_{{\rho}^{i}}+\bm{1}),\forall i\in\{S1,S2,S3\},$
	$\displaystyle\beta_{\text{m}}\|{\widetilde{\lambda}}_{\beta_{\text{m}}}\sim\text{Dir}({\widetilde{\lambda}}_{\beta_{\text{m}}}+\bm{1}),\forall\text{m}\in\{S1,S2,S3,S3^{\prime}\}.$

$S3$ and $S3^{\prime}$ refer to the same segment as described in Table 1. The complete conditional distribution of local latent variables can be written from Equation LABEL:eq:complete_likelihood as:

	$\displaystyle p(\bm{z}^{(1:N)},\bm{\tau}^{(1:N)},\bm{\iota}^{(1:N)}\|\theta,\bm{y}^{(1:N)})=$	$\displaystyle\prod_{s=1}^{N}p(\bm{z}^{(s)},\bm{\tau}^{(s)},\bm{\iota}^{(s)}\|\theta,\bm{y}^{(s)})$		(8)
	$\displaystyle=$	$\displaystyle\prod_{s=1}^{N}\exp\{{\eta}({\theta})^{\top}\phi^{(s)}-A_{y^{(s)}}(\eta(\theta))\}.$		(8)

Here, the cumulant function $A_{y^{(s)}}(\eta(\theta))=\ln\sum_{\bm{z}^{(s)},\bm{\tau}^{(s)},\bm{\iota}^{(s)}}\exp\{{\eta}({\theta})^{\top}\phi^{(s)}\}$ . Define its conjugate function as $A^{*}_{y^{(s)}}(\mu)=\max_{\eta}(\mu^{\top}\eta-A_{y^{(s)}}(\eta))$ . Note the analogue term for all N sequences is $A_{y^{(1:N)}}(\eta(\theta))=\sum_{s=1}^{N}A_{y^{(s)}}(\eta(\theta))$ .

We then specify a family of variational distributions $q(\bm{z}^{(1:N)},\bm{\tau}^{(1:N)},\bm{\iota}^{(1:N)},\theta|\psi,\lambda_{\theta})\in\mathcal{Q}$ , aiming to approximate the true posterior distribution over all latent variables in terms of KL divergence. We make the mean-field assumption in the variational family $\mathcal{Q}$ : independence between $\theta$ and $(\bm{z}^{(1:N)},\bm{\tau}^{(1:N)},\bm{\iota}^{(1:N)})$ while structures are kept within $(z_{1:K}^{(s)},\tau_{1:K}^{(s)},\iota_{1:K}^{(s)})$ , namely $q(\bm{z}^{(1:N)},\bm{\tau}^{(1:N)},\bm{\iota}^{(1:N)},\theta)=q(\theta)\prod_{s=1}^{N}q(\bm{z}^{(s)},\bm{\tau}^{(s)},\bm{\iota}^{(s)})$ . It can be argued that the optimal variational distribution takes the same forms of exponential families as in Equations 7 and 8 (Bishop, 2006). That is, $q(\theta|\lambda_{\theta})=\exp\{\lambda_{\theta}^{\top}\eta(\theta)-B({\lambda}_{\theta})\}$ , $q(\bm{z}^{(s)},\bm{\tau}^{(s)},\bm{\iota}^{(s)}|\psi)=\exp\{\psi^{\top}\phi^{(s)}-A_{y^{(s)}}(\psi)\},\forall s\in 1,...,N$ . Here, different RHP sequences share the same local variational parameters $\psi$ . The sequence specific information is decoupled into sufficient statistics $\phi^{(s)}$ . Using $\mathbb{E}_{\lambda_{\theta}},\mathbb{E}_{\psi}$ to denote the corresponding expectations, the evidence lower bound function (ELBO) is:

$\displaystyle\ln p(\bm{y}^{(1:N)})\geq$	$\displaystyle ELBO_{SVB}$	(9)
$\displaystyle:=$	$\displaystyle\ln p(\bm{y}^{(1:N)})-\text{KL}(q(\bm{z}^{(1:N)},\bm{\tau}^{(1:N)},\bm{\iota}^{(1:N)},\theta\|\psi,\lambda_{\theta})\|\|p(\bm{z}^{(1:N)},\bm{\tau}^{(1:N)},\bm{\iota}^{(1:N)},\theta\|\bm{y}^{(1:N)})))$
$\displaystyle=$	$\displaystyle\Big{\langle}\sum_{s=1}^{N}\mathbb{E}_{\psi}{\phi}^{(s)},\mathbb{E}_{\lambda_{\theta}}\eta(\theta)\Big{\rangle}-A_{y^{(1:N)}}^{*}(\sum_{s=1}^{N}\mathbb{E}_{\psi}{\phi}^{(s)})-\text{KL}(q_{\lambda_{\theta}}\|\|p_{\widetilde{\lambda}_{\theta}})-\sum_{s=1}^{N}\mathbb{E}_{\lambda_{\theta}}A(\eta(\theta)),$

where $\langle\cdot,\cdot\rangle$ denotes inner product. The last line is based on specific forms of variational distributions. Note here the cumulant function $A(\eta({\theta}))=\ln\sum_{\bm{y}^{(s)},\bm{z}^{(s)},\bm{\tau}^{(s)},\bm{\iota}^{(s)}}\exp\{{\eta}({\theta})^{\top}\phi^{(s)}\}=0$ with a probability of one, if $\theta$ follows Dirichlet distributions. Maximizing the ELBO function is equivalent to finding the best approximation in $\mathcal{Q}$ to the true posterior distribution. One way to do this is through stochastic natural gradient descent (Hoffman et al., 2013) while iteratively updating the local variational distribution $q_{\psi^{(t)}}:=q(\bm{z}^{(1:N)},\bm{\tau}^{(1:N)},\bm{\iota}^{(1:N)}|\psi^{(t)})$ and the global variational density $q_{\lambda_{\theta}^{(t)}}:=q(\theta|\lambda_{\theta}^{(t)})$ . Here $(t)$ is the number of iterations.

E-step: solving $\max_{\psi}ELBO_{SVB}\big{(}q_{\psi},q_{\lambda_{\theta}^{(t)}}\big{)}=\max_{\psi}\big{(}\langle\sum_{s=1}^{N}\mathbb{E}_{\psi}{\phi}^{(s)},\mathbb{E}_{\lambda_{\theta}^{(t)}}\eta(\theta)\rangle-\linebreak A_{y^{(1:N)}}^{*}(\sum_{s=1}^{N}\mathbb{E}_{\psi}{\phi}^{(s)})\big{)}$ . Convexity in $\mathbb{E}_{\psi}{\phi}^{(s)}$ yields the update $\psi^{(t)}=\mathbb{E}_{\lambda_{\theta}^{(t)}}\eta(\theta)$ .

M-step: solving $\max_{\lambda_{\theta}}ELBO_{SVB}\big{(}q_{\psi^{(t)}},q_{\lambda_{\theta}}\big{)}=\max_{\lambda_{\theta}}\big{(}\langle\sum_{s=1}^{N}\mathbb{E}_{\psi^{(t)}}{\phi}^{(s)},\mathbb{E}_{\lambda_{\theta}}\eta(\theta)\rangle-\linebreak\text{KL}(q_{\lambda_{\theta}}||p_{\widetilde{\lambda}_{\theta}})\big{)}$ . Using stochastic natural gradient descent yields the update:

\displaystyle\lambda_{\theta}^{(t+1)}=(1-\gamma^{(t)})\lambda_{\theta}^{(t)}+\gamma^{(t)}(\frac{N}{|\mathcal{I}|}\sum_{s\in\mathcal{I}}\mathbb{E}_{\psi^{(t)}}{\phi}^{(s)}+\widetilde{\lambda}_{\theta}),

(10)

where $\mathcal{I}$ is the randomly chosen mini-batch and $\gamma^{(t)}$ is the step size at the $t$ -th iteration. To calculate $\mathbb{E}_{\psi^{(t)}}{\phi}^{(s)}$ , we implemented a message-passing recursion which is a special version of the junction tree algorithm (Lauritzen and Spiegelhalter, 1988). More details on update formulas can be found in Supplementary Materials section B.

4.2 Stochastic Variational Expectation Maximization

Under the frequentist setting, the complete likelihood $p(\bm{z}^{(1:N)},\bm{\tau}^{(1:N)},\bm{\iota}^{(1:N)},\bm{y}^{(1:N)}|{\theta})$ in Equation LABEL:eq:complete_likelihood, complete conditional distributions $p(\bm{z}^{(1:N)},\bm{\tau}^{(1:N)},\bm{\iota}^{(1:N)}|\bm{y}^{(1:N)},{\theta})$ in Equation 8, and local variational distributions $q(\bm{z}^{(s)},\bm{\tau}^{(s)},\bm{\iota}^{(s)}|\psi)$ remain the same as those in SVB. Thus, stochastic variational expectation-maximization (SVEM) for the GHSMM-RHP model carries similar ELBO and parameter updates to the Bayesian approach:

$\displaystyle\ln p(\bm{y}^{(1:N)}\|\theta)\geq$	$\displaystyle ELBO_{SEM}$	(11)
$\displaystyle:=$	$\displaystyle\ln p(\bm{y}^{(1:N)}\|\theta)-\text{KL}(q(\bm{z}^{(1:N)},\bm{\tau}^{(1:N)},\bm{\iota}^{(1:N)}\|\psi)\|\|p(\bm{z}^{(1:N)},\bm{\tau}^{(1:N)},\bm{\iota}^{(1:N)}\|\theta,\bm{y}^{(s)})))$
$\displaystyle=$	$\displaystyle\Big{\langle}\sum_{s=1}^{N}\mathbb{E}_{\psi}{\phi}^{(s)},\eta(\theta)\Big{\rangle}-A_{y^{(1:N)}}^{*}(\sum_{s=1}^{N}\mathbb{E}_{\psi}{\phi}^{(s)})-\sum_{s=1}^{N}A(\eta(\theta)).$

E-step: solving $\max_{\psi}ELBO_{SEM}\big{(}q_{\psi},\theta^{(t)}\big{)}=\max_{\psi}\big{(}\langle\sum_{s=1}^{N}\mathbb{E}_{\psi}{\phi}^{(s)},\eta(\theta^{(t)})\rangle-\linebreak A_{y^{(1:N)}}^{*}(\sum_{s=1}^{N}\mathbb{E}_{\psi}{\phi}^{(s)})\big{)}$ , yielding the update $\psi^{(t)}=\eta(\theta^{(t)})$ .

M-step: solving $\max_{\theta}ELBO_{SEM}\big{(}q_{\psi^{(t)}},\theta\big{)}=\max_{\theta}\big{(}\langle\sum_{s=1}^{N}\mathbb{E}_{\psi^{(t)}}{\phi}^{(s)},\eta(\theta)\rangle-\sum_{s=1}^{N}A(\eta(\theta))\big{)}$ . Using the stochastic Frank-Wolfe algorithm yields the update:

	$\displaystyle\theta^{(t+1)}$	$\displaystyle=(1-\gamma^{(t)})\theta^{(t)}+\gamma^{(t)}e_{j},$		(12)
	$\displaystyle j$	$\displaystyle=\arg\max\frac{\frac{N}{\|\mathcal{I}\|}\sum_{s\in\mathcal{I}}(\mathbb{E}_{\psi^{(t)}}{\phi}^{(s)}-\mathbb{E}_{y,z,\tau,\iota\|\eta(\theta^{(t)})}\phi^{(s)})}{\theta^{(t)}},$		(12)

where $e_{j}$ is the basis vector with 1 at the j-th element and 0 otherwise, $\mathcal{I}$ is the randomly chosen mini-batch, and $\gamma^{(t)}$ is the step size at the $t$ -th iteration.

There are some notable differences with the SVB algorithm. In SVB, the cumulant function $A(\eta({\theta}))=\ln\sum_{\bm{y}^{(s)},\bm{z}^{(s)},\bm{\tau}^{(s)},\bm{\iota}^{(s)}}\exp\{{\eta}({\theta})^{\top}\phi^{(s)}\}=0$ . The M-step involves evaluating $\mathbb{E}_{\psi^{(t)}}{\phi}^{(s)}$ with respect to the local variational distribution $q(\bm{z}^{(s)},\bm{\tau}^{(s)},\bm{\iota}^{(s)}|\psi^{(t)})$ . In SVEM, the cumulant function $A(\eta({\theta}))\neq 0$ for parameters $\theta$ that consist of sub-normalized probability vectors (e.g., the sum of elements in updated $\pi^{(t)}$ is not one). Besides the message-passing recursion used in SVB, the M-step in SVEM requires a new collecting-distributing pass to calculate the unconditional expectation $\mathbb{E}_{y,z,\tau,\iota|\eta(\theta^{(t)})}\phi^{(s)}$ regarding the complete likelihood $p(\bm{z}^{(s)},\bm{\tau}^{(s)},\bm{\iota}^{(s)},\bm{y}^{(s)}|{\theta^{(t)}})$ . More details on message-passing can be found in Supplementary Materials section A. Additionally, SVEM involves constrained optimizations on probability simplexes in order to guarantee the interpretability of parameter estimates (i.e., the sum of elements in $\pi^{(t)}$ remains one). In SVB, parameter constraints are automatically guaranteed after every update.

5 Results

We implemented SVB and SVEM, and compared their performance using a comprehensive simulation study and a real data study from Jiang et al. (2020).

5.1 Simulation Study

5.1.1 Setup

Data Generation This study contains 20 simulated heteropolymers (SHP) datasets labeled SHP1, …, SHP20. They were generated following Equation LABEL:eq:complete_likelihood. The study combines ten sets of predetermined parameters $(\pi,\alpha,\beta)$ and two different values of $\underline{D}$ , the shortest lengths allowed for $S3$ segments. Table 2 illustrates parameters for SHP1 and SHP2. In the simulation, each set of $(\pi,\alpha,\beta)$ was combined with two values of $\underline{D}$ to generate a pair of “twin” SHP datasets, e.g., SHP1 and SHP2 form a “twin” pair, SHP3 and SHP4 form a “twin” pair, et cetera. That is, each “Twin” pair of SHPs share the same initial probabilities $\pi$ , transition probabilities $\alpha$ , and emission probabilities $\beta$ , but differ in the duration probabilities $\rho$ since it relies on $\underline{D}$ . Thus, twin SHPs are more similar to each other than to the rest of data sets. This design ensures that the simulated SHP systems have a wide range of heterogeneity, with some being more similar and others being very different.

Table 2: Simulation for one pair of “twin” SHPs.

Simulated data	Shared parameters	$\underline{D}$	Duration parameters
SHP1	$(\pi,\alpha,\beta)\sim\text{Dir}(\bm{1})$	7	$\rho\sim\text{Dir}_{\underline{D}}(\bm{1})$
SHP2	$(\pi,\alpha,\beta)\sim\text{Dir}(\bm{1})$	9	$\rho\sim\text{Dir}_{\underline{D}}(\bm{1})$

In more detail, elements in every set of $\theta=(\pi,\alpha,\beta,\rho)$ were sampled from the $\text{Dir}(\bm{1})$ distribution. Recall that the supports for parameters $\rho=(\rho^{S1},\rho^{S2},\rho^{S3})$ , namely $\mathcal{S}^{S1}_{\tau,\iota},\mathcal{S}^{S2}_{\tau,\iota}$ or $\mathcal{S}^{S3}_{\tau,\iota}$ , rely on $\underline{D}$ and $\overline{D}$ (hence the notation $\text{Dir}_{\underline{D}}(\bm{1})$ in Table 2). $\underline{D}$ was set to 7 or 9, and $\overline{D}$ was fixed to 25. These values are based on the current knowledge of RHP systems. For each parameter setup ( $\underline{D}$ and $\theta$ ), we generated one SHP dataset comprised of 5000 chains, each chain at a length of 130 monomers. In each SHP dataset, 500 chains were held out and used as the test set, while the other 4500 chains were used as the training set.

Training Settings For each of the 20 simulated datasets, both SVB and SVEM algorithms were applied with various hyper-parameters. Specifically, we used 10 random initializations in each algorithm. For every initialization, we tried 15 learning rates: in SVB the step size was $\gamma^{(t)}=\frac{1}{(t+\kappa_{1}-1)^{\kappa_{2}}}$ where $\kappa_{1}\in\{10^{3},10^{5},10^{6},10^{7},10^{9}\}$ , $\kappa_{2}\in\{1,0.9,0.7\}$ ; in SVEM the step size was $\gamma^{(t)}=\frac{2}{(t+2+\kappa_{1}-1)^{\kappa_{2}}}$ where $\kappa_{1}\in\{10^{0},10^{1},10^{2},10^{4},10^{6}\}$ , $\kappa_{2}\in\{1,0.9,0.7\}$ . They become the vanilla stochastic (conditional) gradient method when the “decay” $\kappa_{1}$ equals $10^{0}$ and the “rate” $\kappa_{2}$ equals 1. During training we also explored varying values of batch size $|\mathcal{I}|\in\{12,48,120\}$ (in Equations 10 and 12) and shortest $S3$ segment length $\underline{D}\in\{5,7,10,15\}$ . However, we found that neither algorithm was sensitive to these two hyper-parameters, and we do not report them here for brevity. The choice of $|\mathcal{I}|$ and $\underline{D}$ were fixed to 48 and 7, respectively.

In summary, we had 150 training settings. Under each of the 10 initializations, we tried 15 learning rates and reported the one that gave the highest value in objective functions ELBO.

In the following subsections, we empirically studied the two stochastic variational methods and showcased the computational instability rooted in SVB. To our best knowledge, there has been little effort to compare their performance or theoretically characterize their convergence behavior in the literature. Our comparisons between SVB and SVEM in the RHP applications may shed light on an understanding of Bayesian versus frequentist stochastic variational methods.

5.1.2 Performance Evaluation

Point Estimation. Through the GHSMM-RHP model, we obtained two point estimates for $\theta$ . We implemented SVB and used the posterior mean $\mathbb{E}_{\lambda_{\theta}^{(t)}}\theta$ as estimates under the Bayesian framework. We implemented SVEM and used approximated maximum likelihood estimates $\theta^{(t)}$ under the frequentist framework. Figure 3 shows the statistical error of the two estimates in different SHPs. For simplicity only one out of every pair of “twin” SHPs is presented, because we observed that “twin” data sets, such as SHP1 and SHP2, had almost identical results. This indicates that the variational objectives of our model (i.e., Equations 9 and 11) are not sensitive to the parameter $\underline{D}$ .¹¹1This relates to the formulation of objective function itself (for more discussion, see for example (Giordano et al., 2018); (Wainwright and Jordan, 2008, Chapter 7)), and is beyond the scope of current work focusing on algorithmic issues.

Figure 3 shows that SVEM consistently produces low statistical error that is comparable to the best in SVB. However, the SVB algorithm appears more sensitive to the local optima issue and may require “better” initializations (e.g., to be closer to the truth) than the SVEM, even though the global optimizer of variational objectives present arguably good statistical properties (Wang and Blei, 2019). That is, to obtain a near-optimal solution in SVB, more tuning efforts may be needed, such as trying more random initializations or using a proper pilot algorithm. Additionally, we observed that it took fewer iterations for SVB to converge in this study. Note that the running time per iteration is of the same order for both methods, because roughly equal numbers of message passes are required in the E-step.

Segment Prediction. We followed the Viterbi algorithm (Viterbi, 1967) and predicted the hidden segment states on each test set of the 20 SHPs, using parameter estimates from SVB and SVEM. Figure 4 summarizes the findings. As in Figure 3, only one out of every pair of “twin” SHPs is presented because “twin” data sets have similar results.

Panel (a) measures the overall prediction accuracy for segment states $S1,S2$ , and $S3$ . Panel (b) measures the accuracy for $S3$ segments alone. Compared to $S1$ and $S2$ , $S3$ appears less frequently and is more functionally interesting because it has the potential to form a transportation tunnel. Figure 4 shows that SVB and SVEM produced comparable accuracy if trained with proper initializations. Note that SHP11 and SHP15 present lower Jaccard index for $S3$ segments (shown in panel (b)) when compared to the overall accuracy (shown in panel (a)). We observed that there are fewer hydrophobic segments $S3$ in SHP11 and SHP15 than in the other SHPs. So, even if $S3$ predictions are relatively poor in SHP11 and SHP15, the overall accuracy illustrated in panel (a) is not affected much due to the dominance of $S1$ and $S2$ in the chains.

Figure 4 again shows that predictions (using the Viterbi algorithm) by SVB estimates remain less robust to random initializations, while SVEM produces consistent better predictions assessed by overall accuracy and Jaccard index. However in some datasets (e.g., SHP3), the segment prediction based on SVB estimates remains stable (shown in Figure 4) despite SVB’s lack of robustness in terms of statistical error (shown in Figure 3). The black cross dots in Figure 4 are the prediction accuracy using the true parameters for each SHP, i.e., $\theta^{\ast}_{Shp1}$ , …, $\theta^{\ast}_{Shp19}$ . As can be seen, the accuracy based on SVB and SVEM estimates is comparable to the accuracy based on true parameters.

5.1.3 Applications

Designing bio-functional polymers relies on understanding the heterogeneity of RHP chains under various overall monomer distributions. Through GHSMM-RHP, we can statistically quantify the relationships among different RHP datasets using posterior predictive probability in the Bayesian scheme: $p(\bm{y}^{(new)}|\bm{y}^{(1:N)})=\int p(\bm{y}^{(new)}|\theta)p(\theta|\bm{y}^{(1:N)})d\theta$ , which is the average likelihood of the new sequences $\bm{y}^{(new)}$ based on distributions of $\theta$ learned from the observed data. Greater similarity indicates that new sequences may share more similar chemical properties with the training sequences. Several lines of empirical research have shown that the predictive inference based on variational Bayes is similar to the fully Bayes predictive inference based on Markov chain Monte Carlo (MCMC) (Blei et al., 2006; Braun and McAuliffe, 2010). Nott et al. (2012) also used variational approximations in the posterior predictive probability for a cross-validation based model selection.

There are several ways to approximate the intractable posterior predictive probability. One popular method is to replace $p(\theta|\bm{y}^{(1:N)})$ with the variational posterior $q(\theta|\lambda_{\theta}^{(t)})$ and construct a Monte Carlo approximation $p(\bm{y}^{(new)}|\bm{y}^{(1:N)})\approx\int q(\theta|\lambda_{\theta}^{(t)})p(\bm{y}^{(new)}|\theta)d\theta$ . The time complexity is high, however, since it requires the same number of message-passings as the Monte Carlo samples. Beal (2003) proposed a simple lower bound whose computation requests only one round of message-passing as shown in Supplementary Materials section A. However, it is based on a bound of the approximation to the posterior predictive probability, which may raise concern in some situations. The third method is to concentrate the local variational distribution on its posterior mean, a point mass $\mathbb{E}_{\lambda_{\theta}^{(t)}}\theta$ , such that $p(\bm{y}^{(new)}|\bm{y}^{(1:N)})\approx p(\bm{y}^{(new)}|\mathbb{E}_{\lambda_{\theta}^{(t)}}\theta)$ . The computation requires only one message-passing, the same as in the second method.

Our simulations showed the above three methods gave similar numerical values to the posterior predictive probability. We chose the third method $p(\bm{y}^{(new)}|\bm{y}^{(1:N)})\approx p(\bm{y}^{(new)}|\mathbb{E}_{\lambda_{\theta}^{(t)}}\theta)$ as it avoids the sample based time complexity in the first method, and serves as a reliable approximation. $\mathbb{E}_{\lambda_{\theta}^{(t)}}\theta$ is referred to as variational Bayes estimate in Wang and Blei (2019), and is closely related to the SVEM estimates $\theta^{(t)}$ discussed in Section 5.1.2. Thus under the frequentist framework, the approximated maximum likelihood $p(\bm{y}^{(new)}|\theta^{(t)})$ offers an analogous measure, and draws an easier comparison of likelihood between the two algorithms. That is, $\ln p(\bm{y}^{(new)}|\mathbb{E}_{\lambda_{\theta}^{(t)}}\theta)$ in SVB and $\ln p(\bm{y}^{(new)}|\theta^{(t)})$ in SVEM, which are both referred to as log predictive probability in the following discussions.

For each of the 20 SHPs, we used its test set (containing 500 sequences) as $\bm{y}^{(new)}$ (namely $\bm{y}^{(new)}_{Shp1}$ to $\bm{y}^{(new)}_{Shp20}$ ). We used its training set (containing 4500 sequences) to generate 10 SVB estimates and 10 SVEM estimates based on the 10 random initializations. Figure 5 illustrates the log predictive probability using $\bm{y}^{(new)}_{Shp1}$ (Figure 5(a)) and $\bm{y}^{(new)}_{Shp3}$ (Figure 5(b)). The average over 500 test sequences is reported. Results using $\bm{y}^{(new)}$ from other SHPs provide similar information and are not included. Here in both panel (a) and (b), the x-axis indexes training sets of the 20 SHPs. For each training set, 10 approximated log predictive probabilities are plotted in blue using the 10 SVB estimates, and the other 10 points are plotted in red using the 10 SVEM estimates.

In Figure 5(a) when setting test sequences as $\bm{y}^{(new)}_{Shp1}$ , the highest log predictive probability is obtained when the parameter estimate $\hat{\theta}$ (i.e., $\mathbb{E}_{\lambda_{\theta}^{(t)}}\theta$ for SVB, $\theta^{(t)}$ for SVEM) is learned from SHP1. That is, among all $\hat{\theta}$ obtained from 20 SHPs, log predictive probability $\ln p(\bm{y}^{(new)}_{Shp1}|\bm{y}^{(1:N)})\approx\ln p(\bm{y}^{(new)}_{Shp1}|\hat{\theta})$ is maximized when $\hat{\theta}$ best approaches SHP1’s true parameter $\theta^{\ast}_{Shp1}$ . The second highest log predictive probability is obtained when $\hat{\theta}$ is learned from SHP2, the “twin” of SHP1, as expected due to our simulation design. The lower values of $\ln p(\bm{y}^{(new)}_{Shp1}|\hat{\theta})$ , for example when $\hat{\theta}$ is learned from SHP11, suggest that SHP11 is more different from SHP1. However, we should be cautious to what extent we can quantify such difference using log predictive probability (see our discussions in Section 5.2). Similar to Figure 5(a), Figure 5(b) uses $\bm{y}^{(new)}_{Shp3}$ from the test set in SHP3, and shows that the two highest log predictive probabilities are obtained with $\hat{\theta}$ learned from SHP3 and its “twin” SHP4. In addition, Figure 5(a) shows that SHP1/2 have greater similarity to SHP19/20 than to other SHPs, and both SHP1/2 and SHP19/20 present the least similarity to SHP3/4 in Figure 5(b).

Figure 5 illustrates that log predictive probability can be indicative of relative similarity among different RHP datasets. We note that difference between the two log predictive probabilities can be approximated by the log likelihood ratio. In addition, Figure 5 echoes Figure 3 in that both SVB and SVEM provide good estimates to the likelihoods under $\theta^{\ast}_{Shp1}$ , …, $\theta^{\ast}_{Shp20}$ (labeled in black cross dots), whereas SVB exhibits less robustness to random initializations.

5.2 Real Data Study

The simulation study shows that through the GHSMM-RHP model, log predictive probability can be used to indicate the similarities among different RHP datasets. In the real data study, we aim to use the log predictive probability to understand the biochemical properties of new RHP datasets in a predictable way.

Data and Training Settings The real data in Jiang et al. (2020) consists of seven synthesized RHP datasets, with different MMA/EHMA ratios to explore their varying proton transport rates. Specifically, the datasets are RHP50/20 (i.e., MMA/EHMA ratio is 50% : 20%), RHP70/0, RHP60/10, RHP35/35, RHP30/40, RHP40/30, and RHP55/15. In each RHP dataset, 500 chains were held out and used as the test set, while the other 4500 chains were used as the training set. Each RHP chain is of length 130. We used the same training settings as described in Section 5.1.1.

Figure 6(a) shows the experimentally measured proton-transport performance of the RHPs with varying MMA/EHMA ratios (Jiang et al., 2020). The assay was conducted on liposomes (sphere-shaped vesicles with the membrane composed of lipid bilayers) with a lower initial pH inside and a higher initial pH outside. The inner pH change for the liposome is a measure of protons that transport across the RHP-lipid bilayers. Four RHP datasets that promote the proton transport rate were reported. Among them RHP50/20 has the best performance, followed by RHP55/15, RHP40/30, and RHP60/10. However, the 70/0, 35/35 ratios do not lead to distinct difference compared with the liposome sample, which suggests their lower proton transport efficiency. It was argued that with the MMA/EHMA ratio deviating from 50/20, the changed overall hydrophilic-lipophilic balance could affect the interplay between polymers and the bilayers, and obstruct the RHP insertion. Note that there was no experimental result for RHP30/40 so it is not shown in panel (a).

Predictive Inference Results We repeated the analysis in Section 5.1.3 and applied the two stochastic variational methods to the real data. By setting $\bm{y}^{(new)}$ to the test set of RHP50/20 which is the dataset of the highest inner pH change, and computing log predictive probabilities based on $\hat{\theta}$ learned from other RHPs, we hope to understand other RHPs’ proton transport efficiency in a predictable manner.

The approximated log predictive probabilities are plotted in Figure 6(b). The highest log predictive probability is obtained when $\hat{\theta}$ is learned from RHP50/20 itself. The second highest is obtained when $\hat{\theta}$ is learned from RHP55/15, followed by RHP40/30 and RHP60/10. Note that RHP70/0 is omitted from Figure 6(b) due to its extremely low value of log predictive probability (mean at -484.8859 for SVEM, -374.0407 for SVB).

Measuring protein-like behaviors in lab experiments can help to validate the findings in our model. For the four functional RHP datasets described in Jiang et al. (2020), the rankings by log predictive probability (shown in Figure 6(b)) are consistent with the rankings by inner pH change (shown in Figure 6(a)). That is, RHP50/20 followed by RHP55/15, RHP40/30 and RHP60/10. Here, RHP50/20 exhibits the best transport performance and the largest inner pH change reading. For RHPs showing high statistical similarity to RHP50/20, their log predictive probability serves as a good indicator of their relative proton transport performance. That is, those RHPs of higher rankings by GHSMM-RHP are datasets of experimental interest for their potential transport efficiency. However, as discussed in Section 5.1.3, the RHPs of the lower rankings, namely RHP 70/0, RHP35/35, and RHP30/40, are indicative of their relative inactive proton transport functionality, and we need to be cautious to use their exact rank. This is because there are many other biochemical properties that can not be measured by inner pH change. The greater dissimilarity indicated by predictive probability between RHP datasets may not necessarily lead to greater difference in inner pH change.

More importantly, Figure 6 shows that the GHSMM-RHP model can provide guidance to synthesizing RHPs with better potential functionalities. For example, sequences of RHPs are easy to generate in computer simulations, while their biochemical properties are expensive to analyze in the lab. To this end, we could use log predictive probability to select RHP datasets from a wide range of compositional ratios that are most similar to a target RHP of desired properties (such as RHP50/20). The RHPs selected by GHSMM-RHP then have the greatest potential to exhibit the targeted biochemical properties (such as high inner pH change), and are favorable to be further validated in costly wet lab experiments.

The GHSMM-RHP model can also be applied to other RHP studies with different monomer choices (other than MMA, EHMA, OEGMA, and SPMA) and composition ratios. Without testing each RHP variants in the wet lab, the current study presents a reliable approach to predict the performance of RHP systems of varying heterogeneity. Such a cost- and time-efficient endeavor holds great potential for facilitating the design of functional RHPs in other application fields.

6 Discussions

Our model constraints are derived from the result of the sequential analysis in Jiang et al. (2020), as well as our current empirical understanding of RHP systems. Despite the limited domain information, we believe the GHSMM-RHP model captures the critical connections between hidden segment structures and the RHP functionality to some extent. Moreover, we can adjust model settings such as segment length constraints whenever there is new supporting evidence.

The similarity measure we used is built on the predictive inference in Bayesian statistics. In particular, the approximation to predictive probability comes down to approximated likelihood. This relates to the log likelihood ratio statistics when comparing two RHP datasets using their log predictive probabilities. Other statistical tools related to likelihood ratio statistics may be explored for application in RHP study.

The GHSMM-RHP model not only holds potential for measuring the similarity of existing RHP datasets, but can also be exploited to create new groups of RHP sequences. This may be done by iteratively refining the current RHP dataset, removing dissimilar sequences and adding in new sequences of greater similarity, until some termination criterion is met. The newly generated RHP dataset may present novel monomer composition ratios, and offer insights into the target functionality. GHSMM-RHP provides new ways of exploiting the rich heterogeneity as a key factor in designing functional RHPs.

We also empirically studied the computational performance of stochastic variational methods under Bayesian and frequentist frameworks. Our work takes an initial effort to compare the two methods in the same non-convex estimation-prediction problem, in which we carried out an extensive simulation study and a real data application. There could be other hyper-parameters besides the four factors we tried (initializations, learning rates, batch size, minimum segment length) that affect the practical performance of algorithms, and more tuning efforts may be implemented.

We note that the two methods discussed in this paper are termed variational in the spirit of using optimization for density estimation (Wainwright and Jordan, 2008; Blei et al., 2017). It has also been argued that EM is itself a mean-field variational method (Neal and Hinton, 1998; Hoffman et al., 2013). As illustrated, we provided the objective functions of Bayesian and frequentist variational methods in Section 4 using the same variational principle based on KL divergence. Both methods involve stochastic optimizations in the M step. They justify the terminologies we used in this paper post hoc. Similar ideas have been proposed to use only partial information from the entire dataset in the hope of scaling up the inference algorithm. Such efforts include incremental EM (Neal and Hinton, 1998), first order EM (Salakhutdinov et al., 2003), online EM (Cappé and Moulines, 2009), as well as Hoffman et al. (2013) who introduced the first order optimization to variational methods under the Bayesian framework.

Supplementary Material

More implementation details on the message-passing algorithms, derivation of the alternate optimizations, code and RHP datasets are available in the supplementary material.

Acknowledgments

The study was supported by DOD HDTRA1-19-1-0011 and DA ARO W911NF2110128.

References

Alquier et al. [2020] Pierre Alquier, James Ridgway, et al. Concentration of tempered posteriors and of their variational approximations. Annals of Statistics, 48(3):1475–1497, 2020.
Amari [1998] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
Awasthi and Risteski [2015] Pranjal Awasthi and Andrej Risteski. On some provably correct cases of variational inference for topic models. In NIPS, 2015.
Balakrishnan et al. [2017] Sivaraman Balakrishnan, Martin J Wainwright, Bin Yu, et al. Statistical guarantees for the em algorithm: From population to sample-based analysis. Annals of Statistics, 45(1):77–120, 2017.
Baum et al. [1970] Leonard E Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics, 41(1):164–171, 1970.
Beal [2003] Matthew J Beal. Variational algorithms for approximate Bayesian inference. PhD thesis, UCL (University College London), 2003.
Bernardo and Smith [2009] José M Bernardo and Adrian FM Smith. Bayesian theory, volume 405. John Wiley & Sons, 2009.
Bickel et al. [2013] Peter Bickel, David Choi, Xiangyu Chang, Hai Zhang, et al. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. Annals of Statistics, 41(4):1922–1943, 2013.
Bishop [2006] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
Blei et al. [2006] David M Blei, Michael I Jordan, et al. Variational inference for dirichlet process mixtures. Bayesian analysis, 1(1):121–143, 2006.
Blei et al. [2017] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.
Bottou [2010] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
Braun and McAuliffe [2010] Michael Braun and Jon McAuliffe. Variational inference for large-scale models of discrete choice. Journal of the American Statistical Association, 105(489):324–335, 2010.
Cappé and Moulines [2009] Olivier Cappé and Eric Moulines. On-line expectation–maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):593–613, 2009.
Carbonetto et al. [2012] Peter Carbonetto, Matthew Stephens, et al. Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian analysis, 7(1):73–108, 2012.
Dhaka et al. [2020] Akash Kumar Dhaka, Alejandro Catalina, Michael Riis Andersen, Måns Magnusson, Jonathan H Huggins, and Aki Vehtari. Robust, accurate stochastic optimization for variational inference. arXiv preprint arXiv:2009.00666, 2020.
Foti et al. [2014] Nick Foti, Jason Xu, Dillon Laird, and Emily Fox. Stochastic variational inference for hidden markov models. In Advances in neural information processing systems, pages 3599–3607, 2014.
Ghahramani and Beal [2001] Zoubin Ghahramani and Matthew J Beal. Propagation algorithms for variational bayesian learning. Advances in neural information processing systems, pages 507–513, 2001.
Ghorbani et al. [2019] Behrooz Ghorbani, Hamid Javadi, and Andrea Montanari. An instability in variational inference for topic models. In International conference on machine learning, pages 2221–2231. PMLR, 2019.
Giordano et al. [2018] Ryan Giordano, Tamara Broderick, and Michael I Jordan. Covariances, robustness and variational bayes. Journal of machine learning research, 19(51), 2018.
Guo et al. [2016] Fangjian Guo, Xiangyu Wang, Kai Fan, Tamara Broderick, and David B Dunson. Boosting variational inference. arXiv preprint arXiv:1611.05559, 2016.
Hoffman et al. [2013] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
Jiang et al. [2020] Tao Jiang, Aaron Hall, Marco Eres, Zahra Hemmatian, Baofu Qiao, Yun Zhou, Zhiyuan Ruan, Andrew D Couse, William T Heller, Haiyan Huang, et al. Single-chain heteropolymers transport protons selectively and rapidly. Nature, 577(7789):216–220, 2020.
Johnson and Willsky [2014] Matthew Johnson and Alan Willsky. Stochastic variational inference for bayesian time series models. In International Conference on Machine Learning, pages 1854–1862. PMLR, 2014.
Jordan et al. [1999] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Lauritzen and Spiegelhalter [1988] Steffen L Lauritzen and David J Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society: Series B (Methodological), 50(2):157–194, 1988.
Lottaz et al. [2003] Claudio Lottaz, Christian Iseli, C Victor Jongeneel, and Philipp Bucher. Modeling sequencing errors by combining hidden markov models. Bioinformatics, 19(suppl_2):ii103–ii112, 2003.
Maathuis et al. [2018] Marloes Maathuis, Mathias Drton, Steffen Lauritzen, and Martin Wainwright. Handbook of graphical models. CRC Press, 2018.
MacKay [1997] David JC MacKay. Ensemble learning for hidden markov models. Technical report, Citeseer, 1997.
Mukherjee and Sarkar [2018] Soumendu Sundar Mukherjee and Purnamrita Sarkar. Mean field for the stochastic blockmodel: Optimization landscape and convergence issues. Advances in neural information processing systems, 2018.
Munch and Krogh [2006] Kasper Munch and Anders Krogh. Automatic generation of gene finders for eukaryotic species. BMC bioinformatics, 7(1):1–12, 2006.
Neal and Hinton [1998] Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer, 1998.
Nott et al. [2012] David J Nott, Siew Li Tan, Mattias Villani, and Robert Kohn. Regression density estimation with variational methods and stochastic approximation. Journal of Computational and Graphical Statistics, 21(3):797–820, 2012.
Panganiban et al. [2018] Brian Panganiban, Baofu Qiao, Tao Jiang, Christopher DelRe, Mona M Obadia, Trung Dac Nguyen, Anton AA Smith, Aaron Hall, Izaac Sit, Marquise G Crosby, et al. Random heteropolymers preserve protein function in foreign environments. Science, 359(6381):1239–1243, 2018.
Pati et al. [2018] Debdeep Pati, Anirban Bhattacharya, and Yun Yang. On statistical optimality of variational bayes. In International Conference on Artificial Intelligence and Statistics, pages 1579–1588. PMLR, 2018.
Rabiner [1989] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
Reddi et al. [2016] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Stochastic frank-wolfe methods for nonconvex optimization. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1244–1251. IEEE, 2016.
Regier et al. [2015] Jeffrey Regier, Andrew Miller, Jon McAuliffe, Ryan Adams, Matt Hoffman, Dustin Lang, David Schlegel, and Mr Prabhat. Celeste: Variational inference for a generative model of astronomical images. In International Conference on Machine Learning, pages 2095–2103. PMLR, 2015.
Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
Salakhutdinov et al. [2003] Ruslan Salakhutdinov, Sam T Roweis, and Zoubin Ghahramani. Optimization with em and expectation-conjugate-gradient. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 672–679, 2003.
Sheth and Khardon [2017] Rishit Sheth and Roni Khardon. Excess risk bounds for the bayes risk using variational inference in latent gaussian models. Advances in neural information processing systems, 30, 2017.
Viterbi [1967] Andrew Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 13(2):260–269, 1967.
Wainwright and Jordan [2008] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1-2):1–305, 2008.
Wang and Blei [2019] Yixin Wang and David M Blei. Frequentist consistency of variational bayes. Journal of the American Statistical Association, 114(527):1147–1161, 2019.
Westling and McCormick [2019] T Westling and TH McCormick. Beyond prediction: A framework for inference with variational approximations in mixture models. Journal of Computational and Graphical Statistics, 28(4):778–789, 2019.
Winn et al. [2005] John Winn, Christopher M Bishop, and Tommi Jaakkola. Variational message passing. Journal of Machine Learning Research, 6(4), 2005.
Yang et al. [2020] Yun Yang, Debdeep Pati, Anirban Bhattacharya, et al. $\alpha$ -variational inference with statistical guarantees. Annals of Statistics, 48(2):886–905, 2020.
Yogatama et al. [2014] Dani Yogatama, Chong Wang, Bryan R Routledge, Noah A Smith, and Eric P Xing. Dynamic language models for streaming text. Transactions of the Association for Computational Linguistics, 2:181–192, 2014.
Yu [2010] Shun-Zheng Yu. Hidden semi-markov models. Artificial intelligence, 174(2):215–243, 2010.
Yu and Kobayashi [2003] Shun-Zheng Yu and Hisashi Kobayashi. An efficient forward-backward algorithm for an explicit-duration hidden markov model. IEEE signal processing letters, 10(1):11–14, 2003.
Zhang et al. [2020a] Anderson Y Zhang, Harrison H Zhou, et al. Theoretical and computational guarantees of mean field variational inference for community detection. Annals of Statistics, 48(5):2575–2598, 2020a.
Zhang et al. [2020b] Fengshuo Zhang, Chao Gao, et al. Convergence rates of variational posterior distributions. Annals of Statistics, 48(4):2180–2207, 2020b.

	$\displaystyle{\pi}\|{\widetilde{\lambda}}_{\pi}\sim\text{Dir}({\widetilde{\lambda}}_{\pi}+\bm{1}),$		(7)
	$\displaystyle\alpha_{i}\|{\widetilde{\lambda}}_{\alpha_{i}}\sim\text{Dir}({\widetilde{\lambda}}_{\alpha_{i}}+\bm{1}),\forall i\in\{S1,S2,S3\},$
	$\displaystyle{\rho}^{i}\|{\widetilde{\lambda}}_{{\rho}^{i}}\sim\text{Dir}({\widetilde{\lambda}}_{{\rho}^{i}}+\bm{1}),\forall i\in\{S1,S2,S3\},$
	$\displaystyle\beta_{\text{m}}\|{\widetilde{\lambda}}_{\beta_{\text{m}}}\sim\text{Dir}({\widetilde{\lambda}}_{\beta_{\text{m}}}+\bm{1}),\forall\text{m}\in\{S1,S2,S3,S3^{\prime}\}.$