This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Phylogenetic Trees Analysis of SARS-CoV-2

Chen Shen, Vic Patrangenaru and Roland Moore
(June 13, 2021)
Abstract

One regards spaces of trees as stratified spaces, to study distributions of phylogenetic trees. Stratified spaces with may have cycles, however spaces of trees with a fixed number of leafs are contractible. Spaces of trees with three leafs, in particular, are spiders with three legs. One gives an elementary proof of the stickiness of intrinsic sample means on spiders. One also represents four leafs tree data in terms of an associated Petersen graph. One applies such ideas to analyze RNA sequences of SARS-CoV-2 from multiple sources, by building samples of trees and running nonparametric statistics for intrinsic means on tree spaces with three and four leafs. SARS-CoV-2 are also used to built trees with leaves consisting in addition to other related coronaviruses.

Keywords: phylogenetic tree, tree space, stratified space, Central Limit Theorem, sticky means, SARS-CoV-2

1 Introduction

We first introduce some basic notions of phylogenetics that are common knowledge in Biology. A phylogenetic tree is a directed tree (directed simply connected graph), having a distinguished vertex, that is an ancestor of all other vertices, called root; all vertices that have no descendants are called leaves of the phylogenetic tree. In a phylogenetic tree, each edge is called a branch. Each vertex where the two branches meet is called a branching point, or node. A node is the most recent common ancestor of all species on the other vertices of those branches.

Deoxyribonucleic acid (DNA) sequences can be used to draw a phylogenetic tree. A nucleotide consists of a sugar molecule (either ribose in ribonucleic acid (RNA) or DNA attached to a phosphate group and a nitrogen-containing base. The bases used in DNA are purines adenine (A) and guanine (G), pyrimidine cytosine (C) and pyrimidine thymine (T). In RNA, the base pyrimidine uracil (U) takes the place of T. To construct a tree, we will compare the RNA or DNA sequences of different species or individuals. Related individuals have a common ancestor. Before species split into separate descendants, they had the same RNA (or DNA). But as species evolve and diverge, they will accumulate changes in the DNA sequences. We can use these changes in the DNA to tell how closely related two individuals are. If there are not very many differences, they are probably closely related; if there are many changes, they might be distant relatives.
The DNA sequences are double-stranded and are made of letters A, G, C, and T, and RNA sequences are similar to those of the DNA and made of letters A, G, C, and U. Chromosomes are thread-like structures located inside the nucleus of animal and plant cells. Each chromosome is made of protein and a single molecule of DNA. Passed from parents to offspring, DNA contains the specific instructions that make each type of living creature unique. A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e., a protein or RNA molecule). Biologists commonly use one homogeneous sequence, which in the biologist’s language concerns the relationship between gene trees and genes that are made from one tree. The gene sequence might be about 200 base pairs long. One of the problems that has occurred in the last fifty years is that biologists believe that the way evolution works is that there would only be one species tree. Different genes have different histories, so you get different gene trees. Putting them together is a statistical problem that helps study of the evolutionary process.

A phylogenetic tree with pp leaves is an equivalence class based on a certain equivalence of a DNA-based connected directed graph of species with no loops, having an unobserved root (common ancestor) and pp observed leaves (currently observed species of a certain family of living creatures). The paper is organized as follows; in the first section one introduces tree spaces of trees with mm leafs, which are regarded as stratified spaces of dimension m2m-2. In particular the detailed description is fiven for the spaces T3T_{3} and T4T_{4}. It is shown that T3T_{3} is a 3-spider, and T4T_{4} can be described in terms of the so called Petersen graph. Section three is dedicated to a proof of the central limit theorem for intrinsic means of spiders. An elementary proof of the stickiness theorem (see Hotz et al (2013)[9] is given here. The fourth section is dedicated to the CLT for intrinsic means on T4T_{4} (see Bardem et al.(2013)[1]). Here the key result is the partial stickiness of the intrinsic means, that stick to the 1D stratum of the tree space. The last part of the paper is applied, classifying SARS-CoV-2 data, according to stickiness of their intrinsic sample means.

2 Tree Spaces

Our stratified data analysis requires certain basic concepts, that are fairly recent in Statistics (see eg Patrangenaru and Ellingson (2015)[10], p.475, and references there in).

Definition 1.

A stratified space (space with a manifold stratification) is a metric space \mathcal{M} that admits a filtration =F1F0F1Fn=M=iFi\emptyset=F_{-1}\subseteq F_{0}\subseteq F_{1}\dots\subseteq F_{n}\subset\dots=M=\cup_{i}F_{i}, By closed subspaces, such that for each i=1,,n,FiFi1i=1,\dots,n,F_{i}\setminus F_{i-1} is empty or is an ii-dimensional manifold, called the ii-th stratum.

An elementary example of a 2 dimensional stratified space is a cone 𝒞,\mathcal{C}, set of solutions in 3\mathbb{R}^{3} of the equation x2+y2z2=0;x^{2}+y^{2}-z^{2}=0; In this case F0={(0,0,0)},F2=𝒞,FiFi1=,i0,2.F_{0}=\{(0,0,0)\},F_{2}=\mathcal{C},F_{i}\setminus F_{i-1}=,\forall i\notin{0,2}.

A tree with pp leaves is a connected, simply connected graph , with a distinguished vertex, labeled oo, called the root, and pp vertices of degree 1, called leaves, that are labeled from 1 to pp. In addition, we assume that with all interior edges have positive lengths. (An edge of a p-tree is called interior if it is not connected to a leaf.)

Now consider a tree TT, with interior edges e1,,ere_{1},...,e_{r} of lengths l1,,lrl_{1},...,l_{r}, respectively. If TT is binary, then r=p2r=p-2, otherwise r<p2r<p-2. The vector(l1,,lr)T{(l_{1},...,l_{r})}^{T} specifies a point in the positive open orthant (0,)r{(0,\infty)}^{r}.

An pp-tree has the maximal possible number of interior edges p2p-2 and thus determines the largest possible dimensional orthant, when it is a binary tree; in this case the orthant is p2p-2 dimensional. The open orthants form the highest dimensional stratum Fp2Fp3F_{p-2}\setminus F_{p-3} of Tp,T_{p}, regarded as a p2p-2 dimensional stratified space. The orthant corresponding to each tree which is not binary appears as a boundary face of the orthants corresponding to at least three binary trees; in particular the origin of each orthant corresponds to the (unique) tree with no interior edges. We construct the space TpT_{p} by taking one p2p-2 dimensional orthant for each of the (2p3)!!(2p-3)!! possible binary trees, and gluing them together along their common faces. Due to such gluing operations, tree spaces are not manifolds. Singularities (points where the space does not have a tangent space of the same dimension as the space itself) are present in the tree space structure. For further details on the construction of the tree space TpT_{p}, see Billera et al.(2001)[6];

In this paper, we are considering phylogenetic trees with 3 or 4 leafs. The space of trees with 3 leafs is T3T_{3} can be identified with S3S_{3}, a 33-spider, which is the union of three line segments with a common end (see left hand side of Figure 1).

T4T_{4} as a stratified space has 15=(2×43)!!15=(2\times 4-3)!! 2D quadrants glued according to tree identification rules (see Billera et. al.(2001)[6]). Interior points of these quadrants are combinatorial binary trees with four leaves, the coordinates of an interior point being given by the two interior edges of a binary tree in one of these combinatorial binary trees. Points on the boundaries of the quadrants, are on the one dimensional stratum, and correspond to combinatorial trees with four leaves, which are obtained from a combinatorial binary tree by shrinking one of the interior edges to zero length (see right hand side of Figure 1). The zero dimensional stratum is made of one point only, the star tree.

Refer to caption
Refer to caption
Figure 1: Tree spaces T3T_{3} and T4T_{4}.

Therefore a representation of T4T_{4} as a surface with singularities can be obtained from the polyhedral surface obtained by the identifications of three semi-axes labeled with the same letter as shown in Figure 2. While this is a 3D pictorial representation only, in general TmT_{m} can be embedded in k\mathbb{R}{{}^{k}}, k=2mm2k=2^{m}-m-2 ( see Ellingson et al(2014) [7]), so, in fact, given that the number of edges that form the one dimensional stratum of T4T_{4} is 2442=10,2^{4}-4-2=10, this stratified space is embedded in 10\mathbb{R}^{10} having the star tree at the origin. In this representation, the intersection of a sphere in 10\mathbb{R}^{10} centered at the origin with T4T_{4} is the so called Petersen graph. The embedded surface can be thus regarded as a one sheet cone over the Petersen graph. An edge of the Petersen graph is the transverse intersection of one quadrant with the sphere; thus there are 1515 edges; a vertex of this graph is a point where one of the coordinate semi-axes pierces the sphere; therefore there are 1010 vertices ( see figure 3).

Refer to caption
Figure 2: A 2D stratified space - T4T_{4}, space of trees with four leaves
Refer to caption
Figure 3: Petersen graph

3 An elementary proof of the intrinsic CLT on spiders

The intrinsic mean on Tm,T_{m}, with the piecewise Euclidean metric is unique, since TmT_{m} with this metric is a CAT(0) space. A CLT for the intrinsic sample mean on a tree space TmT_{m}, for mm fixed, is however still ongoing research. The most recent results, to our knowledge, are due to Shen (2021)[23], Barden et al.(2018)[3], and Barden and Le(2018)[2] are are concerned only with the case when the intrinsic mean lies on the first and second highest dimensional strata of Tm.T_{m}. In this section we are concerned with the case m=3,m=3, when it is known that T3=S3.T_{3}=S_{3}. Therefore we study at no additional effort, that case of the CLT for the intrinsic sample mean on a spider Sp.S_{p}. Assume Xi,i=1,,nX_{i},i=1,\dots,n are i.i.d. random objects on a spider Sp,S_{p}, having legs La,a=1,p,L_{a},a=1,\dots p, and center C.C. Further, assume the intrinsic mean μX1,I\mu_{X_{1},I} exists and the intrinsic variance is finite. Any probability measure QQ on SpS_{p} decomposes uniquely as a weighted sum of probability measures QkQ_{k} on the legs LkL_{k} and an atom Q0Q_{0} at CC (Hotz et. al.(2013)[9]). More precisely, there are nonnegative real numbers {wk}k=0p\{w_{k}\}_{k=0}^{p} summing to 11 such that, for any Borel set ASpA\subseteq S_{p}, the measure QQ takes the value

Q(A)=w0Q0(AC)+k=1pwkQk(ALk).Q(A)=w_{0}Q_{0}(A\cap C)+\sum_{k=1}^{p}w_{k}Q_{k}(A\cap L_{k}). (1)

We will consider the nontrivial case when the moments νa=E(Qa),a=1,,p\nu_{a}=E(Q_{a}),a=1,\dots,p are all positive.

Theorem 2.

(Hotz et. al. (2013)). Assume w0=0.w_{0}=0. (i) If there exists a1,p¯a\in\overline{1,p} such that waνa>bawbνbw_{a}\nu_{a}>\sum_{b\neq a}w_{b}\nu_{b} then μX1,ILa\mu_{X_{1},I}\in L_{a} and, for nn large enough X¯n,ILa\bar{X}_{n,I}\in L_{a} and n(X¯n,IμX1,I)\sqrt{n}(\bar{X}_{n,I}-\mu_{X_{1},I}) has asymptotically a normal distribution. (ii) If there exists a1,p¯a\in\overline{1,p} such that waνa=bawbνb,w_{a}\nu_{a}=\sum_{b\neq a}w_{b}\nu_{b}, then, after folding the legs Lb,ba,L_{b},b\neq a, into one half line opposite to LaL_{a}, n(X¯n,I)\sqrt{n}(\bar{X}_{n,I}) has asymptotically the distribution of the absolute value of a normal distribution. (iii) If a1,p¯,waνa<bawbνb,\forall a\in\overline{1,p},w_{a}\nu_{a}<\sum_{b\neq a}w_{b}\nu_{b}, then μX1,I=C\mu_{X_{1},I}=C and there is n0n_{0} s.t. nn0,\forall n\geq n_{0}, then X¯n,I\bar{X}_{n,I} = CC a.s.

For a proof, recall that if xLax\in L_{a}, the Fréchet function FF is defined as follows,

F(x)\displaystyle F(x) =i=1,iaK0(x+u)2wiQi(du)+0(xu)2waQa(du)\displaystyle=\sum_{i=1,i\neq a}^{K}\int_{0}^{\infty}(x+u)^{2}w_{i}Q_{i}(du)+\int_{0}^{\infty}(x-u)^{2}w_{a}Q_{a}(du) (2)
=x2i=1K0wiQi(du)+2x[i=1,iaK0uwiQi(du)0uwaQa(du)]+i=1K0u2wiQi(du)\displaystyle=x^{2}\sum_{i=1}^{K}\int_{0}^{\infty}w_{i}Q_{i}(du)+2x[\sum_{i=1,i\neq a}^{K}\int_{0}^{\infty}uw_{i}Q_{i}(du)-\int_{0}^{\infty}uw_{a}Q_{a}(du)]+\sum_{i=1}^{K}\int_{0}^{\infty}u^{2}w_{i}Q_{i}(du) (3)
=x2+2[i=1,iaKviva]x+const.\displaystyle=x^{2}+2[\sum_{i=1,i\neq a}^{K}v_{i}-v_{a}]x+const. (4)

where vi=0uwiQi(du)=wiνi.v_{i}=\int_{0}^{\infty}uw_{i}Q_{i}(du)=w_{i}\nu_{i}.

Since there exists an unique minimizer for the Fréchet function FF in (2), this minimizer is intrinsic mean μI\mu_{I}. The minimizer of this quadratic function is x=vaiavix^{*}=v_{a}-\sum_{i\neq a}v_{i}, where xLa={(a,u):u[0,)}x\in L_{a}=\{(a,u):u\in[0,\infty)\}. Thus, we have three situations:(i)vaiavi>0v_{a}-\sum_{i\neq a}v_{i}>0 or va>iaviv_{a}>\sum_{i\neq a}v_{i}, (ii)vaiavi=0v_{a}-\sum_{i\neq a}v_{i}=0 or va=iaviv_{a}=\sum_{i\neq a}v_{i} and iii vaiavi<0v_{a}-\sum_{i\neq a}v_{i}<0 or va<iaviv_{a}<\sum_{i\neq a}v_{i}. In case(i), we have μI\mu_{I} well defined on LaL_{a}, then classical C.L.T is applied. In case (ii), we can fold other legs into that half line opposite to LaL_{a} then apply C.L.T. and since the negative part is undefined, so the result goes to a positive truncated normal distribution. In case (iii), for any a{1,,K}a\in\{1,\dots,K\}, we have μI=C\mu_{I}=C, which shows that intrinsic mean μI\mu_{I} sticks to the center CC;

Corollary 3.

Under the assumptions of Theorem 2 (iii), we say that the sample mean is sticky, a condition that can be readily verified for data on a spider.

Remark 4.

The first paper on the asymptotic behavior of the CLT on a stratified space, is due to Basrak(2010)[5], who gave the first stickiness example for intrinsic means of distributions on metric binary trees. According to Basrak(2010):“ Limit theorems on ( binary ) trees will need minor adjustments, since on a general tree, the barycenter can split the tree into more than three subtrees. Nevertheless, asymptotically, the inductive mean will have one of the three types of behavior described in Theorem 3 in Basrak(2010), meaning that the stickiness phenomenon is still present for distributions on trees. This comment from Basrak(2010) does not extend to arbitrary finite, connected graphs, since graphs usually have cycles (see Figure 4).

Refer to caption
Figure 4: Bicyclic graph

However, Basrak’s condition for trees can be extended to the case of graphs in some general cases even if the graph GG has cycles with positive mass on any arc of a cycle. Given a graph G,G, for each xGx\in G and leg LaL_{a}, d(x,ξ)d(x,\xi) is either increasing or decreasing for ξLa\xi\in L_{a}. We define εa(x)=+1\varepsilon_{a}(x)=+1 if increasing and εa(x)=1\varepsilon_{a}(x)=-1 if decreasing. Then μI\mu_{I} is sticky if for all legs LaL_{a}

E(d(X,μI)εa(X))>0.E(d(X,\mu_{I})\varepsilon_{a}(X))>0. (5)

This is easily identified as a condition that implies that μI\mu_{I} is a local minimum of the Fréchet function. The quantity ma=E(d(X,μI)εa(X))m_{a}=E(d(X,\mu_{I})\varepsilon_{a}(X)) is called the net moment in the direction of LaL_{a}.
As an aside, one can consider for any ξG\xi\in G its star neigborhood SξS_{\xi} and the “derivative” in the direction of leg LaL_{a} of Sξ,S_{\xi}, given by

DLaFd=E(2d(X,ξ)εa(X)).D_{L_{a}}F_{d}=E(2d(X,\xi)\varepsilon_{a}(X)). (6)

Returning to the Fréchet mean μI\mu_{I}, it follows that for large nn and samples X1,,XnX_{1},\ldots,X_{n} and leg La,L_{a}, then

1ni=1nd(Xi,μI)εa(Xi)dE(d(X,μI)εa(X))>0 almost surely.\frac{1}{n}\sum_{i=1}^{n}d(X_{i},\mu_{I})\varepsilon_{a}(X_{i})\to_{d}E(d(X,\mu_{I})\varepsilon_{a}(X))>0\hbox{ almost surely.} (7)

Therefore, as in Theorem2(iii), with high probability μI\mu_{I} is a local minimum for the Fréchet function corresponding to the sample. One should show that with high probability μI\mu_{I} is a point of absolute minimum for the Fréchet function corresponding to the empirical.

Remark 5.

(Hendriks(2014)) One may note that in the case of graphs having cycles, the condition of stickiness is similar with the condition in Theorem 2. Consider the situation where μI\mu_{I} is a vertex, and CaC_{a} is a connected component of G\{μI}G\backslash\{\mu_{I}\} that has a unique edge joined to μI.\mu_{I}. Then Ga=CaμIG_{a}=C_{a}\cup\mu_{I} is a subgraph of GG. For each xGax\in G_{a} there is the intrinsic distance from xx to μI\mu_{I}, and with respect to the probability QaQ_{a} conditional to be in CaC_{a} there is an expected intrinsic distance, called the moment (of QaQ_{a} or GaG_{a} with respect to μI\mu_{I}), which we label ma.m_{a}. The condition for stickiness is equivalent in this case to ma<bama,m_{a}<\sum_{b\neq a}m_{a}, which in the case of a star tree (spider), neighborhood of the Fréchet mean, is the right notion to study the empirical limit behavior, equivalent to the condition in Theorem 2 (iii).

Remark 6.

Certain complications arise in describing stickiness phenomena, in case when the graph includes a cycle of positive probability mass. Consider for example the case of a simple graph such as the circle, realizable as a one vertex, one edge graph. In this case, if μI\mu_{I} is a Fréchet mean, then the antipodal point, μI-\mu_{I}, must have probability 0.0. Even, if the density in a neighborhood of μI-\mu_{I} is continuous, then the density at μI-\mu_{I} cannot exceed (2π)1(2\pi)^{-1}. If the density is below (2π)1(2\pi)^{-1} Hotz and Huckemann (2015)[16] prove a central limit theorem to a distribution other than normal.

4 On the CLT on T4T_{4}

The space T4T_{4} consists of fifteen 2-dimensional faces that are quadrants of planes bounded by certain pairs of the out of the ten positive semi-axes in 10,\mathbb{R}^{1}0, that are dictated by the leaf tree structure, as shown on the Petersen graph in Figure 5 (see Barden et al(2013)[1]).

Refer to caption
Figure 5: Structure of T4T_{4} reflected on the Petersen graph.

There are a number of cases to consider, when it comes to CLT on T4T_{4}, depending on whether the population intrinsic mean lies in the 2D stratum, in the 1D stratum or at the origin (star tree). The limiting distribution for all cases are given in Barden et al (2013)[1]. In particular here we detail the case, when the data lies on three quadrants forming what is known as an open book with three leafs (see Hotz el al(2013)[9]), obtained by gluing these quadrants along a common boundary (spine). Such a particular type of open book could be defined as Li={(i,x1,x2):x1,x2[0,)}i=1,2,3L_{i}=\{(i,x_{1},x_{2}):x_{1},x_{2}\in[0,\infty)\}i=1,2,3, together at the common spine S=[0,)S=[0,\infty) which comprises the equivalence classes in i=13([0,)×[0,)×{i}\cup_{i=1}^{3}([0,\infty)\times[0,\infty)\times\{i\}. Thus, the open book O3O_{3} is the disjoint union

O3=SL1+L2+L3+O_{3}=S\cup L_{1}^{+}\cup L_{2}^{+}\cup L_{3}^{+}

of the spine SS and the interiors Li+=Li/SL_{i}^{+}=L_{i}/S of the leaves, i=1,2,3i=1,2,3. If the points x=(x1,x2),y=(y1,y2)\textbf{x}=(x_{1},x_{2}),\textbf{y}=(y_{1},y_{2}) are on the same flat convex leaf (as a subset of 2,\mathbb{R}^{2}, the distance between them d(x,y)=xyd(\textbf{x},\textbf{y})=\|\textbf{x}-\textbf{y}\|. If the points x,y\textbf{x},\textbf{y} are on different leafs (see figure 6), one could replace point x by x’ onto the half space opposite to LaL_{a}, then the distance is defined as d(x,y)=x’yd(\textbf{x},\textbf{y})=\|\textbf{x'}-\textbf{y}\|, where x’=(x1,x2)\textbf{x'}=(x_{1},-x_{2}).

Refer to caption
Figure 6: Distance between points on different leafs of an open book

Assume Xi,i=1,,nX_{i},i=1,\dots,n are i.i.d. random objects on O3O_{3}. Denote a weighted sum of probability measure Q3Q_{3} on the legs LiL_{i} and an atom Q0Q_{0} at CC. Further, note that the intrinsic mean μI\mu_{I} exists since space has and the intrinsic variance is finite. We assume w0=0w_{0}=0 and xLax\in L_{a}, the Fréchet function is defined as follows,

F(x)\displaystyle F(\textbf{x}) =i=1,ia3O3x’u2wiQi(du)+O3xu2waQa(du)\displaystyle=\sum_{i=1,i\neq a}^{3}\int_{O_{3}}\|\textbf{x'}-\textbf{u}\|^{2}w_{i}Q_{i}(d\textbf{u})+\int_{O_{3}}\|\textbf{x}-\textbf{u}\|^{2}w_{a}Q_{a}(d\textbf{u})
=x2ia2ox’,uwiQi(du)2ox,uwaQa(du)+i=1K0u2wiQi(du)\displaystyle=\|\textbf{x}\|^{2}-\sum_{i\neq a}2\int_{o}^{\infty}\langle\textbf{x'},\textbf{u}\rangle w_{i}Q_{i}(d\textbf{u})-2\int_{o}^{\infty}\langle\textbf{x},\textbf{u}\rangle w_{a}Q_{a}(d\textbf{u})+\sum_{i=1}^{K}\int_{0}^{\infty}\|\textbf{u}\|^{2}w_{i}Q_{i}(d\textbf{u})
=x122[i=1K0u1wiQi(1)(du1)]x1+x222[0u2waQa(2)(du2)ia0u2wiQi(2)(du2)]x2+const.\displaystyle=x_{1}^{2}-2[\sum_{i=1}^{K}\int_{0}^{\infty}u_{1}w_{i}Q_{i}^{(1)}(du_{1})]x_{1}+x_{2}^{2}-2[\int_{0}^{\infty}u_{2}w_{a}Q_{a}^{(2)}(du_{2})-\sum_{i\neq a}\int_{0}^{\infty}u_{2}w_{i}Q_{i}^{(2)}(du_{2})]x_{2}+const.
=x122i=1Kvi(1)x1+x222[va(2)iavi(2)]x2+const.\displaystyle=x_{1}^{2}-2\sum_{i=1}^{K}v_{i}^{(1)}x_{1}+x_{2}^{2}-2[v_{a}^{(2)}-\sum_{i\neq a}v_{i}^{(2)}]x_{2}+const.

where vi(j)=0uwiQi(j)(du)v_{i}^{(j)}=\int_{0}^{\infty}uw_{i}Q_{i}^{(j)}(du).

To minimize F(x)F(\textbf{x}) is to minimize the quadratic form for x1,x2x_{1},x_{2} and the solution is x1=i=13vi(1),x2=va(2)iavi(2)x_{1}^{*}=\sum_{i=1}^{3}v_{i}^{(1)},x_{2}^{*}=v_{a}^{(2)}-\sum_{i\neq a}v_{i}^{(2)}. Since vi(j)0,x1v_{i}^{(j)}\geq 0,x_{1}^{*} is always non-negative. Thus, one could apply C.L.T. on x1x_{1}^{*} . However, for x2x_{2}^{*}, one have to discuss the three situations as did for the spider spaces. Therefore, in this case, if one separate the space into two directions, the stickiness of intrinsic mean would only occur on one direction, which is in a lower dimensional space. If for all a, va(2)<iavi(2)v_{a}^{(2)}<\sum_{i\neq a}v_{i}^{(2)}, the intrinsic sample mean would stick to the spine S, but still has one degree of freedom. Thus for n,n\to\infty, the intrinsic sample mean would stick to the spine S, and if it’s coordinate on the spine is X¯I,n,\bar{X}_{I,n}, than n(X¯I,n,μI)dY,\sqrt{n}(\bar{X}_{I,n},-\mu_{I})\to_{d}Y, where Y𝒩(0,σ2).Y\sim\mathcal{N}(0,\sigma^{2}). Here μI\mu_{I} is the mean of the projection of YY on the spine, and σ2\sigma^{2} is its variance.

Another case of interest (see Barden et al(2013)[1]), is data on a Q5Q_{5}, as a union of five quadrants, that is isometric with a subset made of five quadrants on T4,T_{4}, whose trace on a the Petersen form a cycle ( see eg the cycle of quadrants from the edge xx to itself in Figure 2. In this case the asymptotic distribution according to Barden et al (2013)[1] (see also Barden and Le(2018)[2]), behaves, via the piecewise Riemannian Log map, like a bivariate normal.

5 Application to SARS-CoV-2

5.1 Background

The SARS-CoV-2 was first identified in Wuhan, China, in 2019. As of June 13, 2021, at least 177 million cases have been confirmed worldwide[17]. As we speak, the number of cases keeps increasing. More than 220 countries and territories have been affected, resulting in at least 3,827,000 deaths; more than 161 million people have recovered.

The virus primarily spreads via respiratory droplets produced during breathing, coughing, sneezing, singing or talking. It may also be transmitted via contaminated surfaces. The time between exposure and symptom onset is typically five days, but may range from two to fourteen days. Symptoms may include fever, cough, and shortness of breath. By mid-December 2020, National regulatory authorities have approved six vaccines for public use: two RNA based vaccines (tozinameran from Pfizer&BioNTech and mRNA-1273 from Moderna), three conventional inactivated vaccines (Jenssen from Johnson &\& Johnson, BBIBP-CorV from Sinopharm and CoronaVac from Sinovac), and two viral vector vaccines (Gam-COVID-Vac from the Gamaleya Research Institute and AZD1222 from the University of Oxford and AstraZeneca). Recommended preventive measures include hand washing, maintaining distance from people, given tat there are asymptomatic cases, and monitoring and self-isolation for fourteen days if an individual suspects being infected.

5.2 Data Processing

The data we used in this paper are from multiple resources including GenBank[14] for SARS-CoV[20], MERS-CoV[21],and SARS-CoV-bat[22] and GISAID[13] for SARS-CoV-2[19]. The data includes the virus name, host species, sample date and sample location. From the raw data green see figure 7, we first align the RNA sequences with each other via Clustal Omega method [12]. The Clustal Omega method uses the k-tuple distance to construct the distance matrix. The vector distances in the distance matrix are then used to cluster the sequences into subclusters using a k-means approach. From the aligned RNA sequences, see figure 8 , one could compute the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches as distances from each RNA sequence. And then apply the neighbor-joining method [11] to build trees.

Refer to caption
Figure 7: Example of SARS-CoV-2 RNA subsequence data
Refer to caption
Figure 8: Example of portion of aligned RNA sequences

5.3 Comparison with other Coronaviruses

In this part, we choose SARS-CoV, MERS-CoV and SARS-CoV-bat to compare with SARS-CoV-2. SARS-CoV is the abbreviation for severe acute respiratory syndrome coronavirus, which was an outbreak in June, 2003 with a global total of 8098 reported cases and 774 deaths, and a case fatality rate of 9.7%9.7\%. The Middle East respiratory syndrome coronavirus (MERS-CoV) is another deadly coronavirus, which is currently not presenting a pandemic threat, and emerged in 2012, causing 2494 reported cases and 858 deaths in 27 countries and has a very high case fatality rate of 34%34\%. SARS-CoV-bat is SARS-CoV from bats.

We randomly pick 13 RNA sequences of SARS-CoV-2, 5 RNA sequences of SARS-CoV, 5 RNA sequences of SARS-CoV-bat and 5 RNA sequences of MERS-CoV. The tree containing all 28 samples could be built after alignment and distances computing, see figure 9. From the figure 9, one could find that MERS viruses are clustered as one group; SARS and batSARS are mixed with each other; and SARS-2 viruses are generally different from MERS, SARS, and batSARS. To further analyze the relationship of the four viruses, we want to apply the tree spaces methods we introduced in Section 2, to build trees with four leaves and continue to compute mean tree to conclude a result.

Refer to caption
Figure 9: Phylogenetic tree contains SARS-CoV, SARS-CoV-bat, MERS and SARS-CoV2 virus data

To compare the gene information from each coronaviruses, we randomly pick one virus’ RNA sequence from each group(SARS-CoV-2, SARS-CoV, MERS-CoV and SARS-CoV-bat) to build sample phylogenetic trees. The sample tree would be a tree with 4 leaves which contains SARS-CoV-2, SARS, MERS and batSARS.

Refer to caption
Figure 10: The sample phylogenetic tree example

Since all sample trees are in the same orthant (same book leaf of an open book), from CLT on tree spaces we discussed last section, we could apply multivariate normal distribution and compute the sample mean tree.

Refer to caption
Refer to caption
Figure 11: Left:Sample trees on T4T_{4} space, red dot: mean tree Right: The phylogenetic mean tree

From the result, one could determine that the SARS-CoV-2’s RNA sequencing is closer to SARS-CoV’s and batSARS’s. MERS’s RNA sequencing is mostly different to SARS-CoV’s and SARS-CoV-2’s. The inneredge 13.67 and 5.53 represent the expected genetic distance from a virus to a possible common ancestor. eg. the inneredge 13.67 represents the expected genetic distance between MERS and the common ancestor of SARS and SARS-2 is 13.67. Furthermore, based on the theorem introduced in the last section, one could compute a confidence region for the mean tree.

5.4 Comparison with different months and locations on T3T_{3} space

As an RNA virus, the SARS-CoV-2 has high mutation rates, and these high rates are correlated with enhanced virulence and evolvability. Although researchers often assume that natural selection has optimized the mutation rate of RNA viruses from common knowledge, some data show that selection for faster replication is stronger and has a high possibility to make more mistakes. Moreover, mutation rates are evolvable and can respond to selection [15].

Since its emergence in late 2019, SARS-CoV-2 has diversified into several different co-circulating variants. Researchers have grouped the variants into different clades based on specific signature mutations. One of the most significant mutations is the D614G substitution in the gene coding for the “spike” protein. This variant increases infectivity of SARS-CoV-2 in vitro and may have been selected by evolution or on purpose, for increased human-to-human transmission [18].

Before countries closed borders at the early part of the pandemic, one could expect well-mixed SARS-CoV-2 phylogenetic tree structures across different countries and continents. Through March and April, many countries imposed differing types of ’lockdown’ where the movement was restricted, and businesses and schools closed. Those restrictions decreased between-country transmission, making it more likely that different countries have different types of SARS-CoV-2 phylogenetic tree structures. However, with loose restrictions, and local and within-country transmission might make the tree structure more complicated.

Base on the information, we try to apply the phylogenetic tree space idea to analyze the mutations of SARS-CoV-2. We would apply whole RNA sequences to consider all beneficial, neutral, and deleterious mutations occurring in both the coding and non-coding regions. The data is obtained from GISAID[13], which contains 400 SARS-CoV-2 sequences from 10 countries. The data would be set to different groups based on the locations and months.

First, we set 3 groups for each country. The groups are divided by different months: Jan-Mar, Apr-Jun, Jul-Sep. We randomly pick one RNA sequence from the three groups for each location and then make an alignment. We repeat the step 10 times to form aligned sequences samples for comparing viruses for different countries and 20 times for comparing viruses for different continents. From those aligned sequences, we could build trees with 3 leaves. One of the sample trees for each continent are as followed 12,

Refer to caption
Refer to caption
Refer to caption
Figure 12: The sample tree examples from left to right are from Asia, Europe and America

Since our sample trees are on T3T_{3} space, we could visualize all sample trees on a 3-spider, and we could compute the mean tree via methods we introduced in the last section. The figure 13 showed the 20 sample trees for each continent and the corresponding mean trees.

Refer to caption
Figure 13: Sample trees on T3T_{3}, Orange: Asia, Green: Europe, Blue: America

The mean and standard deviation calculated are as followed. We use Newick format to present the tree type: ’a’ represents the sample from group Jan-Mar, ’b’: Apr-Jun, and ’c’: Jul-Sep. From the figure, means are all very close to the origin or at the origin. The results show that the mean phylogenetic trees are close to each other and also close to the star tree, which means the difference between the viruses from different months is tiny. Another reason for mean trees closing to the origin might be the stickiness of the mean introduced in last section.

Tree Type Intrisic Mean Intrinsic Standard Deviation
America (a,b,c) 0 0.259
Asia ((a,c),b) 0.008 0.256
Europe ((b,c),a) 0.02 0.302

We repeat the same steps for countries and have results showed as follows. From the figures below, one can find that mean trees for countries are not close to the origin except those from Asia. The mean tree stays on a particular ’leg’ of the T3T_{3} space could be expressed that there is a specific evolutionary or mutation pattern in that country. However, considering the standard deviations are not small enough to support a confident region locating on a particular ’leg’ of the T3T_{3} space. We still need more evidence to conclude significant results of a specific mutation pattern that occurred.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 14: Sample trees on T3T_{3}
Tree Type Intrisic Mean Intrinsic Standard Deviation
China ((b,c),a) 0.01 0.022
India ((b,c),a) 0.036 0.343
Japan ((b,c),a) 0.01 0.018
Singapore ((a,c),b) 0.015 0.198
Tree Type Intrisic Mean Intrinsic Standard Deviation
Italy ((b,c),a) 0.066 0.156
Russia ((a,b),c) 0.088 0.274
UK ((b,c),a) 0.084 0.44
Tree Type Intrisic Mean Intrinsic Standard Deviation
Brazil ((a,c),b) 0.008 0.158
Canada ((b,c),a) 0.028 0.322
US ((a,b),c) 0.071 0.238

5.5 Investigating the existence of a mean star 3-Spider for a 3-Spider plot

For this application on the tree space T3T_{3} , we consider CLT for the intrinsic mean on this tree space; according to Theorem 2, if the probability moment on one of the three spider legs of 3-spider does not exceed the sum of the probability moments on the other two spider legs, then the star tree (center of the spider) is the intrinsic mean . Biologically, this can imply the common ancestor, or indicate lack of significant DNA/RNA mutations. In particular, we will apply . Therefore, our main aim here is to use the constructed T3T_{3} tree space on the 3-spider to find out whether there has been significant RNA/DNA sequences mutations among the specified groups by making sure the overall sample mean for each tree is either non-sticky or sticky one. The overall mean inner edge values for all the three legs of the 3-spider satisfying stickiness (star tree), implies no significant mutations among the sample groups being compared. So the 3-spider plot we got should either be a star spider or a non-star spider.

5.6 Application of stickiness Theorem on SARS-CoV-2 RNA Sequence Data From N. America, Asia, and EU, With Their Specified Countries

Table 1: Stickiness theorem on the three continents’ 3-spider
Legs waνabawbνb=θw_{a}\nu_{a}-\sum_{b\neq a}w_{b}\nu_{b}=\theta θ\theta value Verdict
1
wa^=25/59\hat{w_{a}}=25/59
νa^=2.3938\hat{\nu_{a}}=2.3938
wb1^=16/59\hat{w_{b_{1}}}=16/59
νb1^=2.134\hat{\nu_{b_{1}}}=2.134
wb2^=18/59\hat{w_{b_{2}}}=18/59
νb2^=2.8401\hat{\nu_{b_{2}}}=2.8401
0.4-0.4
2
wa^=16/59\hat{w_{a}}=16/59
νa^=2.1342\hat{\nu_{a}}=2.1342
wb1^=25/59\hat{w_{b_{1}}}=25/59
νb1^=2.3938\hat{\nu_{b_{1}}}=2.3938
wb2^=18/59\hat{w_{b_{2}}}=18/59
νb2^=2.8401\hat{\nu_{b_{2}}}=2.8401
1.3-1.3 Sticky
3
wa^=18/59\hat{w_{a}}=18/59
νa^=2.8401\hat{\nu_{a}}=2.8401
wb1^=25/59\hat{w_{b_{1}}}=25/59
wb1^=2.3938\hat{w_{b_{1}}}=2.3938
wb2^=16/59\hat{w_{b_{2}}}=16/59
νb2^=2.1342\hat{\nu_{b_{2}}}=2.1342
0.7-0.7

According to what the table 1 details in accordance with stickiness theorem 2, all three legs of the 3-spider satisfied the stickiness condition. Examination of the 3-spider plot in figure 13 also seems to confirm this: All the means are clustered close to the center. Thus, the overall intrinsic sample mean for the continental regions of N.America, Asia and EU is the star tree. Therefore, by the third part of Theorem 2, we can conclude that SARS-CoV-2 virus RNA sequences from these regions did not mutate significantly, and thus, are not significantly different.

Table 2: Stickiness theorem on N. American countries’ 3-spider
Legs waνabawbνb=θw_{a}\nu_{a}-\sum_{b\neq a}w_{b}\nu_{b}=\theta θ\theta value Verdict
1
wa^=16/30\hat{w_{a}}=16/30
νa^=1.2474\hat{\nu_{a}}=1.2474
wb1^=7/30\hat{w_{b_{1}}}=7/30
νb1^=0.9424\hat{\nu_{b_{1}}}=0.9424
wb2^=7/30\hat{w_{b_{2}}}=7/30
νb2^=0.9395\hat{\nu_{b_{2}}}=0.9395
0.20.2
2
wa^=7/30\hat{w_{a}}=7/30
νa^=0.9424\hat{\nu_{a}}=0.9424
wb1^=16/30\hat{w_{b_{1}}}=16/30
νb1^=1.2474\hat{\nu_{b_{1}}}=1.2474
wb2^=7/30\hat{w_{b_{2}}}=7/30
νb2^=0.9395\hat{\nu_{b_{2}}}=0.9395
0.7-0.7 Not Sticky
3
wa^=7/30\hat{w_{a}}=7/30
νa^=0.9395\hat{\nu_{a}}=0.9395
wb1^=16/30\hat{w_{b_{1}}}=16/30
wb1^=1.2474\hat{w_{b_{1}}}=1.2474
wb2^=7/30\hat{w_{b_{2}}}=7/30
νb2^=0.9424\hat{\nu_{b_{2}}}=0.9424
0.7-0.7

The figure 14 (bottom) of the 3-spider of the N. American countries suggest there could be some mutation among their SARS-CoV-2 viruses. Going by the theorem 2, the result detailed in the table 2 revealed accordingly that the overall intrinsic mean is non-sticky and there is a high chance it is situated somewhere on leg 1 of the 3-spider. The sample size of 30 is fairly large to say 30(X¯30,IμI)\sqrt{30}(\bar{X}_{30,I}-\mu_{I}) is approximately normally distributed. Obviously, the intrinsic sample mean is not the star tree in this case. So, the SARS-CoV-2 RNA sequences from the countries of Brazil, Canada, and the USA did have some significant mutations among them. Therefore, we can declare, based on the theorem 2, that the SARS-CoV-2 RNA sequences from these N. American countries have some significant difference from each other. Further analysis may throw more light on the scope of this difference.

Table 3: Stickiness theorem on Asian countries’ 3-spider
Legs waνabawbνb=θw_{a}\nu_{a}-\sum_{b\neq a}w_{b}\nu_{b}=\theta θ\theta value Verdict
1
wa^=12/30\hat{w_{a}}=12/30
νa^=0.4853\hat{\nu_{a}}=0.4853
wb1^=7/30\hat{w_{b_{1}}}=7/30
νb1^=1.0976\hat{\nu_{b_{1}}}=1.0976
wb2^=21/30\hat{w_{b_{2}}}=21/30
νb2^=1.5386\hat{\nu_{b_{2}}}=1.5386
1.1-1.1
2
wa^=7/30\hat{w_{a}}=7/30
νa^=1.0976\hat{\nu_{a}}=1.0976
wb1^=12/30\hat{w_{b_{1}}}=12/30
νb1^=0.4853\hat{\nu_{b_{1}}}=0.4853
wb2^=21/30\hat{w_{b_{2}}}=21/30
νb2^=1.5386\hat{\nu_{b_{2}}}=1.5386
1.0-1.0 Not Sticky
3
wa^=21/30\hat{w_{a}}=21/30
νa^=1.5386\hat{\nu_{a}}=1.5386
wb1^=7/30\hat{w_{b_{1}}}=7/30
wb1^=1.0976\hat{w_{b_{1}}}=1.0976
wb2^=12/30\hat{w_{b_{2}}}=12/30
νb2^=0.4853\hat{\nu_{b_{2}}}=0.4853
0.60.6

The 3-spider plot found in figure 14 (top left) suggest there might not be much mutation among the SARS-CoV-2 RNA sequences of the Asian countries of China, India, Japan, and Singapore. However, applying theorem 2, we clearly see a contradiction to this observation as the Table 3 results clearly show. The “net” verdict is that the overall intrinsic mean is not sticky and may possibly fall on leg 3. With a sample size of 30, 30(X¯30,IμI)\sqrt{30}(\bar{X}_{30,I}-\mu_{I}) may be approximated by a normally distribution with mean 0. Again, the intrinsic mean on the 3-spider here is unlikely to be the star tree. Thus, we can conclude by the theorem 2 that the SARS-CoV-2 RNA sequences from these four Asian countries have some significant mutations among them. Further analysis is needed to investigate the exact location and magnitude of these mutations.

Table 4: Stickiness theorem on EU countries’ 3-spider
Legs waνabawbνb=θw_{a}\nu_{a}-\sum_{b\neq a}w_{b}\nu_{b}=\theta θ\theta value Verdict
1
wa^=10/30\hat{w_{a}}=10/30
νa^=1.7743\hat{\nu_{a}}=1.7743
wb1^=6/30\hat{w_{b_{1}}}=6/30
νb1^=0.2151\hat{\nu_{b_{1}}}=0.2151
wb2^=14/30\hat{w_{b_{2}}}=14/30
νb2^=2.5628\hat{\nu_{b_{2}}}=2.5628
0.6-0.6
2
wa^=6/30\hat{w_{a}}=6/30
νa^=0.2151\hat{\nu_{a}}=0.2151
wb1^=10/30\hat{w_{b_{1}}}=10/30
νb1^=1.7743\hat{\nu_{b_{1}}}=1.7743
wb2^=14/30\hat{w_{b_{2}}}=14/30
νb2^=2.5628\hat{\nu_{b_{2}}}=2.5628
1.7-1.7 Not sticky
3
wa^=14/30\hat{w_{a}}=14/30
νa^=2.5628\hat{\nu_{a}}=2.5628
wb1^=10/30\hat{w_{b_{1}}}=10/30
wb1^=1.7743\hat{w_{b_{1}}}=1.7743
wb2^=6/30\hat{w_{b_{2}}}=6/30
νb2^=0.2151\hat{\nu_{b_{2}}}=0.2151
0.60.6

For the EU countries, it is interesting that the 3-spider plot, see figure 14 (top right), and the application of theorem 2, see table 4, all agree that there is some significant difference in the mutations among the SARS-CoV-2 RNA sequences from Italy, Russia, and the UK. Therefore the intrinsic mean tree here is not a star tree, and the overall intrinsic mean may likely fall on leg 3, with 30(X¯30,IμI)\sqrt{30}(\bar{X}_{30,I}-\mu_{I}) being approximately normally distributed. Again, further analysis is needed here to define the scope of this difference.

5.7 Comparison with different months and locations on T4T_{4} space

Next, we want to separate the virus sequences into four groups to form phylogenetic trees with 4-leaves. We use the same data as the last section and make groups: Jan- Mar, Apr-May, Jun-Jul, and Aug-Sep. We randomly pick one RNA sequence from the four groups for each location and then make an alignment. We repeat the step 10 times to form aligned sequences samples for comparing viruses for different countries and 20 times for comparing viruses for different continents. From those aligned sequences, we could build trees with four leaves.

Since the sample trees are in T4T_{4} space, we cannot visualize the whole T4T_{4} space in R3R^{3}. However we could centrally project the points onto the Petersen graph to briefly visualize the samples. The figure below showed the 20 sample trees for each continent and the corresponding sample intrinsic mean trees.

Refer to caption
Figure 15: Sample trees(dots) and mean trees(stars) of different continents on T4T_{4}

We could find that samples are separated in the T4T_{4} space. The mean trees of the EU and America are close to each other rather than the mean tree of Asia. The mean trees of the EU and America show a similar mutation pattern, and there is a big difference with Asia’s. However, we cannot draw any conclusion from the figure since the standard deviation and significant low value of inner edge. The mean tree’s inner edge is low can be explained as the stickiness of the mean, and the sample size is small.

We repeat the same steps for countries and have results showed as follows,

Refer to caption
Figure 16: Sample trees(dots) and mean trees(stars) of different countries on T4T_{4}

The result shows that most sample trees locate on several particular orthants. Some countries share the same tree type and are very close to each other. However, as we mentioned before, we probably need more samples to draw a significant conclusion. As more data introducing, either more samples for each country or more countries added, we would find trends for the virus’ evolutionary and mutation.

References

  • [1] Barden, Dennis; Le, Huiling; Owen, Megan (2013). Central limit theorems for Fréchet means in the space of phylogenetic trees. Electron. J. Probab. 18, no. 25, 25 pp.
  • [2] Barden, Dennis; Le, Huiling (2018a). The logarithm map, its limits and Fréchet means in orthant spaces. Proc. Lond. Math. Soc. 117 , 751-–789.
  • [3] Barden, Dennis; Le, Huiling; Owen, Megan (2018). Limiting behaviour of Fréchet means in the space of phylogenetic trees.Ann Inst Stat Math, 70, 99-–129
  • [4] Rabi N. Bhattacharya, Marius Buibas, Ian L. Dryden, Leif A. Ellingson, David Groisser, Harrie Hendriks, Stephan Huckemann, Huiling Le, Xiuwen Liu, James S. Marron, Daniel E. Osborne, Vic Patrângenaru, Armin Schwartzman, Hilary W. Thompson, and Andrew T. A.Wood. (2013) Extrinsic data analysis on sample spaces with a manifold stratification. Advances in Mathematics, Invited Contributions at the Seventh Congress of Romanian Mathematicians, Brasov, 2011, Publishing House of the Romanian Academy (Editors: Lucian Beznea, Vasile Brîzanescu, Marius Iosifescu, Gabriela Marinoschi, Radu Purice and Dan Timotin), pp. 241–252.
  • [5] Basrak, B.(2010) Limit theorems for the inductive mean on metric trees. J. Appl. Prob. 47, 1136-1149.
  • [6] Billera, L., Holmes, S., Vogtmann, K.(2001). Geometry of the space of phylogenetic trees. Adv. Appl. Math. 27, 733-767.
  • [7] Ellingson L., Hendriks H., Patrangenaru V., Valentin P.S. (2014) On the CLT on Low Dimensional Stratified Spaces. Topics in Nonparametric Statistics. Springer Proceedings in Mathematics & Statistics, vol 74. Springer, New York, NY
  • [8] Hendriks H.(2014). Intrinsic means on graphs. private communication.
  • [9] Hotz, T., Huckemann, S., Le, H., Marron, J. S., Mattingly, J.C., Miller, E., Nolen, J., Owen, M,. Patrangenaru, V., and Skwerer, S.(2013). Sticky central limit theorems on open books.Ann. of Appl. Prob. 23, no. 6, 2238-2258.
  • [10] Patrangenaru, V. and Ellingson, L. E. (2015). Nonparametric Statistics on Manifolds and their Applications to Object Data Analysis. CRC.
  • [11] Saitou, N. and Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 4(4):406-425.
  • [12] Sievers F., Wilm A., Dineen D., Gibson T.J., Karplus K., Li W., Lopez R., McWilliam H., Remmert M., So¨\ddot{o}ding J., Thompson J.D. and Higgins D.G. (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7:539
  • [13] Elbe, S., and Buckland-Merrett, G. (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 1:33-46. DOI:10.1002/gch2.1018 PMCID: 31565258
  • [14] Hatcher EL, Zhdanov SA, Bao Y, Blinkova O, Nawrocki EP, Ostapchuck Y, Scha¨\ddot{a}ffer AA, Brister JR. Nucleic Acids Res. 2017 Jan 4;45(D1):D482-D490. doi: 10.1093/nar/gkw1065. Epub 2016 Nov 28.
  • [15] Duffy, S. (2018) Why are RNA virus mutation rates so damn high? PLoS Biol 16(8): e3000003.
  • [16] Hotz, T.; Huckemann, S. (2015). Intrinsic means on the circle: uniqueness, locus and asymptotics. Ann. Inst. Statist. Math. 67, no. 1, 177–193.
  • [17] Worldometer (2021) Worldometer COVID-19 Data https://www.worldometers.info/coronavirus/
  • [18] B. Korber et al.(2020) Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus. Cell 192(4):812-827.e19
  • [19] Wu F., et al.(2020) A new coronavirus associated with human respiratory disease in China. Nature 579(7798):265-269.
  • [20] He R., et al.(2004) Analysis of multimerization of the SARS coronavirus nucleocapsid protein. Biochem Biophys Res Commun 2;316(2):476-83.
  • [21] Zaki AM, van Boheemen S, Bestebroer TM, Osterhaus AD, and Fouchier RA.(2012) Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia. N Engl J Med. 367(19):1814-20.
  • [22] Li W, et al.(2005) Bats are natural reservoirs of SARS-like coronaviruses. Science 310(5748):676-9.
  • [23] Shen, Chen (2021). Topological Data Analysis for Medical Imaging and RNA Data Analysis on Tree Spaces. PhD Dissertation, Florida State University.