This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Matrix Chernoff Bound for Markov Chains and Its Application to Co-occurrence Matrices

Jiezhong Qiu
Tsinghua University
[email protected]
&Chi Wang
Microsoft Research, Redmond
[email protected]
&Ben Liao
Tencent Quantum Lab
[email protected]
&Richard Peng
Georgia Tech
[email protected]
&Jie Tang
Tsinghua University
[email protected]
Abstract

We prove a Chernoff-type bound for sums of matrix-valued random variables sampled via a regular (aperiodic and irreducible) finite Markov chain. Specially, consider a random walk on a regular Markov chain and a Hermitian matrix-valued function on its state space. Our result gives exponentially decreasing bounds on the tail distributions of the extreme eigenvalues of the sample mean matrix. Our proof is based on the matrix expander (regular undirected graph) Chernoff bound [Garg et al. STOC ’18] and scalar Chernoff-Hoeffding bounds for Markov chains [Chung et al. STACS ’12].

Our matrix Chernoff bound for Markov chains can be applied to analyze the behavior of co-occurrence statistics for sequential data, which have been common and important data signals in machine learning. We show that given a regular Markov chain with nn states and mixing time τ\tau, we need a trajectory of length O(τ(logn+logτ)/ϵ2)O(\tau(\log{n}+\log{\tau})/{\epsilon}^{2}) to achieve an estimator of the co-occurrence matrix with error bound ϵ\epsilon. We conduct several experiments and the experimental results are consistent with the exponentially fast convergence rate from theoretical analysis. Our result gives the first bound on the convergence rate of the co-occurrence matrix and the first sample complexity analysis in graph representation learning.

1 Introduction

Chernoff bound [5], which gives exponentially decreasing bounds on tail distributions of sums of independent scalar-valued random variables, is one of the most basic and versatile tools in theoretical computer science, with countless applications to practical problems [21, 35]. There are two notable limitations when applying Chernoff bound in analyzing sample complexity in real-world machine learning problems. First, in many cases the random variables have dependence, e.g., Markov dependence [20] in MCMC [18] and online learning [48]. Second, applications are often concerned with the concentration behavior of quantities beyond scalar-valued random variables, e.g., random features in kernel machines [40] and co-occurrence statistics which are random matrices [38, 39].

Existing research has attempted to extend the original Chernoff bound in one of these two limitations [19, 11, 27, 24, 53, 14, 6, 41, 52, 42, 1, 50]. Wigderson and Xiao [53] conjectured that Chernoff bounds can be generalized to both matrix-valued random variables and Markov dependence, while restricting the Markov dependence to be a regular undirected graph. It was recently proved by Garg et al. [10], based on a new multi-matrix extension of the Golden-Thompson inequality [45]. However, the regular undirected graph is a special case of Markov chains which are reversible and have a uniform stationary distribution, and does not apply to practical problems such as random walk on generic graphs. It is an open question for the Chernoff bound of matrix-valued random matrices with more generic Markov dependence.

In this work, we establish large deviation bounds for the tail probabilities of the extreme eigenvalues of sums of random matrices sampled via a regular Markov chain111Please note that regular Markov chains are Markov chains which are aperiodic and irreducible, while an undirected regular graph is an undirected graph where each vertex has the same number of neighbors. In this work, the term “regular” may have different meanings depending on the context. starting from an arbitrary distribution (not necessarily the stationary distribution), which significantly improves the result of Garg et al. [10]. More formally, we prove the following theorem:

Theorem 1 (Markov Chain Matrix Chernoff Bound).

Let 𝐏{\bm{P}} be a regular Markov chain with state space [N][N], stationary distribution 𝛑{\bm{\pi}} and spectral expansion λ\lambda. Let f:[N]d×df:[N]\rightarrow\mathbb{C}^{d\times d} be a function such that (1) v[N]\forall v\in[N], f(v)f(v) is Hermitian and f(v)21\left\lVert f(v)\right\rVert_{2}\leq 1; (2) v[N]πvf(v)=0\sum_{v\in[N]}\pi_{v}f(v)=0. Let (v1,,vk)(v_{1},\cdots,v_{k}) denote a kk-step random walk on 𝐏{\bm{P}} starting from a distribution ϕ{\bm{\phi}}. Given ϵ(0,1){\epsilon}\in(0,1),

[λmax(1kj=1kf(vj))ϵ]4ϕ𝝅d2exp((ϵ2(1λ)k/72))[λmin(1kj=1kf(vj))ϵ]4ϕ𝝅d2exp((ϵ2(1λ)k/72)).\begin{split}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\geq\epsilon\right]&\leq 4\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2}\exp{\left(-({\epsilon}^{2}(1-\lambda)k/72)\right)}\\ \mathbb{P}\left[\lambda_{\min}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\leq-\epsilon\right]&\leq 4\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2}\exp{\left(-({\epsilon}^{2}(1-\lambda)k/72)\right)}.\end{split}

In the above theorem, 𝝅\left\lVert\cdot\right\rVert_{{\bm{\pi}}} is the 𝝅{\bm{\pi}}-norm (which we define formally later in Section 2) measuring the distance between the initial distribution ϕ{\bm{\phi}} and the stationary distribution 𝝅{\bm{\pi}}. Our strategy is to incorporate the concentration of matrix-valued functions from [10] into the study of general Markov chains from [6], which was originally for scalars.

1.1 Applications to Co-occurrence Matrices of Markov Chains

1 Input sequence (v1,,vL)(v_{1},\cdots,v_{L}); window size TT;
2 Output co-occurrence matrix 𝑪{\bm{C}};
𝑪𝟎n×n{\bm{C}}\leftarrow\bm{0}_{n\times n}; ;
  /* vi[n],i[L]v_{i}\in[n],i\in[L] */
3 for i=1,2,,LTi=1,2,\ldots,L-T do
4       for r=1,,Tr=1,\ldots,T do
5             𝑪vi,vi+r𝑪vi,vi+r+1/T{\bm{C}}_{v_{i},v_{i+r}}\leftarrow{\bm{C}}_{v_{i},v_{i+r}}+1/T;
6             𝑪vi+r,vi𝑪vi+r,vi+1/T{\bm{C}}_{v_{i+r},v_{i}}\leftarrow{\bm{C}}_{v_{i+r},v_{i}}+1/T;
7            
8      
9𝑪12(LT)𝑪{\bm{C}}\leftarrow\frac{1}{2(L-T)}{\bm{C}};
10 Return 𝑪{\bm{C}};
Algorithm 1 The Co-occurrence Matrix.

The co-occurrence statistics have recently emerged as common and important data signals in machine learning, providing rich correlation and clustering information about the underlying object space, such as the word co-occurrence in natural language processing [32, 33, 34, 26, 37], vertex co-occurrence in graph learning [38, 46, 12, 13, 7, 39], item co-occurrence in recommendation system [44, 28, 2, 51, 29], action co-occurrence in reinforcement learning [49], and emission co-occurrence of hidden Markov models [23, 17, 30]. Given a sequence of objects (v1,,vL)(v_{1},\cdots,v_{L}), the co-occurrence statistics are computed by moving a sliding window of fixed size TT over the sequence and recording the frequency of objects’ co-occurrence within the sliding window. A pseudocode of the above procedure is listed in Algorithm 1, which produces an nn by nn co-occurrence matrix where nn is the size of the object space.

A common assumption when building such co-occurrence matrices is that the sequential data are long enough to provide an accurate estimation. For instance, Mikolov et al. [33] use a news article dataset with one billion words in their Skip-gram model; Tennenholtz and Mannor [49] train their Act2vec model with action sequences from over a million StarCraft II game replays, which are equivalent to 100 years of consecutive gameplay; Perozzi et al. [38] samples large amounts of random walk sequences from graphs to capture the vertex co-occurrence. A recent work by Qiu et al. [39] studies the convergence of co-occurrence matrices of random walk on undirected graphs in the limit (i.e., when the length of random walk goes to infinity), but left the convergence rate an open problem. It remains unknown whether the co-occurrence statistics are sample efficient and how efficient they are.

In this paper, we study the situation where the sequential data are sampled from a regular finite Markov chain (i.e., an aperiodic and irreducible finite Markov chain), and derive bounds on the sample efficiency of co-occurrence matrix estimation, specifically on the length of the trajectory needed in the sampling algorithm shown in Algorithm 1. To give a formal statement, we first translate Algorithm 1 to linear algebra language. Given a trajectory (v1,,vL)(v_{1},\cdots,v_{L}) from state space [n][n] and step weight coefficients (α1,,αT)(\alpha_{1},\cdots,\alpha_{T}), the co-occurrence matrix is defined to be

𝑪1LTi=1LT𝑪i,where 𝑪ir=1Tαr2(𝒆vi𝒆vi+r+𝒆vi+r𝒆vi).{\bm{C}}\triangleq\frac{1}{L-T}\sum_{i=1}^{L-T}{\bm{C}}_{i},\text{where }{\bm{C}}_{i}\triangleq\sum_{r=1}^{T}\frac{\alpha_{r}}{2}\left({\bm{e}}_{v_{i}}{\bm{e}}^{\top}_{v_{i+r}}+{\bm{e}}_{v_{i+r}}{\bm{e}}^{\top}_{v_{i}}\right).

Here 𝑪i{\bm{C}}_{i} accounts for the co-occurrence within sliding window (vi,,vi+T)(v_{i},\cdots,v_{i+T}), and 𝒆vi{\bm{e}}_{v_{i}} is a length-nn vector with a one in its viv_{i}-th entry and zeros elsewhere. Thus 𝒆vi𝒆vi+r\small{\bm{e}}_{v_{i}}{\bm{e}}^{\top}_{v_{i+r}}\normalsize is a nn by nn matrix with its (vi,vi+r)(v_{i},v_{i+r})-th entry to be one and other entries to be zero, which records the co-occurrence of viv_{i} and vi+rv_{i+r}. Note that Algorithm 1 is a special case when step weight coefficients are uniform, i.e., αr=1/T,r[T]\alpha_{r}=1/T,r\in[T], and the co-occurrence statistics in all the applications mentioned above can be formalized in this way. When trajectory (v1,,vL)(v_{1},\cdots,v_{L}) is a random walk from a regular Markov chain 𝑷{\bm{P}} with stationary distribution 𝝅{\bm{\pi}}, the asymptotic expectation of the co-occurrence matrix within sliding window (vi,,vi+L)(v_{i},\cdots,v_{i+L}) is

𝔸𝔼[𝑪i]limi𝔼(𝑪i)=r=1Tαr2(𝚷𝑷r+(𝚷𝑷r)),\mathbb{AE}[{\bm{C}}_{i}]\triangleq\lim_{i\to\infty}\mathbb{E}({\bm{C}}_{i})=\sum_{r=1}^{T}\frac{\alpha_{r}}{2}\left({\bm{\Pi}}{\bm{P}}^{r}+\left({\bm{\Pi}}{\bm{P}}^{r}\right)^{\top}\right),

where 𝚷diag(𝝅){\bm{\Pi}}\triangleq\operatorname{diag}({\bm{\pi}}). Thus the asymptotic expectation of the co-occurrence matrix is

𝔸𝔼[𝑪]limL𝔼[𝑪]=limL1LTi=1LT𝔼(𝑪i)=r=1Tαr2(𝚷𝑷r+(𝚷𝑷r)).\mathbb{AE}[{\bm{C}}]\triangleq\lim_{L\to\infty}\mathbb{E}\left[{\bm{C}}\right]=\lim_{L\to\infty}\frac{1}{L-T}\sum_{i=1}^{L-T}\mathbb{E}({\bm{C}}_{i})=\sum_{r=1}^{T}\frac{\alpha_{r}}{2}\left({\bm{\Pi}}{\bm{P}}^{r}+\left({\bm{\Pi}}{\bm{P}}^{r}\right)^{\top}\right). (1)

Our main result regarding the estimation of the co-occurrence matrix is the following convergence bound related to the length of the walk sampled.

Theorem 2 (Convergence Rate of Co-occurrence Matrices).

Let 𝐏{\bm{P}} be a regular Markov chain with state space [n][n], stationary distribution 𝛑{\bm{\pi}} and mixing time τ\tau. Let (v1,,vL)(v_{1},\cdots,v_{L}) denote a LL-step random walk on 𝐏{\bm{P}} starting from a distribution ϕ{\bm{\phi}} on [n][n]. Given step weight coefficients (α1,,αT)(\alpha_{1},\cdots,\alpha_{T}) s.t. r=1T|αr|=1\sum_{r=1}^{T}\left\lvert\alpha_{r}\right\rvert=1, and ϵ(0,1){\epsilon}\in(0,1), the probability that the co-occurrence matrix 𝐂{\bm{C}} deviates from its asymptotic expectation 𝔸𝔼[𝐂]\mathbb{AE}[{\bm{C}}] (in 2-norm) is bounded by:

[𝑪𝔸𝔼[𝑪]2ϵ]2(τ+T)ϕ𝝅n2exp(ϵ2(LT)576(τ+T)).\mathbb{P}\left[\left\lVert{\bm{C}}-\mathbb{AE}[{\bm{C}}]\right\rVert_{2}\geq{\epsilon}\right]\leq 2\left(\tau+T\right)\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}n^{2}\exp{\left(-\frac{{\epsilon}^{2}(L-T)}{576\left(\tau+T\right)}\right)}.

Specially, there exists a trajectory length L=O((τ+T)(logn+log(τ+T))/ϵ2+T)L=O\left((\tau+T)(\log{n}+\log{(\tau+T)})/{\epsilon}^{2}+T\right) such that [𝐂𝔸𝔼[𝐂]2ϵ]1nO(1)\mathbb{P}\left[\left\lVert{\bm{C}}-\mathbb{AE}[{\bm{C}}]\right\rVert_{2}\geq\epsilon\right]\leq\frac{1}{n^{O(1)}}. Assuming T=O(1)T=O(1) gives L=O(τ(logn+logτ)/ϵ2)L=O\left(\tau(\log{n}+\log{\tau})/{\epsilon}^{2}\right).

Our result in Theorem 2 gives the first sample complexity analysis for many graph representation learning algorithms. Given a graph, these algorithms aim to learn a function from the vertices to a low dimensional vector space. Most of them (e.g., DeepWalk [38], node2vec [12], metapath2vec [7], GraphSAGE [13]) consist of two steps. The first step is to draw random sequences from a stochastic process defined on the graph and then count co-occurrence statistics from the sampled sequences, where the stochastic process is usually defined to be first-order or higher-order random walk on the graph. The second step is to train a model to fit the co-occurrence statistics. For example, DeepWalk can be viewed as factorizing a point-wise mutual information matrix [26, 39] which is a transformation of the co-occurrence matrix; GraphSAGE fits the co-occurrence statistics with a graph neural network [22]. The common assumption is that there are enough samples so that the co-occurrence statistics are accurately estimated. We are the first work to study the sample complexity of the aforementioned algorithms. Theorem 2 implies that these algorithms need O(τ(logn+logτ)/ϵ2)O(\tau(\log{n}+\log{\tau})/{\epsilon}^{2}) samples to achieve a good estimator of the co-occurrence matrix.

Previous work Hsu et al. [16, 15] study a similar problem. They leverage the co-occurrence matrix with T=1T=1 to estimate the mixing time in reversible Markov chains from a single trajectory. Their main technique is a blocking technique [55] which is in parallel with the Markov chain matrix Chernoff-bound used in this work. Our work is also related to the research about random-walk matrix polynomial sparsification when the Markov chain 𝑷{\bm{P}} is a random walk on an undirected graph. In this case, we can rewrite 𝑷=𝑫1𝑨{\bm{P}}={\bm{D}}^{-1}{\bm{A}} where 𝑫{\bm{D}} and 𝑨{\bm{A}} is the degree matrix and adjacency matrix of an undirected graph with nn vertices and mm edges, and the expected co-occurrence matrix in Equation 1 can be simplified as 𝔸𝔼[𝑪]=1vol(G)r=1Tαr𝑫(𝑫1𝑨)r\mathbb{AE}\left[{\bm{C}}\right]=\frac{1}{\operatorname{vol}{(G)}}\sum_{r=1}^{T}\alpha_{r}{\bm{D}}({\bm{D}}^{-1}{\bm{A}})^{r},222The volume of a graph GG is defined to be vol(G)ij𝑨ij\operatorname{vol}{(G)}\triangleq\sum_{i}\sum_{j}{\bm{A}}_{ij}. which is known as random-walk matrix polynomials [3, 4]. Cheng et al. [4] propose an algorithm which needs O(T2mlogn/ϵ2)O(T^{2}m\log{n}/{\epsilon}^{2}) steps of random walk to construct an ϵ{\epsilon}-approximator for the random-walk matrix polynomials. Our bound in Theorem 2 is stronger than the bound proposed by Cheng et al. [4] when the Markov chain 𝑷{\bm{P}} mixes fast. Moreover, Cheng et al. [4] require αr\alpha_{r} to be non-negative, while our bound can handle negative step weight coefficients.

Organization The rest of the paper is organized as follows. In Section 2 we provide preliminaries, followed by the proof of matrix Chernoff bound in Section 3 and the proof of convergence rate of co-occurrence matrices in Section 4. In Section 5, we conduct experiments on both synthetic and real-world datasets. Finally, we conclude this work in Section 6.

2 Preliminaries

In this paper, we denote 𝑷{\bm{P}} to be a finite Markov chain on nn states. 𝑷{\bm{P}} could refer to either the chain itself or the corresponding transition probability matrix — an nn by nn matrix such that its entry 𝑷ij{\bm{P}}_{ij} indicates the probability that state ii moves to state jj. A Markov chain is called an ergodic Markov chain if it is possible to eventually get from every state to every other state with positive probability. A Markov chain is regular if some power of its transition matrix has all strictly positive entries. A regular Markov chain must be an ergodic Markov chain, but not vice versa. An ergodic Markov chain has unique stationary distribution, i,e., there exists a unique probability vector 𝝅{\bm{\pi}} such that 𝝅=𝝅𝑷{\bm{\pi}}^{\top}={\bm{\pi}}^{\top}{\bm{P}}. For convenience, we denote 𝚷diag(𝝅){\bm{\Pi}}\triangleq\operatorname{diag}({\bm{\pi}}).

The time that a regular Markov chain333Please note that we need the Markov chain to be regular to make the mixing-time well-defined. For an ergodic Markov chain which could be periodic, the mixing time may be ill-defined. needs to be “close” to its stationary distribution is called mixing time. Let 𝒙{\bm{x}} and 𝒚{\bm{y}} be two probability vectors. The total variation distance between them is 𝒙𝒚TV12𝒙𝒚1\left\lVert{\bm{x}}-{\bm{y}}\right\rVert_{TV}\triangleq\frac{1}{2}\left\lVert{\bm{x}}-{\bm{y}}\right\rVert_{1}. For δ>0\delta>0, the δ\delta-mixing time of regular Markov chain 𝑷{\bm{P}} is τ(𝑷)min{t:max𝒙(𝒙𝑷t)𝝅TVδ}\small\tau({\bm{P}})\triangleq\min\left\{t:\max_{{\bm{x}}}\left\lVert({\bm{x}}^{\top}{\bm{P}}^{t})^{\top}-{\bm{\pi}}\right\rVert_{TV}\leq\delta\right\}\normalsize, where 𝒙{\bm{x}} is an arbitrary probability vector.

The stationary distribution 𝝅{\bm{\pi}} also defines a inner product space where the inner product (under 𝝅{\bm{\pi}}-kernel) is defined as 𝒙,𝒚𝝅𝒚𝚷1𝒙\langle{\bm{x}},{\bm{y}}\rangle_{{\bm{\pi}}}\triangleq{\bm{y}}^{\ast}{\bm{\Pi}}^{-1}{\bm{x}} for 𝒙,𝒚N\forall{\bm{x}},{\bm{y}}\in\mathbb{C}^{N}, where 𝒚{\bm{y}}^{\ast} is the conjugate transpose of 𝒚{\bm{y}}. A naturally defined norm based on the above inner product is 𝒙𝝅𝒙,𝒙𝝅\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}\triangleq\sqrt{\langle{\bm{x}},{\bm{x}}\rangle_{{\bm{\pi}}}}. Then we can define the spectral expansion λ(𝑷)\lambda({\bm{P}}) of a Markov chain 𝑷{\bm{P}} [31, 9, 6] as λ(𝑷)max𝒙,𝝅𝝅=0,𝒙0(𝒙𝑷)𝝅𝒙𝝅\small\lambda({\bm{P}})\triangleq\max_{\langle{\bm{x}},{\bm{\pi}}\rangle_{{\bm{\pi}}}=0,{\bm{x}}\neq 0}\frac{\left\lVert\left({\bm{x}}^{\ast}{\bm{P}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}}{\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}}\normalsize. The spectral expansion λ(𝑷)\lambda({\bm{P}}) is known to be a measure of mixing time of a Markov chain. The smaller λ(𝑷)\lambda({\bm{P}}) is, the faster a Markov chain converges to its stationary distribution [54]. If 𝑷{\bm{P}} is reversible, λ(𝑷)\lambda({\bm{P}}) is simply the second largest absolute eigenvalue of 𝑷{\bm{P}} (the largest is always 11). The irreversible case is more complicated, since 𝑷{\bm{P}} may have complex eigenvalues. In this case, λ(𝑷)\lambda({\bm{P}}) is actually the square root of the second largest absolute eigenvalue of the multiplicative reversiblization of 𝑷{\bm{P}} [9]. When 𝑷{\bm{P}} is clear from the context, we will simply write τ\tau and λ\lambda for τ(𝑷)\tau({\bm{P}}) and λ(𝑷)\lambda({\bm{P}}), respectively. We shall also refer 1λ(𝑷)1-\lambda({\bm{P}}) as the spectral gap of 𝑷{\bm{P}}.

3 Matrix Chernoff Bounds for Markov Chains

This section provides a brief overview of our proof of Markov chain Martrix Chernoff bounds. We start from a simpler version which only consider real-valued symmetric matrices, as stated in Theorem 3 below. Then we extend it to complex-valued Hermitian matrices, as stated in in Theorem 1.

Theorem 3 (A Real-Valued Version of Theorem 1).

Let 𝐏{\bm{P}} be a regular Markov chain with state space [N][N], stationary distribution 𝛑{\bm{\pi}} and spectral expansion λ\lambda. Let f:[N]d×df:[N]\rightarrow\mathbb{R}^{d\times d} be a function such that (1) v[N]\forall v\in[N], f(v)f(v) is symmetric and f(v)21\left\lVert f(v)\right\rVert_{2}\leq 1; (2) v[N]πvf(v)=0\sum_{v\in[N]}\pi_{v}f(v)=0. Let (v1,,vk)(v_{1},\cdots,v_{k}) denote a kk-step random walk on 𝐏{\bm{P}} starting from a distribution ϕ{\bm{\phi}} on [N][N]. Then given ϵ(0,1){\epsilon}\in(0,1),

[λmax(1kj=1kf(vj))ϵ]ϕ𝝅d2exp((ϵ2(1λ)k/72))[λmin(1kj=1kf(vj))ϵ]ϕ𝝅d2exp((ϵ2(1λ)k/72)).\begin{split}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\geq\epsilon\right]&\leq\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2}\exp{\left(-({\epsilon}^{2}(1-\lambda)k/72)\right)}\\ \mathbb{P}\left[\lambda_{\min}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\leq-\epsilon\right]&\leq\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2}\exp{\left(-({\epsilon}^{2}(1-\lambda)k/72)\right)}.\end{split}

Due to space constraints, we defer the full proof to Section B in the supplementary material and instead present a sketch here. By symmetry, we only discuss on bounding λmax\lambda_{\max} here. Using the exponential method, the probability in Theorem 3 can be upper bounded for any t>0t>0 by:

[λmax(1kj=1kf(vj))ϵ][Tr[exp(tj=1kf(vj))]exp(tkϵ)]𝔼[Tr[exp(tj=1kf(vj))]]exp(tkϵ),\scriptsize\begin{split}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\geq\epsilon\right]\leq\mathbb{P}\left[\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\geq\exp{(tk\epsilon)}\right]\leq\frac{\mathbb{E}\left[\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right]}{\exp{(tk\epsilon)}},\end{split}\normalsize

where the first inequality follows by the tail bounds for eigenvalues (See Proposition 3.2.1 in Tropp [50]) which controls the tail probabilities of the extreme eigenvalues of a random matrix by producing a bound for the trace of the matrix moment generating function, and the second inequality follows by Markov’s inequality. The RHS of the above equation is the expected trace of the exponential of a sum of matrices (i.e., tf(vj)tf(v_{j})’s). When ff is a scalar-valued function, we can easily write exponential of a sum to be product of exponentials (since exp(a+b)=exp(a)exp(b)\exp(a+b)=\exp(a)\exp(b) for scalars). However, this is not true for matrices. To bound the expectation term, we invoke the following multi-matrix Golden-Thompson inequality from [10], by letting 𝑯j=tf(vj),j[k]{\bm{H}}_{j}=tf(v_{j}),j\in[k].

Theorem 4 (Multi-matrix Golden-Thompson Inequality, Theorem 1.5 in [10]).

Let 𝐇1,𝐇k{\bm{H}}_{1},\cdots{\bm{H}}_{k} be kk Hermitian matrices, then for some probability distribution μ\mu on [π2,π2][-\frac{\pi}{2},\frac{\pi}{2}].

log(Tr[exp(j=1k𝑯j)])4ππ2π2log(Tr[j=1kexp(eiϕ2𝑯j)j=k1exp(eiϕ2𝑯j)])𝑑μ(ϕ).\log{\left(\operatorname{Tr}{\left[\exp{\left(\sum_{j=1}^{k}{\bm{H}}_{j}\right)}\right]}\right)}\leq\frac{4}{\pi}\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\log{\left(\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}{\bm{H}}_{j}\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}{\bm{H}}_{j}\right)}\right]}\right)}d\mu(\phi).

The key point of this theorem is to relate the exponential of a sum of matrices to a product of matrix exponentials and their adjoints, whose trace can be further bounded via the following lemma by letting eiϕ=γ+ibe^{{\mathrm{i}}\phi}=\gamma+{\mathrm{i}}b.

Lemma 1 (Analogous to Lemma 4.3 in [10]).

Let 𝐏{\bm{P}} be a regular Markov chain with state space [N][N] with spectral expansion λ\lambda. Let ff be a function f:[N]d×df:[N]\rightarrow\mathbb{R}^{d\times d} such that (1) v[N]πvf(v)=0\sum_{v\in[N]}\pi_{v}f(v)=0; (2) f(v)21\left\lVert f(v)\right\rVert_{2}\leq 1 and f(v)f(v) is symmetric, v[N]v\in[N]. Let (v1,,vk)(v_{1},\cdots,v_{k}) denote a kk-step random walk on 𝐏{\bm{P}} starting from a distribution ϕ{\bm{\phi}} on [N][N]. Then for any t>0,γ0,b>0t>0,\gamma\geq 0,b>0 such that t2(γ2+b2)1t^{2}(\gamma^{2}+b^{2})\leq 1 and tγ2+b21λ4λt\sqrt{\gamma^{2}+b^{2}}\leq\frac{1-\lambda}{4\lambda}, we have

𝔼[Tr[j=1kexp(tf(vj)(γ+ib)2)j=k1exp(tf(vj)(γib)2)]]ϕ𝝅dexp(kt2(γ2+b2)(1+81λ)).\mathbb{E}\left[\operatorname{Tr}\left[\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right]\right]\leq\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d\exp{\left(kt^{2}(\gamma^{2}+b^{2})\left(1+\frac{8}{1-\lambda}\right)\right)}.

Proving Lemma 1 is the technical core of our paper. The main idea is to write the expected trace expression in LHS of Lemma 1 in terms of the transition probability matrix 𝑷{\bm{P}}, which allows for a recursive analysis to track how much the expected trace expression changes as a function of kk. The analysis relies on incorporating the concentration of matrix-valued functions from [10] into the study of general Markov chains from [6], which was originally for scalars. Key to this extension is the definition of an inner product related to the stationary distribution 𝝅{\bm{\pi}} of 𝑷{\bm{P}}, and a spectral expansion from such inner products. In contrast, the undirected regular graph case studied in [10] can be handled using the standard inner products, as well as the second largest eigenvalues of 𝑷{\bm{P}} instead of the spectral expansion. Detailed proofs of Theorem 3 and Lemma 1 can be found in Appendix B.2 and Appendix B.3 of the supplementary material, respectively.

Our result about real-valued matrices can be further generalized to complex-valued matrices, as stated in Theorem 1. Our main strategy is to adopt complexification technique [8], which first relate the eigenvalues of a d×dd\times d complex Hermitian matrix to a 2d×2d2d\times 2d real symmetric matrix, and then deal with the real symmetric matrix using Theorem 3. The proof of Theorem 1 is deferred to Appendix B.4 in the supplementary material.

4 Convergence Rate of Co-occurrence Matrices of Markov Chains

In this section, we first apply the matrix Chernoff bound for regular Markov chains from Theorem 3 to obtain our main result on the convergence of co-occurrence matrix estimation, as stated in Theorem 2, and then discuss its generalization to Hidden Markov models in Corollary 1. Informally, our result in Theorem 3 states that if the mixing time of the Markov chain 𝑷{\bm{P}} is τ\tau, then the length of a trajectory needed to guarantee an additive error (in 2-norm) of ϵ\epsilon is roughly O((τ+T)(logn+logτ+T)/ϵ2+T)O\left((\tau+T)(\log{n}+\log{\tau+T})/{\epsilon}^{2}+T\right), where TT is the co-occurrence window size. However, we cannot directly apply the matrix Chernoff bound because the co-occurrence matrix is not a sum of matrix-valued functions sampled from the original Markov chain 𝑷{\bm{P}}. The main difficulty is to construct the proper Markov chain and matrix-valued function as desired by Theorem 3. We formally give our proof as follows:

Proof.

(of Theorem 2) Our proof has three main steps: the first two construct a Markov chain 𝑸{\bm{Q}} according to 𝑷{\bm{P}}, and a matrix-valued function ff such that the sums of matrix-valued random variables sampled via 𝑸{\bm{Q}} is exactly the error matrix 𝑪𝔸𝔼[𝑪]{\bm{C}}-\mathbb{AE}[{\bm{C}}]. Then we invoke Theorem 3 to the constructed Markov chain 𝑸{\bm{Q}} and function ff to bound the convergence rate. We give details below.

Step One Given a random walk (v1,,vL)(v_{1},\cdots,v_{L}) on Markov chain 𝑷{\bm{P}}, we construct a sequence (X1,,XLT)(X_{1},\cdots,X_{L-T}) where Xi(vi,vi+1,,vi+T)X_{i}\triangleq(v_{i},v_{i+1},\cdots,v_{i+T}), i.e., each XiX_{i} is a size-TT sliding window over (v1,,vL)(v_{1},\cdots,v_{L}). Meanwhile, let 𝒮\mathcal{S} be the set of all TT-step walks on Markov chain 𝑷{\bm{P}}, we define a new Markov chain 𝑸{\bm{Q}} on 𝒮\mathcal{S} such that (u0,,uT),(w0,,wT)𝒮\forall(u_{0},\cdots,u_{T}),(w_{0},\cdots,w_{T})\in\mathcal{S}:

𝑸(u0,,uT),(w0,,wT){𝑷uT,wTif (u1,,uT)=(w0,,wT1);0otherwise.{\bm{Q}}_{(u_{0},\cdots,u_{T}),(w_{0},\cdots,w_{T})}\triangleq\begin{cases}{\bm{P}}_{u_{T},w_{T}}&\text{if }(u_{1},\cdots,u_{T})=(w_{0},\cdots,w_{T-1});\\ 0&\text{otherwise}.\end{cases}

The following claim characterizes the properties of 𝑸{\bm{Q}}, whose proof is deferred to Appendix A.1 in the supplementary material.

Claim 1 (Properties of 𝑸{\bm{Q}}).

If 𝐏{\bm{P}} is a regular Markov chain, then 𝐐{\bm{Q}} satisfies:

  1. 1.

    𝑸{\bm{Q}} is a regular Markov chain with stationary distribution σ(u0,,uT)=πu0𝑷u0,u1𝑷uT1,uT\sigma_{(u_{0},\cdots,u_{T})}=\pi_{u_{0}}{\bm{P}}_{u_{0},u_{1}}\cdots{\bm{P}}_{u_{T-1},u_{T}};

  2. 2.

    The sequence (X1,XLT)(X_{1},\cdots X_{L-T}) is a random walk on 𝑸{\bm{Q}} starting from a distribution 𝝆{\bm{\rho}} such that ρ(u0,,uT)=ϕu0𝑷u0,u1𝑷uT1,uT\rho_{(u_{0},\cdots,u_{T})}=\phi_{u_{0}}{\bm{P}}_{u_{0},u_{1}}\cdots{\bm{P}}_{u_{T-1},u_{T}}, and 𝝆𝝈=ϕ𝝅\left\lVert{\bm{\rho}}\right\rVert_{{\bm{\sigma}}}=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}.

  3. 3.

    δ>0\forall\delta>0, the δ\delta-mixing time of 𝑷{\bm{P}} and 𝑸{\bm{Q}} satisfies τ(𝑸)<τ(𝑷)+T\tau({\bm{Q}})<\tau({\bm{P}})+T;

  4. 4.

    𝑷\exists{\bm{P}} with λ(𝑷)<1\lambda({\bm{P}})<1 s.t. the induced 𝑸{\bm{Q}} has λ(𝑸)=1\lambda({\bm{Q}})=1, i.e. 𝑸{\bm{Q}} may have zero spectral gap.

Parts 11 and 22 imply that the sliding windows (i.e., X1,X2,X_{1},X_{2},\cdots) correspond to the state transition in a regular Markov chain 𝑸{\bm{Q}}, whose mixing time and spectral expansion are described in Parts 33 and 44. A special case of the above construction when T=1T=1 can be found in Lemma 6.1 of [54].

Step Two Defining a matrix-valued function f:𝒮n×nf:\mathcal{S}\rightarrow\mathbb{R}^{n\times n} such that X=(u0,,uT)𝒮\forall X=(u_{0},\cdots,u_{T})\in\mathcal{S}:

f(X)12(r=1Tαr2(𝒆u0𝒆ur+𝒆ur𝒆u0)r=1Tαr2(𝚷𝑷r+(𝚷𝑷r))).f(X)\triangleq\frac{1}{2}\left(\sum_{r=1}^{T}\frac{\alpha_{r}}{2}\left({\bm{e}}_{u_{0}}{\bm{e}}^{\top}_{u_{r}}+{\bm{e}}_{u_{r}}{\bm{e}}^{\top}_{u_{0}}\right)-\sum_{r=1}^{T}\frac{\alpha_{r}}{2}\left({\bm{\Pi}}{\bm{P}}^{r}+\left({\bm{\Pi}}{\bm{P}}^{r}\right)^{\top}\right)\right). (2)

With this definition of f(X)f(X), the difference between the co-occurrence matrix 𝑪{\bm{C}} and its asymptotic expectation 𝔸𝔼[𝑪]\mathbb{AE}[{\bm{C}}] can be written as: 𝑪𝔸𝔼[𝑪]=2(1LTi=1LTf(Xi)){\bm{C}}-\mathbb{AE}[{\bm{C}}]=2(\frac{1}{L-T}\sum_{i=1}^{L-T}f(X_{i})). We can further show the following properties of this function ff:

Claim 2 (Properties of ff).

The function ff in Equation 2 satisfies (1) X𝒮σXf(X)=0\sum_{X\in\mathcal{S}}\sigma_{X}f(X)=0; (2) f(X)f(X) is symmetric and f(X)21,X𝒮\left\lVert f(X)\right\rVert_{2}\leq 1,\forall X\in\mathcal{S}.

This claim verifies that ff in Equation 2 satisfies the two conditions of matrix-valued function in Theorem 3. The proof of Claim 2 is deferred to Appendix A.2 of the supplementary material.

Step Three The construction in step two reveals the fact that the error matrix 𝑪𝔸𝔼[𝑪]{\bm{C}}-\mathbb{AE}[{\bm{C}}] can be written as the average of matrix-valued random variables (i.e., f(Xi)f(X_{i})’s), which are sampled via a regular Markov chain 𝑸{\bm{Q}} This encourages us to directly apply Theorem 3. However, note that (1) the error probability in Theorem 3 contains a factor of spectral gap (1λ)(1-\lambda); and (2) Part 4 of Claim 1 allows for the existence of a Markov chain 𝑷{\bm{P}} with λ(𝑷)<1\lambda({\bm{P}})<1 while the induced Markov chain 𝑸{\bm{Q}} has λ(𝑸)=1\lambda({\bm{Q}})=1. So we cannot directly apply Theorem 3 to 𝑸{\bm{Q}}. To address this issue, we utilize the following tighter bound on sub-chains.

Claim 3.

(Claim 3.1 in Chung et al. [6]) Let 𝐐{\bm{Q}} be a regular Markov chain with δ\delta-mixing time τ(Q)\tau(Q), then λ(𝐐τ(Q))2δ\lambda\left({\bm{Q}}^{\tau(Q)}\right)\leq\sqrt{2\delta}. In particular, setting δ=18\delta=\frac{1}{8} implies λ(𝐐τ(𝐐))12\lambda({\bm{Q}}^{\tau({\bm{Q}})})\leq\frac{1}{2}.

The above claim reveals the fact that, even though 𝑸{\bm{Q}} could have zero spectral gap (Part 4 of Claim 1), we can bound the spectral expansion of 𝑸τ(𝑸){\bm{Q}}^{\tau({\bm{Q}})}. We partition (X1,XLT)(X_{1},\cdots X_{L-T}) into τ(𝑸)\tau({\bm{Q}}) groups444Without loss of generality, we assume LTL-T is a multiple of τ(𝑸)\tau({\bm{Q}})., such that the ii-th group consists of a sub-chain (Xi,Xi+τ(𝑸),Xi+2τ(Q),)(X_{i},X_{i+\tau({\bm{Q}})},X_{i+2\tau(Q)},\cdots) of length k(LT)/τ(𝑸)k\triangleq(L-T)/\tau({\bm{Q}}). The sub-chain can be viewed as generated from a Markov chain 𝑸τ(Q){\bm{Q}}^{\tau(Q)}. Apply Theorem 3 to the ii-th sub-chain, whose starting distribution is 𝝆i(𝑸)i1𝝆{\bm{\rho}}_{i}\triangleq\left({\bm{Q}}^{\top}\right)^{i-1}{\bm{\rho}}, we have

[λmax(1kj=1kf(Xi+(j1)τ(𝑸))ϵ]𝝆i𝝈n2exp(ϵ2(1λ(𝑸τ(𝑸)))k/72)𝝆i𝝈n2exp(ϵ2k/144)ϕ𝝅n2exp(ϵ2k/144),\begin{split}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(X_{i+(j-1)\tau({\bm{Q}})}\right)\geq{\epsilon}\right]&\leq\left\lVert{\bm{\rho}}_{i}\right\rVert_{{\bm{\sigma}}}n^{2}\exp{\left(-{\epsilon}^{2}\left(1-\lambda\left({\bm{Q}}^{\tau({\bm{Q}})}\right)\right)k/72\right)}\\ &\leq\left\lVert{\bm{\rho}}_{i}\right\rVert_{{\bm{\sigma}}}n^{2}\exp{\left(-{\epsilon}^{2}k/144\right)}\leq\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}n^{2}\exp{\left(-{\epsilon}^{2}k/144\right)},\end{split}

where that last step follows by 𝝆i𝝈𝝆i1𝝈𝝆1𝝈=𝝆𝝈\left\lVert{\bm{\rho}}_{i}\right\rVert_{{\bm{\sigma}}}\leq\left\lVert{\bm{\rho}}_{i-1}\right\rVert_{{\bm{\sigma}}}\leq\cdots\left\lVert{\bm{\rho}}_{1}\right\rVert_{{\bm{\sigma}}}=\left\lVert{\bm{\rho}}\right\rVert_{{\bm{\sigma}}} and 𝝆𝝈=ϕ𝝅\left\lVert{\bm{\rho}}\right\rVert_{{\bm{\sigma}}}=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}} (Part 2 of Claim 1). Together with a union bound across each sub-chain, we can obtain:

[λmax(𝑪𝔸𝔼[𝑪])ϵ]=[λmax(1LTj=1LTf(Xj))ϵ2]=[λmax(1τ(𝑸)i=1τ(𝑸)1kj=1kf(Xi+(j1)τ(𝑸)))ϵ2]i=1τ(𝑸)[λmax(1kj=1kf(Xi+(j1)N))ϵ2]τ(𝑸)ϕ𝝅n2exp(ϵ2k/576).\begin{split}&\mathbb{P}\left[\lambda_{\max}\left({\bm{C}}-\mathbb{AE}[{\bm{C}}]\right)\geq\epsilon\right]=\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{L-T}\sum_{j=1}^{L-T}f(X_{j})\right)\geq\frac{{\epsilon}}{2}\right]\\ =&\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{\tau({\bm{Q}})}\sum_{i=1}^{\tau({\bm{Q}})}\frac{1}{k}\sum_{j=1}^{k}f(X_{i+(j-1)\tau({\bm{Q}})})\right)\geq\frac{{\epsilon}}{2}\right]\\ \leq&\sum_{i=1}^{\tau({\bm{Q}})}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(X_{i+(j-1)N})\right)\geq\frac{{\epsilon}}{2}\right]\leq\tau({\bm{Q}})\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}n^{2}\exp{\left(-{\epsilon}^{2}k/576\right)}.\end{split}

The bound on λmin\lambda_{\min} also follows similarly. As 𝑪𝔸𝔼[𝑪]{\bm{C}}-\mathbb{AE}[{\bm{C}}] is a real symmetric matrix, its 22-norm is its maximum absolute eigenvalue. Therefore, we can use the eigenvalue bound to bound the overall error probability in terms of the matrix 2-norm:

[𝑪𝔸𝔼[𝑪]2ϵ]=[λmax(𝑪𝔸𝔼[𝑪])ϵλmin(𝑪𝔸𝔼[𝑪])ϵ]2τ(𝑸)n2ϕ𝝅exp(ϵ2k/576)2(τ(𝑷)+T)ϕ𝝅n2exp(ϵ2(LT)576(τ(𝑷)+T)),\begin{split}&\mathbb{P}\left[\left\lVert{\bm{C}}-\mathbb{AE}[{\bm{C}}]\right\rVert_{2}\geq{\epsilon}\right]=\mathbb{P}\left[\lambda_{\max}({\bm{C}}-\mathbb{AE}[{\bm{C}}])\geq{\epsilon}\lor\lambda_{\min}({\bm{C}}-\mathbb{AE}[{\bm{C}}])\leq-{\epsilon}\right]\\ \leq&2\tau({\bm{Q}})n^{2}\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}\exp{\left(-{\epsilon}^{2}k/576\right)}\leq 2\left(\tau({\bm{P}})+T\right)\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}n^{2}\exp{\left(-\frac{{\epsilon}^{2}(L-T)}{576\left(\tau({\bm{P}})+T\right)}\right)},\end{split}

where the first inequality follows by union bound, and the second inequality is due to τ(𝑸)<τ(𝑷)+T\tau({\bm{Q}})<\tau({\bm{P}})+T (Part 3 of Claim 1). This bound implies that the probability that 𝑪{\bm{C}} deviates from 𝔸𝔼[𝑪]\mathbb{AE}[{\bm{C}}] could be arbitrarily small by increasing the sampled trajectory length LL. Specially, if we want the event 𝑪𝔸𝔼[𝑪]2ϵ\left\lVert{\bm{C}}-\mathbb{AE}[{\bm{C}}]\right\rVert_{2}\geq{\epsilon} happens with probability smaller than 1/nO(1)1/n^{O(1)}, we need L=O((τ(𝑷)+T)(logn+log(τ(𝑷)+T))/ϵ2+T)L=O\left(\left(\tau({\bm{P}})+T\right)\left(\log{n}+\log{\left(\tau({\bm{P}})+T\right)}\right)/{\epsilon}^{2}+T\right). If we assume T=O(1)T=O(1), we can achieve L=O(τ(𝑷)(logn+logτ(𝑷))/ϵ2)L=O\left(\tau({\bm{P}})\left(\log{n}+\log{\tau({\bm{P}})}\right)/{\epsilon}^{2}\right). ∎

Our analysis can be extended to Hidden Markov models (HMM) as shown in Corollary 1, and has a potential to solve problems raised in [17, 30]. Our strategy is to treat the HMM with observable state space 𝒴\mathcal{Y} and hidden state space 𝒳\mathcal{X} as a Markov chain with state space 𝒴×𝒳\mathcal{Y}\times\mathcal{X}. The detailed proof can be found in Appendix A.3 in the supplementary material.

Corollary 1 (Co-occurrence Matrices of HMMs).

For a HMM with observable states yt𝒴y_{t}\in\mathcal{Y} and hidden states xt𝒳x_{t}\in\mathcal{X}, let P(yt|xt)P(y_{t}|x_{t}) be the emission probability and P(xt+1|xt)P(x_{t+1}|x_{t}) be the hidden state transition probability. Given an LL-step trajectory observations from the HMM, (y1,,yL)(y_{1},\cdots,y_{L}), one needs a trajectory of length L=O(τ(log|𝒴|+logτ)/ϵ2)L=O(\tau(\log{|\mathcal{Y}|}+\log{\tau})/\epsilon^{2}) to achieve a co-occurrence matrix within error bound ϵ\epsilon with high probability, where τ\tau is the mixing time of the Markov chain on hidden states.

5 Experiments

In this section, we show experiments to illustrate the exponentially fast convergence rate of estimating co-occurrence matrices of Markov chains. We conduct experiments on three synthetic Markov chains (Barbell graph, winning streak chain, and random graph) and one real-world Markov chain (BlogCatalog). For each Markov chain and each trajectory length LL from the set {101,,107}\{10^{1},\cdots,10^{7}\}, we measure the approximation error of the co-occurrence matrix constructed by Algorithm 1 from a LL-step random walk sampled from the chain. We performed 64 trials for each experiment, and the results are aggregated as an error-bar plot. We set T=2T=2 and αr\alpha_{r} to be uniform unless otherwise mentioned. The relationship between trajectory length LL and approximation error 𝑪𝔸𝔼[𝑪]2\left\lVert{\bm{C}}-\mathbb{AE}[{\bm{C}}]\right\rVert_{2} is shown in Figure 1 (in log-log scale). Across all the four datasets, the observed exponentially fast convergence rates match what our bounds predict in Theorem 2. Below we discuss our observations for each of these datasets.

Refer to caption
(a) Barbell Graph
Refer to caption
(b) Winning Streak Chain
Refer to caption
(c) BlogCatalog
Refer to caption
(d) Random Graph
Figure 1: The convergence rate of co-occurrence matrices on Barbell graph, winning streak chain, BlogCatalog graph , and random graph (in log-log scale). The xx-axis is the trajectory length LL and the yy-axis is the error 𝑪𝔸𝔼[𝑪]2\left\lVert{\bm{C}}-\mathbb{AE}[{\bm{C}}]\right\rVert_{2}. Each experiment contains 64 trials, and the error bar is presented.

Barbell Graphs [43] The Barbell graph is an undirected graph with two cliques connected by a single path. Such graphs’ mixing times vary greatly: two cliques with size kk connected by a single edge have mixing time Θ(k2)\Theta(k^{2}); and two size-kk cliques connected by a length-kk path have mixing time about Θ(k3)\Theta(k^{3}). We evaluate the convergence rate of co-occurrence matrices on the two graphs mentioned above, each with 100100 vertices. According to our bound that require L=O(τ(logn+logτ)/ϵ2)L=O(\tau(\log{n}+\log{\tau})/{\epsilon}^{2}), we shall expect the approximate co-occurrence matrix to converge faster when the path bridging the two cliques is shorter. The experimental results are shown in Figure 1a, and indeed display faster convergences when the path is shorter (since we fix n=100n=100, a Barbell graph with clique size 50 has a shorter path connecting the two cliques than the one with clique size 33).

Winning Streak Chains (Section 4.6 of [25]) A winning streak Markov chain has state space [n][n], and can be viewed as tracking the number of consecutive ‘tails’ in a sequence of coin flips. Each state transits back to state 11 with probability 0.50.5, and the next state with probability 0.50.5. The δ\delta-mixing time of this chain satisfies τlog2(1/δ)\tau\leq\lceil\log_{2}(1/\delta)\rceil, and is independent of nn. This prompted us to choose this chain, as we should expect similar rates of convergence for different values of nn according to our bound of L=O(τ(logn+logτ)/ϵ2)L=O(\tau(\log{n}+\log{\tau})/{\epsilon}^{2}). In our experiment, we compare between n=50n=50 and n=100n=100 and illustrate the results in Figure 1b. As we can see, for each trajectory length LL, the approximation errors of n=50n=50 and n=100n=100 are indeed very close.

BlogCatalog Graph [47] is widely used to benchmark graph representation learning algorithms [38, 12, 39]. It is an undirected graph of social relationships of online bloggers with 10,312 vertices and 333,983 edges. The random walk on the BlogCatalog graph has spectral expansion λ0.57\lambda\approx 0.57. Following Levin and Peres [25], we can upper bound its 18\frac{1}{8}-mixing time by τ36\tau\leq 36. We choose TT from {2,4,8}\{2,4,8\} and illustrate the results in Figure 1c. The convergence rate is robust to different values of TT. Moreover, the variance in BlogCatalog is much smaller than that in other datasets.

We further demonstrate how our result could be used to select parameters for a popular graph representation learning algorithm, DeepWalk [38]. We set the window size T=10T=10, which is the default value of DeepWalk. Our bound on trajectory length LL in Theorem 1 (with explicit constant) is L576(τ+T)(3logn+log(τ+T))/ϵ2+TL\geq 576(\tau+T)(3\log{n}+\log{(\tau+T)})/\epsilon^{2}+T. The error bound ϵ\epsilon might be chosen in the range of [0.1,0.01][0.1,0.01], which corresponds to LL in the range of [8.4×107,8.4×109][8.4\times 10^{7},8.4\times 10^{9}]. To verify that is a meaningful range for tuning LL, we enumerate trajectory length LL from {104,,1010}\{10^{4},\cdots,10^{10}\}, estimate the co-occurrence matrix with the single trajectory sampled from BlogCatalog, convert the co-occurrence matrix to the one implicitly factorized by DeepWalk [38, 39], and factorize it with SVD. For comparison, we also provide the result at the limiting case (L+L\rightarrow+\infty) where we directly compute the asymptotic expectation of the co-occurrence matrix according to Equation 1. The limiting case involves computing a matrix polynomial and could be very expensive. For node classification task, the micro-F1 when training ratio is 50% is

Length LL of DeepWalk 10410^{4} 10510^{5} 10610^{6} 10710^{7} 10810^{8} 10910^{9} 101010^{10} ++\infty
Micro-F1 (%\%) 15.21 18.31 26.99 33.85 39.12 41.28 41.58 41.82

.

As we can see, it is reasonable to choose LL in the predicted range.

Random Graph The small variance observed on BlogCatalog leads us to hypothesize that it shares some traits with random graphs. To gather further evidence for this, we estimate the co-occurrence matrices of an Erdős–Rényi random graph for comparison. Specifically, we take a random graph on 100100 vertices where each undirected edge is added independently with probability 0.10.1, aka. G(100,0.1)G(100,0.1). The results Figure 1d show very similar behaviors compared to the BlogCatalog graph: small variance and robust convergence rates.

6 Conclusion and Future Work

In this paper, we analyze the convergence rate of estimating the co-occurrence matrix of a regular Markov chain. The main technical contribution of our work is to prove a Chernoff-type bound for sums of matrix-valued random variables sampled via a regular Markov chain, and we show that the problem of estimating co-occurrence matrices is a non-trivial application of the Chernoff-type bound. Our results show that, given a regular Markov chain with nn states and mixing time τ\tau, we need a trajectory of length O(τ(logn+logτ)/ϵ2)O(\tau(\log{n}+\log{\tau})/{\epsilon}^{2}) to achieve an estimator of the co-occurrence matrix with error bound ϵ\epsilon. Our work leads to some natural future questions:

  • Is it a tight bound? Our analysis on convergence rate of co-occurrence matrices relies on union bound, which probably gives a loose bound. It would be interesting to shave off the leading factor τ\tau in the bound, as the mixing time τ\tau could be large for some Markov chains.

  • What if the construction of the co-occurrence matrix is coupled with a learning algorithm? For example, in word2vec [33], the co-occurrence in each sliding window outputs a mini-batch to a logistic matrix factorization model. This problem can be formalized as the convergence of stochastic gradient descent with non-i.i.d but Markovian random samples.

  • Can we find more applications of the Markov chain matrix Chernoff bound? We believe Theorem 3 could have further applications, e.g., in reinforcement learning [36].

Broader Impact

Our work contributes to the research literature of Chernoff-type bounds and co-occurrence statistics. Chernoff-type bound have become one of the most important probabilistic results in computer science. Our result generalize Chernoff bound to Markov dependence and random matrices. Co-occurrence statistics have emerged as important tools in machine learning. Our work addresses the sample complexity of estimating co-occurrence matrix. We believe such better theoretical understanding can further the understanding of potential and limitations of graph representation learning and reinforcement learning.

Acknowledgments and Disclosure of Funding

We thank Jian Li (IIIS, Tsinghua) and Shengyu Zhang (Tencent Quantum Lab) for motivating this work. Funding in direct support of this work: Jiezhong Qiu and Jie Tang were supported by the National Key R&D Program of China (2018YFB1402600), NSFC for Distinguished Young Scholar (61825602), and NSFC (61836013). Richard Peng was partially supported by NSF grant CCF-1846218. There is no additional revenue related to this work.

References

  • Ahlswede and Winter [2002] Rudolf Ahlswede and Andreas Winter. Strong converse for identification via quantum channels. IEEE Transactions on Information Theory, 2002.
  • Barkan and Koenigstein [2016] Oren Barkan and Noam Koenigstein. Item2vec: neural item embedding for collaborative filtering. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), 2016.
  • Cheng et al. [2015a] Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Efficient sampling for gaussian graphical models via spectral sparsification. In COLT ’15, 2015a.
  • Cheng et al. [2015b] Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Spectral sparsification of random-walk matrix polynomials. arXiv preprint arXiv:1502.03496, 2015b.
  • Chernoff et al. [1952] Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 1952.
  • Chung et al. [2012] Kai-Min Chung, Henry Lam, Zhenming Liu, and Michael Mitzenmacher. Chernoff-hoeffding bounds for markov chains: Generalized and simplified. In STACS’ 12, 2012.
  • Dong et al. [2017] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. metapath2vec: Scalable representation learning for heterogeneous networks. In KDD ’17, 2017.
  • Dongarra et al. [1984] JJ Dongarra, JR Gabriel, DD Koelling, and JH Wilkinson. The eigenvalue problem for hermitian matrices with time reversal symmetry. Linear Algebra and its Applications, 1984.
  • Fill [1991] James Allen Fill. Eigenvalue bounds on convergence to stationarity for nonreversible markov chains, with an application to the exclusion process. The annals of applied probability, 1991.
  • Garg et al. [2018] Ankit Garg, Yin Tat Lee, Zhao Song, and Nikhil Srivastava. A matrix expander chernoff bound. In STOC ’18, 2018.
  • Gillman [1998] David Gillman. A chernoff bound for random walks on expander graphs. SIAM Journal on Computing, 1998.
  • Grover and Leskovec [2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD ’16, 2016.
  • Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS ’17, 2017.
  • Healy [2008] Alexander D Healy. Randomness-efficient sampling within nc. Computational Complexity, 2008.
  • Hsu et al. [2019] Daniel Hsu, Aryeh Kontorovich, David A Levin, Yuval Peres, Csaba Szepesvári, Geoffrey Wolfer, et al. Mixing time estimation in reversible markov chains from a single sample path. The Annals of Applied Probability, 2019.
  • Hsu et al. [2015] Daniel J Hsu, Aryeh Kontorovich, and Csaba Szepesvári. Mixing time estimation in reversible markov chains from a single sample path. In NeurIPS ’15, 2015.
  • Huang et al. [2018] Kejun Huang, Xiao Fu, and Nicholas Sidiropoulos. Learning hidden markov models from pairwise co-occurrences with application to topic modeling. In ICML ’18, 2018.
  • Jerrum and Sinclair [1996] Mark Jerrum and Alistair Sinclair. The markov chain monte carlo method: an approach to approximate counting and integration. Approximation Algorithms for NP-hard problems, PWS Publishing, 1996.
  • Kahale [1997] Nabil Kahale. Large deviation bounds for markov chains. Combinatorics, Probability and Computing, 1997.
  • Karlin [2014] Samuel Karlin. A first course in stochastic processes. Academic press, 2014.
  • Kearns et al. [1994] Michael J Kearns, Umesh Virkumar Vazirani, and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994.
  • Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR ’17, 2017.
  • Kontorovich et al. [2013] Aryeh Kontorovich, Boaz Nadler, and Roi Weiss. On learning parametric-output hmms. In ICML ’13, 2013.
  • León et al. [2004] Carlos A León, François Perron, et al. Optimal hoeffding bounds for discrete reversible markov chains. The Annals of Applied Probability, 2004.
  • Levin and Peres [2017] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  • Levy and Goldberg [2014] Omer Levy and Yoav Goldberg. Neural Word Embedding as Implicit Matrix Factorization. In NeurIPS ’14. 2014.
  • Lezaud [1998] Pascal Lezaud. Chernoff-type bound for finite markov chains. Annals of Applied Probability, 1998.
  • Liang et al. [2016] Dawen Liang, Jaan Altosaar, Laurent Charlin, and David M Blei. Factorization meets the item embedding: Regularizing matrix factorization with item co-occurrence. In RecSys ’16, 2016.
  • Liu et al. [2017] David C Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C Ma, Zhigang Zhong, Jenny Liu, and Yushi Jing. Related pins at pinterest: The evolution of a real-world recommender system. In WWW ’17, 2017.
  • Mattila et al. [2020] Robert Mattila, Cristian R Rojas, Eric Moulines, Vikram Krishnamurthy, and Bo Wahlberg. Fast and consistent learning of hidden markov models by incorporating non-consecutive correlations. In ICML ’20, 2020.
  • Mihail [1989] Milena Mihail. Conductance and convergence of markov chains-a combinatorial treatment of expanders. In FOCS ’89, 1989.
  • Mikolov et al. [2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR Workshop ’13, 2013a.
  • Mikolov et al. [2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NeurIPS’ 13. 2013b.
  • Mikolov et al. [2013c] Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In NAACL ’13, 2013c.
  • Motwani and Raghavan [1995] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cambridge university press, 1995.
  • Ortner [2020] Ronald Ortner. Regret bounds for reinforcement learning via markov chain concentration. Journal of Artificial Intelligence Research, 2020.
  • Pennington et al. [2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP ’14, 2014.
  • Perozzi et al. [2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In KDD ’14, 2014.
  • Qiu et al. [2018] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In WSDM ’18, 2018.
  • Rahimi and Recht [2008] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NeurIPS ’08, 2008.
  • Rao and Regev [2017] Shravas Rao and Oded Regev. A sharp tail bound for the expander random sampler. arXiv preprint arXiv:1703.10205, 2017.
  • Rudelson [1999] Mark Rudelson. Random vectors in the isotropic position. Journal of Functional Analysis, 1999.
  • Sauerwald and Zanetti [2019] Thomas Sauerwald and Luca Zanetti. Random walks on dynamic graphs: Mixing times, hitting times, and return probabilities. In ICALP 2019, 2019.
  • Shani et al. [2005] Guy Shani, David Heckerman, and Ronen I Brafman. An mdp-based recommender system. JMLR ’05, 2005.
  • Sutter et al. [2017] David Sutter, Mario Berta, and Marco Tomamichel. Multivariate trace inequalities. Communications in Mathematical Physics, 352(1):37–58, 2017.
  • Tang et al. [2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In WWW ’15, 2015.
  • Tang and Liu [2009] Lei Tang and Huan Liu. Relational learning via latent social dimensions. In KDD ’09, 2009.
  • Tekin and Liu [2010] Cem Tekin and Mingyan Liu. Online algorithms for the multi-armed bandit problem with markovian rewards. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1675–1682. IEEE, 2010.
  • Tennenholtz and Mannor [2019] Guy Tennenholtz and Shie Mannor. The natural language of actions. In ICML ’19, 2019.
  • Tropp [2015] Joel A Tropp. An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571, 2015.
  • Vasile et al. [2016] Flavian Vasile, Elena Smirnova, and Alexis Conneau. Meta-prod2vec: Product embeddings using side-information for recommendation. In RecSys ’16, 2016.
  • Wagner [2008] Roy Wagner. Tail estimates for sums of variables sampled by a random walk. Combinatorics, Probability and Computing, 2008.
  • Wigderson and Xiao [2005] Avi Wigderson and David Xiao. A randomness-efficient sampler for matrix-valued functions and applications. In FOCS’05, 2005.
  • Wolfer and Kontorovich [2019] Geoffrey Wolfer and Aryeh Kontorovich. Estimating the mixing time of ergodic markov chains. In COLT ’19, 2019.
  • Yu [1994] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94–116, 1994.

Supplementary Material of A Matrix Chernoff Bound for Markov Chains and Its Application to Co-occurrence Matrices

Appendix A Convergence Rate of Co-occurrence Matrices

A.1 Proof of Claim 1

See 1

Proof.

We prove the fours parts of this Claim one by one.

Part 1 To prove 𝑸{\bm{Q}} is regular, it is sufficient to show that N1\exists N_{1}, n1>N1\forall n_{1}>N_{1}, (v0,,vT)(v_{0},\cdots,v_{T}) can reach (u0,,uT)(u_{0},\cdots,u_{T}) at n1n_{1} steps. We know 𝑷{\bm{P}} is a regular Markov chain, so there exists N2TN_{2}\geq T s.t., for any n2N2n_{2}\geq N_{2}, vTv_{T} can reach u0u_{0} at exact n2n_{2} step, i,e., there is a n2n_{2}-step walk s.t. (vT,w1,,wn21,u0)(v_{T},w_{1},\cdots,w_{n_{2}-1},u_{0}) on 𝑷{\bm{P}}. This induces an n2n_{2}-step walk from (v0,,vT)(v_{0},\cdots,v_{T}) to (wn2T+1,,wn21,u0)(w_{n_{2}-T+1},\cdots,w_{n_{2}-1},u_{0}). Take further TT step, we can reach (u0,,uT)(u_{0},\cdots,u_{T}), so we construct a n1=n2+Tn_{1}=n_{2}+T step walk from (v0,,vT)(v_{0},\cdots,v_{T}) to (u0,uT)(u_{0},\cdots u_{T}). Since this is true for any n2N2n_{2}\geq N_{2}, we then claim that any state can be reached from any other state in any number of steps greater than or equal to a number N1=N2+TN_{1}=N_{2}+T. Next to verify 𝝈{\bm{\sigma}} such that σ(u0,,uT)=πu0𝑷u0,u1𝑷uT1,uT\sigma_{(u_{0},\cdots,u_{T})}=\pi_{u_{0}}{\bm{P}}_{u_{0},u_{1}}\cdots{\bm{P}}_{u_{T-1},u_{T}} is the stationary distribution of Markov chain 𝑸{\bm{Q}},

(u0,,uT)𝒮σ(u0,,uT)𝑸(u0,,uT),(w0,,wT)=u0:(u0,w0,,wT1)𝒮πu0𝑷u0,w0𝑷w0,w1,,𝑷wT2,wT1𝑷wT1,wT=(u0πu0𝑷u0,w0)𝑷w0,w1,,𝑷wT2,wT1𝑷wT1,wT=πw0𝑷w0,w1,,𝑷wT2,wT1𝑷wT1,wT=σw0,,wT.\begin{split}&\sum_{(u_{0},\cdots,u_{T})\in\mathcal{S}}\sigma_{(u_{0},\cdots,u_{T})}{\bm{Q}}_{(u_{0},\cdots,u_{T}),(w_{0},\cdots,w_{T})}\\ =&\sum_{u_{0}:(u_{0},w_{0},\cdots,w_{T-1})\in\mathcal{S}}\pi_{u_{0}}{\bm{P}}_{u_{0},w_{0}}{\bm{P}}_{w_{0},w_{1}},\cdots,{\bm{P}}_{w_{T-2},w_{T-1}}{\bm{P}}_{w_{T-1},w_{T}}\\ =&\left(\sum_{u_{0}}\pi_{u_{0}}{\bm{P}}_{u_{0},w_{0}}\right){\bm{P}}_{w_{0},w_{1}},\cdots,{\bm{P}}_{w_{T-2},w_{T-1}}{\bm{P}}_{w_{T-1},w_{T}}\\ =&\pi_{w_{0}}{\bm{P}}_{w_{0},w_{1}},\cdots,{\bm{P}}_{w_{T-2},w_{T-1}}{\bm{P}}_{w_{T-1},w_{T}}=\sigma_{w_{0},\cdots,w_{T}}.\end{split}

Part 2 Recall (v1,,vL)(v_{1},\cdots,v_{L}) is a random walk on 𝑷{\bm{P}} starting from distribution ϕ{\bm{\phi}}, so the probability we observe X1=(v1,,vT+1)X_{1}=(v_{1},\cdots,v_{T+1}) is ϕv1𝑷v1,v2𝑷vT,vT=ρ(v1,,vT+1)\phi_{v_{1}}{\bm{P}}_{v_{1},v_{2}}\cdots{\bm{P}}_{v_{T},v_{T}}=\rho_{(v_{1},\cdots,v_{T+1})}, i.e., X1X_{1} is sampled from the distribution 𝝆{\bm{\rho}}. Then we study the transition probability from Xi=(vi,,vi+T)X_{i}=(v_{i},\cdots,v_{i+T}) to Xi+1=(vi+1,,vi+T+1)X_{i+1}=(v_{i+1},\cdots,v_{i+T+1}), which is 𝑷vi+T,vi+T+1=𝑸Xi,Xi+1{\bm{P}}_{v_{i+T},v_{i+T+1}}={\bm{Q}}_{X_{i},X_{i+1}}. Consequently, we can claim (Xi,,XLT)(X_{i},\cdots,X_{L-T}) is a random walk on 𝑸{\bm{Q}}. Moreover,

𝝆𝝈2=(u0,,uT)𝒮ρ(u0,,uT)2σ(u0,,uT)=(u0,,uT)𝒮(ϕu0𝑷u0,u1𝑷uT1,uT)2πu0𝑷u0,u1𝑷uT1,uT=u0ϕu02πu0(u0,u1,,uT)𝒮𝑷u0,u1𝑷uT1,uT=u0ϕu02πu0=ϕ𝝅2,\begin{split}\left\lVert{\bm{\rho}}\right\rVert_{{\bm{\sigma}}}^{2}&=\sum_{(u_{0},\cdots,u_{T})\in\mathcal{S}}\frac{\rho^{2}_{(u_{0},\cdots,u_{T})}}{\sigma_{(u_{0},\cdots,u_{T})}}=\sum_{(u_{0},\cdots,u_{T})\in\mathcal{S}}\frac{\left(\phi_{u_{0}}{\bm{P}}_{u_{0},u_{1}}\cdots{\bm{P}}_{u_{T-1},u_{T}}\right)^{2}}{\pi_{u_{0}}{\bm{P}}_{u_{0},u_{1}}\cdots{\bm{P}}_{u_{T-1},u_{T}}}\\ &=\sum_{u_{0}}\frac{\phi^{2}_{u_{0}}}{\pi_{u_{0}}}\sum_{(u_{0},u_{1},\cdots,u_{T})\in\mathcal{S}}{\bm{P}}_{u_{0},u_{1}}\cdots{\bm{P}}_{u_{T-1},u_{T}}=\sum_{u_{0}}\frac{\phi^{2}_{u_{0}}}{\pi_{u_{0}}}=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}^{2},\end{split}

which implies 𝝆𝝈=ϕ𝝅\left\lVert{\bm{\rho}}\right\rVert_{{\bm{\sigma}}}=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}.

Part 3 For any distribution 𝒚{\bm{y}} on 𝒮\mathcal{S}, define 𝒙n{\bm{x}}\in\mathbb{R}^{n} such that xi=(v1,,vT1,i)𝒮yv1,,vT1,ix_{i}=\sum_{(v_{1},\cdots,v_{T-1},i)\in\mathcal{S}}y_{v_{1},\cdots,v_{T-1},i}. Easy to see 𝒙{\bm{x}} is a probability vector, since 𝒙{\bm{x}} is the marginal probability of 𝒚{\bm{y}}. For convenience, we assume for a moment the 𝒙,𝒚,𝝈,𝝅{\bm{x}},{\bm{y}},{\bm{\sigma}},{\bm{\pi}} are row vectors. We can see that:

𝒚𝑸τ(𝑷)+T1𝝈TV=12𝒚𝑸τ(𝑷)+T1𝝈1=12(v1,,vT)𝒮|(𝒚𝑸τ(𝑷)+T1𝝈)v1,,vT|=12(v1,,vT)𝒮|(𝒙𝑷τ(𝑷))v1𝑷v1,v2𝑷vT1,vT𝝅v1𝑷v1,v2𝑷vT1,vT|=12(v1,,vT)𝒮|(𝒙𝑷τ(𝑷))v1πv1|𝑷v1,v2𝑷vT1,vT=12v1|(𝒙𝑷τ(𝑷))v1πv1|(v1,,vT)𝒮𝑷v1,v2𝑷vT1,vT=12v1|(𝒙𝑷τ(𝑷))v1πv1|=12𝒙𝑷τ(𝑷)𝝅1=𝒙𝑷τ(𝑷)𝝅TVδ.\begin{split}\left\lVert{\bm{y}}{\bm{Q}}^{\tau({\bm{P}})+T-1}-{\bm{\sigma}}\right\rVert_{TV}&=\frac{1}{2}\left\lVert{\bm{y}}{\bm{Q}}^{\tau({\bm{P}})+T-1}-{\bm{\sigma}}\right\rVert_{1}\\ &=\frac{1}{2}\sum_{(v_{1},\cdots,v_{T})\in\mathcal{S}}\left\lvert\left({\bm{y}}{\bm{Q}}^{\tau({\bm{P}})+T-1}-{\bm{\sigma}}\right)_{v_{1},\cdots,v_{T}}\right\rvert\\ &=\frac{1}{2}\sum_{(v_{1},\cdots,v_{T})\in\mathcal{S}}\left\lvert\left({\bm{x}}{\bm{P}}^{\tau({\bm{P}})}\right)_{v_{1}}{\bm{P}}_{v_{1},v_{2}}\cdots{\bm{P}}_{v_{T-1},v_{T}}-{\bm{\pi}}_{v_{1}}{\bm{P}}_{v_{1},v_{2}}\cdots{\bm{P}}_{v_{T-1},v_{T}}\right\rvert\\ &=\frac{1}{2}\sum_{(v_{1},\cdots,v_{T})\in\mathcal{S}}\left\lvert\left({\bm{x}}{\bm{P}}^{\tau({\bm{P}})}\right)_{v_{1}}-\pi_{v_{1}}\right\rvert{\bm{P}}_{v_{1},v_{2}}\cdots{\bm{P}}_{v_{T-1},v_{T}}\\ &=\frac{1}{2}\sum_{v_{1}}\left\lvert\left({\bm{x}}{\bm{P}}^{\tau({\bm{P}})}\right)_{v_{1}}-\pi_{v_{1}}\right\rvert\sum_{(v_{1},\cdots,v_{T})\in\mathcal{S}}{\bm{P}}_{v_{1},v_{2}}\cdots{\bm{P}}_{v_{T-1},v_{T}}\\ &=\frac{1}{2}\sum_{v_{1}}\left\lvert\left({\bm{x}}{\bm{P}}^{\tau({\bm{P}})}\right)_{v_{1}}-\pi_{v_{1}}\right\rvert=\frac{1}{2}\left\lVert{\bm{x}}{\bm{P}}^{\tau({\bm{P}})}-{\bm{\pi}}\right\rVert_{1}=\left\lVert{\bm{x}}{\bm{P}}^{\tau({\bm{P}})}-{\bm{\pi}}\right\rVert_{TV}\leq\delta.\end{split}

which indicates τ(𝑸)τ(𝑷)+T1<τ(𝑷)+T\tau({\bm{Q}})\leq\tau({\bm{P}})+T-1<\tau({\bm{P}})+T.

Part 4 This is an example showing that λ(𝑸)\lambda{({\bm{Q}})} cannot be bounded by λ(𝑷)\lambda{({\bm{P}})} — even though 𝑷{\bm{P}} has λ(𝑷)<1\lambda{({\bm{P}})}<1, the induced 𝑸{\bm{Q}} may have λ(𝑸)=1\lambda{({\bm{Q}})}=1. We consider random walk on the unweighted undirected graph   [Uncaptioned image]   and T=1T=1. The transition probability matrix 𝑷{\bm{P}} is:

𝑷=[[r]01/31/31/31/201/201/31/301/31/201/20]{\bm{P}}=\begin{bmatrix}[r]0&1/3&1/3&1/3\\ 1/2&0&1/2&0\\ 1/3&1/3&0&1/3\\ 1/2&0&1/2&0\end{bmatrix}

with stationary distribution 𝝅=[0.30.20.30.2]{\bm{\pi}}=\begin{bmatrix}0.3&0.2&0.3&0.2\end{bmatrix}^{\top} and λ(𝑷)=23\lambda({\bm{P}})=\frac{2}{3}. When T=1T=1, the induced Markov chain 𝑸{\bm{Q}} has stationary distribution σu,v=πu𝑷u,v=du2m1du=12m\sigma_{u,v}=\pi_{u}{\bm{P}}_{u,v}=\frac{d_{u}}{2m}\frac{1}{d_{u}}=\frac{1}{2m} where m=5m=5 is the number of edges in the graph. Construct 𝒚|𝒮|{\bm{y}}\in\mathbb{R}^{\left\lvert\mathcal{S}\right\rvert} such that

y(u,v)={1(u,v)=(0,1),1(u,v)=(0,3),0otherwise.y_{(u,v)}=\begin{cases}1&(u,v)=(0,1),\\ -1&(u,v)=(0,3),\\ 0&\text{otherwise.}\end{cases}

The constructed vector 𝒚{\bm{y}} has norm

𝒚𝝈=𝒚,𝒚𝝈=(u,v)𝒮y(u,v)y(u,v)σ(u,v)=y(0,1)y(0,1)σ(0,1)+y(0,3)y(0,3)σ(0,3)=2m.\left\lVert{\bm{y}}\right\rVert_{{\bm{\sigma}}}=\sqrt{\langle{\bm{y}},{\bm{y}}\rangle_{{\bm{\sigma}}}}=\sqrt{\sum_{(u,v)\in\mathcal{S}}\frac{y_{(u,v)}y_{(u,v)}}{\sigma_{(u,v)}}}=\sqrt{\frac{y_{(0,1)}y_{(0,1)}}{\sigma_{(0,1)}}+\frac{y_{(0,3)}y_{(0,3)}}{\sigma_{(0,3)}}}=2\sqrt{m}.

And it is easy to check 𝒚𝝈{\bm{y}}\perp{\bm{\sigma}}, since 𝒚,𝝈𝝈=(u,v)𝒮σ(u,v)y(u,v)σ(u,v)=y(0,1)+y(0,3)=0\langle{\bm{y}},{\bm{\sigma}}\rangle_{{\bm{\sigma}}}=\sum_{(u,v)\in\mathcal{S}}\frac{\sigma_{(u,v)}y_{(u,v)}}{\sigma_{(u,v)}}=y_{(0,1)}+y_{(0,3)}=0. Let 𝒙=(𝒚𝑸){\bm{x}}=\left({\bm{y}}^{\ast}{\bm{Q}}\right)^{\ast}, we have for (u,v)𝒮(u,v)\in\mathcal{S}:

𝒙(u,v)={1(u,v)=(1,2),1(u,v)=(3,2),0otherwise.{\bm{x}}_{(u,v)}=\begin{cases}1&(u,v)=(1,2),\\ -1&(u,v)=(3,2),\\ 0&\text{otherwise.}\end{cases}

This vector has norm:

𝒙𝝈=𝒙,𝒙𝝈=(u,v)𝒮x(u,v)x(u,v)σ(u,v)=y(1,2)y(1,2)σ(1,2)+y(3,2)y(3,2)σ(3,2)=2m\left\lVert{\bm{x}}\right\rVert_{{\bm{\sigma}}}=\sqrt{\langle{\bm{x}},{\bm{x}}\rangle_{{\bm{\sigma}}}}=\sqrt{\sum_{(u,v)\in\mathcal{S}}\frac{x_{(u,v)}x_{(u,v)}}{\sigma_{(u,v)}}}=\sqrt{\frac{y_{(1,2)}y_{(1,2)}}{\sigma_{(1,2)}}+\frac{y_{(3,2)}y_{(3,2)}}{\sigma_{(3,2)}}}=2\sqrt{m}

Thus we have (𝒚𝑸)𝝈𝒚𝝈=1\frac{\left\lVert\left({\bm{y}}^{\ast}{\bm{Q}}\right)^{\ast}\right\rVert_{{\bm{\sigma}}}}{\left\lVert{\bm{y}}\right\rVert_{{\bm{\sigma}}}}=1. Taking maximum over all possible 𝒚{\bm{y}} gives λ(𝑸)1\lambda({\bm{Q}})\geq 1. Also note that fact that λ(𝑸)1\lambda{({\bm{Q}})}\leq 1, so λ(𝑸)=1\lambda{({\bm{Q}})}=1. ∎

A.2 Proof of Claim 2

See 2

Proof.

Note that Equation 2 is indeed a random value minus its expectation, so naturally Equation 2 has zero mean, i.e., X𝒮σXf(X)=0\sum_{X\in\mathcal{S}}\sigma_{X}f(X)=0. Moreover, f(X)21\left\lVert f(X)\right\rVert_{2}\leq 1 because

f(X)212(r=1T|αr|2(𝒆v0𝒆vr2+𝒆vr𝒆v02)+r=1T|αr|2(𝚷2𝑷2r+𝑷2r𝚷2))12(r=1T|αr|+r=1T|αr|)=1.\begin{split}\left\lVert f(X)\right\rVert_{2}&\leq\frac{1}{2}\left(\sum_{r=1}^{T}\frac{\left\lvert\alpha_{r}\right\rvert}{2}\left(\left\lVert{\bm{e}}_{v_{0}}{\bm{e}}^{\top}_{v_{r}}\right\rVert_{2}+\left\lVert{\bm{e}}_{v_{r}}{\bm{e}}^{\top}_{v_{0}}\right\rVert_{2}\right)+\sum_{r=1}^{T}\frac{\left\lvert\alpha_{r}\right\rvert}{2}\left(\left\lVert{\bm{\Pi}}\right\rVert_{2}\left\lVert{\bm{P}}\right\rVert_{2}^{r}+\left\lVert{\bm{P}}^{\top}\right\rVert_{2}^{r}\left\lVert{\bm{\Pi}}\right\rVert_{2}\right)\right)\\ &\leq\frac{1}{2}\left(\sum_{r=1}^{T}\left\lvert\alpha_{r}\right\rvert+\sum_{r=1}^{T}\left\lvert\alpha_{r}\right\rvert\right)=1.\end{split}

where the first step follows triangle inequaity and submultiplicativity of 2-norm, and the third step follows by (1) 𝒆i𝒆j2=1\left\lVert{\bm{e}}_{i}{\bm{e}}_{j}^{\top}\right\rVert_{2}=1; (2) 𝚷2=diag(𝝅)21\left\lVert{\bm{\Pi}}\right\rVert_{2}=\left\lVert\operatorname{diag}({\bm{\pi}})\right\rVert_{2}\leq 1 for distribution 𝝅{\bm{\pi}}; (3) 𝑷2=𝑷2=1\left\lVert{\bm{P}}\right\rVert_{2}=\left\lVert{\bm{P}}^{\top}\right\rVert_{2}=1. ∎

A.3 Proof of Corollary 1

See 1

Proof.

A HMM can be model by a Markov chain 𝑷{\bm{P}} on 𝒴×𝒳\mathcal{Y}\times\mathcal{X} such that P(yt+1,xt+1|yt,xt)=P(yt+1|xt+1)P(xt+1|xt)P(y_{t+1},x_{t+1}|y_{t},x_{t})=P(y_{t+1}|x_{t+1})P(x_{t+1}|x_{t}). For the co-occurrence matrix of observable states, applying a similar proof like our Theorem 2 shows that one needs a trajectory of length O(τ(𝑷)(log|𝒴|+logτ(𝑷))/ϵ2)O(\tau({\bm{P}})(\log{|\mathcal{Y}|}+\log{\tau({\bm{P}})})/\epsilon^{2}) to achieve error bound ϵ\epsilon with high probability. Moreover, the mixing time τ(𝑷)\tau({\bm{P}}) is bounded by the mixing time of the Markov chain on the hidden state space (i.e., P(xt+1|xt)P(x_{t+1}|x_{t})). ∎

Appendix B Matrix Chernoff Bounds for Markov Chains

B.1 Preliminaries

Kronecker Products If 𝑨{\bm{A}} is an M1×N1M_{1}\times N_{1} matrix and 𝑩{\bm{B}} is a M2×N2M_{2}\times N_{2} matrix, then the Kronecker product 𝑨𝑩{\bm{A}}\otimes{\bm{B}} is the M2M1×N1N2M_{2}M_{1}\times N_{1}N_{2} block matrix such that

𝑨𝑩=[𝑨1,1𝑩𝑨1,N1B𝑨M1,1𝑩𝑨M1,N1B].{\bm{A}}\otimes{\bm{B}}=\begin{bmatrix}{\bm{A}}_{1,1}{\bm{B}}&\cdots&{\bm{A}}_{1,N_{1}}B\\ \vdots&\ddots&\vdots\\ {\bm{A}}_{M_{1},1}{\bm{B}}&\cdots&{\bm{A}}_{M_{1},N_{1}}B\\ \end{bmatrix}.

Kronecker product has the mixed-product property. If 𝑨,𝑩,𝑪,𝑫{\bm{A}},{\bm{B}},{\bm{C}},{\bm{D}} are matrices of such size that one can from the matrix products 𝑨𝑪{\bm{A}}{\bm{C}} and 𝑩𝑫{\bm{B}}{\bm{D}}, then (𝑨𝑩)(𝑪𝑫)=(𝑨𝑪)(𝑩𝑫)({\bm{A}}\otimes{\bm{B}})({\bm{C}}\otimes{\bm{D}})=({\bm{A}}{\bm{C}})\otimes({\bm{B}}{\bm{D}}).

Vectorization For a matrix 𝑿d×d{\bm{X}}\in\mathbb{C}^{d\times d}, vec(𝑿)d2\operatorname{vec}({\bm{X}})\in\mathbb{C}^{d^{2}} denote the vertorization of the matrix 𝑿{\bm{X}}, s.t. vec(𝑿)=i[d]j[d]𝑿i,j𝒆i𝒆j\operatorname{vec}({\bm{X}})=\sum_{i\in[d]}\sum_{j\in[d]}{\bm{X}}_{i,j}{\bm{e}}_{i}\otimes{\bm{e}}_{j}, which is the stack of rows of 𝑿{\bm{X}}. And there is a relationship between matrix multiplication and Kronecker product s.t. vec(𝑨𝑿𝑩)=(𝑨𝑩)vec(𝑿)\operatorname{vec}({\bm{A}}{\bm{X}}{\bm{B}})=({\bm{A}}\otimes{\bm{B}}^{\top})\operatorname{vec}({\bm{X}}).

Matrices and Norms For a matrix 𝑨N×N{\bm{A}}\in{\mathbb{C}}^{N\times N}, we use 𝑨{\bm{A}}^{\top} to denote matrix transpose, use 𝑨¯\overline{{\bm{A}}} to denote entry-wise matrix conjugation, use 𝑨{\bm{A}}^{\ast} to denote matrix conjugate transpose (𝑨=𝑨¯=𝑨¯{\bm{A}}^{\ast}=\overline{{\bm{A}}^{\top}}=\overline{{\bm{A}}}^{\top}). The vector 2-norm is defined to be 𝒙2=𝒙𝒙\left\lVert{\bm{x}}\right\rVert_{2}=\sqrt{{\bm{x}}^{\ast}{\bm{x}}}, and the matrix 2-norm is defined to be 𝑨2=max𝒙N,𝒙0𝑨𝒙2𝒙2\left\lVert{\bm{A}}\right\rVert_{2}=\max_{{\bm{x}}\in\mathbb{C}^{N},{\bm{x}}\neq 0}\frac{\left\lVert{\bm{A}}{\bm{x}}\right\rVert_{2}}{\left\lVert{\bm{x}}\right\rVert_{2}}.

We then recall the definition of inner-product under 𝝅{\bm{\pi}}-kernel in Section 2. The inner-product under 𝝅{\bm{\pi}}-kernel for N\mathbb{C}^{N} is 𝒙,𝒚𝝅=𝒚𝚷1𝒙\left\langle{\bm{x}},{\bm{y}}\right\rangle_{\bm{\pi}}={\bm{y}}^{\ast}{\bm{\Pi}}^{-1}{\bm{x}} where 𝚷=diag(𝝅){\bm{\Pi}}=\operatorname{diag}({\bm{\pi}}), and its induced 𝝅{\bm{\pi}}-norm 𝒙𝝅=𝒙,𝒙𝝅\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}=\sqrt{\left\langle{\bm{x}},{\bm{x}}\right\rangle_{\bm{\pi}}}. The above definition allow us to define a inner product under 𝝅{\bm{\pi}}-kernel on Nd2\mathbb{C}^{Nd^{2}}:

Definition 1.

Define inner product on Nd2{\mathbb{C}}^{Nd^{2}} under 𝛑{\bm{\pi}}-kernel to be 𝐱,𝐲𝛑=𝐲(𝚷1𝐈d2)𝐱\left\langle{\bm{x}},{\bm{y}}\right\rangle_{\bm{\pi}}={\bm{y}}^{\ast}\left({\bm{\Pi}}^{-1}\otimes{\bm{I}}_{d^{2}}\right){\bm{x}}.

Remark 1.

For 𝐱,𝐲N{\bm{x}},{\bm{y}}\in{\mathbb{C}}^{N} and 𝐩,𝐪d2{\bm{p}},{\bm{q}}\in{\mathbb{C}}^{d^{2}}, then inner product (under 𝛑{\bm{\pi}}-kernel) between 𝐱𝐩{\bm{x}}\otimes{\bm{p}} and 𝐲𝐪{\bm{y}}\otimes{\bm{q}} can be simplified as

𝒙𝒑,𝒚𝒒𝝅=(𝒚𝒒)(𝚷1𝑰d2)(𝒙𝒑)=(𝒚𝚷1𝒙)(𝒒𝒑)=𝒙,𝒚𝝅𝒑,𝒒.\langle{\bm{x}}\otimes{\bm{p}},{\bm{y}}\otimes{\bm{q}}\rangle_{{\bm{\pi}}}=({\bm{y}}\otimes{\bm{q}})^{\ast}\left({\bm{\Pi}}^{-1}\otimes{\bm{I}}_{d^{2}}\right)({\bm{x}}\otimes{\bm{p}})=({\bm{y}}^{\ast}{\bm{\Pi}}^{-1}{\bm{x}})\otimes({\bm{q}}^{\ast}{\bm{p}})=\langle{\bm{x}},{\bm{y}}\rangle_{{\bm{\pi}}}\langle{\bm{p}},{\bm{q}}\rangle.
Remark 2.

The induced 𝛑{\bm{\pi}}-norm is 𝐱𝛑=𝐱,𝐱𝛑\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}=\sqrt{\left\langle{\bm{x}},{\bm{x}}\right\rangle_{\bm{\pi}}}. When 𝐱=𝐲𝐰{\bm{x}}={\bm{y}}\otimes{\bm{w}}, the 𝛑{\bm{\pi}}-norm can be simplified to be: 𝐱𝛑=𝐲𝐰,𝐲𝐰𝛑=𝐲,𝐲𝛑𝐰,𝐰=𝐲𝛑𝐰2\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}=\sqrt{\left\langle{\bm{y}}\otimes{\bm{w}},{\bm{y}}\otimes{\bm{w}}\right\rangle_{\bm{\pi}}}=\sqrt{\langle{\bm{y}},{\bm{y}}\rangle_{{\bm{\pi}}}\langle{\bm{w}},{\bm{w}}\rangle}=\left\lVert{\bm{y}}\right\rVert_{{\bm{\pi}}}\left\lVert{\bm{w}}\right\rVert_{2}.

Matrix Exponential The matrix exponential of a matrix 𝑨d×d{\bm{A}}\in{\mathbb{C}}^{d\times d} is defined by Taylor expansion exp(𝑨)=j=0+𝑨jj!\exp{({\bm{A}})}=\sum_{j=0}^{+\infty}\frac{{\bm{A}}^{j}}{j!}. And we will use the fact that exp(𝑨)exp(𝑩)=exp(𝑨𝑰+𝑰𝑩)\exp({\bm{A}})\otimes\exp({\bm{B}})=\exp({\bm{A}}\otimes{\bm{I}}+{\bm{I}}\otimes{\bm{B}}).

Golden-Thompson Inequality We need the following multi-matrix Golden-Thompson inequality from from Garg et al. [10]. See 4

B.2 Proof of Theorem 3

See 3

Proof.

Due to symmetry, it suffices to prove one of the statements. Let t>0t>0 be a parameter to be chosen later. Then

[λmax(1kj=1kf(vj))ϵ]=[λmax(j=1kf(vj))kϵ][Tr[exp(tj=1kf(vj))]exp(tkϵ)]𝔼v1,vk[Tr[exp(tj=1kf(vj))]]exp(tkϵ).\begin{split}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\geq\epsilon\right]&=\mathbb{P}\left[\lambda_{\max}\left(\sum_{j=1}^{k}f(v_{j})\right)\geq k\epsilon\right]\\ &\leq\mathbb{P}\left[\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\geq\exp{(tk\epsilon)}\right]\\ &\leq\frac{\mathbb{E}_{v_{1}\cdots,v_{k}}\left[\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right]}{\exp{(tk\epsilon)}}.\end{split} (3)

The second inequality follows Markov inequality.

Next to bound 𝔼v1,vk[Tr[exp(tj=1kf(vj))]]\mathbb{E}_{v_{1}\cdots,v_{k}}\left[\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right]. Using Theorem 4, we have:

log(Tr[exp(tj=1kf(vj))])4ππ2π2log(Tr[j=1kexp(eiϕ2tf(vj))j=k1exp(eiϕ2tf(vj))])𝑑μ(ϕ)4πlogπ2π2Tr[j=1kexp(eiϕ2tf(vj))j=k1exp(eiϕ2tf(vj))]𝑑μ(ϕ),\begin{split}\log{\left(\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right)}&\leq\frac{4}{\pi}\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\log{\left(\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\right]}\right)}d\mu(\phi)\\ &\leq\frac{4}{\pi}\log\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\right]}d\mu(\phi),\end{split}

where the second step follows by concavity of log\log function and the fact that μ(ϕ)\mu(\phi) is a probability distribution on [π2,π2][-\frac{\pi}{2},\frac{\pi}{2}]. This implies

Tr[exp(tj=1kf(vj))](π2π2Tr[j=1kexp(eiϕ2tf(vj))j=k1exp(eiϕ2tf(vj))]𝑑μ(ϕ))4π.\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\leq\left(\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\right]}d\mu(\phi)\right)^{\frac{4}{\pi}}.

Note that 𝒙pd1/p1𝒙1\left\lVert{\bm{x}}\right\rVert_{p}\leq d^{1/p-1}\left\lVert{\bm{x}}\right\rVert_{1} for p(0,1)p\in(0,1), choosing p=π/4p=\pi/4 we have

(Tr[exp(π4tj=1kf(vj))])4πd4π1Tr[exp(tj=1kf(vj))].\left(\operatorname{Tr}{\left[\exp{\left(\frac{\pi}{4}t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right)^{\frac{4}{\pi}}\leq d^{\frac{4}{\pi}-1}\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}.

Combining the above two equations together, we have

Tr[exp(π4tj=1kf(vj))]d1π4π2π2Tr[j=1kexp(eiϕ2tf(vj))j=k1exp(eiϕ2tf(vj))]𝑑μ(ϕ).\operatorname{Tr}{\left[\exp{\left(\frac{\pi}{4}t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\leq d^{1-\frac{\pi}{4}}\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\right]}d\mu(\phi). (4)

Write eiϕ=γ+ibe^{{\mathrm{i}}\phi}=\gamma+{\mathrm{i}}b with γ2+b2=|γ+ib|2=|eiϕ|2=1\gamma^{2}+b^{2}=\left\lvert\gamma+{\mathrm{i}}b\right\rvert^{2}=\left\lvert e^{{\mathrm{i}}\phi}\right\rvert^{2}=1: See 1 Assuming the above lemma, we can complete the proof of the theorem as:

𝔼v1,vk[Tr[exp(π4tj=1kf(vj))]]d1π4𝔼v1,vk[π2π2(Tr[j=1kexp(eiϕ2tf(vj))j=k1exp(eiϕ2tf(vj))])𝑑μ(ϕ)]=d1π4π2π2𝔼v1,vk[Tr[j=1kexp(eiϕ2tf(vj))j=k1exp(eiϕ2tf(vj))]]𝑑μ(ϕ)d1π4π2π2ϕ𝝅dexp(kt2|eiϕ|2(1+81λ))𝑑μ(ϕ)=ϕ𝝅d2π4exp(kt2(1+81λ))π2π2𝑑μ(ϕ)=ϕ𝝅d2π4exp(kt2(1+81λ))\begin{split}&\mathbb{E}_{v_{1}\cdots,v_{k}}\left[\operatorname{Tr}{\left[\exp{\left(\frac{\pi}{4}t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right]\\ \leq&d^{1-\frac{\pi}{4}}\mathbb{E}_{v_{1}\cdots,v_{k}}\left[\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\left(\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\right]}\right)d\mu(\phi)\right]\\ =&d^{1-\frac{\pi}{4}}\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\mathbb{E}_{v_{1}\cdots,v_{k}}\left[\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\right]}\right]d\mu(\phi)\\ \leq&d^{1-\frac{\pi}{4}}\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d\exp{\left(kt^{2}\left\lvert e^{{\mathrm{i}}\phi}\right\rvert^{2}\left(1+\frac{8}{1-\lambda}\right)\right)}d\mu(\phi)\\ =&\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2-\frac{\pi}{4}}\exp{\left(kt^{2}\left(1+\frac{8}{1-\lambda}\right)\right)}\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}d\mu(\phi)\\ =&\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2-\frac{\pi}{4}}\exp{\left(kt^{2}\left(1+\frac{8}{1-\lambda}\right)\right)}\end{split} (5)

where the first step follows Equation 4, the second step follows by swapping 𝔼\mathbb{E} and \int, the third step follows by Lemma 1, the forth step follows by |eiϕ|=1\left\lvert e^{{\mathrm{i}}\phi}\right\rvert=1, and the last step follows by μ\mu is a probability distribution on [π2,π2][-\frac{\pi}{2},\frac{\pi}{2}] so π2π2𝑑μ(ϕ)=1\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}d\mu(\phi)=1

Finally, putting it all together:

[λmax(1kj=1kf(vj))ϵ]𝔼[Tr[exp(tj=1kf(vj))]]exp(tkϵ)=𝔼[Tr[exp(π4(4πt)j=1kf(vj))]]exp(tkϵ)ϕ𝝅d2π4exp(k(4πt)2(1+81λ))exp(tkϵ)=ϕ𝝅d2π4exp((4π)2kϵ2(1λ)2136291λk(1λ)ϵ36ϵ)ϕ𝝅d2exp(kϵ2(1λ)/72).\begin{split}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\geq\epsilon\right]&\leq\frac{\mathbb{E}\left[\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right]}{\exp{(tk\epsilon)}}\\ &=\frac{\mathbb{E}\left[\operatorname{Tr}{\left[\exp{\left(\frac{\pi}{4}\left(\frac{4}{\pi}t\right)\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right]}{\exp{(tk\epsilon)}}\\ &\leq\frac{\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2-\frac{\pi}{4}}\exp{\left(k\left(\frac{4}{\pi}t\right)^{2}\left(1+\frac{8}{1-\lambda}\right)\right)}}{\exp{(tk\epsilon)}}\\ &=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2-\frac{\pi}{4}}\exp{\left(\left(\frac{4}{\pi}\right)^{2}k\epsilon^{2}(1-\lambda)^{2}\frac{1}{36^{2}}\frac{9}{1-\lambda}-k\frac{(1-\lambda)\epsilon}{36}\epsilon\right)}\\ &\leq\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2}\exp{(-k\epsilon^{2}(1-\lambda)/72)}.\end{split}

where the first step follows by Equation 3, the second step follows by Equation 5, the third step follows by choosing t=(1λ)ϵ/36t=(1-\lambda)\epsilon/36. The only thing to be check is that t=(1λ)ϵ/36t=(1-\lambda)\epsilon/36 satisfies tγ2+b2=t1λ4λt\sqrt{\gamma^{2}+b^{2}}=t\leq\frac{1-\lambda}{4\lambda}. Recall that ϵ<1{\epsilon}<1 and λ1\lambda\leq 1, we have t=(1λ)ϵ361λ41λ4λt=\frac{(1-\lambda)\epsilon}{36}\leq\frac{1-\lambda}{4}\leq\frac{1-\lambda}{4\lambda}. ∎

B.3 Proof of Lemma 1

See 1

Proof.

Note that for 𝑨,𝑩d×d{\bm{A}},{\bm{B}}\in\mathbb{C}^{d\times d}, (𝑨𝑩)vec(𝑰d),vec(𝑰d)=Tr[𝑨𝑩]\left\langle({\bm{A}}\otimes{\bm{B}})\operatorname{vec}({\bm{I}}_{d}),\operatorname{vec}({\bm{I}}_{d})\right\rangle=\operatorname{Tr}\left[{\bm{A}}{\bm{B}}^{\top}\right]. By letting 𝑨=j=1kexp(tf(vj)(γ+ib)2){\bm{A}}=\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)} and 𝑩=(j=k1exp(tf(vj)(γib)2))=j=1kexp(tf(vj)(γib)2){\bm{B}}=\left(\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right)^{\top}=\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}. The trace term in LHS of Lemma 1 becomes

Tr[j=1kexp(tf(vj)(γ+ib)2)j=k1exp(tf(vj)(γib)2)]=(j=1kexp(tf(vj)(γ+ib)2)j=1kexp(tf(vj)(γib)2))vec(𝑰d),vec(𝑰d).\begin{split}&\operatorname{Tr}\left[\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right]\\ =&\left\langle\left(\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\otimes\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right)\operatorname{vec}({\bm{I}}_{d}),\operatorname{vec}({\bm{I}}_{d})\right\rangle.\end{split} (6)

By iteratively applying (𝑨𝑩)(𝑪𝑫)=(𝑨𝑪)(𝑩𝑫)({\bm{A}}\otimes{\bm{B}})({\bm{C}}\otimes{\bm{D}})=({\bm{A}}{\bm{C}})\otimes({\bm{B}}{\bm{D}}), we have

j=1kexp(tf(vj)(γ+ib)2)j=1kexp(tf(vj)(γib)2)=j=1k(exp(tf(vj)(γ+ib)2)exp(tf(vj)(γib)2))j=1k𝑴vj,\begin{split}&\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\otimes\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\\ =&\prod_{j=1}^{k}\left(\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\otimes\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right)\triangleq\prod_{j=1}^{k}{\bm{M}}_{v_{j}},\end{split}

where we define

𝑴vjexp(tf(vj)(γ+ib)2)exp(tf(vj)(γib)2).{\bm{M}}_{v_{j}}\triangleq\exp{\left(\frac{tf(v_{j})(\gamma+ib)}{2}\right)}\otimes\exp{\left(\frac{tf(v_{j})(\gamma-ib)}{2}\right)}. (7)

Plug it to the trace term, we have

Tr[j=1kexp(tf(vj)(γ+ib)2)j=k1exp(tf(vj)(γib)2)]=(j=1k𝑴vj)vec(𝑰d),vec(𝑰d).\operatorname{Tr}\left[\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right]=\left\langle\left(\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\right)\operatorname{vec}({\bm{I}}_{d}),\operatorname{vec}({\bm{I}}_{d})\right\rangle.

Next, taking expectation on Equation 6 gives

𝔼v1,,vk[Tr[j=1kexp(tf(vj)(γ+ib)2)j=k1exp(tf(vj)(γib)2)]]=𝔼v1,,vk[(j=1k𝑴vj)vec(𝑰d),vec(𝑰d)]=𝔼v1,,vk[j=1k𝑴vj]vec(𝑰d),vec(𝑰d).\begin{split}&\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\operatorname{Tr}\left[\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+ib)}{2}\right)}\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-ib)}{2}\right)}\right]\right]\\ =&\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\left\langle\left(\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\right)\operatorname{vec}({\bm{I}}_{d}),\operatorname{vec}({\bm{I}}_{d})\right\rangle\right]\\ =&\left\langle\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\right]\operatorname{vec}({\bm{I}}_{d}),\operatorname{vec}({\bm{I}}_{d})\right\rangle.\end{split} (8)

We turn to study 𝔼v1,,vk[j=1k𝑴vj]\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\right], which is characterized by the following lemma:

Lemma 2.

Let 𝐄diag(𝐌1,𝐌2,,𝐌N)Nd2×Nd2{\bm{E}}\triangleq\operatorname{diag}({\bm{M}}_{1},{\bm{M}}_{2},\cdots,{\bm{M}}_{N})\in{\mathbb{C}}^{Nd^{2}\times Nd^{2}} and 𝐏~𝐏𝐈d2Nd2×Nd2\widetilde{{\bm{P}}}\triangleq{\bm{P}}\otimes{\bm{I}}_{d^{2}}\in\mathbb{R}^{Nd^{2}\times Nd^{2}}. For a random walk (v1,,vk)(v_{1},\cdots,v_{k}) such that v1v_{1} is sampled from an arbitrary probability distribution ϕ{\bm{\phi}} on [N][N], 𝔼v1,,vk[j=1k𝐌vj]=(ϕ𝐈d2)((𝐄𝐏~)k1𝐄)(𝟏𝐈d2)\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\right]=\left({\bm{\phi}}\otimes{\bm{I}}_{d^{2}}\right)^{\top}\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\bm{1}\otimes{\bm{I}}_{d^{2}}\right), where 𝟏\bm{1} is the all-ones vector.

Proof.

(of Lemma 2) We always treat 𝑬𝑷~{\bm{E}}\widetilde{{\bm{P}}} as a block matrix, s.t.,

𝑬𝑷~=[𝑴1𝑴N][𝑷1,1𝑰d2𝑷1,N𝑰d2𝑷N,1𝑰d2𝑷N,N𝑰d2]=[𝑷1,1𝑴1𝑷1,N𝑴1𝑷N,1𝑴N𝑷N,N𝑴N].{\bm{E}}\widetilde{{\bm{P}}}=\begin{bmatrix}{\bm{M}}_{1}&&\\ &\ddots&\\ &&{\bm{M}}_{N}\end{bmatrix}\begin{bmatrix}{\bm{P}}_{1,1}{\bm{I}}_{d^{2}}&\cdots&{\bm{P}}_{1,N}{\bm{I}}_{d^{2}}\\ \vdots&\ddots&\vdots\\ {\bm{P}}_{N,1}{\bm{I}}_{d^{2}}&\cdots&{\bm{P}}_{N,N}{\bm{I}}_{d^{2}}\end{bmatrix}=\begin{bmatrix}{\bm{P}}_{1,1}{\bm{M}}_{1}&\cdots&{\bm{P}}_{1,N}{\bm{M}}_{1}\\ \vdots&\ddots&\vdots\\ {\bm{P}}_{N,1}{\bm{M}}_{N}&\cdots&{\bm{P}}_{N,N}{\bm{M}}_{N}\end{bmatrix}.

I.e., the (u,v)(u,v)-th block of 𝑬𝑷~{\bm{E}}\widetilde{{\bm{P}}}, denoted by (𝑬𝑷~)u,v({\bm{E}}\widetilde{{\bm{P}}})_{u,v}, is 𝑷u,v𝑴u{\bm{P}}_{u,v}{\bm{M}}_{u}.

𝔼v1,,vk[j=1k𝑴vj]=v1,,vkϕv1𝑷v1,v2𝑷vk1,vkj=1k𝑴vj=v1ϕv1v2(𝑷v1,v2𝑴v1)vk(𝑷vk1,vk𝑴vk1)𝑴vk=v1ϕv1v2(𝑬𝑷~)v1,v2v3(𝑬𝑷~)v2,v3vk(𝑬𝑷~𝑬)vk1,vk=v1ϕv1vk((𝑬𝑷~)k1𝑬)v1,vk=(ϕ𝑰d2)((𝑬𝑷~)k1𝑬)(𝟏𝑰d2)\begin{split}\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\right]&=\sum_{v_{1},\cdots,v_{k}}{\bm{\phi}}_{v_{1}}{\bm{P}}_{v_{1},v_{2}}\cdots{\bm{P}}_{v_{k-1},v_{k}}\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\\ &=\sum_{v_{1}}{\bm{\phi}}_{v_{1}}\sum_{v_{2}}\left({\bm{P}}_{v_{1},v_{2}}{\bm{M}}_{v_{1}}\right)\cdots\sum_{v_{k}}\left({\bm{P}}_{v_{k-1},v_{k}}{\bm{M}}_{v_{k-1}}\right){\bm{M}}_{v_{k}}\\ &=\sum_{v_{1}}{\bm{\phi}}_{v_{1}}\sum_{v_{2}}({\bm{E}}\widetilde{{\bm{P}}})_{v_{1},v_{2}}\sum_{v_{3}}({\bm{E}}\widetilde{{\bm{P}}})_{v_{2},v_{3}}\cdots\sum_{v_{k}}({\bm{E}}\widetilde{{\bm{P}}}{\bm{E}})_{v_{k-1},v_{k}}\\ &=\sum_{v_{1}}{\bm{\phi}}_{v_{1}}\sum_{v_{k}}\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)_{v_{1},v_{k}}=\left({\bm{\phi}}\otimes{\bm{I}}_{d^{2}}\right)^{\top}\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\bm{1}\otimes{\bm{I}}_{d^{2}}\right)\end{split}

Given Lemma 2, Equation 8 becomes:

𝔼v1,,vk[Tr[j=1kexp(tf(vj)(γ+ib)2)j=k1exp(tf(vj)(γib)2)]]=𝔼v1,,vk[j=1kMvj]vec(𝑰d),vec(𝑰d)=(ϕ𝑰d2)((𝑬𝑷~)k1𝑬)(𝟏𝑰d2),vec(𝑰d)=((𝑬𝑷~)k1𝑬)(𝟏𝑰d2)vec(𝑰d),(ϕ𝑰d2)vec(𝑰d)=((𝑬𝑷~)k1𝑬)(𝟏vec(𝑰d)),𝝅vec(𝑰d)\begin{split}&\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\operatorname{Tr}\left[\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right]\right]\\ =&\left\langle\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\prod_{j=1}^{k}M_{v_{j}}\right]\operatorname{vec}({\bm{I}}_{d}),\operatorname{vec}({\bm{I}}_{d})\right\rangle\\ =&\left\langle\left({\bm{\phi}}\otimes{\bm{I}}_{d^{2}}\right)^{\top}\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\bm{1}\otimes{\bm{I}}_{d^{2}}\right),\operatorname{vec}({\bm{I}}_{d})\right\rangle\\ =&\left\langle\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\bm{1}\otimes{\bm{I}}_{d^{2}}\right)\operatorname{vec}({\bm{I}}_{d}),\left({\bm{\phi}}\otimes{\bm{I}}_{d^{2}}\right)\operatorname{vec}({\bm{I}}_{d})\right\rangle\\ =&\left\langle\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\bm{1}\otimes\operatorname{vec}({\bm{I}}_{d})\right),{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d})\right\rangle\end{split}

The third equality is due to x,𝑨y=𝑨x,y\langle x,{\bm{A}}y\rangle=\langle{\bm{A}}^{\ast}x,y\rangle. The forth equality is by setting 𝑪=1{\bm{C}}=1 (scalar) in (𝑨𝑩)(𝑪𝑫)=(𝑨𝑪)(𝑩𝑫)({\bm{A}}\otimes{\bm{B}})({\bm{C}}\otimes{\bm{D}})=({\bm{A}}{\bm{C}})\otimes({\bm{B}}{\bm{D}}). Then

𝔼v1,,vk[Tr[j=1kexp(tf(vj)(γ+ib)2)j=k1exp(tf(vj)(γib)2)]]=((𝑬𝑷~)k1𝑬)(𝟏vec(𝑰d)),ϕvec(𝑰d)=(ϕvec(𝑰d))((𝑬𝑷~)k1𝑬)(𝟏vec(𝑰d))=(ϕvec(𝑰d))((𝑬𝑷~)k1𝑬)((𝑷𝚷1𝝅)(𝑰d2𝑰d2vec(𝑰d)))=(ϕvec(𝑰d))(𝑬𝑷~)k(𝚷1𝑰d2)(𝝅vec(𝑰d))𝝅vec(𝑰d),𝒛k𝝅,\begin{split}&\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\operatorname{Tr}\left[\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right]\right]\\ =&\left\langle\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\bm{1}\otimes\operatorname{vec}({\bm{I}}_{d})\right),{\bm{\phi}}\otimes\operatorname{vec}({\bm{I}}_{d})\right\rangle\\ =&({\bm{\phi}}\otimes\operatorname{vec}({\bm{I}}_{d}))^{\ast}\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)(\bm{1}\otimes\operatorname{vec}({\bm{I}}_{d}))\\ =&({\bm{\phi}}\otimes\operatorname{vec}({\bm{I}}_{d}))^{\ast}\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\left({\bm{P}}{\bm{\Pi}}^{-1}{\bm{\pi}}\right)\otimes\left({\bm{I}}_{d^{2}}{\bm{I}}_{d^{2}}\operatorname{vec}({\bm{I}}_{d})\right)\right)\\ =&({\bm{\phi}}\otimes\operatorname{vec}({\bm{I}}_{d}))^{\ast}\left({\bm{E}}\widetilde{{\bm{P}}}\right)^{k}\left({\bm{\Pi}}^{-1}\otimes{\bm{I}}_{d^{2}}\right)({\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d}))\triangleq\left\langle{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d}),{\bm{z}}_{k}\right\rangle_{\bm{\pi}},\end{split}

where we define 𝒛0=ϕvec(𝑰d){\bm{z}}_{0}={\bm{\phi}}\otimes\operatorname{vec}({\bm{I}}_{d}) and 𝒛k=(𝒛0(𝑬𝑷~)k)=(𝒛k1𝑬𝑷~){\bm{z}}_{k}=\left({\bm{z}}_{0}^{\ast}\left({\bm{E}}\widetilde{{\bm{P}}}\right)^{k}\right)^{\ast}=\left({\bm{z}}_{k-1}^{\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}. Moreover, by Remark 2, we have 𝝅vec(𝑰d)𝝅=𝝅𝝅vec(𝑰d)2=d\left\lVert{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d})\right\rVert_{{\bm{\pi}}}=\left\lVert{\bm{\pi}}\right\rVert_{{\bm{\pi}}}\left\lVert\operatorname{vec}({\bm{I}}_{d})\right\rVert_{2}=\sqrt{d} and 𝒛0𝝅=ϕvec(𝑰d)𝝅=ϕ𝝅vec(𝑰d)2=ϕ𝝅d\left\lVert{\bm{z}}_{0}\right\rVert_{{\bm{\pi}}}=\left\lVert{\bm{\phi}}\otimes\operatorname{vec}({\bm{I}}_{d})\right\rVert_{{\bm{\pi}}}=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}\left\lVert\operatorname{vec}({\bm{I}}_{d})\right\rVert_{2}=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}\sqrt{d}

Definition 2.

Define linear subspace 𝒰={𝛑𝐰,𝐰d2}\mathcal{U}=\left\{{\bm{\pi}}\otimes{\bm{w}},{\bm{w}}\in{\mathbb{C}}^{d^{2}}\right\}.

Remark 3.

{𝝅𝒆i,i[d2]}\{{\bm{\pi}}\otimes{\bm{e}}_{i},i\in[d^{2}]\} is an orthonormal basis of 𝒰\mathcal{U}. This is because 𝛑𝐞i,𝛑𝐞j𝛑=𝛑,𝛑𝛑𝐞i,𝐞j=δij\langle{\bm{\pi}}\otimes{\bm{e}}_{i},{\bm{\pi}}\otimes{\bm{e}}_{j}\rangle_{\bm{\pi}}=\langle{\bm{\pi}},{\bm{\pi}}\rangle_{\bm{\pi}}\langle{\bm{e}}_{i},{\bm{e}}_{j}\rangle=\delta_{ij} by Remark 1, where δij\delta_{ij} is the Kronecker delta.

Remark 4.

Given 𝐱=𝐲𝐰{\bm{x}}={\bm{y}}\otimes{\bm{w}}. The projection of 𝐱{\bm{x}} on to 𝒰\mathcal{U} is 𝐱=(𝟏𝐲)(𝛑𝐰){\bm{x}}^{\parallel}=(\bm{1}^{\ast}{\bm{y}})({\bm{\pi}}\otimes{\bm{w}}). This is because

𝒙=i=1d2𝒚𝒘,𝝅𝒆i𝝅(𝝅𝒆i)=i=1d2𝒚,𝝅𝝅𝒘,𝒆i(𝝅𝒆i)=(𝟏𝒚)(𝝅𝒘).\begin{split}{\bm{x}}^{\parallel}&=\sum_{i=1}^{d^{2}}\langle{\bm{y}}\otimes{\bm{w}},{\bm{\pi}}\otimes{\bm{e}}_{i}\rangle_{\bm{\pi}}({\bm{\pi}}\otimes{\bm{e}}_{i})=\sum_{i=1}^{d^{2}}\langle{\bm{y}},{\bm{\pi}}\rangle_{{\bm{\pi}}}\langle{\bm{w}},{\bm{e}}_{i}\rangle({\bm{\pi}}\otimes{\bm{e}}_{i})=(\bm{1}^{\ast}{\bm{y}})({\bm{\pi}}\otimes{\bm{w}}).\end{split}

We want to bound

𝝅vec(𝑰d),𝒛k𝝅=𝝅vec(𝑰d),𝒛k+𝒛k𝝅=𝝅vec(𝑰d),𝒛k𝝅𝝅vec(𝑰d)𝝅𝒛k𝝅=d𝒛k𝝅.\begin{split}\left\langle{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d}),{\bm{z}}_{k}\right\rangle_{{\bm{\pi}}}&=\left\langle{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d}),{\bm{z}}^{\perp}_{k}+{\bm{z}}^{\parallel}_{k}\right\rangle_{{\bm{\pi}}}=\left\langle{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d}),{\bm{z}}^{\parallel}_{k}\right\rangle_{{\bm{\pi}}}\\ &\leq\left\lVert{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d})\right\rVert_{{\bm{\pi}}}\left\lVert{\bm{z}}_{k}^{\parallel}\right\rVert_{{\bm{\pi}}}=\sqrt{d}\left\lVert{\bm{z}}_{k}^{\parallel}\right\rVert_{{\bm{\pi}}}.\end{split}

As 𝒛k{\bm{z}}_{k} can be expressed as recursively applying operator 𝑬{\bm{E}} and 𝑷~\widetilde{{\bm{P}}} on 𝒛0{\bm{z}}_{0}, we turn to analyze the effects of 𝑬{\bm{E}} and 𝑷~\widetilde{{\bm{P}}} operators.

Definition 3.

The spectral expansion of 𝐏~\widetilde{{\bm{P}}} is defined as λ(𝐏~)max𝐱𝒰,𝐱0(𝐱𝐏~)𝛑𝐱𝛑\lambda(\widetilde{{\bm{P}}})\triangleq\max_{{\bm{x}}\perp\mathcal{U},{\bm{x}}\neq 0}\frac{\left\lVert\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}}{\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}}\\

Lemma 3.

λ(𝑷)=λ(𝑷~)\lambda({\bm{P}})=\lambda(\widetilde{{\bm{P}}}).

Proof.

First show λ(𝑷~)λ(𝑷)\lambda{(\widetilde{{\bm{P}}})}\geq\lambda{({\bm{P}})}. Suppose the maximizer of λ(𝑷)max𝒚𝝅,𝒚0(𝒚𝑷)𝝅𝒚𝝅\lambda({\bm{P}})\triangleq\max_{{\bm{y}}\perp{\bm{\pi}},{\bm{y}}\neq 0}\frac{\left\lVert\left({\bm{y}}^{\ast}{\bm{P}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}}{\left\lVert{\bm{y}}\right\rVert_{{\bm{\pi}}}} is 𝒚n{\bm{y}}\in\mathbb{C}^{n}, i.e., (𝒚𝑷)𝝅=λ(𝑷)𝒚𝝅\left\lVert\left({\bm{y}}^{\ast}{\bm{P}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}=\lambda({\bm{P}})\left\lVert{\bm{y}}\right\rVert_{{\bm{\pi}}}. Construct 𝒙=𝒚𝒐{\bm{x}}={\bm{y}}\otimes{\bm{o}} for arbitrary non-zero 𝒐d2{\bm{o}}\in\mathbb{C}^{d^{2}}. Easy to check that 𝒙𝒰{\bm{x}}\perp\mathcal{U}, because 𝒙,𝝅𝒘𝝅=𝒚,𝝅𝝅𝒐,𝒘=0\langle{\bm{x}},{\bm{\pi}}\otimes{\bm{w}}\rangle_{{\bm{\pi}}}=\langle{\bm{y}},{\bm{\pi}}\rangle_{{\bm{\pi}}}\langle{\bm{o}},{\bm{w}}\rangle=0, where the last equality is due to 𝒚𝝅{\bm{y}}\perp{\bm{\pi}}. Then we can bound (𝒙𝑷~)𝝅\left\lVert\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}} such that

(𝒙𝑷~)𝝅=𝑷~𝒙𝝅=(𝑷𝑰d2)(𝒚𝒐)𝝅=(𝑷𝒚)𝒐𝝅=(𝒚𝑷)𝝅𝒐2=λ(𝑷)𝒚𝝅𝒐2=λ(𝑷)𝒙𝝅,\begin{split}\left\lVert\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}&=\left\lVert\widetilde{{\bm{P}}}^{\ast}{\bm{x}}\right\rVert_{{\bm{\pi}}}=\left\lVert({\bm{P}}^{\ast}\otimes{\bm{I}}_{d^{2}})({\bm{y}}\otimes{\bm{o}})\right\rVert_{{\bm{\pi}}}=\left\lVert({\bm{P}}^{\ast}{\bm{y}})\otimes{\bm{o}}\right\rVert_{{\bm{\pi}}}\\ &=\left\lVert\left({\bm{y}}^{\ast}{\bm{P}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}\left\lVert{\bm{o}}\right\rVert_{2}=\lambda({\bm{P}})\left\lVert{\bm{y}}\right\rVert_{{\bm{\pi}}}\left\lVert{\bm{o}}\right\rVert_{2}=\lambda({\bm{P}})\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}},\end{split}

which indicate for 𝒙=𝒚𝒐{\bm{x}}={\bm{y}}\otimes{\bm{o}}, (𝒙𝑷~)𝝅𝒙𝝅=λ(𝑷)\frac{\left\lVert\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}}{\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}}=\lambda{({\bm{P}})}. Taking maximum over all 𝒙{\bm{x}} gives λ(𝑷~)λ(𝑷)\lambda{(\widetilde{{\bm{P}}})}\geq\lambda{({\bm{P}})}.

Next to show λ(𝑷)λ(𝑷~)\lambda({\bm{P}})\geq\lambda(\widetilde{{\bm{P}}}). For 𝒙Nd2\forall{\bm{x}}\in\mathbb{C}^{Nd^{2}} such that 𝒙𝒰{\bm{x}}\perp\mathcal{U} and 𝒙0{\bm{x}}\neq 0, we can decompose it to be

𝒙=[x1x2xNd2]=[x1xd2+1x(N1)d2+1]𝒆1+[x2xd2+2x(N1)d2+2]𝒆2++[xd2x2d2xNd2]𝒆d2i=1d2𝒙i𝒆i,\begin{split}{\bm{x}}&=\begin{bmatrix}x_{1}\\ x_{2}\\ \vdots\\ x_{Nd^{2}}\end{bmatrix}=\begin{bmatrix}x_{1}\\ x_{d^{2}+1}\\ \vdots\\ x_{(N-1)d^{2}+1}\end{bmatrix}\otimes{\bm{e}}_{1}+\begin{bmatrix}x_{2}\\ x_{d^{2}+2}\\ \vdots\\ x_{(N-1)d^{2}+2}\end{bmatrix}\otimes{\bm{e}}_{2}+\cdots+\begin{bmatrix}x_{d^{2}}\\ x_{2d^{2}}\\ \vdots\\ x_{Nd^{2}}\end{bmatrix}\otimes{\bm{e}}_{d^{2}}\triangleq\sum_{i=1}^{d^{2}}{\bm{x}}_{i}\otimes{\bm{e}}_{i},\end{split}

where we define 𝒙i[xix(N1)d2+i]{\bm{x}}_{i}\triangleq\begin{bmatrix}x_{i}&\cdots&x_{(N-1)d^{2}+i}\end{bmatrix}^{\top} for i[d2]i\in[d^{2}]. We can observe that 𝒙i𝝅,i[d2]{\bm{x}}_{i}\perp{\bm{\pi}},i\in[d^{2}], because for j[d2]\forall j\in[d^{2}], we have

0=𝒙,𝝅𝒆j𝝅=i=1d2𝒙i𝒆i,𝝅𝒆j𝝅=i=1d2𝒙i𝒆i,𝝅𝒆j𝝅=i=1d2𝒙i,𝝅𝝅𝒆i,𝒆j=𝒙j,𝝅𝝅,0=\langle{\bm{x}},{\bm{\pi}}\otimes{\bm{e}}_{j}\rangle_{{\bm{\pi}}}=\left\langle\sum_{i=1}^{d^{2}}{\bm{x}}_{i}\otimes{\bm{e}}_{i},{\bm{\pi}}\otimes{\bm{e}}_{j}\right\rangle_{{\bm{\pi}}}=\sum_{i=1}^{d^{2}}\left\langle{\bm{x}}_{i}\otimes{\bm{e}}_{i},{\bm{\pi}}\otimes{\bm{e}}_{j}\right\rangle_{{\bm{\pi}}}=\sum_{i=1}^{d^{2}}\langle{\bm{x}}_{i},{\bm{\pi}}\rangle_{{\bm{\pi}}}\langle{\bm{e}}_{i},{\bm{e}}_{j}\rangle=\langle{\bm{x}}_{j},{\bm{\pi}}\rangle_{{\bm{\pi}}},

which indicates 𝒙j𝝅,j[d2]{\bm{x}}_{j}\perp{\bm{\pi}},j\in[d^{2}]. Furthermore, we can also observe that 𝒙i𝒆i,i[d2]{\bm{x}}_{i}\otimes{\bm{e}}_{i},i\in[d^{2}] is pairwise orthogonal. This is because for i,j[d2]\forall i,j\in[d^{2}], 𝒙i𝒆i,𝒙j𝒆j𝝅=𝒙i,𝒙j𝝅𝒆i,𝒆j=δij\langle{\bm{x}}_{i}\otimes{\bm{e}}_{i},{\bm{x}}_{j}\otimes{\bm{e}}_{j}\rangle_{{\bm{\pi}}}=\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle_{{\bm{\pi}}}\langle{\bm{e}}_{i},{\bm{e}}_{j}\rangle=\delta_{ij}, which suggests us to use Pythagorean theorem such that 𝒙𝝅2=i=1d2𝒙i𝒆i𝝅2=i=1d2𝒙i𝝅𝒆i22\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{d^{2}}\left\lVert{\bm{x}}_{i}\otimes{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{d^{2}}\left\lVert{\bm{x}}_{i}\right\rVert_{{\bm{\pi}}}\left\lVert{\bm{e}}_{i}\right\rVert_{2}^{2}.

We can use similar way to decompose and analyze (𝒙𝑷~)\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}:

(𝒙𝑷~)=𝑷~𝒙=i=1d2(𝑷𝑰d2)(𝒙i𝒆i)=i=1d2(𝑷𝒙i)𝒆i.\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}=\widetilde{{\bm{P}}}^{\ast}{\bm{x}}=\sum_{i=1}^{d^{2}}({\bm{P}}^{\ast}\otimes{\bm{I}}_{d^{2}})({\bm{x}}_{i}\otimes{\bm{e}}_{i})=\sum_{i=1}^{d^{2}}({\bm{P}}^{\ast}{\bm{x}}_{i})\otimes{\bm{e}}_{i}.

where we can observe that (𝑷𝒙i)𝒆i,i[d2]({\bm{P}}^{\ast}{\bm{x}}_{i})\otimes{\bm{e}}_{i},i\in[d^{2}] is pairwise orthogonal. This is because for i,j[d2]\forall i,j\in[d^{2}], we have (𝑷𝒙i)𝒆i,(𝑷𝒙j)𝒆j𝝅=𝑷𝒙i,𝑷𝒙j𝝅𝒆i,𝒆j=δij\langle({\bm{P}}^{\ast}{\bm{x}}_{i})\otimes{\bm{e}}_{i},({\bm{P}}^{\ast}{\bm{x}}_{j})\otimes{\bm{e}}_{j}\rangle_{{\bm{\pi}}}=\langle{\bm{P}}^{\ast}{\bm{x}}_{i},{\bm{P}}^{\ast}{\bm{x}}_{j}\rangle_{{\bm{\pi}}}\langle{\bm{e}}_{i},{\bm{e}}_{j}\rangle=\delta_{ij}. Again, applying Pythagorean theorem gives:

(𝒙𝑷~)𝝅2=i=1d2(𝑷𝒙i)𝒆i𝝅2=i=1d2(𝒙i𝑷)𝝅2𝒆i22i=1d2λ(𝑷)2𝒙i𝝅2𝒆i22=λ(𝑷)2(i=1d2𝒙i𝝅2𝒆i22)=λ(𝑷)2𝒙𝝅2,\begin{split}\left\lVert\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}^{2}&=\sum_{i=1}^{d^{2}}\left\lVert({\bm{P}}^{\ast}{\bm{x}}_{i})\otimes{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{d^{2}}\left\lVert\left({\bm{x}}_{i}^{\ast}{\bm{P}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert{\bm{e}}_{i}\right\rVert_{2}^{2}\\ &\leq\sum_{i=1}^{d^{2}}\lambda{({\bm{P}})}^{2}\left\lVert{\bm{x}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert{\bm{e}}_{i}\right\rVert_{2}^{2}=\lambda{({\bm{P}})}^{2}\left(\sum_{i=1}^{d^{2}}\left\lVert{\bm{x}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert{\bm{e}}_{i}\right\rVert_{2}^{2}\right)=\lambda{({\bm{P}})}^{2}\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}^{2},\end{split}

which indicate that for 𝒙\forall{\bm{x}} such that 𝒙𝒰{\bm{x}}\perp\mathcal{U} and 𝒙0{\bm{x}}\neq 0, we have (𝒙𝑷~)𝝅𝒙𝝅λ(𝑷)\frac{\left\lVert\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}}{\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}}\leq\lambda{({\bm{P}})}, or equivalently λ(𝑷~)λ(𝑷)\lambda{(\widetilde{{\bm{P}}})}\leq\lambda{({\bm{P}})}.

Overall, we have shown both λ(𝑷~)λ(𝑷)\lambda{(\widetilde{{\bm{P}}})}\geq\lambda{({\bm{P}})} and λ(𝑷~)λ(𝑷)\lambda{(\widetilde{{\bm{P}}})}\leq\lambda{({\bm{P}})}. We conclude λ(𝑷~)=λ(𝑷)\lambda{(\widetilde{{\bm{P}}})}=\lambda{({\bm{P}})}. ∎

Lemma 4.

(The effect of 𝑷~\widetilde{{\bm{P}}} operator) This lemma is a generalization of lemma 3.3 in [6].

  1. 1.

    𝒚𝒰\forall{\bm{y}}\in\mathcal{U}, then (𝒚𝑷~)=𝒚\left({\bm{y}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}={\bm{y}}.

  2. 2.

    𝒚𝒰\forall{\bm{y}}\perp\mathcal{U}, then (𝒚𝑷~)𝒰\left({\bm{y}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\perp\mathcal{U}, and (𝒚𝑷~)𝝅λ𝒚𝝅\left\lVert\left({\bm{y}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}\leq\lambda\left\lVert{\bm{y}}\right\rVert_{{\bm{\pi}}}.

Proof.

First prove the Part 1 of lemma 4. 𝒚=𝝅𝒘𝒰\forall{\bm{y}}={\bm{\pi}}\otimes{\bm{w}}\in\mathcal{U}:

𝒚𝑷~=(𝝅𝒘)(𝑷𝑰d2)=(𝝅𝑷)(𝒘𝑰d2)=𝝅𝒘=𝒚,{\bm{y}}^{\ast}\widetilde{{\bm{P}}}=\left({\bm{\pi}}^{\ast}\otimes{\bm{w}}^{\ast}\right)({\bm{P}}\otimes{\bm{I}}_{d^{2}})=({\bm{\pi}}^{\ast}{\bm{P}})\otimes\left({\bm{w}}^{\ast}{\bm{I}}_{d^{2}}\right)={\bm{\pi}}^{\ast}\otimes{\bm{w}}^{\ast}={\bm{y}}^{\ast},

where third equality is becase 𝝅{\bm{\pi}} is the stationary distribution. Next to prove Part 2 of lemma 4. Given 𝒚𝒰{\bm{y}}\perp\mathcal{U}, want to show (𝒚𝑷~)𝝅𝒘({\bm{y}}^{\ast}\widetilde{{\bm{P}}})^{\ast}\perp{\bm{\pi}}\otimes{\bm{w}}, for every 𝒘d2{\bm{w}}\in{\mathbb{C}}^{d^{2}}. It is true because

𝝅𝒘,(𝒚𝑷~)𝝅=𝒚𝑷~(𝚷1𝑰d2)(𝝅𝒘)=𝒚((𝑷𝚷1𝝅)𝒘)=𝒚((𝚷1𝝅)𝒘)=𝒚(𝚷1𝑰d2)(𝝅𝒘)=𝝅𝒘,𝒚𝝅=0.\begin{split}\left\langle{\bm{\pi}}\otimes{\bm{w}},({\bm{y}}^{\ast}\widetilde{{\bm{P}}})^{\ast}\right\rangle_{\bm{\pi}}=&{\bm{y}}^{\ast}\widetilde{{\bm{P}}}\left({\bm{\Pi}}^{-1}\otimes{\bm{I}}_{d^{2}}\right)({\bm{\pi}}\otimes{\bm{w}})={\bm{y}}^{\ast}\left(({\bm{P}}{\bm{\Pi}}^{-1}{\bm{\pi}})\otimes{\bm{w}}\right)={\bm{y}}^{\ast}\left(({\bm{\Pi}}^{-1}{\bm{\pi}})\otimes{\bm{w}}\right)\\ =&{\bm{y}}^{\ast}\left({\bm{\Pi}}^{-1}\otimes{\bm{I}}_{d^{2}}\right)({\bm{\pi}}\otimes{\bm{w}})=\langle{\bm{\pi}}\otimes{\bm{w}},{\bm{y}}\rangle_{\bm{\pi}}=0.\end{split}

The third equality is due to 𝑷𝚷1𝝅=𝑷𝟏=𝟏=𝚷1𝝅{\bm{P}}{\bm{\Pi}}^{-1}{\bm{\pi}}={\bm{P}}\bm{1}=\bm{1}={\bm{\Pi}}^{-1}{\bm{\pi}}. Moreover, (𝒚𝑷~)𝝅λ𝒚𝝅\left\lVert\left({\bm{y}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}\leq\lambda\left\lVert{\bm{y}}\right\rVert_{{\bm{\pi}}} is simply a re-statement of definition 3. ∎

Remark 5.

Lemma 4 implies that 𝐲nd2\forall{\bm{y}}\in{\mathbb{C}}^{nd^{2}}

  1. 1.

    ((𝒚𝑷~))=((𝒚𝑷~))+((𝒚𝑷~))=𝒚+𝟎=𝒚\left(\left({\bm{y}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}=\left(\left({\bm{y}}^{\parallel\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}+\left(\left({\bm{y}}^{\perp\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}={\bm{y}}^{\parallel}+\bm{0}={\bm{y}}^{\parallel}

  2. 2.

    ((𝒚𝑷~))=((𝒚𝑷~))+((𝒚𝑷~))=𝟎+(𝒚𝑷~)=(𝒚𝑷~)\left(\left({\bm{y}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}=\left(\left({\bm{y}}^{\parallel\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}+\left(\left({\bm{y}}^{\perp\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}=\bm{0}+\left({\bm{y}}^{\perp\ast}\widetilde{{\bm{P}}}\right)^{\ast}=\left({\bm{y}}^{\perp\ast}\widetilde{{\bm{P}}}\right)^{\ast}.

Lemma 5.

(The effect of 𝑬{\bm{E}} operator) Given three parameters λ[0,1],0\lambda\in[0,1],\ell\geq 0 and t>0t>0. Let 𝐏{\bm{P}} be a regular Markov chain on state space [N][N], with stationary distribution 𝛑{\bm{\pi}} and spectral expansion λ\lambda. Suppose each state i[N]i\in[N] is assigned a matrix 𝐇id2×d2{\bm{H}}_{i}\in{\mathbb{C}}^{d^{2}\times d^{2}} s.t. 𝐇i2\left\lVert{\bm{H}}_{i}\right\rVert_{2}\leq\ell and i[N]πi𝐇i=0\sum_{i\in[N]}\pi_{i}{\bm{H}}_{i}=0. Let 𝐏~=𝐏𝐈d2\widetilde{{\bm{P}}}={\bm{P}}\otimes{\bm{I}}_{d^{2}} and 𝐄{\bm{E}} denotes the Nd2×Nd2Nd^{2}\times Nd^{2} block matrix where the ii-th diagonal block is the matrix exp(t𝐇i)\exp{(t{\bm{H}}_{i})}, i.e., 𝐄=diag(exp(t𝐇1),,exp(t𝐇N)){\bm{E}}=\operatorname{diag}(\exp{(t{\bm{H}}_{1})},\cdots,\exp{(t{\bm{H}}_{N})}). Then for any 𝐳Nd2\forall{\bm{z}}\in{\mathbb{C}}^{Nd^{2}}, we have:

  1. 1.

    ((𝒛𝑬𝑷~))𝝅α1𝒛𝝅\left\lVert\left(\left({\bm{z}}^{\parallel\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}\leq\alpha_{1}\left\lVert{\bm{z}}^{\parallel}\right\rVert_{{\bm{\pi}}}, where α1=exp(t)t\alpha_{1}=\exp{(t\ell)}-t\ell.

  2. 2.

    ((𝒛𝑬𝑷~))𝝅α2𝒛𝝅\left\lVert\left(\left({\bm{z}}^{\parallel\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\leq\alpha_{2}\left\lVert{\bm{z}}^{\parallel}\right\rVert_{{\bm{\pi}}}, where α2=λ(exp(t)1)\alpha_{2}=\lambda(\exp{(t\ell)}-1).

  3. 3.

    ((𝒛𝑬𝑷~))𝝅α3𝒛𝝅\left\lVert\left(\left({\bm{z}}^{\perp\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}\leq\alpha_{3}\left\lVert{\bm{z}}^{\perp}\right\rVert_{{\bm{\pi}}}, where α3=exp(t)1\alpha_{3}=\exp{(t\ell)}-1.

  4. 4.

    ((𝒛𝑬𝑷~))𝝅α4𝒛𝝅\left\lVert\left(\left({\bm{z}}^{\perp\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\leq\alpha_{4}\left\lVert{\bm{z}}^{\perp}\right\rVert_{{\bm{\pi}}}, where α4=λexp(t)\alpha_{4}=\lambda\exp{(t\ell)}.

Proof.

(of Lemma 5) We first show that, for 𝒛=𝒚𝒘{\bm{z}}={\bm{y}}\otimes{\bm{w}},

(𝒛𝑬)=𝑬𝒛=[exp(t𝑯1)exp(t𝑯N)][y1𝒘yN𝒘]=[y1exp(t𝑯1)𝒘yNexp(t𝑯N)𝒘]=[y1exp(t𝑯1)𝒘0]++[0yNexp(t𝑯N)𝒘]=i=1Nyi(𝒆i(exp(t𝑯i)𝒘)).\begin{split}\left({\bm{z}}^{\ast}{\bm{E}}\right)^{\ast}={\bm{E}}^{\ast}{\bm{z}}&=\begin{bmatrix}\exp(t{\bm{H}}^{\ast}_{1})&&\\ &\ddots&\\ &&\exp(t{\bm{H}}^{\ast}_{N})\end{bmatrix}\begin{bmatrix}y_{1}{\bm{w}}\\ \vdots\\ y_{N}{\bm{w}}\end{bmatrix}=\begin{bmatrix}y_{1}\exp(t{\bm{H}}^{\ast}_{1}){\bm{w}}\\ \vdots\\ y_{N}\exp(t{\bm{H}}^{\ast}_{N}){\bm{w}}\end{bmatrix}\\ &=\begin{bmatrix}y_{1}\exp(t{\bm{H}}^{\ast}_{1}){\bm{w}}\\ \vdots\\ 0\end{bmatrix}+\cdots+\begin{bmatrix}0\\ \vdots\\ y_{N}\exp(t{\bm{H}}^{\ast}_{N}){\bm{w}}\end{bmatrix}=\sum_{i=1}^{N}y_{i}\left({\bm{e}}_{i}\otimes(\exp(t{\bm{H}}^{\ast}_{i}){\bm{w}})\right).\end{split}

Due to the linearity of projection,

((𝒛𝑬))=i=1Nyi(𝒆i(exp(t𝑯i)𝒘))=i=1Nyi(𝟏𝒆i)(𝝅(exp(t𝑯i)𝒘))=𝝅(i=1Nyiexp(t𝑯i)𝒘),\begin{split}\left(\left({\bm{z}}^{\ast}{\bm{E}}\right)^{\ast}\right)^{\parallel}&=\sum_{i=1}^{N}y_{i}\left({\bm{e}}_{i}\otimes(\exp(t{\bm{H}}^{\ast}_{i}){\bm{w}})\right)^{\parallel}=\sum_{i=1}^{N}y_{i}(\bm{1}^{\ast}{\bm{e}}_{i})\left({\bm{\pi}}\otimes(\exp(t{\bm{H}}^{\ast}_{i}){\bm{w}})\right)={\bm{\pi}}\otimes\left(\sum_{i=1}^{N}y_{i}\exp(t{\bm{H}}^{\ast}_{i}){\bm{w}}\right),\end{split} (9)

where the second inequality follows by Remark 4.

Proof of Lemma 5, Part 1 Firstly We can bound i=1Nπiexp(t𝑯i)2\left\lVert\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}^{\ast}_{i})\right\rVert_{2} by

i=1Nπiexp(t𝑯i)2=i=1Nπiexp(t𝑯i)2=i=1Nπik=0+tj𝑯ijj!2=𝑰+i=1Nπij=2+tj𝑯ijj!21+i=1Nπij=2+tj𝑯i2jj!1+i=1Nπij=2+(t)jj!=exp(t)t,\begin{split}\left\lVert\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}^{\ast}_{i})\right\rVert_{2}&=\left\lVert\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}_{i})\right\rVert_{2}=\left\lVert\sum_{i=1}^{N}\pi_{i}\sum_{k=0}^{+\infty}\frac{t^{j}{\bm{H}}_{i}^{j}}{j!}\right\rVert_{2}=\left\lVert{\bm{I}}+\sum_{i=1}^{N}\pi_{i}\sum_{j=2}^{+\infty}\frac{t^{j}{\bm{H}}_{i}^{j}}{j!}\right\rVert_{2}\\ &\leq 1+\sum_{i=1}^{N}\pi_{i}\sum_{j=2}^{+\infty}\frac{t^{j}\left\lVert{\bm{H}}_{i}\right\rVert_{2}^{j}}{j!}\leq 1+\sum_{i=1}^{N}\pi_{i}\sum_{j=2}^{+\infty}\frac{(t\ell)^{j}}{j!}=\exp{(t\ell)}-t\ell,\end{split}

where the first step follows by 𝑨2=𝑨2\left\lVert{\bm{A}}\right\rVert_{2}=\left\lVert{\bm{A}}^{\ast}\right\rVert_{2}, the second step follows by matrix exponential, the third step follows by i[N]πi𝑯i=0\sum_{i\in[N]}\pi_{i}{\bm{H}}_{i}=0, and the forth step follows by triangle inequality. Given the above bound, for any 𝒛{\bm{z}}^{\parallel} which can be written as 𝒛=𝝅𝒘{\bm{z}}^{\parallel}={\bm{\pi}}\otimes{\bm{w}} for some 𝒘d2{\bm{w}}\in{\mathbb{C}}^{d^{2}}, we have

((𝒛𝑬𝑷~))𝝅=((𝒛𝑬))𝝅=𝝅(i=1Nπiexp(t𝑯i)𝒘)𝝅=𝝅𝝅i=1Nπiexp(t𝑯i)𝒘2𝝅𝝅i=1Nπiexp(t𝑯i)2𝒘2=i=1Nπiexp(t𝑯i)2𝒛𝝅(exp(t)t)𝒛𝝅,\begin{split}\left\lVert\left(\left({\bm{z}}^{\parallel\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}&=\left\lVert\left(\left({\bm{z}}^{\parallel\ast}{\bm{E}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}=\left\lVert{\bm{\pi}}\otimes\left(\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}^{\ast}_{i}){\bm{w}}\right)\right\rVert_{{\bm{\pi}}}=\left\lVert{\bm{\pi}}\right\rVert_{{\bm{\pi}}}\left\lVert\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}^{\ast}_{i}){\bm{w}}\right\rVert_{2}\\ &\leq\left\lVert{\bm{\pi}}\right\rVert_{{\bm{\pi}}}\left\lVert\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}^{\ast}_{i})\right\rVert_{2}\left\lVert{\bm{w}}\right\rVert_{2}=\left\lVert\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}^{\ast}_{i})\right\rVert_{2}\left\lVert{\bm{z}}^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &\leq\left(\exp{(t\ell)}-t\ell\right)\left\lVert{\bm{z}}^{\parallel}\right\rVert_{{\bm{\pi}}},\end{split}

where step one follows by Part 1 of Remark 5 and step two follows by Equation 9.

Proof of Lemma 5, Part 2 For 𝒛Nd2\forall{\bm{z}}\in\mathbb{C}^{Nd^{2}}, we can write it as block matrix such that:

𝒛=[𝒛1𝒛N]=[𝒛10]++[0𝒛N]=i=1N𝒆i𝒛i,{\bm{z}}=\begin{bmatrix}{\bm{z}}_{1}\\ \vdots\\ {\bm{z}}_{N}\end{bmatrix}=\begin{bmatrix}{\bm{z}}_{1}\\ \vdots\\ 0\end{bmatrix}+\cdots+\begin{bmatrix}0\\ \vdots\\ {\bm{z}}_{N}\end{bmatrix}=\sum_{i=1}^{N}{\bm{e}}_{i}\otimes{\bm{z}}_{i},

where each 𝒛id2{\bm{z}}_{i}\in{\mathbb{C}}^{d^{2}}. Please note that above decomposition is pairwise orthogonal. Applying Pythagorean theorem gives 𝒛𝝅2=i=1N𝒆i𝒛i𝝅2=i=1N𝒆i𝝅2𝒛i22\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\otimes{\bm{z}}_{i}\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert{\bm{z}}_{i}\right\rVert_{2}^{2}. Similarly, we can decompose (𝑬𝑰Nd2)𝒛({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}){\bm{z}} such that

(𝑬𝑰Nd2)𝒛=[exp(t𝑯1)𝑰d2exp(t𝑯N)𝑰d2][𝒛1𝒛N]=[(exp(t𝑯1)𝑰d2)𝒛1(exp(t𝑯N)𝑰d2)𝒛N]=[(exp(t𝑯1)𝑰d2)𝒛10]++[0(exp(t𝑯N)𝑰d2)𝒛N]=i=1N𝒆i((exp(t𝑯i)𝑰d2)𝒛i).\begin{split}({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}){\bm{z}}&=\begin{bmatrix}\exp(t{\bm{H}}^{\ast}_{1})-{\bm{I}}_{d^{2}}&&\\ &\ddots&\\ &&\exp(t{\bm{H}}^{\ast}_{N})-{\bm{I}}_{d^{2}}\end{bmatrix}\begin{bmatrix}{\bm{z}}_{1}\\ \vdots\\ {\bm{z}}_{N}\end{bmatrix}=\begin{bmatrix}(\exp(t{\bm{H}}^{\ast}_{1})-{\bm{I}}_{d^{2}}){\bm{z}}_{1}\\ \vdots\\ (\exp(t{\bm{H}}^{\ast}_{N})-{\bm{I}}_{d^{2}}){\bm{z}}_{N}\end{bmatrix}\\ &=\begin{bmatrix}(\exp(t{\bm{H}}^{\ast}_{1})-{\bm{I}}_{d^{2}}){\bm{z}}_{1}\\ \vdots\\ 0\end{bmatrix}+\cdots+\begin{bmatrix}0\\ \vdots\\ (\exp(t{\bm{H}}^{\ast}_{N})-{\bm{I}}_{d^{2}}){\bm{z}}_{N}\end{bmatrix}\\ &=\sum_{i=1}^{N}{\bm{e}}_{i}\otimes\left((\exp(t{\bm{H}}^{\ast}_{i})-{\bm{I}}_{d^{2}}){\bm{z}}_{i}\right).\end{split} (10)

Note that above decomposition is pairwise orthogonal, too. Applying Pythagorean theorem gives

(𝑬𝑰Nd2)𝒛𝝅2=i=1N𝒆i((exp(t𝑯i)𝑰d2)𝒛i)𝝅2=i=1N𝒆i𝝅2(exp(t𝑯i)𝑰d2)𝒛i22i=1N𝒆i𝝅2exp(t𝑯i)𝑰d222𝒛i22maxi[N]exp(t𝑯i)𝑰d222i=1N𝒆i𝝅2𝒛i22=maxi[N]exp(t𝑯i)𝑰d222𝒛𝝅2=maxi[N]exp(t𝑯i)𝑰d222𝒛𝝅2,\begin{split}\left\lVert({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}){\bm{z}}\right\rVert_{{\bm{\pi}}}^{2}&=\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\otimes\left((\exp(t{\bm{H}}^{\ast}_{i})-{\bm{I}}_{d^{2}}){\bm{z}}_{i}\right)\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert(\exp(t{\bm{H}}^{\ast}_{i})-{\bm{I}}_{d^{2}}){\bm{z}}_{i}\right\rVert_{2}^{2}\\ &\leq\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert\exp(t{\bm{H}}^{\ast}_{i})-{\bm{I}}_{d^{2}}\right\rVert_{2}^{2}\left\lVert{\bm{z}}_{i}\right\rVert_{2}^{2}\leq\max_{i\in[N]}\left\lVert\exp(t{\bm{H}}^{\ast}_{i})-{\bm{I}}_{d^{2}}\right\rVert_{2}^{2}\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert{\bm{z}}_{i}\right\rVert_{2}^{2}\\ &=\max_{i\in[N]}\left\lVert\exp(t{\bm{H}}^{\ast}_{i})-{\bm{I}}_{d^{2}}\right\rVert_{2}^{2}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}^{2}=\max_{i\in[N]}\left\lVert\exp(t{\bm{H}}_{i})-{\bm{I}}_{d^{2}}\right\rVert_{2}^{2}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}^{2},\end{split}

which indicates

(𝑬𝑰Nd2)𝒛𝝅=maxi[N]exp(t𝑯i)𝑰d22𝒛𝝅=maxi[N]j=1+tj𝑯ijj!2𝒛𝝅(j=1+tjjj!)𝒛𝝅=(exp(t)1)𝒛𝝅.\begin{split}\left\lVert({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}){\bm{z}}\right\rVert_{{\bm{\pi}}}&=\max_{i\in[N]}\left\lVert\exp(t{\bm{H}}_{i})-{\bm{I}}_{d^{2}}\right\rVert_{2}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}=\max_{i\in[N]}\left\lVert\sum_{j=1}^{+\infty}\frac{t^{j}{\bm{H}}_{i}^{j}}{j!}\right\rVert_{2}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}\\ &\leq\left(\sum_{j=1}^{+\infty}\frac{t^{j}\ell^{j}}{j!}\right)\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}=(\exp{(t\ell)}-1)\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}.\end{split}

Now we can formally prove Part 2 of Lemma 5 by:

((𝒛𝑬𝑷~))𝝅=((𝑬𝒛)𝑷~)𝝅λ(𝑬𝒛)𝝅=λ(𝑬𝒛𝒛+𝒛)𝝅=λ((𝑬𝑰Nd2)𝒛)𝝅λ(𝑬𝑰Nd2)𝒛𝝅λ(exp(t)1)𝒛𝝅.\begin{split}\left\lVert\left(\left({\bm{z}}^{\parallel\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}&=\left\lVert\left(\left({\bm{E}}^{\ast}{\bm{z}}^{\parallel}\right)^{\perp\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}\leq\lambda\left\lVert\left({\bm{E}}^{\ast}{\bm{z}}^{\parallel}\right)^{\perp}\right\rVert_{{\bm{\pi}}}=\lambda\left\lVert\left({\bm{E}}^{\ast}{\bm{z}}^{\parallel}-{\bm{z}}^{\parallel}+{\bm{z}}^{\parallel}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\\ &=\lambda\left\lVert\left(\left({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}\right){\bm{z}}^{\parallel}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\leq\lambda\left\lVert\left({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}\right){\bm{z}}^{\parallel}\right\rVert_{{\bm{\pi}}}\leq\lambda(\exp{(t\ell)}-1)\left\lVert{\bm{z}}^{\parallel}\right\rVert_{{\bm{\pi}}}.\end{split}

The first step follows by Part 2 of Remark 5, the second step follows by Part 1 on Lemma 4 and the forth step is due to (𝒛)=𝟎\left({\bm{z}}^{\parallel}\right)^{\perp}=\bm{0}.

Proof of Lemma 5, Part 3 Note that

((𝒛𝑬𝑷~))𝝅=(𝑬𝒛)𝝅=(𝑬𝒛𝒛+𝒛)𝝅=((𝑬𝑰Nd2)𝒛)𝝅(𝑬𝑰Nd2)𝒛𝝅(exp(t)1)𝒛𝝅,\begin{split}\left\lVert\left(\left({\bm{z}}^{\perp\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}&=\left\lVert\left({\bm{E}}^{\ast}{\bm{z}}^{\perp}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}=\left\lVert\left({\bm{E}}^{\ast}{\bm{z}}^{\perp}-{\bm{z}}^{\perp}+{\bm{z}}^{\perp}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}=\left\lVert\left(({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}){\bm{z}}^{\perp}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &\leq\left\lVert({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}){\bm{z}}^{\perp}\right\rVert_{{\bm{\pi}}}\leq(\exp{(t\ell)}-1)\left\lVert{\bm{z}}^{\perp}\right\rVert_{{\bm{\pi}}},\end{split}

where the first step follows by Part 1 of Remark 5, the third step follows by (𝒛)=𝟎\left({\bm{z}}^{\perp}\right)^{\parallel}=\bm{0}, and the last step follows by Part 2 of Lemma 4.

Proof of Lemma 5, Part 4 Simiar to Equation 10, for 𝒛Nd2\forall{\bm{z}}\in\mathbb{C}^{Nd^{2}}, we can decompose 𝑬𝒛{\bm{E}}^{\ast}{\bm{z}} as 𝑬𝒛=i=1N𝒆i(exp(t𝑯i)𝒛i){\bm{E}}^{\ast}{\bm{z}}=\sum_{i=1}^{N}{\bm{e}}_{i}\otimes(\exp(t{\bm{H}}^{\ast}_{i}){\bm{z}}_{i}). This decomposition is pairwise orthogonal, too. Applying Pythagorean theorem gives

𝑬𝒛𝝅2=i=1N𝒆i(exp(t𝑯i)𝒛i)𝝅2=i=1N𝒆i𝝅2exp(t𝑯i)𝒛i22i=1N𝒆i𝝅2exp(t𝑯i)22𝒛i22maxi[N]exp(t𝑯i)22i=1N𝒆i𝝅2𝒛i22maxi[N]exp(t𝑯i2)2𝒛𝝅2exp(t)2𝒛𝝅2\begin{split}\left\lVert{\bm{E}}^{\ast}{\bm{z}}\right\rVert_{{\bm{\pi}}}^{2}&=\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\otimes\left(\exp(t{\bm{H}}^{\ast}_{i}){\bm{z}}_{i}\right)\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert\exp(t{\bm{H}}^{\ast}_{i}){\bm{z}}_{i}\right\rVert_{2}^{2}\leq\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert\exp(t{\bm{H}}^{\ast}_{i})\right\rVert_{2}^{2}\left\lVert{\bm{z}}_{i}\right\rVert_{2}^{2}\\ &\leq\max_{i\in[N]}\left\lVert\exp(t{\bm{H}}^{\ast}_{i})\right\rVert_{2}^{2}\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert{\bm{z}}_{i}\right\rVert_{2}^{2}\leq\max_{i\in[N]}\exp{\left(\left\lVert t{\bm{H}}^{\ast}_{i}\right\rVert_{2}\right)}^{2}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}^{2}\leq\exp{(t\ell)}^{2}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}^{2}\end{split}

which indicates 𝑬𝒛𝝅exp(t)𝒛𝝅\left\lVert{\bm{E}}^{\ast}{\bm{z}}\right\rVert_{{\bm{\pi}}}\leq\exp{(t\ell)}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}. Now we can prove Part 4 of Lemma 5: Note that

((𝒛𝑬𝑷~))𝝅=((𝑬𝒛)𝑷~)𝝅λ(𝑬𝒛)𝝅λ𝑬𝒛𝝅λexp(t)𝒛𝝅.\begin{split}\left\lVert\left(\left({\bm{z}}^{\perp\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}&=\left\lVert\left(\left({\bm{E}}^{\ast}{\bm{z}}^{\perp}\right)^{\perp\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}\leq\lambda\left\lVert\left({\bm{E}}^{\ast}{\bm{z}}^{\perp}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\leq\lambda\left\lVert{\bm{E}}^{\ast}{\bm{z}}^{\perp}\right\rVert_{{\bm{\pi}}}\leq\lambda\exp{(t\ell)}\left\lVert{\bm{z}}^{\perp}\right\rVert_{{\bm{\pi}}}.\end{split}

Recursive Analysis We now use Lemma 5 to analyze the evolution of 𝒛i{\bm{z}}_{i}^{\parallel} and 𝒛i{\bm{z}}_{i}^{\perp}. Let 𝑯vf(v)(γ+ib)2𝑰d2+𝑰d2f(v)(γib)2{\bm{H}}_{v}\triangleq\frac{f(v)(\gamma+{\mathrm{i}}b)}{2}\otimes{\bm{I}}_{d^{2}}+{\bm{I}}_{d^{2}}\otimes\frac{f(v)(\gamma-{\mathrm{i}}b)}{2} in Lemma 5. We can see verify the following three facts: (1) exp(t𝑯v)=𝑴v\exp(t{\bm{H}}_{v})={\bm{M}}_{v}; (2) 𝑯v2\left\lVert{\bm{H}}_{v}\right\rVert_{2} is bounded (3) v[N]πv𝑯v=0\sum_{v\in[N]}\pi_{v}{\bm{H}}_{v}=0.

Firstly, easy to see that

exp(t𝑯v)=exp(tf(v)(γ+ib)2𝑰d2+𝑰d2tf(v)(γib)2)=exp(tf(v)(γ+ib)2)exp(tf(v)(γib)2)=𝑴v,\begin{split}\exp{(t{\bm{H}}_{v})}&=\exp{\left(\frac{tf(v)(\gamma+{\mathrm{i}}b)}{2}\otimes{\bm{I}}_{d^{2}}+{\bm{I}}_{d^{2}}\otimes\frac{tf(v)(\gamma-{\mathrm{i}}b)}{2}\right)}\\ &=\exp{\left(\frac{tf(v)(\gamma+{\mathrm{i}}b)}{2}\right)}\otimes\exp{\left(\frac{tf(v)(\gamma-{\mathrm{i}}b)}{2}\right)}={\bm{M}}_{v},\end{split}

where the first step follows by definition of 𝑯i{\bm{H}}_{i} and the second step follows by the fact that exp(𝑨𝑰d+𝑰d𝑩)=exp(𝑨)exp(𝑩)\exp({\bm{A}}\otimes{\bm{I}}_{d}+{\bm{I}}_{d}\otimes{\bm{B}})=\exp({\bm{A}})\otimes\exp({\bm{B}}), and the last step follows by Equation 7.

Secondly, we can bound 𝑯v2\left\lVert{\bm{H}}_{v}\right\rVert_{2} by:

𝑯v2f(v)(γ+ib)2𝑰d22+𝑰d2f(v)(γib)22=f(v)(γ+ib)22𝑰d22+𝑰d22f(v)(γib)22γ2+b2,\begin{split}\left\lVert{\bm{H}}_{v}\right\rVert_{2}&\leq\left\lVert\frac{f(v)(\gamma+{\mathrm{i}}b)}{2}\otimes{\bm{I}}_{d^{2}}\right\rVert_{2}+\left\lVert{\bm{I}}_{d^{2}}\otimes\frac{f(v)(\gamma-{\mathrm{i}}b)}{2}\right\rVert_{2}\\ &=\left\lVert\frac{f(v)(\gamma+{\mathrm{i}}b)}{2}\right\rVert_{2}\left\lVert{\bm{I}}_{d^{2}}\right\rVert_{2}+\left\lVert{\bm{I}}_{d^{2}}\right\rVert_{2}\left\lVert\frac{f(v)(\gamma-{\mathrm{i}}b)}{2}\right\rVert_{2}\leq\sqrt{\gamma^{2}+b^{2}},\end{split}

where the first step follows by triangle inequality, the second step follows by the fact that 𝑨𝑩2=𝑨2𝑩2\left\lVert{\bm{A}}\otimes{\bm{B}}\right\rVert_{2}=\left\lVert{\bm{A}}\right\rVert_{2}\left\lVert{\bm{B}}\right\rVert_{2}, the third step follows by 𝑰d2=1\left\lVert{\bm{I}}_{d}\right\rVert_{2}=1 and f(v)21\left\lVert f(v)\right\rVert_{2}\leq 1. We set =γ2+b2\ell=\sqrt{\gamma^{2}+b^{2}} to satisfy the assumption in Lemma 5 that 𝑯v2\left\lVert{\bm{H}}_{v}\right\rVert_{2}\leq\ell. According to the conditions in Lemma 1, we know that t1t\ell\leq 1 and t1λ4λt\ell\leq\frac{1-\lambda}{4\lambda}.

Finally, we show that v[N]πv𝑯v=0\sum_{v\in[N]}\pi_{v}{\bm{H}}_{v}=0, because

v[N]πv𝑯v=v[N](f(v)(γ+ib)2𝑰d2+𝑰d2f(v)(γib)2)=γ+ib2(v[N]πvf(v))𝑰d+γib2𝑰d(v[N]πvf(v))=0,\begin{split}\sum_{v\in[N]}\pi_{v}{\bm{H}}_{v}&=\sum_{v\in[N]}\left(\frac{f(v)(\gamma+{\mathrm{i}}b)}{2}\otimes{\bm{I}}_{d^{2}}+{\bm{I}}_{d^{2}}\otimes\frac{f(v)(\gamma-{\mathrm{i}}b)}{2}\right)\\ &=\frac{\gamma+{\mathrm{i}}b}{2}\left(\sum_{v\in[N]}\pi_{v}f(v)\right)\otimes{\bm{I}}_{d}+\frac{\gamma-{\mathrm{i}}b}{2}{\bm{I}}_{d}\otimes\left(\sum_{v\in[N]}\pi_{v}f(v)\right)=0,\end{split}

where the last step follows by v[N]πvf(v)=0\sum_{v\in[N]}\pi_{v}f(v)=0.

Claim 4.

𝒛i𝝅α21α4max0j<i𝒛j𝝅\left\lVert{\bm{z}}_{i}^{\perp}\right\rVert_{{\bm{\pi}}}\leq\frac{\alpha_{2}}{1-\alpha_{4}}\max_{0\leq j<i}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}.

Proof.

Using Part 2 and Part 4 of Lemma 5, we have

𝒛i𝝅=((𝒛i1𝑬𝑷~))𝝅((𝒛i1𝑬𝑷~))𝝅+((𝒛i1𝑬𝑷~))𝝅α2𝒛i1𝝅+α4𝒛i1𝝅(α2+α2α4+α2α42+)max0j<i𝒛j𝝅α21α4max0j<i𝒛j𝝅\begin{split}\left\lVert{\bm{z}}_{i}^{\perp}\right\rVert_{{\bm{\pi}}}&=\left\lVert\left(\left({\bm{z}}_{i-1}^{\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\\ &\leq\left\lVert\left(\left({\bm{z}}_{i-1}^{\parallel\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}+\left\lVert\left(\left({\bm{z}}_{i-1}^{\perp\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\\ &\leq\alpha_{2}\left\lVert{\bm{z}}_{i-1}^{\parallel}\right\rVert_{{\bm{\pi}}}+\alpha_{4}\left\lVert{\bm{z}}_{i-1}^{\perp}\right\rVert_{{\bm{\pi}}}\\ &\leq(\alpha_{2}+\alpha_{2}\alpha_{4}+\alpha_{2}\alpha_{4}^{2}+\cdots)\max_{0\leq j<i}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &\leq\frac{\alpha_{2}}{1-\alpha_{4}}\max_{0\leq j<i}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}\end{split}

Claim 5.

𝒛i𝝅(α1+α2α31α4)max0j<i𝒛j𝝅\left\lVert{\bm{z}}_{i}^{\parallel}\right\rVert_{{\bm{\pi}}}\leq\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)\max_{0\leq j<i}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}.

Proof.

Using Part 1 and Part 3 of Lemma 5 as well as Claim 4, we have

𝒛i𝝅=((𝒛i1𝑬𝑷~))𝝅((𝒛i1𝑬𝑷~))𝝅+((𝒛i1𝑬𝑷~))𝝅α1𝒛i1𝝅+α3𝒛i1𝝅α1𝒛i1𝝅+α3α21α4max0j<i1𝒛j𝝅(α1+α2α31α4)max0j<i𝒛j𝝅.\begin{split}\left\lVert{\bm{z}}_{i}^{\parallel}\right\rVert_{{\bm{\pi}}}&=\left\lVert\left(\left({\bm{z}}_{i-1}^{\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &\leq\left\lVert\left(\left({\bm{z}}_{i-1}^{\parallel\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}+\left\lVert\left(\left({\bm{z}}_{i-1}^{\perp\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &\leq\alpha_{1}\left\lVert{\bm{z}}_{i-1}^{\parallel}\right\rVert_{{\bm{\pi}}}+\alpha_{3}\left\lVert{\bm{z}}_{i-1}^{\perp}\right\rVert_{{\bm{\pi}}}\\ &\leq\alpha_{1}\left\lVert{\bm{z}}_{i-1}^{\parallel}\right\rVert_{{\bm{\pi}}}+\alpha_{3}\frac{\alpha_{2}}{1-\alpha_{4}}\max_{0\leq j<i-1}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &\leq\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)\max_{0\leq j<i}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}.\end{split}

Combining Claim 4 and Claim 5 gives

𝒛k𝝅(α1+α2α31α4)max0j<k𝒛j𝝅(because α1+α2α3/(1α4)α11 )(α1+α2α31α4)k𝒛0𝝅=ϕ𝝅d(α1+α2α31α4)k,\begin{split}\left\lVert{\bm{z}}_{k}^{\parallel}\right\rVert_{{\bm{\pi}}}&\leq\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)\max_{0\leq j<k}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}\\ \text{(because $\alpha_{1}+\alpha_{2}\alpha_{3}/(1-\alpha_{4})\geq\alpha_{1}\geq 1$ )}&\leq\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)^{k}\left\lVert{\bm{z}}_{0}^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}\sqrt{d}\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)^{k},\end{split}

which implies

𝝅vec(𝑰d),𝒛k𝝅ϕ𝝅d(α1+α2α31α4)k.\left\langle{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d}),{\bm{z}}_{k}\right\rangle_{{\bm{\pi}}}\leq\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)^{k}.

Finally, we bound (α1+α2α31α4)k\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)^{k}. The same as [10], we can bound α1,α2α3,α4\alpha_{1},\alpha_{2}\alpha_{3},\alpha_{4} by:

α1=exp(t)t1+t22=1+t2(γ2+b2),\alpha_{1}=\exp{(t\ell)}-t\ell\leq 1+t^{2}\ell^{2}=1+t^{2}(\gamma^{2}+b^{2}),

and

α2α3=λ(exp(t)1)2λ(2t)2=4λt2(γ2+b2)\alpha_{2}\alpha_{3}=\lambda(\exp{(t\ell)}-1)^{2}\leq\lambda(2t\ell)^{2}=4\lambda t^{2}(\gamma^{2}+b^{2})

where the second step is because exp(x)1+2x,x[0,1]\exp{(x)}\leq 1+2x,\forall x\in[0,1] and t<1t\ell<1,

α4=λexp(t)λ(1+2t)12+12λ\alpha_{4}=\lambda\exp{(t\ell)}\leq\lambda(1+2t\ell)\leq\frac{1}{2}+\frac{1}{2}\lambda

where the second step is because t<1t\ell<1, and the third step follows by t1λ4λt\ell\leq\frac{1-\lambda}{4\lambda}.

Overall, we have

(α1+α2α31α4)k(1+t2(γ2+b2)+4λt2(γ2+b2)1212λ)kexp(kt2(γ2+b2)(1+81λ)).\begin{split}\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)^{k}\leq&\left(1+t^{2}(\gamma^{2}+b^{2})+\frac{4\lambda t^{2}(\gamma^{2}+b^{2})}{\frac{1}{2}-\frac{1}{2}\lambda}\right)^{k}\\ &\leq\exp{\left(kt^{2}(\gamma^{2}+b^{2})\left(1+\frac{8}{1-\lambda}\right)\right)}.\end{split}

This completes our proof of Lemma 1. ∎

B.4 Proof of Theorem 1

See 1

Proof.

(of Theorem 1) Our strategy is to adopt complexification technique [8]. For any d×dd\times d complex Hermitian matrix 𝑿{\bm{X}}, we may write 𝑿=𝒀+i𝒁{\bm{X}}={\bm{Y}}+{\mathrm{i}}{\bm{Z}}, where 𝒀{\bm{Y}} and i𝒁{\mathrm{i}}{\bm{Z}} are the real and imaginary parts of 𝑿{\bm{X}}, respectively. Moreover, the Hermitian property of 𝑿{\bm{X}} (i.e., 𝑿=𝑿{\bm{X}}^{\ast}={\bm{X}}) implies that (1) 𝒀{\bm{Y}} is real and symmetric (i.e., 𝒀=𝒀{\bm{Y}}^{\top}={\bm{Y}}); (2) 𝒁{\bm{Z}} is real and skew symmetric (i.e., 𝒁=𝒁{\bm{Z}}=-{\bm{Z}}^{\top}). The eigenvalues of 𝑿{\bm{X}} can be found via a 2d×2d2d\times 2d real symmetric matrix 𝑯[𝒀𝒁𝒁𝒀]\scriptsize{\bm{H}}\triangleq\begin{bmatrix}{\bm{Y}}&{\bm{Z}}\\ -{\bm{Z}}&{\bm{Y}}\end{bmatrix}\normalsize, where the symmetry of 𝑯{\bm{H}} follows by the symmetry of 𝒀{\bm{Y}} and skew-symmetry of 𝒁{\bm{Z}}. Note the fact that, if the eigenvalues (real) of 𝑿{\bm{X}} are λ1,λ2,λd\lambda_{1},\lambda_{2},\cdots\lambda_{d}, then those of 𝑯{\bm{H}} are λ1,λ1,λ2,λ2,,λd,λd\lambda_{1},\lambda_{1},\lambda_{2},\lambda_{2},\cdots,\lambda_{d},\lambda_{d}. I.e., 𝑿{\bm{X}} and 𝑯{\bm{H}} have the same eigenvalues, but with different multiplicity.

Using the above technique, we can formally prove Theorem 1. For any complex matrix function f:[N]d×df:[N]\rightarrow\mathbb{C}^{d\times d} in Theorem 1, we can separate its real and imaginary parts by f=f1+if2f=f_{1}+{\mathrm{i}}f_{2}. Then we construct a real-valued matrix function g:[N]2d×2dg:[N]\rightarrow\mathbb{R}^{2d\times 2d} s.t. v[N]\forall v\in[N], g(v)=[f1(v)f2(v)f2(v)f1(v)]\scriptsize g(v)=\begin{bmatrix}f_{1}(v)&f_{2}(v)\\ -f_{2}(v)&f_{1}(v)\end{bmatrix}\normalsize. According to the complexification technique, we know that (1) v[N],g(v)\forall v\in[N],g(v) is real symmetric and g(v)2=f(v)21\left\lVert g(v)\right\rVert_{2}=\left\lVert f(v)\right\rVert_{2}\leq 1; (2) v[N]πvg(v)=0\sum_{v\in[N]}\pi_{v}g(v)=0. Then

[λmax(1kj=1kf(vj))ϵ]=[λmax(1kj=1kg(vj))ϵ]4ϕ𝝅d2exp((ϵ2(1λ)k/72)),\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\geq\epsilon\right]=\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}g(v_{j})\right)\geq\epsilon\right]\leq 4\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2}\exp{\left(-({\epsilon}^{2}(1-\lambda)k/72)\right)},

where the first step follows by the fact that 1kj=1kf(vj)\frac{1}{k}\sum_{j=1}^{k}f(v_{j}) and 1kj=1kg(vj)\frac{1}{k}\sum_{j=1}^{k}g(v_{j}) have the same eigenvalues (with different multiplicity), and the second step follows by Theorem 3.555The additional factor 4 is because the constructed g(v)g(v) has shape 2d×2d2d\times 2d. The bound on λmin\lambda_{\min} also follows similarly. ∎