A Matrix Chernoff Bound for Markov Chains and Its Application to Co-occurrence Matrices

Jiezhong Qiu
Tsinghua University
[email protected]
&Chi Wang
Microsoft Research, Redmond
[email protected]
&Ben Liao
Tencent Quantum Lab
[email protected]
&Richard Peng
Georgia Tech
[email protected]
&Jie Tang
Tsinghua University
[email protected]

Abstract

We prove a Chernoff-type bound for sums of matrix-valued random variables sampled via a regular (aperiodic and irreducible) finite Markov chain. Specially, consider a random walk on a regular Markov chain and a Hermitian matrix-valued function on its state space. Our result gives exponentially decreasing bounds on the tail distributions of the extreme eigenvalues of the sample mean matrix. Our proof is based on the matrix expander (regular undirected graph) Chernoff bound [Garg et al. STOC ’18] and scalar Chernoff-Hoeffding bounds for Markov chains [Chung et al. STACS ’12].

Our matrix Chernoff bound for Markov chains can be applied to analyze the behavior of co-occurrence statistics for sequential data, which have been common and important data signals in machine learning. We show that given a regular Markov chain with $n$ states and mixing time $\tau$ , we need a trajectory of length $O(\tau(\log{n}+\log{\tau})/{\epsilon}^{2})$ to achieve an estimator of the co-occurrence matrix with error bound $\epsilon$ . We conduct several experiments and the experimental results are consistent with the exponentially fast convergence rate from theoretical analysis. Our result gives the first bound on the convergence rate of the co-occurrence matrix and the first sample complexity analysis in graph representation learning.

1 Introduction

Chernoff bound [5], which gives exponentially decreasing bounds on tail distributions of sums of independent scalar-valued random variables, is one of the most basic and versatile tools in theoretical computer science, with countless applications to practical problems [21, 35]. There are two notable limitations when applying Chernoff bound in analyzing sample complexity in real-world machine learning problems. First, in many cases the random variables have dependence, e.g., Markov dependence [20] in MCMC [18] and online learning [48]. Second, applications are often concerned with the concentration behavior of quantities beyond scalar-valued random variables, e.g., random features in kernel machines [40] and co-occurrence statistics which are random matrices [38, 39].

Existing research has attempted to extend the original Chernoff bound in one of these two limitations [19, 11, 27, 24, 53, 14, 6, 41, 52, 42, 1, 50]. Wigderson and Xiao [53] conjectured that Chernoff bounds can be generalized to both matrix-valued random variables and Markov dependence, while restricting the Markov dependence to be a regular undirected graph. It was recently proved by Garg et al. [10], based on a new multi-matrix extension of the Golden-Thompson inequality [45]. However, the regular undirected graph is a special case of Markov chains which are reversible and have a uniform stationary distribution, and does not apply to practical problems such as random walk on generic graphs. It is an open question for the Chernoff bound of matrix-valued random matrices with more generic Markov dependence.

In this work, we establish large deviation bounds for the tail probabilities of the extreme eigenvalues of sums of random matrices sampled via a regular Markov chain¹¹1Please note that regular Markov chains are Markov chains which are aperiodic and irreducible, while an undirected regular graph is an undirected graph where each vertex has the same number of neighbors. In this work, the term “regular” may have different meanings depending on the context. starting from an arbitrary distribution (not necessarily the stationary distribution), which significantly improves the result of Garg et al. [10]. More formally, we prove the following theorem:

Theorem 1 (Markov Chain Matrix Chernoff Bound).

Let ${\bm{P}}$ be a regular Markov chain with state space $[N]$ , stationary distribution ${\bm{\pi}}$ and spectral expansion $\lambda$ . Let $f:[N]\rightarrow\mathbb{C}^{d\times d}$ be a function such that (1) $\forall v\in[N]$ , $f(v)$ is Hermitian and $\left\lVert f(v)\right\rVert_{2}\leq 1$ ; (2) $\sum_{v\in[N]}\pi_{v}f(v)=0$ . Let $(v_{1},\cdots,v_{k})$ denote a $k$ -step random walk on ${\bm{P}}$ starting from a distribution ${\bm{\phi}}$ . Given ${\epsilon}\in(0,1)$ ,

\begin{split}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\geq\epsilon\right]&\leq 4\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2}\exp{\left(-({\epsilon}^{2}(1-\lambda)k/72)\right)}\\ \mathbb{P}\left[\lambda_{\min}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\leq-\epsilon\right]&\leq 4\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2}\exp{\left(-({\epsilon}^{2}(1-\lambda)k/72)\right)}.\end{split}

In the above theorem, $\left\lVert\cdot\right\rVert_{{\bm{\pi}}}$ is the ${\bm{\pi}}$ -norm (which we define formally later in Section 2) measuring the distance between the initial distribution ${\bm{\phi}}$ and the stationary distribution ${\bm{\pi}}$ . Our strategy is to incorporate the concentration of matrix-valued functions from [10] into the study of general Markov chains from [6], which was originally for scalars.

1.1 Applications to Co-occurrence Matrices of Markov Chains

1 Input sequence

(v_{1},\cdots,v_{L})

; window size

T

;

2 Output co-occurrence matrix

{\bm{C}}

;

{\bm{C}}\leftarrow\bm{0}_{n\times n}

; ;

v_{i}\in[n],i\in[L]

3 for $i=1,2,\ldots,L-T$ do

4 for $r=1,\ldots,T$ do

{\bm{C}}_{v_{i},v_{i+r}}\leftarrow{\bm{C}}_{v_{i},v_{i+r}}+1/T

;

{\bm{C}}_{v_{i+r},v_{i}}\leftarrow{\bm{C}}_{v_{i+r},v_{i}}+1/T

;

{\bm{C}}\leftarrow\frac{1}{2(L-T)}{\bm{C}}

;

10 Return

{\bm{C}}

;

Algorithm 1 The Co-occurrence Matrix.

The co-occurrence statistics have recently emerged as common and important data signals in machine learning, providing rich correlation and clustering information about the underlying object space, such as the word co-occurrence in natural language processing [32, 33, 34, 26, 37], vertex co-occurrence in graph learning [38, 46, 12, 13, 7, 39], item co-occurrence in recommendation system [44, 28, 2, 51, 29], action co-occurrence in reinforcement learning [49], and emission co-occurrence of hidden Markov models [23, 17, 30]. Given a sequence of objects $(v_{1},\cdots,v_{L})$ , the co-occurrence statistics are computed by moving a sliding window of fixed size $T$ over the sequence and recording the frequency of objects’ co-occurrence within the sliding window. A pseudocode of the above procedure is listed in Algorithm 1, which produces an $n$ by $n$ co-occurrence matrix where $n$ is the size of the object space.

A common assumption when building such co-occurrence matrices is that the sequential data are long enough to provide an accurate estimation. For instance, Mikolov et al. [33] use a news article dataset with one billion words in their Skip-gram model; Tennenholtz and Mannor [49] train their Act2vec model with action sequences from over a million StarCraft II game replays, which are equivalent to 100 years of consecutive gameplay; Perozzi et al. [38] samples large amounts of random walk sequences from graphs to capture the vertex co-occurrence. A recent work by Qiu et al. [39] studies the convergence of co-occurrence matrices of random walk on undirected graphs in the limit (i.e., when the length of random walk goes to infinity), but left the convergence rate an open problem. It remains unknown whether the co-occurrence statistics are sample efficient and how efficient they are.

In this paper, we study the situation where the sequential data are sampled from a regular finite Markov chain (i.e., an aperiodic and irreducible finite Markov chain), and derive bounds on the sample efficiency of co-occurrence matrix estimation, specifically on the length of the trajectory needed in the sampling algorithm shown in Algorithm 1. To give a formal statement, we first translate Algorithm 1 to linear algebra language. Given a trajectory $(v_{1},\cdots,v_{L})$ from state space $[n]$ and step weight coefficients $(\alpha_{1},\cdots,\alpha_{T})$ , the co-occurrence matrix is defined to be

{\bm{C}}\triangleq\frac{1}{L-T}\sum_{i=1}^{L-T}{\bm{C}}_{i},\text{where }{\bm{C}}_{i}\triangleq\sum_{r=1}^{T}\frac{\alpha_{r}}{2}\left({\bm{e}}_{v_{i}}{\bm{e}}^{\top}_{v_{i+r}}+{\bm{e}}_{v_{i+r}}{\bm{e}}^{\top}_{v_{i}}\right).

Here ${\bm{C}}_{i}$ accounts for the co-occurrence within sliding window $(v_{i},\cdots,v_{i+T})$ , and ${\bm{e}}_{v_{i}}$ is a length- $n$ vector with a one in its $v_{i}$ -th entry and zeros elsewhere. Thus $\small{\bm{e}}_{v_{i}}{\bm{e}}^{\top}_{v_{i+r}}\normalsize$ is a $n$ by $n$ matrix with its $(v_{i},v_{i+r})$ -th entry to be one and other entries to be zero, which records the co-occurrence of $v_{i}$ and $v_{i+r}$ . Note that Algorithm 1 is a special case when step weight coefficients are uniform, i.e., $\alpha_{r}=1/T,r\in[T]$ , and the co-occurrence statistics in all the applications mentioned above can be formalized in this way. When trajectory $(v_{1},\cdots,v_{L})$ is a random walk from a regular Markov chain ${\bm{P}}$ with stationary distribution ${\bm{\pi}}$ , the asymptotic expectation of the co-occurrence matrix within sliding window $(v_{i},\cdots,v_{i+L})$ is

\mathbb{AE}[{\bm{C}}_{i}]\triangleq\lim_{i\to\infty}\mathbb{E}({\bm{C}}_{i})=\sum_{r=1}^{T}\frac{\alpha_{r}}{2}\left({\bm{\Pi}}{\bm{P}}^{r}+\left({\bm{\Pi}}{\bm{P}}^{r}\right)^{\top}\right),

where ${\bm{\Pi}}\triangleq\operatorname{diag}({\bm{\pi}})$ . Thus the asymptotic expectation of the co-occurrence matrix is

\mathbb{AE}[{\bm{C}}]\triangleq\lim_{L\to\infty}\mathbb{E}\left[{\bm{C}}\right]=\lim_{L\to\infty}\frac{1}{L-T}\sum_{i=1}^{L-T}\mathbb{E}({\bm{C}}_{i})=\sum_{r=1}^{T}\frac{\alpha_{r}}{2}\left({\bm{\Pi}}{\bm{P}}^{r}+\left({\bm{\Pi}}{\bm{P}}^{r}\right)^{\top}\right).

(1)

Our main result regarding the estimation of the co-occurrence matrix is the following convergence bound related to the length of the walk sampled.

Theorem 2 (Convergence Rate of Co-occurrence Matrices).

Let ${\bm{P}}$ be a regular Markov chain with state space $[n]$ , stationary distribution ${\bm{\pi}}$ and mixing time $\tau$ . Let $(v_{1},\cdots,v_{L})$ denote a $L$ -step random walk on ${\bm{P}}$ starting from a distribution ${\bm{\phi}}$ on $[n]$ . Given step weight coefficients $(\alpha_{1},\cdots,\alpha_{T})$ s.t. $\sum_{r=1}^{T}\left\lvert\alpha_{r}\right\rvert=1$ , and ${\epsilon}\in(0,1)$ , the probability that the co-occurrence matrix ${\bm{C}}$ deviates from its asymptotic expectation $\mathbb{AE}[{\bm{C}}]$ (in 2-norm) is bounded by:

\mathbb{P}\left[\left\lVert{\bm{C}}-\mathbb{AE}[{\bm{C}}]\right\rVert_{2}\geq{\epsilon}\right]\leq 2\left(\tau+T\right)\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}n^{2}\exp{\left(-\frac{{\epsilon}^{2}(L-T)}{576\left(\tau+T\right)}\right)}.

Specially, there exists a trajectory length $L=O\left((\tau+T)(\log{n}+\log{(\tau+T)})/{\epsilon}^{2}+T\right)$ such that $\mathbb{P}\left[\left\lVert{\bm{C}}-\mathbb{AE}[{\bm{C}}]\right\rVert_{2}\geq\epsilon\right]\leq\frac{1}{n^{O(1)}}$ . Assuming $T=O(1)$ gives $L=O\left(\tau(\log{n}+\log{\tau})/{\epsilon}^{2}\right)$ .

Our result in Theorem 2 gives the first sample complexity analysis for many graph representation learning algorithms. Given a graph, these algorithms aim to learn a function from the vertices to a low dimensional vector space. Most of them (e.g., DeepWalk [38], node2vec [12], metapath2vec [7], GraphSAGE [13]) consist of two steps. The first step is to draw random sequences from a stochastic process defined on the graph and then count co-occurrence statistics from the sampled sequences, where the stochastic process is usually defined to be first-order or higher-order random walk on the graph. The second step is to train a model to fit the co-occurrence statistics. For example, DeepWalk can be viewed as factorizing a point-wise mutual information matrix [26, 39] which is a transformation of the co-occurrence matrix; GraphSAGE fits the co-occurrence statistics with a graph neural network [22]. The common assumption is that there are enough samples so that the co-occurrence statistics are accurately estimated. We are the first work to study the sample complexity of the aforementioned algorithms. Theorem 2 implies that these algorithms need $O(\tau(\log{n}+\log{\tau})/{\epsilon}^{2})$ samples to achieve a good estimator of the co-occurrence matrix.

Previous work Hsu et al. [16, 15] study a similar problem. They leverage the co-occurrence matrix with $T=1$ to estimate the mixing time in reversible Markov chains from a single trajectory. Their main technique is a blocking technique [55] which is in parallel with the Markov chain matrix Chernoff-bound used in this work. Our work is also related to the research about random-walk matrix polynomial sparsification when the Markov chain ${\bm{P}}$ is a random walk on an undirected graph. In this case, we can rewrite ${\bm{P}}={\bm{D}}^{-1}{\bm{A}}$ where ${\bm{D}}$ and ${\bm{A}}$ is the degree matrix and adjacency matrix of an undirected graph with $n$ vertices and $m$ edges, and the expected co-occurrence matrix in Equation 1 can be simplified as $\mathbb{AE}\left[{\bm{C}}\right]=\frac{1}{\operatorname{vol}{(G)}}\sum_{r=1}^{T}\alpha_{r}{\bm{D}}({\bm{D}}^{-1}{\bm{A}})^{r}$ ,²²2The volume of a graph $G$ is defined to be $\operatorname{vol}{(G)}\triangleq\sum_{i}\sum_{j}{\bm{A}}_{ij}$ . which is known as random-walk matrix polynomials [3, 4]. Cheng et al. [4] propose an algorithm which needs $O(T^{2}m\log{n}/{\epsilon}^{2})$ steps of random walk to construct an ${\epsilon}$ -approximator for the random-walk matrix polynomials. Our bound in Theorem 2 is stronger than the bound proposed by Cheng et al. [4] when the Markov chain ${\bm{P}}$ mixes fast. Moreover, Cheng et al. [4] require $\alpha_{r}$ to be non-negative, while our bound can handle negative step weight coefficients.

Organization The rest of the paper is organized as follows. In Section 2 we provide preliminaries, followed by the proof of matrix Chernoff bound in Section 3 and the proof of convergence rate of co-occurrence matrices in Section 4. In Section 5, we conduct experiments on both synthetic and real-world datasets. Finally, we conclude this work in Section 6.

2 Preliminaries

In this paper, we denote ${\bm{P}}$ to be a finite Markov chain on $n$ states. ${\bm{P}}$ could refer to either the chain itself or the corresponding transition probability matrix — an $n$ by $n$ matrix such that its entry ${\bm{P}}_{ij}$ indicates the probability that state $i$ moves to state $j$ . A Markov chain is called an ergodic Markov chain if it is possible to eventually get from every state to every other state with positive probability. A Markov chain is regular if some power of its transition matrix has all strictly positive entries. A regular Markov chain must be an ergodic Markov chain, but not vice versa. An ergodic Markov chain has unique stationary distribution, i,e., there exists a unique probability vector ${\bm{\pi}}$ such that ${\bm{\pi}}^{\top}={\bm{\pi}}^{\top}{\bm{P}}$ . For convenience, we denote ${\bm{\Pi}}\triangleq\operatorname{diag}({\bm{\pi}})$ .

The time that a regular Markov chain³³3Please note that we need the Markov chain to be regular to make the mixing-time well-defined. For an ergodic Markov chain which could be periodic, the mixing time may be ill-defined. needs to be “close” to its stationary distribution is called mixing time. Let ${\bm{x}}$ and ${\bm{y}}$ be two probability vectors. The total variation distance between them is $\left\lVert{\bm{x}}-{\bm{y}}\right\rVert_{TV}\triangleq\frac{1}{2}\left\lVert{\bm{x}}-{\bm{y}}\right\rVert_{1}$ . For $\delta>0$ , the $\delta$ -mixing time of regular Markov chain ${\bm{P}}$ is $\small\tau({\bm{P}})\triangleq\min\left\{t:\max_{{\bm{x}}}\left\lVert({\bm{x}}^{\top}{\bm{P}}^{t})^{\top}-{\bm{\pi}}\right\rVert_{TV}\leq\delta\right\}\normalsize$ , where ${\bm{x}}$ is an arbitrary probability vector.

The stationary distribution ${\bm{\pi}}$ also defines a inner product space where the inner product (under ${\bm{\pi}}$ -kernel) is defined as $\langle{\bm{x}},{\bm{y}}\rangle_{{\bm{\pi}}}\triangleq{\bm{y}}^{\ast}{\bm{\Pi}}^{-1}{\bm{x}}$ for $\forall{\bm{x}},{\bm{y}}\in\mathbb{C}^{N}$ , where ${\bm{y}}^{\ast}$ is the conjugate transpose of ${\bm{y}}$ . A naturally defined norm based on the above inner product is $\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}\triangleq\sqrt{\langle{\bm{x}},{\bm{x}}\rangle_{{\bm{\pi}}}}$ . Then we can define the spectral expansion $\lambda({\bm{P}})$ of a Markov chain ${\bm{P}}$ [31, 9, 6] as $\small\lambda({\bm{P}})\triangleq\max_{\langle{\bm{x}},{\bm{\pi}}\rangle_{{\bm{\pi}}}=0,{\bm{x}}\neq 0}\frac{\left\lVert\left({\bm{x}}^{\ast}{\bm{P}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}}{\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}}\normalsize$ . The spectral expansion $\lambda({\bm{P}})$ is known to be a measure of mixing time of a Markov chain. The smaller $\lambda({\bm{P}})$ is, the faster a Markov chain converges to its stationary distribution [54]. If ${\bm{P}}$ is reversible, $\lambda({\bm{P}})$ is simply the second largest absolute eigenvalue of ${\bm{P}}$ (the largest is always $1$ ). The irreversible case is more complicated, since ${\bm{P}}$ may have complex eigenvalues. In this case, $\lambda({\bm{P}})$ is actually the square root of the second largest absolute eigenvalue of the multiplicative reversiblization of ${\bm{P}}$ [9]. When ${\bm{P}}$ is clear from the context, we will simply write $\tau$ and $\lambda$ for $\tau({\bm{P}})$ and $\lambda({\bm{P}})$ , respectively. We shall also refer $1-\lambda({\bm{P}})$ as the spectral gap of ${\bm{P}}$ .

3 Matrix Chernoff Bounds for Markov Chains

This section provides a brief overview of our proof of Markov chain Martrix Chernoff bounds. We start from a simpler version which only consider real-valued symmetric matrices, as stated in Theorem 3 below. Then we extend it to complex-valued Hermitian matrices, as stated in in Theorem 1.

Theorem 3 (A Real-Valued Version of Theorem 1).

Let ${\bm{P}}$ be a regular Markov chain with state space $[N]$ , stationary distribution ${\bm{\pi}}$ and spectral expansion $\lambda$ . Let $f:[N]\rightarrow\mathbb{R}^{d\times d}$ be a function such that (1) $\forall v\in[N]$ , $f(v)$ is symmetric and $\left\lVert f(v)\right\rVert_{2}\leq 1$ ; (2) $\sum_{v\in[N]}\pi_{v}f(v)=0$ . Let $(v_{1},\cdots,v_{k})$ denote a $k$ -step random walk on ${\bm{P}}$ starting from a distribution ${\bm{\phi}}$ on $[N]$ . Then given ${\epsilon}\in(0,1)$ ,

\begin{split}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\geq\epsilon\right]&\leq\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2}\exp{\left(-({\epsilon}^{2}(1-\lambda)k/72)\right)}\\ \mathbb{P}\left[\lambda_{\min}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\leq-\epsilon\right]&\leq\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2}\exp{\left(-({\epsilon}^{2}(1-\lambda)k/72)\right)}.\end{split}

Due to space constraints, we defer the full proof to Section B in the supplementary material and instead present a sketch here. By symmetry, we only discuss on bounding $\lambda_{\max}$ here. Using the exponential method, the probability in Theorem 3 can be upper bounded for any $t>0$ by:

\scriptsize\begin{split}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\geq\epsilon\right]\leq\mathbb{P}\left[\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\geq\exp{(tk\epsilon)}\right]\leq\frac{\mathbb{E}\left[\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right]}{\exp{(tk\epsilon)}},\end{split}\normalsize

where the first inequality follows by the tail bounds for eigenvalues (See Proposition 3.2.1 in Tropp [50]) which controls the tail probabilities of the extreme eigenvalues of a random matrix by producing a bound for the trace of the matrix moment generating function, and the second inequality follows by Markov’s inequality. The RHS of the above equation is the expected trace of the exponential of a sum of matrices (i.e., $tf(v_{j})$ ’s). When $f$ is a scalar-valued function, we can easily write exponential of a sum to be product of exponentials (since $\exp(a+b)=\exp(a)\exp(b)$ for scalars). However, this is not true for matrices. To bound the expectation term, we invoke the following multi-matrix Golden-Thompson inequality from [10], by letting ${\bm{H}}_{j}=tf(v_{j}),j\in[k]$ .

Theorem 4 (Multi-matrix Golden-Thompson Inequality, Theorem 1.5 in [10]).

Let ${\bm{H}}_{1},\cdots{\bm{H}}_{k}$ be $k$ Hermitian matrices, then for some probability distribution $\mu$ on $[-\frac{\pi}{2},\frac{\pi}{2}]$ .

\log{\left(\operatorname{Tr}{\left[\exp{\left(\sum_{j=1}^{k}{\bm{H}}_{j}\right)}\right]}\right)}\leq\frac{4}{\pi}\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\log{\left(\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}{\bm{H}}_{j}\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}{\bm{H}}_{j}\right)}\right]}\right)}d\mu(\phi).

The key point of this theorem is to relate the exponential of a sum of matrices to a product of matrix exponentials and their adjoints, whose trace can be further bounded via the following lemma by letting $e^{{\mathrm{i}}\phi}=\gamma+{\mathrm{i}}b$ .

Lemma 1 (Analogous to Lemma 4.3 in [10]).

Let ${\bm{P}}$ be a regular Markov chain with state space $[N]$ with spectral expansion $\lambda$ . Let $f$ be a function $f:[N]\rightarrow\mathbb{R}^{d\times d}$ such that (1) $\sum_{v\in[N]}\pi_{v}f(v)=0$ ; (2) $\left\lVert f(v)\right\rVert_{2}\leq 1$ and $f(v)$ is symmetric, $v\in[N]$ . Let $(v_{1},\cdots,v_{k})$ denote a $k$ -step random walk on ${\bm{P}}$ starting from a distribution ${\bm{\phi}}$ on $[N]$ . Then for any $t>0,\gamma\geq 0,b>0$ such that $t^{2}(\gamma^{2}+b^{2})\leq 1$ and $t\sqrt{\gamma^{2}+b^{2}}\leq\frac{1-\lambda}{4\lambda}$ , we have

\mathbb{E}\left[\operatorname{Tr}\left[\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right]\right]\leq\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d\exp{\left(kt^{2}(\gamma^{2}+b^{2})\left(1+\frac{8}{1-\lambda}\right)\right)}.

Proving Lemma 1 is the technical core of our paper. The main idea is to write the expected trace expression in LHS of Lemma 1 in terms of the transition probability matrix ${\bm{P}}$ , which allows for a recursive analysis to track how much the expected trace expression changes as a function of $k$ . The analysis relies on incorporating the concentration of matrix-valued functions from [10] into the study of general Markov chains from [6], which was originally for scalars. Key to this extension is the definition of an inner product related to the stationary distribution ${\bm{\pi}}$ of ${\bm{P}}$ , and a spectral expansion from such inner products. In contrast, the undirected regular graph case studied in [10] can be handled using the standard inner products, as well as the second largest eigenvalues of ${\bm{P}}$ instead of the spectral expansion. Detailed proofs of Theorem 3 and Lemma 1 can be found in Appendix B.2 and Appendix B.3 of the supplementary material, respectively.

Our result about real-valued matrices can be further generalized to complex-valued matrices, as stated in Theorem 1. Our main strategy is to adopt complexification technique [8], which first relate the eigenvalues of a $d\times d$ complex Hermitian matrix to a $2d\times 2d$ real symmetric matrix, and then deal with the real symmetric matrix using Theorem 3. The proof of Theorem 1 is deferred to Appendix B.4 in the supplementary material.

4 Convergence Rate of Co-occurrence Matrices of Markov Chains

In this section, we first apply the matrix Chernoff bound for regular Markov chains from Theorem 3 to obtain our main result on the convergence of co-occurrence matrix estimation, as stated in Theorem 2, and then discuss its generalization to Hidden Markov models in Corollary 1. Informally, our result in Theorem 3 states that if the mixing time of the Markov chain ${\bm{P}}$ is $\tau$ , then the length of a trajectory needed to guarantee an additive error (in 2-norm) of $\epsilon$ is roughly $O\left((\tau+T)(\log{n}+\log{\tau+T})/{\epsilon}^{2}+T\right)$ , where $T$ is the co-occurrence window size. However, we cannot directly apply the matrix Chernoff bound because the co-occurrence matrix is not a sum of matrix-valued functions sampled from the original Markov chain ${\bm{P}}$ . The main difficulty is to construct the proper Markov chain and matrix-valued function as desired by Theorem 3. We formally give our proof as follows:

Proof.

(of Theorem 2) Our proof has three main steps: the first two construct a Markov chain ${\bm{Q}}$ according to ${\bm{P}}$ , and a matrix-valued function $f$ such that the sums of matrix-valued random variables sampled via ${\bm{Q}}$ is exactly the error matrix ${\bm{C}}-\mathbb{AE}[{\bm{C}}]$ . Then we invoke Theorem 3 to the constructed Markov chain ${\bm{Q}}$ and function $f$ to bound the convergence rate. We give details below.

Step One Given a random walk $(v_{1},\cdots,v_{L})$ on Markov chain ${\bm{P}}$ , we construct a sequence $(X_{1},\cdots,X_{L-T})$ where $X_{i}\triangleq(v_{i},v_{i+1},\cdots,v_{i+T})$ , i.e., each $X_{i}$ is a size- $T$ sliding window over $(v_{1},\cdots,v_{L})$ . Meanwhile, let $\mathcal{S}$ be the set of all $T$ -step walks on Markov chain ${\bm{P}}$ , we define a new Markov chain ${\bm{Q}}$ on $\mathcal{S}$ such that $\forall(u_{0},\cdots,u_{T}),(w_{0},\cdots,w_{T})\in\mathcal{S}$ :

{\bm{Q}}_{(u_{0},\cdots,u_{T}),(w_{0},\cdots,w_{T})}\triangleq\begin{cases}{\bm{P}}_{u_{T},w_{T}}&\text{if }(u_{1},\cdots,u_{T})=(w_{0},\cdots,w_{T-1});\\ 0&\text{otherwise}.\end{cases}

The following claim characterizes the properties of ${\bm{Q}}$ , whose proof is deferred to Appendix A.1 in the supplementary material.

Claim 1 (Properties of ${\bm{Q}}$ ).

If ${\bm{P}}$ is a regular Markov chain, then ${\bm{Q}}$ satisfies:

1.

${\bm{Q}}$ is a regular Markov chain with stationary distribution $\sigma_{(u_{0},\cdots,u_{T})}=\pi_{u_{0}}{\bm{P}}_{u_{0},u_{1}}\cdots{\bm{P}}_{u_{T-1},u_{T}}$ ;
2.

The sequence $(X_{1},\cdots X_{L-T})$ is a random walk on ${\bm{Q}}$ starting from a distribution ${\bm{\rho}}$ such that $\rho_{(u_{0},\cdots,u_{T})}=\phi_{u_{0}}{\bm{P}}_{u_{0},u_{1}}\cdots{\bm{P}}_{u_{T-1},u_{T}}$ , and $\left\lVert{\bm{\rho}}\right\rVert_{{\bm{\sigma}}}=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}$ .
3.

$\forall\delta>0$ , the $\delta$ -mixing time of ${\bm{P}}$ and ${\bm{Q}}$ satisfies $\tau({\bm{Q}})<\tau({\bm{P}})+T$ ;
4.

$\exists{\bm{P}}$ with $\lambda({\bm{P}})<1$ s.t. the induced ${\bm{Q}}$ has $\lambda({\bm{Q}})=1$ , i.e. ${\bm{Q}}$ may have zero spectral gap.

Parts $1$ and $2$ imply that the sliding windows (i.e., $X_{1},X_{2},\cdots$ ) correspond to the state transition in a regular Markov chain ${\bm{Q}}$ , whose mixing time and spectral expansion are described in Parts $3$ and $4$ . A special case of the above construction when $T=1$ can be found in Lemma 6.1 of [54].

Step Two Defining a matrix-valued function $f:\mathcal{S}\rightarrow\mathbb{R}^{n\times n}$ such that $\forall X=(u_{0},\cdots,u_{T})\in\mathcal{S}$ :

f(X)\triangleq\frac{1}{2}\left(\sum_{r=1}^{T}\frac{\alpha_{r}}{2}\left({\bm{e}}_{u_{0}}{\bm{e}}^{\top}_{u_{r}}+{\bm{e}}_{u_{r}}{\bm{e}}^{\top}_{u_{0}}\right)-\sum_{r=1}^{T}\frac{\alpha_{r}}{2}\left({\bm{\Pi}}{\bm{P}}^{r}+\left({\bm{\Pi}}{\bm{P}}^{r}\right)^{\top}\right)\right).

(2)

With this definition of $f(X)$ , the difference between the co-occurrence matrix ${\bm{C}}$ and its asymptotic expectation $\mathbb{AE}[{\bm{C}}]$ can be written as: ${\bm{C}}-\mathbb{AE}[{\bm{C}}]=2(\frac{1}{L-T}\sum_{i=1}^{L-T}f(X_{i}))$ . We can further show the following properties of this function $f$ :

Claim 2 (Properties of $f$ ).

The function $f$ in Equation 2 satisfies (1) $\sum_{X\in\mathcal{S}}\sigma_{X}f(X)=0$ ; (2) $f(X)$ is symmetric and $\left\lVert f(X)\right\rVert_{2}\leq 1,\forall X\in\mathcal{S}$ .

This claim verifies that $f$ in Equation 2 satisfies the two conditions of matrix-valued function in Theorem 3. The proof of Claim 2 is deferred to Appendix A.2 of the supplementary material.

Step Three The construction in step two reveals the fact that the error matrix ${\bm{C}}-\mathbb{AE}[{\bm{C}}]$ can be written as the average of matrix-valued random variables (i.e., $f(X_{i})$ ’s), which are sampled via a regular Markov chain ${\bm{Q}}$ This encourages us to directly apply Theorem 3. However, note that (1) the error probability in Theorem 3 contains a factor of spectral gap $(1-\lambda)$ ; and (2) Part 4 of Claim 1 allows for the existence of a Markov chain ${\bm{P}}$ with $\lambda({\bm{P}})<1$ while the induced Markov chain ${\bm{Q}}$ has $\lambda({\bm{Q}})=1$ . So we cannot directly apply Theorem 3 to ${\bm{Q}}$ . To address this issue, we utilize the following tighter bound on sub-chains.

Claim 3.

(Claim 3.1 in Chung et al. [6]) Let ${\bm{Q}}$ be a regular Markov chain with $\delta$ -mixing time $\tau(Q)$ , then $\lambda\left({\bm{Q}}^{\tau(Q)}\right)\leq\sqrt{2\delta}$ . In particular, setting $\delta=\frac{1}{8}$ implies $\lambda({\bm{Q}}^{\tau({\bm{Q}})})\leq\frac{1}{2}$ .

The above claim reveals the fact that, even though ${\bm{Q}}$ could have zero spectral gap (Part 4 of Claim 1), we can bound the spectral expansion of ${\bm{Q}}^{\tau({\bm{Q}})}$ . We partition $(X_{1},\cdots X_{L-T})$ into $\tau({\bm{Q}})$ groups⁴⁴4Without loss of generality, we assume $L-T$ is a multiple of $\tau({\bm{Q}})$ ., such that the $i$ -th group consists of a sub-chain $(X_{i},X_{i+\tau({\bm{Q}})},X_{i+2\tau(Q)},\cdots)$ of length $k\triangleq(L-T)/\tau({\bm{Q}})$ . The sub-chain can be viewed as generated from a Markov chain ${\bm{Q}}^{\tau(Q)}$ . Apply Theorem 3 to the $i$ -th sub-chain, whose starting distribution is ${\bm{\rho}}_{i}\triangleq\left({\bm{Q}}^{\top}\right)^{i-1}{\bm{\rho}}$ , we have

\begin{split}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(X_{i+(j-1)\tau({\bm{Q}})}\right)\geq{\epsilon}\right]&\leq\left\lVert{\bm{\rho}}_{i}\right\rVert_{{\bm{\sigma}}}n^{2}\exp{\left(-{\epsilon}^{2}\left(1-\lambda\left({\bm{Q}}^{\tau({\bm{Q}})}\right)\right)k/72\right)}\\ &\leq\left\lVert{\bm{\rho}}_{i}\right\rVert_{{\bm{\sigma}}}n^{2}\exp{\left(-{\epsilon}^{2}k/144\right)}\leq\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}n^{2}\exp{\left(-{\epsilon}^{2}k/144\right)},\end{split}

where that last step follows by $\left\lVert{\bm{\rho}}_{i}\right\rVert_{{\bm{\sigma}}}\leq\left\lVert{\bm{\rho}}_{i-1}\right\rVert_{{\bm{\sigma}}}\leq\cdots\left\lVert{\bm{\rho}}_{1}\right\rVert_{{\bm{\sigma}}}=\left\lVert{\bm{\rho}}\right\rVert_{{\bm{\sigma}}}$ and $\left\lVert{\bm{\rho}}\right\rVert_{{\bm{\sigma}}}=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}$ (Part 2 of Claim 1). Together with a union bound across each sub-chain, we can obtain:

\begin{split}&\mathbb{P}\left[\lambda_{\max}\left({\bm{C}}-\mathbb{AE}[{\bm{C}}]\right)\geq\epsilon\right]=\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{L-T}\sum_{j=1}^{L-T}f(X_{j})\right)\geq\frac{{\epsilon}}{2}\right]\\ =&\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{\tau({\bm{Q}})}\sum_{i=1}^{\tau({\bm{Q}})}\frac{1}{k}\sum_{j=1}^{k}f(X_{i+(j-1)\tau({\bm{Q}})})\right)\geq\frac{{\epsilon}}{2}\right]\\ \leq&\sum_{i=1}^{\tau({\bm{Q}})}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(X_{i+(j-1)N})\right)\geq\frac{{\epsilon}}{2}\right]\leq\tau({\bm{Q}})\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}n^{2}\exp{\left(-{\epsilon}^{2}k/576\right)}.\end{split}

The bound on $\lambda_{\min}$ also follows similarly. As ${\bm{C}}-\mathbb{AE}[{\bm{C}}]$ is a real symmetric matrix, its $2$ -norm is its maximum absolute eigenvalue. Therefore, we can use the eigenvalue bound to bound the overall error probability in terms of the matrix 2-norm:

\begin{split}&\mathbb{P}\left[\left\lVert{\bm{C}}-\mathbb{AE}[{\bm{C}}]\right\rVert_{2}\geq{\epsilon}\right]=\mathbb{P}\left[\lambda_{\max}({\bm{C}}-\mathbb{AE}[{\bm{C}}])\geq{\epsilon}\lor\lambda_{\min}({\bm{C}}-\mathbb{AE}[{\bm{C}}])\leq-{\epsilon}\right]\\ \leq&2\tau({\bm{Q}})n^{2}\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}\exp{\left(-{\epsilon}^{2}k/576\right)}\leq 2\left(\tau({\bm{P}})+T\right)\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}n^{2}\exp{\left(-\frac{{\epsilon}^{2}(L-T)}{576\left(\tau({\bm{P}})+T\right)}\right)},\end{split}

where the first inequality follows by union bound, and the second inequality is due to $\tau({\bm{Q}})<\tau({\bm{P}})+T$ (Part 3 of Claim 1). This bound implies that the probability that ${\bm{C}}$ deviates from $\mathbb{AE}[{\bm{C}}]$ could be arbitrarily small by increasing the sampled trajectory length $L$ . Specially, if we want the event $\left\lVert{\bm{C}}-\mathbb{AE}[{\bm{C}}]\right\rVert_{2}\geq{\epsilon}$ happens with probability smaller than $1/n^{O(1)}$ , we need $L=O\left(\left(\tau({\bm{P}})+T\right)\left(\log{n}+\log{\left(\tau({\bm{P}})+T\right)}\right)/{\epsilon}^{2}+T\right)$ . If we assume $T=O(1)$ , we can achieve $L=O\left(\tau({\bm{P}})\left(\log{n}+\log{\tau({\bm{P}})}\right)/{\epsilon}^{2}\right)$ . ∎

Our analysis can be extended to Hidden Markov models (HMM) as shown in Corollary 1, and has a potential to solve problems raised in [17, 30]. Our strategy is to treat the HMM with observable state space $\mathcal{Y}$ and hidden state space $\mathcal{X}$ as a Markov chain with state space $\mathcal{Y}\times\mathcal{X}$ . The detailed proof can be found in Appendix A.3 in the supplementary material.

Corollary 1 (Co-occurrence Matrices of HMMs).

For a HMM with observable states $y_{t}\in\mathcal{Y}$ and hidden states $x_{t}\in\mathcal{X}$ , let $P(y_{t}|x_{t})$ be the emission probability and $P(x_{t+1}|x_{t})$ be the hidden state transition probability. Given an $L$ -step trajectory observations from the HMM, $(y_{1},\cdots,y_{L})$ , one needs a trajectory of length $L=O(\tau(\log{|\mathcal{Y}|}+\log{\tau})/\epsilon^{2})$ to achieve a co-occurrence matrix within error bound $\epsilon$ with high probability, where $\tau$ is the mixing time of the Markov chain on hidden states.

5 Experiments

In this section, we show experiments to illustrate the exponentially fast convergence rate of estimating co-occurrence matrices of Markov chains. We conduct experiments on three synthetic Markov chains (Barbell graph, winning streak chain, and random graph) and one real-world Markov chain (BlogCatalog). For each Markov chain and each trajectory length $L$ from the set $\{10^{1},\cdots,10^{7}\}$ , we measure the approximation error of the co-occurrence matrix constructed by Algorithm 1 from a $L$ -step random walk sampled from the chain. We performed 64 trials for each experiment, and the results are aggregated as an error-bar plot. We set $T=2$ and $\alpha_{r}$ to be uniform unless otherwise mentioned. The relationship between trajectory length $L$ and approximation error $\left\lVert{\bm{C}}-\mathbb{AE}[{\bm{C}}]\right\rVert_{2}$ is shown in Figure 1 (in log-log scale). Across all the four datasets, the observed exponentially fast convergence rates match what our bounds predict in Theorem 2. Below we discuss our observations for each of these datasets.

Barbell Graphs [43] The Barbell graph is an undirected graph with two cliques connected by a single path. Such graphs’ mixing times vary greatly: two cliques with size $k$ connected by a single edge have mixing time $\Theta(k^{2})$ ; and two size- $k$ cliques connected by a length- $k$ path have mixing time about $\Theta(k^{3})$ . We evaluate the convergence rate of co-occurrence matrices on the two graphs mentioned above, each with $100$ vertices. According to our bound that require $L=O(\tau(\log{n}+\log{\tau})/{\epsilon}^{2})$ , we shall expect the approximate co-occurrence matrix to converge faster when the path bridging the two cliques is shorter. The experimental results are shown in Figure 1a, and indeed display faster convergences when the path is shorter (since we fix $n=100$ , a Barbell graph with clique size 50 has a shorter path connecting the two cliques than the one with clique size 33).

Winning Streak Chains (Section 4.6 of [25]) A winning streak Markov chain has state space $[n]$ , and can be viewed as tracking the number of consecutive ‘tails’ in a sequence of coin flips. Each state transits back to state $1$ with probability $0.5$ , and the next state with probability $0.5$ . The $\delta$ -mixing time of this chain satisfies $\tau\leq\lceil\log_{2}(1/\delta)\rceil$ , and is independent of $n$ . This prompted us to choose this chain, as we should expect similar rates of convergence for different values of $n$ according to our bound of $L=O(\tau(\log{n}+\log{\tau})/{\epsilon}^{2})$ . In our experiment, we compare between $n=50$ and $n=100$ and illustrate the results in Figure 1b. As we can see, for each trajectory length $L$ , the approximation errors of $n=50$ and $n=100$ are indeed very close.

BlogCatalog Graph [47] is widely used to benchmark graph representation learning algorithms [38, 12, 39]. It is an undirected graph of social relationships of online bloggers with 10,312 vertices and 333,983 edges. The random walk on the BlogCatalog graph has spectral expansion $\lambda\approx 0.57$ . Following Levin and Peres [25], we can upper bound its $\frac{1}{8}$ -mixing time by $\tau\leq 36$ . We choose $T$ from $\{2,4,8\}$ and illustrate the results in Figure 1c. The convergence rate is robust to different values of $T$ . Moreover, the variance in BlogCatalog is much smaller than that in other datasets.

We further demonstrate how our result could be used to select parameters for a popular graph representation learning algorithm, DeepWalk [38]. We set the window size $T=10$ , which is the default value of DeepWalk. Our bound on trajectory length $L$ in Theorem 1 (with explicit constant) is $L\geq 576(\tau+T)(3\log{n}+\log{(\tau+T)})/\epsilon^{2}+T$ . The error bound $\epsilon$ might be chosen in the range of $[0.1,0.01]$ , which corresponds to $L$ in the range of $[8.4\times 10^{7},8.4\times 10^{9}]$ . To verify that is a meaningful range for tuning $L$ , we enumerate trajectory length $L$ from $\{10^{4},\cdots,10^{10}\}$ , estimate the co-occurrence matrix with the single trajectory sampled from BlogCatalog, convert the co-occurrence matrix to the one implicitly factorized by DeepWalk [38, 39], and factorize it with SVD. For comparison, we also provide the result at the limiting case ( $L\rightarrow+\infty$ ) where we directly compute the asymptotic expectation of the co-occurrence matrix according to Equation 1. The limiting case involves computing a matrix polynomial and could be very expensive. For node classification task, the micro-F1 when training ratio is 50% is

Length $L$ of DeepWalk	$10^{4}$	$10^{5}$	$10^{6}$	$10^{7}$	$10^{8}$	$10^{9}$	$10^{10}$	$+\infty$
Micro-F1 ( $\%$ )	15.21	18.31	26.99	33.85	39.12	41.28	41.58	41.82

As we can see, it is reasonable to choose $L$ in the predicted range.

Random Graph The small variance observed on BlogCatalog leads us to hypothesize that it shares some traits with random graphs. To gather further evidence for this, we estimate the co-occurrence matrices of an Erdős–Rényi random graph for comparison. Specifically, we take a random graph on $100$ vertices where each undirected edge is added independently with probability $0.1$ , aka. $G(100,0.1)$ . The results Figure 1d show very similar behaviors compared to the BlogCatalog graph: small variance and robust convergence rates.

6 Conclusion and Future Work

In this paper, we analyze the convergence rate of estimating the co-occurrence matrix of a regular Markov chain. The main technical contribution of our work is to prove a Chernoff-type bound for sums of matrix-valued random variables sampled via a regular Markov chain, and we show that the problem of estimating co-occurrence matrices is a non-trivial application of the Chernoff-type bound. Our results show that, given a regular Markov chain with $n$ states and mixing time $\tau$ , we need a trajectory of length $O(\tau(\log{n}+\log{\tau})/{\epsilon}^{2})$ to achieve an estimator of the co-occurrence matrix with error bound $\epsilon$ . Our work leads to some natural future questions:

•

Is it a tight bound? Our analysis on convergence rate of co-occurrence matrices relies on union bound, which probably gives a loose bound. It would be interesting to shave off the leading factor $\tau$ in the bound, as the mixing time $\tau$ could be large for some Markov chains.
•

What if the construction of the co-occurrence matrix is coupled with a learning algorithm? For example, in word2vec [33], the co-occurrence in each sliding window outputs a mini-batch to a logistic matrix factorization model. This problem can be formalized as the convergence of stochastic gradient descent with non-i.i.d but Markovian random samples.
•

Can we find more applications of the Markov chain matrix Chernoff bound? We believe Theorem 3 could have further applications, e.g., in reinforcement learning [36].

Broader Impact

Our work contributes to the research literature of Chernoff-type bounds and co-occurrence statistics. Chernoff-type bound have become one of the most important probabilistic results in computer science. Our result generalize Chernoff bound to Markov dependence and random matrices. Co-occurrence statistics have emerged as important tools in machine learning. Our work addresses the sample complexity of estimating co-occurrence matrix. We believe such better theoretical understanding can further the understanding of potential and limitations of graph representation learning and reinforcement learning.

Acknowledgments and Disclosure of Funding

We thank Jian Li (IIIS, Tsinghua) and Shengyu Zhang (Tencent Quantum Lab) for motivating this work. Funding in direct support of this work: Jiezhong Qiu and Jie Tang were supported by the National Key R&D Program of China (2018YFB1402600), NSFC for Distinguished Young Scholar (61825602), and NSFC (61836013). Richard Peng was partially supported by NSF grant CCF-1846218. There is no additional revenue related to this work.

References

Ahlswede and Winter [2002] Rudolf Ahlswede and Andreas Winter. Strong converse for identification via quantum channels. IEEE Transactions on Information Theory, 2002.
Barkan and Koenigstein [2016] Oren Barkan and Noam Koenigstein. Item2vec: neural item embedding for collaborative filtering. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), 2016.
Cheng et al. [2015a] Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Efficient sampling for gaussian graphical models via spectral sparsification. In COLT ’15, 2015a.
Cheng et al. [2015b] Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Spectral sparsification of random-walk matrix polynomials. arXiv preprint arXiv:1502.03496, 2015b.
Chernoff et al. [1952] Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 1952.
Chung et al. [2012] Kai-Min Chung, Henry Lam, Zhenming Liu, and Michael Mitzenmacher. Chernoff-hoeffding bounds for markov chains: Generalized and simplified. In STACS’ 12, 2012.
Dong et al. [2017] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. metapath2vec: Scalable representation learning for heterogeneous networks. In KDD ’17, 2017.
Dongarra et al. [1984] JJ Dongarra, JR Gabriel, DD Koelling, and JH Wilkinson. The eigenvalue problem for hermitian matrices with time reversal symmetry. Linear Algebra and its Applications, 1984.
Fill [1991] James Allen Fill. Eigenvalue bounds on convergence to stationarity for nonreversible markov chains, with an application to the exclusion process. The annals of applied probability, 1991.
Garg et al. [2018] Ankit Garg, Yin Tat Lee, Zhao Song, and Nikhil Srivastava. A matrix expander chernoff bound. In STOC ’18, 2018.
Gillman [1998] David Gillman. A chernoff bound for random walks on expander graphs. SIAM Journal on Computing, 1998.
Grover and Leskovec [2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD ’16, 2016.
Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS ’17, 2017.
Healy [2008] Alexander D Healy. Randomness-efficient sampling within nc. Computational Complexity, 2008.
Hsu et al. [2019] Daniel Hsu, Aryeh Kontorovich, David A Levin, Yuval Peres, Csaba Szepesvári, Geoffrey Wolfer, et al. Mixing time estimation in reversible markov chains from a single sample path. The Annals of Applied Probability, 2019.
Hsu et al. [2015] Daniel J Hsu, Aryeh Kontorovich, and Csaba Szepesvári. Mixing time estimation in reversible markov chains from a single sample path. In NeurIPS ’15, 2015.
Huang et al. [2018] Kejun Huang, Xiao Fu, and Nicholas Sidiropoulos. Learning hidden markov models from pairwise co-occurrences with application to topic modeling. In ICML ’18, 2018.
Jerrum and Sinclair [1996] Mark Jerrum and Alistair Sinclair. The markov chain monte carlo method: an approach to approximate counting and integration. Approximation Algorithms for NP-hard problems, PWS Publishing, 1996.
Kahale [1997] Nabil Kahale. Large deviation bounds for markov chains. Combinatorics, Probability and Computing, 1997.
Karlin [2014] Samuel Karlin. A first course in stochastic processes. Academic press, 2014.
Kearns et al. [1994] Michael J Kearns, Umesh Virkumar Vazirani, and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994.
Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR ’17, 2017.
Kontorovich et al. [2013] Aryeh Kontorovich, Boaz Nadler, and Roi Weiss. On learning parametric-output hmms. In ICML ’13, 2013.
León et al. [2004] Carlos A León, François Perron, et al. Optimal hoeffding bounds for discrete reversible markov chains. The Annals of Applied Probability, 2004.
Levin and Peres [2017] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
Levy and Goldberg [2014] Omer Levy and Yoav Goldberg. Neural Word Embedding as Implicit Matrix Factorization. In NeurIPS ’14. 2014.
Lezaud [1998] Pascal Lezaud. Chernoff-type bound for finite markov chains. Annals of Applied Probability, 1998.
Liang et al. [2016] Dawen Liang, Jaan Altosaar, Laurent Charlin, and David M Blei. Factorization meets the item embedding: Regularizing matrix factorization with item co-occurrence. In RecSys ’16, 2016.
Liu et al. [2017] David C Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C Ma, Zhigang Zhong, Jenny Liu, and Yushi Jing. Related pins at pinterest: The evolution of a real-world recommender system. In WWW ’17, 2017.
Mattila et al. [2020] Robert Mattila, Cristian R Rojas, Eric Moulines, Vikram Krishnamurthy, and Bo Wahlberg. Fast and consistent learning of hidden markov models by incorporating non-consecutive correlations. In ICML ’20, 2020.
Mihail [1989] Milena Mihail. Conductance and convergence of markov chains-a combinatorial treatment of expanders. In FOCS ’89, 1989.
Mikolov et al. [2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR Workshop ’13, 2013a.
Mikolov et al. [2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NeurIPS’ 13. 2013b.
Mikolov et al. [2013c] Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In NAACL ’13, 2013c.
Motwani and Raghavan [1995] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cambridge university press, 1995.
Ortner [2020] Ronald Ortner. Regret bounds for reinforcement learning via markov chain concentration. Journal of Artificial Intelligence Research, 2020.
Pennington et al. [2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP ’14, 2014.
Perozzi et al. [2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In KDD ’14, 2014.
Qiu et al. [2018] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In WSDM ’18, 2018.
Rahimi and Recht [2008] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NeurIPS ’08, 2008.
Rao and Regev [2017] Shravas Rao and Oded Regev. A sharp tail bound for the expander random sampler. arXiv preprint arXiv:1703.10205, 2017.
Rudelson [1999] Mark Rudelson. Random vectors in the isotropic position. Journal of Functional Analysis, 1999.
Sauerwald and Zanetti [2019] Thomas Sauerwald and Luca Zanetti. Random walks on dynamic graphs: Mixing times, hitting times, and return probabilities. In ICALP 2019, 2019.
Shani et al. [2005] Guy Shani, David Heckerman, and Ronen I Brafman. An mdp-based recommender system. JMLR ’05, 2005.
Sutter et al. [2017] David Sutter, Mario Berta, and Marco Tomamichel. Multivariate trace inequalities. Communications in Mathematical Physics, 352(1):37–58, 2017.
Tang et al. [2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In WWW ’15, 2015.
Tang and Liu [2009] Lei Tang and Huan Liu. Relational learning via latent social dimensions. In KDD ’09, 2009.
Tekin and Liu [2010] Cem Tekin and Mingyan Liu. Online algorithms for the multi-armed bandit problem with markovian rewards. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1675–1682. IEEE, 2010.
Tennenholtz and Mannor [2019] Guy Tennenholtz and Shie Mannor. The natural language of actions. In ICML ’19, 2019.
Tropp [2015] Joel A Tropp. An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571, 2015.
Vasile et al. [2016] Flavian Vasile, Elena Smirnova, and Alexis Conneau. Meta-prod2vec: Product embeddings using side-information for recommendation. In RecSys ’16, 2016.
Wagner [2008] Roy Wagner. Tail estimates for sums of variables sampled by a random walk. Combinatorics, Probability and Computing, 2008.
Wigderson and Xiao [2005] Avi Wigderson and David Xiao. A randomness-efficient sampler for matrix-valued functions and applications. In FOCS’05, 2005.
Wolfer and Kontorovich [2019] Geoffrey Wolfer and Aryeh Kontorovich. Estimating the mixing time of ergodic markov chains. In COLT ’19, 2019.
Yu [1994] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94–116, 1994.

Supplementary Material of A Matrix Chernoff Bound for Markov Chains and Its Application to Co-occurrence Matrices

Appendix A Convergence Rate of Co-occurrence Matrices

A.1 Proof of Claim 1

See 1

Proof.

We prove the fours parts of this Claim one by one.

Part 1 To prove ${\bm{Q}}$ is regular, it is sufficient to show that $\exists N_{1}$ , $\forall n_{1}>N_{1}$ , $(v_{0},\cdots,v_{T})$ can reach $(u_{0},\cdots,u_{T})$ at $n_{1}$ steps. We know ${\bm{P}}$ is a regular Markov chain, so there exists $N_{2}\geq T$ s.t., for any $n_{2}\geq N_{2}$ , $v_{T}$ can reach $u_{0}$ at exact $n_{2}$ step, i,e., there is a $n_{2}$ -step walk s.t. $(v_{T},w_{1},\cdots,w_{n_{2}-1},u_{0})$ on ${\bm{P}}$ . This induces an $n_{2}$ -step walk from $(v_{0},\cdots,v_{T})$ to $(w_{n_{2}-T+1},\cdots,w_{n_{2}-1},u_{0})$ . Take further $T$ step, we can reach $(u_{0},\cdots,u_{T})$ , so we construct a $n_{1}=n_{2}+T$ step walk from $(v_{0},\cdots,v_{T})$ to $(u_{0},\cdots u_{T})$ . Since this is true for any $n_{2}\geq N_{2}$ , we then claim that any state can be reached from any other state in any number of steps greater than or equal to a number $N_{1}=N_{2}+T$ . Next to verify ${\bm{\sigma}}$ such that $\sigma_{(u_{0},\cdots,u_{T})}=\pi_{u_{0}}{\bm{P}}_{u_{0},u_{1}}\cdots{\bm{P}}_{u_{T-1},u_{T}}$ is the stationary distribution of Markov chain ${\bm{Q}}$ ,

\begin{split}&\sum_{(u_{0},\cdots,u_{T})\in\mathcal{S}}\sigma_{(u_{0},\cdots,u_{T})}{\bm{Q}}_{(u_{0},\cdots,u_{T}),(w_{0},\cdots,w_{T})}\\ =&\sum_{u_{0}:(u_{0},w_{0},\cdots,w_{T-1})\in\mathcal{S}}\pi_{u_{0}}{\bm{P}}_{u_{0},w_{0}}{\bm{P}}_{w_{0},w_{1}},\cdots,{\bm{P}}_{w_{T-2},w_{T-1}}{\bm{P}}_{w_{T-1},w_{T}}\\ =&\left(\sum_{u_{0}}\pi_{u_{0}}{\bm{P}}_{u_{0},w_{0}}\right){\bm{P}}_{w_{0},w_{1}},\cdots,{\bm{P}}_{w_{T-2},w_{T-1}}{\bm{P}}_{w_{T-1},w_{T}}\\ =&\pi_{w_{0}}{\bm{P}}_{w_{0},w_{1}},\cdots,{\bm{P}}_{w_{T-2},w_{T-1}}{\bm{P}}_{w_{T-1},w_{T}}=\sigma_{w_{0},\cdots,w_{T}}.\end{split}

Part 2 Recall $(v_{1},\cdots,v_{L})$ is a random walk on ${\bm{P}}$ starting from distribution ${\bm{\phi}}$ , so the probability we observe $X_{1}=(v_{1},\cdots,v_{T+1})$ is $\phi_{v_{1}}{\bm{P}}_{v_{1},v_{2}}\cdots{\bm{P}}_{v_{T},v_{T}}=\rho_{(v_{1},\cdots,v_{T+1})}$ , i.e., $X_{1}$ is sampled from the distribution ${\bm{\rho}}$ . Then we study the transition probability from $X_{i}=(v_{i},\cdots,v_{i+T})$ to $X_{i+1}=(v_{i+1},\cdots,v_{i+T+1})$ , which is ${\bm{P}}_{v_{i+T},v_{i+T+1}}={\bm{Q}}_{X_{i},X_{i+1}}$ . Consequently, we can claim $(X_{i},\cdots,X_{L-T})$ is a random walk on ${\bm{Q}}$ . Moreover,

\begin{split}\left\lVert{\bm{\rho}}\right\rVert_{{\bm{\sigma}}}^{2}&=\sum_{(u_{0},\cdots,u_{T})\in\mathcal{S}}\frac{\rho^{2}_{(u_{0},\cdots,u_{T})}}{\sigma_{(u_{0},\cdots,u_{T})}}=\sum_{(u_{0},\cdots,u_{T})\in\mathcal{S}}\frac{\left(\phi_{u_{0}}{\bm{P}}_{u_{0},u_{1}}\cdots{\bm{P}}_{u_{T-1},u_{T}}\right)^{2}}{\pi_{u_{0}}{\bm{P}}_{u_{0},u_{1}}\cdots{\bm{P}}_{u_{T-1},u_{T}}}\\ &=\sum_{u_{0}}\frac{\phi^{2}_{u_{0}}}{\pi_{u_{0}}}\sum_{(u_{0},u_{1},\cdots,u_{T})\in\mathcal{S}}{\bm{P}}_{u_{0},u_{1}}\cdots{\bm{P}}_{u_{T-1},u_{T}}=\sum_{u_{0}}\frac{\phi^{2}_{u_{0}}}{\pi_{u_{0}}}=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}^{2},\end{split}

which implies $\left\lVert{\bm{\rho}}\right\rVert_{{\bm{\sigma}}}=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}$ .

Part 3 For any distribution ${\bm{y}}$ on $\mathcal{S}$ , define ${\bm{x}}\in\mathbb{R}^{n}$ such that $x_{i}=\sum_{(v_{1},\cdots,v_{T-1},i)\in\mathcal{S}}y_{v_{1},\cdots,v_{T-1},i}$ . Easy to see ${\bm{x}}$ is a probability vector, since ${\bm{x}}$ is the marginal probability of ${\bm{y}}$ . For convenience, we assume for a moment the ${\bm{x}},{\bm{y}},{\bm{\sigma}},{\bm{\pi}}$ are row vectors. We can see that:

\begin{split}\left\lVert{\bm{y}}{\bm{Q}}^{\tau({\bm{P}})+T-1}-{\bm{\sigma}}\right\rVert_{TV}&=\frac{1}{2}\left\lVert{\bm{y}}{\bm{Q}}^{\tau({\bm{P}})+T-1}-{\bm{\sigma}}\right\rVert_{1}\\ &=\frac{1}{2}\sum_{(v_{1},\cdots,v_{T})\in\mathcal{S}}\left\lvert\left({\bm{y}}{\bm{Q}}^{\tau({\bm{P}})+T-1}-{\bm{\sigma}}\right)_{v_{1},\cdots,v_{T}}\right\rvert\\ &=\frac{1}{2}\sum_{(v_{1},\cdots,v_{T})\in\mathcal{S}}\left\lvert\left({\bm{x}}{\bm{P}}^{\tau({\bm{P}})}\right)_{v_{1}}{\bm{P}}_{v_{1},v_{2}}\cdots{\bm{P}}_{v_{T-1},v_{T}}-{\bm{\pi}}_{v_{1}}{\bm{P}}_{v_{1},v_{2}}\cdots{\bm{P}}_{v_{T-1},v_{T}}\right\rvert\\ &=\frac{1}{2}\sum_{(v_{1},\cdots,v_{T})\in\mathcal{S}}\left\lvert\left({\bm{x}}{\bm{P}}^{\tau({\bm{P}})}\right)_{v_{1}}-\pi_{v_{1}}\right\rvert{\bm{P}}_{v_{1},v_{2}}\cdots{\bm{P}}_{v_{T-1},v_{T}}\\ &=\frac{1}{2}\sum_{v_{1}}\left\lvert\left({\bm{x}}{\bm{P}}^{\tau({\bm{P}})}\right)_{v_{1}}-\pi_{v_{1}}\right\rvert\sum_{(v_{1},\cdots,v_{T})\in\mathcal{S}}{\bm{P}}_{v_{1},v_{2}}\cdots{\bm{P}}_{v_{T-1},v_{T}}\\ &=\frac{1}{2}\sum_{v_{1}}\left\lvert\left({\bm{x}}{\bm{P}}^{\tau({\bm{P}})}\right)_{v_{1}}-\pi_{v_{1}}\right\rvert=\frac{1}{2}\left\lVert{\bm{x}}{\bm{P}}^{\tau({\bm{P}})}-{\bm{\pi}}\right\rVert_{1}=\left\lVert{\bm{x}}{\bm{P}}^{\tau({\bm{P}})}-{\bm{\pi}}\right\rVert_{TV}\leq\delta.\end{split}

which indicates $\tau({\bm{Q}})\leq\tau({\bm{P}})+T-1<\tau({\bm{P}})+T$ .

Part 4 This is an example showing that $\lambda{({\bm{Q}})}$ cannot be bounded by $\lambda{({\bm{P}})}$ — even though ${\bm{P}}$ has $\lambda{({\bm{P}})}<1$ , the induced ${\bm{Q}}$ may have $\lambda{({\bm{Q}})}=1$ . We consider random walk on the unweighted undirected graph and $T=1$ . The transition probability matrix ${\bm{P}}$ is:

{\bm{P}}=\begin{bmatrix}[r]0&1/3&1/3&1/3\\ 1/2&0&1/2&0\\ 1/3&1/3&0&1/3\\ 1/2&0&1/2&0\end{bmatrix}

with stationary distribution ${\bm{\pi}}=\begin{bmatrix}0.3&0.2&0.3&0.2\end{bmatrix}^{\top}$ and $\lambda({\bm{P}})=\frac{2}{3}$ . When $T=1$ , the induced Markov chain ${\bm{Q}}$ has stationary distribution $\sigma_{u,v}=\pi_{u}{\bm{P}}_{u,v}=\frac{d_{u}}{2m}\frac{1}{d_{u}}=\frac{1}{2m}$ where $m=5$ is the number of edges in the graph. Construct ${\bm{y}}\in\mathbb{R}^{\left\lvert\mathcal{S}\right\rvert}$ such that

y_{(u,v)}=\begin{cases}1&(u,v)=(0,1),\\ -1&(u,v)=(0,3),\\ 0&\text{otherwise.}\end{cases}

The constructed vector ${\bm{y}}$ has norm

\left\lVert{\bm{y}}\right\rVert_{{\bm{\sigma}}}=\sqrt{\langle{\bm{y}},{\bm{y}}\rangle_{{\bm{\sigma}}}}=\sqrt{\sum_{(u,v)\in\mathcal{S}}\frac{y_{(u,v)}y_{(u,v)}}{\sigma_{(u,v)}}}=\sqrt{\frac{y_{(0,1)}y_{(0,1)}}{\sigma_{(0,1)}}+\frac{y_{(0,3)}y_{(0,3)}}{\sigma_{(0,3)}}}=2\sqrt{m}.

And it is easy to check ${\bm{y}}\perp{\bm{\sigma}}$ , since $\langle{\bm{y}},{\bm{\sigma}}\rangle_{{\bm{\sigma}}}=\sum_{(u,v)\in\mathcal{S}}\frac{\sigma_{(u,v)}y_{(u,v)}}{\sigma_{(u,v)}}=y_{(0,1)}+y_{(0,3)}=0$ . Let ${\bm{x}}=\left({\bm{y}}^{\ast}{\bm{Q}}\right)^{\ast}$ , we have for $(u,v)\in\mathcal{S}$ :

{\bm{x}}_{(u,v)}=\begin{cases}1&(u,v)=(1,2),\\ -1&(u,v)=(3,2),\\ 0&\text{otherwise.}\end{cases}

This vector has norm:

\left\lVert{\bm{x}}\right\rVert_{{\bm{\sigma}}}=\sqrt{\langle{\bm{x}},{\bm{x}}\rangle_{{\bm{\sigma}}}}=\sqrt{\sum_{(u,v)\in\mathcal{S}}\frac{x_{(u,v)}x_{(u,v)}}{\sigma_{(u,v)}}}=\sqrt{\frac{y_{(1,2)}y_{(1,2)}}{\sigma_{(1,2)}}+\frac{y_{(3,2)}y_{(3,2)}}{\sigma_{(3,2)}}}=2\sqrt{m}

Thus we have $\frac{\left\lVert\left({\bm{y}}^{\ast}{\bm{Q}}\right)^{\ast}\right\rVert_{{\bm{\sigma}}}}{\left\lVert{\bm{y}}\right\rVert_{{\bm{\sigma}}}}=1$ . Taking maximum over all possible ${\bm{y}}$ gives $\lambda({\bm{Q}})\geq 1$ . Also note that fact that $\lambda{({\bm{Q}})}\leq 1$ , so $\lambda{({\bm{Q}})}=1$ . ∎

A.2 Proof of Claim 2

See 2

Proof.

Note that Equation 2 is indeed a random value minus its expectation, so naturally Equation 2 has zero mean, i.e., $\sum_{X\in\mathcal{S}}\sigma_{X}f(X)=0$ . Moreover, $\left\lVert f(X)\right\rVert_{2}\leq 1$ because

\begin{split}\left\lVert f(X)\right\rVert_{2}&\leq\frac{1}{2}\left(\sum_{r=1}^{T}\frac{\left\lvert\alpha_{r}\right\rvert}{2}\left(\left\lVert{\bm{e}}_{v_{0}}{\bm{e}}^{\top}_{v_{r}}\right\rVert_{2}+\left\lVert{\bm{e}}_{v_{r}}{\bm{e}}^{\top}_{v_{0}}\right\rVert_{2}\right)+\sum_{r=1}^{T}\frac{\left\lvert\alpha_{r}\right\rvert}{2}\left(\left\lVert{\bm{\Pi}}\right\rVert_{2}\left\lVert{\bm{P}}\right\rVert_{2}^{r}+\left\lVert{\bm{P}}^{\top}\right\rVert_{2}^{r}\left\lVert{\bm{\Pi}}\right\rVert_{2}\right)\right)\\ &\leq\frac{1}{2}\left(\sum_{r=1}^{T}\left\lvert\alpha_{r}\right\rvert+\sum_{r=1}^{T}\left\lvert\alpha_{r}\right\rvert\right)=1.\end{split}

where the first step follows triangle inequaity and submultiplicativity of 2-norm, and the third step follows by (1) $\left\lVert{\bm{e}}_{i}{\bm{e}}_{j}^{\top}\right\rVert_{2}=1$ ; (2) $\left\lVert{\bm{\Pi}}\right\rVert_{2}=\left\lVert\operatorname{diag}({\bm{\pi}})\right\rVert_{2}\leq 1$ for distribution ${\bm{\pi}}$ ; (3) $\left\lVert{\bm{P}}\right\rVert_{2}=\left\lVert{\bm{P}}^{\top}\right\rVert_{2}=1$ . ∎

A.3 Proof of Corollary 1

See 1

Proof.

A HMM can be model by a Markov chain ${\bm{P}}$ on $\mathcal{Y}\times\mathcal{X}$ such that $P(y_{t+1},x_{t+1}|y_{t},x_{t})=P(y_{t+1}|x_{t+1})P(x_{t+1}|x_{t})$ . For the co-occurrence matrix of observable states, applying a similar proof like our Theorem 2 shows that one needs a trajectory of length $O(\tau({\bm{P}})(\log{|\mathcal{Y}|}+\log{\tau({\bm{P}})})/\epsilon^{2})$ to achieve error bound $\epsilon$ with high probability. Moreover, the mixing time $\tau({\bm{P}})$ is bounded by the mixing time of the Markov chain on the hidden state space (i.e., $P(x_{t+1}|x_{t})$ ). ∎

Appendix B Matrix Chernoff Bounds for Markov Chains

B.1 Preliminaries

Kronecker Products If ${\bm{A}}$ is an $M_{1}\times N_{1}$ matrix and ${\bm{B}}$ is a $M_{2}\times N_{2}$ matrix, then the Kronecker product ${\bm{A}}\otimes{\bm{B}}$ is the $M_{2}M_{1}\times N_{1}N_{2}$ block matrix such that

{\bm{A}}\otimes{\bm{B}}=\begin{bmatrix}{\bm{A}}_{1,1}{\bm{B}}&\cdots&{\bm{A}}_{1,N_{1}}B\\ \vdots&\ddots&\vdots\\ {\bm{A}}_{M_{1},1}{\bm{B}}&\cdots&{\bm{A}}_{M_{1},N_{1}}B\\ \end{bmatrix}.

Kronecker product has the mixed-product property. If ${\bm{A}},{\bm{B}},{\bm{C}},{\bm{D}}$ are matrices of such size that one can from the matrix products ${\bm{A}}{\bm{C}}$ and ${\bm{B}}{\bm{D}}$ , then $({\bm{A}}\otimes{\bm{B}})({\bm{C}}\otimes{\bm{D}})=({\bm{A}}{\bm{C}})\otimes({\bm{B}}{\bm{D}})$ .

Vectorization For a matrix ${\bm{X}}\in\mathbb{C}^{d\times d}$ , $\operatorname{vec}({\bm{X}})\in\mathbb{C}^{d^{2}}$ denote the vertorization of the matrix ${\bm{X}}$ , s.t. $\operatorname{vec}({\bm{X}})=\sum_{i\in[d]}\sum_{j\in[d]}{\bm{X}}_{i,j}{\bm{e}}_{i}\otimes{\bm{e}}_{j}$ , which is the stack of rows of ${\bm{X}}$ . And there is a relationship between matrix multiplication and Kronecker product s.t. $\operatorname{vec}({\bm{A}}{\bm{X}}{\bm{B}})=({\bm{A}}\otimes{\bm{B}}^{\top})\operatorname{vec}({\bm{X}})$ .

Matrices and Norms For a matrix ${\bm{A}}\in{\mathbb{C}}^{N\times N}$ , we use ${\bm{A}}^{\top}$ to denote matrix transpose, use $\overline{{\bm{A}}}$ to denote entry-wise matrix conjugation, use ${\bm{A}}^{\ast}$ to denote matrix conjugate transpose ( ${\bm{A}}^{\ast}=\overline{{\bm{A}}^{\top}}=\overline{{\bm{A}}}^{\top}$ ). The vector 2-norm is defined to be $\left\lVert{\bm{x}}\right\rVert_{2}=\sqrt{{\bm{x}}^{\ast}{\bm{x}}}$ , and the matrix 2-norm is defined to be $\left\lVert{\bm{A}}\right\rVert_{2}=\max_{{\bm{x}}\in\mathbb{C}^{N},{\bm{x}}\neq 0}\frac{\left\lVert{\bm{A}}{\bm{x}}\right\rVert_{2}}{\left\lVert{\bm{x}}\right\rVert_{2}}$ .

We then recall the definition of inner-product under ${\bm{\pi}}$ -kernel in Section 2. The inner-product under ${\bm{\pi}}$ -kernel for $\mathbb{C}^{N}$ is $\left\langle{\bm{x}},{\bm{y}}\right\rangle_{\bm{\pi}}={\bm{y}}^{\ast}{\bm{\Pi}}^{-1}{\bm{x}}$ where ${\bm{\Pi}}=\operatorname{diag}({\bm{\pi}})$ , and its induced ${\bm{\pi}}$ -norm $\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}=\sqrt{\left\langle{\bm{x}},{\bm{x}}\right\rangle_{\bm{\pi}}}$ . The above definition allow us to define a inner product under ${\bm{\pi}}$ -kernel on $\mathbb{C}^{Nd^{2}}$ :

Definition 1.

Define inner product on ${\mathbb{C}}^{Nd^{2}}$ under ${\bm{\pi}}$ -kernel to be $\left\langle{\bm{x}},{\bm{y}}\right\rangle_{\bm{\pi}}={\bm{y}}^{\ast}\left({\bm{\Pi}}^{-1}\otimes{\bm{I}}_{d^{2}}\right){\bm{x}}$ .

Remark 1.

For ${\bm{x}},{\bm{y}}\in{\mathbb{C}}^{N}$ and ${\bm{p}},{\bm{q}}\in{\mathbb{C}}^{d^{2}}$ , then inner product (under ${\bm{\pi}}$ -kernel) between ${\bm{x}}\otimes{\bm{p}}$ and ${\bm{y}}\otimes{\bm{q}}$ can be simplified as

\langle{\bm{x}}\otimes{\bm{p}},{\bm{y}}\otimes{\bm{q}}\rangle_{{\bm{\pi}}}=({\bm{y}}\otimes{\bm{q}})^{\ast}\left({\bm{\Pi}}^{-1}\otimes{\bm{I}}_{d^{2}}\right)({\bm{x}}\otimes{\bm{p}})=({\bm{y}}^{\ast}{\bm{\Pi}}^{-1}{\bm{x}})\otimes({\bm{q}}^{\ast}{\bm{p}})=\langle{\bm{x}},{\bm{y}}\rangle_{{\bm{\pi}}}\langle{\bm{p}},{\bm{q}}\rangle.

Remark 2.

The induced ${\bm{\pi}}$ -norm is $\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}=\sqrt{\left\langle{\bm{x}},{\bm{x}}\right\rangle_{\bm{\pi}}}$ . When ${\bm{x}}={\bm{y}}\otimes{\bm{w}}$ , the ${\bm{\pi}}$ -norm can be simplified to be: $\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}=\sqrt{\left\langle{\bm{y}}\otimes{\bm{w}},{\bm{y}}\otimes{\bm{w}}\right\rangle_{\bm{\pi}}}=\sqrt{\langle{\bm{y}},{\bm{y}}\rangle_{{\bm{\pi}}}\langle{\bm{w}},{\bm{w}}\rangle}=\left\lVert{\bm{y}}\right\rVert_{{\bm{\pi}}}\left\lVert{\bm{w}}\right\rVert_{2}$ .

Matrix Exponential The matrix exponential of a matrix ${\bm{A}}\in{\mathbb{C}}^{d\times d}$ is defined by Taylor expansion $\exp{({\bm{A}})}=\sum_{j=0}^{+\infty}\frac{{\bm{A}}^{j}}{j!}$ . And we will use the fact that $\exp({\bm{A}})\otimes\exp({\bm{B}})=\exp({\bm{A}}\otimes{\bm{I}}+{\bm{I}}\otimes{\bm{B}})$ .

Golden-Thompson Inequality We need the following multi-matrix Golden-Thompson inequality from from Garg et al. [10]. See 4

B.2 Proof of Theorem 3

See 3

Proof.

Due to symmetry, it suffices to prove one of the statements. Let $t>0$ be a parameter to be chosen later. Then

\begin{split}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\geq\epsilon\right]&=\mathbb{P}\left[\lambda_{\max}\left(\sum_{j=1}^{k}f(v_{j})\right)\geq k\epsilon\right]\\ &\leq\mathbb{P}\left[\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\geq\exp{(tk\epsilon)}\right]\\ &\leq\frac{\mathbb{E}_{v_{1}\cdots,v_{k}}\left[\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right]}{\exp{(tk\epsilon)}}.\end{split}

(3)

The second inequality follows Markov inequality.

Next to bound $\mathbb{E}_{v_{1}\cdots,v_{k}}\left[\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right]$ . Using Theorem 4, we have:

\begin{split}\log{\left(\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right)}&\leq\frac{4}{\pi}\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\log{\left(\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\right]}\right)}d\mu(\phi)\\ &\leq\frac{4}{\pi}\log\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\right]}d\mu(\phi),\end{split}

where the second step follows by concavity of $\log$ function and the fact that $\mu(\phi)$ is a probability distribution on $[-\frac{\pi}{2},\frac{\pi}{2}]$ . This implies

\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\leq\left(\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\right]}d\mu(\phi)\right)^{\frac{4}{\pi}}.

Note that $\left\lVert{\bm{x}}\right\rVert_{p}\leq d^{1/p-1}\left\lVert{\bm{x}}\right\rVert_{1}$ for $p\in(0,1)$ , choosing $p=\pi/4$ we have

\left(\operatorname{Tr}{\left[\exp{\left(\frac{\pi}{4}t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right)^{\frac{4}{\pi}}\leq d^{\frac{4}{\pi}-1}\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}.

Combining the above two equations together, we have

\operatorname{Tr}{\left[\exp{\left(\frac{\pi}{4}t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\leq d^{1-\frac{\pi}{4}}\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\right]}d\mu(\phi).

(4)

Write $e^{{\mathrm{i}}\phi}=\gamma+{\mathrm{i}}b$ with $\gamma^{2}+b^{2}=\left\lvert\gamma+{\mathrm{i}}b\right\rvert^{2}=\left\lvert e^{{\mathrm{i}}\phi}\right\rvert^{2}=1$ : See 1 Assuming the above lemma, we can complete the proof of the theorem as:

\begin{split}&\mathbb{E}_{v_{1}\cdots,v_{k}}\left[\operatorname{Tr}{\left[\exp{\left(\frac{\pi}{4}t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right]\\ \leq&d^{1-\frac{\pi}{4}}\mathbb{E}_{v_{1}\cdots,v_{k}}\left[\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\left(\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\right]}\right)d\mu(\phi)\right]\\ =&d^{1-\frac{\pi}{4}}\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\mathbb{E}_{v_{1}\cdots,v_{k}}\left[\operatorname{Tr}{\left[\prod_{j=1}^{k}\exp{\left(\frac{e^{{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\prod_{j=k}^{1}\exp{\left(\frac{e^{-{\mathrm{i}}\phi}}{2}tf(v_{j})\right)}\right]}\right]d\mu(\phi)\\ \leq&d^{1-\frac{\pi}{4}}\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d\exp{\left(kt^{2}\left\lvert e^{{\mathrm{i}}\phi}\right\rvert^{2}\left(1+\frac{8}{1-\lambda}\right)\right)}d\mu(\phi)\\ =&\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2-\frac{\pi}{4}}\exp{\left(kt^{2}\left(1+\frac{8}{1-\lambda}\right)\right)}\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}d\mu(\phi)\\ =&\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2-\frac{\pi}{4}}\exp{\left(kt^{2}\left(1+\frac{8}{1-\lambda}\right)\right)}\end{split}

(5)

where the first step follows Equation 4, the second step follows by swapping $\mathbb{E}$ and $\int$ , the third step follows by Lemma 1, the forth step follows by $\left\lvert e^{{\mathrm{i}}\phi}\right\rvert=1$ , and the last step follows by $\mu$ is a probability distribution on $[-\frac{\pi}{2},\frac{\pi}{2}]$ so $\int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}d\mu(\phi)=1$

Finally, putting it all together:

\begin{split}\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\geq\epsilon\right]&\leq\frac{\mathbb{E}\left[\operatorname{Tr}{\left[\exp{\left(t\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right]}{\exp{(tk\epsilon)}}\\ &=\frac{\mathbb{E}\left[\operatorname{Tr}{\left[\exp{\left(\frac{\pi}{4}\left(\frac{4}{\pi}t\right)\sum_{j=1}^{k}f(v_{j})\right)}\right]}\right]}{\exp{(tk\epsilon)}}\\ &\leq\frac{\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2-\frac{\pi}{4}}\exp{\left(k\left(\frac{4}{\pi}t\right)^{2}\left(1+\frac{8}{1-\lambda}\right)\right)}}{\exp{(tk\epsilon)}}\\ &=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2-\frac{\pi}{4}}\exp{\left(\left(\frac{4}{\pi}\right)^{2}k\epsilon^{2}(1-\lambda)^{2}\frac{1}{36^{2}}\frac{9}{1-\lambda}-k\frac{(1-\lambda)\epsilon}{36}\epsilon\right)}\\ &\leq\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2}\exp{(-k\epsilon^{2}(1-\lambda)/72)}.\end{split}

where the first step follows by Equation 3, the second step follows by Equation 5, the third step follows by choosing $t=(1-\lambda)\epsilon/36$ . The only thing to be check is that $t=(1-\lambda)\epsilon/36$ satisfies $t\sqrt{\gamma^{2}+b^{2}}=t\leq\frac{1-\lambda}{4\lambda}$ . Recall that ${\epsilon}<1$ and $\lambda\leq 1$ , we have $t=\frac{(1-\lambda)\epsilon}{36}\leq\frac{1-\lambda}{4}\leq\frac{1-\lambda}{4\lambda}$ . ∎

B.3 Proof of Lemma 1

See 1

Proof.

Note that for ${\bm{A}},{\bm{B}}\in\mathbb{C}^{d\times d}$ , $\left\langle({\bm{A}}\otimes{\bm{B}})\operatorname{vec}({\bm{I}}_{d}),\operatorname{vec}({\bm{I}}_{d})\right\rangle=\operatorname{Tr}\left[{\bm{A}}{\bm{B}}^{\top}\right]$ . By letting ${\bm{A}}=\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}$ and ${\bm{B}}=\left(\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right)^{\top}=\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}$ . The trace term in LHS of Lemma 1 becomes

\begin{split}&\operatorname{Tr}\left[\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right]\\ =&\left\langle\left(\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\otimes\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right)\operatorname{vec}({\bm{I}}_{d}),\operatorname{vec}({\bm{I}}_{d})\right\rangle.\end{split}

(6)

By iteratively applying $({\bm{A}}\otimes{\bm{B}})({\bm{C}}\otimes{\bm{D}})=({\bm{A}}{\bm{C}})\otimes({\bm{B}}{\bm{D}})$ , we have

\begin{split}&\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\otimes\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\\ =&\prod_{j=1}^{k}\left(\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\otimes\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right)\triangleq\prod_{j=1}^{k}{\bm{M}}_{v_{j}},\end{split}

where we define

{\bm{M}}_{v_{j}}\triangleq\exp{\left(\frac{tf(v_{j})(\gamma+ib)}{2}\right)}\otimes\exp{\left(\frac{tf(v_{j})(\gamma-ib)}{2}\right)}.

(7)

Plug it to the trace term, we have

\operatorname{Tr}\left[\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right]=\left\langle\left(\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\right)\operatorname{vec}({\bm{I}}_{d}),\operatorname{vec}({\bm{I}}_{d})\right\rangle.

Next, taking expectation on Equation 6 gives

\begin{split}&\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\operatorname{Tr}\left[\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+ib)}{2}\right)}\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-ib)}{2}\right)}\right]\right]\\ =&\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\left\langle\left(\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\right)\operatorname{vec}({\bm{I}}_{d}),\operatorname{vec}({\bm{I}}_{d})\right\rangle\right]\\ =&\left\langle\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\right]\operatorname{vec}({\bm{I}}_{d}),\operatorname{vec}({\bm{I}}_{d})\right\rangle.\end{split}

(8)

We turn to study $\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\right]$ , which is characterized by the following lemma:

Lemma 2.

Let ${\bm{E}}\triangleq\operatorname{diag}({\bm{M}}_{1},{\bm{M}}_{2},\cdots,{\bm{M}}_{N})\in{\mathbb{C}}^{Nd^{2}\times Nd^{2}}$ and $\widetilde{{\bm{P}}}\triangleq{\bm{P}}\otimes{\bm{I}}_{d^{2}}\in\mathbb{R}^{Nd^{2}\times Nd^{2}}$ . For a random walk $(v_{1},\cdots,v_{k})$ such that $v_{1}$ is sampled from an arbitrary probability distribution ${\bm{\phi}}$ on $[N]$ , $\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\right]=\left({\bm{\phi}}\otimes{\bm{I}}_{d^{2}}\right)^{\top}\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\bm{1}\otimes{\bm{I}}_{d^{2}}\right)$ , where $\bm{1}$ is the all-ones vector.

Proof.

(of Lemma 2) We always treat ${\bm{E}}\widetilde{{\bm{P}}}$ as a block matrix, s.t.,

{\bm{E}}\widetilde{{\bm{P}}}=\begin{bmatrix}{\bm{M}}_{1}&&\\ &\ddots&\\ &&{\bm{M}}_{N}\end{bmatrix}\begin{bmatrix}{\bm{P}}_{1,1}{\bm{I}}_{d^{2}}&\cdots&{\bm{P}}_{1,N}{\bm{I}}_{d^{2}}\\ \vdots&\ddots&\vdots\\ {\bm{P}}_{N,1}{\bm{I}}_{d^{2}}&\cdots&{\bm{P}}_{N,N}{\bm{I}}_{d^{2}}\end{bmatrix}=\begin{bmatrix}{\bm{P}}_{1,1}{\bm{M}}_{1}&\cdots&{\bm{P}}_{1,N}{\bm{M}}_{1}\\ \vdots&\ddots&\vdots\\ {\bm{P}}_{N,1}{\bm{M}}_{N}&\cdots&{\bm{P}}_{N,N}{\bm{M}}_{N}\end{bmatrix}.

I.e., the $(u,v)$ -th block of ${\bm{E}}\widetilde{{\bm{P}}}$ , denoted by $({\bm{E}}\widetilde{{\bm{P}}})_{u,v}$ , is ${\bm{P}}_{u,v}{\bm{M}}_{u}$ .

\begin{split}\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\right]&=\sum_{v_{1},\cdots,v_{k}}{\bm{\phi}}_{v_{1}}{\bm{P}}_{v_{1},v_{2}}\cdots{\bm{P}}_{v_{k-1},v_{k}}\prod_{j=1}^{k}{\bm{M}}_{v_{j}}\\ &=\sum_{v_{1}}{\bm{\phi}}_{v_{1}}\sum_{v_{2}}\left({\bm{P}}_{v_{1},v_{2}}{\bm{M}}_{v_{1}}\right)\cdots\sum_{v_{k}}\left({\bm{P}}_{v_{k-1},v_{k}}{\bm{M}}_{v_{k-1}}\right){\bm{M}}_{v_{k}}\\ &=\sum_{v_{1}}{\bm{\phi}}_{v_{1}}\sum_{v_{2}}({\bm{E}}\widetilde{{\bm{P}}})_{v_{1},v_{2}}\sum_{v_{3}}({\bm{E}}\widetilde{{\bm{P}}})_{v_{2},v_{3}}\cdots\sum_{v_{k}}({\bm{E}}\widetilde{{\bm{P}}}{\bm{E}})_{v_{k-1},v_{k}}\\ &=\sum_{v_{1}}{\bm{\phi}}_{v_{1}}\sum_{v_{k}}\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)_{v_{1},v_{k}}=\left({\bm{\phi}}\otimes{\bm{I}}_{d^{2}}\right)^{\top}\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\bm{1}\otimes{\bm{I}}_{d^{2}}\right)\end{split}

∎

Given Lemma 2, Equation 8 becomes:

\begin{split}&\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\operatorname{Tr}\left[\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right]\right]\\ =&\left\langle\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\prod_{j=1}^{k}M_{v_{j}}\right]\operatorname{vec}({\bm{I}}_{d}),\operatorname{vec}({\bm{I}}_{d})\right\rangle\\ =&\left\langle\left({\bm{\phi}}\otimes{\bm{I}}_{d^{2}}\right)^{\top}\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\bm{1}\otimes{\bm{I}}_{d^{2}}\right),\operatorname{vec}({\bm{I}}_{d})\right\rangle\\ =&\left\langle\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\bm{1}\otimes{\bm{I}}_{d^{2}}\right)\operatorname{vec}({\bm{I}}_{d}),\left({\bm{\phi}}\otimes{\bm{I}}_{d^{2}}\right)\operatorname{vec}({\bm{I}}_{d})\right\rangle\\ =&\left\langle\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\bm{1}\otimes\operatorname{vec}({\bm{I}}_{d})\right),{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d})\right\rangle\end{split}

The third equality is due to $\langle x,{\bm{A}}y\rangle=\langle{\bm{A}}^{\ast}x,y\rangle$ . The forth equality is by setting ${\bm{C}}=1$ (scalar) in $({\bm{A}}\otimes{\bm{B}})({\bm{C}}\otimes{\bm{D}})=({\bm{A}}{\bm{C}})\otimes({\bm{B}}{\bm{D}})$ . Then

\begin{split}&\mathbb{E}_{v_{1},\cdots,v_{k}}\left[\operatorname{Tr}\left[\prod_{j=1}^{k}\exp{\left(\frac{tf(v_{j})(\gamma+{\mathrm{i}}b)}{2}\right)}\prod_{j=k}^{1}\exp{\left(\frac{tf(v_{j})(\gamma-{\mathrm{i}}b)}{2}\right)}\right]\right]\\ =&\left\langle\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\bm{1}\otimes\operatorname{vec}({\bm{I}}_{d})\right),{\bm{\phi}}\otimes\operatorname{vec}({\bm{I}}_{d})\right\rangle\\ =&({\bm{\phi}}\otimes\operatorname{vec}({\bm{I}}_{d}))^{\ast}\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)(\bm{1}\otimes\operatorname{vec}({\bm{I}}_{d}))\\ =&({\bm{\phi}}\otimes\operatorname{vec}({\bm{I}}_{d}))^{\ast}\left(({\bm{E}}\widetilde{{\bm{P}}})^{k-1}{\bm{E}}\right)\left(\left({\bm{P}}{\bm{\Pi}}^{-1}{\bm{\pi}}\right)\otimes\left({\bm{I}}_{d^{2}}{\bm{I}}_{d^{2}}\operatorname{vec}({\bm{I}}_{d})\right)\right)\\ =&({\bm{\phi}}\otimes\operatorname{vec}({\bm{I}}_{d}))^{\ast}\left({\bm{E}}\widetilde{{\bm{P}}}\right)^{k}\left({\bm{\Pi}}^{-1}\otimes{\bm{I}}_{d^{2}}\right)({\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d}))\triangleq\left\langle{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d}),{\bm{z}}_{k}\right\rangle_{\bm{\pi}},\end{split}

where we define ${\bm{z}}_{0}={\bm{\phi}}\otimes\operatorname{vec}({\bm{I}}_{d})$ and ${\bm{z}}_{k}=\left({\bm{z}}_{0}^{\ast}\left({\bm{E}}\widetilde{{\bm{P}}}\right)^{k}\right)^{\ast}=\left({\bm{z}}_{k-1}^{\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}$ . Moreover, by Remark 2, we have $\left\lVert{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d})\right\rVert_{{\bm{\pi}}}=\left\lVert{\bm{\pi}}\right\rVert_{{\bm{\pi}}}\left\lVert\operatorname{vec}({\bm{I}}_{d})\right\rVert_{2}=\sqrt{d}$ and $\left\lVert{\bm{z}}_{0}\right\rVert_{{\bm{\pi}}}=\left\lVert{\bm{\phi}}\otimes\operatorname{vec}({\bm{I}}_{d})\right\rVert_{{\bm{\pi}}}=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}\left\lVert\operatorname{vec}({\bm{I}}_{d})\right\rVert_{2}=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}\sqrt{d}$

Definition 2.

Define linear subspace $\mathcal{U}=\left\{{\bm{\pi}}\otimes{\bm{w}},{\bm{w}}\in{\mathbb{C}}^{d^{2}}\right\}$ .

Remark 3.

$\{{\bm{\pi}}\otimes{\bm{e}}_{i},i\in[d^{2}]\}$ is an orthonormal basis of $\mathcal{U}$ . This is because $\langle{\bm{\pi}}\otimes{\bm{e}}_{i},{\bm{\pi}}\otimes{\bm{e}}_{j}\rangle_{\bm{\pi}}=\langle{\bm{\pi}},{\bm{\pi}}\rangle_{\bm{\pi}}\langle{\bm{e}}_{i},{\bm{e}}_{j}\rangle=\delta_{ij}$ by Remark 1, where $\delta_{ij}$ is the Kronecker delta.

Remark 4.

Given ${\bm{x}}={\bm{y}}\otimes{\bm{w}}$ . The projection of ${\bm{x}}$ on to $\mathcal{U}$ is ${\bm{x}}^{\parallel}=(\bm{1}^{\ast}{\bm{y}})({\bm{\pi}}\otimes{\bm{w}})$ . This is because

\begin{split}{\bm{x}}^{\parallel}&=\sum_{i=1}^{d^{2}}\langle{\bm{y}}\otimes{\bm{w}},{\bm{\pi}}\otimes{\bm{e}}_{i}\rangle_{\bm{\pi}}({\bm{\pi}}\otimes{\bm{e}}_{i})=\sum_{i=1}^{d^{2}}\langle{\bm{y}},{\bm{\pi}}\rangle_{{\bm{\pi}}}\langle{\bm{w}},{\bm{e}}_{i}\rangle({\bm{\pi}}\otimes{\bm{e}}_{i})=(\bm{1}^{\ast}{\bm{y}})({\bm{\pi}}\otimes{\bm{w}}).\end{split}

We want to bound

\begin{split}\left\langle{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d}),{\bm{z}}_{k}\right\rangle_{{\bm{\pi}}}&=\left\langle{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d}),{\bm{z}}^{\perp}_{k}+{\bm{z}}^{\parallel}_{k}\right\rangle_{{\bm{\pi}}}=\left\langle{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d}),{\bm{z}}^{\parallel}_{k}\right\rangle_{{\bm{\pi}}}\\ &\leq\left\lVert{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d})\right\rVert_{{\bm{\pi}}}\left\lVert{\bm{z}}_{k}^{\parallel}\right\rVert_{{\bm{\pi}}}=\sqrt{d}\left\lVert{\bm{z}}_{k}^{\parallel}\right\rVert_{{\bm{\pi}}}.\end{split}

As ${\bm{z}}_{k}$ can be expressed as recursively applying operator ${\bm{E}}$ and $\widetilde{{\bm{P}}}$ on ${\bm{z}}_{0}$ , we turn to analyze the effects of ${\bm{E}}$ and $\widetilde{{\bm{P}}}$ operators.

Definition 3.

The spectral expansion of $\widetilde{{\bm{P}}}$ is defined as $\lambda(\widetilde{{\bm{P}}})\triangleq\max_{{\bm{x}}\perp\mathcal{U},{\bm{x}}\neq 0}\frac{\left\lVert\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}}{\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}}\\$

Lemma 3.

$\lambda({\bm{P}})=\lambda(\widetilde{{\bm{P}}})$ .

Proof.

First show $\lambda{(\widetilde{{\bm{P}}})}\geq\lambda{({\bm{P}})}$ . Suppose the maximizer of $\lambda({\bm{P}})\triangleq\max_{{\bm{y}}\perp{\bm{\pi}},{\bm{y}}\neq 0}\frac{\left\lVert\left({\bm{y}}^{\ast}{\bm{P}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}}{\left\lVert{\bm{y}}\right\rVert_{{\bm{\pi}}}}$ is ${\bm{y}}\in\mathbb{C}^{n}$ , i.e., $\left\lVert\left({\bm{y}}^{\ast}{\bm{P}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}=\lambda({\bm{P}})\left\lVert{\bm{y}}\right\rVert_{{\bm{\pi}}}$ . Construct ${\bm{x}}={\bm{y}}\otimes{\bm{o}}$ for arbitrary non-zero ${\bm{o}}\in\mathbb{C}^{d^{2}}$ . Easy to check that ${\bm{x}}\perp\mathcal{U}$ , because $\langle{\bm{x}},{\bm{\pi}}\otimes{\bm{w}}\rangle_{{\bm{\pi}}}=\langle{\bm{y}},{\bm{\pi}}\rangle_{{\bm{\pi}}}\langle{\bm{o}},{\bm{w}}\rangle=0$ , where the last equality is due to ${\bm{y}}\perp{\bm{\pi}}$ . Then we can bound $\left\lVert\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}$ such that

\begin{split}\left\lVert\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}&=\left\lVert\widetilde{{\bm{P}}}^{\ast}{\bm{x}}\right\rVert_{{\bm{\pi}}}=\left\lVert({\bm{P}}^{\ast}\otimes{\bm{I}}_{d^{2}})({\bm{y}}\otimes{\bm{o}})\right\rVert_{{\bm{\pi}}}=\left\lVert({\bm{P}}^{\ast}{\bm{y}})\otimes{\bm{o}}\right\rVert_{{\bm{\pi}}}\\ &=\left\lVert\left({\bm{y}}^{\ast}{\bm{P}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}\left\lVert{\bm{o}}\right\rVert_{2}=\lambda({\bm{P}})\left\lVert{\bm{y}}\right\rVert_{{\bm{\pi}}}\left\lVert{\bm{o}}\right\rVert_{2}=\lambda({\bm{P}})\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}},\end{split}

which indicate for ${\bm{x}}={\bm{y}}\otimes{\bm{o}}$ , $\frac{\left\lVert\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}}{\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}}=\lambda{({\bm{P}})}$ . Taking maximum over all ${\bm{x}}$ gives $\lambda{(\widetilde{{\bm{P}}})}\geq\lambda{({\bm{P}})}$ .

Next to show $\lambda({\bm{P}})\geq\lambda(\widetilde{{\bm{P}}})$ . For $\forall{\bm{x}}\in\mathbb{C}^{Nd^{2}}$ such that ${\bm{x}}\perp\mathcal{U}$ and ${\bm{x}}\neq 0$ , we can decompose it to be

\begin{split}{\bm{x}}&=\begin{bmatrix}x_{1}\\ x_{2}\\ \vdots\\ x_{Nd^{2}}\end{bmatrix}=\begin{bmatrix}x_{1}\\ x_{d^{2}+1}\\ \vdots\\ x_{(N-1)d^{2}+1}\end{bmatrix}\otimes{\bm{e}}_{1}+\begin{bmatrix}x_{2}\\ x_{d^{2}+2}\\ \vdots\\ x_{(N-1)d^{2}+2}\end{bmatrix}\otimes{\bm{e}}_{2}+\cdots+\begin{bmatrix}x_{d^{2}}\\ x_{2d^{2}}\\ \vdots\\ x_{Nd^{2}}\end{bmatrix}\otimes{\bm{e}}_{d^{2}}\triangleq\sum_{i=1}^{d^{2}}{\bm{x}}_{i}\otimes{\bm{e}}_{i},\end{split}

where we define ${\bm{x}}_{i}\triangleq\begin{bmatrix}x_{i}&\cdots&x_{(N-1)d^{2}+i}\end{bmatrix}^{\top}$ for $i\in[d^{2}]$ . We can observe that ${\bm{x}}_{i}\perp{\bm{\pi}},i\in[d^{2}]$ , because for $\forall j\in[d^{2}]$ , we have

0=\langle{\bm{x}},{\bm{\pi}}\otimes{\bm{e}}_{j}\rangle_{{\bm{\pi}}}=\left\langle\sum_{i=1}^{d^{2}}{\bm{x}}_{i}\otimes{\bm{e}}_{i},{\bm{\pi}}\otimes{\bm{e}}_{j}\right\rangle_{{\bm{\pi}}}=\sum_{i=1}^{d^{2}}\left\langle{\bm{x}}_{i}\otimes{\bm{e}}_{i},{\bm{\pi}}\otimes{\bm{e}}_{j}\right\rangle_{{\bm{\pi}}}=\sum_{i=1}^{d^{2}}\langle{\bm{x}}_{i},{\bm{\pi}}\rangle_{{\bm{\pi}}}\langle{\bm{e}}_{i},{\bm{e}}_{j}\rangle=\langle{\bm{x}}_{j},{\bm{\pi}}\rangle_{{\bm{\pi}}},

which indicates ${\bm{x}}_{j}\perp{\bm{\pi}},j\in[d^{2}]$ . Furthermore, we can also observe that ${\bm{x}}_{i}\otimes{\bm{e}}_{i},i\in[d^{2}]$ is pairwise orthogonal. This is because for $\forall i,j\in[d^{2}]$ , $\langle{\bm{x}}_{i}\otimes{\bm{e}}_{i},{\bm{x}}_{j}\otimes{\bm{e}}_{j}\rangle_{{\bm{\pi}}}=\langle{\bm{x}}_{i},{\bm{x}}_{j}\rangle_{{\bm{\pi}}}\langle{\bm{e}}_{i},{\bm{e}}_{j}\rangle=\delta_{ij}$ , which suggests us to use Pythagorean theorem such that $\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{d^{2}}\left\lVert{\bm{x}}_{i}\otimes{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{d^{2}}\left\lVert{\bm{x}}_{i}\right\rVert_{{\bm{\pi}}}\left\lVert{\bm{e}}_{i}\right\rVert_{2}^{2}$ .

We can use similar way to decompose and analyze $\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}$ :

\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}=\widetilde{{\bm{P}}}^{\ast}{\bm{x}}=\sum_{i=1}^{d^{2}}({\bm{P}}^{\ast}\otimes{\bm{I}}_{d^{2}})({\bm{x}}_{i}\otimes{\bm{e}}_{i})=\sum_{i=1}^{d^{2}}({\bm{P}}^{\ast}{\bm{x}}_{i})\otimes{\bm{e}}_{i}.

where we can observe that $({\bm{P}}^{\ast}{\bm{x}}_{i})\otimes{\bm{e}}_{i},i\in[d^{2}]$ is pairwise orthogonal. This is because for $\forall i,j\in[d^{2}]$ , we have $\langle({\bm{P}}^{\ast}{\bm{x}}_{i})\otimes{\bm{e}}_{i},({\bm{P}}^{\ast}{\bm{x}}_{j})\otimes{\bm{e}}_{j}\rangle_{{\bm{\pi}}}=\langle{\bm{P}}^{\ast}{\bm{x}}_{i},{\bm{P}}^{\ast}{\bm{x}}_{j}\rangle_{{\bm{\pi}}}\langle{\bm{e}}_{i},{\bm{e}}_{j}\rangle=\delta_{ij}$ . Again, applying Pythagorean theorem gives:

\begin{split}\left\lVert\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}^{2}&=\sum_{i=1}^{d^{2}}\left\lVert({\bm{P}}^{\ast}{\bm{x}}_{i})\otimes{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{d^{2}}\left\lVert\left({\bm{x}}_{i}^{\ast}{\bm{P}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert{\bm{e}}_{i}\right\rVert_{2}^{2}\\ &\leq\sum_{i=1}^{d^{2}}\lambda{({\bm{P}})}^{2}\left\lVert{\bm{x}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert{\bm{e}}_{i}\right\rVert_{2}^{2}=\lambda{({\bm{P}})}^{2}\left(\sum_{i=1}^{d^{2}}\left\lVert{\bm{x}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert{\bm{e}}_{i}\right\rVert_{2}^{2}\right)=\lambda{({\bm{P}})}^{2}\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}^{2},\end{split}

which indicate that for $\forall{\bm{x}}$ such that ${\bm{x}}\perp\mathcal{U}$ and ${\bm{x}}\neq 0$ , we have $\frac{\left\lVert\left({\bm{x}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}}{\left\lVert{\bm{x}}\right\rVert_{{\bm{\pi}}}}\leq\lambda{({\bm{P}})}$ , or equivalently $\lambda{(\widetilde{{\bm{P}}})}\leq\lambda{({\bm{P}})}$ .

Overall, we have shown both $\lambda{(\widetilde{{\bm{P}}})}\geq\lambda{({\bm{P}})}$ and $\lambda{(\widetilde{{\bm{P}}})}\leq\lambda{({\bm{P}})}$ . We conclude $\lambda{(\widetilde{{\bm{P}}})}=\lambda{({\bm{P}})}$ . ∎

Lemma 4.

(The effect of $\widetilde{{\bm{P}}}$ operator) This lemma is a generalization of lemma 3.3 in [6].

1.

$\forall{\bm{y}}\in\mathcal{U}$ , then $\left({\bm{y}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}={\bm{y}}$ .
2.

$\forall{\bm{y}}\perp\mathcal{U}$ , then $\left({\bm{y}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\perp\mathcal{U}$ , and $\left\lVert\left({\bm{y}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}\leq\lambda\left\lVert{\bm{y}}\right\rVert_{{\bm{\pi}}}$ .

Proof.

First prove the Part 1 of lemma 4. $\forall{\bm{y}}={\bm{\pi}}\otimes{\bm{w}}\in\mathcal{U}$ :

{\bm{y}}^{\ast}\widetilde{{\bm{P}}}=\left({\bm{\pi}}^{\ast}\otimes{\bm{w}}^{\ast}\right)({\bm{P}}\otimes{\bm{I}}_{d^{2}})=({\bm{\pi}}^{\ast}{\bm{P}})\otimes\left({\bm{w}}^{\ast}{\bm{I}}_{d^{2}}\right)={\bm{\pi}}^{\ast}\otimes{\bm{w}}^{\ast}={\bm{y}}^{\ast},

where third equality is becase ${\bm{\pi}}$ is the stationary distribution. Next to prove Part 2 of lemma 4. Given ${\bm{y}}\perp\mathcal{U}$ , want to show $({\bm{y}}^{\ast}\widetilde{{\bm{P}}})^{\ast}\perp{\bm{\pi}}\otimes{\bm{w}}$ , for every ${\bm{w}}\in{\mathbb{C}}^{d^{2}}$ . It is true because

\begin{split}\left\langle{\bm{\pi}}\otimes{\bm{w}},({\bm{y}}^{\ast}\widetilde{{\bm{P}}})^{\ast}\right\rangle_{\bm{\pi}}=&{\bm{y}}^{\ast}\widetilde{{\bm{P}}}\left({\bm{\Pi}}^{-1}\otimes{\bm{I}}_{d^{2}}\right)({\bm{\pi}}\otimes{\bm{w}})={\bm{y}}^{\ast}\left(({\bm{P}}{\bm{\Pi}}^{-1}{\bm{\pi}})\otimes{\bm{w}}\right)={\bm{y}}^{\ast}\left(({\bm{\Pi}}^{-1}{\bm{\pi}})\otimes{\bm{w}}\right)\\ =&{\bm{y}}^{\ast}\left({\bm{\Pi}}^{-1}\otimes{\bm{I}}_{d^{2}}\right)({\bm{\pi}}\otimes{\bm{w}})=\langle{\bm{\pi}}\otimes{\bm{w}},{\bm{y}}\rangle_{\bm{\pi}}=0.\end{split}

The third equality is due to ${\bm{P}}{\bm{\Pi}}^{-1}{\bm{\pi}}={\bm{P}}\bm{1}=\bm{1}={\bm{\Pi}}^{-1}{\bm{\pi}}$ . Moreover, $\left\lVert\left({\bm{y}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}\leq\lambda\left\lVert{\bm{y}}\right\rVert_{{\bm{\pi}}}$ is simply a re-statement of definition 3. ∎

Remark 5.

Lemma 4 implies that $\forall{\bm{y}}\in{\mathbb{C}}^{nd^{2}}$

1.

$\left(\left({\bm{y}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}=\left(\left({\bm{y}}^{\parallel\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}+\left(\left({\bm{y}}^{\perp\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}={\bm{y}}^{\parallel}+\bm{0}={\bm{y}}^{\parallel}$
2.

$\left(\left({\bm{y}}^{\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}=\left(\left({\bm{y}}^{\parallel\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}+\left(\left({\bm{y}}^{\perp\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}=\bm{0}+\left({\bm{y}}^{\perp\ast}\widetilde{{\bm{P}}}\right)^{\ast}=\left({\bm{y}}^{\perp\ast}\widetilde{{\bm{P}}}\right)^{\ast}$ .

Lemma 5.

(The effect of ${\bm{E}}$ operator) Given three parameters $\lambda\in[0,1],\ell\geq 0$ and $t>0$ . Let ${\bm{P}}$ be a regular Markov chain on state space $[N]$ , with stationary distribution ${\bm{\pi}}$ and spectral expansion $\lambda$ . Suppose each state $i\in[N]$ is assigned a matrix ${\bm{H}}_{i}\in{\mathbb{C}}^{d^{2}\times d^{2}}$ s.t. $\left\lVert{\bm{H}}_{i}\right\rVert_{2}\leq\ell$ and $\sum_{i\in[N]}\pi_{i}{\bm{H}}_{i}=0$ . Let $\widetilde{{\bm{P}}}={\bm{P}}\otimes{\bm{I}}_{d^{2}}$ and ${\bm{E}}$ denotes the $Nd^{2}\times Nd^{2}$ block matrix where the $i$ -th diagonal block is the matrix $\exp{(t{\bm{H}}_{i})}$ , i.e., ${\bm{E}}=\operatorname{diag}(\exp{(t{\bm{H}}_{1})},\cdots,\exp{(t{\bm{H}}_{N})})$ . Then for any $\forall{\bm{z}}\in{\mathbb{C}}^{Nd^{2}}$ , we have:

1.

$\left\lVert\left(\left({\bm{z}}^{\parallel\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}\leq\alpha_{1}\left\lVert{\bm{z}}^{\parallel}\right\rVert_{{\bm{\pi}}}$ , where $\alpha_{1}=\exp{(t\ell)}-t\ell$ .
2.

$\left\lVert\left(\left({\bm{z}}^{\parallel\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\leq\alpha_{2}\left\lVert{\bm{z}}^{\parallel}\right\rVert_{{\bm{\pi}}}$ , where $\alpha_{2}=\lambda(\exp{(t\ell)}-1)$ .
3.

$\left\lVert\left(\left({\bm{z}}^{\perp\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}\leq\alpha_{3}\left\lVert{\bm{z}}^{\perp}\right\rVert_{{\bm{\pi}}}$ , where $\alpha_{3}=\exp{(t\ell)}-1$ .
4.

$\left\lVert\left(\left({\bm{z}}^{\perp\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\leq\alpha_{4}\left\lVert{\bm{z}}^{\perp}\right\rVert_{{\bm{\pi}}}$ , where $\alpha_{4}=\lambda\exp{(t\ell)}$ .

Proof.

(of Lemma 5) We first show that, for ${\bm{z}}={\bm{y}}\otimes{\bm{w}}$ ,

\begin{split}\left({\bm{z}}^{\ast}{\bm{E}}\right)^{\ast}={\bm{E}}^{\ast}{\bm{z}}&=\begin{bmatrix}\exp(t{\bm{H}}^{\ast}_{1})&&\\ &\ddots&\\ &&\exp(t{\bm{H}}^{\ast}_{N})\end{bmatrix}\begin{bmatrix}y_{1}{\bm{w}}\\ \vdots\\ y_{N}{\bm{w}}\end{bmatrix}=\begin{bmatrix}y_{1}\exp(t{\bm{H}}^{\ast}_{1}){\bm{w}}\\ \vdots\\ y_{N}\exp(t{\bm{H}}^{\ast}_{N}){\bm{w}}\end{bmatrix}\\ &=\begin{bmatrix}y_{1}\exp(t{\bm{H}}^{\ast}_{1}){\bm{w}}\\ \vdots\\ 0\end{bmatrix}+\cdots+\begin{bmatrix}0\\ \vdots\\ y_{N}\exp(t{\bm{H}}^{\ast}_{N}){\bm{w}}\end{bmatrix}=\sum_{i=1}^{N}y_{i}\left({\bm{e}}_{i}\otimes(\exp(t{\bm{H}}^{\ast}_{i}){\bm{w}})\right).\end{split}

Due to the linearity of projection,

\begin{split}\left(\left({\bm{z}}^{\ast}{\bm{E}}\right)^{\ast}\right)^{\parallel}&=\sum_{i=1}^{N}y_{i}\left({\bm{e}}_{i}\otimes(\exp(t{\bm{H}}^{\ast}_{i}){\bm{w}})\right)^{\parallel}=\sum_{i=1}^{N}y_{i}(\bm{1}^{\ast}{\bm{e}}_{i})\left({\bm{\pi}}\otimes(\exp(t{\bm{H}}^{\ast}_{i}){\bm{w}})\right)={\bm{\pi}}\otimes\left(\sum_{i=1}^{N}y_{i}\exp(t{\bm{H}}^{\ast}_{i}){\bm{w}}\right),\end{split}

(9)

where the second inequality follows by Remark 4.

Proof of Lemma 5, Part 1 Firstly We can bound $\left\lVert\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}^{\ast}_{i})\right\rVert_{2}$ by

\begin{split}\left\lVert\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}^{\ast}_{i})\right\rVert_{2}&=\left\lVert\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}_{i})\right\rVert_{2}=\left\lVert\sum_{i=1}^{N}\pi_{i}\sum_{k=0}^{+\infty}\frac{t^{j}{\bm{H}}_{i}^{j}}{j!}\right\rVert_{2}=\left\lVert{\bm{I}}+\sum_{i=1}^{N}\pi_{i}\sum_{j=2}^{+\infty}\frac{t^{j}{\bm{H}}_{i}^{j}}{j!}\right\rVert_{2}\\ &\leq 1+\sum_{i=1}^{N}\pi_{i}\sum_{j=2}^{+\infty}\frac{t^{j}\left\lVert{\bm{H}}_{i}\right\rVert_{2}^{j}}{j!}\leq 1+\sum_{i=1}^{N}\pi_{i}\sum_{j=2}^{+\infty}\frac{(t\ell)^{j}}{j!}=\exp{(t\ell)}-t\ell,\end{split}

where the first step follows by $\left\lVert{\bm{A}}\right\rVert_{2}=\left\lVert{\bm{A}}^{\ast}\right\rVert_{2}$ , the second step follows by matrix exponential, the third step follows by $\sum_{i\in[N]}\pi_{i}{\bm{H}}_{i}=0$ , and the forth step follows by triangle inequality. Given the above bound, for any ${\bm{z}}^{\parallel}$ which can be written as ${\bm{z}}^{\parallel}={\bm{\pi}}\otimes{\bm{w}}$ for some ${\bm{w}}\in{\mathbb{C}}^{d^{2}}$ , we have

\begin{split}\left\lVert\left(\left({\bm{z}}^{\parallel\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}&=\left\lVert\left(\left({\bm{z}}^{\parallel\ast}{\bm{E}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}=\left\lVert{\bm{\pi}}\otimes\left(\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}^{\ast}_{i}){\bm{w}}\right)\right\rVert_{{\bm{\pi}}}=\left\lVert{\bm{\pi}}\right\rVert_{{\bm{\pi}}}\left\lVert\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}^{\ast}_{i}){\bm{w}}\right\rVert_{2}\\ &\leq\left\lVert{\bm{\pi}}\right\rVert_{{\bm{\pi}}}\left\lVert\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}^{\ast}_{i})\right\rVert_{2}\left\lVert{\bm{w}}\right\rVert_{2}=\left\lVert\sum_{i=1}^{N}\pi_{i}\exp(t{\bm{H}}^{\ast}_{i})\right\rVert_{2}\left\lVert{\bm{z}}^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &\leq\left(\exp{(t\ell)}-t\ell\right)\left\lVert{\bm{z}}^{\parallel}\right\rVert_{{\bm{\pi}}},\end{split}

where step one follows by Part 1 of Remark 5 and step two follows by Equation 9.

Proof of Lemma 5, Part 2 For $\forall{\bm{z}}\in\mathbb{C}^{Nd^{2}}$ , we can write it as block matrix such that:

{\bm{z}}=\begin{bmatrix}{\bm{z}}_{1}\\ \vdots\\ {\bm{z}}_{N}\end{bmatrix}=\begin{bmatrix}{\bm{z}}_{1}\\ \vdots\\ 0\end{bmatrix}+\cdots+\begin{bmatrix}0\\ \vdots\\ {\bm{z}}_{N}\end{bmatrix}=\sum_{i=1}^{N}{\bm{e}}_{i}\otimes{\bm{z}}_{i},

where each ${\bm{z}}_{i}\in{\mathbb{C}}^{d^{2}}$ . Please note that above decomposition is pairwise orthogonal. Applying Pythagorean theorem gives $\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\otimes{\bm{z}}_{i}\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert{\bm{z}}_{i}\right\rVert_{2}^{2}$ . Similarly, we can decompose $({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}){\bm{z}}$ such that

\begin{split}({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}){\bm{z}}&=\begin{bmatrix}\exp(t{\bm{H}}^{\ast}_{1})-{\bm{I}}_{d^{2}}&&\\ &\ddots&\\ &&\exp(t{\bm{H}}^{\ast}_{N})-{\bm{I}}_{d^{2}}\end{bmatrix}\begin{bmatrix}{\bm{z}}_{1}\\ \vdots\\ {\bm{z}}_{N}\end{bmatrix}=\begin{bmatrix}(\exp(t{\bm{H}}^{\ast}_{1})-{\bm{I}}_{d^{2}}){\bm{z}}_{1}\\ \vdots\\ (\exp(t{\bm{H}}^{\ast}_{N})-{\bm{I}}_{d^{2}}){\bm{z}}_{N}\end{bmatrix}\\ &=\begin{bmatrix}(\exp(t{\bm{H}}^{\ast}_{1})-{\bm{I}}_{d^{2}}){\bm{z}}_{1}\\ \vdots\\ 0\end{bmatrix}+\cdots+\begin{bmatrix}0\\ \vdots\\ (\exp(t{\bm{H}}^{\ast}_{N})-{\bm{I}}_{d^{2}}){\bm{z}}_{N}\end{bmatrix}\\ &=\sum_{i=1}^{N}{\bm{e}}_{i}\otimes\left((\exp(t{\bm{H}}^{\ast}_{i})-{\bm{I}}_{d^{2}}){\bm{z}}_{i}\right).\end{split}

(10)

Note that above decomposition is pairwise orthogonal, too. Applying Pythagorean theorem gives

\begin{split}\left\lVert({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}){\bm{z}}\right\rVert_{{\bm{\pi}}}^{2}&=\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\otimes\left((\exp(t{\bm{H}}^{\ast}_{i})-{\bm{I}}_{d^{2}}){\bm{z}}_{i}\right)\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert(\exp(t{\bm{H}}^{\ast}_{i})-{\bm{I}}_{d^{2}}){\bm{z}}_{i}\right\rVert_{2}^{2}\\ &\leq\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert\exp(t{\bm{H}}^{\ast}_{i})-{\bm{I}}_{d^{2}}\right\rVert_{2}^{2}\left\lVert{\bm{z}}_{i}\right\rVert_{2}^{2}\leq\max_{i\in[N]}\left\lVert\exp(t{\bm{H}}^{\ast}_{i})-{\bm{I}}_{d^{2}}\right\rVert_{2}^{2}\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert{\bm{z}}_{i}\right\rVert_{2}^{2}\\ &=\max_{i\in[N]}\left\lVert\exp(t{\bm{H}}^{\ast}_{i})-{\bm{I}}_{d^{2}}\right\rVert_{2}^{2}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}^{2}=\max_{i\in[N]}\left\lVert\exp(t{\bm{H}}_{i})-{\bm{I}}_{d^{2}}\right\rVert_{2}^{2}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}^{2},\end{split}

which indicates

\begin{split}\left\lVert({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}){\bm{z}}\right\rVert_{{\bm{\pi}}}&=\max_{i\in[N]}\left\lVert\exp(t{\bm{H}}_{i})-{\bm{I}}_{d^{2}}\right\rVert_{2}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}=\max_{i\in[N]}\left\lVert\sum_{j=1}^{+\infty}\frac{t^{j}{\bm{H}}_{i}^{j}}{j!}\right\rVert_{2}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}\\ &\leq\left(\sum_{j=1}^{+\infty}\frac{t^{j}\ell^{j}}{j!}\right)\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}=(\exp{(t\ell)}-1)\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}.\end{split}

Now we can formally prove Part 2 of Lemma 5 by:

\begin{split}\left\lVert\left(\left({\bm{z}}^{\parallel\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}&=\left\lVert\left(\left({\bm{E}}^{\ast}{\bm{z}}^{\parallel}\right)^{\perp\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}\leq\lambda\left\lVert\left({\bm{E}}^{\ast}{\bm{z}}^{\parallel}\right)^{\perp}\right\rVert_{{\bm{\pi}}}=\lambda\left\lVert\left({\bm{E}}^{\ast}{\bm{z}}^{\parallel}-{\bm{z}}^{\parallel}+{\bm{z}}^{\parallel}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\\ &=\lambda\left\lVert\left(\left({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}\right){\bm{z}}^{\parallel}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\leq\lambda\left\lVert\left({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}\right){\bm{z}}^{\parallel}\right\rVert_{{\bm{\pi}}}\leq\lambda(\exp{(t\ell)}-1)\left\lVert{\bm{z}}^{\parallel}\right\rVert_{{\bm{\pi}}}.\end{split}

The first step follows by Part 2 of Remark 5, the second step follows by Part 1 on Lemma 4 and the forth step is due to $\left({\bm{z}}^{\parallel}\right)^{\perp}=\bm{0}$ .

Proof of Lemma 5, Part 3 Note that

\begin{split}\left\lVert\left(\left({\bm{z}}^{\perp\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}&=\left\lVert\left({\bm{E}}^{\ast}{\bm{z}}^{\perp}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}=\left\lVert\left({\bm{E}}^{\ast}{\bm{z}}^{\perp}-{\bm{z}}^{\perp}+{\bm{z}}^{\perp}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}=\left\lVert\left(({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}){\bm{z}}^{\perp}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &\leq\left\lVert({\bm{E}}^{\ast}-{\bm{I}}_{Nd^{2}}){\bm{z}}^{\perp}\right\rVert_{{\bm{\pi}}}\leq(\exp{(t\ell)}-1)\left\lVert{\bm{z}}^{\perp}\right\rVert_{{\bm{\pi}}},\end{split}

where the first step follows by Part 1 of Remark 5, the third step follows by $\left({\bm{z}}^{\perp}\right)^{\parallel}=\bm{0}$ , and the last step follows by Part 2 of Lemma 4.

Proof of Lemma 5, Part 4 Simiar to Equation 10, for $\forall{\bm{z}}\in\mathbb{C}^{Nd^{2}}$ , we can decompose ${\bm{E}}^{\ast}{\bm{z}}$ as ${\bm{E}}^{\ast}{\bm{z}}=\sum_{i=1}^{N}{\bm{e}}_{i}\otimes(\exp(t{\bm{H}}^{\ast}_{i}){\bm{z}}_{i})$ . This decomposition is pairwise orthogonal, too. Applying Pythagorean theorem gives

\begin{split}\left\lVert{\bm{E}}^{\ast}{\bm{z}}\right\rVert_{{\bm{\pi}}}^{2}&=\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\otimes\left(\exp(t{\bm{H}}^{\ast}_{i}){\bm{z}}_{i}\right)\right\rVert_{{\bm{\pi}}}^{2}=\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert\exp(t{\bm{H}}^{\ast}_{i}){\bm{z}}_{i}\right\rVert_{2}^{2}\leq\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert\exp(t{\bm{H}}^{\ast}_{i})\right\rVert_{2}^{2}\left\lVert{\bm{z}}_{i}\right\rVert_{2}^{2}\\ &\leq\max_{i\in[N]}\left\lVert\exp(t{\bm{H}}^{\ast}_{i})\right\rVert_{2}^{2}\sum_{i=1}^{N}\left\lVert{\bm{e}}_{i}\right\rVert_{{\bm{\pi}}}^{2}\left\lVert{\bm{z}}_{i}\right\rVert_{2}^{2}\leq\max_{i\in[N]}\exp{\left(\left\lVert t{\bm{H}}^{\ast}_{i}\right\rVert_{2}\right)}^{2}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}^{2}\leq\exp{(t\ell)}^{2}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}^{2}\end{split}

which indicates $\left\lVert{\bm{E}}^{\ast}{\bm{z}}\right\rVert_{{\bm{\pi}}}\leq\exp{(t\ell)}\left\lVert{\bm{z}}\right\rVert_{{\bm{\pi}}}$ . Now we can prove Part 4 of Lemma 5: Note that

\begin{split}\left\lVert\left(\left({\bm{z}}^{\perp\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}&=\left\lVert\left(\left({\bm{E}}^{\ast}{\bm{z}}^{\perp}\right)^{\perp\ast}\widetilde{{\bm{P}}}\right)^{\ast}\right\rVert_{{\bm{\pi}}}\leq\lambda\left\lVert\left({\bm{E}}^{\ast}{\bm{z}}^{\perp}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\leq\lambda\left\lVert{\bm{E}}^{\ast}{\bm{z}}^{\perp}\right\rVert_{{\bm{\pi}}}\leq\lambda\exp{(t\ell)}\left\lVert{\bm{z}}^{\perp}\right\rVert_{{\bm{\pi}}}.\end{split}

∎

Recursive Analysis We now use Lemma 5 to analyze the evolution of ${\bm{z}}_{i}^{\parallel}$ and ${\bm{z}}_{i}^{\perp}$ . Let ${\bm{H}}_{v}\triangleq\frac{f(v)(\gamma+{\mathrm{i}}b)}{2}\otimes{\bm{I}}_{d^{2}}+{\bm{I}}_{d^{2}}\otimes\frac{f(v)(\gamma-{\mathrm{i}}b)}{2}$ in Lemma 5. We can see verify the following three facts: (1) $\exp(t{\bm{H}}_{v})={\bm{M}}_{v}$ ; (2) $\left\lVert{\bm{H}}_{v}\right\rVert_{2}$ is bounded (3) $\sum_{v\in[N]}\pi_{v}{\bm{H}}_{v}=0$ .

Firstly, easy to see that

\begin{split}\exp{(t{\bm{H}}_{v})}&=\exp{\left(\frac{tf(v)(\gamma+{\mathrm{i}}b)}{2}\otimes{\bm{I}}_{d^{2}}+{\bm{I}}_{d^{2}}\otimes\frac{tf(v)(\gamma-{\mathrm{i}}b)}{2}\right)}\\ &=\exp{\left(\frac{tf(v)(\gamma+{\mathrm{i}}b)}{2}\right)}\otimes\exp{\left(\frac{tf(v)(\gamma-{\mathrm{i}}b)}{2}\right)}={\bm{M}}_{v},\end{split}

where the first step follows by definition of ${\bm{H}}_{i}$ and the second step follows by the fact that $\exp({\bm{A}}\otimes{\bm{I}}_{d}+{\bm{I}}_{d}\otimes{\bm{B}})=\exp({\bm{A}})\otimes\exp({\bm{B}})$ , and the last step follows by Equation 7.

Secondly, we can bound $\left\lVert{\bm{H}}_{v}\right\rVert_{2}$ by:

\begin{split}\left\lVert{\bm{H}}_{v}\right\rVert_{2}&\leq\left\lVert\frac{f(v)(\gamma+{\mathrm{i}}b)}{2}\otimes{\bm{I}}_{d^{2}}\right\rVert_{2}+\left\lVert{\bm{I}}_{d^{2}}\otimes\frac{f(v)(\gamma-{\mathrm{i}}b)}{2}\right\rVert_{2}\\ &=\left\lVert\frac{f(v)(\gamma+{\mathrm{i}}b)}{2}\right\rVert_{2}\left\lVert{\bm{I}}_{d^{2}}\right\rVert_{2}+\left\lVert{\bm{I}}_{d^{2}}\right\rVert_{2}\left\lVert\frac{f(v)(\gamma-{\mathrm{i}}b)}{2}\right\rVert_{2}\leq\sqrt{\gamma^{2}+b^{2}},\end{split}

where the first step follows by triangle inequality, the second step follows by the fact that $\left\lVert{\bm{A}}\otimes{\bm{B}}\right\rVert_{2}=\left\lVert{\bm{A}}\right\rVert_{2}\left\lVert{\bm{B}}\right\rVert_{2}$ , the third step follows by $\left\lVert{\bm{I}}_{d}\right\rVert_{2}=1$ and $\left\lVert f(v)\right\rVert_{2}\leq 1$ . We set $\ell=\sqrt{\gamma^{2}+b^{2}}$ to satisfy the assumption in Lemma 5 that $\left\lVert{\bm{H}}_{v}\right\rVert_{2}\leq\ell$ . According to the conditions in Lemma 1, we know that $t\ell\leq 1$ and $t\ell\leq\frac{1-\lambda}{4\lambda}$ .

Finally, we show that $\sum_{v\in[N]}\pi_{v}{\bm{H}}_{v}=0$ , because

\begin{split}\sum_{v\in[N]}\pi_{v}{\bm{H}}_{v}&=\sum_{v\in[N]}\left(\frac{f(v)(\gamma+{\mathrm{i}}b)}{2}\otimes{\bm{I}}_{d^{2}}+{\bm{I}}_{d^{2}}\otimes\frac{f(v)(\gamma-{\mathrm{i}}b)}{2}\right)\\ &=\frac{\gamma+{\mathrm{i}}b}{2}\left(\sum_{v\in[N]}\pi_{v}f(v)\right)\otimes{\bm{I}}_{d}+\frac{\gamma-{\mathrm{i}}b}{2}{\bm{I}}_{d}\otimes\left(\sum_{v\in[N]}\pi_{v}f(v)\right)=0,\end{split}

where the last step follows by $\sum_{v\in[N]}\pi_{v}f(v)=0$ .

Claim 4.

$\left\lVert{\bm{z}}_{i}^{\perp}\right\rVert_{{\bm{\pi}}}\leq\frac{\alpha_{2}}{1-\alpha_{4}}\max_{0\leq j<i}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}$ .

Proof.

Using Part 2 and Part 4 of Lemma 5, we have

\begin{split}\left\lVert{\bm{z}}_{i}^{\perp}\right\rVert_{{\bm{\pi}}}&=\left\lVert\left(\left({\bm{z}}_{i-1}^{\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\\ &\leq\left\lVert\left(\left({\bm{z}}_{i-1}^{\parallel\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}+\left\lVert\left(\left({\bm{z}}_{i-1}^{\perp\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\perp}\right\rVert_{{\bm{\pi}}}\\ &\leq\alpha_{2}\left\lVert{\bm{z}}_{i-1}^{\parallel}\right\rVert_{{\bm{\pi}}}+\alpha_{4}\left\lVert{\bm{z}}_{i-1}^{\perp}\right\rVert_{{\bm{\pi}}}\\ &\leq(\alpha_{2}+\alpha_{2}\alpha_{4}+\alpha_{2}\alpha_{4}^{2}+\cdots)\max_{0\leq j<i}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &\leq\frac{\alpha_{2}}{1-\alpha_{4}}\max_{0\leq j<i}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}\end{split}

∎

Claim 5.

$\left\lVert{\bm{z}}_{i}^{\parallel}\right\rVert_{{\bm{\pi}}}\leq\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)\max_{0\leq j<i}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}$ .

Proof.

Using Part 1 and Part 3 of Lemma 5 as well as Claim 4, we have

\begin{split}\left\lVert{\bm{z}}_{i}^{\parallel}\right\rVert_{{\bm{\pi}}}&=\left\lVert\left(\left({\bm{z}}_{i-1}^{\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &\leq\left\lVert\left(\left({\bm{z}}_{i-1}^{\parallel\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}+\left\lVert\left(\left({\bm{z}}_{i-1}^{\perp\ast}{\bm{E}}\widetilde{{\bm{P}}}\right)^{\ast}\right)^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &\leq\alpha_{1}\left\lVert{\bm{z}}_{i-1}^{\parallel}\right\rVert_{{\bm{\pi}}}+\alpha_{3}\left\lVert{\bm{z}}_{i-1}^{\perp}\right\rVert_{{\bm{\pi}}}\\ &\leq\alpha_{1}\left\lVert{\bm{z}}_{i-1}^{\parallel}\right\rVert_{{\bm{\pi}}}+\alpha_{3}\frac{\alpha_{2}}{1-\alpha_{4}}\max_{0\leq j<i-1}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &\leq\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)\max_{0\leq j<i}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}.\end{split}

∎

Combining Claim 4 and Claim 5 gives

\begin{split}\left\lVert{\bm{z}}_{k}^{\parallel}\right\rVert_{{\bm{\pi}}}&\leq\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)\max_{0\leq j<k}\left\lVert{\bm{z}}_{j}^{\parallel}\right\rVert_{{\bm{\pi}}}\\ \text{(because $\alpha_{1}+\alpha_{2}\alpha_{3}/(1-\alpha_{4})\geq\alpha_{1}\geq 1$ )}&\leq\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)^{k}\left\lVert{\bm{z}}_{0}^{\parallel}\right\rVert_{{\bm{\pi}}}\\ &=\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}\sqrt{d}\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)^{k},\end{split}

which implies

\left\langle{\bm{\pi}}\otimes\operatorname{vec}({\bm{I}}_{d}),{\bm{z}}_{k}\right\rangle_{{\bm{\pi}}}\leq\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)^{k}.

Finally, we bound $\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)^{k}$ . The same as [10], we can bound $\alpha_{1},\alpha_{2}\alpha_{3},\alpha_{4}$ by:

\alpha_{1}=\exp{(t\ell)}-t\ell\leq 1+t^{2}\ell^{2}=1+t^{2}(\gamma^{2}+b^{2}),

and

\alpha_{2}\alpha_{3}=\lambda(\exp{(t\ell)}-1)^{2}\leq\lambda(2t\ell)^{2}=4\lambda t^{2}(\gamma^{2}+b^{2})

where the second step is because $\exp{(x)}\leq 1+2x,\forall x\in[0,1]$ and $t\ell<1$ ,

\alpha_{4}=\lambda\exp{(t\ell)}\leq\lambda(1+2t\ell)\leq\frac{1}{2}+\frac{1}{2}\lambda

where the second step is because $t\ell<1$ , and the third step follows by $t\ell\leq\frac{1-\lambda}{4\lambda}$ .

Overall, we have

\begin{split}\left(\alpha_{1}+\frac{\alpha_{2}\alpha_{3}}{1-\alpha_{4}}\right)^{k}\leq&\left(1+t^{2}(\gamma^{2}+b^{2})+\frac{4\lambda t^{2}(\gamma^{2}+b^{2})}{\frac{1}{2}-\frac{1}{2}\lambda}\right)^{k}\\ &\leq\exp{\left(kt^{2}(\gamma^{2}+b^{2})\left(1+\frac{8}{1-\lambda}\right)\right)}.\end{split}

This completes our proof of Lemma 1. ∎

B.4 Proof of Theorem 1

See 1

Proof.

(of Theorem 1) Our strategy is to adopt complexification technique [8]. For any $d\times d$ complex Hermitian matrix ${\bm{X}}$ , we may write ${\bm{X}}={\bm{Y}}+{\mathrm{i}}{\bm{Z}}$ , where ${\bm{Y}}$ and ${\mathrm{i}}{\bm{Z}}$ are the real and imaginary parts of ${\bm{X}}$ , respectively. Moreover, the Hermitian property of ${\bm{X}}$ (i.e., ${\bm{X}}^{\ast}={\bm{X}}$ ) implies that (1) ${\bm{Y}}$ is real and symmetric (i.e., ${\bm{Y}}^{\top}={\bm{Y}}$ ); (2) ${\bm{Z}}$ is real and skew symmetric (i.e., ${\bm{Z}}=-{\bm{Z}}^{\top}$ ). The eigenvalues of ${\bm{X}}$ can be found via a $2d\times 2d$ real symmetric matrix $\scriptsize{\bm{H}}\triangleq\begin{bmatrix}{\bm{Y}}&{\bm{Z}}\\ -{\bm{Z}}&{\bm{Y}}\end{bmatrix}\normalsize$ , where the symmetry of ${\bm{H}}$ follows by the symmetry of ${\bm{Y}}$ and skew-symmetry of ${\bm{Z}}$ . Note the fact that, if the eigenvalues (real) of ${\bm{X}}$ are $\lambda_{1},\lambda_{2},\cdots\lambda_{d}$ , then those of ${\bm{H}}$ are $\lambda_{1},\lambda_{1},\lambda_{2},\lambda_{2},\cdots,\lambda_{d},\lambda_{d}$ . I.e., ${\bm{X}}$ and ${\bm{H}}$ have the same eigenvalues, but with different multiplicity.

Using the above technique, we can formally prove Theorem 1. For any complex matrix function $f:[N]\rightarrow\mathbb{C}^{d\times d}$ in Theorem 1, we can separate its real and imaginary parts by $f=f_{1}+{\mathrm{i}}f_{2}$ . Then we construct a real-valued matrix function $g:[N]\rightarrow\mathbb{R}^{2d\times 2d}$ s.t. $\forall v\in[N]$ , $\scriptsize g(v)=\begin{bmatrix}f_{1}(v)&f_{2}(v)\\ -f_{2}(v)&f_{1}(v)\end{bmatrix}\normalsize$ . According to the complexification technique, we know that (1) $\forall v\in[N],g(v)$ is real symmetric and $\left\lVert g(v)\right\rVert_{2}=\left\lVert f(v)\right\rVert_{2}\leq 1$ ; (2) $\sum_{v\in[N]}\pi_{v}g(v)=0$ . Then

\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}f(v_{j})\right)\geq\epsilon\right]=\mathbb{P}\left[\lambda_{\max}\left(\frac{1}{k}\sum_{j=1}^{k}g(v_{j})\right)\geq\epsilon\right]\leq 4\left\lVert{\bm{\phi}}\right\rVert_{{\bm{\pi}}}d^{2}\exp{\left(-({\epsilon}^{2}(1-\lambda)k/72)\right)},

where the first step follows by the fact that $\frac{1}{k}\sum_{j=1}^{k}f(v_{j})$ and $\frac{1}{k}\sum_{j=1}^{k}g(v_{j})$ have the same eigenvalues (with different multiplicity), and the second step follows by Theorem 3.⁵⁵5The additional factor 4 is because the constructed $g(v)$ has shape $2d\times 2d$ . The bound on $\lambda_{\min}$ also follows similarly. ∎

A Matrix Chernoff Bound for Markov Chains and Its Application to Co-occurrence Matrices

Abstract

1 Introduction

Theorem 1 (Markov Chain Matrix Chernoff Bound).

1.1 Applications to Co-occurrence Matrices of Markov Chains

Theorem 2 (Convergence Rate of Co-occurrence Matrices).

2 Preliminaries

3 Matrix Chernoff Bounds for Markov Chains

Theorem 3 (A Real-Valued Version of Theorem 1).

Theorem 4 (Multi-matrix Golden-Thompson Inequality, Theorem 1.5 in [10]).

Lemma 1 (Analogous to Lemma 4.3 in [10]).

4 Convergence Rate of Co-occurrence Matrices of Markov Chains

Proof.

Claim 1 (Properties of 𝑸{\bm{Q}}).

Claim 2 (Properties of ff).

Claim 3.

Corollary 1 (Co-occurrence Matrices of HMMs).

5 Experiments

6 Conclusion and Future Work

Broader Impact

Acknowledgments and Disclosure of Funding

References

Supplementary Material of A Matrix Chernoff Bound for Markov Chains and Its Application to Co-occurrence Matrices

Appendix A Convergence Rate of Co-occurrence Matrices

A.1 Proof of Claim 1

Proof.

A.2 Proof of Claim 2

Proof.

A.3 Proof of Corollary 1

Proof.

Appendix B Matrix Chernoff Bounds for Markov Chains

B.1 Preliminaries

Definition 1.

Remark 1.

Remark 2.

B.2 Proof of Theorem 3

Proof.

B.3 Proof of Lemma 1

Proof.

Lemma 2.

Proof.

Definition 2.

Remark 3.

Remark 4.

Definition 3.

Lemma 3.

Proof.

Lemma 4.

Proof.

Remark 5.

Lemma 5.

Proof.

Claim 4.

Proof.

Claim 5.

Proof.

B.4 Proof of Theorem 1

Proof.

Claim 1 (Properties of ${\bm{Q}}$ ).

Claim 2 (Properties of $f$ ).