A Matrix Chernoff Bound for Markov Chains and Its Application to Co-occurrence Matrices
Abstract
We prove a Chernoff-type bound for sums of matrix-valued random variables sampled via a regular (aperiodic and irreducible) finite Markov chain. Specially, consider a random walk on a regular Markov chain and a Hermitian matrix-valued function on its state space. Our result gives exponentially decreasing bounds on the tail distributions of the extreme eigenvalues of the sample mean matrix. Our proof is based on the matrix expander (regular undirected graph) Chernoff bound [Garg et al. STOC ’18] and scalar Chernoff-Hoeffding bounds for Markov chains [Chung et al. STACS ’12].
Our matrix Chernoff bound for Markov chains can be applied to analyze the behavior of co-occurrence statistics for sequential data, which have been common and important data signals in machine learning. We show that given a regular Markov chain with states and mixing time , we need a trajectory of length to achieve an estimator of the co-occurrence matrix with error bound . We conduct several experiments and the experimental results are consistent with the exponentially fast convergence rate from theoretical analysis. Our result gives the first bound on the convergence rate of the co-occurrence matrix and the first sample complexity analysis in graph representation learning.
1 Introduction
Chernoff bound [5], which gives exponentially decreasing bounds on tail distributions of sums of independent scalar-valued random variables, is one of the most basic and versatile tools in theoretical computer science, with countless applications to practical problems [21, 35]. There are two notable limitations when applying Chernoff bound in analyzing sample complexity in real-world machine learning problems. First, in many cases the random variables have dependence, e.g., Markov dependence [20] in MCMC [18] and online learning [48]. Second, applications are often concerned with the concentration behavior of quantities beyond scalar-valued random variables, e.g., random features in kernel machines [40] and co-occurrence statistics which are random matrices [38, 39].
Existing research has attempted to extend the original Chernoff bound in one of these two limitations [19, 11, 27, 24, 53, 14, 6, 41, 52, 42, 1, 50]. Wigderson and Xiao [53] conjectured that Chernoff bounds can be generalized to both matrix-valued random variables and Markov dependence, while restricting the Markov dependence to be a regular undirected graph. It was recently proved by Garg et al. [10], based on a new multi-matrix extension of the Golden-Thompson inequality [45]. However, the regular undirected graph is a special case of Markov chains which are reversible and have a uniform stationary distribution, and does not apply to practical problems such as random walk on generic graphs. It is an open question for the Chernoff bound of matrix-valued random matrices with more generic Markov dependence.
In this work, we establish large deviation bounds for the tail probabilities of the extreme eigenvalues of sums of random matrices sampled via a regular Markov chain111Please note that regular Markov chains are Markov chains which are aperiodic and irreducible, while an undirected regular graph is an undirected graph where each vertex has the same number of neighbors. In this work, the term “regular” may have different meanings depending on the context. starting from an arbitrary distribution (not necessarily the stationary distribution), which significantly improves the result of Garg et al. [10]. More formally, we prove the following theorem:
Theorem 1 (Markov Chain Matrix Chernoff Bound).
Let be a regular Markov chain with state space , stationary distribution and spectral expansion . Let be a function such that (1) , is Hermitian and ; (2) . Let denote a -step random walk on starting from a distribution . Given ,
In the above theorem, is the -norm (which we define formally later in Section 2) measuring the distance between the initial distribution and the stationary distribution . Our strategy is to incorporate the concentration of matrix-valued functions from [10] into the study of general Markov chains from [6], which was originally for scalars.
1.1 Applications to Co-occurrence Matrices of Markov Chains
The co-occurrence statistics have recently emerged as common and important data signals in machine learning, providing rich correlation and clustering information about the underlying object space, such as the word co-occurrence in natural language processing [32, 33, 34, 26, 37], vertex co-occurrence in graph learning [38, 46, 12, 13, 7, 39], item co-occurrence in recommendation system [44, 28, 2, 51, 29], action co-occurrence in reinforcement learning [49], and emission co-occurrence of hidden Markov models [23, 17, 30]. Given a sequence of objects , the co-occurrence statistics are computed by moving a sliding window of fixed size over the sequence and recording the frequency of objects’ co-occurrence within the sliding window. A pseudocode of the above procedure is listed in Algorithm 1, which produces an by co-occurrence matrix where is the size of the object space.
A common assumption when building such co-occurrence matrices is that the sequential data are long enough to provide an accurate estimation. For instance, Mikolov et al. [33] use a news article dataset with one billion words in their Skip-gram model; Tennenholtz and Mannor [49] train their Act2vec model with action sequences from over a million StarCraft II game replays, which are equivalent to 100 years of consecutive gameplay; Perozzi et al. [38] samples large amounts of random walk sequences from graphs to capture the vertex co-occurrence. A recent work by Qiu et al. [39] studies the convergence of co-occurrence matrices of random walk on undirected graphs in the limit (i.e., when the length of random walk goes to infinity), but left the convergence rate an open problem. It remains unknown whether the co-occurrence statistics are sample efficient and how efficient they are.
In this paper, we study the situation where the sequential data are sampled from a regular finite Markov chain (i.e., an aperiodic and irreducible finite Markov chain), and derive bounds on the sample efficiency of co-occurrence matrix estimation, specifically on the length of the trajectory needed in the sampling algorithm shown in Algorithm 1. To give a formal statement, we first translate Algorithm 1 to linear algebra language. Given a trajectory from state space and step weight coefficients , the co-occurrence matrix is defined to be
Here accounts for the co-occurrence within sliding window , and is a length- vector with a one in its -th entry and zeros elsewhere. Thus is a by matrix with its -th entry to be one and other entries to be zero, which records the co-occurrence of and . Note that Algorithm 1 is a special case when step weight coefficients are uniform, i.e., , and the co-occurrence statistics in all the applications mentioned above can be formalized in this way. When trajectory is a random walk from a regular Markov chain with stationary distribution , the asymptotic expectation of the co-occurrence matrix within sliding window is
where . Thus the asymptotic expectation of the co-occurrence matrix is
(1) |
Our main result regarding the estimation of the co-occurrence matrix is the following convergence bound related to the length of the walk sampled.
Theorem 2 (Convergence Rate of Co-occurrence Matrices).
Let be a regular Markov chain with state space , stationary distribution and mixing time . Let denote a -step random walk on starting from a distribution on . Given step weight coefficients s.t. , and , the probability that the co-occurrence matrix deviates from its asymptotic expectation (in 2-norm) is bounded by:
Specially, there exists a trajectory length such that . Assuming gives .
Our result in Theorem 2 gives the first sample complexity analysis for many graph representation learning algorithms. Given a graph, these algorithms aim to learn a function from the vertices to a low dimensional vector space. Most of them (e.g., DeepWalk [38], node2vec [12], metapath2vec [7], GraphSAGE [13]) consist of two steps. The first step is to draw random sequences from a stochastic process defined on the graph and then count co-occurrence statistics from the sampled sequences, where the stochastic process is usually defined to be first-order or higher-order random walk on the graph. The second step is to train a model to fit the co-occurrence statistics. For example, DeepWalk can be viewed as factorizing a point-wise mutual information matrix [26, 39] which is a transformation of the co-occurrence matrix; GraphSAGE fits the co-occurrence statistics with a graph neural network [22]. The common assumption is that there are enough samples so that the co-occurrence statistics are accurately estimated. We are the first work to study the sample complexity of the aforementioned algorithms. Theorem 2 implies that these algorithms need samples to achieve a good estimator of the co-occurrence matrix.
Previous work Hsu et al. [16, 15] study a similar problem. They leverage the co-occurrence matrix with to estimate the mixing time in reversible Markov chains from a single trajectory. Their main technique is a blocking technique [55] which is in parallel with the Markov chain matrix Chernoff-bound used in this work. Our work is also related to the research about random-walk matrix polynomial sparsification when the Markov chain is a random walk on an undirected graph. In this case, we can rewrite where and is the degree matrix and adjacency matrix of an undirected graph with vertices and edges, and the expected co-occurrence matrix in Equation 1 can be simplified as ,222The volume of a graph is defined to be . which is known as random-walk matrix polynomials [3, 4]. Cheng et al. [4] propose an algorithm which needs steps of random walk to construct an -approximator for the random-walk matrix polynomials. Our bound in Theorem 2 is stronger than the bound proposed by Cheng et al. [4] when the Markov chain mixes fast. Moreover, Cheng et al. [4] require to be non-negative, while our bound can handle negative step weight coefficients.
Organization The rest of the paper is organized as follows. In Section 2 we provide preliminaries, followed by the proof of matrix Chernoff bound in Section 3 and the proof of convergence rate of co-occurrence matrices in Section 4. In Section 5, we conduct experiments on both synthetic and real-world datasets. Finally, we conclude this work in Section 6.
2 Preliminaries
In this paper, we denote to be a finite Markov chain on states. could refer to either the chain itself or the corresponding transition probability matrix — an by matrix such that its entry indicates the probability that state moves to state . A Markov chain is called an ergodic Markov chain if it is possible to eventually get from every state to every other state with positive probability. A Markov chain is regular if some power of its transition matrix has all strictly positive entries. A regular Markov chain must be an ergodic Markov chain, but not vice versa. An ergodic Markov chain has unique stationary distribution, i,e., there exists a unique probability vector such that . For convenience, we denote .
The time that a regular Markov chain333Please note that we need the Markov chain to be regular to make the mixing-time well-defined. For an ergodic Markov chain which could be periodic, the mixing time may be ill-defined. needs to be “close” to its stationary distribution is called mixing time. Let and be two probability vectors. The total variation distance between them is . For , the -mixing time of regular Markov chain is , where is an arbitrary probability vector.
The stationary distribution also defines a inner product space where the inner product (under -kernel) is defined as for , where is the conjugate transpose of . A naturally defined norm based on the above inner product is . Then we can define the spectral expansion of a Markov chain [31, 9, 6] as . The spectral expansion is known to be a measure of mixing time of a Markov chain. The smaller is, the faster a Markov chain converges to its stationary distribution [54]. If is reversible, is simply the second largest absolute eigenvalue of (the largest is always ). The irreversible case is more complicated, since may have complex eigenvalues. In this case, is actually the square root of the second largest absolute eigenvalue of the multiplicative reversiblization of [9]. When is clear from the context, we will simply write and for and , respectively. We shall also refer as the spectral gap of .
3 Matrix Chernoff Bounds for Markov Chains
This section provides a brief overview of our proof of Markov chain Martrix Chernoff bounds. We start from a simpler version which only consider real-valued symmetric matrices, as stated in Theorem 3 below. Then we extend it to complex-valued Hermitian matrices, as stated in in Theorem 1.
Theorem 3 (A Real-Valued Version of Theorem 1).
Let be a regular Markov chain with state space , stationary distribution and spectral expansion . Let be a function such that (1) , is symmetric and ; (2) . Let denote a -step random walk on starting from a distribution on . Then given ,
Due to space constraints, we defer the full proof to Section B in the supplementary material and instead present a sketch here. By symmetry, we only discuss on bounding here. Using the exponential method, the probability in Theorem 3 can be upper bounded for any by:
where the first inequality follows by the tail bounds for eigenvalues (See Proposition 3.2.1 in Tropp [50]) which controls the tail probabilities of the extreme eigenvalues of a random matrix by producing a bound for the trace of the matrix moment generating function, and the second inequality follows by Markov’s inequality. The RHS of the above equation is the expected trace of the exponential of a sum of matrices (i.e., ’s). When is a scalar-valued function, we can easily write exponential of a sum to be product of exponentials (since for scalars). However, this is not true for matrices. To bound the expectation term, we invoke the following multi-matrix Golden-Thompson inequality from [10], by letting .
Theorem 4 (Multi-matrix Golden-Thompson Inequality, Theorem 1.5 in [10]).
Let be Hermitian matrices, then for some probability distribution on .
The key point of this theorem is to relate the exponential of a sum of matrices to a product of matrix exponentials and their adjoints, whose trace can be further bounded via the following lemma by letting .
Lemma 1 (Analogous to Lemma 4.3 in [10]).
Let be a regular Markov chain with state space with spectral expansion . Let be a function such that (1) ; (2) and is symmetric, . Let denote a -step random walk on starting from a distribution on . Then for any such that and , we have
Proving Lemma 1 is the technical core of our paper. The main idea is to write the expected trace expression in LHS of Lemma 1 in terms of the transition probability matrix , which allows for a recursive analysis to track how much the expected trace expression changes as a function of . The analysis relies on incorporating the concentration of matrix-valued functions from [10] into the study of general Markov chains from [6], which was originally for scalars. Key to this extension is the definition of an inner product related to the stationary distribution of , and a spectral expansion from such inner products. In contrast, the undirected regular graph case studied in [10] can be handled using the standard inner products, as well as the second largest eigenvalues of instead of the spectral expansion. Detailed proofs of Theorem 3 and Lemma 1 can be found in Appendix B.2 and Appendix B.3 of the supplementary material, respectively.
Our result about real-valued matrices can be further generalized to complex-valued matrices, as stated in Theorem 1. Our main strategy is to adopt complexification technique [8], which first relate the eigenvalues of a complex Hermitian matrix to a real symmetric matrix, and then deal with the real symmetric matrix using Theorem 3. The proof of Theorem 1 is deferred to Appendix B.4 in the supplementary material.
4 Convergence Rate of Co-occurrence Matrices of Markov Chains
In this section, we first apply the matrix Chernoff bound for regular Markov chains from Theorem 3 to obtain our main result on the convergence of co-occurrence matrix estimation, as stated in Theorem 2, and then discuss its generalization to Hidden Markov models in Corollary 1. Informally, our result in Theorem 3 states that if the mixing time of the Markov chain is , then the length of a trajectory needed to guarantee an additive error (in 2-norm) of is roughly , where is the co-occurrence window size. However, we cannot directly apply the matrix Chernoff bound because the co-occurrence matrix is not a sum of matrix-valued functions sampled from the original Markov chain . The main difficulty is to construct the proper Markov chain and matrix-valued function as desired by Theorem 3. We formally give our proof as follows:
Proof.
(of Theorem 2) Our proof has three main steps: the first two construct a Markov chain according to , and a matrix-valued function such that the sums of matrix-valued random variables sampled via is exactly the error matrix . Then we invoke Theorem 3 to the constructed Markov chain and function to bound the convergence rate. We give details below.
Step One Given a random walk on Markov chain , we construct a sequence where , i.e., each is a size- sliding window over . Meanwhile, let be the set of all -step walks on Markov chain , we define a new Markov chain on such that :
The following claim characterizes the properties of , whose proof is deferred to Appendix A.1 in the supplementary material.
Claim 1 (Properties of ).
If is a regular Markov chain, then satisfies:
-
1.
is a regular Markov chain with stationary distribution ;
-
2.
The sequence is a random walk on starting from a distribution such that , and .
-
3.
, the -mixing time of and satisfies ;
-
4.
with s.t. the induced has , i.e. may have zero spectral gap.
Parts and imply that the sliding windows (i.e., ) correspond to the state transition in a regular Markov chain , whose mixing time and spectral expansion are described in Parts and . A special case of the above construction when can be found in Lemma 6.1 of [54].
Step Two Defining a matrix-valued function such that :
(2) |
With this definition of , the difference between the co-occurrence matrix and its asymptotic expectation can be written as: . We can further show the following properties of this function :
Claim 2 (Properties of ).
The function in Equation 2 satisfies (1) ; (2) is symmetric and .
This claim verifies that in Equation 2 satisfies the two conditions of matrix-valued function in Theorem 3. The proof of Claim 2 is deferred to Appendix A.2 of the supplementary material.
Step Three The construction in step two reveals the fact that the error matrix can be written as the average of matrix-valued random variables (i.e., ’s), which are sampled via a regular Markov chain This encourages us to directly apply Theorem 3. However, note that (1) the error probability in Theorem 3 contains a factor of spectral gap ; and (2) Part 4 of Claim 1 allows for the existence of a Markov chain with while the induced Markov chain has . So we cannot directly apply Theorem 3 to . To address this issue, we utilize the following tighter bound on sub-chains.
Claim 3.
(Claim 3.1 in Chung et al. [6]) Let be a regular Markov chain with -mixing time , then . In particular, setting implies .
The above claim reveals the fact that, even though could have zero spectral gap (Part 4 of Claim 1), we can bound the spectral expansion of . We partition into groups444Without loss of generality, we assume is a multiple of ., such that the -th group consists of a sub-chain of length . The sub-chain can be viewed as generated from a Markov chain . Apply Theorem 3 to the -th sub-chain, whose starting distribution is , we have
where that last step follows by and (Part 2 of Claim 1). Together with a union bound across each sub-chain, we can obtain:
The bound on also follows similarly. As is a real symmetric matrix, its -norm is its maximum absolute eigenvalue. Therefore, we can use the eigenvalue bound to bound the overall error probability in terms of the matrix 2-norm:
where the first inequality follows by union bound, and the second inequality is due to (Part 3 of Claim 1). This bound implies that the probability that deviates from could be arbitrarily small by increasing the sampled trajectory length . Specially, if we want the event happens with probability smaller than , we need . If we assume , we can achieve . ∎
Our analysis can be extended to Hidden Markov models (HMM) as shown in Corollary 1, and has a potential to solve problems raised in [17, 30]. Our strategy is to treat the HMM with observable state space and hidden state space as a Markov chain with state space . The detailed proof can be found in Appendix A.3 in the supplementary material.
Corollary 1 (Co-occurrence Matrices of HMMs).
For a HMM with observable states and hidden states , let be the emission probability and be the hidden state transition probability. Given an -step trajectory observations from the HMM, , one needs a trajectory of length to achieve a co-occurrence matrix within error bound with high probability, where is the mixing time of the Markov chain on hidden states.
5 Experiments
In this section, we show experiments to illustrate the exponentially fast convergence rate of estimating co-occurrence matrices of Markov chains. We conduct experiments on three synthetic Markov chains (Barbell graph, winning streak chain, and random graph) and one real-world Markov chain (BlogCatalog). For each Markov chain and each trajectory length from the set , we measure the approximation error of the co-occurrence matrix constructed by Algorithm 1 from a -step random walk sampled from the chain. We performed 64 trials for each experiment, and the results are aggregated as an error-bar plot. We set and to be uniform unless otherwise mentioned. The relationship between trajectory length and approximation error is shown in Figure 1 (in log-log scale). Across all the four datasets, the observed exponentially fast convergence rates match what our bounds predict in Theorem 2. Below we discuss our observations for each of these datasets.




Barbell Graphs [43] The Barbell graph is an undirected graph with two cliques connected by a single path. Such graphs’ mixing times vary greatly: two cliques with size connected by a single edge have mixing time ; and two size- cliques connected by a length- path have mixing time about . We evaluate the convergence rate of co-occurrence matrices on the two graphs mentioned above, each with vertices. According to our bound that require , we shall expect the approximate co-occurrence matrix to converge faster when the path bridging the two cliques is shorter. The experimental results are shown in Figure 1a, and indeed display faster convergences when the path is shorter (since we fix , a Barbell graph with clique size 50 has a shorter path connecting the two cliques than the one with clique size 33).
Winning Streak Chains (Section 4.6 of [25]) A winning streak Markov chain has state space , and can be viewed as tracking the number of consecutive ‘tails’ in a sequence of coin flips. Each state transits back to state with probability , and the next state with probability . The -mixing time of this chain satisfies , and is independent of . This prompted us to choose this chain, as we should expect similar rates of convergence for different values of according to our bound of . In our experiment, we compare between and and illustrate the results in Figure 1b. As we can see, for each trajectory length , the approximation errors of and are indeed very close.
BlogCatalog Graph [47] is widely used to benchmark graph representation learning algorithms [38, 12, 39]. It is an undirected graph of social relationships of online bloggers with 10,312 vertices and 333,983 edges. The random walk on the BlogCatalog graph has spectral expansion . Following Levin and Peres [25], we can upper bound its -mixing time by . We choose from and illustrate the results in Figure 1c. The convergence rate is robust to different values of . Moreover, the variance in BlogCatalog is much smaller than that in other datasets.
We further demonstrate how our result could be used to select parameters for a popular graph representation learning algorithm, DeepWalk [38]. We set the window size , which is the default value of DeepWalk. Our bound on trajectory length in Theorem 1 (with explicit constant) is . The error bound might be chosen in the range of , which corresponds to in the range of . To verify that is a meaningful range for tuning , we enumerate trajectory length from , estimate the co-occurrence matrix with the single trajectory sampled from BlogCatalog, convert the co-occurrence matrix to the one implicitly factorized by DeepWalk [38, 39], and factorize it with SVD. For comparison, we also provide the result at the limiting case () where we directly compute the asymptotic expectation of the co-occurrence matrix according to Equation 1. The limiting case involves computing a matrix polynomial and could be very expensive. For node classification task, the micro-F1 when training ratio is 50% is
Length of DeepWalk | ||||||||
---|---|---|---|---|---|---|---|---|
Micro-F1 () | 15.21 | 18.31 | 26.99 | 33.85 | 39.12 | 41.28 | 41.58 | 41.82 |
.
As we can see, it is reasonable to choose in the predicted range.
Random Graph The small variance observed on BlogCatalog leads us to hypothesize that it shares some traits with random graphs. To gather further evidence for this, we estimate the co-occurrence matrices of an Erdős–Rényi random graph for comparison. Specifically, we take a random graph on vertices where each undirected edge is added independently with probability , aka. . The results Figure 1d show very similar behaviors compared to the BlogCatalog graph: small variance and robust convergence rates.
6 Conclusion and Future Work
In this paper, we analyze the convergence rate of estimating the co-occurrence matrix of a regular Markov chain. The main technical contribution of our work is to prove a Chernoff-type bound for sums of matrix-valued random variables sampled via a regular Markov chain, and we show that the problem of estimating co-occurrence matrices is a non-trivial application of the Chernoff-type bound. Our results show that, given a regular Markov chain with states and mixing time , we need a trajectory of length to achieve an estimator of the co-occurrence matrix with error bound . Our work leads to some natural future questions:
-
•
Is it a tight bound? Our analysis on convergence rate of co-occurrence matrices relies on union bound, which probably gives a loose bound. It would be interesting to shave off the leading factor in the bound, as the mixing time could be large for some Markov chains.
-
•
What if the construction of the co-occurrence matrix is coupled with a learning algorithm? For example, in word2vec [33], the co-occurrence in each sliding window outputs a mini-batch to a logistic matrix factorization model. This problem can be formalized as the convergence of stochastic gradient descent with non-i.i.d but Markovian random samples.
- •
Broader Impact
Our work contributes to the research literature of Chernoff-type bounds and co-occurrence statistics. Chernoff-type bound have become one of the most important probabilistic results in computer science. Our result generalize Chernoff bound to Markov dependence and random matrices. Co-occurrence statistics have emerged as important tools in machine learning. Our work addresses the sample complexity of estimating co-occurrence matrix. We believe such better theoretical understanding can further the understanding of potential and limitations of graph representation learning and reinforcement learning.
Acknowledgments and Disclosure of Funding
We thank Jian Li (IIIS, Tsinghua) and Shengyu Zhang (Tencent Quantum Lab) for motivating this work. Funding in direct support of this work: Jiezhong Qiu and Jie Tang were supported by the National Key R&D Program of China (2018YFB1402600), NSFC for Distinguished Young Scholar (61825602), and NSFC (61836013). Richard Peng was partially supported by NSF grant CCF-1846218. There is no additional revenue related to this work.
References
- Ahlswede and Winter [2002] Rudolf Ahlswede and Andreas Winter. Strong converse for identification via quantum channels. IEEE Transactions on Information Theory, 2002.
- Barkan and Koenigstein [2016] Oren Barkan and Noam Koenigstein. Item2vec: neural item embedding for collaborative filtering. In 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), 2016.
- Cheng et al. [2015a] Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Efficient sampling for gaussian graphical models via spectral sparsification. In COLT ’15, 2015a.
- Cheng et al. [2015b] Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Spectral sparsification of random-walk matrix polynomials. arXiv preprint arXiv:1502.03496, 2015b.
- Chernoff et al. [1952] Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 1952.
- Chung et al. [2012] Kai-Min Chung, Henry Lam, Zhenming Liu, and Michael Mitzenmacher. Chernoff-hoeffding bounds for markov chains: Generalized and simplified. In STACS’ 12, 2012.
- Dong et al. [2017] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. metapath2vec: Scalable representation learning for heterogeneous networks. In KDD ’17, 2017.
- Dongarra et al. [1984] JJ Dongarra, JR Gabriel, DD Koelling, and JH Wilkinson. The eigenvalue problem for hermitian matrices with time reversal symmetry. Linear Algebra and its Applications, 1984.
- Fill [1991] James Allen Fill. Eigenvalue bounds on convergence to stationarity for nonreversible markov chains, with an application to the exclusion process. The annals of applied probability, 1991.
- Garg et al. [2018] Ankit Garg, Yin Tat Lee, Zhao Song, and Nikhil Srivastava. A matrix expander chernoff bound. In STOC ’18, 2018.
- Gillman [1998] David Gillman. A chernoff bound for random walks on expander graphs. SIAM Journal on Computing, 1998.
- Grover and Leskovec [2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD ’16, 2016.
- Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS ’17, 2017.
- Healy [2008] Alexander D Healy. Randomness-efficient sampling within nc. Computational Complexity, 2008.
- Hsu et al. [2019] Daniel Hsu, Aryeh Kontorovich, David A Levin, Yuval Peres, Csaba Szepesvári, Geoffrey Wolfer, et al. Mixing time estimation in reversible markov chains from a single sample path. The Annals of Applied Probability, 2019.
- Hsu et al. [2015] Daniel J Hsu, Aryeh Kontorovich, and Csaba Szepesvári. Mixing time estimation in reversible markov chains from a single sample path. In NeurIPS ’15, 2015.
- Huang et al. [2018] Kejun Huang, Xiao Fu, and Nicholas Sidiropoulos. Learning hidden markov models from pairwise co-occurrences with application to topic modeling. In ICML ’18, 2018.
- Jerrum and Sinclair [1996] Mark Jerrum and Alistair Sinclair. The markov chain monte carlo method: an approach to approximate counting and integration. Approximation Algorithms for NP-hard problems, PWS Publishing, 1996.
- Kahale [1997] Nabil Kahale. Large deviation bounds for markov chains. Combinatorics, Probability and Computing, 1997.
- Karlin [2014] Samuel Karlin. A first course in stochastic processes. Academic press, 2014.
- Kearns et al. [1994] Michael J Kearns, Umesh Virkumar Vazirani, and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994.
- Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR ’17, 2017.
- Kontorovich et al. [2013] Aryeh Kontorovich, Boaz Nadler, and Roi Weiss. On learning parametric-output hmms. In ICML ’13, 2013.
- León et al. [2004] Carlos A León, François Perron, et al. Optimal hoeffding bounds for discrete reversible markov chains. The Annals of Applied Probability, 2004.
- Levin and Peres [2017] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
- Levy and Goldberg [2014] Omer Levy and Yoav Goldberg. Neural Word Embedding as Implicit Matrix Factorization. In NeurIPS ’14. 2014.
- Lezaud [1998] Pascal Lezaud. Chernoff-type bound for finite markov chains. Annals of Applied Probability, 1998.
- Liang et al. [2016] Dawen Liang, Jaan Altosaar, Laurent Charlin, and David M Blei. Factorization meets the item embedding: Regularizing matrix factorization with item co-occurrence. In RecSys ’16, 2016.
- Liu et al. [2017] David C Liu, Stephanie Rogers, Raymond Shiau, Dmitry Kislyuk, Kevin C Ma, Zhigang Zhong, Jenny Liu, and Yushi Jing. Related pins at pinterest: The evolution of a real-world recommender system. In WWW ’17, 2017.
- Mattila et al. [2020] Robert Mattila, Cristian R Rojas, Eric Moulines, Vikram Krishnamurthy, and Bo Wahlberg. Fast and consistent learning of hidden markov models by incorporating non-consecutive correlations. In ICML ’20, 2020.
- Mihail [1989] Milena Mihail. Conductance and convergence of markov chains-a combinatorial treatment of expanders. In FOCS ’89, 1989.
- Mikolov et al. [2013a] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR Workshop ’13, 2013a.
- Mikolov et al. [2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In NeurIPS’ 13. 2013b.
- Mikolov et al. [2013c] Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In NAACL ’13, 2013c.
- Motwani and Raghavan [1995] Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cambridge university press, 1995.
- Ortner [2020] Ronald Ortner. Regret bounds for reinforcement learning via markov chain concentration. Journal of Artificial Intelligence Research, 2020.
- Pennington et al. [2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP ’14, 2014.
- Perozzi et al. [2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In KDD ’14, 2014.
- Qiu et al. [2018] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In WSDM ’18, 2018.
- Rahimi and Recht [2008] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NeurIPS ’08, 2008.
- Rao and Regev [2017] Shravas Rao and Oded Regev. A sharp tail bound for the expander random sampler. arXiv preprint arXiv:1703.10205, 2017.
- Rudelson [1999] Mark Rudelson. Random vectors in the isotropic position. Journal of Functional Analysis, 1999.
- Sauerwald and Zanetti [2019] Thomas Sauerwald and Luca Zanetti. Random walks on dynamic graphs: Mixing times, hitting times, and return probabilities. In ICALP 2019, 2019.
- Shani et al. [2005] Guy Shani, David Heckerman, and Ronen I Brafman. An mdp-based recommender system. JMLR ’05, 2005.
- Sutter et al. [2017] David Sutter, Mario Berta, and Marco Tomamichel. Multivariate trace inequalities. Communications in Mathematical Physics, 352(1):37–58, 2017.
- Tang et al. [2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In WWW ’15, 2015.
- Tang and Liu [2009] Lei Tang and Huan Liu. Relational learning via latent social dimensions. In KDD ’09, 2009.
- Tekin and Liu [2010] Cem Tekin and Mingyan Liu. Online algorithms for the multi-armed bandit problem with markovian rewards. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1675–1682. IEEE, 2010.
- Tennenholtz and Mannor [2019] Guy Tennenholtz and Shie Mannor. The natural language of actions. In ICML ’19, 2019.
- Tropp [2015] Joel A Tropp. An introduction to matrix concentration inequalities. arXiv preprint arXiv:1501.01571, 2015.
- Vasile et al. [2016] Flavian Vasile, Elena Smirnova, and Alexis Conneau. Meta-prod2vec: Product embeddings using side-information for recommendation. In RecSys ’16, 2016.
- Wagner [2008] Roy Wagner. Tail estimates for sums of variables sampled by a random walk. Combinatorics, Probability and Computing, 2008.
- Wigderson and Xiao [2005] Avi Wigderson and David Xiao. A randomness-efficient sampler for matrix-valued functions and applications. In FOCS’05, 2005.
- Wolfer and Kontorovich [2019] Geoffrey Wolfer and Aryeh Kontorovich. Estimating the mixing time of ergodic markov chains. In COLT ’19, 2019.
- Yu [1994] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94–116, 1994.
Supplementary Material of A Matrix Chernoff Bound for Markov Chains and Its Application to Co-occurrence Matrices
Appendix A Convergence Rate of Co-occurrence Matrices
A.1 Proof of Claim 1
See 1
Proof.
We prove the fours parts of this Claim one by one.
Part 1 To prove is regular, it is sufficient to show that , , can reach at steps. We know is a regular Markov chain, so there exists s.t., for any , can reach at exact step, i,e., there is a -step walk s.t. on . This induces an -step walk from to . Take further step, we can reach , so we construct a step walk from to . Since this is true for any , we then claim that any state can be reached from any other state in any number of steps greater than or equal to a number . Next to verify such that is the stationary distribution of Markov chain ,
Part 2 Recall is a random walk on starting from distribution , so the probability we observe is , i.e., is sampled from the distribution . Then we study the transition probability from to , which is . Consequently, we can claim is a random walk on . Moreover,
which implies .
Part 3 For any distribution on , define such that . Easy to see is a probability vector, since is the marginal probability of . For convenience, we assume for a moment the are row vectors. We can see that:
which indicates .
Part 4
This is an example showing that cannot be bounded by — even though has ,
the induced may have . We consider random walk on the unweighted undirected graph
and .
The transition probability matrix is:
with stationary distribution and . When , the induced Markov chain has stationary distribution where is the number of edges in the graph. Construct such that
The constructed vector has norm
And it is easy to check , since . Let , we have for :
This vector has norm:
Thus we have . Taking maximum over all possible gives . Also note that fact that , so . ∎
A.2 Proof of Claim 2
See 2
A.3 Proof of Corollary 1
See 1
Proof.
A HMM can be model by a Markov chain on such that . For the co-occurrence matrix of observable states, applying a similar proof like our Theorem 2 shows that one needs a trajectory of length to achieve error bound with high probability. Moreover, the mixing time is bounded by the mixing time of the Markov chain on the hidden state space (i.e., ). ∎
Appendix B Matrix Chernoff Bounds for Markov Chains
B.1 Preliminaries
Kronecker Products If is an matrix and is a matrix, then the Kronecker product is the block matrix such that
Kronecker product has the mixed-product property. If are matrices of such size that one can from the matrix products and , then .
Vectorization For a matrix , denote the vertorization of the matrix , s.t. , which is the stack of rows of . And there is a relationship between matrix multiplication and Kronecker product s.t. .
Matrices and Norms For a matrix , we use to denote matrix transpose, use to denote entry-wise matrix conjugation, use to denote matrix conjugate transpose (). The vector 2-norm is defined to be , and the matrix 2-norm is defined to be .
We then recall the definition of inner-product under -kernel in Section 2. The inner-product under -kernel for is where , and its induced -norm . The above definition allow us to define a inner product under -kernel on :
Definition 1.
Define inner product on under -kernel to be .
Remark 1.
For and , then inner product (under -kernel) between and can be simplified as
Remark 2.
The induced -norm is . When , the -norm can be simplified to be: .
Matrix Exponential The matrix exponential of a matrix is defined by Taylor expansion . And we will use the fact that .
B.2 Proof of Theorem 3
See 3
Proof.
Due to symmetry, it suffices to prove one of the statements. Let be a parameter to be chosen later. Then
(3) |
The second inequality follows Markov inequality.
Next to bound . Using Theorem 4, we have:
where the second step follows by concavity of function and the fact that is a probability distribution on . This implies
Note that for , choosing we have
Combining the above two equations together, we have
(4) |
Write with : See 1 Assuming the above lemma, we can complete the proof of the theorem as:
(5) |
where the first step follows Equation 4, the second step follows by swapping and , the third step follows by Lemma 1, the forth step follows by , and the last step follows by is a probability distribution on so
B.3 Proof of Lemma 1
See 1
Proof.
Note that for , . By letting and . The trace term in LHS of Lemma 1 becomes
(6) |
By iteratively applying , we have
where we define
(7) |
Plug it to the trace term, we have
Next, taking expectation on Equation 6 gives
(8) |
We turn to study , which is characterized by the following lemma:
Lemma 2.
Let and . For a random walk such that is sampled from an arbitrary probability distribution on , , where is the all-ones vector.
Proof.
Given Lemma 2, Equation 8 becomes:
The third equality is due to . The forth equality is by setting (scalar) in . Then
where we define and . Moreover, by Remark 2, we have and
Definition 2.
Define linear subspace .
Remark 3.
is an orthonormal basis of . This is because by Remark 1, where is the Kronecker delta.
Remark 4.
Given . The projection of on to is . This is because
We want to bound
As can be expressed as recursively applying operator and on , we turn to analyze the effects of and operators.
Definition 3.
The spectral expansion of is defined as
Lemma 3.
.
Proof.
First show . Suppose the maximizer of is , i.e., . Construct for arbitrary non-zero . Easy to check that , because , where the last equality is due to . Then we can bound such that
which indicate for , . Taking maximum over all gives .
Next to show . For such that and , we can decompose it to be
where we define for . We can observe that , because for , we have
which indicates . Furthermore, we can also observe that is pairwise orthogonal. This is because for , , which suggests us to use Pythagorean theorem such that .
We can use similar way to decompose and analyze :
where we can observe that is pairwise orthogonal. This is because for , we have . Again, applying Pythagorean theorem gives:
which indicate that for such that and , we have , or equivalently .
Overall, we have shown both and . We conclude . ∎
Lemma 4.
(The effect of operator) This lemma is a generalization of lemma 3.3 in [6].
-
1.
, then .
-
2.
, then , and .
Proof.
Remark 5.
Lemma 4 implies that
-
1.
-
2.
.
Lemma 5.
(The effect of operator) Given three parameters and . Let be a regular Markov chain on state space , with stationary distribution and spectral expansion . Suppose each state is assigned a matrix s.t. and . Let and denotes the block matrix where the -th diagonal block is the matrix , i.e., . Then for any , we have:
-
1.
, where .
-
2.
, where .
-
3.
, where .
-
4.
, where .
Proof.
(of Lemma 5) We first show that, for ,
Due to the linearity of projection,
(9) |
where the second inequality follows by Remark 4.
Proof of Lemma 5, Part 1 Firstly We can bound by
where the first step follows by , the second step follows by matrix exponential, the third step follows by , and the forth step follows by triangle inequality. Given the above bound, for any which can be written as for some , we have
where step one follows by Part 1 of Remark 5 and step two follows by Equation 9.
Proof of Lemma 5, Part 2 For , we can write it as block matrix such that:
where each . Please note that above decomposition is pairwise orthogonal. Applying Pythagorean theorem gives . Similarly, we can decompose such that
(10) |
Note that above decomposition is pairwise orthogonal, too. Applying Pythagorean theorem gives
which indicates
Now we can formally prove Part 2 of Lemma 5 by:
The first step follows by Part 2 of Remark 5, the second step follows by Part 1 on Lemma 4 and the forth step is due to .
Recursive Analysis We now use Lemma 5 to analyze the evolution of and . Let in Lemma 5. We can see verify the following three facts: (1) ; (2) is bounded (3) .
Firstly, easy to see that
where the first step follows by definition of and the second step follows by the fact that , and the last step follows by Equation 7.
Secondly, we can bound by:
where the first step follows by triangle inequality, the second step follows by the fact that , the third step follows by and . We set to satisfy the assumption in Lemma 5 that . According to the conditions in Lemma 1, we know that and .
Finally, we show that , because
where the last step follows by .
Claim 4.
.
Proof.
Claim 5.
.
B.4 Proof of Theorem 1
See 1
Proof.
(of Theorem 1) Our strategy is to adopt complexification technique [8]. For any complex Hermitian matrix , we may write , where and are the real and imaginary parts of , respectively. Moreover, the Hermitian property of (i.e., ) implies that (1) is real and symmetric (i.e., ); (2) is real and skew symmetric (i.e., ). The eigenvalues of can be found via a real symmetric matrix , where the symmetry of follows by the symmetry of and skew-symmetry of . Note the fact that, if the eigenvalues (real) of are , then those of are . I.e., and have the same eigenvalues, but with different multiplicity.
Using the above technique, we can formally prove Theorem 1. For any complex matrix function in Theorem 1, we can separate its real and imaginary parts by . Then we construct a real-valued matrix function s.t. , . According to the complexification technique, we know that (1) is real symmetric and ; (2) . Then
where the first step follows by the fact that and have the same eigenvalues (with different multiplicity), and the second step follows by Theorem 3.555The additional factor 4 is because the constructed has shape . The bound on also follows similarly. ∎