This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Random Transpositions on Contingency Tables

Mackenzie Simper [email protected] Stanford University
Abstract

Contingency tables are useful objects in statistics for representing 2-way data. With fixed row and column sums, and a total of nn entries, contingency tables correspond to parabolic double cosets of SnS_{n}. The uniform distribution on SnS_{n} induces the Fisher-Yates distribution, classical for its use in the chi-squared test for independence. A Markov chain on SnS_{n} can then induce a random process on the space of contingency tables through the double cosets correspondence. The random transpositions Markov chain on SnS_{n} induces a natural ‘swap’ Markov chain on the space of contingency tables; the stationary distribution of the Markov chain is the Fisher-Yates distribution. This paper describes this Markov chain and shows that the eigenfunctions are orthogonal polynomials of the Fisher-Yates distribution. Results for the mixing time are discussed, as well as connections with sampling from the uniform distribution on contingency tables, and data analysis.

1 Introduction

Contingency tables are a classical object of study, useful in statistics for representing data with two categorical features. For example, rows could be hair colors and columns could be eye colors. The number in an entry in the table counts the number of individuals from a sample with a fixed hair, eye color combination. Various statistical tests exist to test the hypothesis that the row and column features are independent, and other more general models. See e.g. [36], [1] and Section 5 of [16] for many more references.

Given λ=(λ1,,λI),μ=(μ1,,μJ)\lambda=(\lambda_{1},\dots,\lambda_{I}),\mu=(\mu_{1},\dots,\mu_{J}) two partitions of nn, the set of contingency tables 𝒯λ,μ\mathcal{T}_{\lambda,\mu} is the set of I×JI\times J tables with non-negative integer entries with row sums λ1,,λI\lambda_{1},\dots,\lambda_{I} and column sums μ1,,μJ\mu_{1},\dots,\mu_{J}. As explained below, the space 𝒯λ,μ\mathcal{T}_{\lambda,\mu} is in bijection with the space of double cosets Sλ\Sn/SμS_{\lambda}\backslash S_{n}/S_{\mu}, where SnS_{n} is the symmetric group on nn elements and Sλ,SμS_{\lambda},S_{\mu} are the Young subgroups defined by λ,μ\lambda,\mu. Though this is a classical fact [29], recently it was studied in [16] with probabilistic motivation. The general question is: Given a finite group GG and two subgroups H,KH,K, what is the distribution on H\G/KH\backslash G/K induced by picking a element uniformly from GG and mapping it to its double coset? For Sλ\Sn/SμS_{\lambda}\backslash S_{n}/S_{\mu}, the induced distribution on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} is the Fisher-Yates distribution:

πλ,μ(T)=1n!i,jλi!μj!Tij!,T={Tij}1iI,1jJ𝒯λ,μ.\pi_{\lambda,\mu}(T)=\frac{1}{n!}\prod_{i,j}\frac{\lambda_{i}!\mu_{j}!}{T_{ij}!},\quad T=\{T_{ij}\}_{1\leq i\leq I,1\leq j\leq J}\in\mathcal{T}_{\lambda,\mu}. (1)

The Fisher-Yates distribution is classically defined as the conditional distribution on a contingency table sampled with cell probabilities pijp_{ij} which satisfy pij=θiηjp_{ij}=\theta_{i}\eta_{j} (the assumption that row and column features are independent). It is the distribution of the table conditioned on the sufficient statistics (the row and column sums). It is a pleasant surprise that this same distribution is induced by the double coset decomposition.

Whenever there is a Markov chain on some state space, one can consider the lumped process on any partitioning of this space. In [12] the question is studied for double cosets: Given a Markov chain on GG, is the lumped process on H\G/KH\backslash G/K also a Markov chain? An affirmative answer is proven when the Markov chain on GG is induced by a probability measure on GG which is constant on conjugacy classes.

Many Markov chains on SnS_{n} have been studied, and perhaps the most famous and easiest to describe is the random transpositions Markov chain [14]. This is generated by the class function QQ which gives equal probability to transpositions. The Markov chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} induced by random transpositions turns out to be novel and easy to describe. One move involves adding a sub-matrix of the form (+111+1)\begin{pmatrix}+1&-1\cr-1&+1\end{pmatrix} to the current table; the probability of selecting the rows/columns for the sub-matrix depend on the current configuration. Since the stationary distribution of the random transpositions chain on SnS_{n} is uniform, the induced chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} has Fisher-Yates distribution as stationary distribution. These statements are proven in Section 1.1 below.

This same move has previously been studied for a Markov chain on contingency tables with rows and columns for the sub-matrix chosen uniformly at random (and moves which would give a negative entry in the table are rejected). This ‘uniform swap’ chain was proposed in [11] to sample contingency tables from the uniform distribution, as a way of approximately counting the number of tables, which is a # -P complete problem.

The purpose of this paper is to describe the random transposition chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu}, characterize its eigenvalues and eigenvectors, and analyze the mixing time in special cases. This chain has polynomial eigenvectors, which can be used to give orthogonal polynomials for the Fisher-Yates distribution. In the special case of a two-rowed table, the Fisher-Yates distribution is the multivariate hypergeometric distribution and the orthogonal polynomials are the multivariate Hahn polynomials. More generally, this Markov chain gives a characterization of the natural analog of Hahn polynomials for the Fisher-Yates distribution.

The remainder of this section reviews the correspondence between contingency tables and double cosets, defines the Markov chain explicitly, states the main results of this paper, and reviews related work.

1.1 Contingency Tables as Double Cosets

This section describes the relationship between contingency tables and double cosets of SnS_{n}. Let nn be a positive integer and suppose λ=(λ1,λ2,,λI)\lambda=(\lambda_{1},\lambda_{2},\dots,\lambda_{I}) is a partition of nn. That is, λi\lambda_{i} are all positive integers, λ1λ2λI>0\lambda_{1}\geq\lambda_{2}\geq\ldots\geq\lambda_{I}>0, and λ1+λ2++λI=n\lambda_{1}+\lambda_{2}+\ldots+\lambda_{I}=n.

Definition 1.1 (Young Subgroup).

The parabolic subgroup, or Young subgroup, SλS_{\lambda} is the set of all permutations in SnS_{n} which permute only {1,2,,λ1}\{1,2,\dots,\lambda_{1}\} among themselves, only {λ1+1,,λ1+λ2}\{\lambda_{1}+1,\dots,\lambda_{1}+\lambda_{2}\} among themselves, and so on. Thus,

SλSλ1×Sλ2××SλI.S_{\lambda}\cong S_{\lambda_{1}}\times S_{\lambda_{2}}\times\ldots\times S_{\lambda_{I}}.

If GG is an arbitrary finite group and H,KH,K are two sub-groups of GG, then the HKH-K double-cosets are the equivalence classes defined by the equivalence relation

sts=htkfors,tG,hH,kK.s\sim t\iff s=htk\,\,\,\,\,\,\,\,\text{for}\,\,\,\,\,\,\,\,s,t\in G,\,\,\,h\in H,\,\,\,k\in K.

Let μ=(μ1,,μJ)\mu=(\mu_{1},\dots,\mu_{J}) be a second partition of nn, let 𝒯λ,μ\mathcal{T}_{\lambda,\mu} denote the space of contingency tables with row sums given by λ\lambda and column sums μ\mu. That is, an element of 𝒯λ,μ\mathcal{T}_{\lambda,\mu} is an I×JI\times J table with non-negative integer entries, row sums λ1,,λI\lambda_{1},\dots,\lambda_{I} and column sums μ1,,μJ\mu_{1},\dots,\mu_{J}. The following bijection is described in further detail in [29], and the induced distribution, along with many more references, in [16].

Lemma 1.2.

The set of double-cosets Sλ\Sn/SμS_{\lambda}\backslash S_{n}/S_{\mu} is in bijection with 𝒯λ,μ\mathcal{T}_{\lambda,\mu}. The distribution on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} induced by sampling a uniform σSn\sigma\in S_{n} and mapping it to it’s corresponding double-coset is the Fisher-Yates distribution

πλ,μ(T)=1n!i,jμi!λj!Tij!,T𝒯λ,μ.\pi_{\lambda,\mu}(T)=\frac{1}{n!}\prod_{i,j}\frac{\mu_{i}!\lambda_{j}!}{T_{ij}!},\quad T\in\mathcal{T}_{\lambda,\mu}. (2)
Remark 1.

Without loss of generality, we assume that the rows and columns of the contingency tables 𝒯λ,μ\mathcal{T}_{\lambda,\mu} are ordered, i.e. the first row/column has the largest sum and then row/column sums are decreasing. This ordering does not effect any of the results in this paper.

The mapping from SnS_{n} to tables is easy to describe: Fix σSn\sigma\in S_{n}. Inspect the first λ1\lambda_{1} positions in σ\sigma. Let T11T_{11} be the number of elements from {1,2,,μ1}\{1,2,\dots,\mu_{1}\} occurring in these positions, T12T_{12} the number of elements from {μ1+1,,μ1+μ2}\{\mu_{1}+1,\dots,\mu_{1}+\mu_{2}\}, …and T1JT_{1J} the number of elements from {nμJ+1,,n}\{n-\mu_{J}+1,\dots,n\}. In general, TijT_{ij} is the number of elements from {μ1++μi1+1,,μ1++μj}\{\mu_{1}+\ldots+\mu_{i-1}+1,\dots,\mu_{1}+\ldots+\mu_{j}\} which occur in the positions λ1+λ2++λi1+1\lambda_{1}+\lambda_{2}+\ldots+\lambda_{i-1}+1 up to λ1++λi\lambda_{1}+\ldots+\lambda_{i}.

Example 1.3.

When n=5n=5, λ=(3,2)\lambda=(3,2), μ=(2,2,1)\mu=(2,2,1) there are five possible tables:

(210011)(201020)(120101)(111110)(021200)\displaystyle\begin{pmatrix}2&1&0\cr 0&1&1\end{pmatrix}\,\,\,\,\,\,\,\,\,\,\,\begin{pmatrix}2&0&1\cr 0&2&0\end{pmatrix}\,\,\,\,\,\,\,\,\,\,\,\begin{pmatrix}1&2&0\cr 1&0&1\end{pmatrix}\,\,\,\,\,\,\,\,\,\,\,\begin{pmatrix}1&1&1\cr 1&1&0\end{pmatrix}\,\,\,\,\,\,\,\,\,\,\,\begin{pmatrix}0&2&1\cr 2&0&0\end{pmatrix}
σ=12345σ=12534σ=13425σ=13524σ=34512\displaystyle\hskip 11.38109pt\sigma=12345\hskip 28.45274pt\sigma=12534\hskip 28.45274pt\sigma=13425\hskip 28.45274pt\sigma=13524\hskip 28.45274pt\sigma=34512
            2412244812\displaystyle\,\,\,\,\,\,\,\,\,\,\,\,24\hskip 65.44133pt12\hskip 65.44133pt24\hskip 62.59605pt48\hskip 62.59605pt12

Listed below each table is a permutation in the corresponding double coset, and the total size of the double coset. The double coset representatives are chosen to be the one of minimal length, i.e. the fewest number of adjacent transpositions to change the permutation to the identity. It is always possible to a unique shortest double coset representative [4]. This is easy to identify: Given TT, build σ\sigma sequentially, left to right, by putting down 1,2,,T111,2,\dots,T_{11} then μ1+1,μ1+2,,μ1+T12\mu_{1}+1,\mu_{1}+2,\dots,\mu_{1}+T_{12} … each time putting down the longest available numbers in the μj\mu_{j} block, in order. In example 1.3, the shortest double coset representatives are shown.

Thanks to Zhihan Li, another way to describe the mapping from SnS_{n} to contingency tables is to consider the n×nn\times n permutation matrix defined by σ\sigma. The partition μ\mu divides the columns into JJ blocks and λ\lambda divides the rows into II blocks. The table TT corresponding to σ\sigma is defined by TijT_{ij} as the number of 11s in the ii block of rows and jj block of columns. For example, σ=12543\sigma=12543 gives the permutation matrix

μ1\mu_{1} μ2\mu_{2} μ3\mu_{3}
λ1\lambda_{1} 1 0 0 0 0
0 1 0 0 0
0 0 0 0 1
λ2\lambda_{2} 0 0 0 1 0
0 0 1 0 0

which defines the table (201020)\begin{pmatrix}2&0&1\cr 0&2&0\end{pmatrix}.

1.2 Markov Chain and Main Results

The random transpositions chain on SnS_{n} is easy to describe. If a permutation σSn\sigma\in S_{n} is viewed as an arrangement of a deck of nn unique cards, the random transpositions chain can be described: Pick one card with your left hand and one card with your right hand (allowing for the possibility of picking the same card), and swap the two cards. The transition probabilities are generated by the probability measure QQ on SnS_{n} defined

Q(σ)={2n2ifσ=(i,j),i<j1nifσ=id0otherwise,Q(\sigma)=\begin{cases}\frac{2}{n^{2}}&\text{if}\,\,\,\ \sigma=(i,j),i<j\\ \frac{1}{n}&\text{if}\,\,\,\ \sigma=id\\ 0&\text{otherwise},\end{cases} (3)

where (i,j)(i,j) denotes the transposition of elements ii and jj. The transition matrix for the random transposition Markov chain is then

P(x,y)=Q(yx1),x,ySn.P(x,y)=Q(y\cdot x^{-1}),\,\,\,\,\,\,\,\,x,y\in S_{n}.

Note that QQ is constant on conjugacy classes: If i<j,k<i<j,k<\ell and x=(ij),y=(k)x=(ij),y=(k\ell) are two transpositions, then

yxy1=(k)(ij)(k)={(ij)ifik,jori=k,j=(j)ifi=k,j(ki)ifj=,ik.\displaystyle yxy^{-1}=(k\ell)\cdot(ij)\cdot(k\ell)=\begin{cases}(ij)&\text{if}\,\,i\neq k,j\neq\ell\,\,\text{or}\,\,i=k,j=\ell\\ (j\ell)&\text{if}\,\,i=k,j\neq\ell\\ (ki)&\text{if}\,\,j=\ell,i\neq k\end{cases}.

Since transpositions generate SnS_{n}, this proves that if xx is a transposition, then yxy1yxy^{-1} is a transposition for any ySny\in S_{n}, and thus Q(yxy1)=Q(x)Q(yxy^{-1})=Q(x).

The random transpositions chain on SnS_{n} induces a Markov chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} as a ‘lumping’. From a table T𝒯λ,μT\in\mathcal{T}_{\lambda,\mu}, choose a permutation xSnx\in S_{n} that maps to the double coset corresponding to TT. From xx, sample yy from P(x,)P(x,\cdot) and move to the contingency table corresponding to yy. More details of this lumping are discussed in Section 2.1, and this perspective will be useful for relating the well-studied eigenvalues of the chain on SnS_{n} to the chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu}. It is also possible to write down the transition probabilities independently of the chain on SnS_{n}.

Definition 1.4 (Random Transpositions Markov chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu}).

Let λ=(λ1,,λI),μ=(μ1,,μJ)\lambda=(\lambda_{1},\dots,\lambda_{I}),\mu=(\mu_{1},\dots,\mu_{J}) be two partitions of nn. For T𝒯λ,μT\in\mathcal{T}_{\lambda,\mu} and indices 1i1,i2I1\leq i_{1},i_{2}\leq I and 1j1,j2J1\leq j_{1},j_{2}\leq J, let F(i1,j1),(i2,j2)(T)=TF_{(i_{1},j_{1}),(i_{2},j_{2})}(T)=T^{\prime} denote the table with

Ti1,j1=Ti1,j11,\displaystyle T_{i_{1},j_{1}}^{\prime}=T_{i_{1},j_{1}}-1,
Ti2,j2=Ti2,j21,\displaystyle T_{i_{2},j_{2}}^{\prime}=T_{i_{2},j_{2}}-1,
Ti1,j2=Ti1,j2+1,\displaystyle T_{i_{1},j_{2}}^{\prime}=T_{i_{1},j_{2}}+1,
Ti2,j1=Ti2,j1+1,\displaystyle T_{i_{2},j_{1}}^{\prime}=T_{i_{2},j_{1}}+1,

and all other entries of TT^{\prime} the same as TT. The random transpositions Markov chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} is defined by the transition matrix PP with:

P(T,T)={2Ti1,j1Ti2,j2n2ifT=F(i1,j1),(i2,j2)(T)1i1<i2j1<j22Ti1,j1Ti2,j2n2ifT=T0otherwise.P(T,T^{\prime})=\begin{cases}\frac{2\cdot T_{i_{1},j_{1}}\cdot T_{i_{2},j_{2}}}{n^{2}}&\text{if}\,\,\,T^{\prime}=F_{(i_{1},j_{1}),(i_{2},j_{2})}(T)\\ 1-\sum_{i_{1}<i_{2}}\sum_{j_{1}<j_{2}}\frac{2\cdot T_{i_{1},j_{1}}\cdot T_{i_{2},j_{2}}}{n^{2}}&\text{if}\,\,\,T^{\prime}=T\\ 0&\text{otherwise}\end{cases}.

Note if Ti1,j1=0T_{i_{1},j_{1}}=0 or Ti2,j2=0T_{i_{2},j_{2}}=0, then F(i1,j1),(i2,j2)(T)F_{(i_{1},j_{1}),(i_{2},j_{2})}(T) has negative values and is not an element of 𝒯λ,μ\mathcal{T}_{\lambda,\mu}. Definition 1.4 does not allow selecting this possibility, since only pairs with Ti1,j1Ti2,j21T_{i_{1},j_{1}}T_{i_{2},j_{2}}\geq 1 will be chosen.

There are other ways of thinking about the Markov chain. Suppose the contingency table is created from nn data pairs of the type (i,j)(i,j), where ii indicates the row feature and jj indicates the column feature. One step of the chain is to pick two pairs uniformly (possibly picking the same pair) and swapping the column values (or equivalently, swapping the row values). This is exactly the swap Markov chain on bi-partite graphs, which has been studied for various special cases, e.g. [2].

Example 1.5.

λ=(3,2),μ=(2,1,1)\lambda=(3,2),\mu=(2,1,1):

Tx=(210011)Ty=(120101)\displaystyle T^{x}=\begin{pmatrix}2&1&0\cr 0&1&1\end{pmatrix}\hskip 28.45274pt\longrightarrow\hskip 28.45274ptT^{y}=\begin{pmatrix}1&2&0\cr 1&0&1\end{pmatrix}
x=12345y=1𝟒3𝟐5\displaystyle\hskip 14.22636ptx=12345\hskip 45.5244pt\longrightarrow\hskip 28.45274pty=1\mathbf{4}3\mathbf{2}5

The probability of this transition for the tables is 2/522/5^{2}, since there are two possible transpositions in the permutation xx that would result in the table TyT^{y}. Representing the table as a set of nn data points, this transition is:

Tx:(1,1),(1,1),(1,2),(2,2),(2,3)Ty=(1,1),(1,𝟐),(1,2),(2,𝟏),(2,3).\displaystyle T^{x}:(1,1),(1,1),(1,2),(2,2),(2,3)\longrightarrow T^{y}=(1,1),(1,\mathbf{2}),(1,2),(2,\mathbf{1}),(2,3).
Remark 2.

If I=J=2I=J=2, then this is related to the Bernoulli-Laplace chain: Suppose λ=(nk,k),μ=(nj,j)\lambda=(n-k,k),\mu=(n-j,j). Consider two urns. The first contains nkn-k total balls and the second has kk balls. Of these nn total balls, njn-j are green and jj are red. In the contingency table, the rows represent urns and the columns colors. In the traditional Bernoulli-Laplace urn, one ball is picked uniformly from each urn and swapped. The random transpositions chain Definition 1.4 can be described by instead picking 22 balls uniformly from all nn balls and swapping (which creates the possibility both are selected from the same urn).

Using the spherical Fourier transform of the Gelfand pair (S2n,Sn×Sn)(S_{2n},S_{n}\times S_{n}), Diaconis and Shahshahani proved mixing time of order (n/4)log(n)(n/4)\log(n) steps for the Bernoulli-Laplace chain in a special case (for 2n2n total balls in the system); for thorough analyses of the Bernoulli-Laplace chain, see [15], [21], [6]. An extension of the Bernoulli-Laplace chain with mm urns is studied in [46]; this state space corresponds to 𝒯λ,μ\mathcal{T}_{\lambda,\mu} for λ,μ\lambda,\mu partitions of mnmn with λ=1mn,μ=(n,n,,n)\lambda=1^{mn},\mu=(n,n,\dots,n).

Remark 3.

The random transpositions chain is sometimes defined with the probability measure:

Q~(σ)={1(n2)ifσ=(i,j),i<j0else.\widetilde{Q}(\sigma)=\begin{cases}\frac{1}{\binom{n}{2}}&\text{if}\,\,\,\ \sigma=(i,j),i<j\\ 0&\text{else}\end{cases}.

That is, the random walk on SnS_{n} is described: pick a card with your left hand and pick a different card with your right hand. This means the permutation will always change at each step, but then the process is periodic. Using QQ from (3) (i.e. allowing the possibility of picking the same card) the chain is aperiodic, which allows for easier analysis of the chain on SnS_{n}.

Note that as long as λ,μ\lambda,\mu are not both equal to (1,1,,1)(1,1,\dots,1) (in which case Sλ\Sn/SμSnS_{\lambda}\backslash S_{n}/S_{\mu}\cong S_{n}), the chain using probabilities 1/(n2)1/\binom{n}{2} is not periodic: If there is at least one row or column with 22 entries, e.g. (i,j)(i,j) and (i,l(i,l), then picking these two entries to swap does not change the table. Thus, using Q~\widetilde{Q} to define the induced process on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} is also possible. The analysis and mixing times are the same for both QQ and Q~\widetilde{Q}, except the chains have slightly different eigenvalues.

The interpretation of the chain as a lumping of the random transpositions chain on SnS_{n} is advantageous because the chain on contingency tables inherits its eigenvalues from the chain on SnS_{n}, and these eigenvalues are well-known. Representation theory also gives an expression for the multiplicities of the eigenvalues in terms of Kostka numbers corresponding to partitions. Section 2.1 reviews this background in more detail, and thoroughly explains part (b) of the following theorem, which are the results for eigenvalues and multiplicities.

Theorem 1.6.

Let λ=(λ1,,λI),μ=(μ1,,μJ)\lambda=(\lambda_{1},\dots,\lambda_{I}),\mu=(\mu_{1},\dots,\mu_{J}) be partitions of nn and PP the random transpositions Markov chain on the space of contingency tables 𝒯λ,μ\mathcal{T}_{\lambda,\mu}.

  1. (a)

    The eigenvalues βρ\beta_{\rho} are of the form

    βρ=1n+1n2j=1k[(ρjj)(ρjj+1)j(j1)],\beta_{\rho}=\frac{1}{n}+\frac{1}{n^{2}}\sum_{j=1}^{k}\left[(\rho_{j}-j)(\rho_{j}-j+1)-j(j-1)\right],

    for ρ=(ρ1,,ρk)\rho=(\rho_{1},\dots,\rho_{k}) a partition of nn.

  2. (b)

    The multiplicity of βρ\beta_{\rho} for PP is mρλmρμm_{\rho}^{\lambda}\cdot m_{\rho}^{\mu}, where mρλm_{\rho}^{\lambda} is the Kostka number (defined below).

  3. (c)

    Eigenfunctions of PP are the orthogonal polynomials for the Fisher-Yates distribution.

Theorem 1.6 (a) and (b) arise from the general theory for Markov chains on double cosets developed in [12] and [18] and summarized in Section 2.1. Section 3 contains a proof of Theorem 1.6(c) as well as a description of the multiplicities. It seems unlikely that there would be a simple formula for the number and multiplicity of the eigenvalues, as this would give a way of enumerating the size of the state space 𝒯λ,μ\mathcal{T}_{\lambda,\mu}, which is a #- P complete problem [11]. The orthogonal polynomials of the Fisher-Yates distribution have not been explicitly computed. Section 3.2 computes linear and quadratic eigenfunctions of PP, for any size table, which can be used as a starting point for finding orthogonal polynomials.

For 2×J2\times J tables, the stationary distribution is multi-variate hypergeometric, and the computation of the eigenvalues simplifies to give a more complete picture.

Theorem 1.7.

Let λ=(nk,k)\lambda=(n-k,k) for some kn/2k\leq\lfloor n/2\rfloor, μ=(μ1,,μJ)\mu=(\mu_{1},\dots,\mu_{J}), and PP be the random swap Markov chain on the space of contingency tables 𝒯λ,μ\mathcal{T}_{\lambda,\mu}. Then the eigenvalues of PP are

βm=12m(n+1m)n2,      0mk,\beta_{m}=1-\frac{2m(n+1-m)}{n^{2}},\,\,\,\,\,\,0\leq m\leq k,

with eigenbasis the orthogonal polynomials of degree mm for the multivariate hypergeometric distribution (defined below). The multiplicity of βm\beta_{m} is the size of the set

{(x1,,xJ1)J1:j=1J1xj=m,xj<μJj+1}.\left\{(x_{1},\dots,x_{J-1})\in\mathbb{N}^{J-1}:\sum_{j=1}^{J-1}x_{j}=m,x_{j}<\mu_{J-j+1}\right\}.

The eigenvalues and eigenvectors allow an analysis of the convergence rates of this Markov chain. The measure that we study for convergence rate is the mixing time. That is, for a Markov chain on a space Ω\Omega with transition probability PP and stationary distribution π\pi, the mixing time is defined as

tmix(ϵ)=supx0Ωinf{t>0:Pt(x0,)π()TV<ϵ},ϵ>0,t_{mix}(\epsilon)=\sup_{x_{0}\in\Omega}\inf\{t>0:\|P^{t}(x_{0},\cdot)-\pi(\cdot)\|_{TV}<\epsilon\},\quad\epsilon>0,

where μπTV=supAΩ|μ(A)π(A)|\|\mu-\pi\|_{TV}=\sup_{A\subset\Omega}|\mu(A)-\pi(A)| is the total-variation distance between probability measures.

It is challenging to analyze the mixing time in full generality, but understanding the linear and quadratic eigenfunctions of the chain give the following results:

  • For λ=(nk,k),μ=(n,)\lambda=(n-k,k),\mu=(n-\ell,\ell), the time to stationarity, averaged over starting positions sampled from πλ,μ\pi_{\lambda,\mu}, is bounded above by (n/4)log(min(k,))(n/4)\log(\min(k,\ell)).

  • For λ=(nk,k)\lambda=(n-k,k) and any μ\mu, the multivariate Hahn polynomials are used to compute an upper bound for the time to stationarity starting from extreme states in which the second row has a single non-zero entry.

  • For any λ,μ\lambda,\mu, the mixing time is bounded below by ((n/4)1)log(Cλ,μn)((n/4)-1)\log(C_{\lambda,\mu}n), where Cλ,μC_{\lambda,\mu} is a computable constant.

The mixing time of the random transpositions chain on SnS_{n} is (1/2)nlog(n)+cn(1/2)n\log(n)+cn, which is an upper bound for the mixing time of the lumped chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu}. The results listed above suggest that there is not much of a speed-up.

1.3 Related Work

Note that for two way contingency tables it is easy to sample directly from the Fisher-Yates distribution (by generating a random permutation and using the argument of Lemma 1.2). One reason for interest in the Markov chain mixing time is that, for three and higher way tables, the Markov chain method is the only available approach, so it is important to see how it works in the two way case.

The uniform swap Markov chain on contingency tables was first proposed in [11], with the motivation of studying the uniform distribution [10]. The mixing time of this chain has been analyzed in special cases, e.g.  if n=λi=μjn=\sum\lambda_{i}=\sum\mu_{j} and the number of rows and columns is fixed, then the mixing time is of order n2n^{2}. When the table has two rows, the chain mixes in time polynomial in the number of columns and nn [25]. Chung, Graham, and Yau [7] indicate that a modified version of the chain converges in time polynomial in the number of rows and columns.

A contingency table can be thought of as a bi-partite graph with prescribed degree sequences, and a similar swap Markov chain defined for any graph with fixed degree sequences to sample from the uniform distribution, e.g.  [2]. There is a large literature on methods for sampling graphs with given degree sequences (not necessarily bi-partite); see [20] for a recent review. Recently the chain was studied for contingency tables with entries in /q\mathbb{Z}/q\mathbb{Z} and mixing time with cut-off proven at cqn2log(n)c_{q}n^{2}\log(n) [41].

A Markov chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} for which the stationary distribution is Fisher-Yates was studied in [17], motivated by studying conditional distributions. The Markov chain is created by taking the Gibbs sampler using the uniform swap chain defined in [11]. This gives the chain: Pick a pair of rows i1,i2i_{1},i_{2} and columns j1,j2j_{1},j_{2} uniformly at random. This determines a 2×22\times 2 submatrix. Replace it with a 2×22\times 2 sub-table with the same margins, chosen from the hypergeometric distribution. As noted in Section 2 of [17], while it is straightforward to sample directly from Fisher-Yates distribution for 22-way tables, there is not a simple algorithm for sampling from the analagous distribution on 33-way or higher-dimensional tables. Thus, Markov chains with Fisher-Yates distribution as stationary are valuable for this application; multiway tables are discussed further in Section 5.2.

1.4 Outline

Section 2 contains an overview of basic definitions and facts for Markov chains, double cosets, and orthogonal polynomials. Section 3 establishes the eigenvalues and multiplicities, as well as the eigenfunctions of the Markov chain as polynomials, and gives formulas for the linear and quadratic polynomial eigenfunctions. These are then used in Section 4 to find explicit upper and lower bounds on the mixing time of the chain, in specific cases. Section 5 discusses some future directions.

Acknowledgments

The author thanks Persi Diaconis for helpful discussion and suggestions. This work was supported by a National Defense Science & Engineering Graduate Fellowship and a Lieberman Fellowship at Stanford University.

2 Background

This section reviews necessary facts and background on Markov chains, orthogonal polynomials, and the random transpositions chain on SnS_{n}.

2.1 Double Coset Markov Chains

The results of this section use basic tools of representation theory; in particular, induced representations and Frobenius reciprocity. For background on these topics, see [27], [30], [47]. In the specific case in hand – the symmetric group – additional tools are available; Young’s rule and the hook-length formula. These are clearly developed in [31]. See also, [29].

Let H,KH,K be subgroups of the finite group GG, QQ a probability on GG. If supp(Q)\text{supp}(Q) is not contained in a coset of a subgroup, then the random walk on GG induced by QQ is ergodic with a uniform stationary distribution. This random walk on GG is defined by picking a random element from QQ and multiplying the current state. That is, transitions are

P(x,y)=Q(yx1).P(x,y)=Q(yx^{-1}).

See [8] for an introduction to random walks on groups.

The double cosets of H\G/KH\backslash G/K partition the space GG and any Markov chain on GG defines a random process on the set of double cosets by keeping track of which double coset the process on GG is in, giving a ‘lumped’ process. Proposition 2.1 gives a condition for when the random walk induced by QQ on GG is also a Markov chain on the double cosets. For the statement we fix double coset representatives and write xx for the double coset HxKHxK. The result is proven in [12].

Proposition 2.1.

Let QQ be a probability on GG with is HH-conjugacy invariant (Q(s)=Q(h1sh)Q(s)=Q(h^{-1}sh) for hH,sGh\in H,s\in G). Then, the image of the random walk driven by QQ on GG maps to a Markov chain on H\G/KH\backslash G/K with transition kernel

P(x,y)=Q(HyKx1).P(x,y)=Q(HyKx^{-1}).

The stationary distribution is π(x)=|HxK|/|G|\pi(x)=|HxK|/|G|. If Q(s)=Q(s1)Q(s)=Q(s^{-1}), then (P,π)(P,\pi) is reversible.

Remark 4.

It is special for a function of a Markov chain to also be Markov. In this setting, many of the famous Markov chains on SnS_{n} are not Markov when they are lumped to the the double cosets Sλ\Sn/SμS_{\lambda}\backslash S_{n}/S_{\mu}. For example, the adjacent transpositions random walk on SnS_{n} is induced by Q((i,i+1))=1/(n1)Q((i,i+1))=1/(n-1), which does not satisfy the conditions of Proposition 2.1. See [43] for a survey on lumped Markov chains.

Since transpositions form a conjugacy class in SnS_{n}, the probability QQ from (3) satisfies Proposition 2.1, so lumped to contingency tables gives a Markov process. The following lemma also directly proves that the lumping of the random transpositions chain to contingency tables is Markov, and equivalent to Definition 1.4.

Lemma 2.2.

The random transpositions chain on SnS_{n} induced by QQ from (7) when lumped to double cosets Sλ\Sn/SμS_{\lambda}\backslash S_{n}/S_{\mu} is equivalent to the Markov chain defined in Definition 1.4.

Proof.

Suppose λ=(λ1,,λI),μ=(μ1,,μJ)\lambda=(\lambda_{1},\dots,\lambda_{I}),\mu=(\mu_{1},\dots,\mu_{J}). Define L1={1,,λ1},L2={λ1+1,,λ1+λ2},LI={nλI,,n1,n}L_{1}=\{1,\dots,\lambda_{1}\},L_{2}=\{\lambda_{1}+1,\dots,\lambda_{1}+\lambda_{2}\},\dots L_{I}=\{n-\lambda_{I},\dots,n-1,n\} and similarly use μ\mu to define M1,,MJM_{1},\dots,M_{J}. Then the contingency table corresponding to the double coset of xSnx\in S_{n} is defined by

Tx={Tijx},Tijx=#{Li:x()Mj}.T^{x}=\{T_{ij}^{x}\},\quad T_{ij}^{x}=\#\{\ell\in L_{i}:x(\ell)\in M_{j}\}.

Let (ab)(ab) be a transposition in SnS_{n}. If y=(ab)xy=(ab)x then y(a)=x(b),y(b)=x(a)y(a)=x(b),y(b)=x(a) and y(c)=x(c),ca,by(c)=x(c),c\neq a,b. Suppose aLi1,bLi2,x(a)Mj1,x(b)Mj2a\in L_{i_{1}},b\in L_{i_{2}},x(a)\in M_{j_{1}},x(b)\in M_{j_{2}}. If i1=i2i_{1}=i_{2} or j1=j2j_{1}=j_{2} then ySλxSμy\in S_{\lambda}xS_{\mu}. Otherwise, Ty=F(i1,j1),(i2,j2)(Tx)T^{y}=F_{(i_{1},j_{1}),(i_{2},j_{2})}(T^{x}), as defined in Definition 1.4.

Note that if x,xSnx,x^{\prime}\in S_{n} are contained in the same double coset, i.e. Tx=TxT^{x}=T^{x^{\prime}}, then for any ySny\in S_{n}

ySλySμQ(yx1)=ySλySμQ(y(x)1).\sum_{y^{\prime}\in S_{\lambda}yS_{\mu}}Q(y^{\prime}x^{-1})=\sum_{y^{\prime}\in S_{\lambda}yS_{\mu}}Q(y^{\prime}(x^{\prime})^{-1}).

In words, for the random transpositions chain on SnS_{n}, the probability of transitioning from the double coset SλxSμS_{\lambda}xS_{\mu} to another double coset SλySμS_{\lambda}yS_{\mu} doesn’t depend on the choice of double coset representative. This is Dynkin’s condition for lumped Markov chains ([37], [43]).

Eigenvalues

Let 1=β0>β1β|Ω|111=\beta_{0}>\beta_{1}\geq\ldots\geq\beta_{|\Omega|-1}\geq-1 be the eigenvalues of PP with corresponding eigenfunctions f0,f1,,f|Ω|1f_{0},f_{1},\dots,f_{|\Omega|-1}. That is, fi:Ωf_{i}:\Omega\to\mathbb{R} and

𝔼[fi(X1)X0=x]=βifi(x),xΩ.\operatorname{{\mathbb{E}}}[f_{i}(X_{1})\mid X_{0}=x]=\beta_{i}\cdot f_{i}(x),\,\,\,\,\,x\in\Omega.

For reversible Markov chains, the eigenfunctions {fi}i0\{f_{i}\}_{i\geq 0} can be chosen to be orthonormal with respect to the stationary distribution:

xΩfi(x)fj(x)π(x)=1(i=j).\sum_{x\in\Omega}f_{i}(x)\cdot f_{j}(x)\cdot\pi(x)=\textbf{1}(i=j).

The eigenvalues and eigenfunctions give an exact formula for the chi-squared distance between the chain and the stationary distribution, defined

χx2(t):=yΩ|Pt(x,y)π(y)|2π(y).\chi^{2}_{x}(t):=\sum_{y\in\Omega}\frac{\left|P^{t}(x,y)-\pi(y)\right|^{2}}{\pi(y)}.

The chi-squared distance then gives an upper bound for the total variation distance. This information is summarized in the following lemma, see [37] Chapter 12.

Lemma 2.3.

For any xΩx\in\Omega and t>0t>0,

Pt(x,)π()TV214χx2(t)=14i=1|Ω|1βi2tfi2(x).\|P^{t}(x,\cdot)-\pi(\cdot)\|_{TV}^{2}\leq\frac{1}{4}\chi^{2}_{x}(t)=\frac{1}{4}\cdot\sum_{i=1}^{|\Omega|-1}\beta_{i}^{2t}\cdot f_{i}^{2}(x).

Note that the worst-case mixing time is defined by looking at the total variation distance from the worst-case starting point. Thus to bound the mixing time it is needed to be able to analyze fi2(x)f_{i}^{2}(x) for any xΩx\in\Omega. Even when the eigenfunctions are explicitly known, this can be challenging. For chains P(x,y)P(x,y) on Ω\Omega where a group GG acts transitively (for all x,yΩ,gGx,y\in\Omega,g\in G, P(x,y)=P(gx,gy)P(x,y)=P(gx,gy)), the distance from all starting states is the same and the right hand side of the bound in Lemma 2.3 is iβi2t\sum_{i}\beta_{i}^{2t}. Another bound, for any starting state xx, is

4PxπTV21π(x)β2,4\|P_{x}^{\ell}-\pi\|_{TV}^{2}\leq\frac{1}{\pi(x)}\beta_{*}^{2\ell}, (4)

with β=maxλ1|βλ|\beta_{*}=\max_{\lambda\neq 1}|\beta_{\lambda}|. If π(x)\pi(x) is not uniform this bound can vary widely, as in the following example.

Example 2.4.

The paper [12] analyzes the random transvections chain on GLn(𝔽q)GL_{n}(\mathbb{F}_{q}) lumped to double cosets \GLn(𝔽q)/\mathcal{B}\backslash GL_{n}(\mathbb{F}_{q})/\mathcal{B}, with \mathcal{B} the Borel subgroup of upper triangular matrices. Here double cosets are indexed by permutations in SnS_{n}, and the uniform distribution on GLn(𝔽q)GL_{n}(\mathbb{F}_{q}) induces the Mallows distribution on SnS_{n} with parameter qq:

pq(ω)=qI(ω)[n]q!,[n]q!:=(1+q)(1+q+q2)(1+q++qn1),p_{q}(\omega)=\frac{q^{I(\omega)}}{[n]_{q}!},\quad[n]_{q}!:=(1+q)(1+q+q^{2})\ldots(1+q+\ldots+q^{n-1}),

where I(ω)I(\omega) is the number of inversions in ω\omega. In the setting q>1q>1, the reversal permutation has largest probability and the identity has smallest. Transvections form a conjugacy class in GLn(𝔽q)GL_{n}(\mathbb{F}_{q}), and so character theory gives the eigenvalues and the mixing time from special starting states can be analyzed. Starting from the identity the mixing time is order nn (the same order as the random transvections walk on GLn(𝔽q)GL_{n}(\mathbb{F}_{q})), but starting from the reversal permutation the mixing time is order log(n)\log(n).

The analysis in Example 2.4 is amenable because, in the setting of Proposition 2.1, the measure QQ was a class function, i.e. Q(s)=Q(t1st)Q(s)=Q(t^{-1}st) for all s,t,Gs,t,\in G. In this case, the eigenvalues of the walk on GG are expressed using the characters of the irreducible complex representations of GG. Let G^\widehat{G} be an index set for these representations, λG^\lambda\in\widehat{G}, and χλ\chi_{\lambda} the character of λ\lambda. The restriction of χλ\chi_{\lambda} to HH is written χλ|H\chi_{\lambda}|_{H} and χλ|H,1\left\langle\chi_{\lambda}|_{H},1\right\rangle is the number of times the trivial representation of HH appears in χλ|H\chi_{\lambda}|_{H}. By reciprocity, this is χλ,IndHG(1)\left\langle\chi_{\lambda},\mathrm{Ind}_{H}^{G}(1)\right\rangle.

The following theorem is proven in [12].

Proposition 2.5.

Let QQ be a class function on GG and let H,KH,K be subgroups of GG. The induced chain P(x,y)P(x,y) of Proposition 2.1 on H\G/KH\backslash G/K has eigenvalues

βλ=1χλ(1)sGQ(s),λG^\beta_{\lambda}=\frac{1}{\chi_{\lambda}(1)}\sum_{s\in G}Q(s),\quad\lambda\in\widehat{G} (5)

with multiplicity

mλ=χλ|H,1χλ|K,1.m_{\lambda}=\left\langle\chi_{\lambda}|_{H},1\right\rangle\cdot\left\langle\chi_{\lambda}|_{K},1\right\rangle. (6)

Further, the average square chi-squared distance to stationarity satisfies

xΩπ(x)χx2()=14λG^,λ1mλβλ2.\sum_{x\in\Omega}\pi(x)\chi^{2}_{x}(\ell)=\frac{1}{4}\sum_{\lambda\in\widehat{G},\lambda\neq 1}m_{\lambda}\beta_{\lambda}^{2\ell}. (7)

2.2 Random Transpositions on SnS_{n}

The driving Markov chain on SnS_{n} for contingency tables is the random transpositions Markov chain studied in [14]. This uses the tools of representation theory and the character theory of the symmetric group. An expository account, aimed at probabilists and statisticians is in Chapter 3D in [8]. One of the main results used below is an explicit determination of the eigenvalues of this chain.

Recall the probability measure which defines the random walk on SnS_{n}:

Q(σ)={2n2ifσ=(i,j),i<j1nifσ=id,0otherwise,.Q(\sigma)=\begin{cases}\frac{2}{n^{2}}&\text{if}\,\,\,\,\sigma=(i,j),i<j\\ \frac{1}{n}&\text{if}\,\,\,\,\sigma=id,\\ 0&\text{otherwise},\end{cases}.

This measure QQ is not concentrated on a single conjugacy class (the identity is not in the conjugacy class of transpositions). However, we can write

Q=1nI+n1nQ~,Q=\frac{1}{n}I+\frac{n-1}{n}\widetilde{Q},

where I(σ)=𝟏(σ=id)I(\sigma)=\mathbf{1}(\sigma=id) and Q~=1(n2)𝟏(σ𝒞)\widetilde{Q}=\frac{1}{\binom{n}{2}}\mathbf{1}(\sigma\in\mathcal{C}), where 𝒞Sn\mathcal{C}\subset S_{n} denotes the conjugacy class of transpositions. (This Q~\widetilde{Q} is the same as the one discussed in Remark 3.) Then Q~\widetilde{Q} is concentration on a single conjugacy class 𝒞\mathcal{C}, so the eigenvalues of the random walk are equal to the character ratio,

β~ρ=1χρ(1)sGQ(s)=χρ(𝒞)χρ(1),\widetilde{\beta}_{\rho}=\frac{1}{\chi_{\rho}(1)}\sum_{s\in G}Q(s)=\frac{\chi_{\rho}(\mathcal{C})}{\chi_{\rho}(1)},

for each partition ρ\rho of nn.. The formula for this character ratio was determined by Frobenius; see [38] for a modern exposition, and [44] for a proof of this special case:

β~ρ=1n(n1)j=1k[ρj2(2j1)ρj].\widetilde{\beta}_{\rho}=\frac{1}{n(n-1)}\sum_{j=1}^{k}\left[\rho_{j}^{2}-(2j-1)\rho_{j}\right].

The eigenvalues for the walk driven by QQ are then related:

βρ=1n+n1nβ~ρ.\beta_{\rho}=\frac{1}{n}+\frac{n-1}{n}\widetilde{\beta}_{\rho}.

This information is summarized in the following lemma; see [14] for more details.

Lemma 2.6 ([14], Corollary 1 & Lemma 7).

If PP is the random transpositions chain on SnS_{n} driven by QQ, then PP has an eigenvalue βρ\beta_{\rho} for each partition ρ=(ρ1,ρ2,,ρk)\rho=(\rho_{1},\rho_{2},\dots,\rho_{k}) of nn. The eigenvalue corresponding to ρ\rho is

βρ=1n+1n2j=1k[ρj2(2j1)ρj].\beta_{\rho}=\frac{1}{n}+\frac{1}{n^{2}}\sum_{j=1}^{k}\left[\rho_{j}^{2}-(2j-1)\rho_{j}\right].

In [14], the chain is proven to mix in total variation distance, with cut-off, after (n/2)log(n)(n/2)\log(n). This gives an initial upper bound on the mixing time of the random transpositions chain on contingency tables.

2.3 Orthogonal Polynomials

This section reviews orthogonal polynomials and records the formula for multivariate Hahn polynomials. See [19] for a thorough exposition on multivariate orthogonal polynomials; especially Section 3 for sufficient conditions for an orthonormal basis to exists for a probability distribution.

Let π\pi be a probability distribution on a finite space Ωd\Omega\subset\mathbb{N}^{d}. Let 2(π)\ell^{2}(\pi) be the space of functions f:Ωf:\Omega\to\mathbb{R} with the inner product

f,gπ=𝔼π[f(X)g(X)]=xΩf(x)g(x)π(x).\left\langle f,g\right\rangle_{\pi}=\operatorname{{\mathbb{E}}}_{\pi}[f(X)g(X)]=\sum_{x\in\Omega}f(x)g(x)\pi(x).

A set of functions {qm}0m<|Ω|\{q_{m}\}_{0\leq m<|\Omega|} are orthogonal in 2(π)\ell^{2}(\pi) if

qm,qπ=dm21(m=).\left\langle q_{m},q_{\ell}\right\rangle_{\pi}=d_{m}^{2}\textbf{1}(m=\ell).

For the following lemma, 𝐦=(m1,,mN)\mathbf{m}=(m_{1},\dots,m_{N}) denotes an index vector and |𝐦|=mi|\mathbf{m}|=\sum m_{i} is the total degree of the polynomial defined by the vector.

Lemma 2.7 ([35], Lemma 3.2).

Suppose π\pi is a distribution on the space ΩN\Omega\subset\mathbb{N}^{N}, for some NN, and {q𝐦}\{q_{\mathbf{m}}\} is an orthogonal basis of 2(π)\ell^{2}(\pi), where q𝐦q_{\mathbf{m}} is a polynomial of exact degree |𝐦||\mathbf{m}|. Let (Xt)t0(X_{t})_{t\geq 0} be a reversible Markov chain with transition matrix PP and stationary distribution π\pi. Suppose that,

𝔼[X1𝐦X0=𝐱]=β|𝐦|x𝐦+(terms in x of degree<|𝐦|).\operatorname{{\mathbb{E}}}\left[X_{1}^{\mathbf{m}}\mid X_{0}=\mathbf{x}\right]=\beta_{|\mathbf{m}|}x^{\mathbf{m}}+\left(\text{terms in x of degree}\,\,\,<|\mathbf{m}|\right).

Then PP has eigenvalue β|𝐦|\beta_{|\mathbf{m}|} with eigenbasis {q𝐦}|𝐦|=m\{q_{\mathbf{m}}\}_{|\mathbf{m}|=m}.

Write βm=β|𝐦|\beta_{m}=\beta_{|\mathbf{m}|} for |𝐦|=m|\mathbf{m}|=m. The chi-square distance between Pt(𝐱,)P^{t}(\mathbf{x},\cdot) and π\pi is

χ𝐱2(t)=m1βm2thm(𝐱,𝐱),\chi^{2}_{\mathbf{x}}(t)=\sum_{m\geq 1}\beta_{m}^{2t}\cdot h_{m}(\mathbf{x},\mathbf{x}),

where

hm(𝐱,𝐲):=|𝐦|=mq𝐦(𝐱)q𝐦(𝐲)q𝐦,q𝐦π1h_{m}(\mathbf{x},\mathbf{y}):=\sum_{|\mathbf{m}|=m}q_{\mathbf{m}}(\mathbf{x})q_{\mathbf{m}}(\mathbf{y})\left\langle q_{\mathbf{m}},q_{\mathbf{m}}\right\rangle_{\pi}^{-1}

is called the kernel polynomial of degree mm.

Multivariate Hahn Polynomials

For contingency tables with only two rows, the Fisher-Yates distribution is simply the multivariate hypergeometric distribution. That is, if λ=(nk,k),k<n/2\lambda=(n-k,k),k<\lfloor n/2\rfloor and μ=(μ1,,μJ)\mu=(\mu_{1},\dots,\mu_{J}), then a contingency table in 𝒯λ,μ\mathcal{T}_{\lambda,\mu} can be represented by a (J1)(J-1)-dimensional vector in the space

𝐗k,μJ={𝐱=(x1,,xJ)0J:|𝐱|=k,xjμj},\mathbf{X}_{k,\mu}^{J}=\{\mathbf{x}=(x_{1},\dots,x_{J})\in\mathbb{N}_{0}^{J}:|\mathbf{x}|=k,x_{j}\leq\mu_{j}\},

where |𝐱|=ixi|\mathbf{x}|=\sum_{i}x_{i}. The multivariate hypergeometric distribution is

Hk,μ(𝐱)=j=1J(μjxj)(nk),H_{k,\mu}(\mathbf{x})=\frac{\prod_{j=1}^{J}\binom{\mu_{j}}{x_{j}}}{\binom{n}{k}},

where n=jμjn=\sum_{j}\mu_{j}. For example, this distribution arises from sampling without replacement: An urn contains nn balls of |μ||\mu| different colors, μj\mu_{j} of color jj. If 𝐗\mathbf{X} is a vector which records the number of each color drawn in a sample (without replacement) of size kk, then 𝐗\mathbf{X} has the multivariate hypergeometric distribution. The orthogonal polynomials for this distribution are called the multivariate Hahn polynomials. An overview of univariate Hahn polynomials is in [28]. Following the notation from [35] (Section 2.2.1) and [26], define

a(k)=a(a+1)(a+k1),\displaystyle a_{(k)}=a(a+1)\ldots(a+k-1),
a[k]=a(a1)(ak+1),\displaystyle a_{[k]}=a(a-1)\ldots(a-k+1),
a(0)=a[0]=1.\displaystyle a_{(0)}=a_{[0]}=1.

For a vectors 𝐱=(x1,,xd)\mathbf{x}=(x_{1},\dots,x_{d}),

|𝐱|=i=1dxi,|𝐱i|=j=1ixj,|𝐱i|=j=idxj.\displaystyle|\mathbf{x}|=\sum_{i=1}^{d}x_{i},\quad|\mathbf{x}_{i}|=\sum_{j=1}^{i}x_{j},\quad|\mathbf{x}^{i}|=\sum_{j=i}^{d}x_{j}.

The multivariate Hahn polynomials, defined in Section 2.2.1 of [35], are

Q𝐦(𝐱;k,μ)=(1)|𝐦|(k)[|𝐦|]j=1J1(k+|𝐱j1|+|𝐦j+1|)(mj)Qmj(xj;k|𝐱j1||𝐦j+1|,μj,|μj+1|+2|𝐦j+1|),Q_{\mathbf{m}}(\mathbf{x};k,\mu)=\frac{(-1)^{|\mathbf{m}|}}{(k)_{[|\mathbf{m}|]}}\prod_{j=1}^{J-1}\left(-k+|\mathbf{x}_{j-1}|+|\mathbf{m}^{j+1}|\right)_{(m_{j})}\cdot Q_{m_{j}}\left(x_{j};k-|\mathbf{x}_{j-1}|-|\mathbf{m}^{j+1}|,-\mu_{j},|\mu^{j+1}|+2|\mathbf{m}^{j+1}|\right), (8)

where

Qm(x;k,α,β)\displaystyle Q_{m}(x;k,\alpha,\beta) =F23(n,n+α+β1,x1α,N)\displaystyle=\prescript{}{3}{F}_{2}\begin{pmatrix}-n,n+\alpha+\beta-1,-x&\vline&1\\ \alpha,-N&\vline\end{pmatrix}

is the classical univariate Hahn polynomial.

For calculating the expression for chi-squared distance, if the orthogonal polynomials are the eigenfunctions, then it is the kernel polynomials which need to be solved for. These were constructed by Griffiths [23], [24]. It is an open problem to find a useful equation for the kernel polynomials evaluated at more general states.

Proposition 2.8 (Proposition 2.6 from [35]).

Suppose μjk\mu_{j}\geq k for all jj. Let 𝐞j\mathbf{e}_{j} be the vector with 11 in the jjth index and 0 elsewhere. Then,

hm(k𝐞j,k𝐞j)=(km)(n2m+1)n[m1](nμj)[m](nk)[m](μj)[m].h_{m}(k\mathbf{e}_{j},k\mathbf{e}_{j})=\binom{k}{m}\frac{(n-2m+1)n_{[m-1]}(n-\mu_{j})_{[m]}}{(n-k)_{[m]}(\mu_{j})_{[m]}}. (9)

2.4 Fisher-Yates Distribution

The measure induced on contingency tables by the uniform distribution on SnS_{n} is

πλ,μ(T)=1n!i,jλi!μj!Tij!.\pi_{\lambda,\mu}(T)=\frac{1}{n!}\prod_{i,j}\frac{\lambda_{i}!\mu_{j}!}{T_{ij}!}. (10)

This is the Fisher-Yates distribution on contingency tables, a mainstay of applied statistical work in chi-squared tests of independence. In applications it is natural to test the assumption that row and column features are independent and that the observed table is a sample with cell probabilities pij=θiηjp_{ij}=\theta_{i}\eta_{j}. For this model, the row and column sums are sufficient statistics; conditioning on the row and column sums of the table gives the Fisher-Yates distribution.

A useful way of describing a sample from πλ,μ\pi_{\lambda,\mu} is by sampling without replacement: Suppose that an urn contains nn total balls of JJ different colors, μj\mu_{j} of color jj. To empty the urn, make II sets of draws of unequal sizes. First draw λ1\lambda_{1} balls, next λ2\lambda_{2}, and so on until there are λI=ni=1I1λi\lambda_{I}=n-\sum_{i=1}^{I-1}\lambda_{i} balls left. Create a contingency table by setting TijT_{ij} to be the number of color jj in the iith draw.

This description is useful for calculating moments of entries in a table, by utilizing the fact that the draws are exchangeable: Let Xs(i,j)X_{s}^{(i,j)}, 1sn1\leq s\leq n be 11 if in the iith round of drawings, the ssth draw was color jj, and 0 otherwise. That is,

Tij=s=1λiXs(i,j).T_{ij}=\sum_{s=1}^{\lambda_{i}}X_{s}^{(i,j)}.

The expectation of one entry in the table is

𝔼πλ,μ[Tij]=s=1λi(Xs(i,j)=1)=λiμjn,\displaystyle\operatorname{{\mathbb{E}}}_{\pi_{\lambda,\mu}}[T_{ij}]=\sum_{s=1}^{\lambda_{i}}\operatorname{\mathbb{P}}(X_{s}^{(i,j)}=1)=\frac{\lambda_{i}\mu_{j}}{n},

using that (Xs(i,j)=1)=μj/n\operatorname{\mathbb{P}}(X_{s}^{(i,j)}=1)=\mu_{j}/n for any ss since the variables are exchangeable in ss. These calculations for the moments of Fisher-Yates tables are needed in the following section to normalize eigenfunctions.

The usual test of the independence model uses the chi-squared statistic:

χ2(T)=i,j(Tijλiμj/n)2λiμj/n.\chi^{2}(T)=\sum_{i,j}\frac{(T_{ij}-\lambda_{i}\mu_{j}/n)^{2}}{\lambda_{i}\mu_{j}/n}.

Under general conditions, see e.g. [33], the chi-squared statistic has limiting limiting chi-squared distribution with (I1)(J1)(I-1)\cdot(J-1) degrees of freedom. This result says that, if the null hypothesis is true, most tables will be close to a table TT^{*} with entries Tij=λiμj/nT^{*}_{ij}=\lambda_{i}\mu_{j}/n. Another feature to investigate for the independence model is the number of zero entries in a contingency table. Since most tables will be close to the table TT^{*}, which has no zeros, zeros are a pointer to the breakdown of the independence model. Section 5.2 of [16] proves that the number of zeros is asymptotically Poisson, under naturally hypotheses on row and column sums. In [42], limit theorems for fixed points, descents, and inversions of permutations chosen uniformly from fixed double cosets are studied.

Majorization Order

Let TT and TT^{\prime} be tables with the same row and column sums. Say that TTT\prec T^{\prime} (‘TT^{\prime} majorizes TT’) if the largest element in TT^{\prime} is greater than the largest element in TT, the sum of the two largest elements in TT^{\prime} is greater than the sum of the two largest elements in TT, and so on. Of course the sum of all elements in TT^{\prime} equals the sum of all elements of TT.

Example 2.9.

For tables with n=8,λ1=λ2=μ1=μ2=4n=8,\lambda_{1}=\lambda_{2}=\mu_{1}=\mu_{2}=4, there is the following ordering

(2222)(3113)(4004).\begin{pmatrix}2&2\cr 2&2\end{pmatrix}\prec\begin{pmatrix}3&1\cr 1&3\end{pmatrix}\prec\begin{pmatrix}4&0\cr 0&4\end{pmatrix}.

Majorization is a standard partial order on vectors [40] and Harry Joe [32] has shown it is useful for contingency tables.

Proposition 2.10.

Let TT and TT^{\prime} be tables with the same row and column sums given by λ,μ\lambda,\mu and πλ,μ\pi_{\lambda,\mu} the Fisher-Yates distribution. If TTT\prec T^{\prime}, then

πλ,μ(T)>πλ,μ(T).\pi_{\lambda,\mu}(T)>\pi_{\lambda,\mu}(T^{\prime}).
Proof.

From the definition, we have log(piλ,μ(T))=Ci,jlog(Tij!)\log(pi_{\lambda,\mu}(T))=C-\sum_{i,j}\log(T_{ij}!) for a constant CC. This form makes it clear the right hand side is a symmetric function of the IJIJ numbers {Tij}\{T_{ij}\}. The log convexity of the Gamma function shows that it is concave. A symmetric concave function is Schur concave: That is, order-reversing for the majorization order [40]. ∎

Remark 5.

Joe [32] shows that, among the real-valued tables with given row and column sums, the independence table TT^{*} is the unique smallest table in majorization order. He further shows that if an integer valued table is, entry-wise, within 11 of the real independence table, then TT is the unique smallest table with integer entries. In this case, the corresponding double coset has P(T)P(T) largest.

Example 2.11.

Fix a positive integer aa and consider an I×JI\times J table TT with all entries equal to aa. This has constant row sums JaJ\cdot a and column sums IaI\cdot a. It is the unique smallest table with these row and column sums, and so corresponds to the largest double coset. For a=2,I=2,J=3a=2,I=2,J=3, this table is

T=(222222).T=\begin{pmatrix}2&2&2\cr 2&2&2\end{pmatrix}.

Contingency tables with fixed row and column sums form a graph with edges between tables that can be obtained by one move of the following: pick two rows i,ii,i^{\prime} and two columns j,jj,j^{\prime}. Add +1+1 to the (i,j)(i,j) entry, 1-1 to the (i,j)(i^{\prime},j) entry, +1+1 to the (i,j)(i^{\prime},j^{\prime}) entry, and 1-1 to the (i,j)(i,j^{\prime}) entry (one move of the Markov chain Definition 1.4). This graph is connected and moves up or down in the majorization order as the 2×22\times 2 table with rows i,ii,i^{\prime} and columns j,jj,j^{\prime} moves up or down. See Example 2.9 above.

3 Eigenvalues and Eigenfunctions

From Proposition 2.5, the multiplicity of the eigenvalue βρ\beta_{\rho} in the contingency table chain is

mρ=mρλmρμ\displaystyle m_{\rho}=m_{\rho}^{\lambda}\cdot m_{\rho}^{\mu}
mρλ:=χρ,IndSλSn(1)\displaystyle m_{\rho}^{\lambda}:=\left\langle\chi_{\rho},\mathrm{Ind}_{S_{\lambda}}^{S_{n}}(1)\right\rangle

These multiplicities can be determined by Young’s Rule [9], [31]: Given λ\lambda a partition of nn, take λ1\lambda_{1} of the symbol ‘11’, λ2\lambda_{2} of the symbol ‘2’, and so on. The value χρ,IndSλSn(1)\left\langle\chi_{\rho},\mathrm{Ind}_{S_{\lambda}}^{S_{n}}(1)\right\rangle is equal to the number of ways of arranging these symbols into a tableau of shape ρ\rho with strictly increasing columns and weakly increasing rows. In other words, mρλm_{\rho}^{\lambda} is the number of semistandard Young tableau of shape ρ\rho and weight λ\lambda. These are also called Kostka numbers, and have a long enumerative history in connection with symmetric functions (see Chapter 7 of [48]).

Example 3.1.

Consider n=5n=5 and λ=(3,1,1),μ=(2,2,1)\lambda=(3,1,1),\mu=(2,2,1). The possible eigenvalues and their multiplicities are in Table 1.

ρ\rho βρ\beta_{\rho} mρλm_{\rho}^{\lambda} mρμm_{\rho}^{\mu} mρm_{\rho}
(5)(5) 1.01.0 1 1 1
(4,1)(4,1) 0.60.6 2 2 4
(3,2)(3,2) 0.360.36 1 2 2
(3,1,1)(3,1,1) 0.200.20 1 1 1
Table 1: Eigenvalues and multiplicities for λ=(3,1,1),μ=(2,2,1)\lambda=(3,1,1),\mu=(2,2,1).

For example, for μ=(2,2,1)\mu=(2,2,1) the symbols are 1122311223 and there are 22 ways of arranging them into a tableau of shape (3,2)(3,2) with increasing rows and strictly increasing columns:

11223,11322.\begin{matrix}1&1&2\cr 2&3&\end{matrix},\quad\begin{matrix}1&1&3\cr 2&2&\end{matrix}.

Thus for ρ=(3,2)\rho=(3,2), the contribution to the multiplicity is mρμ=2m_{\rho}^{\mu}=2, as shown in Table 1. While there is not an explicit formula for the result of Young’s rule (indeed, this would provide a formula for the number of contingency tables with fixed row and column sums), we can say things for special cases. See the exercises in Chapter 6.4 [38] for more properties.

Corollary 3.2.

Let ρ,λ,μ\rho,\lambda,\mu be partitions of nn and |ρ||\rho| denote the number of parts of the partition, with |λ|=I,|μ|=J|\lambda|=I,|\mu|=J. Let βρ\beta_{\rho} denote the eigenvalue in the random transpositions chain corresponding to ρ\rho and mρm_{\rho} the multiplicity of βρ\beta_{\rho} in the chain lumped to 𝒯λ,μ\mathcal{T}_{\lambda,\mu}.

  1. (a)

    The multiplicity of the second-largest eigenvalue β(n1,1)=12/n\beta_{(n-1,1)}=1-2/n is m(n1,1)=(I1)(J1)m_{(n-1,1)}=(I-1)\cdot(J-1).

  2. (b)

    If mρ>0m_{\rho}>0, then |ρ|min(I,J)|\rho|\leq\min(I,J) and ρ1max(λ1,μ1)\rho_{1}\geq\max(\lambda_{1},\mu_{1}).

  3. (c)

    If ρ=(nk,k)\rho=(n-k,k) for 1kn/21\leq k\leq\lfloor n/2\rfloor and λ=(nj,j)\lambda=(n-j,j) then

    mρλ={0ifk>j1else.m_{\rho}^{\lambda}=\begin{cases}0&\text{if}\,\,\,k>j\\ 1&\text{else}\end{cases}.
  4. (d)

    If ρ=(nk,k)\rho=(n-k,k), μ=(μ1,,μJ)\mu=(\mu_{1},\dots,\mu_{J}) and μ1k\mu_{1}\geq k, then

    mρμ={(x1,,xJ1)J1:j=1J1xj=k,xjμj+1}=|𝒯(k,nμ1k),(μ2,μ3,,μj)|.m_{\rho}^{\mu}=\left\{(x_{1},\dots,x_{J-1})\in\mathbb{N}^{J-1}:\sum_{j=1}^{J-1}x_{j}=k,x_{j}\leq\mu_{j+1}\right\}=|\mathcal{T}_{(k,n-\mu_{1}-k),(\mu_{2},\mu_{3},\dots,\mu_{j})}|.
Remark 6.

Corollary 3.2(b) shows that a table with only two rows or columns will only have eigenvalues βρ\beta_{\rho} with ρ=(nm,m)\rho=(n-m,m). The eigenvalue defined by ρ=(1,1,,1)\rho=(1,1,\dots,1) is βρ=1\beta_{\rho}=-1. This has 0 multiplicity unless λ=μ=(1,1,,1)\lambda=\mu=(1,1,\dots,1) for which 𝒯λ,μSn\mathcal{T}_{\lambda,\mu}\simeq S_{n}.

Proof.

(a): An arrangement of symbols determined by λ\lambda into a tableau of shape (n1,1)(n-1,1) is determined by the choice of the symbol to be in the second row. To fit the constraint the columns in the tableau are strictly increasing, any symbol except ‘1’ could be placed in the second row. Thus, there are |λ|1|\lambda|-1 possibilities. For example, if λ=(3,2,1)\lambda=(3,2,1), there are 2 possibilities:

111223,111232.\displaystyle\begin{matrix}1&1&1&2&2\cr 3&&&&\end{matrix},\quad\begin{matrix}1&1&1&2&3\cr 2&&&&\end{matrix}.

(b): Consider the first column of an arrangement of the symbols determined by λ\lambda into a tableau of shape ρ\rho. For the column to be strictly increasing, there must be at least |ρ||\rho| symbols from λ\lambda, to give the column 1,2,,|ρ|1,2,\dots,|\rho|. Thus, if |λ|<|ρ||\lambda|<|\rho| then mρλ=0m_{\rho}^{\lambda}=0. All of the ‘1’ symbols must be contained in the first row of the tableau, which gives the constraint λ1ρ1\lambda_{1}\leq\rho_{1}.

(c): If |λ|=2|\lambda|=2 and |ρ|=2|\rho|=2 then there is at most one way of arranging the symbols of λ\lambda into a tableau of shape μ\mu. For example, λ=(3,2),ρ=(4,1)\lambda=(3,2),\rho=(4,1):

11122.\begin{matrix}1&1&1&2\\ 2&&&\end{matrix}.

To satisfy the constraint that columns are strictly increasing it is necessary that the second row only contains symbols ‘2’. If λ1>ρ1\lambda_{1}>\rho_{1}, this is not possible. Note that the assumption λ2λ1\lambda_{2}\leq\lambda_{1} ensures that there would never need to be a column with only the symbol ‘2’.

(d): Now suppose μ=(μ1,,μJ)\mu=(\mu_{1},\dots,\mu_{J}) with J>2J>2. The assumption μ1k\mu_{1}\geq k ensures that any assignment of the second row will obey the strictly increasing column constraint. That is, any selection of kk symbols from the nμ1n-\mu_{1} symbols which are greater than 11 would be a valid assignment for the second row of the tableau. There are J1J-1 possible symbols, μj\mu_{j} of each type. If xjx_{j} denotes the number of symbol j+1j+1 in row 22, then the second row is determined by (x1,,xJ1)(x_{1},\dots,x_{J-1}) with xjμj+1x_{j}\leq\mu_{j+1} and j=1J1xj=k\sum_{j=1}^{J-1}x_{j}=k. The number of possibilities is exactly the number of 2×(J1)2\times(J-1) contingency tables with row sums k,nμ1kk,n-\mu_{1}-k and column sums μ2,,μJ\mu_{2},\dots,\mu_{J}.

The multiplicities also behave well with respect to the majorization order on partitions: If λ,λ\lambda,\lambda^{\prime} are partitions of nn, then λλ\lambda\prec\lambda^{\prime} if λ1+λ2++λkλ1+λ2++λk\lambda_{1}+\lambda_{2}+\ldots+\lambda_{k}\leq\lambda_{1}^{\prime}+\lambda_{2}^{\prime}+\ldots+\lambda_{k}^{\prime} for all 1k|λ|1\leq k\leq|\lambda|. In other words, λ\lambda^{\prime} can be obtained from λ\lambda by successively ‘moving up boxes’ ([38] Chapter 1). For example,

λ=\lambda= λ=\lambda^{\prime}=
Remark 7.

The eigenvalues βρ\beta_{\rho} are monotonic with respect to this ordering: If ρρ\rho\prec\rho^{\prime}, then βρβρ\beta_{\rho}\leq\beta_{\rho^{\prime}}; see Chapter 3 of [8] for more properties. This tells us that ρ=(n1,1)\rho=(n-1,1) gives the largest eigenvalue not equal to 11.

The following lemma is well-known in the literature, e.g. [22].

Lemma 3.3.

Let λ,λ,ρ\lambda,\lambda^{\prime},\rho be partitions of nn. Then,

  1. (a)

    mρλ0m_{\rho}^{\lambda}\neq 0 if and only if λρ\lambda\prec\rho.

  2. (b)

    If λλρ\lambda\prec\lambda^{\prime}\prec\rho, then mρλmρλm_{\rho}^{\lambda}\geq m_{\rho}^{\lambda^{\prime}}.

Remark 8.

No such monotonicity exists in ρ\rho. For example, with n=4n=4, λ=(1,1,1,1)\lambda=(1,1,1,1), we have (1,1,1,1)(2,1,1)(2,2)(1,1,1,1)\prec(2,1,1)\prec(2,2) yet m(1,1,1,1)λ=1m_{(1,1,1,1)}^{\lambda}=1, m(2,1,1)λ=3m_{(2,1,1)}^{\lambda}=3, m(2,2)λ=2m_{(2,2)}^{\lambda}=2.

3.1 Average Case Mixing Time

For 2×22\times 2 contingency tables, Corollary 3.2 suffices to determine the multiplicities of all eigenvalues.

Lemma 3.4.

For λ=(nk,k),μ=(n,)\lambda=(n-k,k),\mu=(n-\ell,\ell) for kn/2k\leq\ell\leq\lfloor n/2\rfloor if t=(n/4)(log(k)+c)t=(n/4)(\log(k)+c) for c>0c>0 then

𝐱𝒯λ,μπλ,μ(𝐱)P𝐱tπλ,μTV2ec.\sum_{\mathbf{x}\in\mathcal{T}_{\lambda,\mu}}\pi_{\lambda,\mu}(\mathbf{x})\|P^{t}_{\mathbf{x}}-\pi_{\lambda,\mu}\|_{TV}^{2}\leq e^{-c}.

If t=cn/4t=cn/4, then

𝐱𝒯λ,μπλ,μ(𝐱)χ𝐱2(t)1c.\sum_{\mathbf{x}\in\mathcal{T}_{\lambda,\mu}}\pi_{\lambda,\mu}(\mathbf{x})\chi^{2}_{\mathbf{x}}(t)\geq 1-c.
Proof.

Suppose ρ=(nm,m)\rho=(n-m,m) with mkm\leq k\leq\ell. By Corollary 3.2(c) mρλ=mρμ=1m_{\rho}^{\lambda}=m_{\rho}^{\mu}=1, so the multiplicity of βρ=12m(n+1m)/n2\beta_{\rho}=1-2m(n+1-m)/n^{2} in the transpositions chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} is 11. For any other partition ρ\rho, the multiplicity is 0. Thus,

𝐱𝒯λ,μπ(𝐱)χ𝐱2(t)\displaystyle\sum_{\mathbf{x}\in\mathcal{T}_{\lambda,\mu}}\pi(\mathbf{x})\chi^{2}_{\mathbf{x}}(t) =ρ(n)βρ2tmρ2\displaystyle=\sum_{\rho\neq(n)}\beta_{\rho}^{2t}m_{\rho}^{2}
=m=1k(12m(n+1m)n2)2t\displaystyle=\sum_{m=1}^{k}\left(1-\frac{2m(n+1-m)}{n^{2}}\right)^{2t}
m=1kexp(2t2m(n+1m)n2)exp(4tn+log(k)),\displaystyle\leq\sum_{m=1}^{k}\exp\left(-2t\frac{2m(n+1-m)}{n^{2}}\right)\leq\exp\left(-\frac{4t}{n}+\log(k)\right),

which is ec\leq e^{-c} for t=(n/4)(log(k)+c)t=(n/4)(\log(k)+c).

The lower bound comes from using the term in the sum with ρ=(n1,1)\rho=(n-1,1) (which gives the largest eigenvalue, with multiplicity 11):

𝐱𝒯λ,μπ(𝐱)χ𝐱2(t)\displaystyle\sum_{\mathbf{x}\in\mathcal{T}_{\lambda,\mu}}\pi(\mathbf{x})\chi^{2}_{\mathbf{x}}(t) β(n1,1)2tm(n1,1)2\displaystyle\geq\beta_{(n-1,1)}^{2t}m_{(n-1,1)}^{2}
=(12n)2t\displaystyle=\left(1-\frac{2}{n}\right)^{2t}
14tn=1c,\displaystyle\geq 1-\frac{4t}{n}=1-c,

if t=cn/4t=cn/4. The last inequality is Bernoulli’s inequality. ∎

Remark 9.

A table in 𝒯(nk,k),(n,)\mathcal{T}_{(n-k,k),(n-\ell,\ell)} is determined by one entry, say the (2,2)(2,2) entry XX, so the table is

(nk+XXkXX).\begin{pmatrix}n-k-\ell+X&\ell-X\cr k-X&X\end{pmatrix}.

Then X{0,1,,k}X\in\{0,1,\dots,k\} has the uni-variate hypergeometric distribution with parameters n,kn,k. The transitions for XX are

P(a,a+1)=2(ka)(a)n2,a<k\displaystyle P(a,a+1)=\frac{2(k-a)(\ell-a)}{n^{2}},\quad a<k
P(a,a1)=2a(nk+a)n2,a>0.\displaystyle P(a,a-1)=\frac{2a(n-k-\ell+a)}{n^{2}},\quad a>0.
Remark 10.

In contrast to the previous remark, the Bernoulli-Laplace urn has transitions

P(a,a+1)=2(ka)(a)k(nk),a<k\displaystyle P(a,a+1)=\frac{2(k-a)(\ell-a)}{k(n-k)},\quad a<k
P(a,a1)=2a(nk+a)k(nk),a>0.\displaystyle P(a,a-1)=\frac{2a(n-k-\ell+a)}{k(n-k)},\quad a>0.

In [14], for the special case k==n/2k=\ell=n/2 (which corresponds to the state space 𝒯(n/2,n/2),(n/2,n/2)\mathcal{T}_{(n/2,n/2),(n/2,n/2)}), the Bernoulli-Laplace chain is proven to mix (n/8)log(n)(n/8)\log(n) steps. The paper [21] studies a more general model in which kk balls are swapped between the two urns; the chain is shown to have mixing time (n/4k)log(n)(n/4k)\log(n). These are special cases of the random walk on a distance regular graph studied in [3].

Remark 11.

Note that the random transpositions Markov chain on SnS_{n} is transitive, and so the behavior of the chain, especially the mixing time, does not depend on the starting state. The chain lumped to 𝒯λ,μ\mathcal{T}_{\lambda,\mu} is not transitive, and the mixing time could heavily rely on the starting state. It is expected that the mixing time is fastest starting from states with high probability (large double cosets) and slowest from the states with low probability (small double cosets). From Remark 5, the highest probability table is the one closest to the ‘independence table’; the tables with low probability are the sparse tables with many zeros.

Remark 12.

For any λ,μ\lambda,\mu, suppose T𝒯λ,μT\in\mathcal{T}_{\lambda,\mu}. From the table TT we can create a 2×22\times 2 table by ‘collapsing’ columns 2:J2:J and rows 2:I2:I. That is, set

T~(1,1)=T(1,1)\displaystyle\widetilde{T}(1,1)=T(1,1)
T~(1,2)=j=2JT(1,j)\displaystyle\widetilde{T}(1,2)=\sum_{j=2}^{J}T(1,j)
T~(2,1)=i=2IT(i,1)\displaystyle\widetilde{T}(2,1)=\sum_{i=2}^{I}T(i,1)
T~(2,2)=λ2T~(2,1)=μ2T~(1,2).\displaystyle\widetilde{T}(2,2)=\lambda_{2}-\widetilde{T}(2,1)=\mu_{2}-\widetilde{T}(1,2).

Then T~𝒯(λ1,nλ1),(μ1,nμ1)\widetilde{T}\in\mathcal{T}_{(\lambda_{1},n-\lambda_{1}),(\mu_{1},n-\mu_{1})}. Futhermore, T(1,1)T(1,1) has the uni-variate hypergeometric distribution with parameters μ1,λ1\mu_{1},\lambda_{1}. The random transpositions chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} lumps to the uni-variate Markov chain on 𝒯(λ1,nλ1),(μ1,nμ1)\mathcal{T}_{(\lambda_{1},n-\lambda_{1}),(\mu_{1},n-\mu_{1})}. Thus the mixing time for the (1,1)(1,1)-entry Markov chain is a lower bound for the mixing time of the full process.

3.2 Orthogonal Polynomials

This section proves part (3) of Theorem 1.6, that the eigenfunctions of the random transpositions chain on contingency tables are orthogonal polynomials. While it is difficult to find an explicit formula for all of these polynomials, analysis of the chain provides a way to calculate the linear and quadratic polynomials.

Theorem 3.5.

Let λ=(λ1,,λI),μ=(μ1,,μJ)\lambda=(\lambda_{1},\dots,\lambda_{I}),\mu=(\mu_{1},\dots,\mu_{J}) be partitions of nn and {Tt}t0\{T_{t}\}_{t\geq 0} the random transpositions Markov chain on the space of contingency tables 𝒯λ,μ\mathcal{T}_{\lambda,\mu} with transition matrix PP. For any mm and 𝐱𝒯λ,μ\mathbf{x}\in\mathcal{T}_{\lambda,\mu},

𝔼[Tt+1(i,j)mTt=𝐱]\displaystyle\operatorname{{\mathbb{E}}}[T_{t+1}(i,j)^{m}\mid T_{t}=\mathbf{x}] =xijm(12m(n+1m)n2)+(terms in x of degree<m).\displaystyle=x_{ij}^{m}\left(1-\frac{2m(n+1-m)}{n^{2}}\right)+(\text{terms in x of degree}\,\,\,<m).

The eigenfunctions for PP are polynomials.

Proof.

Let {Tt}t0\{T_{t}\}_{t\geq 0} represent the Markov chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu}, with Tt(i,j)T_{t}(i,j) denoting the value in the (i,j)(i,j) cell. Let 𝐱𝒯λ,μ\mathbf{x}\in\mathcal{T}_{\lambda,\mu}. Marginally, each entry of the table is a birth-death process: For any 1iI,1jJ1\leq i\leq I,1\leq j\leq J,

(Tt+1(i,j)=xij+1Tt=𝐱)=jki2xixkjn2=2(λixij)(μjxij)n2\displaystyle\operatorname{\mathbb{P}}\left(T_{t+1}(i,j)=x_{ij}+1\mid T_{t}=\mathbf{x}\right)=\sum_{\ell\neq j}\sum_{k\neq i}\frac{2\cdot x_{i\ell}x_{kj}}{n^{2}}=\frac{2(\lambda_{i}-x_{ij})(\mu_{j}-x_{ij})}{n^{2}} (11)
(Tt+1(i,j)=xij1Tt=𝐱)=jki2xijxkln2=2xij(nλiμj+xij)n2.\displaystyle\operatorname{\mathbb{P}}\left(T_{t+1}(i,j)=x_{ij}-1\mid T_{t}=\mathbf{x}\right)=\sum_{\ell\neq j}\sum_{k\neq i}\frac{2x_{ij}x_{kl}}{n^{2}}=\frac{2x_{ij}(n-\lambda_{i}-\mu_{j}+x_{ij})}{n^{2}}. (12)

These transitions allow for the calculation of 𝔼[Tt+1(i,j)mTt=𝐱]\operatorname{{\mathbb{E}}}[T_{t+1}(i,j)^{m}\mid T_{t}=\mathbf{x}], for any integer power mm. That is,

𝔼[Tt+1(i,j)mTt=𝐱]\displaystyle\operatorname{{\mathbb{E}}}[T_{t+1}(i,j)^{m}\mid T_{t}=\mathbf{x}] =(xij+1)m2(λixij)(μjxij)n2+(xij1)m2xij(nλiμj+xij)n2\displaystyle=(x_{ij}+1)^{m}\cdot\frac{2(\lambda_{i}-x_{ij})(\mu_{j}-x_{ij})}{n^{2}}+(x_{ij}-1)^{m}\cdot\frac{2x_{ij}(n-\lambda_{i}-\mu_{j}+x_{ij})}{n^{2}}
+xijm(12xij(nλiμj+xij)n22(λixij)(μjxij)n2)\displaystyle\,\,\,\,\,\,\,\,\,\,+x_{ij}^{m}\left(1-\frac{2x_{ij}(n-\lambda_{i}-\mu_{j}+x_{ij})}{n^{2}}-\frac{2(\lambda_{i}-x_{ij})(\mu_{j}-x_{ij})}{n^{2}}\right)
=xijm+(mxijm1+m(m1)2xijm2+=3m(m)xijm)2(λixij)(μjxij)n2\displaystyle=x_{ij}^{m}+\left(mx_{ij}^{m-1}+\frac{m(m-1)}{2}x_{ij}^{m-2}+\sum_{\ell=3}^{m}\binom{m}{\ell}x_{ij}^{m-\ell}\right)\cdot\frac{2(\lambda_{i}-x_{ij})(\mu_{j}-x_{ij})}{n^{2}}
+(mxijm1+m(m1)2xijm2+=3m(m)xijm(1))2xij(nλiμj+xij)n2\displaystyle\,\,\,\,\,\,\,\,\,\,+\left(-mx_{ij}^{m-1}+\frac{m(m-1)}{2}x_{ij}^{m-2}+\sum_{\ell=3}^{m}\binom{m}{\ell}x_{ij}^{m-\ell}(-1)^{\ell}\right)\cdot\frac{2x_{ij}(n-\lambda_{i}-\mu_{j}+x_{ij})}{n^{2}}
=xijm(12m(n+1m)n2)+(terms in x of degree<m).\displaystyle=x_{ij}^{m}\left(1-\frac{2m(n+1-m)}{n^{2}}\right)+(\text{terms in x of degree}\,\,\,<m).

By Lemma 2.7, this condition means the eigenfunctions for PP are polynomials. ∎

Straightforward calculation of the transition probabilities gives the linear eigenfunctions:

Lemma 3.6.

For any λ,μ\lambda,\mu and 1iI,1jJ1\leq i\leq I,1\leq j\leq J, the functions

fij(𝐱):=xijλiμjnf_{ij}(\mathbf{x}):=x_{ij}-\frac{\lambda_{i}\mu_{j}}{n}

are eigenvectors with eigenvalue 12/n1-2/n.

Proof.

For 𝐱𝒯λ,μ\mathbf{x}\in\mathcal{T}_{\lambda,\mu}, the expected value of one entry of the table after one step of the chain can be computed using (11) and (12):

𝔼[T1(i,j)T0=𝐱]\displaystyle\operatorname{{\mathbb{E}}}[T_{1}(i,j)\mid T_{0}=\mathbf{x}] =(xij+1)2(λixij)(μjxij)n2+(xij1)2xij(nλiμj+xij)n2\displaystyle=(x_{ij}+1)\cdot\frac{2(\lambda_{i}-x_{ij})(\mu_{j}-x_{ij})}{n^{2}}+(x_{ij}-1)\cdot\frac{2x_{ij}(n-\lambda_{i}-\mu_{j}+x_{ij})}{n^{2}}
+xij(12xij(nλiμj+xij)n22(λixij)(μjxij)n2)\displaystyle\,\,\,\,\,\,\,\,\,\,+x_{ij}\left(1-\frac{2x_{ij}(n-\lambda_{i}-\mu_{j}+x_{ij})}{n^{2}}-\frac{2(\lambda_{i}-x_{ij})(\mu_{j}-x_{ij})}{n^{2}}\right)
=2(λixij)(μjxij)n22xij(nλiμj+xij)n2+xij\displaystyle=\frac{2(\lambda_{i}-x_{ij})(\mu_{j}-x_{ij})}{n^{2}}-\frac{2x_{ij}(n-\lambda_{i}-\mu_{j}+x_{ij})}{n^{2}}+x_{ij}
=xij(12n)+2λiμjn2.\displaystyle=x_{ij}\left(1-\frac{2}{n}\right)+\frac{2\lambda_{i}\mu_{j}}{n^{2}}. (13)

Thus,

𝔼[fij(T1)T0=𝐱]\displaystyle\operatorname{{\mathbb{E}}}[f_{ij}(T_{1})\mid T_{0}=\mathbf{x}] =xij(12n)+2λiμjn2λiμjn\displaystyle=x_{ij}\left(1-\frac{2}{n}\right)+\frac{2\lambda_{i}\mu_{j}}{n^{2}}-\frac{\lambda_{i}\mu_{j}}{n}
=(12n)(xijλiμjn)=(12n)fij(𝐱).\displaystyle=\left(1-\frac{2}{n}\right)\left(x_{ij}-\frac{\lambda_{i}\mu_{j}}{n}\right)=\left(1-\frac{2}{n}\right)\cdot f_{ij}(\mathbf{x}).

Remark 13.

The set {fij}1iI1,1jJ1\{f_{ij}\}_{1\leq i\leq I-1,1\leq j\leq J-1} of eigenvectors from Lemma 3.6 is a basis for the eigenspace with eigenvalue 12/n1-2/n (as Corollary 3.2 (a) showed the multiplicity of 12/n1-2/n is (I1)(J1)(I-1)(J-1)). The current version of the functions fijf_{ij} are not orthogonal with respect to πλ,μ\pi_{\lambda,\mu}, because

𝔼πλ,μ[TijTkl]={λiμjλkμln(n1)ifik,jlλiμjλk(μj1)n(n1)ifik,j=lλi2μj2n2+λiμj(nλi)(nμj)n2(n1)ifi=j,k=l.\displaystyle\operatorname{{\mathbb{E}}}_{\pi_{\lambda,\mu}}[T_{ij}T_{kl}]=\begin{cases}\frac{\lambda_{i}\mu_{j}\lambda_{k}\mu_{l}}{n(n-1)}&\text{if}\quad i\neq k,j\neq l\\ \frac{\lambda_{i}\mu_{j}\lambda_{k}(\mu_{j}-1)}{n(n-1)}&\text{if}\quad i\neq k,j=l\\ \frac{\lambda_{i}^{2}\mu_{j}^{2}}{n^{2}}+\frac{\lambda_{i}\mu_{j}(n-\lambda_{i})(n-\mu_{j})}{n^{2}(n-1)}&\text{if}\quad i=j,k=l\end{cases}. (14)

If ik,jli\neq k,j\neq l, then

𝔼πλ,μ[fi,j(T)fk,l(T)]\displaystyle\operatorname{{\mathbb{E}}}_{\pi_{\lambda,\mu}}[f_{i,j}(T)f_{k,l}(T)] =λiμjλkμln(n1)2λiμjλkμln2+λiμjλkμln2\displaystyle=\frac{\lambda_{i}\mu_{j}\lambda_{k}\mu_{l}}{n(n-1)}-2\frac{\lambda_{i}\mu_{j}\lambda_{k}\mu_{l}}{n^{2}}+\frac{\lambda_{i}\mu_{j}\lambda_{k}\mu_{l}}{n^{2}}
=λiμjλkμln(1n(n1)).\displaystyle=\frac{\lambda_{i}\mu_{j}\lambda_{k}\mu_{l}}{n}\left(\frac{1}{n(n-1)}\right).

To find the quadratic eigenvectors, the following computations are needed.

Lemma 3.7.

For any λ,μ\lambda,\mu, 1i,kI,1j,lJ1\leq i,k\leq I,1\leq j,l\leq J and 𝐱𝒯λ,μ\mathbf{x}\in\mathcal{T}_{\lambda,\mu},

𝔼[T1(i,j)T1(k,l)T0=𝐱]={xijxkl(14n+2n2)+2n2(xklλiμj+xijλkμl+xilxkj)ik,jlxijxkj(14n+4n2)+2n2(xkj(λiμjλi)+xij(λkμjλk))ik,j=lxij2(14n+4n2)+2n2(xij(2λiμj2λi2μj+n)+λiμj)i=k,j=l.\operatorname{{\mathbb{E}}}\left[T_{1}(i,j)T_{1}(k,l)\mid T_{0}=\mathbf{x}\right]=\begin{cases}x_{ij}x_{kl}\left(1-\frac{4}{n}+\frac{2}{n^{2}}\right)+\frac{2}{n^{2}}\left(x_{kl}\lambda_{i}\mu_{j}+x_{ij}\lambda_{k}\mu_{l}+x_{il}x_{kj}\right)&i\neq k,j\neq l\\ x_{ij}x_{kj}\left(1-\frac{4}{n}+\frac{4}{n^{2}}\right)+\frac{2}{n^{2}}\left(x_{kj}(\lambda_{i}\mu_{j}-\lambda_{i})+x_{ij}(\lambda_{k}\mu_{j}-\lambda_{k})\right)&i\neq k,j=l\\ x_{ij}^{2}\left(1-\frac{4}{n}+\frac{4}{n^{2}}\right)+\frac{2}{n^{2}}\left(x_{ij}(2\lambda_{i}\mu_{j}-2\lambda_{i}-2\mu_{j}+n)+\lambda_{i}\mu_{j}\right)&i=k,j=l\end{cases}. (15)
Proof.

For the first case, if iki\neq k and jlj\neq l then the only situation in which both the (i,j)(i,j) and (k,l)(k,l) entries are chosen is if they are opposite corners of the box chosen for the swap move. In the following calculation, (16) is the case where both the (i,j)(i,j) and (k,l)(k,l) entries change, (17) is the case where only the (i,j)(i,j) entry changes, and (18) is the case where only the (k,l)(k,l) entry changes.

𝔼[T1(i,j)T1(k,l)\displaystyle\operatorname{{\mathbb{E}}}\left[T_{1}(i,j)T_{1}(k,l)\right. xijxklT0=𝐱]=(xijxkl+1)2xijxkln2+(xij+xkl+1)2xilxkjn2\displaystyle\left.-x_{ij}x_{kl}\mid T_{0}=\mathbf{x}\right]=(-x_{ij}-x_{kl}+1)\frac{2x_{ij}x_{kl}}{n^{2}}+(x_{ij}+x_{kl}+1)\frac{2x_{il}x_{kj}}{n^{2}} (16)
xkl2xij(nλiμj+xijxkl)n2+xkl2((λixij)(μjxij)xilxkj)n2\displaystyle-x_{kl}\frac{2x_{ij}(n-\lambda_{i}-\mu_{j}+x_{ij}-x_{kl})}{n^{2}}+x_{kl}\frac{2((\lambda_{i}-x_{ij})(\mu_{j}-x_{ij})-x_{il}x_{kj})}{n^{2}} (17)
xij2xkl(nλkμl+xklxij)n2+xij2((λkxkl)(μlxkl)xilxkj)n2\displaystyle-x_{ij}\frac{2x_{kl}(n-\lambda_{k}-\mu_{l}+x_{kl}-x_{ij})}{n^{2}}+x_{ij}\frac{2((\lambda_{k}-x_{kl})(\mu_{l}-x_{kl})-x_{il}x_{kj})}{n^{2}} (18)
=2n2(xijxkl2nxijxkl+xijλkμl+xklλiμj+xilxkj)\displaystyle=\frac{2}{n^{2}}\left(x_{ij}x_{kl}-2nx_{ij}x_{kl}+x_{ij}\lambda_{k}\mu_{l}+x_{kl}\lambda_{i}\mu_{j}+x_{il}x_{kj}\right)

Now suppose iki\neq k and j=lj=l. In this case the (i,j)(i,j) and (k,l)(k,l) entries of the table could only both change if one increases and the other decreases. This gives

𝔼[T1(i,j)T1(k,j)\displaystyle\operatorname{{\mathbb{E}}}\left[T_{1}(i,j)T_{1}(k,j)\right. xijxkjT0=𝐱]=(xijxkj1)2xij(λkxkj)n2+(xij+xkj1)2xkj(λixij)n2\displaystyle\left.-x_{ij}x_{kj}\mid T_{0}=\mathbf{x}\right]=(x_{ij}-x_{kj}-1)\frac{2x_{ij}(\lambda_{k}-x_{kj})}{n^{2}}+(-x_{ij}+x_{kj}-1)\frac{2x_{kj}(\lambda_{i}-x_{ij})}{n^{2}} (19)
xkj2xij(nλiλkμj+xij+xkj)n2+xkj2(λixij)(μjxijxkj)n2\displaystyle-x_{kj}\frac{2x_{ij}(n-\lambda_{i}-\lambda_{k}-\mu_{j}+x_{ij}+x_{kj})}{n^{2}}+x_{kj}\frac{2(\lambda_{i}-x_{ij})(\mu_{j}-x_{ij}-x_{kj})}{n^{2}} (20)
xij2xkj(nλkλiμj+xkj+xij)n2+xij2(λkxkj)(μjxkjxij)n2\displaystyle-x_{ij}\frac{2x_{kj}(n-\lambda_{k}-\lambda_{i}-\mu_{j}+x_{kj}+x_{ij})}{n^{2}}+x_{ij}\frac{2(\lambda_{k}-x_{kj})(\mu_{j}-x_{kj}-x_{ij})}{n^{2}} (21)
=2n2(2xijxkj2nxijxkjλkxijλixkj+xkjλiμj+xijλkμj).\displaystyle=\frac{2}{n^{2}}\left(2x_{ij}x_{kj}-2nx_{ij}x_{kj}-\lambda_{k}x_{ij}-\lambda_{i}x_{kj}+x_{kj}\lambda_{i}\mu_{j}+x_{ij}\lambda_{k}\mu_{j}\right).

Finally for the case i=k,j=li=k,j=l, this is the second moment of a birth death process. Using the transitions (11) and (12), the calculation is

𝔼[T1(i,j)2T0=𝐱]\displaystyle\operatorname{{\mathbb{E}}}[T_{1}(i,j)^{2}\mid T_{0}=\mathbf{x}] =(xij+1)2(2(λixij)(μjxij)n2)+(xij1)2(2xij(nλiμj+xij)n2)\displaystyle=(x_{ij}+1)^{2}\left(\frac{2(\lambda_{i}-x_{ij})(\mu_{j}-x_{ij})}{n^{2}}\right)+(x_{ij}-1)^{2}\left(\frac{2x_{ij}(n-\lambda_{i}-\mu_{j}+x_{ij})}{n^{2}}\right)
+xij2(T1(i,j)=xijT0=𝐱)\displaystyle\,\,\,\,\,\,\,\,\,\,\,+x_{ij}^{2}\operatorname{\mathbb{P}}(T_{1}(i,j)=x_{ij}\mid T_{0}=\mathbf{x})
=xij2(14n+4n2)+2n2(xij(2λiμj2λi2μj+n)+λiμj).\displaystyle=x_{ij}^{2}\left(1-\frac{4}{n}+\frac{4}{n^{2}}\right)+\frac{2}{n^{2}}\left(x_{ij}(2\lambda_{i}\mu_{j}-2\lambda_{i}-2\mu_{j}+n)+\lambda_{i}\mu_{j}\right).

Lemma 3.8 (Quadratic Eigenfunctions).

Let λ,μ\lambda,\mu be partitions of nn with |λ|=I,|μ|=J|\lambda|=I,|\mu|=J. Let 1i,kI,1j,lJ1\leq i,k\leq I,1\leq j,l\leq J with ik,jli\neq k,j\neq l. For the Markov chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu}, the following functions are eigenfunctions with eigenvalue 14/n+4/n21-4/n+4/n^{2}, defined for 𝐱𝒯λ,μ\mathbf{x}\in\mathcal{T}_{\lambda,\mu}:

  1. (a)
    f(i,j),(k,l)(𝐱)\displaystyle f_{(i,j),(k,l)}(\mathbf{x}) :=xijxklxijλkμln2xklλiμjn2\displaystyle:=x_{ij}x_{kl}-x_{ij}\frac{\lambda_{k}\mu_{l}}{n-2}-x_{kl}\frac{\lambda_{i}\mu_{j}}{n-2}
    +xilxkjxilλkμjn2xkjλiμln2+2λkμlλiμj(n1)(n2).\displaystyle\,\,\,\,\,\,\,+x_{il}x_{kj}-x_{il}\frac{\lambda_{k}\mu_{j}}{n-2}-x_{kj}\frac{\lambda_{i}\mu_{l}}{n-2}+\frac{2\lambda_{k}\mu_{l}\lambda_{i}\mu_{j}}{(n-1)(n-2)}.
  2. (b)
    f(i,j),(k,j)(𝐱)\displaystyle f_{(i,j),(k,j)}(\mathbf{x}) =xijxkjxijλi(μj1)n2xkjλk(μj1)n2+λiλkμj(μj1)(n1)(n2).\displaystyle=x_{ij}x_{kj}-x_{ij}\frac{\lambda_{i}(\mu_{j}-1)}{n-2}-x_{kj}\frac{\lambda_{k}(\mu_{j}-1)}{n-2}+\frac{\lambda_{i}\lambda_{k}\mu_{j}(\mu_{j}-1)}{(n-1)(n-2)}.
  3. (c)
    f(i,j),(i,j)(𝐱)\displaystyle f_{(i,j),(i,j)}(\mathbf{x}) =xij2xij2λiμj2λi2μj+nn2+λiμj(1+λiμjλiμj)(n1)(n2).\displaystyle=x_{ij}^{2}-x_{ij}\frac{2\lambda_{i}\mu_{j}-2\lambda_{i}-2\mu_{j}+n}{n-2}+\frac{\lambda_{i}\mu_{j}(1+\lambda_{i}\mu_{j}-\lambda_{i}-\mu_{j})}{(n-1)(n-2)}.
Remark 14.

Using Equation 14, one can check that 𝔼πλ,μ[f(i,j),(k,l)(T)]=0\operatorname{{\mathbb{E}}}_{\pi_{\lambda,\mu}}[f_{(i,j),(k,l)}(T)]=0, for any i,j,k,li,j,k,l.

Proof.

The results follow from straightforward calculations using Lemma 3.7 and Equation 13. As an illustration, (a) is computed: In 𝔼[f(i,j),(k,l)(T1)T0=𝐱]\operatorname{{\mathbb{E}}}[f_{(i,j),(k,l)}(T_{1})\mid T_{0}=\mathbf{x}], the degree 22 terms will be

xijxkl(14n+2n2)+2n2xilxkj+xilxkj(14n+2n2)+2n2xijxkl=(14n+4n2)(xijxkl+xilxkj).\displaystyle x_{ij}x_{kl}\left(1-\frac{4}{n}+\frac{2}{n^{2}}\right)+\frac{2}{n^{2}}x_{il}x_{kj}+x_{il}x_{kj}\left(1-\frac{4}{n}+\frac{2}{n^{2}}\right)+\frac{2}{n^{2}}x_{ij}x_{k}l=\left(1-\frac{4}{n}+\frac{4}{n^{2}}\right)\left(x_{ij}x_{kl}+x_{il}x_{kj}\right).

Degree 11 terms arise from the expectation of six of the terms in f(i,j),(k,l)(T1)f_{(i,j),(k,l)}(T_{1}):

𝔼[T1(i,j)T1(k,l)T0=𝐱]2n2(xklλiμj+xijλkμl)\displaystyle\operatorname{{\mathbb{E}}}[T_{1}(i,j)T_{1}(k,l)\mid T_{0}=\mathbf{x}]\to\frac{2}{n^{2}}\left(x_{kl}\lambda_{i}\mu_{j}+x_{ij}\lambda_{k}\mu_{l}\right)
𝔼[T1(i,l)T1(k,j)T0=𝐱]2n2(xkjλiμl+xilλkμj)\displaystyle\operatorname{{\mathbb{E}}}[T_{1}(i,l)T_{1}(k,j)\mid T_{0}=\mathbf{x}]\to\frac{2}{n^{2}}\left(x_{kj}\lambda_{i}\mu_{l}+x_{il}\lambda_{k}\mu_{j}\right)
𝔼[T1(i,j)λkμln2T0=𝐱](12n)xijλkμln2\displaystyle\operatorname{{\mathbb{E}}}\left[-T_{1}(i,j)\frac{\lambda_{k}\mu_{l}}{n-2}\mid T_{0}=\mathbf{x}\right]\to-\left(1-\frac{2}{n}\right)x_{ij}\frac{\lambda_{k}\mu_{l}}{n-2}
𝔼[T1(k,l)λiμjn2T0=𝐱](12n)xklλiμjn2\displaystyle\operatorname{{\mathbb{E}}}\left[-T_{1}(k,l)\frac{\lambda_{i}\mu_{j}}{n-2}\mid T_{0}=\mathbf{x}\right]\to-\left(1-\frac{2}{n}\right)x_{kl}\frac{\lambda_{i}\mu_{j}}{n-2}
𝔼[T1(i,l)λkμjn2T0=𝐱](12n)xilλkμjn2\displaystyle\operatorname{{\mathbb{E}}}\left[-T_{1}(i,l)\frac{\lambda_{k}\mu_{j}}{n-2}\mid T_{0}=\mathbf{x}\right]\to-\left(1-\frac{2}{n}\right)x_{il}\frac{\lambda_{k}\mu_{j}}{n-2}
𝔼[T1(k,j)λiμln2T0=𝐱](12n)xkjλiμln2\displaystyle\operatorname{{\mathbb{E}}}\left[-T_{1}(k,j)\frac{\lambda_{i}\mu_{l}}{n-2}\mid T_{0}=\mathbf{x}\right]\to-\left(1-\frac{2}{n}\right)x_{kj}\frac{\lambda_{i}\mu_{l}}{n-2}

Collecting the xijx_{ij} terms gives

xij(2n2λkμl1nλkμl)\displaystyle x_{ij}\left(\frac{2}{n^{2}}\lambda_{k}\mu_{l}-\frac{1}{n}\lambda_{k}\mu_{l}\right) =(n2)(2n21n)xijλkμln2\displaystyle=(n-2)\left(\frac{2}{n^{2}}-\frac{1}{n}\right)x_{ij}\frac{\lambda_{k}\mu_{l}}{n-2}
=(14n+4n2)xijλkμln2,\displaystyle=-\left(1-\frac{4}{n}+\frac{4}{n^{2}}\right)x_{ij}\frac{\lambda_{k}\mu_{l}}{n-2},

and the computation is the same for the other linear terms. Finally, the constant terms arise from:

𝔼[T1(i,j)λkμln2T0=𝐱]2λiμjn2λkμln2\displaystyle\operatorname{{\mathbb{E}}}\left[-T_{1}(i,j)\frac{\lambda_{k}\mu_{l}}{n-2}\mid T_{0}=\mathbf{x}\right]\to-\frac{2\lambda_{i}\mu_{j}}{n^{2}}\cdot\frac{\lambda_{k}\mu_{l}}{n-2}
𝔼[T1(k,l)λiμjn2T0=𝐱]2λkμln2λiμjn2\displaystyle\operatorname{{\mathbb{E}}}\left[-T_{1}(k,l)\frac{\lambda_{i}\mu_{j}}{n-2}\mid T_{0}=\mathbf{x}\right]\to-\frac{2\lambda_{k}\mu_{l}}{n^{2}}\cdot\frac{\lambda_{i}\mu_{j}}{n-2}
𝔼[T1(i,l)λkμjn2T0=𝐱]2λiμln2λkμjn2\displaystyle\operatorname{{\mathbb{E}}}\left[-T_{1}(i,l)\frac{\lambda_{k}\mu_{j}}{n-2}\mid T_{0}=\mathbf{x}\right]\to-\frac{2\lambda_{i}\mu_{l}}{n^{2}}\cdot\frac{\lambda_{k}\mu_{j}}{n-2}
𝔼[T1(k,j)λiμln2T0=𝐱]2λkμjn2λiμln2.\displaystyle\operatorname{{\mathbb{E}}}\left[-T_{1}(k,j)\frac{\lambda_{i}\mu_{l}}{n-2}\mid T_{0}=\mathbf{x}\right]\to-\frac{2\lambda_{k}\mu_{j}}{n^{2}}\cdot\frac{\lambda_{i}\mu_{l}}{n-2}.

This gives

2λkμlλiμj(n1)(n2)8λkμlλiμjn2(n2)=(14n+4n2)2λkμlλiμj(n1)(n2).\displaystyle\frac{2\lambda_{k}\mu_{l}\lambda_{i}\mu_{j}}{(n-1)(n-2)}-\frac{8\lambda_{k}\mu_{l}\lambda_{i}\mu_{j}}{n^{2}(n-2)}=\left(1-\frac{4}{n}+\frac{4}{n^{2}}\right)\frac{2\lambda_{k}\mu_{l}\lambda_{i}\mu_{j}}{(n-1)(n-2)}.

Further details are omitted. ∎

4 Mixing Time

This section contains results on the mixing time for special cases. Throughout, keep in mind that random transpositions on SnS_{n} has mixing time (n/2)log(n)(n/2)\log(n), which is an upper bound for the mixing time on 𝒯λ,μ\mathcal{T}_{\lambda,\mu}. The question then is: Does the lumping speed up mixing, and if so by what order?

4.1 Upper Bound for 2×J2\times J tables

Due to the complicated nature of the orthogonal polynomials, the upper bound can only be analyzed for the specific starting states for which the kernel polynomials simplify. Suppose λ=(nk,k),μ=(μ1,,μJ)\lambda=(n-k,k),\mu=(\mu_{1},\dots,\mu_{J}), for kn/2k\leq\lfloor n/2\rfloor, and at least one 1jJ1\leq j\leq J such that μj>k\mu_{j}>k.

Let k𝐞jk\mathbf{e}_{j} be the table with the second row all 0 except kk in the jjth column. For example, when n=10,μ=(4,3,3),λ=(8,2)n=10,\mu=(4,3,3),\lambda=(8,2), we have the following table

2𝐞1=(233200).2\mathbf{e}_{1}=\begin{pmatrix}2&3&3\\ 2&0&0\end{pmatrix}.

Note that the assumption μj>k\mu_{j}>k ensure that this table exists.

Recall Proposition 2.8 which says, evaluated at these states, the kernel polynomials for the multivariate orthogonal polynomials have a simple closed-form expression:

hm(k𝐞j,k𝐞j)=(km)(n2m+1)n[m1](nμj)[m](nk)[m](μj)[m].h_{m}(k\mathbf{e}_{j},k\mathbf{e}_{j})=\binom{k}{m}\frac{(n-2m+1)n_{[m-1]}(n-\mu_{j})_{[m]}}{(n-k)_{[m]}(\mu_{j})_{[m]}}. (22)

Recall the notation a[m]=a(a1)(am+1)a_{[m]}=a(a-1)\ldots(a-m+1) for the decreasing factorial, with a[0]:=1a_{[0]}:=1.

Theorem 4.1.

Let PP be the transition matrix for the swap Markov chain 𝒯(nk,k),μ\mathcal{T}_{(n-k,k),\mu}, with kn/2k\leq\lfloor n/2\rfloor and μ=(μ1,,μJ)\mu=(\mu_{1},\dots,\mu_{J}) with μj>k\mu_{j}>k for at least one index 1jJ1\leq j\leq J. For any c>0c>0 and 1jJ1\leq j\leq J such that μj>k\mu_{j}>k,

  1. (a)

    If

    t=(n4+k(k1)2(n2k))(log(kn(nμj)(n2k)(μjk))+c),t=\left(\frac{n}{4}+\frac{k(k-1)}{2(n-2k)}\right)\cdot\left(\log\left(\frac{k\cdot n\cdot(n-\mu_{j})}{(n-2k)\cdot(\mu_{j}-k)}\right)+c\right),

    then χk𝐞j2(t)ec\chi^{2}_{k\mathbf{e}_{j}}(t)\leq e^{-c}.

  2. (b)

    If

    t=n8(log(k(n1)(nμj)(nk)μj)c),t=\frac{n}{8}\cdot\left(\log\left(\frac{k\cdot(n-1)\cdot(n-\mu_{j})}{(n-k)\cdot\mu_{j}}\right)-c\right),

    then χk𝐞j2(t)ec\chi^{2}_{k\mathbf{e}_{j}}(t)\geq e^{c}.

Remark 15.

In [35] the kernel polynomials for the multivariate hypergeometric distribution are similarly used to analyze three classes of Bernoulli-Laplace type Markov chains, which have multivariate hypergeometric stationary distribution. Theorem 4.1 can be compared to Proposition 4.2.1 [35]. The random transpositions chain on 𝒯(nk,k),μ\mathcal{T}_{(n-k,k),\mu} is essentially a special case of the Bernoulli Laplace level model.

Proof.

Using the expression (22) for the kernel polynomials, the chi-squared distance is

χk𝐞j2(t)\displaystyle\chi^{2}_{k\mathbf{e}_{j}}(t) =m=1k(12m(n+1m)n2)2t(km)(n2m+1)n[m1](nμj)[m](nk)[m](μj)[m].\displaystyle=\sum_{m=1}^{k}\left(1-\frac{2m(n+1-m)}{n^{2}}\right)^{2t}\cdot\binom{k}{m}\frac{(n-2m+1)n_{[m-1]}(n-\mu_{j})_{[m]}}{(n-k)_{[m]}(\mu_{j})_{[m]}}.

To help bound this sum, let SmS_{m} be the mmth term in the summand, i.e.

Sm=(12m(n+1m)n2)2t(km)(n2m+1)n[m1](nμj)[m](nk)[m](μj)[m].S_{m}=\left(1-\frac{2m(n+1-m)}{n^{2}}\right)^{2t}\cdot\binom{k}{m}\frac{(n-2m+1)n_{[m-1]}(n-\mu_{j})_{[m]}}{(n-k)_{[m]}(\mu_{j})_{[m]}}.

Then the ratio of consecutive terms can be bounded:

Sm+1Sm\displaystyle\frac{S_{m+1}}{S_{m}}\leq (12(n2m)n22m(n+1m))2tkmm+1n2m1n2m+1(nm+1)(nμjm)(nkm)(μjm)\displaystyle\left(1-\frac{2(n-2m)}{n^{2}-2m(n+1-m)}\right)^{2t}\cdot\frac{k-m}{m+1}\cdot\frac{n-2m-1}{n-2m+1}\cdot\frac{(n-m+1)(n-\mu_{j}-m)}{(n-k-m)(\mu_{j}-m)}
(12(n2k)n22k(n+1k))2tk2nn2knμjμjk\displaystyle\,\,\,\,\,\,\,\,\,\,\leq\left(1-\frac{2(n-2k)}{n^{2}-2k(n+1-k)}\right)^{2t}\cdot\frac{k}{2}\cdot\frac{n}{n-2k}\cdot\frac{n-\mu_{j}}{\mu_{j}-k}
k2nn2knμjμjkexp(2t(2(n2k)n22k(n+1k))).\displaystyle\,\,\,\,\,\,\,\,\,\,\leq\frac{k}{2}\cdot\frac{n}{n-2k}\cdot\frac{n-\mu_{j}}{\mu_{j}-k}\exp\left(-2t\left(\frac{2(n-2k)}{n^{2}-2k(n+1-k)}\right)\right).

If c>0c>0 and

t=(n22k(n+1k)4(n2k))(log(knn2knμjμjk)+c),t=\left(\frac{n^{2}-2k(n+1-k)}{4(n-2k)}\right)\cdot\left(\log\left(k\cdot\frac{n}{n-2k}\cdot\frac{n-\mu_{j}}{\mu_{j}-k}\right)+c\right),

then the ratio Sm+1/SmS_{m+1}/S_{m} is less than ec/21/2e^{-c}/2\leq 1/2. Also the first term S1S_{1} is the largest term, and

S1=(12n)2tk(n1)(nμj)(nk)(μj)ec.S_{1}=\left(1-\frac{2}{n}\right)^{2t}\cdot\frac{k(n-1)(n-\mu_{j})}{(n-k)(\mu_{j})}\leq e^{-c}.

Thus,

m=1kSmS1m=012mec.\displaystyle\sum_{m=1}^{k}S_{m}\leq S_{1}\sum_{m=0}^{\infty}\frac{1}{2^{m}}\leq e^{-c}.

The bound for part (b) comes from considering the contribution to the sum from the largest eigenvalues, which is for m=1m=1. This gives

χk𝐞j2(t)\displaystyle\chi^{2}_{k\mathbf{e}_{j}}(t) (12n)2tk(n1)n[0](nμj)[1](nk)[1](μj)[1]=(12n)2tk(n1)(nμj)(nk)μj\displaystyle\geq\left(1-\frac{2}{n}\right)^{2t}\cdot k\cdot\frac{(n-1)n_{[0]}(n-\mu_{j})_{[1]}}{(n-k)_{[1]}(\mu_{j})_{[1]}}=\left(1-\frac{2}{n}\right)^{2t}\cdot k\cdot\frac{(n-1)\cdot(n-\mu_{j})}{(n-k)\cdot\mu_{j}}
=exp(2tlog(12n)+log(k(n1)(nμj)(nk)μj))\displaystyle=\exp\left(2t\log\left(1-\frac{2}{n}\right)+\log\left(k\cdot\frac{(n-1)\cdot(n-\mu_{j})}{(n-k)\cdot\mu_{j}}\right)\right)
exp(2t(4n)+log(k(n1)(nμj)(nk)μj)),\displaystyle\geq\exp\left(2t\left(-\frac{4}{n}\right)+\log\left(k\cdot\frac{(n-1)\cdot(n-\mu_{j})}{(n-k)\cdot\mu_{j}}\right)\right),

using that log(1x)xx2>2x\log(1-x)\geq-x-x^{2}>-2x for 0x1/20\leq x\leq 1/2. The sum is then ec\geq e^{c} for t=(n/8)(log(k(n1)(nμj)/((nk)μj))c)t=(n/8)\left(\log\left(k(n-1)(n-\mu_{j})/((n-k)\mu_{j})\right)-c\right). ∎

Example 4.2.

Suppose λ=(nk,k),μ=(n,)\lambda=(n-k,k),\mu=(n-\ell,\ell) with >k\ell>k. Then Lemma 4.1 gives χk𝐞22(t)ec\chi^{2}_{k\mathbf{e}_{2}}(t)\leq e^{-c} for

t=(n4+k(k1)2(n2k))(log(kn(n)(n2k)(k))+c).t=\left(\frac{n}{4}+\frac{k(k-1)}{2(n-2k)}\right)\cdot\left(\log\left(\frac{k\cdot n\cdot(n-\ell)}{(n-2k)\cdot(\ell-k)}\right)+c\right).

In particular if nn is even, k=n/21k=n/2-1, =n/2\ell=n/2, then

t=(n4+(n/21)(n/22)2)(log((n/21)n(n/2))+c)(n4+n28)(log(n3)+c).t=\left(\frac{n}{4}+\frac{(n/2-1)(n/2-2)}{2}\right)\cdot\left(\log\left((n/2-1)\cdot n\cdot(n/2)\right)+c\right)\sim\left(\frac{n}{4}+\frac{n^{2}}{8}\right)\left(\log(n^{3})+c\right).

Note that this is a factor of nn larger than the bound in Lemma 3.4 for the distance averaged over all starting states. This could be due to slower mixing from the extreme starting state:

k𝐞2=(n+kk0k),k\mathbf{e}_{2}=\begin{pmatrix}n-\ell+k&\ell-k\\ 0&k\end{pmatrix},

which has small probability under π(nk,k),(n,)\pi_{(n-k,k),(n-\ell,\ell)}.

4.2 Lower Bound

Because the linear eigenfunctions are known for any size contingency table, Wilson’s method works to give a general lower bound.

Theorem 4.3 (Wilson’s Method, Theorem 13.5 in [37]).

Let (Xt)t0(X_{t})_{t\geq 0} be an irreducible aperiodic Markov chain with state space Ω\Omega and transition matrix PP. Let ϕ\phi be an eigenfunction of PP with eigenvalue λ\lambda with 1/2<λ<11/2<\lambda<1. Fix 0<ϵ<10<\epsilon<1 and let R>0R>0 satisfy

𝔼[|ϕ(X1)ϕ(x)|2X0=x]R,xΩ.\operatorname{{\mathbb{E}}}\left[|\phi(X_{1})-\phi(x)|^{2}\mid X_{0}=x\right]\leq R,\,\,\,\,\,\,\,\,\forall x\in\Omega.

Then for any xΩx\in\Omega,

tmix(ϵ)12log(1/λ)[log((1λ)ϕ(x)22R)+log(1ϵϵ)].t_{mix}(\epsilon)\geq\frac{1}{2\log(1/\lambda)}\left[\log\left(\frac{(1-\lambda)\phi(x)^{2}}{2R}\right)+\log\left(\frac{1-\epsilon}{\epsilon}\right)\right].

For the contingency table Markov chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} the linear functions

fij(𝐱)=xijλiμjn,𝐱𝒯λ,μf_{ij}(\mathbf{x})=x_{ij}-\frac{\lambda_{i}\mu_{j}}{n},\quad\mathbf{x}\in\mathcal{T}_{\lambda,\mu}

for any i,ji,j, are eigenfunctions with eigenvalue 12/n1-2/n. These will be used in Wilson’s method to get the following result.

Lemma 4.4.

Let λ=(λ1,,λI),μ=(μ1,,μJ)\lambda=(\lambda_{1},\dots,\lambda_{I}),\mu=(\mu_{1},\dots,\mu_{J}) be any partitions of nn. For any i,ji,j and c>0c>0,

tmix{(n412)(log(mijλiμjn)c)ifn2(λi+μj)(n412)(log(12(nmijλiμj)2n(n+2)λiμj)c)ifn<2(λi+μj),t_{mix}\geq\begin{cases}\left(\frac{n}{4}-\frac{1}{2}\right)\left(\log\left(m_{ij}-\frac{\lambda_{i}\mu_{j}}{n}\right)-c\right)&\text{if}\,\,\,n\geq 2(\lambda_{i}+\mu_{j})\\ \left(\frac{n}{4}-\frac{1}{2}\right)\left(\log\left(\frac{1}{2}\frac{(nm_{ij}-\lambda_{i}\mu_{j})^{2}}{n(n+2)\lambda_{i}\mu_{j}}\right)-c\right)&\text{if}\,\,\,n<2(\lambda_{i}+\mu_{j})\end{cases},

where mij=min(λi,μj)m_{ij}=\min(\lambda_{i},\mu_{j}).

Proof.

From Lemmas 3.6 and 3.7,

𝔼[Tt+1(i,j)Tt=𝐱]=xij(12n)+2λiμjn2\displaystyle\operatorname{{\mathbb{E}}}[T_{t+1}(i,j)\mid T_{t}=\mathbf{x}]=x_{ij}\left(1-\frac{2}{n}\right)+\frac{2\lambda_{i}\mu_{j}}{n^{2}}
𝔼[Tt+1(i,j)2Tt=𝐱]=xij2(14n+4n2)+2n2(xij(2λiμj2λi2μj+n)+λiμj)).\displaystyle\operatorname{{\mathbb{E}}}[T_{t+1}(i,j)^{2}\mid T_{t}=\mathbf{x}]=x_{ij}^{2}\left(1-\frac{4}{n}+\frac{4}{n^{2}}\right)+\frac{2}{n^{2}}\left(x_{ij}(2\lambda_{i}\mu_{j}-2\lambda_{i}-2\mu_{j}+n)+\lambda_{i}\mu_{j})\right).

This allows the calculation of 𝔼[|ϕ(T1)ϕ(𝐱)|2T0=𝐱]\operatorname{{\mathbb{E}}}[|\phi(T_{1})-\phi(\mathbf{x})|^{2}\mid T_{0}=\mathbf{x}] for ϕ=fij\phi=f_{ij}:

𝔼[|fij(T1)fij(𝐱)|2T0=𝐱]\displaystyle\operatorname{{\mathbb{E}}}[|f_{ij}(T_{1})-f_{ij}(\mathbf{x})|^{2}\mid T_{0}=\mathbf{x}] =𝔼[|T1(i,j)xij|2T0=𝐱]\displaystyle=\operatorname{{\mathbb{E}}}[|T_{1}(i,j)-x_{ij}|^{2}\mid T_{0}=\mathbf{x}]
=xij2(14n+4n2)+2n2(xij(2λiμj2λi2μj+n)+λiμj))\displaystyle=x_{ij}^{2}\left(1-\frac{4}{n}+\frac{4}{n^{2}}\right)+\frac{2}{n^{2}}\left(x_{ij}(2\lambda_{i}\mu_{j}-2\lambda_{i}-2\mu_{j}+n)+\lambda_{i}\mu_{j})\right)
2xij(xij(12n)+2λiμjn2)+xij2\displaystyle\,\,\,\,\,\,\,\,\,\,-2x_{ij}\left(x_{ij}\left(1-\frac{2}{n}\right)+\frac{2\lambda_{i}\mu_{j}}{n^{2}}\right)+x_{ij}^{2}
=xij24n2+xij2n2(n2λi2μj)+2λiμjn2=:A\displaystyle=x_{ij}^{2}\cdot\frac{4}{n^{2}}+x_{ij}\cdot\frac{2}{n^{2}}\left(n-2\lambda_{i}-2\mu_{j}\right)+\frac{2\lambda_{i}\mu_{j}}{n^{2}}=:A

First suppose that i,ji,j are such that n2(λi+μj)n\geq 2(\lambda_{i}+\mu_{j}), and note that xijmin(λi,μj)=:mijx_{ij}\leq\min(\lambda_{i},\mu_{j})=:m_{ij} for 𝐱𝒯λ,μ\mathbf{x}\in\mathcal{T}_{\lambda,\mu}. A bound then is

A\displaystyle A 4mij2n2+mij2n2(n2λi2μj)+2λiμjn2\displaystyle\leq\frac{4m_{ij}^{2}}{n^{2}}+m_{ij}\frac{2}{n^{2}}\left(n-2\lambda_{i}-2\mu_{j}\right)+\frac{2\lambda_{i}\mu_{j}}{n^{2}}
=2mijn2λiμjn2.\displaystyle=\frac{2m_{ij}}{n}-\frac{2\lambda_{i}\mu_{j}}{n^{2}}.

This is the constant RR that can be applied to Wilson’s method. Using 𝐱𝒯λ,μ\mathbf{x}\in\mathcal{T}_{\lambda,\mu} such that xij=mij=min(λi,μj)x_{ij}=m_{ij}=\min(\lambda_{i},\mu_{j}):

(1λ)fij(𝐱)22R\displaystyle\frac{(1-\lambda)f_{ij}(\mathbf{x})^{2}}{2R} =2n(xijλiμjn)2/2(2mijn2λiμjn2)\displaystyle=\left.\frac{2}{n}\left(x_{ij}-\frac{\lambda_{i}\mu_{j}}{n}\right)^{2}\middle/2\left(\frac{2m_{ij}}{n}-\frac{2\lambda_{i}\mu_{j}}{n^{2}}\right)\right.
=2n(mijλiμjn)2/2(2mijn2λiμjn2)=12(mijλiμjn)\displaystyle=\left.\frac{2}{n}\left(m_{ij}-\frac{\lambda_{i}\mu_{j}}{n}\right)^{2}\middle/2\left(\frac{2m_{ij}}{n}-\frac{2\lambda_{i}\mu_{j}}{n^{2}}\right)\right.=\frac{1}{2}\left(m_{ij}-\frac{\lambda_{i}\mu_{j}}{n}\right)

If n<2(λi+μj)n<2(\lambda_{i}+\mu_{j}), then

A4mij2n2+2λiμjn2λiμjn(1+2n),A\leq\frac{4m_{ij}^{2}}{n^{2}}+\frac{2\lambda_{i}\mu_{j}}{n}\leq\frac{2\lambda_{i}\mu_{j}}{n}\left(1+\frac{2}{n}\right),

and in this case

(1λ)fij(𝐱)22R=2n(mijλiμjn)2/4λiμjn(1+2n)=12(nmijλiμj)2n(n+2)λiμj\displaystyle\frac{(1-\lambda)f_{ij}(\mathbf{x})^{2}}{2R}=\left.\frac{2}{n}\left(m_{ij}-\frac{\lambda_{i}\mu_{j}}{n}\right)^{2}\middle/\frac{4\lambda_{i}\mu_{j}}{n}\left(1+\frac{2}{n}\right)\right.=\frac{1}{2}\frac{(nm_{ij}-\lambda_{i}\mu_{j})^{2}}{n(n+2)\lambda_{i}\mu_{j}}

Finally, note that, with λ=12/n\lambda=1-2/n and using the bound 1/log(1γ)1/γ11/-\log(1-\gamma)\geq 1/\gamma-1,

12log(1/λ)=12log(12/n)n412.\frac{1}{2\log(1/\lambda)}=\frac{1}{-2\log(1-2/n)}\geq\frac{n}{4}-\frac{1}{2}.

Example 4.5.

Suppose λ=(nk,k),μ=(n,)\lambda=(n-k,k),\mu=(n-\ell,\ell) with k<n/2k<\ell\leq n/2 (as in Example 4.2). Then 2nk>n2n-\ell-k>n, so the second case in Lemma 4.4 always applies to the (1,1)(1,1) entry of the table. Since m11=min(n,nk)=nm_{11}=\min(n-\ell,n-k)=n-\ell, a lower bound is

tmix\displaystyle\mathrm{t_{mix}} (n412)(log(12(n(n)(n)(nk))2n(n+2)(n)(nk))c)\displaystyle\geq\left(\frac{n}{4}-\frac{1}{2}\right)\left(\log\left(\frac{1}{2}\frac{(n(n-\ell)-(n-\ell)(n-k))^{2}}{n(n+2)(n-\ell)(n-k)}\right)-c\right)
=(n412)(log(12(n)k2n(n+2)(nk))c).\displaystyle=\left(\frac{n}{4}-\frac{1}{2}\right)\left(\log\left(\frac{1}{2}\frac{(n-\ell)k^{2}}{n(n+2)(n-k)}\right)-c\right).

For example, if k==n/2k=\ell=n/2, then the expression inside the log\log is equal to (n/2)2/(2n(n+2))1/8(n/2)^{2}/(2n(n+2))\sim 1/8.

If ,k\ell,k are small enough so that k+n/2k+\ell\leq n/2, then the first case of Lemma 4.4 applies to the (2,2)(2,2) entry, with m22=min(k,)=km_{22}=\min(k,\ell)=k. The lower bound is then

tmix=(n412)(log(kkn)c).\displaystyle\mathrm{t_{mix}}=\left(\frac{n}{4}-\frac{1}{2}\right)\left(\log\left(k-\frac{k\ell}{n}\right)-c\right).

For example, =k=n/4\ell=k=n/4, then the expression inside the log\log is equal to (n2/4n2/16)/n(3/16)n(n^{2}/4-n^{2}/16)/n\sim(3/16)n.

5 Further Directions

This section discussions some future directions and applications of the random transpositions chain on contingency tables. Section 5.1 explores how the linear and quadratic eigenfunctions of the Fisher-Yates distribution could be used in statistical applications, to decompose the chi-squared statistic. Section 5.2 notes how the random transpositions chain can be extended to multi-way tables (coming from data with more than 22 categorical features). Section 5.3 considers the random transpositions chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} transformed via the Metropolis-Hastings algorithm to a new Markov chain which has uniform stationary distribution (and vice-versa, the symmetric swap chain can be transformed to have Fisher-Yates stationary distribution). The relaxation times of the chains can be compared. The constants in the comparison depend on λ,μ\lambda,\mu and the size of the table, but the conclusion is that for sparse tables the metropolized version of random transpositions has a significantly smaller relaxation time than the symmetric swap chain. Section 5.4 notes the qq-analog of the Fisher-Yates distribution arises via double cosets.

5.1 Data Analysis

This section discusses potential statistical applications of the eigenfunctions for the random transpositions chain on contingency tables. First, some classic datasets are described.

Datasets.

Tables 2, 3, 4 are classical real data tables with large χ2\chi^{2} statistic. Figure 1 shows a histogram of the the quadratic eigenfunctions evaluated on each of these tables, as well as a plot of the Pearson residuals. We have not succeded in finding any extra structure from these displays but believe that they may sometimes be informative.

Well Mild Moderate Impaired Total
A 64 94 58 46 262
B 57 94 54 40 245
C 57 105 65 60 287
D 72 141 77 94 384
E 36 97 54 78 265
F 21 71 54 71 217
Total 307 602 362 389 1660
Table 2: Midtown Manhattan Mental Health Study data

Table 2 shows data from an epidemiological survey known as the Midtown Manhattan Mental Health Study [39]. Rows record parent’s socioeconomic status (ranging from A = high, to F = low) and columns severity of mental illness. The χ2\chi^{2} statistic is 45.9845.98 on 1515 degrees of freedom.

Jan Feb March April May June July Aug Sep Oct Nov Dec Total
Jan 1 0 0 0 1 2 0 0 1 0 1 0 6
Feb 1 0 0 1 0 0 0 0 0 1 0 2 5
March 1 0 0 0 2 1 0 0 0 0 0 1 5
April 3 0 2 0 0 0 1 0 1 3 1 1 12
May 2 1 1 1 1 1 1 1 1 1 1 0 12
June 2 0 0 0 1 0 0 0 0 0 0 0 3
July 2 0 2 1 0 0 0 0 1 1 1 2 10
Aug 0 0 0 3 0 0 1 0 0 1 0 2 7
Sep 0 0 0 1 1 0 0 0 0 0 1 0 3
Oct 1 1 0 2 0 0 1 0 0 1 1 0 7
Nov 0 1 1 1 2 0 0 2 0 1 1 0 9
Dec 0 1 1 0 0 0 1 0 0 0 0 0 3
Total 13 4 7 10 8 4 5 3 4 9 7 8 82
Table 3: Birth and deathday for Queen Victoria’s descendants.

Table 3 records the month of birth and death for 8282 descendants of Queen Victoria, occurs as an example in [17]. The χ2\chi^{2} statistic is 115.6115.6 with 121121 degrees of freedom, which gives pp-value 0.6210.621, suggesting we do not reject the null hypothesis for independence. The classical rules of thumb for validity of the chi-square approximation is that there is a minimum of 55 entries per cell; this assumption is are badly violated in Table 3, and there are too many tables with these margins for exact enumeration.

Black Brown Red Blond Total
Brown 68 119 26 7 220
Blue 20 84 17 94 215
Hazel 15 54 14 10 93
Green 5 29 14 16 64
Total 108 286 71 127 592
Table 4: Eye color vs. hair color for n=592n=592 individuals.

Table 4 was analyzed in [10]; the χ2\chi^{2} statistic is 138.29138.29 with 99 degrees of freedom.

Refer to caption
Figure 1: The figures in the left column are the Pearson residuals, which are the normalized linear eigenfunctions; the xx-axis indicates an arbitrary ordering of the residuals and the yy-axis are the values of the residuals. The right column shows histograms for the normalized quadratic eigenfunctions.

Residuals.

The linear eigenfunctions derived in Lemma 3.6 are well known as ‘Pearson residuals’ in the classical analysis of contingency tables [1]. The naturally scaled eigenfunctions

f^ij(T)=Tijλiμj/nλiμj/n\widehat{f}_{ij}(T)=\frac{T_{ij}-\lambda_{i}\mu_{j}/n}{\sqrt{\lambda_{i}\mu_{j}/n}}

measure the departure of the table from the null model. Standard practice displays all of these in a two way array. Another scaling is inspired by the inner product space with respect to the Fisher-Yates distribution, so that the functions have norm 11:

f~ij(T)=Tijλiμj/ncij,cij:=λiμj(nλi)(nμj)n2(n1)=𝔼πλ,μ[(Tijλiμj/n)2].\widetilde{f}_{ij}(T)=\frac{T_{ij}-\lambda_{i}\mu_{j}/n}{\sqrt{c_{ij}}},\quad c_{ij}:=\frac{\lambda_{i}\mu_{j}(n-\lambda_{i})(n-\mu_{j})}{n^{2}(n-1)}=\operatorname{{\mathbb{E}}}_{\pi_{\lambda,\mu}}\left[(T_{ij}-\lambda_{i}\mu_{j}/n)^{2}\right].

Table 7 compares f^ij\widehat{f}_{ij} and f^ij\widehat{f}_{ij} for the Midtown Manhattan Mental Health dataset from Table 2.

Well Mild Moderate Impaired
A 2.233 -0.104 0.114 -1.965
B 1.737 0.546 0.078 -2.298
C 0.538 0.090 0.305 -0.885
D 0.117 0.148 -0.737 0.423
E -1.858 0.092 -0.498 2.018
F -3.020 -0.867 0.971 2.826
Table 5: The standard Pearson residuals, f^ij\widehat{f}_{ij}.
Well Mild Moderate Impaired
A 2.695 -0.142 0.141 -2.446
B 2.083 0.741 0.096 -2.844
C 0.656 0.124 0.379 -1.111
D 0.147 0.211 -0.950 0.551
E -2.245 0.125 -0.615 2.515
F -3.587 -1.165 1.177 3.462
Table 6: The linear eigenfunctions f^ij\widehat{f}_{ij} standardized to have norm 11.
Table 7: Comparison of the different normalizations f^ij\widehat{f}_{ij} and f~ij\widetilde{f}_{ij} of the linear eigenfunctions, for the dataset from Table 2.

Chi-squared Decomposition.

It is natural to try to use the quadratic eigenfunctions in a similar way. They do not have an interpretation as (observed - expected) for some observed statistic, but they do have expectation 0 with respect to πλ,μ\pi_{\lambda,\mu}.

A potential use for the polynomial eigenfunctions is in a decomposition of the chi-squared statistic

χ2(T)=i,j(Tijλiμj/n)2λiμj/n.\displaystyle\chi^{2}(T)=\sum_{i,j}\frac{(T_{ij}-\lambda_{i}\mu_{j}/n)^{2}}{\lambda_{i}\mu_{j}/n}. (23)

Under the null hypothesis that row and column features are independent, χ2(T)\chi^{2}(T) has a limiting chi-squared distribution with (I1)(J1)(I-1)\cdot(J-1) degrees of freedom. A drawback of using χ2(T)\chi^{2}(T) is that when it is large enough to reject the null hypothesis, no additional information is given about why the sample fails the test. This motivates the subject of partitioning the statistic into terms which could give insight into what parts of the table fail the independence hypothesis. The thesis [45] contains a thorough review of this problem, and a theory for decomposing χ2(T)\chi^{2}(T) using Markov chains on the space {1,,I}×{1,,J}\{1,\dots,I\}\times\{1,\dots,J\}. Presented below is a natural decomposition using the linear and quadratic eigenfunctions fijf_{ij} and f(i,j),(i,j)f_{(i,j),(i,j)}.

Let Mij=λiμj/nM_{ij}=\lambda_{i}\mu_{j}/n, Kij,LijK_{ij},L_{ij} be such that

Kij=2λiμj2λi2μj+nn2\displaystyle K_{ij}=\frac{2\lambda_{i}\mu_{j}-2\lambda_{i}-2\mu_{j}+n}{n-2}
Lij=λiμj(1+λiμjλiμj)(n1)(n2),\displaystyle L_{ij}=\frac{\lambda_{i}\mu_{j}(1+\lambda_{i}\mu_{j}-\lambda_{i}-\mu_{j})}{(n-1)(n-2)},

so that f(i,j),(i,j)(T)=Tij2KijTij+Lijf_{(i,j),(i,j)}(T)=T_{ij}^{2}-K_{ij}T_{ij}+L_{ij}. Then,

χ2(T)\displaystyle\chi^{2}(T) =1Mij(TijMij)2=1Mij(Tij22MijTij+Mij2)\displaystyle=\sum\frac{1}{M_{ij}}(T_{ij}-M_{ij})^{2}=\sum\frac{1}{M_{ij}}(T_{ij}^{2}-2M_{ij}T_{ij}+M_{ij}^{2})
=1Mij(Tij2KijTij+Lij+(2MijKij)Tij+Mij2Lij)\displaystyle=\sum\frac{1}{M_{ij}}(T_{ij}^{2}-K_{ij}T_{ij}+L_{ij}+(2M_{ij}-K_{ij})T_{ij}+M_{ij}^{2}-L_{ij})
=i,j1Mijf(i,j),(i,j)(T)+i,j2MijKijMijf(i,j)(T)+i,jMij2LijMij.\displaystyle=\sum_{i,j}\frac{1}{M_{ij}}f_{(i,j),(i,j)}(T)+\sum_{i,j}\frac{2M_{ij}-K_{ij}}{M_{ij}}f_{(i,j)}(T)+\sum_{i,j}\frac{M_{ij}^{2}-L_{ij}}{M_{ij}}.

Figure 1 shows histograms of the normalized quadratic eigenfunctions f(i,j),(k,l)/𝔼π[f(i,j),(k,l)]f_{(i,j),(k,l)}/\operatorname{{\mathbb{E}}}_{\pi}\left[f_{(i,j),(k,l)}\right].

5.2 Higher Dimensional Tables

The random swap Markov chain can be used as inspiration for Markov chains on mm-way contingency tables with fixed margins for m>2m>2. This section describes how that could go for m=3m=3.

Let λ,μ,ρ\lambda,\mu,\rho be three partitions of nn with |λ|=I,|μ|=J,|ρ|=K|\lambda|=I,|\mu|=J,|\rho|=K. Let 𝒯λ,μ,ρ\mathcal{T}_{\lambda,\mu,\rho} be the set of three-way tables with fixed margins determined by λ,μ,ρ\lambda,\mu,\rho. That is T={Tijk:1iI,1j,J,1kK}𝒯λ,μ,ρT=\{T_{ijk}:1\leq i\leq I,1\leq j,\leq J,1\leq k\leq K\}\in\mathcal{T}_{\lambda,\mu,\rho} if

jkTijk=λi,ikTijk=μj,ijTijk=ρk,for all   1iI,1jJ,1kK.\displaystyle\sum_{jk}T_{ijk}=\lambda_{i},\quad\sum_{ik}T_{ijk}=\mu_{j},\quad\sum_{ij}T_{ijk}=\rho_{k},\quad\text{for all}\,\,\,1\leq i\leq I,1\leq j\leq J,1\leq k\leq K.

The partitions λ,μ,ρ\lambda,\mu,\rho are the sufficient statistics for the complete independence model: The probability of an entry being (i,j,k)(i,j,k) is pijk=θiηjγkp_{ijk}=\theta_{i}\eta_{j}\gamma_{k}. A table can by thought of as tri-partite hypergraph, where each edge connects exactly three vertices.

Representing a table by a set of nn tuples (i,j,k)(i,j,k), we can describe a similar adjacent swap Markov chain as: Pick 22 tuples (i1,j1,k1),(i2,j2,k2)(i_{1},j_{1},k_{1}),(i_{2},j_{2},k_{2}), pick r{1,2,3}r\in\{1,2,3\} uniformly, and swap the rrth entry in the two tuples. For example,

(i1,j1,k1),(i2,j2,k2){(i2,j1,k1),(i1,j2,k2)(i1,j2,k1),(i2,j1,k2)(i1,j1,k2),(i2,j2,k1).(i_{1},j_{1},k_{1}),(i_{2},j_{2},k_{2})\to\begin{cases}(i_{2},j_{1},k_{1}),(i_{1},j_{2},k_{2})\\ (i_{1},j_{2},k_{1}),(i_{2},j_{1},k_{2})\\ (i_{1},j_{1},k_{2}),(i_{2},j_{2},k_{1})\end{cases}.

This move still corresponds to adding (1111)\begin{pmatrix}-1&1\cr 1&-1\end{pmatrix} to a 2×22\times 2 submatrix of the contingency table. The probability of picking the two tuples 2Ti1j1k1Ti2j2k2/n22T_{i_{1}j_{1}k_{1}}\cdot T_{i_{2}j_{2}k_{2}}/n^{2}. To be precise about the Markov chain, let F(i1,j1,k1),(i2,j2,k2),r(T)F_{(i_{1},j_{1},k_{1}),(i_{2},j_{2},k_{2}),r}(T) denote the table where the swap was made in the indicated entries of the table at the rr index, with r{1,2,3}r\in\{1,2,3\}. Then,

P(T,F(i1,j1,k1),(i2,j2,k2),r(T))=132Ti1j1k1Ti2j2k2n2.\displaystyle P(T,F_{(i_{1},j_{1},k_{1}),(i_{2},j_{2},k_{2}),r}(T))=\frac{1}{3}\cdot\frac{2T_{i_{1}j_{1}k_{1}}T_{i_{2}j_{2}k_{2}}}{n^{2}}.

These same moves were proposed in [17] as input for a Metropolis Markov chain for log-linear models on multi-way tables (Section 4.2); for that Markov chain the 22 tuples to be swapped are chosen uniformly from all entries (and if the swap move results in a negative value in the table, it is rejected), thus the chain has uniform stationary distribution.

Theorem 5.1.

The Markov chain PP on 𝒯λ,μ,ρ\mathcal{T}_{\lambda,\mu,\rho} is connected and reversible with respect to the natural analog of Fisher-Yates for 3-way tables:

πλ,μ,ρ(T):=1n!i,j,kλi!μj!ρk!Tijk!.\pi_{\lambda,\mu,\rho}(T):=\frac{1}{n!}\prod_{i,j,k}\frac{\lambda_{i}!\mu_{j}!\rho_{k}!}{T_{ijk}!}.
Proof.

To see that the Markov chain is connected, we can define a partial order on the space 𝒯λ,μ,ρ\mathcal{T}_{\lambda,\mu,\rho}, analagous to the majorization order on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} used in Proposition 2.10: Write TTT\prec T^{\prime} if

r=1Rs=1St=1TTirjsktr=1Rs=1St=1TTirjskt,for all   1RI,1sJ,1TK.\sum_{r=1}^{R}\sum_{s=1}^{S}\sum_{t=1}^{T}T_{i_{r}j_{s}k_{t}}\leq\sum_{r=1}^{R}\sum_{s=1}^{S}\sum_{t=1}^{T}T^{\prime}_{i_{r}j_{s}k_{t}},\quad\text{for all}\,\,\,1\leq R\leq I,1\leq s\leq J,1\leq T\leq K.

One move of the Markov chain corresponds to one edge in the lattice defined by this partial order (i.e. if P(T,T)>0P(T,T^{\prime})>0 then TTT\prec T^{\prime} or TTT^{\prime}\prec T). Thus, every table can transition to the largest element and so the space is connected under PP.

To see that the chain is reversible with the correct stationary distribution, given T𝒯λ,μ,ρT\in\mathcal{T}_{\lambda,\mu,\rho} suppose T=F(i1,j1,k1),(i2,j2,k2),r(T)T^{\prime}=F_{(i_{1},j_{1},k_{1}),(i_{2},j_{2},k_{2}),r}(T) with r=1r=1 so that P(T,T)>0P(T,T^{\prime})>0. Then,

πλ,μ,ρ(T)=πλ,μ,ρ(T)Ti1j1k1Ti2j2k2(Ti2j1k1+1)(Ti1j2k2+1).\displaystyle\pi_{\lambda,\mu,\rho}(T^{\prime})=\pi_{\lambda,\mu,\rho}(T)\cdot\frac{T_{i_{1}j_{1}k_{1}}T_{i_{2}j_{2}k_{2}}}{(T_{i_{2}j_{1}k_{1}}+1)(T_{i1j_{2}k_{2}}+1)}.

Then,

πλ,μ,ρ(T)P(T,T)\displaystyle\pi_{\lambda,\mu,\rho}(T^{\prime})P(T^{\prime},T) =πλ,μ,ρ(T)Ti1j1k1Ti2j2k2(Ti2j1k1+1)(Ti1j2k2+1)2(Ti2j1k1+1)(Ti1j2k2+1)3n2\displaystyle=\pi_{\lambda,\mu,\rho}(T)\cdot\frac{T_{i_{1}j_{1}k_{1}}T_{i_{2}j_{2}k_{2}}}{(T_{i_{2}j_{1}k_{1}}+1)(T_{i1j_{2}k_{2}}+1)}\cdot\frac{2(T_{i_{2}j_{1}k_{1}}+1)(T_{i1j_{2}k_{2}}+1)}{3n^{2}}
=πλ,μ,ρ(T)2Ti1j1k1Ti2j2k23n2=πλ,μ,ρ(T)P(T,T).\displaystyle=\pi_{\lambda,\mu,\rho}(T)\cdot\frac{2T_{i_{1}j_{1}k_{1}}T_{i_{2}j_{2}k_{2}}}{3n^{2}}=\pi_{\lambda,\mu,\rho}(T)P(T,T^{\prime}).

The calculation is analogous for r=2,3r=2,3. ∎

Remark 16.

It would be interesting to have a double-coset representation for 𝒯λ,μ,ρ\mathcal{T}_{\lambda,\mu,\rho}, but we have not discovered one.

5.3 Comparison of Markov Chains

This new chain on contingency tables can be compared to other Markov chains on contingency tables using simple spectral comparison techniques, as developed in [13].

The comparison technique relies on the variational characterization of the spectral gap of a Markov chain: If PP is a reversible Markov chain on Ω\Omega with stationary distribution π\pi and γ=1λ2\gamma=1-\lambda_{2} the spectral gap, then

γ=minf:Ω:Varπ(f)(f)Varπ(f),\gamma=\min_{f:\Omega\to\mathbb{R}:\text{Var}_{\pi}(f)}\frac{\mathcal{E}(f)}{\text{Var}_{\pi}(f)},

where

(f):=12x,yΩ(f(x)f(y))2π(x)P(x,y).\mathcal{E}(f):=\frac{1}{2}\sum_{x,y\in\Omega}\left(f(x)-f(y)\right)^{2}\pi(x)P(x,y).

The relaxation time τ\tau is defined as the inverse of the absolute spectral gap τ=1/γ\tau=1/\gamma, and can be used as one measure of mixing.

Lemma 5.2 (Lemma 13.22 in [37]).

Let PP and P~\widetilde{P} be reversible transition matrices on Ω\Omega with stationary distributions π\pi and π~\widetilde{\pi}, respectively. If ~(f)α(f)\widetilde{\mathcal{E}}(f)\leq\alpha\mathcal{E}(f) for all ff, then

γ~(maxxΩπ(x)π~(x))αγ.\widetilde{\gamma}\leq\left(\max_{x\in\Omega}\frac{\pi(x)}{\widetilde{\pi}(x)}\right)\alpha\gamma. (24)

The term maxxΩπ(x)/π~(x)\max_{x\in\Omega}\pi(x)/\widetilde{\pi}(x) makes it difficult to compare Markov chains with stationary distributions which are very different. The uniform distribution and the Fisher-Yates distribution on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} are exponentially different and the ratio of the distribution cannot be bounded by a constant.

Example 5.3.

With λ=(nk,k),μ=(n,)\lambda=(n-k,k),\mu=(n-\ell,\ell), with kn/2k\leq\ell\leq n/2, there are kk tables. The table with the smallest Fisher-Yates probability is

T=(nkk0),T=\begin{pmatrix}n-\ell-k&\ell\cr k&0\end{pmatrix},

with πFY(T)=(nk)!(n)!/((nk)!n!)\pi_{FY}(T)=(n-k)!(n-\ell)!/((n-k-\ell)!n!). Taking k==n/2k=\ell=n/2 for example gives

πFY(T)=(n/2)!(n/2)!n!(n/2)n/2(n/2)n/2nn=12n,\pi_{FY}(T)=\frac{(n/2)!(n/2)!}{n!}\sim\frac{(n/2)^{n/2}(n/2)^{n/2}}{n^{n}}=\frac{1}{2^{n}},

by Stirling’s approximation. This is in sharp contrast with the uniform probability πU(T)=1/k=2/n\pi_{U}(T)=1/k=2/n.

Instead, we can use the Metropolis algorithm on either the symmetric swap Markov chain or the random transpositions Markov chain to get two chains with the same distribution to compare. This is done in both cases in the following sections.

Let PFYP_{FY} be the random transpositions Markov chain on 𝒯λ,μ\mathcal{T}_{\lambda,\mu} from Definition 1.4. Let PUP_{U} be the symmetric swap Markov chain which has stationary uniform distribution. That is,

PU(𝐱,𝐲)={2(IJ)2if𝐲=F(i1,j1),(i2,j2)(𝐱)1i1<i2j2<j22(IJ)2if𝐲=𝐱0otherwise.P_{U}(\mathbf{x},\mathbf{y})=\begin{cases}\frac{2}{(IJ)^{2}}&\text{if}\quad\mathbf{y}=F_{(i_{1},j_{1}),(i_{2},j_{2})}(\mathbf{x})\\ 1-\sum_{i_{1}<i_{2}}\sum_{j_{2}<j_{2}}\frac{2}{(IJ)^{2}}&\text{if}\quad\mathbf{y}=\mathbf{x}\\ 0&\text{otherwise}\end{cases}.

Let PUMP_{U}^{M} be the chain from Metropolizing PFYP_{FY} to have uniform stationary distribution and PFYMP_{FY}^{M} be the result of Metropolizing PUP_{U} to have stationary distribution Fisher-Yates.

Theorem 5.4.

Let τFY,τFYM,τU,τUM\tau_{FY},\tau_{FY}^{M},\tau_{U},\tau_{U}^{M} be the associated relaxation times for the Markov chains PFY,PFYM,PU,PUMP_{FY},P_{FY}^{M},P_{U},P_{U}^{M}. Define

mλ=min{xij>0:𝐱𝒯λ,μ}\displaystyle m_{\lambda}=\min\{x_{ij}>0:\mathbf{x}\in\mathcal{T}_{\lambda,\mu}\}
Mλ=max{xij>0:𝐱𝒯λ,μ}\displaystyle M_{\lambda}=\max\{x_{ij}>0:\mathbf{x}\in\mathcal{T}_{\lambda,\mu}\}

If |λ|=I,|μ|=J|\lambda|=I,|\mu|=J, then

  1. (a)
    mλ,μ2(IJ)2n2τUMτUMλ,μ2(IJ)2n2τUM,\frac{m_{\lambda,\mu}^{2}(IJ)^{2}}{n^{2}}\tau_{U}^{M}\leq\tau_{U}\leq\frac{M_{\lambda,\mu}^{2}(IJ)^{2}}{n^{2}}\tau_{U}^{M},
  2. (b)
    n2(IJ)2Mλ,μ4τFYMτFYn2(IJ)2mλ,μ4τFYM.\frac{n^{2}}{(IJ)^{2}M_{\lambda,\mu}^{4}}\tau_{FY}^{M}\leq\tau_{FY}\leq\frac{n^{2}}{(IJ)^{2}m_{\lambda,\mu}^{4}}\tau_{FY}^{M}.

For any λ,μ\lambda,\mu, note that mλ,μ1m_{\lambda,\mu}\geq 1 and Mλ,μmax(λi,μj)M_{\lambda,\mu}\leq\max(\lambda_{i},\mu_{j}).

Example 5.5.

If λ=μ=(c,c,,c)\lambda=\mu=(c,c,\dots,c), I=J=n/cI=J=n/c, then mλ,μ=1m_{\lambda,\mu}=1 and Mλ,μ=cM_{\lambda,\mu}=c. Then Theorem 5.4 shows

n2c4τUMτUn2c2τUM.\frac{n^{2}}{c^{4}}\tau_{U}^{M}\leq\tau_{U}\leq\frac{n^{2}}{c^{2}}\tau_{U}^{M}.

The intuition is that if nn is growing and cc is fixed then the tables are relatively sparse, and so the Metropolis chain is much more likely than the uniform swap chain to propose moves which are accepted. Accepting more moves means exploring the state space more quickly, and thus achieving stationarity. Similarly, for the chains with Fisher-Yates stationary distribution in this setting,

1n2τFYMτFYc4n2τFYM.\frac{1}{n^{2}}\tau_{FY}^{M}\leq\tau_{FY}\leq\frac{c^{4}}{n^{2}}\tau_{FY}^{M}.

Again, if cc is fixed and nn is larger the Markov chain utilizing the random transpositions has significantly smaller relaxation time than the Markov chain created by using the uniform swap moves with the Metropolis algorithm. Note that if c=1c=1, then the Fisher-Yates distribution is exactly the uniform distribution.

The remainder of this section reviews the Metropolis-Hastings Algorithm, finds the transition probabilities PUMP_{U}^{M} and PFYMP_{FY}^{M}, and then proves Theorem 5.4.

Metropolis-Hastings Algorithm

The random transpositions chain could be used as the proposal Markov chain in Metropolis-Hastings algorithm to sample from the uniform distribution. Suppose PP is a Markov chain on Ω\Omega and π\pi is a target distribution. A new Markov chain on Ω\Omega which is reversible with respect to π\pi is defined by: From xx

  1. 1.

    Generate a random candidate yy according to one step of the PP chain, i.e. yP(x,y)y\sim P(x,y).

  2. 2.

    Compute the acceptance probability

    A(x,y)=min(1,π(y)P(y,x)π(x)P(x,y)).A(x,y)=\min\left(1,\frac{\pi(y)P(y,x)}{\pi(x)P(x,y)}\right).
  3. 3.

    With probability A(x,y)A(x,y), move to yy. Otherwise, stay at the current state xx.

The transitions for the new Markov chain are defined

K(x,y)=P(x,y)A(x,y).K(x,y)=P(x,y)\cdot A(x,y).

Uniform Distribution Comparison

For two states 𝐱,𝐲𝒯λ,μ\mathbf{x},\mathbf{y}\in\mathcal{T}_{\lambda,\mu} such that P(𝐱,𝐲)>0P(\mathbf{x},\mathbf{y})>0, the acceptance probability can be computed. One way to see this is to note that since PP is reversible with respect to the Fisher-Yates distribution πλ,μ\pi_{\lambda,\mu}, it is

P(𝐲,𝐱)P(𝐱,𝐲)=πλ,μ(𝐱)πλ,μ(𝐲)=(xi1,j2+1)(xi2,j1+1)xi1,j1xi2,j2,\frac{P(\mathbf{y},\mathbf{x})}{P(\mathbf{x},\mathbf{y})}=\frac{\pi_{\lambda,\mu}(\mathbf{x})}{\pi_{\lambda,\mu}(\mathbf{y})}=\frac{(x_{i_{1},j_{2}}+1)(x_{i_{2},j_{1}}+1)}{x_{i_{1},j_{1}}\cdot x_{i_{2},j_{2}}},

supposing that 𝐲=F(i1,j1),(i2,j2)(𝐱)\mathbf{y}=F_{(i_{1},j_{1}),(i_{2},j_{2})}(\mathbf{x}).

In conclusion, the Metropolis chain has transitions

PUM(𝐱,𝐲)\displaystyle P_{U}^{M}(\mathbf{x},\mathbf{y}) =2xi1,j1xi2,j2n2min(1,(xi1,j2+1)(xi2,j1+1)xi1,j1xi2,j2)=min(2xi1,j1xi2,j2n2,(2xi1,j2+1)(xi2,j1+1)n2)\displaystyle=\frac{2x_{i_{1},j_{1}}x_{i_{2},j_{2}}}{n^{2}}\cdot\min\left(1,\frac{(x_{i_{1},j_{2}}+1)(x_{i_{2},j_{1}}+1)}{x_{i_{1},j_{1}}\cdot x_{i_{2},j_{2}}}\right)=\min\left(\frac{2x_{i_{1},j_{1}}x_{i_{2},j_{2}}}{n^{2}},\frac{(2x_{i_{1},j_{2}}+1)(x_{i_{2},j_{1}}+1)}{n^{2}}\right)
=min(PFY(𝐱,𝐲),PFY(𝐲,𝐱)).\displaystyle=\min\left(P_{FY}(\mathbf{x},\mathbf{y}),P_{FY}(\mathbf{y},\mathbf{x})\right).

This is clearly a symmetric Markov chain and has uniform stationary distribution. Note that this in general, any chain P(x,y)P(x,y) when transformed via Metropolis to give the uniform stationary distribution is if the form PU(x,y)=min(P(x,y),P(y,x))P_{U}(x,y)=\min(P(x,y),P(y,x)).

To make the comparison of Dirichlet forms, since the stationary distributions are the same it is enough to compare the transition probabilities. This can be done,

PU(𝐱,𝐲)=n2mλ,μ2(IJ)22mλ,μ2n2n2mλ,μ2(IJ)2PUM(𝐱,𝐲),P_{U}(\mathbf{x},\mathbf{y})=\frac{n^{2}}{m_{\lambda,\mu}^{2}(IJ)^{2}}\cdot\frac{2m_{\lambda,\mu}^{2}}{n^{2}}\leq\frac{n^{2}}{m_{\lambda,\mu}^{2}(IJ)^{2}}P_{U}^{M}(\mathbf{x},\mathbf{y}),

and similarly

PFY(𝐱,𝐲)=n2Mλ,μ2(IJ)22Mλ,μ2n2n2Mλ,μ2(IJ)2PUM(𝐱,𝐲).P_{FY}(\mathbf{x},\mathbf{y})=\frac{n^{2}}{M_{\lambda,\mu}^{2}(IJ)^{2}}\cdot\frac{2M_{\lambda,\mu}^{2}}{n^{2}}\geq\frac{n^{2}}{M_{\lambda,\mu}^{2}(IJ)^{2}}P_{U}^{M}(\mathbf{x},\mathbf{y}).

This gives the result, since

α1PUM(𝐱,𝐲)PU(𝐱,𝐲)α2PUM(𝐱,𝐲)1α2τUMτU1α1τUM.\displaystyle\alpha_{1}P_{U}^{M}(\mathbf{x},\mathbf{y})\leq P_{U}(\mathbf{x},\mathbf{y})\leq\alpha_{2}P_{U}^{M}(\mathbf{x},\mathbf{y})\implies\frac{1}{\alpha_{2}}\tau_{U}^{M}\leq\tau_{U}\leq\frac{1}{\alpha_{1}}\tau_{U}^{M}.

Fisher-Yates Distribution Comparison

In the other direction, we could take the symmetric swap Markov chain and Metropolize it to have stationary distribution Fisher-Yates. The acceptance probability in this direction is

A(𝐱,𝐲)\displaystyle A(\mathbf{x},\mathbf{y}) =min(1,πFY(𝐲)PU(𝐲,𝐱)πFY(𝐱)PU(𝐱,𝐲))\displaystyle=\min\left(1,\frac{\pi_{FY}(\mathbf{y})P_{U}(\mathbf{y},\mathbf{x})}{\pi_{FY}(\mathbf{x})P_{U}(\mathbf{x},\mathbf{y})}\right)
=min(1,πFY(𝐲)πFY(𝐱))=min(1,xi1,j1xi2,j2(xi1,j2+1)(xi2,j1+1)).\displaystyle=\min\left(1,\frac{\pi_{FY}(\mathbf{y})}{\pi_{FY}(\mathbf{x})}\right)=\min\left(1,\frac{x_{i_{1},j_{1}}\cdot x_{i_{2},j_{2}}}{(x_{i_{1},j_{2}}+1)(x_{i_{2},j_{1}}+1)}\right).

The transition probabilities, for 𝐲=F(i1,j1),(i2,j2)(𝐱)\mathbf{y}=F_{(i_{1},j_{1}),(i_{2},j_{2})}(\mathbf{x}) are then

PFYM(𝐱,𝐲)=2(IJ)2min(1,xi1,j1xi2,j2(xi1,j2+1)(xi2,j1+1)).\displaystyle P_{FY}^{M}(\mathbf{x},\mathbf{y})=\frac{2}{(IJ)^{2}}\min\left(1,\frac{x_{i_{1},j_{1}}\cdot x_{i_{2},j_{2}}}{(x_{i_{1},j_{2}}+1)(x_{i_{2},j_{1}}+1)}\right).

To compare this with the random transpositions chain PFYP_{FY},

PFY(𝐱,𝐲)\displaystyle P_{FY}(\mathbf{x},\mathbf{y}) =2xi1,j1xi2,j2n22Mλ,μ2n2\displaystyle=\frac{2x_{i_{1},j_{1}}x_{i_{2},j_{2}}}{n^{2}}\leq\frac{2M_{\lambda,\mu}^{2}}{n^{2}}
=(IJ)2Mλ,μ4n22mλ,μ2Mλ,μ2(IJ)2(IJ)2Mλ,μ4n2PFYM(𝐱,𝐲)\displaystyle=\frac{(IJ)^{2}M_{\lambda,\mu}^{4}}{n^{2}}\cdot\frac{2m_{\lambda,\mu}^{2}}{M_{\lambda,\mu}^{2}(IJ)^{2}}\leq\frac{(IJ)^{2}M_{\lambda,\mu}^{4}}{n^{2}}\cdot P_{FY}^{M}(\mathbf{x},\mathbf{y})

And similarly

PU(𝐱,𝐲)\displaystyle P_{U}(\mathbf{x},\mathbf{y}) =2xi1,j1xi2,j2n22mλ,μ2n2\displaystyle=\frac{2x_{i_{1},j_{1}}x_{i_{2},j_{2}}}{n^{2}}\geq\frac{2m_{\lambda,\mu}^{2}}{n^{2}}
=(IJ)2mλ,μ4n22Mλ,μ2mλ,μ2(IJ)2(IJ)2mλ,μ4n2PFYM(𝐱,𝐲)\displaystyle=\frac{(IJ)^{2}m_{\lambda,\mu}^{4}}{n^{2}}\cdot\frac{2M_{\lambda,\mu}^{2}}{m_{\lambda,\mu}^{2}(IJ)^{2}}\geq\frac{(IJ)^{2}m_{\lambda,\mu}^{4}}{n^{2}}\cdot P_{FY}^{M}(\mathbf{x},\mathbf{y})

This proves part (b) of Theorem 5.4.

5.4 Parabolic Subgroups of GLnGL_{n}

Contingency tables also arise from double cosets using the parabolic subgroups of GLn(q)GL_{n}(q), as described in [34].

Definition 5.6.

Let α=(α1,,αk)\alpha=(\alpha_{1},\dots,\alpha_{k}) be a partition of nn. The parabolic subgroup PαGLn(q)P_{\alpha}\subset GL_{n}(q) consists of all invertible block upper-triangular matrices with diagonal block sizes α1,,αk\alpha_{1},\dots,\alpha_{k}

Section 4 of [34] shows that if α,β\alpha,\beta are two partitions of nn, the double cosets Pα\GLn(q)/PβP_{\alpha}\backslash GL_{n}(q)/P_{\beta} are indexed by contingency tables with row sums α\alpha and column sums β\beta. Proposition 4.37 contains the size of a double-coset which corresponds to the table TT, with θ=1/q\theta=1/q

θn2+1i<iI,1j<jJTijTij(1θ)ni=1I[αi]θ!j=1J[βj]θ!i,j[Mij]θ!,I=|α|,J=|β|.\theta^{-n^{2}+\sum_{1\leq i<i^{\prime}\leq I,1\leq j<j^{\prime}\leq J}T_{ij}T_{i^{\prime}j^{\prime}}}(1-\theta)^{n}\cdot\frac{\prod_{i=1}^{I}[\alpha_{i}]_{\theta}!\prod_{j=1}^{J}[\beta_{j}]_{\theta}!}{\prod_{i,j}[M_{ij}]_{\theta}!},\quad I=|\alpha|,J=|\beta|. (25)

Dividing (25) by (1θ)n(1-\theta)^{n} and setting θ=1\theta=1 recovers the usual Fisher-Yates distribution for partitions α,β\alpha,\beta. It remains an open problem to investigate these probability distributions and Markov chains on the double cosets of GLn(q)GL_{n}(q) from parabolic subgroups.

References

  • [1] A. Agresti. A survey of exact inference for contingency tables. Statist. Sci., 7(1):131–177, 1992. With comments and a rejoinder by the author.
  • [2] G. Amanatidis and P. Kleer. Rapid mixing of the switch Markov chain for strongly stable degree sequences. Random Structures & Algorithms, 57(3):637–657, 2020.
  • [3] E. D. Belsley. Rates of convergence of random walk on distance regular graphs. Probability theory and related fields, 112(4):493–533, 1998.
  • [4] S. C. Billey, M. Konvalinka, T. K. Petersen, W. Slofstra, and B. E. Tenner. Parabolic double cosets in coxeter groups, 2017.
  • [5] O. Bormashenko. A coupling argument for the random transposition walk. arXiv preprint arXiv:1109.3915, 2011.
  • [6] T. Ceccherini Silberstein, F. Scarabotti, and F. Tolli. Harmonic analysis on finite groups, volume 108. Cambridge University Press, 2008.
  • [7] F. R. K. Chung, R. L. Graham, and S.-T. Yau. On sampling with Markov chains. Random Structures & Algorithms, 9(1-2):55–77, 1996.
  • [8] P. Diaconis. Group representations in probability and statistics, volume 11 of Institute of Mathematical Statistics Lecture Notes—Monograph Series. Institute of Mathematical Statistics, Hayward, CA, 1988.
  • [9] P. Diaconis. A generalization of spectral analysis with application to ranked data. The Annals of Statistics, pages 949–979, 1989.
  • [10] P. Diaconis and B. Efron. Testing for independence in a two-way table: new interpretations of the chi-square statistic. Ann. Statist., 13(3):845–913, 1985. With discussions and with a reply by the authors.
  • [11] P. Diaconis and A. Gangolli. Rectangular arrays with fixed margins. In Discrete probability and algorithms, pages 15–41. Springer, 1995.
  • [12] P. Diaconis, A. Ram, and M. Simper. Double coset markov chains and random transvections, 2022.
  • [13] P. Diaconis and L. Saloff-Coste. Comparison theorems for reversible markov chains. The Annals of Applied Probability, 3(3):696–730, 1993.
  • [14] P. Diaconis and M. Shahshahani. Generating a random permutation with random transpositions. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 57(2):159–179, 1981.
  • [15] P. Diaconis and M. Shahshahani. Time to reach stationarity in the Bernoulli-Laplace diffusion model. SIAM J. Math. Anal., 18(1):208–218, 1987.
  • [16] P. Diaconis and M. Simper. Statistical enumeration of groups by double cosets. Journal of Algebra, 607:214–246, 2022. Special Issue dedicated to J. Saxl.
  • [17] P. Diaconis and B. Sturmfels. Algebraic algorithms for sampling from conditional distributions. Annals of statistics, 26(1):363–397, 1998.
  • [18] P. Diaconis and P. M. Wood. Random doubly stochastic tridiagonal matrices. Random Structures & Algorithms, 42(4):403–437, 2013.
  • [19] C. F. Dunkl and Y. Xu. Orthogonal Polynomials of Several Variables. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2 edition, 2014.
  • [20] U. Dutta, B. K. Fosdick, and A. Clauset. Sampling random graphs with specified degree sequences. arXiv e-prints, pages arXiv–2105, 2021.
  • [21] A. Eskenazis and E. Nestoridi. Cutoff for the Bernoulli–Laplace urn model with o(n)o(n) swaps. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 56, pages 2621–2639. Institut Henri Poincaré, 2020.
  • [22] M. Fayers. A note on Kostka numbers. arXiv preprint arXiv:1903.12499, 2019.
  • [23] B. Griffiths. Orthogonal polynomials on the multinomial distribution. Australian Journal of Statistics, 13(1):27–35, 1971.
  • [24] R. C. Griffiths and D. Spano. Multivariate Jacobi and Laguerre polynomials, infinite-dimensional extensions, and their probabilistic connections with multivariate Hahn and Meixner polynomials. Bernoulli, 17(3):1095–1125, 2011.
  • [25] D. Hernek. Random generation of 2×n\times n contingency tables. Random Structures & Algorithms, 13(1):71–79, 1998.
  • [26] P. Iliev and Y. Xu. Discrete orthogonal polynomials and difference equations of several variables. Advances in Mathematics, 212(1):1–36, 2007.
  • [27] I. M. Isaacs. Character theory of finite groups, volume 69. Courier Corporation, 1994.
  • [28] M. E. H. Ismail. Classical and Quantum Orthogonal Polynomials in One Variable. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2005.
  • [29] James. The Representation Theory of the Symmetric Group. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 1984.
  • [30] G. James, M. W. Liebeck, and M. Liebeck. Representations and characters of groups. Cambridge University Press, 2001.
  • [31] G. D. James. The representation theory of the symmetric groups. In Proc. Symposia in Pure Math, volume 47, pages 111–126, 1987.
  • [32] H. Joe. An ordering of dependence for contingency tables. Linear algebra and its applications, 70:89–103, 1985.
  • [33] S.-h. Kang and J. Klotz. Limiting conditional distribution for tests of independence in the two way table. Comm. Statist. Theory Methods, 27(8):2075–2082, 1998.
  • [34] S. N. Karp and H. Thomas. qq-Whittaker functions, finite fields, and Jordan forms. arXiv preprint arXiv:2207.12590, 2022.
  • [35] K. Khare and H. Zhou. Rates of convergence of some multivariate Markov chains with polynomial eigenfunctions. The Annals of Applied Probability, 19(2):737–777, 2009.
  • [36] H. O. Lancaster. The chi-squared distribution. John Wiley & Sons, Inc., New York-London-Sydney, 1969.
  • [37] D. A. Levin and Y. Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  • [38] I. G. Macdonald. Symmetric functions and Hall polynomials. Oxford university press, 1998.
  • [39] B. MacMahon. Mental health in the metropolis: The midtown Manhattan study, 1963.
  • [40] A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: theory of majorization and its applications, volume 143. Springer, 1979.
  • [41] E. Nestoridi and O. Nguyen. On the mixing time of the Diaconis–Gangolli random walk on contingency tables over /q\mathbb{Z}/q\mathbb{Z}. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 56, pages 983–1001. Institut Henri Poincaré, 2020.
  • [42] J. Paguyo. Fixed points, descents, and inversions in parabolic double cosets of the symmetric group. arXiv preprint arXiv:2112.07728, 2021.
  • [43] C. Pang. Lumpings of algebraic Markov chains arise from subquotients. Journal of Theoretical Probability, 32(4):1804–1844, 2019.
  • [44] S. RE Ingram. Some characters of the symmetric group. Proceedings of the American Mathematical Society, pages 358–369, 1950.
  • [45] J. Salzman. Spectral analysis with Markov chains. PhD thesis, Stanford University, 2007. Copyright - Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works; Last updated - 2021-09-28.
  • [46] F. Scarabotti. Time to reach stationarity in the bernoulli–laplace diffusion model with many urns. Advances in Applied Mathematics, 18(3):351–371, 1997.
  • [47] J.-P. Serre. Linear representations of finite groups, volume 42. Springer, 1977.
  • [48] R. P. Stanley. Enumerative Combinatorics, volume 1 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2 edition, 2011.