Geometric Mean Type of Proportional Reduction in Variation Measure for Two-Way Contingency Tables

Wataru Urasaki Department of Information Sciences, Tokyo University of Science Yuki Wada Department of Information Sciences, Tokyo University of Science Tomoyuki Nakagawa School of Data Science, Meisei University Kouji Tahata Department of Information Sciences, Tokyo University of Science Sadao Tomizawa School of Data Science, Meisei University Department of Information Science, Meisei University

Abstract

In a two-way contingency table analysis with explanatory and response variables, the analyst is interested in the independence of the two variables. However, if the test of independence does not show independence or clearly shows a relationship, the analyst is interested in the degree of their association. Various measures have been proposed to calculate the degree of their association, one of which is the proportional reduction in variation (PRV) measure which describes the PRV from the marginal distribution to the conditional distribution of the response. The conventional PRV measures can assess the association of the entire contingency table, but they can not accurately assess the association for each explanatory variable. In this paper, we propose a geometric mean type of PRV (geoPRV) measure that aims to sensitively capture the association of each explanatory variable to the response variable by using a geometric mean, and it enables analysis without underestimation when there is partial bias in cells of the contingency table. Furthermore, the geoPRV measure is constructed by using any functions that satisfy specific conditions, which has application advantages and makes it possible to express conventional PRV measures as geometric mean types in special cases.

Keywords: Contingency table, Diversity index, Geometric mean, Independence, Measure of association, Proportional Reduction in Variation

Mathematics Subject Classification: 62H17, 62H20

1 Introduction

Categorical variables are formed from categories and are employed in various fields such as medicine, psychology, education, and social science. Considering two types of categorical variables, one consisting of $R$ categories and the other consisting of $C$ categories. These two variables have $R\times C$ combinations, which can be represented in a table with $R$ rows and $C$ columns. This is called a two-way contingency table, where each ( $i,j$ ) cell ( $i=1,2,\ldots,R;~{}j=1,2,\ldots,C$ ) displays only the observed frequencies. Typically, the two-way contingency table is used to evaluate whether the two variables are related, i.e., statistically independent. If the independence of the two variables is rejected for example by Pearson’s chi-square test, or they are clearly considered to be related, we are interested in the strength of their association.

As a method to investigate the associative structure of the contingency table, association models have been proposed by Gilula and Haberman, (1986), Goodman, (1981, 1985), and Rom and Sarkar, (1992). This method can determine whether there is a relationship between row and column variables by the goodness-of-fit test with models. However, this method only focuses on whether or not there is a relationship, and we can not quantitatively determine what the degree of association is.

Instead of the goodness-of-fit test with the models, a variety of measures have been proposed as indicators that can show the degree of association within the interval from 0 to 1 by Agresti, (2003), Bishop et al., (2007), Cramér, (1999), Everitt, (1992),Tomizawa et al., (2004), and Tschuprow, (1925, 1939). These measures calculate the degree of deviation from independence for each $(i,j)$ cell in the contingency table and derive the degree of association from the sum of all cells. Because of the method, these measures can be applied to most contingency tables without distinguishing whether row and column variables are explanatory or response variables. However, in actual contingency table analysis, there are cases where the row and column variables are defined as explanatory or response variables. In such cases, it is not appropriate to analyze each variable by ignoring its characteristics.

Alternative measures have been proposed by Goodman and Kruskal, (1954), and Theil, (1970), which is explained by the proportional reduction in variation (PRV) from the marginal distribution to the conditional distributions of the response. The measures constructed by the method is called PRV measure. The PRV measure is an important tool in summarizing the strength of association of the entire contingency table because the way it is constructed makes it easy to interpret the values. In addition, we sometimes want to focus on the association of some categories of explanatory variables, but conventional PRV measures underestimate the strength and thus may not be able to accurately reflect the partial association numerically. In the study of models and scales for evaluating the symmetry of the contingency table, Nakagawa et al., (2020), Saigusa et al., (2016), and Saigusa et al., (2019) proposed to evaluate the partial symmetry by using the geometric mean. On the other hand, little research has been done in the case of the partial association.

In this paper, we propose a geometric mean type of PRV (geoPRV) measure via a geometric mean and functions satisfying certain conditions. Therefore, the geoPRV measure has application advantages and makes it possible to express previously proposed PRV measures as geometric mean types in special cases. By using the geometric mean to sensitively capture the association of each explanatory variable, analysis can be performed without underestimating the degree of association when cells in the contingency table are partially biased. In addition, the geoPRV measure enables us to know local association structures. Furthermore, the geoPRV measure can be analyzed regardless of whether the categorical variable is nominal or ordinal because its value does not change even when rows and columns are swapped. The rest of this paper is organized as follows. Section 2 introduces previous research on an extension of generalized PRV (eGPRV) measure and proposes the geoPRV measure. Section 3 presents the approximate confidence intervals of the proposed measures. Section 4 confirms the values and confidence intervals of the proposed measure using several artificial and actual data sets, and compares them with the eGPRV measure. Section 5 presents our conclusions.

2 PRV Measure

In this Section, we introduce measures using function $f(x)$ that satisfy the following conditions: (i) The function $f(x)$ is convex function; (ii) $0\cdot f(0/0)=0$ ; (iii) $\lim_{x\to+0}f(x)=0$ ; (iv) $f(1)=0$ . Examples of the function are introduced, and models and measures using it have been proposed by Kateri and Papaioannou, (1994), Momozaki et al., (2022) and Tahata, (2022). These proposals are intended to generalize existing models and measures and have application advantages that make it easy to construct new ones and allow adjustments with tuning parameters to fit the analysis. Section 2.1 provides some conventional PRV measures by Momozaki et al., (2022). In Section 2.2, we propose a geometric mean type of PRV measure and its characteristics.

2.1 Conventional PRV Measure

Consider $R\times C$ contingency table with nominal categories of the explanatory variable $X$ and the response variable $Y$ . Let $p_{ij}$ denote the probability that an observation will fall in the $i$ th row and $j$ th column of the table ( $i=1,\ldots,R;j=1,\ldots,C)$ . In addition, $p_{i\cdot}$ and $p_{\cdot j}$ are denoted as $p_{i\cdot}=\sum_{l=1}^{C}p_{il}$ , $p_{\cdot j}=\sum_{k=1}^{R}p_{kj}$ . The conventional PRV measure has the form

\Phi=\frac{V(Y)-E[V(Y|X)]}{V(Y)}=\frac{\displaystyle V(Y)-\sum_{i=1}^{R}p_{i\cdot}V(Y|X=i)}{V(Y)},

where $V(Y)$ is a measure of variation for the marginal distribution of $Y$ , and $E[V(Y|X)]$ is the expectation for the conditional variation of $Y$ given the distribution of $X$ (see, Agresti,, 2003). $\Phi$ is using the weighted arithmetic mean of $V(Y|X=i)$ , i.e, $\sum_{i=1}^{R}p_{i\cdot}V(Y|X=i)$ . By changing the variation measure, various PRV measures can be expressed, such as uncertainty coefficient $U$ for the variation measure $V(Y)=-\sum_{j=1}^{C}p_{\cdot j}\log p_{\cdot j}$ called Shannon entropy and concentration coefficient $\tau$ for the variation measure $V(Y)=1-\sum_{j=1}^{C}p_{\cdot j}^{2}$ called Gini concentration (see, Agresti,, 2003). Tomizawa et al., (1997) proposed a generalized PRV measure $T^{(\lambda)}$ that includes $U$ and $\tau$ by using $V(Y)=\left(1-\sum_{j=1}^{C}p_{\cdot j}^{\lambda+1}\right)/\lambda$ as the variation measure which is Patil and Taillie, (1982) diversity index of degree $\lambda$ for the marginal distribution $p_{\cdot j}$ . Furthermore, Momozaki et al., (2022) proposed an extension of generalized PRV (eGPRV) measure that includes $U$ , $\tau$ , and $T^{(\lambda)}$ :

\Phi_{f}=\frac{\displaystyle-\sum_{j=1}^{C}f(p_{\cdot j})-\sum_{i=1}^{R}p_{i\cdot}\left[-\sum_{j=1}^{C}f\left(p_{ij}/p_{i\cdot}\right)\right]}{\displaystyle-\sum_{j=1}^{C}f(p_{\cdot j})}.

The variation measure used in the eGPRV measure $\Phi_{f}$ are $V(Y)=-\sum_{j=1}^{C}f(p_{\cdot j})$ .

2.2 Geometric Mean Type of PRV Measure

We propose a new PRV measure by using the weighted geometric mean of $V(Y|X=i)$ that aims to sensitively capture the association of each explanatory variable to the response variable. Assume that $p_{\cdot j}>0$ and $V(Y|X=i)$ is a real number greater than or equal to 0 ( $i=1,\ldots,R;~{}j=1,\ldots,C$ ). We propose a geometric mean type of PRV (geoPRV) measure for $R\times C$ contingency tables defined as

\Phi_{G}=\frac{\displaystyle V(Y)-\prod_{i=1}^{R}\left[V(Y|X=i)\right]^{p_{i\cdot}}}{V(Y)},

where $V(Y)$ is a measure of variation for the marginal distribution of $Y$ . The geoPRV measure can use the same variation as the conventional PRV measure, for example,

\Phi_{Gf}=\frac{\displaystyle-\sum_{j=1}^{C}f(p_{\cdot j})-\prod_{i=1}^{R}\left[-\sum_{j=1}^{C}f\left(\frac{p_{ij}}{p_{i\cdot}}\right)\right]^{p_{i\cdot}}}{\displaystyle-\sum_{j=1}^{C}f(p_{\cdot j})},

where the variation measure $V(Y)=-\sum_{j=1}^{C}f(p_{\cdot j})$ . In addition, the following theorem for $\Phi_{Gf}$ holds.

Theorem 1.

The measure $\Phi_{Gf}$ satisfies the following conditions:

(i)

$\Phi_{f}\leq\Phi_{Gf}$ .
(ii)

$\Phi_{Gf}$ must lie between 0 and 1.
(iii)

$\Phi_{Gf}=0$ is equivalent to independence of $X$ and $Y$ .
(iv)

$\Phi_{Gf}=1$ is equivalent to $\prod_{i=1}^{R}\left[V(Y|X=i)\right]^{p_{i\cdot}}=0$ , i.e., for at least one $s$ , there exists $t$ such that $p_{st}\neq 0$ and $p_{sj}=0$ for every $j$ with $j\neq t$ .

Theorem 2.

The value of $\Phi_{Gf}$ is invariant to permutations of row and column categories.

For proof of Theorem 1 and Theorem 2, see Appendix A and Appendix B, respectively. The geoPRV measure differs from the conventional PRV measure in that $\Phi_{Gf}=1$ when there exists $i$ such that $p_{ij}=p_{i\cdot}\neq 0$ . Another important feature of the geoPRV measures is that it takes higher or equal values than the conventional PRV measures, allowing for a stronger representation of row and column relationships.

A property of the geoPRV measure is that the larger the value of $\Phi_{G}$ , the stronger the association between the response variable $Y$ and the explanatory variable $X$ . In other words, the larger the value of $\Phi_{G}$ , the more accurately you can predict the $Y$ category if you know the $X$ category than if you do not. In contrast, if the value of $\Phi_{G}$ is 0, the $Y$ category is not affected by the $X$ category at all.

3 Approximate Confidence Interval for the Measure

Since the measure $\Phi_{G}$ is unknown, we derived a confidence interval of $\Phi_{G}$ . Let $n_{ij}$ denote the frequency for a cell ( $i,j$ ), and $n=\sum_{i=1}^{R}\sum_{j=1}^{C}n_{ij}$ ( $i=1,2,\ldots,R;~{}j=1,2,\ldots,C$ ). Assume that the observed frequencies $\{n_{ij}\}$ have a multinomial distribution, we consider an approximate standard error and large-sample confidence interval for $\Phi_{G}$ using the delta method (Bishop et al.,, 2007, and Appendix C in Agresti,, 2010).

Theorem 3.

Let $\widehat{\Phi}_{Gf}$ denote a plug-in estimator of $\Phi_{Gf}$ . $\sqrt{n}(\widehat{\Phi}_{Gf}-\Phi_{Gf})$ converges in distribution to a normal distribution with mean zero and variance $\sigma^{2}[\Phi_{Gf}]$ , where

\sigma^{2}[\Phi_{Gf}]=\left(\delta^{(f)}\right)^{2}\left[\sum_{i=1}^{R}\sum_{j=1}^{C}p_{ij}(\Delta_{ij}^{(f)})^{2}-\left(\sum_{i=1}^{R}\sum_{j=1}^{C}p_{ij}\Delta_{ij}^{(f)}\right)^{2}\right],

with

$\displaystyle\delta^{(f)}$	$\displaystyle=$	$\displaystyle\frac{\displaystyle\prod_{s=1}^{R}\left[-\sum_{t=1}^{C}f\left(\frac{p_{st}}{p_{s\cdot}}\right)\right]^{p_{s\cdot}}}{\displaystyle\left(\sum_{t=1}^{C}f(p_{\cdot t})\right)^{2}},$
$\displaystyle\Delta_{ij}^{(f)}$	$\displaystyle=$	$\displaystyle f^{\prime}(p_{\cdot j})-\varepsilon_{ij}^{(f)}\sum_{t=1}^{C}f(p_{\cdot t}),$
$\displaystyle\varepsilon_{ij}^{(f)}$	$\displaystyle=$	$\displaystyle\log\left[-\sum_{t=1}^{C}f\left(\frac{p_{it}}{p_{i\cdot}}\right)\right]+\frac{\displaystyle\sum_{t=1}^{C}\left\{-\frac{p_{it}}{p_{i\cdot}}f^{\prime}\left(\frac{p_{it}}{p_{i\cdot}}\right)\right\}+f^{\prime}\left(\frac{p_{ij}}{p_{i\cdot}}\right)}{\displaystyle\sum_{t=1}^{C}f^{\prime}\left(\frac{p_{it}}{p_{i\cdot}}\right)},$

and $f^{\prime}(x)$ is the derivative of function $f(x)$ by $x$ .

The proof of Theorem 3 is given in Appendix C.

Let $\widehat{\sigma}^{2}\left[\Phi_{Gf}\right]$ denote a plug-in estimator of $\sigma^{2}\left[\Phi_{Gf}\right]$ . From Theorem 3, since $\widehat{\sigma}\left[\Phi_{Gf}\right]$ is a consistent estimator of $\sigma\left[\Phi_{Gf}\right]$ , $\widehat{\sigma}\left[\Phi_{Gf}\right]/\sqrt{n}$ is an estimated standard error for $\widehat{\Phi}_{Gf}$ , and $\widehat{\Phi}_{Gf}\pm z_{\alpha/2}\widehat{\sigma}\left[\Phi_{Gf}\right]/\sqrt{n}$ is an approximate $100(1-\alpha)\%$ confidence limit for $\Phi_{Gf}$ , where $z_{\alpha}$ is the upper two-sided normal distribution percentile at level $\alpha$ .

4 Numerical Experiments

In this section, we confirmed the performance of geoPRV measure $\Phi_{Gf}$ , and the difference between $\Phi_{Gf}$ and the conventional PRV measure $\Phi_{f}$ proposed by Momozaki et al., (2022). We use $\Phi_{f}$ and $\Phi_{Gf}$ , which have the variation measure $V(Y)=-\sum_{j=1}^{C}f(p_{\cdot j})$ . In addition to applying $f(x)=\left(x^{\lambda+1}-x\right)/\lambda$ for $\lambda>-1$ and $g(x)=(x-1)^{2}/(\omega x+1-\omega)-(x-1)/(1-\omega)$ for $0\leq\omega<1$ (see, Ichimori,, 2013), the former is expressed as $\Phi_{f}^{(\lambda)}$ and $\Phi_{Gf}^{(\lambda)}$ , while the latter is expressed as $\Phi_{g}^{(\omega)}$ and $\Phi_{Gg}^{(\omega)}$ . For the tuning parameters, set $\lambda=0,~{}0.5,~{}1.0$ and $\omega=0,~{}0.5,~{}0.9$ .

Artificial data 1

Consider the artificial data in Table 1. These are data to clearly show the difference in characteristics between conventional PRV measures and the geoPRV measure. Table 1c shows the case where the explanatory variable in the first row has a complete association structure with the response variable in the third column. On the other hand, Table 1a and Table 1b show the case where the explanatory variable in the first row has a weak or slightly strong association structure to the response variable, respectively.

Table 1: The

3\times 3

probability tables, which have a (a) weak (b) slightly strong, and (c) complete association structure in the first row.

	(1)	(2)	(3)	Total
(a)
(1)	0.005	0.125	0.370	0.500
(2)	0.030	0.050	0.120	0.200
(3)	0.045	0.075	0.180	0.300
Total	0.080	0.250	0.670	1.000
(b)
(1)	0.005	0.025	0.470	0.500
(2)	0.030	0.050	0.120	0.200
(3)	0.045	0.075	0.180	0.300
Total	0.080	0.150	0.770	1.000
(c)
(1)	0.000	0.000	0.500	0.500
(2)	0.030	0.050	0.120	0.200
(3)	0.045	0.075	0.180	0.300
Total	0.075	0.125	0.800	1.000

The values of $\Phi_{f}^{(\lambda)}$ and $\Phi_{Gf}^{(\lambda)}$ are provided in Table 2a and Table 2b, respectively. For instance, Table 2a shows that when Table 1c is parsed the measure $\Phi_{f}^{(\lambda)}=0.2628,~{}0.1990,~{}0.1784$ for each $\lambda$ and does not capture the complete association structure of the first row. In contrast, $\Phi_{Gf}^{(\lambda)}=1$ in all $\lambda$ , allowing us to identify the local complete association structure. Similarly, consider the results of the $\Phi_{Gf}^{(\lambda)}$ and $\Phi_{f}^{(\lambda)}$ in any $\lambda$ from Table 1a to Table 1c. As can be seen from these results, the simulation also shows that $\Phi_{Gf}^{(\lambda)}$ changes significantly by capturing partially related structures compared to $\Phi_{f}^{(\lambda)}$ .

Table 2: The value of

\Phi_{f}^{(\lambda)}

and

\Phi_{Gf}^{(\lambda)}

, applied to Table 1

(a) The values of $\Phi_{f}^{(\lambda)}$
$\lambda$ Table 1a Table 1b Table 1c 0.0 0.0495 0.1285 0.2628 0.5 0.0302 0.1156 0.1990 1.0 0.0203 0.1105 0.1784

(b) The values of $\Phi_{Gf}^{(\lambda)}$
$\lambda$ Table 1a Table 1b Table 1c 0.0 0.0701 0.2765 1.0000 0.5 0.0487 0.3126 1.0000 1.0 0.0354 0.3221 1.0000

Artificial data 2

Consider the artificial data in Table 3. These data are intended to examine the value of the geoPRV measure $\Phi_{Gf}$ as the association of the entire contingency table changes. Therefore, we obtained data suitable for the survey by converting the bivariate normal distribution with means $\mu_{1}=\mu_{2}=0$ and variances $\sigma^{2}_{1}=\sigma^{2}_{2}=1$ , in which the correlation coefficient was changed from $0$ to $1$ by $0.2$ , into the $4\times 4$ contingency tables with equal-interval frequency. From Theorem 2 and the properties of the PRV measures, when the absolute values of the correlation coefficients are the same, i.e., when the rows of the contingency table are simply swapped, the values are equal, so the results for the negative correlation coefficient case are omitted.

Table 3: The

4\times 4

probability tables, formed by using three cutpoints for each variable at

z_{0.25},z_{0.50},z_{0.75}

from a bivariate normal distribution with the conditions

\mu_{1}=\mu_{2}=0

\sigma^{2}_{1}=\sigma^{2}_{2}=1

, and

\rho

increasing by 0.2 from

0

1

	(1)	(2)	(3)	(4)	Total		(1)	(2)	(3)	(4)	Total
		$\rho=1.0$						$\rho=0.4$
(1)	0.2500	0.0000	0.0000	0.0000	0.2500	(1)	0.1072	0.0692	0.0477	0.0258	0.2500
(2)	0.0000	0.2500	0.0000	0.0000	0.2500	(2)	0.0692	0.0698	0.0632	0.0477	0.2500
(3)	0.0000	0.0000	0.2500	0.0000	0.2500	(3)	0.0477	0.0632	0.0698	0.0692	0.2500
(4)	0.0000	0.0000	0.0000	0.2500	0.2500	(4)	0.0258	0.0477	0.0692	0.1072	0.2500
Total	0.2500	0.2500	0.2500	0.2500	1.0000	Total	0.2500	0.2500	0.2500	0.2500	1.0000
		$\rho=0.8$						$\rho=0.2$
(1)	0.1691	0.0629	0.0164	0.0016	0.2500	(1)	0.0837	0.0668	0.0563	0.0432	0.2500
(2)	0.0629	0.1027	0.0680	0.0164	0.2500	(2)	0.0668	0.0648	0.0621	0.0563	0.2500
(3)	0.0164	0.0680	0.1027	0.0629	0.2500	(3)	0.0563	0.0621	0.0648	0.0668	0.2500
(4)	0.0016	0.0164	0.0629	0.1691	0.2500	(4)	0.0432	0.0563	0.0668	0.0837	0.2500
Total	0.2500	0.2500	0.2500	0.2500	1.0000	Total	0.2500	0.2500	0.2500	0.2500	1.0000
		$\rho=0.6$						$\rho=0$
(1)	0.1345	0.0691	0.0353	0.0111	0.2500	(1)	0.0625	0.0625	0.0625	0.0625	0.2500
(2)	0.0691	0.0797	0.0659	0.0353	0.2500	(2)	0.0625	0.0625	0.0625	0.0625	0.2500
(3)	0.0353	0.0659	0.0797	0.0691	0.2500	(3)	0.0625	0.0625	0.0625	0.0625	0.2500
(4)	0.0111	0.0353	0.0691	0.1345	0.2500	(4)	0.0625	0.0625	0.0625	0.0625	0.2500
Total	0.2500	0.2500	0.2500	0.2500	1.0000	Total	0.2500	0.2500	0.2500	0.2500	1.0000

Table 4 shows the value of $\Phi_{f}^{(\lambda)}$ and $\Phi_{Gf}^{(\lambda)}$ for each value of $\rho$ , respectively. We observe that the values of $\Phi_{f}^{(\lambda)}$ and $\Phi_{Gf}^{(\lambda)}$ increase as the absolute value of the $\rho$ increases. Besides, $\rho=0$ if and only if the measures show that it is independent of the table, and $\rho=1.0$ if and only if the measures confirm that there is a structure of all (or partially) complete association. Also, if there is a relationship only to the entire contingency table, the values of $\Phi_{Gf}^{(\lambda)}$ are found to be larger than the values of $\Phi_{f}^{(\lambda)}$ by Theorem 1, but the differences are small.

Table 4: The values of

\Phi_{f}^{(\lambda)}

and

\Phi_{Gf}^{(\lambda)}

for each

\rho

(a) The values of $\Phi_{f}^{(\lambda)}$
$\lambda$ $\rho=0.0$ $\rho=0.2$ $\rho=0.4$ $\rho=0.6$ $\rho=0.8$ $\rho=1.0$ 0.0 0.0000 0.0109 0.0461 0.1159 0.2541 1.0000 0.5 0.0000 0.0113 0.0471 0.1161 0.2479 1.0000 1.0 0.0000 0.0100 0.0419 0.1035 0.2236 1.0000

(b) The values of $\Phi_{Gf}^{(\lambda)}$
$\lambda$ $\rho=0.0$ $\rho=0.2$ $\rho=0.4$ $\rho=0.6$ $\rho=0.8$ $\rho=1.0$ 0.0 0.0000 0.0109 0.0469 0.1203 0.2699 1.0000 0.5 0.0000 0.0113 0.0479 0.1205 0.2634 1.0000 1.0 0.0000 0.0100 0.0425 0.1071 0.2369 1.0000

Actual data 1

Consider the case where the PRV measure is adapted to the data in Table 5, a survey of cannabis use among students conducted at the University of Ioannina (Greece) in 1995 and published in Marselos et al., (1997). The students’ frequency of alcohol consumption is measured on a four-level scale ranging from at most once per month up to more frequently than twice per week while their trial of cannabis through a three-level variable (never tried–tried once or twice–more often). We can see the partial bias of the frequency for the first and second rows in the data.

Table 5: Students’ survey about cannabis use at the University of Ioannina

	I tried cannabis $\dots$
Alcohol consumption	Never	Once or twice	More often	Total
At most once/month	204	6	1	211
Twice/month	211	13	5	229
Twice/week	357	44	38	439
More often	92	34	49	175
Total	864	97	93	1054

The estimates of $\Phi_{f}^{(\lambda)}$ and $\Phi_{Gf}^{(\lambda)}$ are provided in Table 6a and Table 6b, respectively. For instance, when $\lambda=1$ , the measure $\widehat{\Phi}_{f}^{(1)}=0.1034$ for Table 6a, and $\widehat{\Phi}_{Gf}^{(1)}=0.2992$ for Table 6b. $\widehat{\Phi}_{f}^{(1)}$ shows that the average condition variation of trying cannabis is $10.34\%$ smaller than the marginal variation, and similarly $\widehat{\Phi}_{Gf}^{(1)}$ shows that the average condition variation of trying cannabis is $29.92\%$ smaller. Based on the results of these values, the following can be interpreted from Table 5:

(1)

There is a strong association overall between alcohol consumption and cannabis use experience associated.
(2)

There are fairly strong associations between some alcohol consumption and cannabis use experience.

These interpretations seem to be intuitive when looking at Table 5. However, by analyzing using the measures, we have been able to present an objective interpretation numerically and to show how strongly associated structures are in the contingency table.

Table 6: Estimate of

\Phi_{f}^{(\lambda)}

and

\Phi_{Gf}^{(\lambda)}

, estimated approximate standard error for

\widehat{\Phi}_{f}^{(\lambda)}

and

\widehat{\Phi}_{Gf}^{(\lambda)}

, approximate

95\%

confidence interval for

\Phi_{f}^{(\lambda)}

and

\Phi_{Gf}^{(\lambda)}

(a) $\Phi_{f}^{(\lambda)}$ for Table 5
$\lambda$ Estimated measure Standard error Confidence interval 0.0 0.1215 0.0175 (0.0872, 0.1557) 0.5 0.1090 0.0172 (0.0752, 0.1428) 1.0 0.1034 0.0174 (0.0693, 0.1376)

(b) $\Phi_{Gf}^{(\lambda)}$ for Table 5
$\lambda$ Estimated measure Standard error Confidence interval 0.0 0.2601 0.0439 (0.1741, 0.3461) 0.5 0.2922 0.0488 (0.1965, 0.3879) 1.0 0.2992 0.0502 (0.2007, 0.3976)

Actual data 2

By analyzing multiple contingency tables using the measures, it is possible to numerically determine how much difference there are between the associations of the contingency tables. Therefore, consider the data in Table 7 are taken from Hashimoto, (1999). These data describe the cross-classifications of the father’s and son’s occupational status categories in Japan which were examined in 1975 and 1985. In addition, we can consider the father’s states as an explanatory variable and the son’s states as an response variable, since the father’s occupational status categories seem to have an influence on the son’s. The analysis of Table 7 aims to show what differences there are in the associations of occupational status categories for fathers and sons in 1975 and 1985.

Table 7: Occupational status for Japanese father-son pairs

(a) Examined in 1975 Son’s status Father’s status Capitalist New middle Working Old middle Total Capitalist 29 43 25 35 132 New middle 23 159 89 52 323 Working 11 69 184 44 308 Old middle 84 323 525 613 1545 Total 147 594 823 744 2308

(b) Examined in 1985 Son’s status Father’s status Capitalist New middle Working Old middle Total Capitalist 46 59 34 42 181 New middle 20 193 79 31 323 Working 9 122 202 48 381 Old middle 47 270 412 380 1109 Total 122 644 727 501 1994

Table 8 and Table 9 give the estimates of $\Phi_{g}^{(\omega)}$ and $\Phi_{Gg}^{(\omega)}$ , respectively. Comparing the estimates for each $\omega$ in Table 8 and Table 9, we can see that the values for both measures are almost the same. In addition, comparing Table 8a and Table 8b, the estimate is slightly larger in Table 8b, so it can be assumed that Table 8b is more related, but there is little difference because all the confidence intervals are covered. When we also compare 9a and Table 9b, we can see that 9b is larger because the estimate is slightly larger in 9b. However, we can see that the confidence interval does not cover at $\omega=0.9$ . From the results of these values, the following can be interpreted for Table 7a and Table 7b:

(1)

The occupational status categories of fathers and sons in 1975 and 1985 both have weak associations overall, further indicating that individual explanatory variables do not have remarkably associations.
(2)

Although the association of Table 7b is slightly larger than Table 7a, the results of the confidence intervals indicate that there is no statistical difference.
(3)

The partial association in Table 7b is slightly larger than Table 7a, and the results of confidence intervals indicate that there may be a statistical difference.

When there are statistical differences from the results of some confidence intervals, as in (3), it is affected by differences in the characteristics of variation associated with changing the tuning parameters. In this case, it is difficult to give an interpretation by referring to variation because there was no difference in the variation in the special cases (e.g., $\omega=0$ ). However, when there are differences in variation in special cases, further interpretation can be given by focusing on the characteristics.

Table 8: In the table, the first column indicates the estimate

\widehat{\Phi}_{g}^{(\omega)}

\Phi_{g}^{(\omega)}

, and the second column indicates the estimated approximate standard error for

\widehat{\Phi}_{g}^{(\omega)}

, and the final column indicates approximate

95\%

confidence interval for

\Phi_{g}^{(\omega)}

(a) For Table 7a
$\omega$ Estimated measure Standard error Confidence interval 0.0 0.0480 0.0061 (0.0361, 0.0600) 0.5 0.0547 0.0067 (0.0416, 0.0678) 0.9 0.0401 0.0054 (0.0294, 0.0507)

(b) For Table 7b
$\omega$ Estimated measure Standard error Confidence interval 0.0 0.0598 0.0071 (0.0459, 0.0736) 0.5 0.0709 0.0079 (0.0553, 0.0864) 0.9 0.0665 0.0081 (0.0506, 0.0823)

Table 9: Estimate of

\Phi_{Gg}^{(\omega)}

, estimated approximate standard error for

\widehat{\Phi}_{Gg}^{(\omega)}

, approximate

95\%

confidence interval for

\Phi_{Gg}^{(\omega)}

(a) For Table 7a
$\omega$ Estimated measure Standard error Confidence interval 0.0 0.0499 0.0066 (0.0371, 0.0628) 0.5 0.0571 0.0072 (0.0431, 0.0712) 0.9 0.0416 0.0057 (0.0304, 0.0528)

(b) For Table 7b
$\omega$ Estimated measure Standard error Confidence interval 0.0 0.0630 0.0077 (0.0478, 0.0782) 0.5 0.0752 0.0086 (0.0583, 0.0922) 0.9 0.0695 0.0084 (0.0530, 0.0860)

5 Conclusion

In this paper, we proposed a geometric mean type of PRV (geoPRV) measure that uses variation composed of geometric mean and arbitrary functions that satisfy certain conditions. We showed that the proposed measure has the following three properties that are suitable for examining the degree of association, which satisfies the conventional measures: (i) The measure increases monotonically as the degree of association increases; (ii) The value is 0 when there is a structure of null association, and (iii) The value is 1 when there is a complete structure of association. Furthermore, by using geometric means, the geoPRV measure can capture the association to the response variables for individual explanatory variables that could not be investigated by the existing PRV measures. Analyses using the existing PRV measures and the geoPRV measure simultaneously will be able to examine the association of the entire contingency table and the partial association. Also, the geoPRV measure can be analyzed using variations with various characteristics by providing functions and tuning parameters that satisfy the conditions, such as the measure $\Phi_{f}$ . Therefore, analysis using the geoPRV measure together can lead to a deeper understanding of the data and provide further interpretation. While various measures of contingency tables have been proposed, there have been several studies in recent years that have conducted analyses using the Goodman-Kraskal’s PRV measure (e.g. Gea-Izquierdo,, 2023; Iordache et al.,, 2022). We believe that the new PRV measure in this paper, when examined and compared together with the existing Goodman-Kraskal’s PRV measure, may provide a new perspective that pays attention to the association of individual explanatory variables, including the association of the entire contingency table.

Appendix A Proof of Theorem 1

Proof.

(i)

Let $\phi$ denote a numerator of a fraction

\Phi_{Gf}-\Phi_{f}=\frac{\displaystyle\sum_{i=1}^{R}p_{i\cdot}V(Y|X=i)-\prod_{i=1}^{R}\left[V(Y|X=i)\right]^{p_{i\cdot}}}{\displaystyle-\sum_{j=1}^{C}f(p_{\cdot j})}.

If there exists $i$ such that $V(Y|X=i)=0$ , $\phi\geq 0$ is easily verified, i.e., $\Phi_{f}\leq\Phi_{Gf}$ is established. Moreover, consider cases other than this one. Assume that $f(x)=-\log x$ which is convex function since $f^{\prime\prime}(x)=1/x^{2}>0$ where $f^{\prime\prime}(x)$ is the second derivative of function $f(x)$ by $x$ . From Jensen’s inequality,

			$\displaystyle\sum_{i=1}^{R}p_{i\cdot}[-\log V(Y\|X=i)]\geq-\log\left[\sum_{i=1}^{R}p_{i\cdot}V(Y\|X=i)\right]$
		$\displaystyle\Longleftrightarrow$	$\displaystyle\sum_{i=1}^{R}\log\left[V(Y\|X=i)\right]^{p_{i\cdot}}\leq\log\left[\sum_{i=1}^{R}p_{i\cdot}V(Y\|X=i)\right]$
		$\displaystyle\Longleftrightarrow$	$\displaystyle\log\prod_{i=1}^{R}\left[V(Y\|X=i)\right]^{p_{i\cdot}}\leq\log\left[\sum_{i=1}^{R}p_{i\cdot}V(Y\|X=i)\right]$
		$\displaystyle\Longleftrightarrow$	$\displaystyle\prod_{i=1}^{R}\left[V(Y\|X=i)\right]^{p_{i\cdot}}\leq\left[\sum_{i=1}^{R}p_{i\cdot}V(Y\|X=i)\right]$

where $p_{i\cdot}\geq 0$ , $\sum_{i=1}^{R}p_{i\cdot}=1$ . Therefore, $\phi\geq 0$ , i.e., $\Phi_{f}\leq\Phi_{Gf}$ holds.

(ii)

The inequality $0\leq\Phi_{f}\leq 1$ is already proven by Momozaki et al., and $\Phi_{f}\leq\Phi_{Gf}$ holds as proved above. Hence, $\Phi_{Gf}\geq 0$ holds since $0\leq\Phi_{f}\leq\Phi_{Gf}$ . In addition, since $\prod_{i=1}^{R}\left[V(Y|X=i)\right]^{p_{i\cdot}}\geq 0$ , we obtain $\Phi_{G}\leq 1$ . Thus, $0\leq\Phi_{Gf}\leq 1$ holds.
(iii)

Since $0\leq\Phi_{f}\leq\Phi_{Gf}$ , if $\Phi_{Gf}=0$ then $\Phi_{f}=0$ . Hence, since $p_{ij}=p_{i\cdot}p_{\cdot j}$ holds for $\Phi_{f}=0$ (Momozaki et al.), $p_{ij}=p_{i\cdot}p_{\cdot j}$ holds for $\Phi_{Gf}=0$ . Thus, $\Phi_{Gf}=0\Longrightarrow p_{ij}=p_{i\cdot}p_{\cdot j}$ holds. Moreover, $\Phi_{Gf}=0\Longleftarrow p_{ij}=p_{i\cdot}p_{\cdot j}$ can be easily checked.
(iv)

If $\Phi_{Gf}=1$ then $\prod_{i=1}^{R}\left[V(Y|X=i)\right]^{p_{i\cdot}}=0$ , i.e., for some $s$ , $V(Y|X=s)=-\sum_{j=1}^{C}f\left(\frac{p_{ij}}{p_{i\cdot}}\right)=0$ ( $s=1,2,\ldots,R$ ). Thus, there exists $i$ such that $p_{ij}\neq 0$ and $p_{ik}=0$ ( $k\neq j$ ).

∎

Appendix B Proof of Theorem 2

Proof.

Since the first terms in the denominator and numerator of $\Phi_{Gf}$ do not depend on the row category, we focus on the second term in the numerator. This term is

\prod_{i=1}^{R}\left[-\sum_{j=1}^{C}f\left(\frac{p_{ij}}{p_{i\cdot}}\right)\right]^{p_{i\cdot}}=\prod_{i=1}^{R}\left[-f\left(\frac{p_{i1}}{p_{i\cdot}}\right)-\cdots-f\left(\frac{p_{iC}}{p_{i\cdot}}\right)\right]^{p_{i\cdot}},

and the values are invariant to the reordering of the sums. Namely, the value of $\Phi_{Gf}$ is invariant with respect to the permutation of row categories. Similarly, the value of $\Phi_{Gf}$ is also invariant with respect to the permutation of column categories. ∎

Appendix C Proof of Theorem 3

Proof.

Let

\bm{n}=(n_{11},n_{12},\ldots,n_{1C},n_{21},\ldots,n_{RC})^{\top},

\bm{p}=(p_{11},p_{12},\ldots,p_{1C},p_{21},\ldots,p_{RC})^{\top},

$\widehat{\bm{p}}=\bm{n}/n$ , and $\bm{a}^{\top}$ is a transpose of $\bm{a}$ . Then $\sqrt{n}\left(\widehat{\bm{p}}-\bm{p}\right)$ converges in distribution to a normal distribution with mean zero and the covariance matrix ${\rm diag}(\bm{p})-\bm{pp}^{\top}$ , where ${\rm diag}(\bm{p})$ is a diagonal matrix with the elements of $\bm{p}$ on the main diagonal (Bishop et al.,, 2007).

The Taylor expansion of the function $\widehat{\Phi}_{Gf}$ around $\bm{p}$ is given by

\widehat{\Phi}_{Gf}=\Phi_{Gf}+\left(\frac{\partial\Phi_{Gf}}{\partial\bm{p}^{\top}}\right)(\widehat{\bm{p}}-\bm{p})+o_{p}(n^{-1/2}).

Therefore, since

\sqrt{n}(\widehat{\Phi}_{Gf}-\Phi_{Gf})=\sqrt{n}\left(\frac{\partial\Phi_{Gf}}{\partial\bm{p}^{\top}}\right)(\widehat{\bm{p}}-\bm{p})+o_{p}(1),

\sqrt{n}(\widehat{\Phi}_{Gf}-\Phi_{Gf})\overset{d}{\to}N(0,\sigma^{2}[\Phi_{Gf}]),

where

\sigma^{2}[\Phi_{Gf}]=\left(\delta^{(f)}\right)^{2}\left[\sum_{i=1}^{R}\sum_{j=1}^{C}p_{ij}(\Delta_{ij}^{(f)})^{2}-\left(\sum_{i=1}^{R}\sum_{j=1}^{C}p_{ij}\Delta_{ij}^{(f)}\right)^{2}\right],

with

$\displaystyle\delta^{(f)}$	$\displaystyle=$	$\displaystyle\frac{\displaystyle\prod_{s=1}^{R}\left[-\sum_{t=1}^{C}f\left(\frac{p_{st}}{p_{s\cdot}}\right)\right]^{p_{s\cdot}}}{\displaystyle\left(\sum_{t=1}^{C}f(p_{\cdot t})\right)^{2}}$
$\displaystyle\Delta_{ij}^{(f)}$	$\displaystyle=$	$\displaystyle f^{\prime}(p_{\cdot j})-\varepsilon_{ij}^{(f)}\sum_{t=1}^{C}f(p_{\cdot t}),$
$\displaystyle\varepsilon_{ij}^{(f)}$	$\displaystyle=$	$\displaystyle\log\left[-\sum_{t=1}^{C}f\left(\frac{p_{it}}{p_{i\cdot}}\right)\right]+\frac{\displaystyle\sum_{t=1}^{C}\left\{-\frac{p_{it}}{p_{i\cdot}}f^{\prime}\left(\frac{p_{it}}{p_{i\cdot}}\right)\right\}+f^{\prime}\left(\frac{p_{ij}}{p_{i\cdot}}\right)}{\displaystyle\sum_{t=1}^{C}f^{\prime}\left(\frac{p_{it}}{p_{i\cdot}}\right)},$

and $f^{\prime}(x)$ is the derivative of function $f(x)$ by $x$ . ∎

References

Agresti, (2003) Agresti, A. (2003). Categorical data analysis. John Wiley & Sons.
Agresti, (2010) Agresti, A. (2010). Analysis of ordinal categorical data, volume 656. John Wiley & Sons.
Bishop et al., (2007) Bishop, Y. M., Fienberg, S. E., and Holland, P. W. (2007). Discrete multivariate analysis: theory and practice. Springer Science & Business Media.
Cramér, (1999) Cramér, H. (1999). Mathematical methods of statistics, volume 43. Princeton university press.
Everitt, (1992) Everitt, B. S. (1992). The analysis of contingency tables. CRC Press.
Gea-Izquierdo, (2023) Gea-Izquierdo, E. (2023). Biological risk of legionella pneumophila in irrigation systems. Revista de Salud Pública, 22:434–439.
Gilula and Haberman, (1986) Gilula, Z. and Haberman, S. J. (1986). Canonical analysis of contingency tables by maximum likelihood. Journal of the American statistical association, 81(395):780–788.
Goodman, (1981) Goodman, L. A. (1981). Association models and canonical correlation in the analysis of cross-classifications having ordered categories. Journal of the American Statistical Association, 76(374):320–334.
Goodman, (1985) Goodman, L. A. (1985). The analysis of cross-classified data having ordered and/or unordered categories: Association models, correlation models, and asymmetry models for contingency tables with or without missing entries. The Annals of Statistics, 13:10–69.
Goodman and Kruskal, (1954) Goodman, L. A. and Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49(268):732–764.
Hashimoto, (1999) Hashimoto, K. (1999). Gendai nihon no kaikyuu kouzou (class structure in modern japan: theory, method and quantitative analysis). Toshindo, Tokyo (in Japanese).
Ichimori, (2013) Ichimori, T. (2013). On inequalities between $f$ -divergence. Technical Note, IPSJ Journal, 54(11):2344–2348.
Iordache et al., (2022) Iordache, A. M., Nechita, C., Voica, C., Pluháček, T., and Schug, K. A. (2022). Climate change extreme and seasonal toxic metal occurrence in romanian freshwaters in the last two decades—case study and critical review. NPJ Clean Water, 5(1):2.
Kateri and Papaioannou, (1994) Kateri, M. and Papaioannou, T. (1994). f-divergence Association Models. University of Ioannina.
Marselos et al., (1997) Marselos, M., Boutsouris, K., Liapi, H., Malamas, M., Kateri, M., and Papaioannou, T. (1997). Epidemiological aspects of the use of cannabis among university students in greece. European Addiction Research, 3(4):184–191.
Momozaki et al., (2022) Momozaki, T., Wada, Y., Nakagawa, T., and Tomizawa, S. (2022). Extension of generalized proportional reduction in variation measure for two-way contingency tables. Behaviormetrika, pages 1–14.
Nakagawa et al., (2020) Nakagawa, T., Takei, T., Ishii, A., and Tomizawa, S. (2020). Geometric mean type measure of marginal homogeneity for square contingency tables with ordered categories. Journal of Mathematics and Statistics, 16(1):170–175.
Patil and Taillie, (1982) Patil, G. and Taillie, C. (1982). Diversity as a concept and its measurement. Journal of the American statistical Association, 77(379):548–561.
Rom and Sarkar, (1992) Rom, D. and Sarkar, S. K. (1992). A generalized model for the analysis of association in ordinal contingency tables. Journal of statistical planning and inference, 33(2):205–212.
Saigusa et al., (2016) Saigusa, Y., Tahata, K., and Tomizawa, S. (2016). Measure of departure from partial symmetry for square contingency tables. Journal of Mathematics and Statistics, 12(3):152–156.
Saigusa et al., (2019) Saigusa, Y., Takami, M., Ishii, A., Nakagawa, T., and Tomizawa, S. (2019). Measure for departure from cumulative partial symmetry for square contingency tables with ordered categories. Journal of Statistics: Advances in Theory and Applications, 21:53–70.
Tahata, (2022) Tahata, K. (2022). Advances in quasi-symmetry for square contingency tables. Symmetry, 14(5):1051.
Theil, (1970) Theil, H. (1970). On the estimation of relationships involving qualitative variables. American Journal of Sociology, 76(1):103–154.
Tomizawa et al., (2004) Tomizawa, S., Miyamoto, N., and Houya, H. (2004). Generalization of cramer’s coefficient of association for contingency tables: theory and methods. South African Statistical Journal, 38(1):1–24.
Tomizawa et al., (1997) Tomizawa, S., Seo, T., and Ebi, M. (1997). Generalized proportional reduction in variation measure for two-way contingency tables. Behaviormetrika, 24(2):193–201.
Tschuprow, (1925) Tschuprow, A. (1925). Grundbegriffe und grundprobleme der korrelationstheorie. Leipzig: B.G. Teubner.
Tschuprow, (1939) Tschuprow, A. (1939). Principles of the mathematical theory of correlation. W. Hodge & Co.

			$\displaystyle\sum_{i=1}^{R}p_{i\cdot}[-\log V(Y\|X=i)]\geq-\log\left[\sum_{i=1}^{R}p_{i\cdot}V(Y\|X=i)\right]$
		$\displaystyle\Longleftrightarrow$	$\displaystyle\sum_{i=1}^{R}\log\left[V(Y\|X=i)\right]^{p_{i\cdot}}\leq\log\left[\sum_{i=1}^{R}p_{i\cdot}V(Y\|X=i)\right]$
		$\displaystyle\Longleftrightarrow$	$\displaystyle\log\prod_{i=1}^{R}\left[V(Y\|X=i)\right]^{p_{i\cdot}}\leq\log\left[\sum_{i=1}^{R}p_{i\cdot}V(Y\|X=i)\right]$
		$\displaystyle\Longleftrightarrow$	$\displaystyle\prod_{i=1}^{R}\left[V(Y\|X=i)\right]^{p_{i\cdot}}\leq\left[\sum_{i=1}^{R}p_{i\cdot}V(Y\|X=i)\right]$