Nonparametric tests for interaction in two-way ANOVA with balanced replications

Bao Khue Tran
Department of Mathematics and Statistics
Kenyon College
Gambier, OH
[email protected]
&Amy S. Wagaman
Department of Mathematics and Statistics
Amherst College
Amherst, MA
&Andrew Nguyen
Department of Statistics and Operations Research
University of North Carolina at Chapel Hill
Chapel Hill, NC
&David Jacobson
Department of Mathematics and Statistics
Amherst College
Amherst, MA
&Bradley Hartlaub
Department of Mathematics and Statistics
Kenyon College
Gambier, OH

(October 6, 2024)

Abstract

Nonparametric procedures are more powerful for detecting interaction in two-way ANOVA when the data are non-normal. In this paper, we compute null critical values for the aligned rank-based tests ( $APCSSA/APCSSM$ ) where the levels of the factors are between 2 and 6. We compare the performance of these new procedures with the ANOVA F-test for interaction, the adjusted rank transform test ( $ART$ ), Conover’s rank transform procedure ( $RT$ ), and a rank-based ANOVA test (raov) using Monte Carlo simulations. The new procedures $APCSSA/APCSSM$ are comparable with existing competitors in all settings. Even though there is no single dominant test in detecting interaction effects for non-normal data, nonparametric procedure $APCSSM$ is the most highly recommended procedures for Cauchy errors settings.

Keywords Aligned rank-based tests $\cdot$ Hypothesis testing $\cdot$ Non-normal errors $\cdot$ Nonparametric methods

1 Introduction

Factorial designs allow researchers to explore main effects and interactions. Particularly, detecting interaction is crucial to conducting data analysis as the presence of interaction in the data can influence whether a factor is significant or not. In a parametric setting, the usual test for interaction is an $F$ -test (Montgomery, 2020). However, for data collected in various fields and industries, errors tend to not follow the normal distribution. The use of tests with known issues provides continued motivation for developing appropriate nonparametric tests for interaction in the two-way layout with multiple replications per cell. We consider the model for the general two-way layout and some different types of interaction. We focus on a balanced design with an equal number of replications per cell.

In the case of a two-way layout with balanced replications per cell, the general model is

Y_{ijk}=\alpha_{i}+\beta_{j}+\gamma_{ij}+\varepsilon_{ijk},\;i=1,...,I,\;j=1,...,J,\mbox{and}\;k=1,...,K,

(1)

where two factors $U$ and $V$ , having $I$ and $J$ levels respectively, are being investigated and $K$ is the number of replications per cell. The $Y_{ijk}$ ’s are $IJK$ observations which are mutually independent random variables. The error terms, denoted by the $\varepsilon_{ijk}$ ’s, are assumed to have a common median $\theta$ . For notation, $\alpha_{i}$ is the effect of the $i^{\text{th}}$ level of factor $U$ , $\beta_{j}$ is the effect of the $j^{\text{th}}$ level of factor $V$ , and $\gamma_{ij}$ is the effect of the interaction between the $i^{\text{th}}$ level of factor $U$ and the $j^{\text{th}}$ level of factor $V$ .
The hypotheses for the typical test of interaction, with no restrictions on $\alpha_{i}$ and $\beta_{j}$ , are

	$\displaystyle H_{0}$	$\displaystyle:\gamma_{ijk}=0,i=1,\dots,I,j=1,\dots,J,\mbox{ and }k=1,\dots,K;$
	$\displaystyle H_{A}$	$\displaystyle:\;\gamma_{ijk}\mbox{'s not all zero.}$

Stating that not all $\gamma_{ijk}$ values are zero is only one way of stating that interaction is present. Multiple definitions of interaction have been developed, and some are only applicable to the location-family model, while others are more general.

In this paper, we propose new aligned rank-based tests ( $APCSSA/APCSSM$ ) for interaction in the general two-way layout with balanced replications per cell. We start with a review of existing test procedures for interaction. Then, we set forth our procedures. Through Monte Carlo simulation power studies with seven competing tests for interaction, we demonstrate the consistent performance of $APCSSA/APCSSM$ in various settings. We conclude with a discussion of our findings and thoughts on future work in this area.

2 Review of test procedures for interaction

Researchers have developed a variety of procedures to test for interaction in the general two-way layout. The classical test procedure for determining the presence of interaction (as defined by non-zero $\gamma_{ijk}$ ’s) in a parametric setting is the standard ANOVA $F$ -test (Montgomery, 2020). When the error terms in (1) are normally distributed, the two-way analysis of variance sum of squares identity partitions the variability and leads to an $F$ -test for our null hypothesis of no interaction. The $F$ -test assumptions of normality and common variance may not be met in all situations, however. When the assumptions necessary for performing an $F$ -test are not met, a nonparametric test should be used.

Multiple nonparametric tests for interaction exist in the literature. Conover and Iman (1976) suggest using the rank transform approach ( $RT$ ) in factorial settings. $RT$ uses joint ranks (averages for ties correction), and then uses the parametric $F$ -test on the ranks. Another nonparametric test for interaction based on the rank transform approach is the aligned rank transform ( $ART$ ). With the $ART$ procedure, an alignment is performed in the rows and columns before the joint ranking of the response variables. Wobbrock, Findlater, Gergle, and Higgins (2011) align the data before performing analysis like $RT$ to study individual effects. Alignment is performed by computing the row and column averages and subtracting each from all corresponding entries in the individual rows and columns. After obtaining the joint ranks on the aligned data, the $F$ -statistic is calculated on the joint ranks.

The aligned rank transform has been found to perform better than the rank transform, but it still has problems with elevated Type I error rates and with nonnormal error terms (Richter, 1999; Luepsen, 2017). Mansouri and Chang (1995) performed a comparative analysis between the $F$ -test, rank transform, and two versions of the aligned rank transform. They conclude that the aligned rank tests are preferred over the other test procedures. They also note that all the test procedures they considered performed poorly in the Monte Carlo power study with errors drawn from a Cauchy distribution. In our power studies, we use the art procedure from the $ARTool$ package in $\mathsf{R}$ (Kay et al., 2021).

Salazar-Alvarez, Tercero-Gómez, Cordero-Franco, and Conover (2014) conduct a literature review and conclude that most of the techniques were based on $RT$ . In order to provide a more complete comparison of nonparametric tests for interaction, we will consider methods that do not stem from Conover and Iman’s $RT$ .

A different testing approach was proposed by De Kroon and Van Der Laan (1981), based on their definition of rank interaction. Their test statistic was developed to detect the presence of rank interaction in the two-way layout with multiple replications per cell. As noted by De Kroon and Van Der Laan (1981) and Hartlaub et al. (1999), the proposed statistic only works well in detecting a rank interaction of type $U^{*}(V)$ if no main effect for $U$ is present (and vice versa for detecting type $V^{*}(U)$ , which works well where no main effect for $V$ is present).

Another way to detect interaction in factorial design is through the lens of linear regression, specifically rank-based linear models. Kloke and McKean (2012) proposed a rank-based analysis for all three hypotheses including the main effects and interactions based on a reduction of dispersion from the reduced to the full model. For the computations in this paper, we have utilized the raov function in Rfit package to compute their test statistic for interaction (Kloke and McKean, 2012).

Lastly, Hartlaub, Dean, and Wolfe (1999) developed procedures to test for interaction in the two-way layout with one observation per cell. An invariance problem with their statistics was solved by Lehman, Wolfe, Dean, and Hartlaub (Lehmann et al., 2001) who proposed symmetrized procedures. The symmetrized procedures, $S-SA$ (symmetrized statistics aligned by averages) and $S-SM$ (symmetrized statistics aligned by medians) are based on the statistics $CRA$ and $RCA$ , and $CRM$ and $RCM$ , respectively. In short, these procedures align within the rows or columns using averages or medians, and then rank within the other dimension. Reversing the aligning and ranking to create two statistics is similar to checking for both row and column concordance. The ranks are then combined in a cross-comparison framework to form appropriate statistics to test for interaction. A challenge for this procedure is that null means and variances for the statistics must be computed before the significance of the test statistic may be determined.

3 Nonparametric test for interaction proposal

Salazar-Alvarez, Tercero-Gómez, Cordero-Franco, and Conover (2014) recommend developing new nonparametric methods that are not based on $RT$ to detect interaction. The test statistics proposed by Hartlaub, Dean, and Wolfe (1999) were found to perform well in the two-way layout with a single replication per cell. Thus, we propose an extension of their technique for the case of balanced replications per cell. Our proposed statistics use the technique of crossed comparisons (Tukey, 1991) to detect the interactions. Initial investigations summarizing cell information into a single statistic (such as a median or mean) and then applying the methods from Hartlaub, Dean, and Wolfe (1999) did not perform as well as these extended methods where all possible comparisons were examined.

Our proposed statistics generalize the comparison idea with all possible comparisons. We eliminate nuisance effects by aligning with averages or medians to remove one of the nuisance effects (row or column) and ranking within the columns or rows to remove the other. We call the proposed statistics $APCSSA$ and $APCSSM$ . The names of the statistics come from the idea that they are extensions of $S\text{-}SA$ and $S\text{-}SM$ from Hartlaub, Dean, and Wolfe 1999, where we have added APC to stand for all possible comparisons.

Next, we describe how our proposed statistics, $APCSSA$ and $APCSSM$ , are computed. We begin with $APCSSA$ , the all possible comparisons extension of $SSA$ . $APCSSA$ is the maximum of two standardized statistics, so we begin by outlining their calculation.

Step 1. Calculate $APCCRA$ , which stands for all possible comparisons (APC), column alignment (C), row ranking (R), using averages for the alignment (A). Again, the name of the statistic just reflects that we are aligning within the columns using column averages, and then ranking within the rows, and then using all possible comparisons to create the statistic (below). We compute the $J(J-1)/2$ crossed comparisons denoted $V_{jj\;^{\prime}}$ .

V_{jj\;^{\prime}}=\sum_{1\leq\>i<}\sum_{i\;^{\prime}\leq\>I}\;\;\sum^{K}_{k_{1}=1}\sum^{K}_{k_{2}=1}\sum^{K}_{k_{3}=1}\sum^{K}_{k_{4}=1}\left\{\left(r_{ijk_{1}}+r_{i\;^{\prime}j\;^{\prime}k_{2}}\right)-\left(r_{i\;^{\prime}jk_{3}}+r_{ij\;^{\prime}k_{4}}\right)\right\}^{2}.

(2)

$APCCRA$ is the maximum of the $V_{jj\;^{\prime}}$ ’s.

Step 2. Calculate $APCCRAD$ , which is just a scaled version of $APCCRA$ . Divide $APCCRA$ by $K^{4}I(I-1)/2$ , the number of summands in $APCCRA$ , to obtain this scaled version of the crossed comparisons for the maximum column comparison. That is,

APCCRAD=\frac{APCCRA}{K^{4}I(I-1)/2}.

(3)

Step 3. Calculate $APCRCA$ , which stands for all possible comparisons (APC), row alignment (R), column ranking (C), using averages for the alignment (A). Repeat Step 1, with alignment in the rows and ranking in the columns. $APCRCA$ is computed by taking the maximum of $I(I-1)/2$ possible row comparisons.

Step 4. Calculate $APCRCAD$ . $APCRCAD$ is computed by dividing $APCRCA$ by $K^{4}J(J-1)/2$ .

Step 5. Standardization. $APCCRAD$ and $APCRCAD$ are further standardized by subtracting the appropriate null mean and dividing by the appropriate null standard deviation. In order to do this, one must find

APCCRAD^{*}=\frac{APCCRAD-E_{0}(APCCRAD)}{\sqrt{V_{0}(APCCRAD)}},\\

(4)

\mbox{and}\;APCRCAD^{*}=\frac{APCRCAD-E_{0}(APCRCAD)}{\sqrt{V_{0}(APCRCAD)}},

(5)

where $E_{0}(APCCRAD)$ and $E_{0}(APCRCAD)$ are the null means of $APCCRAD$ and $APCRCAD$ respectively, and $V_{0}(APCCRAD)$ and $V_{0}(APCRCAD)$ are the null variances of $APCCRAD$ and $APCRCAD$ respectively. These null means and variances are computed via simulation and their values are available on Github at https://github.com/tranbaokhue/Rank-based-InteractionTest. Note that if $I=J$ , then by symmetry, the null means and null variances are equal and only one set needs to be computed.

Step 6. Calculate $APCSSA$ . $APCSSA$ is the maximum of $APCCRAD^{*}$ and $APCRCAD^{*}$ .
Alternatively, using medians to align the data instead of averages in our procedure yields $APCSSM$ . The common ties correction of using average ranks for ties should be used during the ranking process.

4 Simulations

4.1 Simulation settings and notes

With the increase in $\mathsf{R}$ packages used to analyze two-way ANOVA models, Feys (2016) cautions researchers against choosing tests for their $p$ -values based on a given dataset. In order to compare the proposed statistics with existing competitors, we performed a simulation power study with the $F$ , $RT$ , $ART$ , $DEKR$ , raov statistics, and our two proposed statistics $APCSSA$ and $APCSSM$ . Multiple settings, with factor levels from 2 to 6 and replications between 2 and 9, were investigated and in this section we show selected results from four settings: $3\times 2\times 3$ , $3\times 3\times 3$ , $3\times 4\times 2$ , and $4\times 6\times 2$ , where the format $I\times J\times K$ gives the number of levels for each factor and the number of replications per cell.

The null distributions for $APCSSA/APCSSM$ were derived for all settings with 2 to 6 levels in each main factor and the number of replications per cell ranging from 1 to 5 using Monte Carlo simulation. The null distributions were based on 100,000 computations of the statistics with no interaction or main effects present and using normal error terms. In cases where symmetrization was needed, e.g. $APCSSA$ and $APCSSM$ , two sets of 100,000 computations were used. The first set of 100,000 was used to determine expected values and variances for use in the symmetrizations, and the second set for the null distribution determination of the symmetrized statistics. During the ranking process, the common ties correction of using average ranks was employed for the our new proposed procedures, $APCSSA$ and $APCSSM$ . Expected values and variances as well as critical values for our proposed statistics are available through Github (https://github.com/tranbaokhue/Rank-based-InteractionTest).

We used Monte Carlo simulation to perform the power comparisons for all statistics at the $0.05$ significance level. Multiple main effects were chosen for each factor, and two different types of interaction were studied. These types are product and specific interaction. Product interaction is interaction where the $\gamma_{ijk}$ ’s are related to the main effects, simply as $\gamma_{ijk}=\lambda\alpha_{i}\beta_{j}$ , where $\lambda$ is a general scaling factor (Tukey, 1949). In specific interaction, $2\times 2$ matrices

\begin{bmatrix}c&-c\\ -c&c\end{bmatrix}

where $c\in\mathbb{R}$ , are embedded in the rows or columns. Details on the main effects and interaction effects examined are available through Github at https://github.com/tranbaokhue/Rank-based-InteractionTest. We note that main effects are absent for each factor at factor level 1.

When generating the data, error terms were drawn from the $\text{Normal}(0,1)$ , $\text{Uniform}(-2,2)$ , $\text{Exponential}(1)$ , $\text{Double Exponential}\left(0,\frac{1}{\sqrt{2}}\right)$ , or $\text{Cauchy}(0,1)$ distributions. For each combination of main effects, interaction effects, and error terms, we conduct 10,000 simulations for power comparisons. We use varying interaction effects and no interaction to attain the full power curves for each procedure under each setting. All simulations were computed with R Statistical Software (v.4.3.1; R Core Team 2023).

4.2 Simulation results

Our selected power comparison results are shown via figures. Figures 1 and 4 are from the $3\times 4\times 2$ setting, while Figures 3 and 5 are from the $3\times 2\times 3$ setting and $3\times 3\times 3$ setting. Finally, Figures 2 and 6 are from the $4\times 6\times 2$ setting. While Figures 1, 3, 4, and 5 include power curves from settings with product interaction, Figures 2 and 6 present the powers of the tests when the $2\times 2$ matrices of specific interaction are embedded in the first two columns of the simulated data. Both interaction types as well as all five distributions of errors are shown in our selected results.

In Figure 1, with normal errors, even thought $ART$ has slightly higher power than the $F$ -test, $ART$ suffers from inflated Type I error. Thus, we still recommend the $F$ -test for normal error data compared to the others (in order from second-best to worst, $APCSSA$ , raov, $APCSSM$ , $RT$ , and $DEKR$ ). Our new proposed procedure $APCSSA$ is comparable to the $F$ -test in terms of power. We also quickly see that the $DEKR$ statistic does not do well when both factors have main effects.

In Figure 2, looking at uniform errors, we can see that the main effects clearly nullified $DEKR$ interaction detection. We can see that $APCSSA$ performs the best followed closely by the $F$ -test and others.

Refer to caption — Figure 1: Power curves in the $3\times 4\times 2$ setting with product interaction ( $\gamma_{ijk}=\lambda\alpha_{i}\beta_{j}$ ), $\bm{\alpha}=(-1,0,1)$ , $\bm{\beta}=(-1,-0.5,0.5,1)$ , and standard normal errors.

In Figure 3, when the data follow an exponential distribution, raov is the most powerful but it has an extremely elevated Type I error. Following closely behind are the $ART$ and the proposed $APCSSA$ . These two are powerful, with only slightly inflated significance levels. Even though $APCSSA$ is comparable, we would generally recommend comparing with the $F$ -test results. Facing double exponential errors, in Figure 4, we can see that $ART$ is the most powerful, but $APCSSA$ , raov, and the $F$ -test all have roughly the same empirical powers. From both these figures, it is clear how $RT$ and $DEKR$ are easily influenced by main effects.

Finally, as seen in Figure 5 and Figure 6, when faced with the challenges of Cauchy errors, our proposed test $APCSSM$ performs the best in various settings (both product interaction and specific interaction). Kloke and McKean’s raov might be more powerful, but when there is no interaction, the rejection rate remains way over 0.1. We can also see how the $F$ -test, $DEKR$ , $APCSSA$ , and $ART$ perform when the data have many outliers.

Our results for other combinations of settings, main effects, interactions and error terms are consistent with these selected results. In summary, the new proposed statistics perform outstanding in situations with Cauchy or double exponential errors, so we advocate their use to detect interaction in these settings.

5 Discussion and Future Work

Our simulation studies have verified previous work that $DEKR$ suffers from the introduction of row main effects and is not recommended for interaction detection in the balanced replications per cell setting. With Conover’s $RT$ approach, the resulting statistic does not compete well with the $F$ -test and suffers elevated Type I error rates when the error terms come from nonnormal distributions. Although Kloke and McKean’s raov and $ART$ are much more powerful than $DEKR$ and $RT$ , their rejection rates in various settings can be twice the significance level of 0.05. Our proposed statistics $APCSSA$ and $APCSSM$ perform well in settings with exponential and Cauchy error terms respectively. Additionally, when using $APCSSA$ , not much power is lost compared to the $F$ -test when error terms are from the normal or uniform distributions and $APCSSM$ is undeniably the best procedure for detecting interactions for data with Cauchy errors. In conclusion, we are able to recommend using $APCSSA$ and $APCSSM$ to detect interaction in the two-way layout with balanced replications per cell.

While we have demonstrated that our statistics work well to detect interaction in these settings, future work in this area remains. Further validation of these results with the power comparisons from settings with more replications per cell may be enlightening. Additionally, we have work in progress to develop $\mathsf{R}$ code so that the new statistics, which are computationally intensive, may be easily accessed by interested researchers. It is also natural to consider extending these techniques to the general two-way layout with an unequal number of replications per cell.

Acknowledgments

We are grateful to Jessica Jeong for her meticulous verification of $\mathsf{R}$ code and scripts. We would like to thank Amherst College and Kenyon College for supporting our summer research projects.

References

Conover and Iman [1976] W. Conover and R. Iman. On some alternative procedures using ranks for the analysis of experimental designs. Communications in Statistics - Theory and Methods, 5(14):1349–1368, Jan. 1976. ISSN 0361-0926, 1532-415X. doi:10.1080/03610927608827447. URL http://dx.doi.org/10.1080/03610927608827447.
De Kroon and Van Der Laan [1981] J. De Kroon and P. Van Der Laan. Distribution-free test procedures in two-way layouts; a concept of rank-interaction. Statistica Neerlandica, 35(4):189–213, Dec. 1981. ISSN 0039-0402, 1467-9574. doi:10.1111/j.1467-9574.1981.tb00730.x. URL http://dx.doi.org/10.1111/j.1467-9574.1981.tb00730.x.
Feys [2016] J. Feys. Nonparametric Tests for the Interaction in Two-way Factorial Designs Using R. The R Journal, 8(1):367, 2016. ISSN 2073-4859. doi:10.32614/RJ-2016-027.
Hartlaub et al. [1999] B. A. Hartlaub, A. M. Dean, and D. A. Wolfe. Rank-based test procedures for interaction in the two-way layout with one observation per cell. Canadian Journal of Statistics, 27(4):863–874, Dec. 1999. ISSN 0319-5724, 1708-945X. doi:10.2307/3316137.
Kay et al. [2021] M. Kay, L. A. Elkin, J. J. Higgins, and J. O. Wobbrock. Mjskay/ARTool: ARTool 0.11.0. Zenodo, Apr. 2021.
Kloke and McKean [2012] J. Kloke, D. and J. McKean, W. Rfit: Rank-based Estimation for Linear Models. The R Journal, 4(2):57, 2012. ISSN 2073-4859. doi:10.32614/RJ-2012-014.
Lehmann et al. [2001] J. Lehmann, D. Wolfe, and B. A. Hartlaub. Rank-based procedures for analysis of factorial effects. Recent Advances in Experimental Designs and Related Topics, 2001.
Luepsen [2017] H. Luepsen. The aligned rank transform and discrete variables: A warning. Communications in Statistics - Simulation and Computation, 46(9):6923–6936, Oct. 2017. ISSN 0361-0918, 1532-4141. doi:10.1080/03610918.2016.1217014.
Mansouri and Chang [1995] H. Mansouri and G.-H. Chang. A comparative study of some rank tests for interaction. Computational Statistics & Data Analysis, 19(1):85–96, Jan. 1995. ISSN 01679473. doi:10.1016/0167-9473(93)E0045-6.
Montgomery [2020] D. C. Montgomery. Design and Analysis of Experiments. Wiley, Hoboken, NJ, tenth edition edition, 2020. ISBN 978-1-119-49247-4 978-1-119-49244-3.
Richter [1999] S. J. Richter. Nearly exact tests in factorial experiments using the aligned rank transform. Journal of Applied Statistics, 26(2):203–217, Feb. 1999. ISSN 0266-4763. doi:10.1080/02664769922548.
Salazar-Alvarez et al. [2014] M. I. Salazar-Alvarez, V. G. Tercero-Gómez, A. E. Cordero-Franco, and W. J. Conover. Nonparametric analysis of interactions: A review and gap analysis. IIE Annual Conference and Expo 2014, pages 2910–2917, 01 2014.
Tukey [1991] J. W. Tukey. The philosophy of multiple comparisons. Statistical Science, 6:100–116, 1991.
Wobbrock et al. [2011] J. O. Wobbrock, L. Findlater, D. Gergle, and J. J. Higgins. The aligned rank transform for nonparametric factorial analyses using only anova procedures. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 143–146, Vancouver BC Canada, May 2011. ACM. ISBN 978-1-4503-0228-9. doi:10.1145/1978942.1978963.