These authors contributed equally to this work.
These authors contributed equally to this work.
[2]\fnmYang Young \surLu
1]\orgdivComputer Science and Engineering Division, \orgnameUniversity of Michigan, \cityAnn Arbor, \stateMichigan, \countryUSA
2]\orgdivCheriton School of Computer Science, \orgnameUniversity of Waterloo, \cityWaterloo, \stateOntario, \countryCanada
3]\orgdivDepartment of Genome Sciences and Paul G. Allen School of Computer Science and Engineering, \orgnameUniversity of Washington, \citySeattle, \stateWashington, \countryUSA
Error-controlled non-additive interaction discovery in machine learning models
Abstract
Machine learning (ML) models are powerful tools for detecting complex patterns within data, yet their “black box” nature limits their interpretability, hindering their use in critical domains like healthcare and finance. To address this challenge, interpretable ML methods have been developed to explain how features influence model predictions. However, these methods often focus on univariate feature importance, overlooking the complex interactions between features that ML models are capable of capturing. Recognizing this limitation, recent efforts have aimed to extend these methods to discover feature interactions, but existing approaches struggle with robustness and error control, especially under data perturbations.
In this study, we introduce Diamond, a novel method for trustworthy feature interaction discovery. Diamond uniquely integrates the model-X knockoffs framework to control the false discovery rate (FDR), ensuring that the proportion of falsely discovered interactions remains low. We further address the challenges of using off-the-shelf interaction importance measures by proposing a calibration procedure that refines these measures to maintain the desired FDR. Diamond’s applicability spans a wide range of ML models, including deep neural networks, tree-based models, and factorization-based models. Our empirical evaluations on both simulated and real datasets across various biomedical studies demonstrate Diamond’s utility in enabling more reliable data-driven scientific discoveries. This method represents a significant step forward in the deployment of ML models for scientific innovation and hypothesis generation.
keywords:
Explainable AI, Interpretable AI, Interaction Detection, False Discovery Rate, Knockoffs1 Introduction
Machine learning (ML) has emerged as a critical tool in many application domains, largely due to its ability to detect subtle relationships and patterns within complex data [40]. While the complexity of ML models contributes to their power, it also makes them challenging to interpret, leaving users with few clues about the mechanisms underlying a given model’s outputs. Consequently, this “black box” nature of ML models, particularly deep neural networks, has hindered their applicability in error-intolerant domains like healthcare and finance. Stakeholders such as clinicians need to understand why and how the models make predictions before making important decisions, such as disease diagnosis [29]. Importantly, without providing insight into their internal mechanisms, ML models cannot be effectively used for making data-driven scientific discoveries, which are crucial for gaining human-understandable insights and driving successful innovation [1].
To enhance the interpretability of ML models for better data-driven scientific discoveries, interpretable ML methods have been developed to elucidate the internal mechanisms of these models [43]. These methods help to elucidate how individual features influence prediction outcomes by assigning an importance score to each feature so that higher scores indicate greater relevance to the prediction [48, 47, 36, 50, 34]. However, these univariate interpretations overlook a primary advantage of ML models: their ability to model complex interactions between features in a data-driven way. In fact, input features usually do not work individually within an ML model but cooperate with other features to make inferences jointly [52]. For example, it is well established in biology that genes do not operate in isolation but work together in co-regulated pathways with additive, cooperative, or competitive interactions [32]. Additionally, gene-gene, gene-disease, gene-drug, and gene-environment interactions are critical in explaining genetic mechanisms, diseases, and drug effects [56].
Recognizing the limitations of univariate interpretations, efforts have been made to extend interpretable ML methods to discover interactions among features. These methods attribute the prediction influence to feature pairs and then rank candidate feature pairs from a trained ML model, with highly ranked pairs indicating higher importance [52, 53, 11, 37, 22, 51, 7, 26, 58]. However, it is important to note that these approaches characterize feature pairs where both features are individually important for a model’s prediction, rather than capturing the synergistic or interactive effects between the two features [54] (see the detailed explanation in Sec.4.2). Furthermore, the induced ranked list of feature pairs must be cut off at a certain confidence level for use in scientific discovery and hypothesis validation [52]. However, selecting this threshold is typically under user control and must be set arbitrarily. Worse still, existing methods are sensitive to perturbations, in the sense that even imperceivable, random perturbations of the input data may lead to dramatic changes in the importance ranking [19, 25, 35].
From a practitioner’s perspective, a given set of discovered feature interactions is scientifically valuable only if a systematic strategy exists to prioritize and select relevant interactions in a robust and error-controlled manner, even in the presence of noise. Although many methods have been developed for feature interaction discovery, we are not aware of any previous attempts to conduct discovery while explicitly estimating and controlling the discovery error. Without this, accurate and reliable findings cannot be achieved. In this study, we introduce an error-controlled interaction discovery method named Diamond (Discovering interactions in machine learning models with a controlled error rate). Here, the error is quantified by the false discovery rate (FDR) [4], which informally represents the expected proportion of falsely discovered interactions among all discovered interactions. A false discovery is a feature interaction that is discovered but not truly relevant.
Three components of Diamond are novel (Fig. 1). First, Diamond achieves FDR control by leveraging the model-X knockoffs framework [2, 6]. The core idea of this framework is to generate dummy features that perfectly mimic the empirical dependence structure among the original features while being conditionally independent of the response given the original features. Second, we discover that naively using off-the-shelf feature interaction importance measures cannot correctly control the FDR. To address this issue, we propose a calibration procedure to distill non-additive or interactive effects from the reported interaction importance measures from existing methods, thereby maintaining FDR control at the target level. Third, Diamond is applicable to a wide range of ML models, including deep neural networks (DNNs), tree-based models, and factorization-based models. We have applied Diamond to various simulated and real datasets to demonstrate its empirical utility. Practically speaking, Diamond paves the way for the wider deployment of machine learning models in scientific discovery and hypothesis generation.

2 Results
2.1 Diamond overview
Consider a prediction task where we have independent and identically distributed (i.i.d.) samples and , denoting the data matrix with -dimensional features and the corresponding response, respectively. The response of the prediction task can be categorical labels such as disease status or numerical measurements such as body mass index, and the features could include gene expression data, microbial taxa abundance profiles, and more. The prediction task can be described using an ML model that maps from the input to the response . When modeling the task, the function learns the dependence structure of on , enabling effective prediction with the fitted ML model.
In this study, we focus on interpreting the fitted ML model for data-driven scientific discovery by identifying the non-additive feature interactions that contribute most to the model’s predictions. We say that is a non-additive interaction of function if and only if cannot be decomposed into an addition of subfunctions , each of which excludes a corresponding interaction feature [49, 52], i.e., . For example, the multiplication between two features and is a non-additive interaction because it cannot be decomposed into a sum of univariate functions, i.e., . Assume that there exists a group of interactions such that conditional on interactions , the response is independent of interactions in the complement . In this setting, our goal is to accurately discover feature interactions in without erroneously reporting too many incorrect interactions in .
Diamond is designed to discover feature interactions from fitted ML models while maintaining a controlled FDR [4]. For a set of feature interactions discovered by some interaction detection method, the FDR is defined as:
Commonly used procedures, such as the Benjamini–Hochberg procedure [4], achieve FDR control by working with p-values computed against some null hypothesis. In the interaction discovery setting, for each feature interaction, one tests the significance of the statistical association between the specific interaction and the response, either jointly or marginally, and obtains a p-value under the null hypothesis that the interaction is irrelevant. These p-values are then used to rank the features for FDR control. However, controlling FDR in generic machine learning models, especially deep learning models, is challenging because, to our knowledge, the field lacks a method for producing meaningful p-values that reflect interaction importance. To bypass the use of p-values but still achieve FDR control, we draw inspiration from the model-X knockoffs framework [2, 6]. The core idea of knockoffs is to generate dummy features that perfectly mimic the empirical dependence structure among the original features but are conditionally independent of the response given the original features. These knockoff features can then be used as a control by comparing the feature importance between the original features and their knockoff counterparts to estimate FDR (Fig. 1a).
Diamond trains a generic ML model that takes as input an augmented data matrix , created by concatenating the original data matrix with its knockoffs . After training, Diamond quantifies feature interactions by interpreting the trained model and produces a ranked list of these interactions. Diamond subsequently uses a calibration procedure to distill non-additive or interactive effects from the reported interaction importance measures from existing methods, thereby maintaining FDR control at the target level (Fig. 1b). The calibration process is designed to remove both prediction-dependent marginal effects and prediction-independent feature biases from the reported feature interactions, leaving only the prediction-dependent non-additive interactive effects (Fig. 1c).
2.2 Diamond discovers FDR-controlled interactions on simulated datasets

We started by evaluating the performance of Diamond on simulated datasets, assessing its ability to identify important non-additive interactions while controlling the FDR. We benchmarked Diamond on a test suite of 10 simulated datasets generated by different simulation functions proposed by [52]. These datasets contain a mixture of univariate functions and multivariate interactions, exhibiting varied order, strength, and nonlinearity (Fig. 2a). Since our goal is to detect pairwise interactions, high-order interaction functions (e.g., ) are decomposed into pairwise interactions (e.g., , , and ) to serve as the ground truth.
Following the settings used in [52], we employed a sample size of , equally divided into training and test sets. In addition, the number of features is set at , and all features are sampled randomly from a continuous uniform distribution, . Only the first out of features contribute to the corresponding response, while the remaining features serve as noise to increase the task’s complexity. For robustness, we repeated the experiment 20 times for each simulated dataset using different random seeds. Each repetition involved data generation, knockoff generation using KnockoffsDiagnostics [5], ML model training, and interaction-wise FDR estimation. For all simulation settings, we reported the mean performance with 95% confidence intervals, fixing the target FDR level at .
Our analysis shows that Diamond consistently identifies important non-additive interactions with controlled FDR across various ML models (Fig. 2b). Under controlled FDR, the multi-layer perceptron (MLP) model demonstrates better performance in identifying important interactions than other ML models, as measured by the statistical power and the area under the receiver operating characteristic curve (AUROC). This superior performance is attributed to the inclusion of a knockoff-tailored plugin pairwise-coupling layer design [33], which has been shown to maximize statistical power (See details in Sec. S.1). We discovered that the proposed calibration procedure (Sec 4.2) is critical; without it, the FDR cannot be controlled by naively using reported interaction importance values from existing methods (Fig. 2b and Fig. S.2).
To gain insight into the FDR control failure in the absence of calibration, we conducted a qualitative comparison assessing interaction importance before and after calibration using the simulation function (Fig. 2c and Fig. S.3). The primary cause of the FDR control failure appears to lie in the distributional disparity between original-only interactions (i.e., interactions involving two features from the original feature set rather than knockoffs) and interactions involving knockoffs. This observation suggests a violation of the knockoff filter’s assumption in controlling the FDR (See discussion in Sec 4.2). The proposed calibration procedure mitigates the disparity by extracting non-additive interactive effects from the reported interaction importance measures, thereby enhancing the utility of knockoff-involving interactions as a negative control for FDR estimation.
Finally, we verified whether alternative baseline methods can accurately identify important non-additive interactions with controlled FDR. We compared two baseline methods for FDR estimation: one based on permutation-based interaction-wise p-values coupled with the Benjamini-Hochberg procedure, and the other representing interaction-wise FDR as the aggregation of feature-wise FDR (See details in Sec. 4.6). Our analysis shows that neither method correctly controls the FDR. This greatly reduces the utility of these methods, despite their reported high power and AUROC.
2.3 Diamond is robust to various knockoffs and importance measures

The design of Diamond incorporates several key components, including knockoff generation and interaction importance measures. To demonstrate the robustness of Diamond in FDR control, we conducted a control study where we modified Diamond by replacing these two components with alternative solutions. Specifically, we considered two variants of Diamond using the MLP model. In the first setting, we replaced KnockoffsDiagnostics with three alternatives: KnockoffGAN [23], Deep knockoffs [42] or VAE knockoffs [31], while keeping the MLP model and the interaction importance measure (i.e., Expected Hessian [14]) unchanged. In the second setting, we replaced the interaction importance measure, Expected Hessian, with two alternatives: Integrated Hessian [22] and a model-specific interaction importance measure derived from the MLP weights (See details in Sec. S.2), while keeping the MLP model and the knockoff generation unchanged. For each variant, we applied Diamond to the test suite of 10 simulated datasets using the same settings.
The results indicate that Diamond is robust to various knockoff designs and importance measures. In the first study, our analysis shows that Diamond consistently discovers important non-additive interactions with controlled FDR across various knockoff designs (Fig. 3a). Despite the methodological differences in knockoff generation between KnockoffsDiagnostics, KnockoffGAN, Deep knockoffs, and VAE knockoffs, Diamond achieves comparable statistical power and AUROC in each case. This result demonstrates the Diamond’s stability in identifying important interactions in knockoff-based FDR estimation.
In the second study, our analysis shows that Diamond discovers important non-additive interactions with controlled FDR across various interaction importance measures (Fig. 3b). Despite the methodological differences between Expected Hessian, Integrated Hessian, and model-specific measures, which involve mechanistic disparities in elucidating the relationships between features and responses, Diamond achieves valid FDR control while maintaining reasonably good statistical power and AUROC. This result demonstrates Diamond’s flexibility in identifying important interactions based on interaction-wise importance ranking.
Finally, it is worth mentioning that regardless of the knockoff generation and interaction importance measures we used, there is room for improvement in achieving better statistical power while maintaining controlled FDR. Diamond tends to be conservative by overestimating the FDR, suggesting that an improved FDR control procedure could potentially boost statistical power.
2.4 Diamond identifies interactions to explain disease progression

We then evaluated the performance of Diamond on a real dataset [13], assessing its ability to identify important interactions in the progression of disease. We used a quantitative study with diabetes patients, each characterized by standardized baseline features, including age, sex, body mass index, average blood pressure, and six blood serum measurements (Fig. 4b). The task is to construct an ML model to predict the response of interest, which is a quantitative measure of disease progression one year after the baseline. We considered four different types of ML models: MLP, LightGBM, XGBoost, and factorization machines (FM). For robustness, we repeated each experiment 20 times using different random seeds. Each repetition involved knockoff generation, ML model training, and interaction-wise FDR estimation.
Our analysis shows that Diamond consistently identifies the important interaction between body mass index (BMI) and serum triglycerides level (STL) across three of the four different ML models (Fig. 4a). The only exception is the FM method, for which Diamond fails to identify any important interactions. We investigated the identified interaction via literature search and found that high BMI and high STL are associated with an increased risk of diabetes, as reported by multiple studies [17, 59, 20]. The literature evidence is further supported by the qualitative evaluation (Fig. 4c). Specifically, we examined the identified interaction in four ways: the contribution of feature values to response prediction, the marginal importance measure, the interaction importance measure, and the contribution of the marginal importance measures to the interaction importance measure. This analysis showed that high BMI and high STL contribute to more severe progression of diabetes. Meanwhile, higher STL further reinforces the contribution of BMI to the progression of diabetes synergistically.
Finally, we find that Diamond does not necessarily report a lower FDR for interactions composed of two marginally important features. All baseline features from diabetes patients are effective in predicting disease progression, as measured by the Expected Gradient scores in the MLP model (Fig. 4b). Among these features, in addition to BMI and STL, blood pressure, blood sugar level, and high-density lipoproteins are more predictive than others. It is important to note that Diamond still associates most interactions composed of these highly predictive features with a high FDR threshold.
2.5 Diamond investigates enhancer activity in Drosophila embryos

We then evaluated the performance of Diamond on a Drosophila enhancer dataset [3] to investigate the relationship between enhancer activity and DNA occupancy for transcription factor (TF) binding and histone modifications in Drosophila embryos. We used a quantitative study of DNA occupancy for TFs and histone modifications with the enhancer status labeled for genomic sequence samples from blastoderm Drosophila embryos (Fig. S.4). The enhancer status for each sequence is binarized as the response, depending on whether the sequence drives patterned expression in blastoderm embryos. To predict the enhancer status, the maximum value of normalized fold-enrichment [27] of ChIP-seq and ChIP-chip assays for each TF or histone modification served as our features. For robustness, we repeated the experiment 20 times using different random seeds.
We started by comparing the identified interactions against a list of well-characterized interactions in the early Drosophila embryos, summarized by [3]. If Diamond is successful in identifying important interactions while controlling the FDR, then the identified interactions should considerably overlap with the ground truth list. Our results show that three of the top five interactions reported by Diamond are present in the ground truth list for MLP and tree-based models, while four out of five interactions are present in the ground truth list for FM (Fig. 5a). In comparison, the baseline methods exhibit significantly different behaviors. One method, which is based on permutation-based interaction-wise p-values and the Benjamini-Hochberg procedure, reported no ground truth interactions among its top five identifications. Another method, which aggregates feature-wise FDR, reported both ground truth along with a large number of non-ground truth interactions.
Next, we investigated the identified interactions that are not included in the ground truth list. For example, we investigated the interaction between the TFs Snail (UniProt ID: P08044) and Twist (UniProt ID: P10627) identified by Diamond+MLP through databases containing assorted experimentally verified interactions [30] and literature [21]. Research indicates that Snail represses Twist targets because their target sequences are very similar, and their binding is mutually exclusive. This literature evidence is further supported by a qualitative evaluation (Fig. 5b-d). Specifically, we examined the Snail–Twist interaction in terms of the contribution of feature values to the enhancer status prediction, observing that high Twist expression suppresses the contribution of Snail to enhancer activation.
Furthermore, we investigated the remaining identifications that lack supporting ground truth or literature evidence. We discovered that these interactions can be logically explained through transitive effects. Specifically, even without a direct interaction between TF1 and TF3, we might still expect a strong interaction between them if solid interactions exist between TF1 and TF2 and between TF2 and TF3. On this basis, for example, the interaction between the TFs Twist (UniProt ID: P10627) and Zeste (UniProt ID: P09956), identified by Diamond+MLP, can be classified as a transitive interaction, supported by experimentally validated interactions [57]: (1) the interaction between Twist and Ultrabithorax (UniProt ID: P83949), Decapentaplegic (UniProt ID: P07713), Pho (UniProt ID: Q8ST83), Raf (UniProt ID: P11346), Psc (UniProt ID: P35820), Trr (UniProt ID: Q8IRW8) and (2) the interaction between these proteins and Zeste. Analogously, the interaction between the TFs Twist (UniProt ID: P10627) and Dichaete (UniProt ID: Q24533), identified by Diamond together with two tree-based models, can also be supported by transitive interactions [57]: (1) the interaction between Twist and Pho, Raf, Psc, Trr, Myc (UniProt ID: Q9W4S7), Trx (UniProt ID: P20659 ) and (2) the interaction between these proteins and Dichaete.
Finally, Diamond is able to identify non-additive interactions with various interactive patterns (Fig. 5b-d). Among the top five interactions reported by Diamond with MLP, the Zelda–Twist interaction, the Bicoid–Twist interaction, and the Krueppel–Twist interaction exhibit synergistic effects. The high expression of both TFs in these interactions reinforces the contribution of each individual factor to enhancer activation. In contrast, the Snail–Twist interaction, which was mentioned earlier, exhibits repressive effects, where high expression of one TF suppresses the contribution of the other to enhancer activation. Lastly, the Twist–Zeste interaction exhibits alternative effects, where high expression of either TF enhances the contribution of each to enhancer activation.
2.6 Diamond understands mortality risk factors in health outcomes

We lastly evaluated the performance of Diamond on a real mortality risk dataset [10] to investigate the relationship between mortality risk factors and long-term health outcomes within the US population. We used a mortality dataset from the National Health and Nutrition Examination Survey (NHANES I) and NHANES I Epidemiologic Follow-up Study (NHEFS). The dataset incorporates clinical and laboratory measurements from US participants between 1971 and 1974 (Fig. 6b). Note that some of the features are presented categorically, which makes model’s trained from this data challenging to interpret. We addressed this problem by converting each categorical feature into a list of binary features using one-hot encoding. The dataset also reports the mortality status of participants as of 1992, with individuals known to have passed away before 1992. The task is to construct an MLP-based model to predict mortality status using a survival loss function. For robustness, we repeated the experiment 20 times using different random seeds.
Given the absence of ground truth interactions for this task, we began by evaluating the interactions identified by Diamond through literature support. Our analysis shows that three of the top ten selected interactions are directly supported by existing literature (Fig. 6a). For example, we investigated the interaction between sex and sedimentation rate (SR) and found that, although a high SR is strongly associated with a higher risk of overall mortality, this association can be attenuated by sex [15]. Additionally, we investigated the interaction between creatinine and the blood urea nitrogen (BUN) identified by Diamond. Research indicates that the BUN/creatinine ratio is recognized to have a nonlinear association with all-cause mortality and a linear association with cancer mortality [46]. The literature evidence is further supported by qualitative evaluation (Fig. 5c-e). Specifically, we examined the BUN-creatinine interaction in terms of the contribution of feature values to mortality status prediction, observing that when the creatinine level exceeds a certain level, a high BUN level further increases the mortality risk.
Finally, we probed into the remaining identifications that lack supporting literature evidence. As in the Drosophila analysis, we discovered that these interactions can be logically explained through transitive effects. For example, the interaction between BUN and potassium can be justified by combining the following two established facts: (1) BUN level is indicative of chronic kidney disease development [9], and (2) patients with chronic kidney diseases show a progressively increasing mortality rate with abnormal potassium levels [44]. Additionally, the interaction between BUN and SR reflects the fact that (1) the creatinine level and SR are strongly associated to clinical pathology types and prognosis of patients [28], and (2) creatinine and BUN have a nonlinear association [46].
3 Discussion
In this study, we aim to enable rigorous data-driven scientific discoveries, which are crucial in the present of massive data sets. For this purpose, we propose Diamond, an error-controlled interaction discovery method designed to work with a variety of ML models. The key novelties of Diamond are threefold. First, Diamond achieves FDR control using carefully designed knockoffs, without relying on p-values, which are often difficult to obtain in generic ML models. Second, Diamond includes a calibration procedure to distill non-additive or interactive effects, thereby maintaining FDR control at the target level. This calibration step is critical because naive application of off-the-shelf feature interaction importance measures fail to control the FDR in this setting. Third, Diamond is versatile in discovering important non-additive interactions across a wide range of ML models, different knockoff designs, and various interaction importance measures, all while guaranteeing FDR control. We have applied Diamond to various simulated and real datasets to demonstrate its empirical utility, proving it to be the only valid solution for error-controlled data-driven scientific discovery compared to other alternative methods.
Methodologically, our approach provides a path toward enhancing transparency in complex ML models. The complexity of ML models has long posed a tradeoff between predictability and interpretability. Specifically, sophisticated ML models such as DNNs excel at detecting subtle relationships and patterns within complex data. However, their “black-box” nature poses challenges in mission-critical domains where error tolerance is minimal. By providing FDR control in this setting, we can better understand why and how models make predictions, thereby enabling more effective use of ML models to gain human-understandable insights. Diamond can help diagnose potential biases in ML models, contributing positively to the integrity, accountability, and fairness required when employing AI technologies.
Lastly, this study highlights several promising directions for future research. First, we observe that Diamond tends to be conservative. This suggests a future direction for improvement in enhancing statistical power through further refinement of the FDR estimation process. Previous efforts, like the knockoff-tailored plugin pairwise-coupling layer in MLP models, have shown superior performance in statistical power [33]. However, these previously described methods are specific to certain types of ML models and may not be generally applicable to all generic ML models, indicating the need for improvement to broaden their applicability. Second, we observe that Diamond cannot distinguish between direct interactions and interactions caused by transitive effects. Specifically, even without a direct interaction between two features, we might still expect a strong interaction between them if solid interactions exist individually between each of these features and a third feature. Though transitive interactions are non-additive, for the purpose of scientific discovery, a potential research direction is to identify FDR-controlled direct interactions exclusively, using a causal inference framework [38]. Finally, this study primarily focuses on discovering pairwise interactions. While detecting pairwise interactions holds practical significance in various biological contexts, the discovery of higher-order interactions can naturally provide deeper insights for explaining genetic mechanisms, diseases, and drug effects in healthcare domains. This problem is highly challenging in practice due to the exponentially large search space. A potential research direction for future studies is to effectively reduce the search space to generalize Diamond for practical discovery of higher-order interactions.
In conclusion, Diamond enables error-controlled detection of feature interactions in the context of a variety of ML models. The versatility and flexibility of Diamond make it widely applicable in high-stakes and error-intolerant domains where interpretability and statistical rigor are needed. We believe that this powerful tool will pave the way for the broader deployment of machine learning models in scientific discovery and hypothesis validation.
4 Methods
4.1 FDR control with the knockoffs
Diamond achieves FDR control by leveraging the model-X knockoffs framework [2, 6], which was proposed in the setting of error-controlled feature selection. The core idea of knockoffs is to generate dummy features that perfectly mimic the empirical dependence structure among the original features but are conditionally independent of the response given the original features. Briefly speaking, the knockoff filter achieves FDR control in two steps: (1) construction of knockoff features and (2) filtering using knockoff statistics.
For the first step, the knockoff features are defined as follows:
Definition 1 (Model-X knockoff [6]).
The model-X knockoff features for the family of random features are a new family of random features that satisfy two properties:
-
1.
for any subset , where means swapping and for each and denotes equal in distribution, and
-
2.
, i.e., is independent of response given feature .
According to Definition 1, the construction of the knockoffs must be independent of the response . Thus, if we can construct a set of model-X knockoff features properly, then by comparing the original features with these control features, FDR can be controlled at target level . In the Gaussian setting, i.e., with covariance matrix , the model-X knockoff features can be constructed easily:
(1) |
where is a diagonal matrix with all components of being positive such that the conditional covariance matrix in Equation 1 is positive definite. As a result, the original features and the model-X knockoff features constructed by Equation 1 have the following joint distribution:
(2) |
With the constructed knockoff , feature importances are quantified by computing the knockoff statistics for , where and represent feature importance measures for the -th feature and its knockoff counterpart , respectively, and is an antisymmetric function satisfying . The knockoff statistics should satisfy a coin-flip property such that swapping an arbitrary pair and its knockoff counterpart only changes the sign of but keeps the signs of other () unchanged [6]. A desirable property for knockoff statistics ’s is that important features are expected to have large absolute values, whereas unimportant ones should have small symmetric values around 0.
Finally, the absolute values of the knockoff statistics ’s are sorted in decreasing order, and FDR-controlled features are selected whose ’s exceed some threshold . In particular, the choice of threshold follows where is the set of unique nonzero values from ’s and is the desired FDR level specified by the user.
4.2 Measuring non-additive interactive effect
As a key precursor to FDR estimation, Diamond quantifies feature interactions from trained ML models and produces a ranked list of these interactions, with higher-ranked interactions indicating greater importance. For notational simplicity, we use indices for both original features and knockoffs as , with and corresponding to the original features and their respective knockoff counterparts. Here, we define as a reported interaction importance measure from existing methods. There are many feature interaction importance measures available for , each attributing the prediction influence to feature pairs in different ways. However, it is important to note that such measures favor pairs where both features are simultaneously important for a model’s prediction, rather than capturing the true non-additive or interactive effects between the two features [54]. Further supported by simulation studies (Fig. 2c), we observed that many feature interaction importance measures tend to assign higher interaction scores to two marginally important but non-interacting features compared to two random ones, even though neither pair has a real interaction, leading to the failure of FDR control.
The direct reason for the interaction importance measure falling short in controlling FDR is its violation of the knockoff filter’s assumption. Specifically, the knockoff filter requires that the importance scores of knockoff-involving interactions and false interactions have a similar distribution. To resolve this issue, we introduce a calibration procedure to be applied on top of existing interaction importance measures. Specifically, we consider that a reported interaction importance measure from existing methods comprises a mixture of several factors: prediction-dependent marginal effects for individual features, prediction-independent feature biases, independent random noise, and potential non-additive interactive effects between feature pairs. Thus, the reported interaction between features and is represented as:
(3) |
where is random noise independent of both features and predictions, and are reported pairwise and univariate feature importance measures that are dependent on the model’s predictions, respectively. The functions adapt univariate feature importance to be compatible with feature interaction importance. The function models the feature-specific biases that are independent of the model’s predictions, where indicates the presence of feature and . Our goal is to identify , the potential non-additive interactive effects between features and .
We formulate the identification of interactive effects as the residuals of a regression task:
(4) |
where is the conditional probability of being either original-only (i.e., ) or knockoff-involving (i.e., or ) feature pairs given two univariate feature importance measures and , estimated by a logistic regression model [16]. The rationale is based on the important observation that most feature pairs do not exhibit non-additive interactions, especially those involving knockoff features. Therefore, we want to focus more on potential non-additive interactions that have large univariate feature importance, as important interactions naturally consist of significant marginal features. In this study, we parameterize the functions and using generalized additive models and optimize Eq. 4 using the pyGAM library [45].
4.3 FDR control for interactions
After calculating the non-additive interactive effects using Eq. 4, we denote the resultant set of interactive effects as . We arrange in decreasing order and select interactions for which the interactive effect, , exceeds some threshold, . This selection ensures that the chosen interactions adhere to a desired FDR level .
However, a point of complexity is introduced due to the heterogeneous interactions, which include original-only and knockoff-involving interactions. The latter further comprises original-knockoff, knockoff-original, and knockoff-knockoff interactions. Following the strategy outlined by [55], the threshold is determined by:
(5) |
where and respectively denote the sets of interactions that include at least one knockoff feature and both knockoff features, while refers to the set of unique nonzero values present in .
4.4 ML models and interpretation
Diamond is designed to be compatible with a wide range of ML models. To demonstrate the broad applicability of Diamond, we use it in conjunction with machine learning models from methodologically distinct categories, including DNN models, tree-based models, and factorization-based models.
For DNNs, we configured a pyramid-shaped multi-layer perceptron (MLP) model with the exponential linear unit activation and four hidden layers, each having neuron sizes of 2p, p, p/2, and p/4 respectively, where p denotes the input dimensionality. The model follows the guidance of [33] and includes a plugin pairwise-coupling layer, connecting each original feature with its knockoff counterpart in a pairwise fashion to maximize statistical power (See details in Sec. S.1). Using an MLP as the ML model, we employed three representative interaction importance measures to clarify the relationship between features and responses. We began with a model-specific interaction importance measure derived from the model weights, specifically designed for the MLP architecture (See details in Sec. S.2). To demonstrate Diamond’s flexibility, we also used Integrated Hessian [22] and Expected Hessian [14], two state-of-the-art, model-agnostic interaction importance measures, to elucidate the relationships between features and responses without making assumptions about any specific model architecture (See details in Sec. S.3). Both model-agnostic interaction importance measures are calculated using the Path-Explain library [22].
For tree-based models, we employed two widely used gradient-boosted decision tree representatives—XGBoost [8] and LightGBM [24]. We used the implementations provided by the two libraries with default settings. Using either XGBoost or LightGBM as the ML model, we utilized TreeSHAP [37], a Shapley value-based interpretation method tailored for tree-based ML models, to elucidate the relationships between features and responses. We used the implementation provided by the SHAP library [36] to calculate both univariate and pairwise feature importance.
For factorization-based models, we used a widely adopted representative method—Factorization machines (FM) [41]. We used 2-Way FM implemented in the xLearn library [39] with default settings, which models up to second-order feature interactions. Since 2-Way FM is explicitly modelled as the weighted sum of univariate and pairwise feature interactions, we used the learned coefficients of these interactions as the corresponding importance measures.
4.5 Knockoff generation
Diamond relies on knockoffs to control the FDR. It is worth mentioning that conventional knockoffs are limited to Gaussian settings, which may not be applicable in many practical scenarios. In this study, we focus on the state-of-the-art knockoff design without assuming any specific feature distribution. In particular, we consider commonly-used non-Gaussian knockoff generation methods such as KnockoffsDiagnostics [5], KnockoffGAN [23], Deep knockoffs [42], and VAE knockoffs [31].
4.6 Alternative FDR estimation methods
We evaluate the performance of Diamond in comparison to other baseline methods. For the first baseline method, we employ a permutation-based approach to calculate the interaction-wise FDR. Specifically, this involves using a previously described permutation procedure tailored for neural networks to assess the significance of interactions and calculate permutation p-values [12], followed by the Benjamini–Hochberg procedure [4] to estimate the FDR.
For the second baseline method, we consider an ensemble-based approach that represents interaction-wise FDR as the aggregation of feature-wise FDR. This approach follows the intuition that an important feature interaction is composed of important univariate features. Specifically, we use a previously described knockoff-based procedure tailored for neural networks to estimate the feature-wise FDR of each univariate feature [33]. We then approximate the interaction-wise FDR as the maximum of the two comprising univariate feature-wise FDRs.
4.7 Code and data availability
The software implementation and all models described in this study are available at https://github.com/batmen-lab/diamond. All public datasets used for evaluating the models are available at https://github.com/batmen-lab/diamond/data.
Declarations
-
•
The research is supported by the Canadian NSERC Discovery Grant RGPIN-03270-2023.
-
•
The authors declare that they have no conflict of interest.
-
•
There is no ethics approval and consent to participate involved in this study.
-
•
There is no consent for publication involved in this study.
-
•
Author contribution: W.C. and Y.J. implemented the code and did the analysis. W.C. and Y.J. set up and preprocessed the datasets. W.C., Y.J. and Y.L. prepared the figures. Y.L. wrote the manuscript. All authors participated the discussion. All authors reviewed the manuscript.
Appendix S.1 Knockoff-tailored multi-layer perceptron model

By following the guidance of [33], Diamond utilizes a knockoff-tailored multi-layer perceptron (MLP) model that includes a plugin pairwise-coupling layer, connecting each original feature with its knockoff counterpart in a pairwise fashion to maximize statistical power. Specifically, the model takes as input an augmented data matrix , created by concatenating the original data matrix with its knockoffs . The augmented data matrix is fed to an off-the-shelf MLP model through a plugin pairwise-coupling layer composed of filters, encapsulated by , where each -th filter connects feature and its knockoff counterpart , as shown in Fig. S.1.
The filter weights, and are initialized identically and engage in a competitive dynamic via pairwise connections during the DNN training. Additionally, we employ a linear activation function in the pairwise-coupling layer to stimulate competition between different features. The outputs of the filters are subsequently channeled into an MLP model that learns to map to the response . In this study, we chose an MLP architecture with the exponential linear unit (ELU) activation function and four hidden layers. Letting be the number of hidden layers and denote the number of neurons in the -th layer of the MLP — where — we accordingly define the weight matrices of the input layer, hidden layers, and the output layer in the MLP as , , , respectively. With these notations, the response is expressed as follows:
(6) | ||||
where refers the ELU function, and signifies the bias vector in the -th layer. While we use this specific model, Diamond’s overall process is versatile and fully applicable to any off-the-shelf DNN architecture.
Appendix S.2 Model-specific interaction importance measure
The model-specific interaction importance measure is based on the knockoff-tailored MLP model design and notation described in Sec. S.1, Given the pairwise-coupling layer weights: and as well as the MLP weights in the -th layer: , the model-specific interaction importance measure decomposes into two factors: (1) the relative importance between the original feature and its knockoff counterpart, encoded by concatenated filter weights , and (2) the relative importance among all features, encoded by the weight matrix and the aggregated weights . (See [18] for theoretical insights regarding .)
Inspired by [52], we define the model-specific interaction importance as:
(7) |
where and denotes the -th row of .
Following [33], we define the model-specific univariate feature importance as:
(8) |
where and denotes entry-wise matrix multiplication.
Appendix S.3 Model-agnostic interaction importance measure
The model-agnostic interaction importance measure aims to elucidate the relationships between feature pairs and responses without making assumptions about any specific model architecture. Specifically, we employ two state-of-the-art model-agnostic interaction importance measures—Integrated Hessian [22] and Expected Hessian [14].
The Integrated Hessian interaction importance [22] is defined as:
(9) |
where calculates the second derivative of the response with respect to the -th and -th feature of sample to be explained. The corresponding univariate feature importance measure, which is compatible with the Integrated Hessian, is Integrated Gradient [50]:
(10) |
where calculates the first-order derivative of with respect to the -th feature of input .
Appendix S.4 Simulated dataset
Following [52], we use simulation datasets to evaluate Diamond. 10 simulation functions (described in Tab. S.1) are used to generate the simulation datasets.
ID | Simulation Function |
---|---|
Following the settings used in [52], we employed a sample size of , equally divided into training and test sets. In addition, the number of features is set at , and all features are sampled randomly from a continuous uniform distribution, . Only the first out of features contribute to the corresponding response, while the remaining features serve as noise to increase the task’s complexity. For robustness, we repeated the experiment 20 times for each simulated dataset using different random seeds. Each repetition involved data generation, knockoff generation using KnockoffsDiagnostics [5], ML model training, and interaction-wise FDR estimation. For all simulation settings, we reported the mean performance with 95% confidence intervals, fixing the target FDR level at .
We discovered that the proposed calibration procedure; without it, the FDR cannot be controlled by naively using reported interaction importance from existing methods (Fig. S.2). To gain insight into the FDR control failure, we conducted a qualitative comparison assessing interaction importance before and after calibration using the simulation function across different ML models (Fig. S.3). The primary cause of the FDR control failure lies in the distribution disparity between original-only interactions and those involving knockoffs. This suggests a violation of the knockoff filter’s assumption in controlling FDR. The proposed calibration procedure mitigates the disparity by extracting non-additive interactive effects from the reported interaction importance measures, thereby enhancing the utility of knockoff-involving interactions as a negative control for FDR estimation.





References
- \bibcommenthead
- Agrawal et al [2024] Agrawal A, McHale J, Oettl A (2024) Artificial intelligence and scientific discovery: A model of prioritized search. Research Policy 53(5):104989
- Barber and Candès [2015] Barber RF, Candès EJ (2015) Controlling the false discovery rate via knockoffs. The Annals of Statistics 43(5):2055–2085
- Basu et al [2018] Basu S, Kumbier K, Brown JB, et al (2018) Iterative random forests to discover predictive and stable high-order interactions. Proceedings of the National Academy of Sciences 115(8):1943–1948
- Benjamini and Hochberg [1995] Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57:289–300
- Blain et al [2024] Blain A, Thirion B, Linhart J, et al (2024) When knockoffs fail: diagnosing and fixing non-exchangeability of knockoffs. arXiv preprint arXiv:240706892
- Candès et al [2018] Candès EJ, Fan Y, Janson L, et al (2018) Panning for gold: Model-X knockoffs for high-dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(3):551–577
- Chang et al [2022] Chang C, Caruana R, Goldenberg A (2022) NODE-GAM: Neural generalized additive model for interpretable deep learning. International Conference on Learning Representations
- Chen and Guestrin [2016] Chen T, Guestrin C (2016) XGBoost: A scalable tree boosting system. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, KDD ’16, pp 785–794
- Collins et al [2017] Collins AJ, Pitt B, Reaven N, et al (2017) Association of serum potassium with all-cause mortality in patients with and without heart failure, chronic kidney disease, and/or diabetes. American Journal of Nephrology 46(3):213–221
- Cox et al [1997] Cox CS, Feldman JJ, Golden CD, et al (1997) Plan and operation of the NHANES I Epidemiologic Followup Study, 1992
- Cui et al [2019] Cui T, Marttinen P, Kaski S (2019) Recovering pairwise interactions using neural networks. arXiv preprint arXiv:190108361
- Cui et al [2022] Cui T, El Mekkaoui K, Reinvall J, et al (2022) Gene–gene interaction detection with deep learning. Communications Biology 5(1):1238
- Efron et al [2004] Efron B, Hastie T, Johnstone I, et al (2004) Least angle regression. The Annals of Statistics 32(2):407–499
- Erion et al [2021] Erion G, Janizek JD, Sturmfels P, et al (2021) Improving performance of deep learning models with axiomatic attribution priors and expected gradients. Nature Machine Intelligence 3(7):620–631
- Fest et al [2019] Fest J, Ruiter R, Mooijaart S, et al (2019) Erythrocyte sedimentation rate as an independent prognostic marker for mortality: a prospective population-based cohort study. Journal of Internal Medicine 285(3):341–348
- Freedman and Berk [2008] Freedman DA, Berk RA (2008) Weighting regressions by propensity scores. Evaluation Review 32(4):392–409
- Garbuzova et al [2023] Garbuzova EV, Shcherbakova LV, Rymar OD, et al (2023) Triglycerides, obesity and education status are associated with the risk of developing Type 2 Diabetes in young adults, cohort study. Journal of Personalized Medicine 13(9):1403
- Garson [1991] Garson GD (1991) Interpreting neural-network connection weights. AI Expert 6(4):46–51
- Ghorbani et al [2019] Ghorbani A, Abid A, Zou J (2019) Interpretation of neural networks is fragile. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 3681–3688
- Huang et al [2021] Huang X, Li G, Xu B, et al (2021) Lower baseline serum triglyceride levels are associated with higher decrease in body mass index after laparoscopy sleeve gastrectomy among obese patients. Frontiers in Endocrinology 12:633856
- Ip et al [1992] Ip YT, Park RE, Kosman D, et al (1992) The dorsal gradient morphogen regulates stripes of rhomboid expression in the presumptive neuroectoderm of the drosophila embryo. Genes & Development 6(9):1728–1739
- Janizek et al [2021] Janizek JD, Sturmfels P, Lee SI (2021) Explaining explanations: Axiomatic feature interactions for deep networks. Journal of Machine Learning Research 22:104:1–104:54
- Jordon et al [2018] Jordon J, Yoon J, van der Schaar M (2018) KnockoffGAN: Generating knockoffs for feature selection using generative adversarial networks. In: International Conference on Learning Representations
- Ke et al [2017] Ke G, Meng Q, Finley T, et al (2017) LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems 30
- Kindermans et al [2019] Kindermans PJ, Hooker S, Adebayo J, et al (2019) The (un)reliability of saliency methods. Explainable AI: Interpreting, explaining and visualizing deep learning pp 267–280
- Lerman et al [2021] Lerman S, Venuto C, Kautz H, et al (2021) Explaining local, global, and higher-order interactions in deep learning. In: International Conference on Computer Vision, pp 1224–1233
- Li et al [2008] Li XY, MacArthur S, Bourgon R, et al (2008) Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm. PLoS Biology 6(2):e27
- Liang et al [2017] Liang H, Xin M, Zhao L, et al (2017) Serum creatinine level and ESR values associated to clinical pathology types and prognosis of patients with renal injury caused by ANCA-associated vasculitis. Experimental and Therapeutic Medicine 14(6):6059–6063
- Lipton [2018] Lipton ZC (2018) The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 16(3):31–57
- Liska et al [2022] Liska O, Bohár B, Hidas A, et al (2022) TFLink: an integrated gateway to access transcription factor–target gene interactions for multiple species. Database 2022:baac083
- Liu and Zheng [2018] Liu Y, Zheng C (2018) Auto-encoding knockoff generator for FDR controlled variable selection. arXiv preprint arXiv:180910765
- Lu and Noble [2021] Lu YY, Noble WS (2021) A wider field of view to predict expression. Nature Methods 18(10):1155–1156
- Lu et al [2018] Lu YY, Fan Y, Lv J, et al (2018) DeepPINK: reproducible feature selection in deep neural networks. In: Advances in Neural Information Processing Systems
- Lu et al [2021a] Lu YY, Guo W, Xing X, et al (2021a) DANCE: Enhancing saliency maps using decoys. In: International Conference on Machine Learning
- Lu et al [2021b] Lu YY, Yu TC, Bonora G, et al (2021b) ACE: Explaining cluster from an adversarial perspective. In: International Conference on Machine Learning
- Lundberg and Lee [2017] Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems
- Lundberg et al [2020] Lundberg SM, Erion G, Chen H, et al (2020) From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence 2(1):56–67
- Luo et al [2020] Luo Y, Peng J, Ma J (2020) When causal inference meets deep learning. Nature Machine Intelligence 2(8):426–427
- Ma [2019] Ma C (2019) xLearn python library. https://xlearn-doc.readthedocs.io/
- Obermeyer and Emanuel [2016] Obermeyer Z, Emanuel EJ (2016) Predicting the future–big data, machine learning, and clinical medicine. The New England Journal of Medicine 375(13):1216
- Rendle [2010] Rendle S (2010) Factorization machines. In: IEEE International Conference on Data Mining (ICDM), IEEE, pp 995–1000
- Romano et al [2020] Romano Y, Sesia M, Candès E (2020) Deep knockoffs. Journal of the American Statistical Association 115(532):1861–1872
- Samek et al [2021] Samek W, Montavon G, Lapuschkin S, et al (2021) Explaining deep neural networks and beyond: A review of methods and applications. Proceedings of the IEEE 109(3):247–278
- Seki et al [2019] Seki M, Nakayama M, Sakoh T, et al (2019) Blood urea nitrogen is independently associated with renal outcomes in Japanese patients with stage 3–5 chronic kidney disease: a prospective observational study. BMC Nephrology 20:1–10
- Serven and Brummitt [2018] Serven D, Brummitt C (2018) pyGAM: Generalized additive models in python. 10.5281/zenodo.1208723, URL https://doi.org/10.5281/zenodo.1208723
- Shen et al [2022] Shen S, Yan X, Xu B (2022) The blood urea nitrogen/creatinine (BUN/cre) ratio was U-shaped associated with all-cause mortality in general population. Renal Failure 44(1):184–190
- Shrikumar et al [2017] Shrikumar A, Greenside P, Shcherbina A, et al (2017) Learning important features through propagating activation differences. In: International Conference on Machine Learning
- Simonyan et al [2013] Simonyan K, Vedaldi A, Zisserman A (2013) Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:13126034
- Sorokina et al [2008] Sorokina D, Caruana R, Riedewald M, et al (2008) Detecting statistical interactions with additive groves of trees. In: International Conference on Machine learning, pp 1000–1007
- Sundararajan et al [2017] Sundararajan M, Taly A, Yan Q (2017) Axiomatic attribution for deep networks. In: International Conference on Machine Learning
- Sundararajan et al [2020] Sundararajan M, Dhamdhere K, Agarwal A (2020) The shapley taylor interaction index. In: International Conference on Machine Learning, PMLR, pp 9259–9268
- Tsang et al [2018a] Tsang M, Cheng D, Liu Y (2018a) Detecting statistical interactions from neural network weights. International Conference on Learning Representations
- Tsang et al [2018b] Tsang M, Liu H, Purushotham S, et al (2018b) Neural interaction transparency (NIT): Disentangling learned interactions for improved interpretability. Advances in Neural Information Processing Systems 31
- Tsang et al [2021] Tsang M, Enouen J, Liu Y (2021) Interpretable artificial intelligence through the lens of feature interaction. arXiv preprint arXiv:210303103
- Walzthoeni et al [2012] Walzthoeni T, Claassen M, Leitner A, et al (2012) False discovery rate estimation for cross-linked peptides identified by mass spectrometry. Nature Methods 9(9):901–903
- Watson [2022] Watson DS (2022) Interpretable machine learning for genomics. Human Genetics 141(9):1499–1513
- Yevshin et al [2016] Yevshin I, Sharipov R, Valeev T, et al (2016) GTRD: a database of transcription factor binding sites identified by ChIP-seq experiments. Nucleic Acids Research p gkw951
- Zhang et al [2021] Zhang H, Xie Y, Zheng L, et al (2021) Interpreting multivariate shapley interactions in DNNs. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 10877–10886
- Zhao et al [2020] Zhao H, Zheng C, Gan K, et al (2020) High body mass index and triglycerides help protect against osteoporosis in patients with type 2 diabetes mellitus. Journal of Diabetes Research 2020(1):1517879