This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

BayesNAM: Leveraging Inconsistency for Reliable Explanations

Hoki Kim, Jinseong Park, Yujin Choi, Seungyun Lee, and Jaewook Lee Hoki Kim is with the Department of Industrial Security, Chung-Ang University, South Korea. E-mail: [email protected] Park, Yujin Choi, Seungyun Lee, and Jaewook Lee are with the Department of Industrial Engineering, Seoul National University, South Korea. The corresponding author is Jaewook Lee. E-mail: [email protected]
Abstract

Neural additive model (NAM) is a recently proposed explainable artificial intelligence (XAI) method that utilizes neural network-based architectures. Given the advantages of neural networks, NAMs provide intuitive explanations for their predictions with high model performance. In this paper, we analyze a critical yet overlooked phenomenon: NAMs often produce inconsistent explanations, even when using the same architecture and dataset. Traditionally, such inconsistencies have been viewed as issues to be resolved. However, we argue instead that these inconsistencies can provide valuable explanations within the given data model. Through a simple theoretical framework, we demonstrate that these inconsistencies are not mere artifacts but emerge naturally in datasets with multiple important features. To effectively leverage this information, we introduce a novel framework, Bayesian Neural Additive Model (BayesNAM), which integrates Bayesian neural networks and feature dropout, with theoretical proof demonstrating that feature dropout effectively captures model inconsistencies. Our experiments demonstrate that BayesNAM effectively reveals potential problems such as insufficient data or structural limitations of the model, providing more reliable explanations and potential remedies.

1 Introduction

Explainable artificial intelligence (XAI) has become a significant field of research as machine learning models are increasingly applied in real-world systems including finance and healthcare. To provide insight into the underlying decision-making process behind the predictions made by these models, numerous researchers have developed various techniques to assist human decision-makers.

Recently, Agarwal et al.[1] proposed a neural additive model (NAM) that utilizes neural networks to achieve both high performance and explainability. NAM is a type of generalized additive model (GAM) that involves the linear or non-linear transformation of each input and yields the final prediction through an additive operation. Previous studies have demonstrated that NAM not only learns complex relationships between inputs and outputs but also provides a high level of explainability based on neural network architectures and training techniques.

Refer to caption
Figure 1: Inconsistency of NAM, where two independent NAMs trained with the same dataset and architecture output different explanations solely due to different random seeds.

In this paper, we analyze a critical yet overlooked phenomenon: the inconsistency phenomenon of NAM. Fig. 1 illustrates this issue, where two independent NAMs, trained on the same dataset and architecture, produce different explanations due solely to variations in random seeds. Such inconsistency has traditionally been viewed as a problem to be solved [2].

However, we argue that these inconsistencies are not merely obstacles but can offer valuable insights to uncover external explanations within the data model. Through a simple theoretical model, we show that NAMs naturally exhibit the inconsistency phenomenon even when trained on usual datasets that contain multiple important features. Building on this insight, we propose the Bayesian Neural Additive Model (BayesNAM), a novel framework that combines Bayesian neural networks with feature dropout to harness these inconsistencies for more reliable explainability. We also provide theoretical proof that feature dropout effectively leverages inconsistency. Our real-world experiments demonstrate that BayesNAM not only provides more reliable and interpretable explanations but also highlights potential issues in the data model, such as insufficient data and structural limitations within the model.

The main contributions can be summarized as follows:

  • We investigate the inconsistency phenomenon of NAMs and analyze this phenomenon through a simple theoretical model.

  • We propose a new framework BasyesNAM, which utilizes Bayesian neural network and feature dropout. We also establish a theoretical analysis of the efficacy of feature dropout in leveraging inconsistency information.

  • We empirically demonstrate that BayesNAM is particularly effective in identifying data insufficiencies or structural limitations, offering more reliable explanations and insights for decision-making.

2 Related Work

2.1 Neural Additive Model

As numerous machine learning and deep learning models are black-box, a line of work has attempted to explain the decisions made by a black-box model. We call these methods post-hoc methods since they are applied after the model has been trained. While post-hoc methods offer some interpretability, recent work [3, 4] has argued that these methods can lead to unreliable explanations, which could potentially have detrimental effects on their explainability.

In contrast to post-hoc methods, intrinsic methods aim to develop an inherently explainable model without additional techniques [4]. Agarwal et al.[1] proposed a neural additive model (NAM), which combines a generalized additive model [5] and neural networks. To be specific, given dd features, x1,x2,,xdx_{1},x_{2},\cdots,x_{d} and a target yy, NAM constructs dd mapping functions as follows:

y=f1(x1)+f2(x2)++fd(xd)+β,y=f_{1}(x_{1})+f_{2}(x_{2})+\cdots+f_{d}(x_{d})+\beta, (1)

where β\beta is a bias term and each mapping function fif_{i}. In Fig. 2, We illustrate an example of NAM. By utilizing the neural network, NAMs capture the non-linear relationship and achieve high performance while maintaining clarity through a straightforward plot.

Despite their strengths, NAMs frequently exhibit inconsistent explanations even when trained on identical datasets with the same architectures, as illustrated in Fig. 1. These inconsistencies can also be observed in the original work [1], where the mapping functions produced by different NAMs within an ensemble show substantial variation, despite being trained under the same experimental conditions.

While this inconsistent phenomenon across NAMs can harm its explainability as they are intended to be XAI models, this phenomenon has received limited attention in the literature. To the best of our knowledge, only one study has explicitly addressed this issue. Radenovic et al. [2] introduced the neural basis model (NBM), which used shared basis functions across features rather than assigning independent mapping functions to each feature. They argued that NBM reduces divergence between models, offering more consistent shape functions compared to NAM, thus mitigating the inconsistency problem.

In contrast, this paper presents a novel view on the inconsistency phenomenon. Rather than treating it as a problem to be solved, we argue that these inconsistencies provide valuable information about the data model.

Refer to caption
Figure 2: Example of a mapping function fif_{i} of NAM. Blue regions correspond to regions with high data density. NAM enables us to capture non-linear relationships between inputs and outputs and further provide a clear understanding.

2.2 Bayesian Neural Network

Although the use of a single model is a fundamental approach, numerous studies have found that a point-estimation is often vulnerable to overfitting and high variance due to its limited representation [6]. To overcome this limitation, Bayesian neural network estimates the model distribution instead of calculating a fixed model. Given the data (𝒙,y)𝒟({\bm{x}},y)\sim\mathcal{D} and the prior p(𝒘)p({\bm{w}}), we aim to approximate the posterior p(𝒘|𝒙,y)p({\bm{w}}|{\bm{x}},y). Specifically, rather than using a fixed weight vector 𝒘i{\bm{w}}_{i}, it aims to find a distribution of weight vectors 𝒩(𝝁i,diag(𝒔i)2)\mathcal{N}(\bm{\mu}_{i},\texttt{diag}({\bm{s}}_{i})^{2}) and learn the mean vector 𝝁i\bm{\mu}_{i} and the standard deviation vector 𝒔i{\bm{s}}_{i}.

Since the distribution p(𝒙,y)p({\bm{x}},y) is generally intractable, several methods have been developed to approximate the posterior, including Markov Chain Monte Carlo (MCMC) [7] and variational inference approaches [8, 9]. While MCMC methods can provide more accurate estimates, their high computational cost [10] has led to the use of variational inference methods across diverse domains [11, 12].

During optimization in variational inference methods, an weight vector 𝒘i=𝝁i+𝒔iϵ{\bm{w}}_{i}=\bm{\mu}_{i}+{\bm{s}}_{i}\odot\bm{\epsilon} is sampled for each forward step where ϵ𝒩(𝟎,𝐈)\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}). The prior distribution can be simply chosen as the isometric Gaussian prior 𝒩(𝟎,s02𝐈)\mathcal{N}(\mathbf{0},s_{0}^{2}\mathbf{I}) where s0s_{0} is a predefined standard deviation to explicitly calculate the KL-divergence [11].

A promising direction in the field of Bayesian neural networks is their integration with other domains to enhance model explainability. Bayesian neural networks provide weight distributions that enable the identification of high-density regions or confidence intervals, which can be used for uncertainty estimation. Researchers and practitioners in several domains that require reliable explanations, such as medicine [13] and finance [14], have also explored the utilization of Bayesian models to measure the confidence of prediction for trustworthy decision-making.

3 Methodology

In Section 3.1, we first investigate the inconsistency phenomenon of NAMs with a simple theoretical model. Our empirical findings show that this inconsistency can easily occur, even when datasets contain more than one important feature. Subsequently, in Section 3.2, we propose a new framework called BayesNAM, which combines Bayesian neural network with feature dropout, to leverage the inconsistency information as a source of valuable indicator. This framework is supported by a theoretical analysis demonstrating the effectiveness of feature dropout in capturing diverse explanations. Finally, we provide a detailed explanation of the proposed framework.

3.1 Rethinking Inconsistency of Neural Additive Model

Refer to caption
(a) Random Seed 1
Refer to caption
(b) Random Seed 2
Figure 3: (Case-I. λ=0\lambda=0) Mapping functions of two NAMs trained with different random seeds show similar shapes. Blue regions correspond to regions with high data density.

We begin the analysis by identifying and investigating the inconsistent explanations of NAM. To this end, we construct a simple theoretical model. Here, we consider a binary classification task where the target yy can have a value in {1,1}\{-1,1\}. Inspired by [15], we construct the input-target pairs (𝒙,y)=(x1,x2,,xd,y)({\bm{x}},y)=(x_{1},x_{2},\cdots,x_{d},y) from a distribution 𝒟\mathcal{D} as follows:

x1\displaystyle x_{1} ={+y with probability py with probability 1p,\displaystyle=\begin{cases}+y\text{ with probability }p\\ -y\text{ with probability }1-p\end{cases}, (2)
x2\displaystyle x_{2} ,,xdi.i.d.𝒩(λy,σ2),\displaystyle,\cdots,x_{d}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(\lambda y,\sigma^{2}), (3)

where x2,,xdx_{2},\cdots,x_{d} are independently and identically sampled from a normal distribution 𝒩\mathcal{N} with the mean λy\lambda y and the standard deviation σ2\sigma^{2} for positive λ\lambda and σ\sigma. It is important to note that the features x2,,xdx_{2},\cdots,x_{d} are uncorrelated, as they are drawn independently and identically distributed. By adjusting the values of pp and λ\lambda, we can control the significance of x1x_{1} and x2,,xdx_{2},\cdots,x_{d} in predicting yy, as stated in the following lemma:

Lemma 1.

(Derived from [15]) Consider a linear classifier hh,

h(x2,,xd)\displaystyle h(x_{2},\cdots,x_{d}) =sign(w2x2+w3x3++wdxd).\displaystyle=\operatorname{sign}(w_{2}x_{2}+w_{3}x_{3}+\cdots+w_{d}x_{d}). (4)

Then, even a natural linear classifier, h()h(\cdot) with wi=1d1w_{i}=\frac{1}{d-1}, can easily achieve a higher classification accuracy than pp, which is a natural accuracy of the model that only uses x1x_{1}, if the following statement is satisfied:

ΦX𝒩(0,σ2/(d1))(λ)>p,\displaystyle\Phi_{X\sim\mathcal{N}\left(0,\sigma^{2}/(d-1)\right)}(\lambda)>p, (5)

where ΦX()\Phi_{X}(\cdot) is the cumulative distribution function of XX. (Detailed proof is presented in Appendix)

Let pp be a sufficiently large positive number. When λ=0\lambda=0, only x1x_{1} is useful to predict yy and other features x2,,xdx_{2},\cdots,x_{d} are not correlated to yy. As λ\lambda increases, x2,,xdx_{2},\cdots,x_{d} become correlated to yy. By Lemma 1, if (5) is satisfied, a model that only considers x2,,xdx_{2},\cdots,x_{d} can achieve a higher classification accuracy than pp. In summary, if λ=0\lambda=0, x1x_{1} would be the only feature with a high importance in predicting yy, while x2,,xdx_{2},\cdots,x_{d} is enough to have a significant performance in predicting yy for a large λ>0\lambda>0.

Now, we consider the following two cases with d=3:

  • Case-I. Single important feature exists (λ=0)(\lambda=0). In this case, only x1x_{1} is effective in predicting yy, while x2x_{2} and x3x_{3} are not useful.

  • Case-II. Multiple important features exist (λ=3)(\lambda=3). In this case, all the features, x1x_{1}, x2x_{2}, and x3x_{3} are highly correlated with yy. The model uses x2x_{2} and x3x_{3} can perform better than the model sorely depends on x1x_{1} since ΦZ(λ=3)=0.999\Phi_{Z}(\lambda=3)=0.999.

Given this theoretical model, we generated two sets of data containing 50,000 training examples and 10,000 test examples and trained two different NAMs on each dataset for different random seeds. For simplicity, we fixed the feature dimension to d=3d=3, the probability p=0.95p=0.95, and σ2=d1\sigma^{2}=d-1, resulting (5) becomes ΦZ(λ)>p\Phi_{Z}(\lambda)>p, where ZZ is drawn from the standard normal distribution 𝒩(0,1)\mathcal{N}(0,1). For each mapping function fif_{i} of NAM, we constructed a simple neural network with two linear layers containing 10 hidden neurons and used ReLU as an activation function. The models are trained by SGD with a learning rate of 0.01. One epoch was sufficient to achieve high training accuracy.

Refer to caption
(a) Random Seed 1
Refer to caption
(b) Random Seed 2
Figure 4: (Case-II. λ=3\lambda=3) Mapping functions of two NAMs trained with different random seeds are extremely different. This yields inconsistent feature contribution in Fig. 5.

Fig. 3 (Case-I) and Fig. 4 (Case-II) illustrate the mapping functions of trained NAMs for each case. Specifically, for Case-I, we observed that the two NAM models trained with different random seeds exhibited similar test accuracy and explanations (94.9% and 95.0%, respectively). As shown in Fig. 3, the mapping functions for each xix_{i} have similar shapes, with f1f_{1} being the only increasing one and the others being almost constant. Therefore, in this case, NAM successfully captures the true importance of features and provides reliable explanations.

Refer to caption
Figure 5: Corresponding feature contributions of a sample 𝒙=[1,3,3]{{\bm{x}}}=[-1,3,3] with y=1y=1 for NAMs in Fig. 4. This inconsistent explanation corresponds to Fig. 1.

In contrast, for Case-II (Fig. 4), the mapping functions fif_{i} of the trained NAMs have extremely different shapes. Although both NAMs achieve a test accuracy exceeding 99.99%, f3f_{3} is much steeper than f2f_{2} for the first random seed, whereas the relationship is reversed for the second random seed, and vice versa.

Such inconsistency results in inconsistent feature contribution. Figure 5 shows the feature contribution of a sample 𝒙=[x1,x2,x3]{\bm{x}}=[x_{1},x_{2},x_{3}] =[1,3,3]=[-1,3,3]. Following [1], we calculate the feature contribution by subtracting the average value of a mapping function across the entire training dataset. Although we use the same example, the feature contribution calculated by NAM with random seed 1 implies that x2x_{2} appears to be more important than x3x_{3}, while NAM with random seed 2 outputs the opposite result that x3x_{3} appears to be more significant than x2x_{2}. In summary, NAMs can produce inconsistent explanations when multiple important features are present, a common condition in real-world datasets. Indeed, as discussed later in Figures 10 and 12, this inconsistency is readily observable in widely-used datasets.

At first glance, the observed inconsistency may appear problematic; however, both explanations are not inherently incorrect. Specifically, given the theoretical model, both x2x_{2} and x3x_{3} are important features under Case-II, as using only one of them can achieve high performance. Therefore, the distinct mapping functions demonstrate that the different perspectives of trained models and each explanation is a valid interpretation of the data model, where relying solely on either x2x_{2} or x3x_{3} is sufficient for high performance.

Refer to caption
(a) Learning Rate
Refer to caption
(b) Batch size
Figure 6: Mapping functions of NAMs trained with different learning rates and batch sizes. Given the fact that all models achieve more than 99% test accuracy, this inconsistency tells us that high-performing models can have diverse perspectives for the given data model.

In Fig. 6, we vary the learning rates (η\eta) and batch sizes (BB) during training within the same theoretical model. We linearly increase the learning rate η\eta from 0.005 to 0.01, and the batch size BB from 5 to 50. In total, we trained 50 models for each experiment, where each NAM exhibits inconsistent mapping functions. However, it is important to note that all models achieved over 99% test accuracy on the dataset. This indicates that the diverse explanations are not incorrect; rather, they offer valuable external insights into the existence of diverse perspectives among high-performing models, complementing the internal explanations of individual models. Therefore, we posit that inconsistency can be a useful indicator of potential external explanatory factors. Based on these findings, we propose a new framework to leverage inconsistency and provide additional explanations within the data model.

3.2 Bayesian Neural Additive Model

In the previous subsection, we explored the inconsistency phenomenon in NAMs and suggested that rather than being a flaw, this inconsistency can serve as a valuable source of additional information, shedding light on underlying external explanations in the data model. In this section, we introduce BayesNAM, a novel framework designed to leverage this inconsistency. BayesNAM incorporates two key approaches: (1) a modeling approach based on Bayesian structure and (2) an optimization approach utilizing feature dropout. Each of these approaches will be detailed in the following paragraphs.

1) Modeling Approach: Bayesian Structure for Inconsistency Exploration. A naive approach to exploring possible inconsistencies in NAMs is by training multiple independent models. Indeed, Agarwal et al.[1] trained several NAMs and visualized the learned shape functions fk(xk)f_{k}(x_{k}). However, this requires training nn independent models, leading to a computational burden proportional to nn, making it impractical for large-scale applications.

To address this limitation, we propose using Bayesian neural networks [8, 9], which inherently allow for efficient exploration of model uncertainty without the need to train multiple independent models. Under variational inference and Bayes by Backprop [8, 9], Bayesian neural networks rather train the mean parameter 𝝁i\bm{\mu}_{i} and the standard deviation parameter 𝒔i{\bm{s}}_{i} instead of an weight vector 𝒘i{\bm{w}}_{i}. Then, during the training and inference phase, it samples a weight 𝒘i=𝝁i+𝒔iϵ{\bm{w}}_{i}=\bm{\mu}_{i}+{\bm{s}}_{i}\odot\bm{\epsilon} for a random vector ϵ\bm{\epsilon} from a predefined distribution. Following prior works [11, 12], we adopt the reparameterization trick [16] for efficient training. This results in the following training objective.

min𝝁i,𝒔i(i=1dfi(xi|𝝁i,𝒔i)+β,y)+i=1dKL(q𝝁i,𝒔i(𝒘i)p(𝒘i)),\displaystyle\underset{\bm{\mu}_{i},{\bm{s}}_{i}}{\texttt{min}}\mathcal{L}\left(\sum_{i=1}^{d}f_{i}(x_{i}|\bm{\mu}_{i},{\bm{s}}_{i})+\beta,y\right)+\sum_{i=1}^{d}\texttt{KL}\left(q_{\bm{\mu}_{i},{\bm{s}}_{i}}({\bm{w}}_{i})\lVert p({\bm{w}}_{i})\right), (6)

where ()\mathcal{L}(\cdot) is a given loss function and KL()\texttt{KL}(\cdot\lVert\cdot) is the KL-divergence. For further details, we refer the readers to [9].

Refer to caption
Figure 7: Structural framework of BayesNAM. BayesNAM trains the distribution of parameters through updating 𝝁\bm{\mu} and 𝒔{\bm{s}}. During the inference phase, based on the weights 𝒘(j){\bm{w}}^{(j)} drawn from the trained distribution, it can provide rich explanations for its prediction by leveraging inconsistency, such as the confidence interval (upper) of feature contribution (lower).

In Fig. 7, we present the structural framework that integrates a Bayesian neural network with a neural additive model. For each sampled weight 𝒘(j){\bm{w}}^{(j)}, we compute the corresponding predictions fi(xi|𝒘i(j))f_{i}(x_{i}|{\bm{w}}^{(j)}_{i}). This sampling approach enables the model to efficiently explore a diverse range of model spaces without needing to train multiple models. By incorporating a Bayesian neural network, the model provides high-density regions of the mapping functions and provides confidence intervals for feature contributions, offering richer interpretability.

Refer to caption
Refer to caption
(a) Naive Bayesian
Refer to caption
Refer to caption
(b) w/ Feature Dropout
Figure 8: Effectiveness of feature dropout. The same setting is used as in Fig. 4. Each plot shows the mapping functions of x2x_{2} (left) and x3x_{3} (right). Both models use the structural framework depicted in Fig. 7. Without feature dropout (Fig. 8a), the model tends to focus on a single feature, similar to training a single NAM. In contrast, feature dropout (Fig. 8b) enables the model to explore diverse explanations.

2) Optimization Approach: Feature Dropout for Encouraging Diverse Explanations. Although Bayesian neural networks provide an efficient mechanism for exploration, they do not inherently guarantee exploring diverse explanations. Indeed, as shown in Fig. 8a, naive Bayesian neural network alone tends to focus on a single feature, similar to training a single NAM, rather than adequately exploring diverse explanations. As previously noted in related works [11, 12], we observe that increasing the standard deviation hyper-parameter s0s_{0} within Bayesian neural network tends to degrade model performance and fails to address this issue effectively. Therefore, given the presence of diverse valid explanations shown in Figures 4 and 5, it is evident that a method is needed to encourage the exploration of diverse explanations.

As a potential solution, we propose the use of feature dropout during optimization. Feature dropout, initially introduced by Agarwal et al.[1], extends traditional dropout by selectively omitting individual feature networks during training. The hyperparameter τ\tau determines the probability of dropping each feature. While the original work focused on improving model performance with feature dropout, we here provide a theoretical analysis showing that feature dropout implicitly encourages diverse explanations, preventing over-relying on any single feature.

Given the theoretical model in Section 3.1, we establish the following theorem.

Theorem 1.

(Feature Dropout Implicitly Encourages Exploring Diverse Explanations) Given the dataset (𝐱,y)({\bm{x}},y) in (2), the linear classifier hh in (4), and the feature dropout rate τ\tau, without loss of generality, the maximal training accuracy of hh that only uses kk features becomes

𝒫(k,τ)=xi𝒩(λy,σ2),ui(1τ)[yki=2k+1uixi>0].\mathcal{P}(k,\tau)=\mathbb{P}_{x_{i}\sim\mathcal{N}(\lambda y,\sigma^{2}),u_{i}\sim\mathcal{B}(1-\tau)}\left[\frac{y}{k}\sum_{i=2}^{k+1}u_{i}x_{i}>0\right]. (7)

Then, for k3k\geq 3 and τ[0,12]\tau\in[0,\frac{1}{2}], the gap Δ𝒫:=𝒫(k,τ)𝒫(1,τ)\Delta\mathcal{P}:=\mathcal{P}(k,\tau)-\mathcal{P}(1,\tau) is always positive and increases as τ\tau increases. Thus, the model leverages multiple features to achieve high performance, implicitly encouraging the exploration of diverse explanations.

Sketch of proof.

Let qj=ΦZ(λj/σ)q_{j}=\Phi_{Z}(\lambda\sqrt{j}/\sigma). Then, 𝒫(k,τ)\mathcal{P}(k,\tau) can be formalized as follows:

𝒫(k,τ)=j=1kqj(kj)(1τ)jτkj.\mathcal{P}(k,\tau)=\sum^{k}_{j=1}q_{j}\binom{k}{j}(1-\tau)^{j}\tau^{k-j}.

To show that Δ𝒫(k,τ)>0\Delta\mathcal{P}(k,\tau)>0 and τΔ𝒫(k,τ)>0\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k,\tau)>0 for k3k\geq 3 and 8τ[0,32]\tau\in[0,\frac{3}{2}], we use mathematical induction.

Applying Pascal’s identity and strong induction, we find

τΔ\displaystyle\frac{\partial}{\partial\tau}\Delta 𝒫(k+1,τ)=τΔ𝒫(3,τ)+q1τ2(3(k+1)τk2)\displaystyle\mathcal{P}(k+1,\tau)=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-(k+1)\tau^{k-2})
+(q2q1)τ(1τ)(6(k+1)kτk2)+3(q3q2)(1τ)2\displaystyle+(q_{2}-q_{1})\tau(1-\tau)(6-(k+1)k\tau^{k-2})+3(q_{3}-q_{2})(1-\tau)^{2}
j=2k(j+1)(qj+1qj)(k+1j+1)(1τ)jτkj\displaystyle-\sum^{k}_{j=2}(j+1)(q_{j+1}-q_{j})\binom{k+1}{j+1}(1-\tau)^{j}\tau^{k-j}

We now consider two cases for τ\tau: (1) τ[0,38]\tau\in\left[0,\frac{3}{8}\right] and (2) τ(38,12]\tau\in\left(\frac{3}{8},\frac{1}{2}\right]. In each case, we prove that τΔ𝒫(k+1,τ)>0\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k+1,\tau)>0 by finding the value of λ/σ\lambda/\sigma that minimizes each term. With these lower bounds, we can conclude that the overall expression is positive. (Detailed proof is presented in Appendix)

Refer to caption
(a) λ=3\lambda=3
Refer to caption
(b) λ=0.01\lambda=0.01
Figure 9: Empirical verification of Theorem 1. As τ\tau increases in range of [0,12][0,\frac{1}{2}], Δ𝒫\Delta\mathcal{P} increases as well. Moreover, as kk increases, the acceptable range of τ\tau in Theorem 1 expands.

Fig. 9 empirically verifies the general acceptance of Theorem 1. We plot Δ𝒫(k,τ)\Delta\mathcal{P}(k,\tau) with varying kk. Other settings are same as the Case-II in Section 3.1. The increasing trend of Δ𝒫(k,τ)\Delta\mathcal{P}(k,\tau) for τ[0,12]\tau\in[0,\frac{1}{2}] aligns with the implications of Theorem 1. Furthermore, the acceptable range of τ\tau in Theorem 1 expands as kk increases. When k=100k=100, Δ𝒫(k,τ)\Delta\mathcal{P}(k,\tau) is increasing for τ[0,0.9]\tau\in[0,0.9]. Even for k=2k=2, we observe that Δ𝒫(k,τ)\Delta\mathcal{P}(k,\tau) increases until τ=0.4\tau=0.4. Since Theorem 1 holds regardless of the values of λ\lambda, we observe similar results for a very small value of λ=0.01\lambda=0.01.

In summary, we theoretically and empirically verify that feature dropout encourages the model to explore diverse explanations by using multiple features in the dataset. As shown in Fig. 8b, incorporating feature dropout enables the model to explore explanations across a range of features. Therefore, we introduce the framework that combines Bayesian neural networks with feature dropout as Bayesian Neural Additive Model (BayesNAM).

4 Experiments

In this section, we present empirical findings comparing the performance of our proposed framework against traditional models, such as Logistic/Linear Regression, Classification and Regression Trees (CART), and Gradient Boosted Trees (XGBoost) [17], as well as recent explainable models including the Explainable Boosting Machine (EBM) [18], NAM, NAM with an ensemble method (NAM+Ens), and our proposed BayesNAM. For Logistic/Linear Regression, CART, XGBoost, and EBM, we conducted a grid search for hyperparameter tuning, following the settings outlined in [1]. We found that using ResNet blocks—comprising two group convolution layers with BatchNorm and ReLU activation—yields better performance for NAM and BayesNAM compared to the ExU units or ReLU-nn suggested in [1]. For NAM+Ens, we trained five independent NAMs, and both NAM+Ens and BayesNAM utilized soft voting for model aggregation during evaluation. Detailed settings are provided in the Appendix.

We evaluated all models on five different datasets: Credit Fraud [19], FICO [20], and COMPAS [21] for classification tasks, and California Housing (CA Housing) [22] and Boston [23] for regression tasks. As shown in Table I, BayesNAM demonstrates comparable performance to other benchmarks across datasets, with particularly strong results in classification tasks such as COMPAS, Credit Fraud, and FICO. For regression tasks, BayesNAM tends to be less accurate, which we discuss further in the Appendix.

TABLE I: Performance comparison between models on 5 different random seeds. Higher AUC is better (\uparrow) and lower RMSE is better (\downarrow).
Model
COMPAS
(AUC\uparrow)
Credit
(AUC\uparrow)
FICO
(AUC\uparrow)
Boston
(RMSE\downarrow)
CA Housing
(RMSE\downarrow)
Log./Lin. Reg. 0.699±\pm0.005 0.977±\pm0.004 0.706±\pm0.005 5.517±\pm0.009 0.731±\pm0.010
CART 0.776±\pm0.005 0.956±\pm0.005 0.784±\pm0.002 4.133±\pm0.004 0.712±\pm0.007
XGBoost 0.743±\pm0.012 0.980±\pm0.005 0.795±\pm0.001 3.155±\pm0.009 0.531±\pm0.011
EBM 0.764±\pm0.009 0.978±\pm0.007 0.793±\pm0.005 3.301±\pm0.005 0.558±\pm0.012
NAM 0.769±\pm0.011 0.989±\pm0.007 0.804±\pm0.003 3.567±\pm0.012 0.556±\pm0.009
NAM+Ens 0.771±\pm0.005 0.990±\pm0.004 0.804±\pm0.002 3.555±\pm0.006 0.554±\pm0.003
BayesNAM 0.784±\pm0.009 0.991±\pm0.003 0.804±\pm0.001 3.620±\pm0.011 0.556±\pm0.007

4.1 Identifying Data Inefficiency

The capability of BayesNAM to explore diverse explanations further allows us to obtain confidence information of feature contributions. In the left plot of Figure 10, we plot the feature contributions of two randomly drawn offenders from each target value, ‘reoffended’ (y=1y=1) or ‘not’ (y=0y=0).

Refer to caption
Refer to caption
Figure 10: Empirical results on COMPAS. (Left) Feature contribution of two randomly drawn offenders obtained from BayesNAM. Error bars correspond to the standard deviation of feature contribution. juv_other_count shows an extremely high variance. (Right) Mapping function of juv_other_count. Gray lines represent the mapping functions from five different NAMs, while the red line and orange shaded area indicate the average mapping function and its two-sigma interval for BayesNAM, respectively. The mapping functions begin to diverge, i.e., inconsistency occurs, when juv_other_count \geq 4.
Refer to caption
(a) Data distribution within juv_other_count
Refer to caption
(b) Class imbalance in juv_other_count \geq 9.
Figure 11: Empirical findings on COMPAS. (Top) The high variance area corresponds to a lack of data for juv_other_count 4\geq 4. (Bottom) This range also shows skewed proportions, particularly for juv_other_count 9\geq 9, where all labels are either ’reoffended’ or ’not.’

Among the features, juv_other_count (which represents the number of non-felony juvenile offenses a person has been convicted of) exhibits high variance in its contributions. This high variance indicates that, with a single NAM, the contribution of juv_other_count can appear either extremely negative or positive, potentially leading to misinterpretation. BayesNAM reveals substantial variability among models, suggesting that juv_other_count can have both positive and negative effects on predictions within certain ranges.

What can we infer from this high variation? In the right plot of Figure 10, we analyze the mapping function of juv_other_count. NAMs (gray) show inconsistent explanations for juv_other_count. The two-sigma interval of mapping functions of BayesNAM (orange) also starts to diverge significantly when juv_other_count 4\geq 4, indicating increased inconsistency in this range. As shown in Figure 11a, a data range where juv_other_count 4\geq 4 indicates a lack of sufficient data. Moreover, this range also shows skewed proportions, especially for juv_other_count 9\geq 9, where all labels are either ’reoffended’ or ’not.’ In summary, we verify that the high inconsistency highlights the need for caution when interpreting examples involving the feature, and suggests potential issues such as a lack of data.

In addition to identifying data insufficiencies, our model can also be used for feature selection. Features with high absolute contributions and small standard deviations, such as priors_count (which represents the total number of prior offenses a person has been convicted of), consistently demonstrate significant impact across different models.

4.2 Capturing Structural Limitation

Refer to caption
(a) Model prediction
Refer to caption
(b) Price visualization
Figure 12: Empirical findings on CA. (Top) Mapping functions for Longitude, similar to the right plot of Fig. 10. (Bottom) Housing prices are represented with colors in thousand dollars.
Refer to caption
Refer to caption
Figure 13: High variance observed in Fig. 12a suggests the potential failure of model assumption. (Left) Based on our analysis in Fig. 12, we construct and train a new model that contains the interaction term between Latitude and Longitude. As a result, the variance of the prediction is significantly decreased. (Right) Feature importance gained from a newly constructed BayesNAM for each location. The interaction term is highly important when distinguishing Yosemite from San Francisco and Los Angeles.

In addition to data insufficiencies, a high level of inconsistency can reveal structural limitations within the model. In Fig. 12a, we plot the results of NAMs and BayesNAM for longitude. These functions show higher housing prices in San Francisco (around -122.5) and Los Angeles (around -118.5), consistent with previous findings [1]. However, BayesNAM finds that there exist inconsistent explanations between these two cities, particularly within the longitude range of -120 to -119.

We hypothesize that this inconsistency is due to significant variations in housing prices (target variable) across different latitudes. Fig. 12b illustrates the distribution of housing prices in California, with red circles indicating higher prices and larger circles representing higher volumes of houses. As shown in Fig. 12b, while Santa Barbara, Yosemite National Park, and Fresno are on similar longitudes, Santa Barbara exhibits a substantial price gap compared to the others. Additionally, since Yosemite National Park and Fresno are near the same latitude as San Francisco, NAM might struggle to accurately predict housing prices without the interaction term between Latitude and Longitude.

Based on this observation, we train a NAM with an interaction term between Latitude and Longitude. This model achieved a much better performance (RMSE: 0.506±0.0050.506\pm 0.005) than without the interaction term (RMSE: 0.556±0.0090.556\pm 0.009). In addition, The significance of the interaction term is particularly evident near Yosemite. When BayesNAM is trained with this interaction term, it also shows reduced variance in the longitude range of -120 to -119. This suggests that the high variance can highlight potential structural limitations within models. Moreover, considering that existing methods such as NA2M [1], which incorporate all possible interaction terms, incur heavy computational costs and diminished explainability, BayesNAM offers a promising alternative by effectively selecting the most important interaction terms.

5 Conclusion

In this work, we identified and analyzed the inconsistent explanations of NAMs. We highlighted the importance of acknowledging these inconsistencies and introduced a new framework, BayesNAM, which leverages inconsistency to provide more reliable explanations. Through empirical validation, we demonstrated that BayesNAM effectively explores diverse explanations and provides external explanations such as insufficient data or model limitations within the data model. We hope our research contributes to the development of trustworthy models.

References

  • [1] R. Agarwal, L. Melnick, N. Frosst, X. Zhang, B. Lengerich, R. Caruana, and G. E. Hinton, “Neural additive models: Interpretable machine learning with neural nets,” Advances in Neural Information Processing Systems, vol. 34, pp. 4699–4711, 2021.
  • [2] F. Radenovic, A. Dubey, and D. Mahajan, “Neural basis models for interpretability,” Advances in Neural Information Processing Systems, vol. 35, pp. 8414–8426, 2022.
  • [3] I. E. Kumar, S. Venkatasubramanian, C. Scheidegger, and S. Friedler, “Problems with shapley-value-based explanations as feature importance measures,” in International Conference on Machine Learning.   PMLR, 2020, pp. 5491–5500.
  • [4] C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature Machine Intelligence, vol. 1, no. 5, pp. 206–215, 2019.
  • [5] R. Hastie TJTibshirani, “Generalized additive models,” Monographs on statistics and applied probability. Chapman & Hall, vol. 43, p. 335, 1990.
  • [6] T. G. Dietterich, “Ensemble methods in machine learning,” in Multiple Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, June 21–23, 2000 Proceedings 1.   Springer, 2000, pp. 1–15.
  • [7] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient langevin dynamics,” in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 681–688.
  • [8] A. Graves, “Practical variational inference for neural networks,” Advances in neural information processing systems, vol. 24, 2011.
  • [9] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural network,” in International conference on machine learning.   PMLR, 2015, pp. 1613–1622.
  • [10] C. Li, C. Chen, D. Carlson, and L. Carin, “Preconditioned stochastic gradient langevin dynamics for deep neural networks,” in Proceedings of the AAAI conference on artificial intelligence, 2016.
  • [11] X. Liu, Y. Li, C. Wu, and C.-J. Hsieh, “Adv-bnn: Improved adversarial defense through robust bayesian neural network,” arXiv preprint arXiv:1810.01279, 2018.
  • [12] S. Lee, H. Kim, and J. Lee, “Graddiv: Adversarial robustness of randomized neural networks via gradient diversity regularization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [13] A. Singh, S. Sengupta, M. A. Rasheed, V. Jayakumar, and V. Lakshminarayanan, “Uncertainty aware and explainable diagnosis of retinal disease,” in Medical Imaging 2021: Imaging Informatics for Healthcare, Research, and Applications, vol. 11601.   SPIE, 2021, pp. 116–125.
  • [14] H. Jang and J. Lee, “Generative bayesian neural network model for risk-neutral pricing of american index options,” Quantitative Finance, vol. 19, no. 4, pp. 587–603, 2019.
  • [15] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry, “Robustness may be at odds with accuracy,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=SyxAb30cY7
  • [16] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” Advances in neural information processing systems, vol. 28, 2015.
  • [17] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
  • [18] H. Nori, S. Jenkins, P. Koch, and R. Caruana, “Interpretml: A unified framework for machine learning interpretability,” arXiv preprint arXiv:1909.09223, 2019.
  • [19] A. Dal Pozzolo, “Adaptive machine learning for credit card fraud detection,” Université libre de Bruxelles, 2015.
  • [20] FICO, “Fico explainable machine learning challenge,” https://community.fico.com/s/ explainable-machine-learning-challenge, 2018.
  • [21] ProPublica, “Compas data and analysis for ‘machine bias’,” https://github.com/ propublica/compas-analysis, 2016.
  • [22] R. K. Pace and R. Barry, “Sparse spatial autoregressions,” Statistics & Probability Letters, vol. 33, no. 3, pp. 291–297, 1997.
  • [23] D. Harrison Jr and D. L. Rubinfeld, “Hedonic housing prices and the demand for clean air,” Journal of environmental economics and management, vol. 5, no. 1, pp. 81–102, 1978.

Supplements to BayesNAM: Leveraging Inconsistency for Reliable Explanations

Proofs

Proof for Lemma 1.

Proof.

Let h(x2,,xd)=sign(i=2dxi/(d1))h(x_{2},\cdots,x_{d})=\operatorname{sign}(\sum_{i=2}^{d}x_{i}/(d-1)). Then, the accuracy of h(x2,,xd)h(x_{2},\cdots,x_{d}) becomes

\displaystyle\mathbb{P} [h(x2,,xd)=y]=[yd1i=2d𝒩(λy,σ2)>0]=ΦX𝒩(0,σ2/(d1))(λ).\displaystyle[h(x_{2},\cdots,x_{d})=y]=\mathbb{P}\left[\frac{y}{d-1}\sum_{i=2}^{d}\mathcal{N}(\lambda y,\sigma^{2})>0\right]=\Phi_{X\sim\mathcal{N}\left(0,\sigma^{2}/(d-1)\right)}(\lambda).

Preliminary Lemma for Proof for Theorem 1.

Lemma 2.

Let qj=ΦZ(λj/σ)q_{j}=\Phi_{Z}(\lambda\sqrt{j}/\sigma). Then, the following inequality holds:

(qj+1qj)12πjln((j+1)/j)(j+1)ln((j+1)/j)e12t2𝑑t,\displaystyle(q_{j+1}-q_{j})\leq\frac{1}{\sqrt{2\pi}}\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}e^{-\frac{1}{2}t^{2}}dt,

Moreover,

(j+1)12πjln((j+1)/j)(j+1)ln((j+1)/j)e12t2𝑑t\displaystyle(j+1)\frac{1}{\sqrt{2\pi}}\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}e^{-\frac{1}{2}t^{2}}dt

decreases with respect to jj, so the upper bound for j2j\geq 2 is 32π2ln(3/2)3ln(3/2)e12t2𝑑t.\frac{3}{\sqrt{2\pi}}\int_{\sqrt{2\text{ln}(3/2)}}^{\sqrt{3\text{ln}(3/2)}}e^{-\frac{1}{2}t^{2}}dt.

Proof.

To find the upper bound of qj+1qjq_{j+1}-q_{j}, we take the derivative with respect to aa as follows:

a(qj+1qj)=12π(j+1ej+12a2jej2a2)=0ata=ln(j+1/j).\displaystyle\frac{\partial}{\partial a}(q_{j+1}-q_{j})=\frac{1}{\sqrt{2\pi}}(\sqrt{j+1}e^{-\frac{j+1}{2}a^{2}}-\sqrt{j}e^{-\frac{j}{2}a^{2}})=0\quad\text{at}\quad a=\sqrt{\text{ln}(j+1/j)}.

Thus, the first inequality holds. Next, we take the derivative with respect to jj to verify that the left-hand side decreases.

j(j+1)jln((j+1)/j)(j+1)ln((j+1)/j)\displaystyle\frac{\partial}{\partial j}(j+1)\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}} e12t2dt=jln((j+1)/j)(j+1)ln((j+1)/j)e12t2𝑑t\displaystyle e^{-\frac{1}{2}t^{2}}dt=\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}e^{-\frac{1}{2}t^{2}}dt
+(j+1)[(jj+1)j+122(j+1)ln(j+1/j){lnj+1j1j}(jj+1)j22jln(j+1/j){lnj+1j1j+1}]\displaystyle+(j+1)\left[\frac{(\frac{j}{j+1})^{\frac{j+1}{2}}}{2\sqrt{(j+1)\ln(j+1/j)}}\left\{\ln\frac{j+1}{j}-\frac{1}{j}\right\}-\frac{(\frac{j}{j+1})^{\frac{j}{2}}}{2\sqrt{j\ln(j+1/j)}}\left\{\ln\frac{j+1}{j}-\frac{1}{j+1}\right\}\right]

Simplifying this equation:

(j+1)\displaystyle(j+1) [(jj+1)j+122(j+1)ln(j+1/j){lnj+1j1j}(jj+1)j22jln(j+1/j){lnj+1j1j+1}]\displaystyle\left[\frac{(\frac{j}{j+1})^{\frac{j+1}{2}}}{2\sqrt{(j+1)\ln(j+1/j)}}\left\{\ln\frac{j+1}{j}-\frac{1}{j}\right\}-\frac{(\frac{j}{j+1})^{\frac{j}{2}}}{2\sqrt{j\ln(j+1/j)}}\left\{\ln\frac{j+1}{j}-\frac{1}{j+1}\right\}\right]
=j+12ln(j+1/j)(jj+1)j2×[jj+1(lnj+1j1j)1j(lnj+1j1j+1)]\displaystyle=\frac{j+1}{2\sqrt{\ln(j+1/j)}}\left(\frac{j}{j+1}\right)^{\frac{j}{2}}\times\left[\frac{\sqrt{j}}{{j+1}}(\ln\frac{j+1}{j}-\frac{1}{j})-\frac{1}{\sqrt{j}}(\ln\frac{j+1}{j}-\frac{1}{j+1})\right]
=12(jj+1)j2lnj+1j1j\displaystyle=-\frac{1}{2}\left(\frac{j}{j+1}\right)^{\frac{j}{2}}\sqrt{\ln\frac{j+1}{j}}\cdot\frac{1}{\sqrt{j}}

Since

jln((j+1)/j)(j+1)ln((j+1)/j)e12t2𝑑tlnj+1j(j+1j)(jj+1)j2,\displaystyle\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}e^{-\frac{1}{2}t^{2}}dt\leq\sqrt{\ln\frac{j+1}{j}}(\sqrt{j+1}-\sqrt{j})\left(\frac{j}{j+1}\right)^{\frac{j}{2}},

we have

jln((j+1)/j)(j+1)ln((j+1)/j)e12t2𝑑t12(jj+1)j2lnj+1j1jlnj+1j(jj+1)j2(j+1j12j)0,\displaystyle\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}e^{-\frac{1}{2}t^{2}}dt-\frac{1}{2}\left(\frac{j}{j+1}\right)^{\frac{j}{2}}\sqrt{\ln\frac{j+1}{j}}\cdot\frac{1}{\sqrt{j}}\leq\sqrt{\ln\frac{j+1}{j}}\left(\frac{j}{j+1}\right)^{\frac{j}{2}}(\sqrt{j+1}-\sqrt{j}-\frac{1}{2\sqrt{j}})\leq 0,

for j2j\geq 2. ∎

Proof for Theorem 1.

Proof.

Let qj=ΦZ(λj/σ)q_{j}=\Phi_{Z}(\lambda\sqrt{j}/\sigma). Then, 𝒫(k,τ)\mathcal{P}(k,\tau) can be formalized as follows:

𝒫(k,τ)=j=1kqj(kj)(1τ)jτkj.\mathcal{P}(k,\tau)=\sum^{k}_{j=1}q_{j}\binom{k}{j}(1-\tau)^{j}\tau^{k-j}. (8)

Additionally, let the difference between 𝒫(k,τ)\mathcal{P}(k,\tau) and 𝒫(1,τ)\mathcal{P}(1,\tau) as follows:

Δ𝒫(k,τ):=𝒫(k,τ)𝒫(1,τ)\displaystyle\Delta\mathcal{P}(k,\tau):=\mathcal{P}(k,\tau)-\mathcal{P}(1,\tau) =j=1kqj(kj)(1τ)jτkjq1(1τ).\displaystyle=\sum^{k}_{j=1}q_{j}\binom{k}{j}(1-\tau)^{j}\tau^{k-j}-q_{1}(1-\tau).

It is trivial that Δ𝒫(k,0)=qkq1>0\Delta\mathcal{P}(k,0)=q_{k}-q_{1}>0 k3\forall k\geq 3.

To show that τΔ𝒫(k,τ)>0\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k,\tau)>0 for τ[0,12]\tau\in[0,\frac{1}{2}], we first prove that τΔ𝒫(3,τ)>0\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)>0 for τ[0,12]\tau\in[0,\frac{1}{2}]. For k=3k=3, we have

Δ𝒫(3,τ)\displaystyle\Delta\mathcal{P}(3,\tau) =j=13qj(3j)(1τ)jτ3jq1(1τ)\displaystyle=\sum^{3}_{j=1}q_{j}\binom{3}{j}(1-\tau)^{j}\tau^{3-j}-q_{1}(1-\tau)
=(3q1+3q2q3)τ3+(3q16q2+3q3)τ2+(q1+3q23q3)τ+(q1+q3),\displaystyle=(-3q_{1}+3q_{2}-q_{3})\tau^{3}+(3q_{1}-6q_{2}+3q_{3})\tau^{2}+(q_{1}+3q_{2}-3q_{3})\tau+(-q_{1}+q_{3}),

and

τΔ𝒫(3,τ)\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau) =(9q1+9q23q3)τ2+(6q112q2+6q3)τ+(q1+3q23q3).\displaystyle=(-9q_{1}+9q_{2}-3q_{3})\tau^{2}+(6q_{1}-12q_{2}+6q_{3})\tau+(q_{1}+3q_{2}-3q_{3}). (9)

For simplicity, let us denote a=λ/σa=\lambda/\sigma so that qj=ΦZ(aj)q_{j}=\Phi_{Z}(a\sqrt{j}). Since aΦZ(aj)=jϕ(aj)\frac{\partial}{\partial a}\Phi_{Z}(a\sqrt{j})=\sqrt{j}\cdot\phi(a\sqrt{j}),

a(q2q1)=a(ΦZ(2a)ΦZ(a))=2ϕ(a2)+ϕ(a),\displaystyle\frac{\partial}{\partial a}(q_{2}-q_{1})=\frac{\partial}{\partial a}(\Phi_{Z}(\sqrt{2}a)-\Phi_{Z}(a))=\sqrt{2}\phi(a\sqrt{2})+\phi(a), (10)

where ϕ(x)=12πe12x2\phi(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^{2}}. (10) becomes 0 when

a(q2q1)=12π(2ea2e12a2)=0ata=ln2.\displaystyle\frac{\partial}{\partial a}(q_{2}-q_{1})=\frac{1}{\sqrt{2\pi}}(\sqrt{2}e^{-a^{2}}-e^{-\frac{1}{2}a^{2}})=0\quad\text{at}\quad a=\sqrt{\text{ln}2}.

Since qj>12q_{j}>\frac{1}{2} for a>0a>0 and j>0j>0, we have

q2q112πln22ln2e12t2𝑑t0.08302<16<13q3,\displaystyle q_{2}-q_{1}\leq\frac{1}{\sqrt{2\pi}}\int_{\sqrt{\text{ln}2}}^{\sqrt{2\text{ln}2}}e^{-\frac{1}{2}t^{2}}dt\approx 0.08302<\frac{1}{6}<\frac{1}{3}q_{3},

so that 9q1+9q23q3<0-9q_{1}+9q_{2}-3q_{3}<0 for any aa.

Similarly, we can easily show that

a(q1+3q23q3)=12π(e12a2+32ea233e32a2)=12π(x+32x233x3),\displaystyle\frac{\partial}{\partial a}(q_{1}+3q_{2}-3q_{3})=\frac{1}{\sqrt{2\pi}}(e^{-\frac{1}{2}a^{2}}+3\sqrt{2}e^{-a^{2}}-3\sqrt{3}e^{-\frac{3}{2}a^{2}})=\frac{1}{\sqrt{2\pi}}(x+3\sqrt{2}x^{2}-3\sqrt{3}x^{3}), (11)

where x=e12a2(0,1]x=e^{-\frac{1}{2}a^{2}}\in(0,1] for a2[0,)a^{2}\in[0,\infty). The solutions that make (11) equal to zero can be explicitly calculated as x=0,66+436<0x=0,\frac{\sqrt{6}-\sqrt{6+4\sqrt{3}}}{6}<0 and 6+6+436>1\frac{\sqrt{6}+\sqrt{6+4\sqrt{3}}}{6}>1. The minimum value of q1+3q23q3q_{1}+3q_{2}-3q_{3} is attained when aa lies on the boundary. Thus, τΔ𝒫(3,τ)|τ=00.5\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)|_{\tau=0}\geq 0.5.

We will now demonstrate that τΔ𝒫(3,τ)|τ=38>0\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)|_{\tau=\frac{3}{8}}>0 by showing that 4τΔ𝒫(3,τ)|τ=12=7q13q23q3>04*\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)|_{\tau=\frac{1}{2}}=7q_{1}-3q_{2}-3q_{3}>0. Note that

7q13q23q3=72πae12t2𝑑t32πa2e12t2𝑑t32πa3e12t2𝑑t.7q_{1}-3q_{2}-3q_{3}=\frac{7}{\sqrt{2\pi}}\int_{-\infty}^{a}e^{-\frac{1}{2}t^{2}}dt-\frac{3}{\sqrt{2\pi}}\int_{-\infty}^{a\sqrt{2}}e^{-\frac{1}{2}t^{2}}dt-\frac{3}{\sqrt{2\pi}}\int_{-\infty}^{a\sqrt{3}}e^{-\frac{1}{2}t^{2}}dt. (12)

By taking the derivative with respect to aa, we have

a(7q13q23q3)=12π(7e12a232ea233e32a2)=12π(7x32x233x3),\frac{\partial}{\partial a}(7q_{1}-3q_{2}-3q_{3})=\frac{1}{\sqrt{2\pi}}(7e^{-\frac{1}{2}a^{2}}-3\sqrt{2}e^{-a^{2}}-3\sqrt{3}e^{-\frac{3}{2}a^{2}})=\frac{1}{\sqrt{2\pi}}(7x-3\sqrt{2}x^{2}-3\sqrt{3}x^{3}), (13)

where x=e12a2x=e^{-\frac{1}{2}a^{2}}. The solutions that make (13) equal to zero can be explicitly calculated as x=0,±6+28366x=0,\frac{\pm\sqrt{6+28\sqrt{3}}-\sqrt{6}}{6}. Since e12a2(0,1]e^{-\frac{1}{2}a^{2}}\in(0,1] and strictly decreases, the minimum occurs at e12a~2=6+28366e^{-\frac{1}{2}\tilde{a}^{2}}=\frac{\sqrt{6+28\sqrt{3}}-\sqrt{6}}{6}, where a~=2ln66+2836\tilde{a}=\sqrt{2\ln{\frac{6}{\sqrt{6+28\sqrt{3}}-\sqrt{6}}}}. Thus, we obtain the following inequality:

7q13q23q3\displaystyle 7q_{1}-3q_{2}-3q_{3} 72πa~e12t2𝑑t32πa~2e12t2𝑑t32πa~3e12t2𝑑t>0.1217>0.\displaystyle\geq\frac{7}{\sqrt{2\pi}}\int_{-\infty}^{\tilde{a}}e^{-\frac{1}{2}t^{2}}dt-\frac{3}{\sqrt{2\pi}}\int_{-\infty}^{\tilde{a}\sqrt{2}}e^{-\frac{1}{2}t^{2}}dt-\frac{3}{\sqrt{2\pi}}\int_{-\infty}^{\tilde{a}\sqrt{3}}e^{-\frac{1}{2}t^{2}}dt>0.1217>0.

Thus implies that τΔ𝒫(3,τ)|τ=12>0.0304>0\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)\lvert_{\tau=\frac{1}{2}}>0.0304>0. Since τΔ𝒫(3,τ)\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau) with a negative leading coefficient, it follows that τΔ𝒫(3,τ)>0\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)>0 for τ[0,12]\tau\in[0,\frac{1}{2}].

Next, we demonstrate that Δ𝒫(k,τ)\Delta\mathcal{P}(k,\tau) increases as τ\tau increases for τ[0,12]\tau\in[0,\frac{1}{2}] and k3k\geq 3. To begin, we decompose Δ𝒫(k+1,τ)\Delta\mathcal{P}(k+1,\tau) as follows:

Δ𝒫(k+1,τ)\displaystyle\Delta\mathcal{P}(k+1,\tau) =j=1k+1qj(k+1j)(1τ)jτk+1jq1(1τ)\displaystyle=\sum^{k+1}_{j=1}q_{j}\binom{k+1}{j}(1-\tau)^{j}\tau^{k+1-j}-q_{1}(1-\tau)
=j=1kqj((kj)+(kj1))(1τ)jτk+1j+qk+1(1τ)k+1q1(1τ)\displaystyle=\sum^{k}_{j=1}q_{j}\left(\binom{k}{j}+\binom{k}{j-1}\right)(1-\tau)^{j}\tau^{k+1-j}+q_{k+1}(1-\tau)^{k+1}-q_{1}(1-\tau)
=Δ𝒫(k,τ)+j=1k(qj+1qj)(kj)(1τ)j+1τkj+q1(1τ)τk\displaystyle=\Delta\mathcal{P}(k,\tau)+\sum^{k}_{j=1}(q_{j+1}-q_{j})\binom{k}{j}(1-\tau)^{j+1}\tau^{k-j}+q_{1}(1-\tau)\tau^{k}
=Δ𝒫(3,τ)+l=3kj=1l(qj+1qj)(lj)(1τ)j+1τlj+q1(1τ)τl\displaystyle=\Delta\mathcal{P}(3,\tau)+\sum^{k}_{l=3}\sum^{l}_{j=1}(q_{j+1}-q_{j})\binom{l}{j}(1-\tau)^{j+1}\tau^{l-j}+q_{1}(1-\tau)\tau^{l}
=Δ𝒫(3,τ)+q1(1τ)l=3kτl+(q2q1)(1τ)2l=3k(l1)τl1+(q3q2)(1τ)3l=3k(l2)τl2\displaystyle=\Delta\mathcal{P}(3,\tau)+q_{1}(1-\tau)\sum^{k}_{l=3}\tau^{l}+(q_{2}-q_{1})(1-\tau)^{2}\sum^{k}_{l=3}\binom{l}{1}\tau^{l-1}+(q_{3}-q_{2})(1-\tau)^{3}\sum^{k}_{l=3}\binom{l}{2}\tau^{l-2}
+j=3k(qj+1qj)(1τ)j+1l=jk(lj)τlj.\displaystyle\quad+\sum^{k}_{j=3}(q_{j+1}-q_{j})(1-\tau)^{j+1}\sum^{k}_{l=j}\binom{l}{j}\tau^{l-j}.

By taking the derivative with respect to τ\tau, we have

τΔ𝒫(k+1,τ)\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k+1,\tau) =τΔ𝒫(3,τ)+q1l=3kτl1{(l1)(l+11)τ}+2(q2q1)(1τ)l=3kτl2{(l2)(l+12)τ}\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\sum^{k}_{l=3}\tau^{l-1}\Big{\{}\binom{l}{1}-\binom{l+1}{1}\tau\Big{\}}+2(q_{2}-q_{1})(1-\tau)\sum^{k}_{l=3}\tau^{l-2}\Big{\{}\binom{l}{2}-\binom{l+1}{2}\tau\Big{\}}
+3(q3q2)(1τ)2l=3kτl3{(l3)(l+13)τ}\displaystyle\quad+3(q_{3}-q_{2})(1-\tau)^{2}\sum^{k}_{l=3}\tau^{l-3}\Big{\{}\binom{l}{3}-\binom{l+1}{3}\tau\Big{\}}
+j=3k(j+1)(qj+1qj)(1τ)j[1+l=j+1kτlj1{(lj+1)(l+1j+1)τ}]\displaystyle\quad+\sum^{k}_{j=3}(j+1)(q_{j+1}-q_{j})(1-\tau)^{j}\Big{[}-1+\sum^{k}_{l=j+1}\tau^{l-j-1}\Big{\{}\binom{l}{j+1}-\binom{l+1}{j+1}\tau\Big{\}}\Big{]}
=τΔ𝒫(3,τ)+q1(3τ2(k+11)τk)+2(q2q1)(1τ)((32)τ(k+12)τk1)\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}(3\tau^{2}-\binom{k+1}{1}\tau^{k})+2(q_{2}-q_{1})(1-\tau)(\binom{3}{2}\tau-\binom{k+1}{2}\tau^{k-1})
+3(q3q2)(1τ)2(1(k+13)τk2)j=3k(j+1)(qj+1qj)(k+1j+1)(1τ)jτkj\displaystyle\quad+3(q_{3}-q_{2})(1-\tau)^{2}(1-\binom{k+1}{3}\tau^{k-2})-\sum^{k}_{j=3}(j+1)(q_{j+1}-q_{j})\binom{k+1}{j+1}(1-\tau)^{j}\tau^{k-j}
=τΔ𝒫(3,τ)+3q1τ2+6(q2q1)τ(1τ)+3(q3q2)(1τ)2\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+3q_{1}\tau^{2}+6(q_{2}-q_{1})\tau(1-\tau)+3(q_{3}-q_{2})(1-\tau)^{2}
(k+1)j=0k(qj+1qj)(kj)(1τ)jτkj(k+1)q0τk\displaystyle\quad-(k+1)\sum^{k}_{j=0}(q_{j+1}-q_{j})\binom{k}{j}(1-\tau)^{j}\tau^{k-j}-(k+1)q_{0}\tau^{k}
=τΔ𝒫(3,τ)+q1τ2(3(k+1)τk2)+(q2q1)τ(1τ)(6(k+1)kτk2)\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-(k+1)\tau^{k-2})+(q_{2}-q_{1})\tau(1-\tau)(6-(k+1)k\tau^{k-2})
+3(q3q2)(1τ)2j=2k(j+1)(qj+1qj)(k+1j+1)(1τ)jτkj\displaystyle\quad+3(q_{3}-q_{2})(1-\tau)^{2}-\sum^{k}_{j=2}(j+1)(q_{j+1}-q_{j})\binom{k+1}{j+1}(1-\tau)^{j}\tau^{k-j}

By Lemma 2, for j2j\geq 2,

(j+1)(qj+1qj)(j+1)12πjln((j+1)/j)(j+1)ln((j+1)/j)e12t2𝑑t32π2ln(3/2)3ln(3/2)e12t2𝑑t0.147.\displaystyle(j+1)(q_{j+1}-q_{j})\leq(j+1)\frac{1}{\sqrt{2\pi}}\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}e^{-\frac{1}{2}t^{2}}dt\leq\frac{3}{\sqrt{2\pi}}\int_{\sqrt{2\text{ln}(3/2)}}^{\sqrt{3\text{ln}(3/2)}}e^{-\frac{1}{2}t^{2}}dt\leq 0.147.

Additionally, since τΔ𝒫(3,τ)\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau) is quadratic function, we have

τΔ𝒫(3,τ)(12τ)τΔ𝒫(3,τ)|τ=0+2ττΔ𝒫(3,τ)|τ=120.50.94τ.\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)\geq(1-2\tau)\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)|_{\tau=0}+2\tau\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)|_{\tau=\frac{1}{2}}\geq 0.5-0.94\tau.

Moreover, since (k+1)τk2(k+1)\tau^{k-2} decreases for k3k\geq 3, we have

τΔ𝒫(k+1,τ)\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k+1,\tau) τΔ𝒫(3,τ)+q1τ2(3(k+1)τk2)j=2k(j+1)(qj+1qj)(k+1j+1)(1τ)jτkj\displaystyle\geq\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-(k+1)\tau^{k-2})-\sum^{k}_{j=2}(j+1)(q_{j+1}-q_{j})\binom{k+1}{j+1}(1-\tau)^{j}\tau^{k-j}
0.50.94τ+12τ2(34τ)0.1471τ.\displaystyle\geq 0.5-0.94\tau+\frac{1}{2}\tau^{2}(3-4\tau)-\frac{0.147}{1-\tau}.

Therefore, we proved that 0.50.94τ+12τ2(34τ)0.1471τ>00.5-0.94\tau+\frac{1}{2}\tau^{2}(3-4\tau)-\frac{0.147}{1-\tau}>0 for τ[0,38]\tau\in[0,\frac{3}{8}], τΔ𝒫(k+1,τ)0\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k+1,\tau)\geq 0 for τ[0,38]\tau\in[0,\frac{3}{8}] for k3k\geq 3.

Now, we will show that τΔ𝒫(4,τ)0\frac{\partial}{\partial\tau}\Delta\mathcal{P}(4,\tau)\geq 0 and τΔ𝒫(5,τ)0\frac{\partial}{\partial\tau}\Delta\mathcal{P}(5,\tau)\geq 0 for τ[38,12]\tau\in[\frac{3}{8},\frac{1}{2}]. First,

τΔ𝒫(4,τ)\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(4,\tau) =τΔ𝒫(3,τ)+3q1τ2+6(q2q1)τ(1τ)+3(q3q2)(1τ)24j=03(qj+1qj)(3j)(1τ)jτ3j4q0τ3\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+3q_{1}\tau^{2}+6(q_{2}-q_{1})\tau(1-\tau)+3(q_{3}-q_{2})(1-\tau)^{2}-4\sum^{3}_{j=0}(q_{j+1}-q_{j})\binom{3}{j}(1-\tau)^{j}\tau^{3-j}-4q_{0}\tau^{3}
=τΔ𝒫(3,τ)+q1τ2(34τ)+6(q2q1)τ(1τ)(12τ)+3(q3q2)(1τ)2(14τ)4(q4q3)(1τ)3.\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-4\tau)+6(q_{2}-q_{1})\tau(1-\tau)(1-2\tau)+3(q_{3}-q_{2})(1-\tau)^{2}(1-4\tau)-4(q_{4}-q_{3})(1-\tau)^{3}.

By using Lemma 2, we can have the following inequalities:

q3q212π2ln(3/2)3ln(3/2)e12t2𝑑t0.049,\displaystyle q_{3}-q_{2}\leq\frac{1}{\sqrt{2\pi}}\int_{\sqrt{2\text{ln}(3/2)}}^{\sqrt{3\text{ln}(3/2)}}e^{-\frac{1}{2}t^{2}}dt\leq 0.049,

and

q4q312π3ln(4/3)4ln(4/3)e12t2𝑑t0.035.\displaystyle q_{4}-q_{3}\leq\frac{1}{\sqrt{2\pi}}\int_{\sqrt{3\text{ln}(4/3)}}^{\sqrt{4\text{ln}(4/3)}}e^{-\frac{1}{2}t^{2}}dt\leq 0.035.

Since q3q2q4q3q_{3}-q_{2}\geq q_{4}-q_{3}, we can also the following inequalities:

3(q3q2)(1τ)2(4τ1)+4(q4q3)(1τ)3(q3q2)(1τ)2(1+8τ)2516(q3q2)0.077,\displaystyle 3(q_{3}-q_{2})(1-\tau)^{2}(4\tau-1)+4(q_{4}-q_{3})(1-\tau)^{3}\leq(q_{3}-q_{2})(1-\tau)^{2}(1+8\tau)\leq\frac{25}{16}(q_{3}-q_{2})\leq 0.077,

and

τΔ𝒫(3,τ)+q1τ2(34τ)τΔ𝒫(3,τ)|τ=12+27128q10.03+0.1054=0.1354.\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-4\tau)\geq\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)|_{\tau=\frac{1}{2}}+\frac{27}{128}q_{1}\geq 0.03+0.1054=0.1354.

Therefore, τΔ𝒫(4,τ)τΔ𝒫(3,τ)+q1τ2(34τ)(3(q3q2)(1τ)2(4τ1)+4(q4q3)(1τ)3)0.13540.077>0\frac{\partial}{\partial\tau}\Delta\mathcal{P}(4,\tau)\geq\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-4\tau)-(3(q_{3}-q_{2})(1-\tau)^{2}(4\tau-1)+4(q_{4}-q_{3})(1-\tau)^{3})\geq 0.1354-0.077>0.

Similarly, we have

τΔ𝒫(5,τ)\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(5,\tau) =τΔ𝒫(3,τ)+3q1τ2+6(q2q1)τ(1τ)+3(q3q2)(1τ)2\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+3q_{1}\tau^{2}+6(q_{2}-q_{1})\tau(1-\tau)+3(q_{3}-q_{2})(1-\tau)^{2}
5j=04(qj+1qj)(4j)(1τ)jτ4j5q0τ4\displaystyle\quad-5\sum^{4}_{j=0}(q_{j+1}-q_{j})\binom{4}{j}(1-\tau)^{j}\tau^{4-j}-5q_{0}\tau^{4}
=τΔ𝒫(3,τ)+q1τ2(35τ2)+(q2q1)τ(1τ)(620τ2)+3(q3q2)(1τ)2(110τ2)\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-5\tau^{2})+(q_{2}-q_{1})\tau(1-\tau)(6-20\tau^{2})+3(q_{3}-q_{2})(1-\tau)^{2}(1-10\tau^{2})
20(q4q3)(1τ)3τ5(q5q4)(1τ)4\displaystyle\quad-20(q_{4}-q_{3})(1-\tau)^{3}\tau-5(q_{5}-q_{4})(1-\tau)^{4}
τΔ𝒫(3,τ)+q1τ2(35τ2)(q3q2)(1τ)2(15τ2+10τ+2)\displaystyle\geq\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-5\tau^{2})-(q_{3}-q_{2})(1-\tau)^{2}(15\tau^{2}+10\tau+2)
0.03+13234096q1125754096(q3q2)0.03+0.16230.1505>0.\displaystyle\geq 0.03+\frac{1323}{4096}q_{1}-\frac{12575}{4096}(q_{3}-q_{2})\geq 0.03+0.1623-0.1505>0.

For k6k\geq 6, we can easily show that

τΔ𝒫(k,τ)\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k,\tau) τΔ𝒫(3,τ)+q1τ2(3kτk3)j=2k1(j+1)(qj+1qj)(kj+1)(1τ)jτk1j\displaystyle\geq\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-k\tau^{k-3})-\sum^{k-1}_{j=2}(j+1)(q_{j+1}-q_{j})\binom{k}{j+1}(1-\tau)^{j}\tau^{k-1-j}
0.50.94τ+12τ2(36τ3)0.1471τ.\displaystyle\geq 0.5-0.94\tau+\frac{1}{2}\tau^{2}(3-6\tau^{3})-\frac{0.147}{1-\tau}.

Since 0.50.94τ+12τ2(36τ3)0.1471τ>00.5-0.94\tau+\frac{1}{2}\tau^{2}(3-6\tau^{3})-\frac{0.147}{1-\tau}>0 for τ[38,12]\tau\in[\frac{3}{8},\frac{1}{2}], the inequality holds.

Therefore, τΔ𝒫(k,τ)0\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k,\tau)\geq 0 for τ[0,12]\tau\in[0,\frac{1}{2}] for k3k\geq 3. ∎

Training Details

In Section 4, we conducted experiments on five different datasets: Credit Fraud [19], FICO [20], COMPAS [21], California Housing (CA Housing) [22], and Boston [23]. Here, we provide a summary of the characteristics in Table II and detailed explanations to ease the understanding of the experimental interpretation.

TABLE II: List of datasets and their characteristics.
Data # Train # Test # Features Task type
Credit Fraud 227,845 56,962 30 Classification
FICO 8,367 2,092 23 Classification
COMPAS 13,315 3,329 17 Classification
CA Housing 16,512 4,128 8 Regression
Boston 404 102 13 Regression
  • Credit Fraud: This dataset focuses on predicting fraudulent credit card transactions and is highly imbalanced. Due to confidentiality concerns, the features are represented as principal components obtained through PCA.

  • FICO: This dataset aims to predict the risk performance of consumers, categorizing them as either “Bad” or “Good” based on their credit.

  • COMPAS: This dataset aims to predict recidivism, determining whether an individual will reoffend or not.

  • CA Housing: This dataset aims to predict the median house value for districts in California based on data derived from the 1990 U.S. Census.

  • Boston: This dataset focuses on predicting the median value of owner-occupied homes in the Boston area, using data collected by the U.S. Census Service.

For each dataset, we conducted experiments using five different random seeds. To ensure a fair comparison with the prior work [1], we adopted 5-fold cross-validation for datasets that train-test split was not provided.

TABLE III: Comparison between the structures of previous NAM [1] and the proposed.
Data NAM in [1] AUC(\uparrow) NAM in ours AUC(\uparrow)
Credit Fraud ExU+ReLU-11 0.980±\pm0.00 ResBlocks+ReLU 0.990±\pm0.00
COMPAS Linear+ReLU 0.737±\pm0.01 ResBlocks+ReLU 0.771±\pm0.05

In the original paper of NAM [1], the authors proposed exp-centered (ExU) units, which can be formalized as follows:

ExU(x)=h(ew(xb)),\text{ExU}(x)=h(e^{w}(x-b)), (14)

where ww and bb are weight and bias parameters, and h()h(\cdot) is an activation function. ExU units were proposed to model jagged functions, enhancing the expressiveness of NAM. The authors also explored the use of ReLU-nn activation that bounds the ReLU activation at nn and they found it to be beneficial for specific datasets. However, we discover that their convergence is relatively unstable compared to fully connected networks.

To address this issue, we employ ResNet blocks in our approach. Specifically, we utilize a ResNet block with group convolution layers, where each layer is followed by BatchNorm and ReLU activation. This modification enables stable training across various datasets and further significantly improves performance. Following the standard ResNet structure, our model includes one input layer with BatchNorm, ReLU, and Dropout, three ResNet blocks, and one output layer. We find that a dimension of 3232 for each layer is sufficient to achieve superior performance. In Table III, we compare the structures and performances of NAM in [1] with our proposed structure for common datasets under the same setting in [1]. Our proposed structure shows improved performance while requiring lower computational costs. By combining the group convolution integration proposed in [2], the basic structure of our neural additive model becomes Fig. 14. Given the feature dimension d=3d=3, the inputs are reshaped accordingly within the channel dimension. Subsequently, we employ group convolution using dd groups, where each kernel is individually applied to its corresponding channel. Then, residual connections are employed for each function fif_{i}.

Refer to caption
Figure 14: Illustration of the use of group convolution and its effect on reducing computational costs.
TABLE IV: Selected hyper-parameters by grid-search.
Params COMPAS Credit FICO Boston CA Housing
Learning rate η\eta 0.01 0.01 0.01 0.001 0.01
Dropout rate ψ\psi 0.1 0.3 0.0 0.0 0.0
Batch-size BB 1,024 1,024 2,048 128 2,048
Feature Dropout τ\tau 0.2 0.1 0.4 0.1 0.1

We performed grid searches to determine the best settings for achieving high performance. First, for NAM, we explored the learning rate η[0.1,0.01,0.001,0.0001]\eta\in[0.1,0.01,0.001,0.0001], the dropout rate in the input layer ψ[0.0,0.1,0.2,0.3,0.4,0.5]\psi\in[0.0,0.1,0.2,0.3,0.4,0.5], and the batch size B[128,256,512,1024,2048]B\in[128,256,512,1024,2048]. We then explored the additional hyper-parameters for BayesNAM while fixing other variables the same as NAM. Following [11], we searched for the initial standard deviation vector s0s_{0}, and found that s0=1e4s_{0}=1e^{-4} provided the most stable performance. We explored the feature dropout probability τ[0.1,0.2,0.3,0.4,0.5]\tau\in[0.1,0.2,0.3,0.4,0.5]. In all experiments, we used SGD with cosine learning rate decay, a momentum of 0.9, and weight decay of 5e45e^{-4} over 100 epochs. The selected hyper-parameter settings are provided in Table IV.

Ablation Study

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 15: Consistency of BayesNAM for five different random seeds. Mapping functions of f1(x1)f_{1}(x_{1}) (up row) and f2(x2)f_{2}(x_{2}) (bottom row) obtained from BayesNAM with τ=0.1\tau=0.1 for five distinct random seeds. The same setting is used as in Fig. 4. The average function is plotted in red, with the min-max range indicated in orange. Each column corresponds to a different random seed. In contrast to NAM in Fig. 4, BayesNAM consistently exhibits similar mapping function distribution.
TABLE V: Performance comparison between models on 5 different random seeds. Higher AUC is better (\uparrow) and lower RMSE is better (\downarrow).
Model
COMPAS
(AUC\uparrow)
Credit
(AUC\uparrow)
FICO
(AUC\uparrow)
Boston
(RMSE\downarrow)
CA Housing
(RMSE\downarrow)
w/o FD 0.782±\pm0.006 0.990±\pm0.005 0.805±\pm0.003 3.618±\pm0.007 0.554±\pm0.010
w/ FD 0.784±\pm0.009 0.991±\pm0.003 0.804±\pm0.001 3.620±\pm0.011 0.556±\pm0.007

Fig. 15 and Table V show the ablation study on the feature dropout. Fig. 15 shows the consistent explanations of BayesNAM under the setting of Fig. 4. Compared to the results of NAM in Fig. 4, BayesNAM effectively explores diverse explanations, even across different random seeds, yielding more consistent results. In Table V, we compare the performance of BayesNAM without feature dropout (denoted as ’naive Bayesian’ in Fig. 8) and with feature dropout. While feature dropout encourages greater exploration of diverse explanations, it does not always lead to improved performance across all datasets. Specifically, for regression tasks, feature dropout often results in performance degradation. This is left as future work, aiming to develop a framework that enhances both performance and explainability across all possible tasks and datasets.