BayesNAM: Leveraging Inconsistency for Reliable Explanations

Hoki Kim, Jinseong Park, Yujin Choi, Seungyun Lee, and Jaewook Lee Hoki Kim is with the Department of Industrial Security, Chung-Ang University, South Korea. E-mail: [email protected] Park, Yujin Choi, Seungyun Lee, and Jaewook Lee are with the Department of Industrial Engineering, Seoul National University, South Korea. The corresponding author is Jaewook Lee. E-mail: [email protected]

Abstract

Neural additive model (NAM) is a recently proposed explainable artificial intelligence (XAI) method that utilizes neural network-based architectures. Given the advantages of neural networks, NAMs provide intuitive explanations for their predictions with high model performance. In this paper, we analyze a critical yet overlooked phenomenon: NAMs often produce inconsistent explanations, even when using the same architecture and dataset. Traditionally, such inconsistencies have been viewed as issues to be resolved. However, we argue instead that these inconsistencies can provide valuable explanations within the given data model. Through a simple theoretical framework, we demonstrate that these inconsistencies are not mere artifacts but emerge naturally in datasets with multiple important features. To effectively leverage this information, we introduce a novel framework, Bayesian Neural Additive Model (BayesNAM), which integrates Bayesian neural networks and feature dropout, with theoretical proof demonstrating that feature dropout effectively captures model inconsistencies. Our experiments demonstrate that BayesNAM effectively reveals potential problems such as insufficient data or structural limitations of the model, providing more reliable explanations and potential remedies.

1 Introduction

Explainable artificial intelligence (XAI) has become a significant field of research as machine learning models are increasingly applied in real-world systems including finance and healthcare. To provide insight into the underlying decision-making process behind the predictions made by these models, numerous researchers have developed various techniques to assist human decision-makers.

Recently, Agarwal et al.[1] proposed a neural additive model (NAM) that utilizes neural networks to achieve both high performance and explainability. NAM is a type of generalized additive model (GAM) that involves the linear or non-linear transformation of each input and yields the final prediction through an additive operation. Previous studies have demonstrated that NAM not only learns complex relationships between inputs and outputs but also provides a high level of explainability based on neural network architectures and training techniques.

Refer to caption — Figure 1: Inconsistency of NAM, where two independent NAMs trained with the same dataset and architecture output different explanations solely due to different random seeds.

In this paper, we analyze a critical yet overlooked phenomenon: the inconsistency phenomenon of NAM. Fig. 1 illustrates this issue, where two independent NAMs, trained on the same dataset and architecture, produce different explanations due solely to variations in random seeds. Such inconsistency has traditionally been viewed as a problem to be solved [2].

However, we argue that these inconsistencies are not merely obstacles but can offer valuable insights to uncover external explanations within the data model. Through a simple theoretical model, we show that NAMs naturally exhibit the inconsistency phenomenon even when trained on usual datasets that contain multiple important features. Building on this insight, we propose the Bayesian Neural Additive Model (BayesNAM), a novel framework that combines Bayesian neural networks with feature dropout to harness these inconsistencies for more reliable explainability. We also provide theoretical proof that feature dropout effectively leverages inconsistency. Our real-world experiments demonstrate that BayesNAM not only provides more reliable and interpretable explanations but also highlights potential issues in the data model, such as insufficient data and structural limitations within the model.

The main contributions can be summarized as follows:

•

We investigate the inconsistency phenomenon of NAMs and analyze this phenomenon through a simple theoretical model.
•

We propose a new framework BasyesNAM, which utilizes Bayesian neural network and feature dropout. We also establish a theoretical analysis of the efficacy of feature dropout in leveraging inconsistency information.
•

We empirically demonstrate that BayesNAM is particularly effective in identifying data insufficiencies or structural limitations, offering more reliable explanations and insights for decision-making.

2 Related Work

2.1 Neural Additive Model

As numerous machine learning and deep learning models are black-box, a line of work has attempted to explain the decisions made by a black-box model. We call these methods post-hoc methods since they are applied after the model has been trained. While post-hoc methods offer some interpretability, recent work [3, 4] has argued that these methods can lead to unreliable explanations, which could potentially have detrimental effects on their explainability.

In contrast to post-hoc methods, intrinsic methods aim to develop an inherently explainable model without additional techniques [4]. Agarwal et al.[1] proposed a neural additive model (NAM), which combines a generalized additive model [5] and neural networks. To be specific, given $d$ features, $x_{1},x_{2},\cdots,x_{d}$ and a target $y$ , NAM constructs $d$ mapping functions as follows:

y=f_{1}(x_{1})+f_{2}(x_{2})+\cdots+f_{d}(x_{d})+\beta,

(1)

where $\beta$ is a bias term and each mapping function $f_{i}$ . In Fig. 2, We illustrate an example of NAM. By utilizing the neural network, NAMs capture the non-linear relationship and achieve high performance while maintaining clarity through a straightforward plot.

Despite their strengths, NAMs frequently exhibit inconsistent explanations even when trained on identical datasets with the same architectures, as illustrated in Fig. 1. These inconsistencies can also be observed in the original work [1], where the mapping functions produced by different NAMs within an ensemble show substantial variation, despite being trained under the same experimental conditions.

While this inconsistent phenomenon across NAMs can harm its explainability as they are intended to be XAI models, this phenomenon has received limited attention in the literature. To the best of our knowledge, only one study has explicitly addressed this issue. Radenovic et al. [2] introduced the neural basis model (NBM), which used shared basis functions across features rather than assigning independent mapping functions to each feature. They argued that NBM reduces divergence between models, offering more consistent shape functions compared to NAM, thus mitigating the inconsistency problem.

In contrast, this paper presents a novel view on the inconsistency phenomenon. Rather than treating it as a problem to be solved, we argue that these inconsistencies provide valuable information about the data model.

2.2 Bayesian Neural Network

Although the use of a single model is a fundamental approach, numerous studies have found that a point-estimation is often vulnerable to overfitting and high variance due to its limited representation [6]. To overcome this limitation, Bayesian neural network estimates the model distribution instead of calculating a fixed model. Given the data $({\bm{x}},y)\sim\mathcal{D}$ and the prior $p({\bm{w}})$ , we aim to approximate the posterior $p({\bm{w}}|{\bm{x}},y)$ . Specifically, rather than using a fixed weight vector ${\bm{w}}_{i}$ , it aims to find a distribution of weight vectors $\mathcal{N}(\bm{\mu}_{i},\texttt{diag}({\bm{s}}_{i})^{2})$ and learn the mean vector $\bm{\mu}_{i}$ and the standard deviation vector ${\bm{s}}_{i}$ .

Since the distribution $p({\bm{x}},y)$ is generally intractable, several methods have been developed to approximate the posterior, including Markov Chain Monte Carlo (MCMC) [7] and variational inference approaches [8, 9]. While MCMC methods can provide more accurate estimates, their high computational cost [10] has led to the use of variational inference methods across diverse domains [11, 12].

During optimization in variational inference methods, an weight vector ${\bm{w}}_{i}=\bm{\mu}_{i}+{\bm{s}}_{i}\odot\bm{\epsilon}$ is sampled for each forward step where $\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ . The prior distribution can be simply chosen as the isometric Gaussian prior $\mathcal{N}(\mathbf{0},s_{0}^{2}\mathbf{I})$ where $s_{0}$ is a predefined standard deviation to explicitly calculate the KL-divergence [11].

A promising direction in the field of Bayesian neural networks is their integration with other domains to enhance model explainability. Bayesian neural networks provide weight distributions that enable the identification of high-density regions or confidence intervals, which can be used for uncertainty estimation. Researchers and practitioners in several domains that require reliable explanations, such as medicine [13] and finance [14], have also explored the utilization of Bayesian models to measure the confidence of prediction for trustworthy decision-making.

3 Methodology

In Section 3.1, we first investigate the inconsistency phenomenon of NAMs with a simple theoretical model. Our empirical findings show that this inconsistency can easily occur, even when datasets contain more than one important feature. Subsequently, in Section 3.2, we propose a new framework called BayesNAM, which combines Bayesian neural network with feature dropout, to leverage the inconsistency information as a source of valuable indicator. This framework is supported by a theoretical analysis demonstrating the effectiveness of feature dropout in capturing diverse explanations. Finally, we provide a detailed explanation of the proposed framework.

3.1 Rethinking Inconsistency of Neural Additive Model

We begin the analysis by identifying and investigating the inconsistent explanations of NAM. To this end, we construct a simple theoretical model. Here, we consider a binary classification task where the target $y$ can have a value in $\{-1,1\}$ . Inspired by [15], we construct the input-target pairs $({\bm{x}},y)=(x_{1},x_{2},\cdots,x_{d},y)$ from a distribution $\mathcal{D}$ as follows:

	$\displaystyle x_{1}$	$\displaystyle=\begin{cases}+y\text{ with probability }p\\ -y\text{ with probability }1-p\end{cases},$		(2)
	$\displaystyle x_{2}$	$\displaystyle,\cdots,x_{d}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(\lambda y,\sigma^{2}),$		(3)

where $x_{2},\cdots,x_{d}$ are independently and identically sampled from a normal distribution $\mathcal{N}$ with the mean $\lambda y$ and the standard deviation $\sigma^{2}$ for positive $\lambda$ and $\sigma$ . It is important to note that the features $x_{2},\cdots,x_{d}$ are uncorrelated, as they are drawn independently and identically distributed. By adjusting the values of $p$ and $\lambda$ , we can control the significance of $x_{1}$ and $x_{2},\cdots,x_{d}$ in predicting $y$ , as stated in the following lemma:

Lemma 1.

(Derived from [15]) Consider a linear classifier $h$ ,

\displaystyle h(x_{2},\cdots,x_{d})

\displaystyle=\operatorname{sign}(w_{2}x_{2}+w_{3}x_{3}+\cdots+w_{d}x_{d}).

(4)

Then, even a natural linear classifier, $h(\cdot)$ with $w_{i}=\frac{1}{d-1}$ , can easily achieve a higher classification accuracy than $p$ , which is a natural accuracy of the model that only uses $x_{1}$ , if the following statement is satisfied:

\displaystyle\Phi_{X\sim\mathcal{N}\left(0,\sigma^{2}/(d-1)\right)}(\lambda)>p,

(5)

where $\Phi_{X}(\cdot)$ is the cumulative distribution function of $X$ . (Detailed proof is presented in Appendix)

Let $p$ be a sufficiently large positive number. When $\lambda=0$ , only $x_{1}$ is useful to predict $y$ and other features $x_{2},\cdots,x_{d}$ are not correlated to $y$ . As $\lambda$ increases, $x_{2},\cdots,x_{d}$ become correlated to $y$ . By Lemma 1, if (5) is satisfied, a model that only considers $x_{2},\cdots,x_{d}$ can achieve a higher classification accuracy than $p$ . In summary, if $\lambda=0$ , $x_{1}$ would be the only feature with a high importance in predicting $y$ , while $x_{2},\cdots,x_{d}$ is enough to have a significant performance in predicting $y$ for a large $\lambda>0$ .

Now, we consider the following two cases with d=3:

•

Case-I. Single important feature exists $(\lambda=0)$ . In this case, only $x_{1}$ is effective in predicting $y$ , while $x_{2}$ and $x_{3}$ are not useful.
•

Case-II. Multiple important features exist $(\lambda=3)$ . In this case, all the features, $x_{1}$ , $x_{2}$ , and $x_{3}$ are highly correlated with $y$ . The model uses $x_{2}$ and $x_{3}$ can perform better than the model sorely depends on $x_{1}$ since $\Phi_{Z}(\lambda=3)=0.999$ .

Given this theoretical model, we generated two sets of data containing 50,000 training examples and 10,000 test examples and trained two different NAMs on each dataset for different random seeds. For simplicity, we fixed the feature dimension to $d=3$ , the probability $p=0.95$ , and $\sigma^{2}=d-1$ , resulting (5) becomes $\Phi_{Z}(\lambda)>p$ , where $Z$ is drawn from the standard normal distribution $\mathcal{N}(0,1)$ . For each mapping function $f_{i}$ of NAM, we constructed a simple neural network with two linear layers containing 10 hidden neurons and used ReLU as an activation function. The models are trained by SGD with a learning rate of 0.01. One epoch was sufficient to achieve high training accuracy.

Fig. 3 (Case-I) and Fig. 4 (Case-II) illustrate the mapping functions of trained NAMs for each case. Specifically, for Case-I, we observed that the two NAM models trained with different random seeds exhibited similar test accuracy and explanations (94.9% and 95.0%, respectively). As shown in Fig. 3, the mapping functions for each $x_{i}$ have similar shapes, with $f_{1}$ being the only increasing one and the others being almost constant. Therefore, in this case, NAM successfully captures the true importance of features and provides reliable explanations.

In contrast, for Case-II (Fig. 4), the mapping functions $f_{i}$ of the trained NAMs have extremely different shapes. Although both NAMs achieve a test accuracy exceeding 99.99%, $f_{3}$ is much steeper than $f_{2}$ for the first random seed, whereas the relationship is reversed for the second random seed, and vice versa.

Such inconsistency results in inconsistent feature contribution. Figure 5 shows the feature contribution of a sample ${\bm{x}}=[x_{1},x_{2},x_{3}]$ $=[-1,3,3]$ . Following [1], we calculate the feature contribution by subtracting the average value of a mapping function across the entire training dataset. Although we use the same example, the feature contribution calculated by NAM with random seed 1 implies that $x_{2}$ appears to be more important than $x_{3}$ , while NAM with random seed 2 outputs the opposite result that $x_{3}$ appears to be more significant than $x_{2}$ . In summary, NAMs can produce inconsistent explanations when multiple important features are present, a common condition in real-world datasets. Indeed, as discussed later in Figures 10 and 12, this inconsistency is readily observable in widely-used datasets.

At first glance, the observed inconsistency may appear problematic; however, both explanations are not inherently incorrect. Specifically, given the theoretical model, both $x_{2}$ and $x_{3}$ are important features under Case-II, as using only one of them can achieve high performance. Therefore, the distinct mapping functions demonstrate that the different perspectives of trained models and each explanation is a valid interpretation of the data model, where relying solely on either $x_{2}$ or $x_{3}$ is sufficient for high performance.

In Fig. 6, we vary the learning rates ( $\eta$ ) and batch sizes ( $B$ ) during training within the same theoretical model. We linearly increase the learning rate $\eta$ from 0.005 to 0.01, and the batch size $B$ from 5 to 50. In total, we trained 50 models for each experiment, where each NAM exhibits inconsistent mapping functions. However, it is important to note that all models achieved over 99% test accuracy on the dataset. This indicates that the diverse explanations are not incorrect; rather, they offer valuable external insights into the existence of diverse perspectives among high-performing models, complementing the internal explanations of individual models. Therefore, we posit that inconsistency can be a useful indicator of potential external explanatory factors. Based on these findings, we propose a new framework to leverage inconsistency and provide additional explanations within the data model.

3.2 Bayesian Neural Additive Model

In the previous subsection, we explored the inconsistency phenomenon in NAMs and suggested that rather than being a flaw, this inconsistency can serve as a valuable source of additional information, shedding light on underlying external explanations in the data model. In this section, we introduce BayesNAM, a novel framework designed to leverage this inconsistency. BayesNAM incorporates two key approaches: (1) a modeling approach based on Bayesian structure and (2) an optimization approach utilizing feature dropout. Each of these approaches will be detailed in the following paragraphs.

1) Modeling Approach: Bayesian Structure for Inconsistency Exploration. A naive approach to exploring possible inconsistencies in NAMs is by training multiple independent models. Indeed, Agarwal et al.[1] trained several NAMs and visualized the learned shape functions $f_{k}(x_{k})$ . However, this requires training $n$ independent models, leading to a computational burden proportional to $n$ , making it impractical for large-scale applications.

To address this limitation, we propose using Bayesian neural networks [8, 9], which inherently allow for efficient exploration of model uncertainty without the need to train multiple independent models. Under variational inference and Bayes by Backprop [8, 9], Bayesian neural networks rather train the mean parameter $\bm{\mu}_{i}$ and the standard deviation parameter ${\bm{s}}_{i}$ instead of an weight vector ${\bm{w}}_{i}$ . Then, during the training and inference phase, it samples a weight ${\bm{w}}_{i}=\bm{\mu}_{i}+{\bm{s}}_{i}\odot\bm{\epsilon}$ for a random vector $\bm{\epsilon}$ from a predefined distribution. Following prior works [11, 12], we adopt the reparameterization trick [16] for efficient training. This results in the following training objective.

\displaystyle\underset{\bm{\mu}_{i},{\bm{s}}_{i}}{\texttt{min}}\mathcal{L}\left(\sum_{i=1}^{d}f_{i}(x_{i}|\bm{\mu}_{i},{\bm{s}}_{i})+\beta,y\right)+\sum_{i=1}^{d}\texttt{KL}\left(q_{\bm{\mu}_{i},{\bm{s}}_{i}}({\bm{w}}_{i})\lVert p({\bm{w}}_{i})\right),

(6)

where $\mathcal{L}(\cdot)$ is a given loss function and $\texttt{KL}(\cdot\lVert\cdot)$ is the KL-divergence. For further details, we refer the readers to [9].

In Fig. 7, we present the structural framework that integrates a Bayesian neural network with a neural additive model. For each sampled weight ${\bm{w}}^{(j)}$ , we compute the corresponding predictions $f_{i}(x_{i}|{\bm{w}}^{(j)}_{i})$ . This sampling approach enables the model to efficiently explore a diverse range of model spaces without needing to train multiple models. By incorporating a Bayesian neural network, the model provides high-density regions of the mapping functions and provides confidence intervals for feature contributions, offering richer interpretability.

2) Optimization Approach: Feature Dropout for Encouraging Diverse Explanations. Although Bayesian neural networks provide an efficient mechanism for exploration, they do not inherently guarantee exploring diverse explanations. Indeed, as shown in Fig. 8a, naive Bayesian neural network alone tends to focus on a single feature, similar to training a single NAM, rather than adequately exploring diverse explanations. As previously noted in related works [11, 12], we observe that increasing the standard deviation hyper-parameter $s_{0}$ within Bayesian neural network tends to degrade model performance and fails to address this issue effectively. Therefore, given the presence of diverse valid explanations shown in Figures 4 and 5, it is evident that a method is needed to encourage the exploration of diverse explanations.

As a potential solution, we propose the use of feature dropout during optimization. Feature dropout, initially introduced by Agarwal et al.[1], extends traditional dropout by selectively omitting individual feature networks during training. The hyperparameter $\tau$ determines the probability of dropping each feature. While the original work focused on improving model performance with feature dropout, we here provide a theoretical analysis showing that feature dropout implicitly encourages diverse explanations, preventing over-relying on any single feature.

Given the theoretical model in Section 3.1, we establish the following theorem.

Theorem 1.

(Feature Dropout Implicitly Encourages Exploring Diverse Explanations) Given the dataset $({\bm{x}},y)$ in (2), the linear classifier $h$ in (4), and the feature dropout rate $\tau$ , without loss of generality, the maximal training accuracy of $h$ that only uses $k$ features becomes

\mathcal{P}(k,\tau)=\mathbb{P}_{x_{i}\sim\mathcal{N}(\lambda y,\sigma^{2}),u_{i}\sim\mathcal{B}(1-\tau)}\left[\frac{y}{k}\sum_{i=2}^{k+1}u_{i}x_{i}>0\right].

(7)

Then, for $k\geq 3$ and $\tau\in[0,\frac{1}{2}]$ , the gap $\Delta\mathcal{P}:=\mathcal{P}(k,\tau)-\mathcal{P}(1,\tau)$ is always positive and increases as $\tau$ increases. Thus, the model leverages multiple features to achieve high performance, implicitly encouraging the exploration of diverse explanations.

Sketch of proof.

Let $q_{j}=\Phi_{Z}(\lambda\sqrt{j}/\sigma)$ . Then, $\mathcal{P}(k,\tau)$ can be formalized as follows:

\mathcal{P}(k,\tau)=\sum^{k}_{j=1}q_{j}\binom{k}{j}(1-\tau)^{j}\tau^{k-j}.

To show that $\Delta\mathcal{P}(k,\tau)>0$ and $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k,\tau)>0$ for $k\geq 3$ and 8 $\tau\in[0,\frac{3}{2}]$ , we use mathematical induction.

Applying Pascal’s identity and strong induction, we find

	$\displaystyle\frac{\partial}{\partial\tau}\Delta$	$\displaystyle\mathcal{P}(k+1,\tau)=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-(k+1)\tau^{k-2})$
		$\displaystyle+(q_{2}-q_{1})\tau(1-\tau)(6-(k+1)k\tau^{k-2})+3(q_{3}-q_{2})(1-\tau)^{2}$
		$\displaystyle-\sum^{k}_{j=2}(j+1)(q_{j+1}-q_{j})\binom{k+1}{j+1}(1-\tau)^{j}\tau^{k-j}$

We now consider two cases for $\tau$ : (1) $\tau\in\left[0,\frac{3}{8}\right]$ and (2) $\tau\in\left(\frac{3}{8},\frac{1}{2}\right]$ . In each case, we prove that $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k+1,\tau)>0$ by finding the value of $\lambda/\sigma$ that minimizes each term. With these lower bounds, we can conclude that the overall expression is positive. (Detailed proof is presented in Appendix) ∎

Fig. 9 empirically verifies the general acceptance of Theorem 1. We plot $\Delta\mathcal{P}(k,\tau)$ with varying $k$ . Other settings are same as the Case-II in Section 3.1. The increasing trend of $\Delta\mathcal{P}(k,\tau)$ for $\tau\in[0,\frac{1}{2}]$ aligns with the implications of Theorem 1. Furthermore, the acceptable range of $\tau$ in Theorem 1 expands as $k$ increases. When $k=100$ , $\Delta\mathcal{P}(k,\tau)$ is increasing for $\tau\in[0,0.9]$ . Even for $k=2$ , we observe that $\Delta\mathcal{P}(k,\tau)$ increases until $\tau=0.4$ . Since Theorem 1 holds regardless of the values of $\lambda$ , we observe similar results for a very small value of $\lambda=0.01$ .

In summary, we theoretically and empirically verify that feature dropout encourages the model to explore diverse explanations by using multiple features in the dataset. As shown in Fig. 8b, incorporating feature dropout enables the model to explore explanations across a range of features. Therefore, we introduce the framework that combines Bayesian neural networks with feature dropout as Bayesian Neural Additive Model (BayesNAM).

4 Experiments

In this section, we present empirical findings comparing the performance of our proposed framework against traditional models, such as Logistic/Linear Regression, Classification and Regression Trees (CART), and Gradient Boosted Trees (XGBoost) [17], as well as recent explainable models including the Explainable Boosting Machine (EBM) [18], NAM, NAM with an ensemble method (NAM+Ens), and our proposed BayesNAM. For Logistic/Linear Regression, CART, XGBoost, and EBM, we conducted a grid search for hyperparameter tuning, following the settings outlined in [1]. We found that using ResNet blocks—comprising two group convolution layers with BatchNorm and ReLU activation—yields better performance for NAM and BayesNAM compared to the ExU units or ReLU- $n$ suggested in [1]. For NAM+Ens, we trained five independent NAMs, and both NAM+Ens and BayesNAM utilized soft voting for model aggregation during evaluation. Detailed settings are provided in the Appendix.

We evaluated all models on five different datasets: Credit Fraud [19], FICO [20], and COMPAS [21] for classification tasks, and California Housing (CA Housing) [22] and Boston [23] for regression tasks. As shown in Table I, BayesNAM demonstrates comparable performance to other benchmarks across datasets, with particularly strong results in classification tasks such as COMPAS, Credit Fraud, and FICO. For regression tasks, BayesNAM tends to be less accurate, which we discuss further in the Appendix.

TABLE I: Performance comparison between models on 5 different random seeds. Higher AUC is better (

\uparrow

) and lower RMSE is better (

\downarrow

Model

COMPAS

(AUC

\uparrow

)

Credit

(AUC

\uparrow

)

FICO

(AUC

\uparrow

)

Boston

(RMSE

\downarrow

)

CA Housing

(RMSE

\downarrow

)

Log./Lin. Reg.

0.699

\pm

0.005

0.977

\pm

0.004

0.706

\pm

0.005

5.517

\pm

0.009

0.731

\pm

0.010

CART

0.776

\pm

0.005

0.956

\pm

0.005

0.784

\pm

0.002

4.133

\pm

0.004

0.712

\pm

0.007

XGBoost

0.743

\pm

0.012

0.980

\pm

0.005

0.795

\pm

0.001

3.155

\pm

0.009

0.531

\pm

0.011

EBM

0.764

\pm

0.009

0.978

\pm

0.007

0.793

\pm

0.005

3.301

\pm

0.005

0.558

\pm

0.012

NAM

0.769

\pm

0.011

0.989

\pm

0.007

0.804

\pm

0.003

3.567

\pm

0.012

0.556

\pm

0.009

NAM+Ens

0.771

\pm

0.005

0.990

\pm

0.004

0.804

\pm

0.002

3.555

\pm

0.006

0.554

\pm

0.003

BayesNAM

0.784

\pm

0.009

0.991

\pm

0.003

0.804

\pm

0.001

3.620

\pm

0.011

0.556

\pm

0.007

4.1 Identifying Data Inefficiency

The capability of BayesNAM to explore diverse explanations further allows us to obtain confidence information of feature contributions. In the left plot of Figure 10, we plot the feature contributions of two randomly drawn offenders from each target value, ‘reoffended’ ( $y=1$ ) or ‘not’ ( $y=0$ ).

Among the features, juv_other_count (which represents the number of non-felony juvenile offenses a person has been convicted of) exhibits high variance in its contributions. This high variance indicates that, with a single NAM, the contribution of juv_other_count can appear either extremely negative or positive, potentially leading to misinterpretation. BayesNAM reveals substantial variability among models, suggesting that juv_other_count can have both positive and negative effects on predictions within certain ranges.

What can we infer from this high variation? In the right plot of Figure 10, we analyze the mapping function of juv_other_count. NAMs (gray) show inconsistent explanations for juv_other_count. The two-sigma interval of mapping functions of BayesNAM (orange) also starts to diverge significantly when juv_other_count $\geq 4$ , indicating increased inconsistency in this range. As shown in Figure 11a, a data range where juv_other_count $\geq 4$ indicates a lack of sufficient data. Moreover, this range also shows skewed proportions, especially for juv_other_count $\geq 9$ , where all labels are either ’reoffended’ or ’not.’ In summary, we verify that the high inconsistency highlights the need for caution when interpreting examples involving the feature, and suggests potential issues such as a lack of data.

In addition to identifying data insufficiencies, our model can also be used for feature selection. Features with high absolute contributions and small standard deviations, such as priors_count (which represents the total number of prior offenses a person has been convicted of), consistently demonstrate significant impact across different models.

4.2 Capturing Structural Limitation

In addition to data insufficiencies, a high level of inconsistency can reveal structural limitations within the model. In Fig. 12a, we plot the results of NAMs and BayesNAM for longitude. These functions show higher housing prices in San Francisco (around -122.5) and Los Angeles (around -118.5), consistent with previous findings [1]. However, BayesNAM finds that there exist inconsistent explanations between these two cities, particularly within the longitude range of -120 to -119.

We hypothesize that this inconsistency is due to significant variations in housing prices (target variable) across different latitudes. Fig. 12b illustrates the distribution of housing prices in California, with red circles indicating higher prices and larger circles representing higher volumes of houses. As shown in Fig. 12b, while Santa Barbara, Yosemite National Park, and Fresno are on similar longitudes, Santa Barbara exhibits a substantial price gap compared to the others. Additionally, since Yosemite National Park and Fresno are near the same latitude as San Francisco, NAM might struggle to accurately predict housing prices without the interaction term between Latitude and Longitude.

Based on this observation, we train a NAM with an interaction term between Latitude and Longitude. This model achieved a much better performance (RMSE: $0.506\pm 0.005$ ) than without the interaction term (RMSE: $0.556\pm 0.009$ ). In addition, The significance of the interaction term is particularly evident near Yosemite. When BayesNAM is trained with this interaction term, it also shows reduced variance in the longitude range of -120 to -119. This suggests that the high variance can highlight potential structural limitations within models. Moreover, considering that existing methods such as NA²M [1], which incorporate all possible interaction terms, incur heavy computational costs and diminished explainability, BayesNAM offers a promising alternative by effectively selecting the most important interaction terms.

5 Conclusion

In this work, we identified and analyzed the inconsistent explanations of NAMs. We highlighted the importance of acknowledging these inconsistencies and introduced a new framework, BayesNAM, which leverages inconsistency to provide more reliable explanations. Through empirical validation, we demonstrated that BayesNAM effectively explores diverse explanations and provides external explanations such as insufficient data or model limitations within the data model. We hope our research contributes to the development of trustworthy models.

References

[1] R. Agarwal, L. Melnick, N. Frosst, X. Zhang, B. Lengerich, R. Caruana, and G. E. Hinton, “Neural additive models: Interpretable machine learning with neural nets,” Advances in Neural Information Processing Systems, vol. 34, pp. 4699–4711, 2021.
[2] F. Radenovic, A. Dubey, and D. Mahajan, “Neural basis models for interpretability,” Advances in Neural Information Processing Systems, vol. 35, pp. 8414–8426, 2022.
[3] I. E. Kumar, S. Venkatasubramanian, C. Scheidegger, and S. Friedler, “Problems with shapley-value-based explanations as feature importance measures,” in International Conference on Machine Learning. PMLR, 2020, pp. 5491–5500.
[4] C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature Machine Intelligence, vol. 1, no. 5, pp. 206–215, 2019.
[5] R. Hastie TJTibshirani, “Generalized additive models,” Monographs on statistics and applied probability. Chapman & Hall, vol. 43, p. 335, 1990.
[6] T. G. Dietterich, “Ensemble methods in machine learning,” in Multiple Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, June 21–23, 2000 Proceedings 1. Springer, 2000, pp. 1–15.
[7] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient langevin dynamics,” in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 681–688.
[8] A. Graves, “Practical variational inference for neural networks,” Advances in neural information processing systems, vol. 24, 2011.
[9] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural network,” in International conference on machine learning. PMLR, 2015, pp. 1613–1622.
[10] C. Li, C. Chen, D. Carlson, and L. Carin, “Preconditioned stochastic gradient langevin dynamics for deep neural networks,” in Proceedings of the AAAI conference on artificial intelligence, 2016.
[11] X. Liu, Y. Li, C. Wu, and C.-J. Hsieh, “Adv-bnn: Improved adversarial defense through robust bayesian neural network,” arXiv preprint arXiv:1810.01279, 2018.
[12] S. Lee, H. Kim, and J. Lee, “Graddiv: Adversarial robustness of randomized neural networks via gradient diversity regularization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[13] A. Singh, S. Sengupta, M. A. Rasheed, V. Jayakumar, and V. Lakshminarayanan, “Uncertainty aware and explainable diagnosis of retinal disease,” in Medical Imaging 2021: Imaging Informatics for Healthcare, Research, and Applications, vol. 11601. SPIE, 2021, pp. 116–125.
[14] H. Jang and J. Lee, “Generative bayesian neural network model for risk-neutral pricing of american index options,” Quantitative Finance, vol. 19, no. 4, pp. 587–603, 2019.
[15] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry, “Robustness may be at odds with accuracy,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=SyxAb30cY7
[16] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” Advances in neural information processing systems, vol. 28, 2015.
[17] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794.
[18] H. Nori, S. Jenkins, P. Koch, and R. Caruana, “Interpretml: A unified framework for machine learning interpretability,” arXiv preprint arXiv:1909.09223, 2019.
[19] A. Dal Pozzolo, “Adaptive machine learning for credit card fraud detection,” Université libre de Bruxelles, 2015.
[20] FICO, “Fico explainable machine learning challenge,” https://community.fico.com/s/ explainable-machine-learning-challenge, 2018.
[21] ProPublica, “Compas data and analysis for ‘machine bias’,” https://github.com/ propublica/compas-analysis, 2016.
[22] R. K. Pace and R. Barry, “Sparse spatial autoregressions,” Statistics & Probability Letters, vol. 33, no. 3, pp. 291–297, 1997.
[23] D. Harrison Jr and D. L. Rubinfeld, “Hedonic housing prices and the demand for clean air,” Journal of environmental economics and management, vol. 5, no. 1, pp. 81–102, 1978.

Supplements to BayesNAM: Leveraging Inconsistency for Reliable Explanations

Proofs

Proof for Lemma 1.

Proof.

Let $h(x_{2},\cdots,x_{d})=\operatorname{sign}(\sum_{i=2}^{d}x_{i}/(d-1))$ . Then, the accuracy of $h(x_{2},\cdots,x_{d})$ becomes

\displaystyle\mathbb{P}

\displaystyle[h(x_{2},\cdots,x_{d})=y]=\mathbb{P}\left[\frac{y}{d-1}\sum_{i=2}^{d}\mathcal{N}(\lambda y,\sigma^{2})>0\right]=\Phi_{X\sim\mathcal{N}\left(0,\sigma^{2}/(d-1)\right)}(\lambda).

∎

Preliminary Lemma for Proof for Theorem 1.

Lemma 2.

Let $q_{j}=\Phi_{Z}(\lambda\sqrt{j}/\sigma)$ . Then, the following inequality holds:

\displaystyle(q_{j+1}-q_{j})\leq\frac{1}{\sqrt{2\pi}}\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}e^{-\frac{1}{2}t^{2}}dt,

Moreover,

\displaystyle(j+1)\frac{1}{\sqrt{2\pi}}\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}e^{-\frac{1}{2}t^{2}}dt

decreases with respect to $j$ , so the upper bound for $j\geq 2$ is $\frac{3}{\sqrt{2\pi}}\int_{\sqrt{2\text{ln}(3/2)}}^{\sqrt{3\text{ln}(3/2)}}e^{-\frac{1}{2}t^{2}}dt.$

Proof.

To find the upper bound of $q_{j+1}-q_{j}$ , we take the derivative with respect to $a$ as follows:

\displaystyle\frac{\partial}{\partial a}(q_{j+1}-q_{j})=\frac{1}{\sqrt{2\pi}}(\sqrt{j+1}e^{-\frac{j+1}{2}a^{2}}-\sqrt{j}e^{-\frac{j}{2}a^{2}})=0\quad\text{at}\quad a=\sqrt{\text{ln}(j+1/j)}.

Thus, the first inequality holds. Next, we take the derivative with respect to $j$ to verify that the left-hand side decreases.

	$\displaystyle\frac{\partial}{\partial j}(j+1)\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}$	$\displaystyle e^{-\frac{1}{2}t^{2}}dt=\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}e^{-\frac{1}{2}t^{2}}dt$
		$\displaystyle+(j+1)\left[\frac{(\frac{j}{j+1})^{\frac{j+1}{2}}}{2\sqrt{(j+1)\ln(j+1/j)}}\left\{\ln\frac{j+1}{j}-\frac{1}{j}\right\}-\frac{(\frac{j}{j+1})^{\frac{j}{2}}}{2\sqrt{j\ln(j+1/j)}}\left\{\ln\frac{j+1}{j}-\frac{1}{j+1}\right\}\right]$

Simplifying this equation:

	$\displaystyle(j+1)$	$\displaystyle\left[\frac{(\frac{j}{j+1})^{\frac{j+1}{2}}}{2\sqrt{(j+1)\ln(j+1/j)}}\left\{\ln\frac{j+1}{j}-\frac{1}{j}\right\}-\frac{(\frac{j}{j+1})^{\frac{j}{2}}}{2\sqrt{j\ln(j+1/j)}}\left\{\ln\frac{j+1}{j}-\frac{1}{j+1}\right\}\right]$
		$\displaystyle=\frac{j+1}{2\sqrt{\ln(j+1/j)}}\left(\frac{j}{j+1}\right)^{\frac{j}{2}}\times\left[\frac{\sqrt{j}}{{j+1}}(\ln\frac{j+1}{j}-\frac{1}{j})-\frac{1}{\sqrt{j}}(\ln\frac{j+1}{j}-\frac{1}{j+1})\right]$
		$\displaystyle=-\frac{1}{2}\left(\frac{j}{j+1}\right)^{\frac{j}{2}}\sqrt{\ln\frac{j+1}{j}}\cdot\frac{1}{\sqrt{j}}$

Since

\displaystyle\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}e^{-\frac{1}{2}t^{2}}dt\leq\sqrt{\ln\frac{j+1}{j}}(\sqrt{j+1}-\sqrt{j})\left(\frac{j}{j+1}\right)^{\frac{j}{2}},

we have

\displaystyle\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}e^{-\frac{1}{2}t^{2}}dt-\frac{1}{2}\left(\frac{j}{j+1}\right)^{\frac{j}{2}}\sqrt{\ln\frac{j+1}{j}}\cdot\frac{1}{\sqrt{j}}\leq\sqrt{\ln\frac{j+1}{j}}\left(\frac{j}{j+1}\right)^{\frac{j}{2}}(\sqrt{j+1}-\sqrt{j}-\frac{1}{2\sqrt{j}})\leq 0,

for $j\geq 2$ . ∎

Proof for Theorem 1.

Proof.

Let $q_{j}=\Phi_{Z}(\lambda\sqrt{j}/\sigma)$ . Then, $\mathcal{P}(k,\tau)$ can be formalized as follows:

\mathcal{P}(k,\tau)=\sum^{k}_{j=1}q_{j}\binom{k}{j}(1-\tau)^{j}\tau^{k-j}.

(8)

Additionally, let the difference between $\mathcal{P}(k,\tau)$ and $\mathcal{P}(1,\tau)$ as follows:

\displaystyle\Delta\mathcal{P}(k,\tau):=\mathcal{P}(k,\tau)-\mathcal{P}(1,\tau)

\displaystyle=\sum^{k}_{j=1}q_{j}\binom{k}{j}(1-\tau)^{j}\tau^{k-j}-q_{1}(1-\tau).

It is trivial that $\Delta\mathcal{P}(k,0)=q_{k}-q_{1}>0$ $\forall k\geq 3$ .

To show that $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k,\tau)>0$ for $\tau\in[0,\frac{1}{2}]$ , we first prove that $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)>0$ for $\tau\in[0,\frac{1}{2}]$ . For $k=3$ , we have

	$\displaystyle\Delta\mathcal{P}(3,\tau)$	$\displaystyle=\sum^{3}_{j=1}q_{j}\binom{3}{j}(1-\tau)^{j}\tau^{3-j}-q_{1}(1-\tau)$
		$\displaystyle=(-3q_{1}+3q_{2}-q_{3})\tau^{3}+(3q_{1}-6q_{2}+3q_{3})\tau^{2}+(q_{1}+3q_{2}-3q_{3})\tau+(-q_{1}+q_{3}),$

and

\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)

\displaystyle=(-9q_{1}+9q_{2}-3q_{3})\tau^{2}+(6q_{1}-12q_{2}+6q_{3})\tau+(q_{1}+3q_{2}-3q_{3}).

(9)

For simplicity, let us denote $a=\lambda/\sigma$ so that $q_{j}=\Phi_{Z}(a\sqrt{j})$ . Since $\frac{\partial}{\partial a}\Phi_{Z}(a\sqrt{j})=\sqrt{j}\cdot\phi(a\sqrt{j})$ ,

\displaystyle\frac{\partial}{\partial a}(q_{2}-q_{1})=\frac{\partial}{\partial a}(\Phi_{Z}(\sqrt{2}a)-\Phi_{Z}(a))=\sqrt{2}\phi(a\sqrt{2})+\phi(a),

(10)

where $\phi(x)=\frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}x^{2}}$ . (10) becomes 0 when

\displaystyle\frac{\partial}{\partial a}(q_{2}-q_{1})=\frac{1}{\sqrt{2\pi}}(\sqrt{2}e^{-a^{2}}-e^{-\frac{1}{2}a^{2}})=0\quad\text{at}\quad a=\sqrt{\text{ln}2}.

Since $q_{j}>\frac{1}{2}$ for $a>0$ and $j>0$ , we have

\displaystyle q_{2}-q_{1}\leq\frac{1}{\sqrt{2\pi}}\int_{\sqrt{\text{ln}2}}^{\sqrt{2\text{ln}2}}e^{-\frac{1}{2}t^{2}}dt\approx 0.08302<\frac{1}{6}<\frac{1}{3}q_{3},

so that $-9q_{1}+9q_{2}-3q_{3}<0$ for any $a$ .

Similarly, we can easily show that

\displaystyle\frac{\partial}{\partial a}(q_{1}+3q_{2}-3q_{3})=\frac{1}{\sqrt{2\pi}}(e^{-\frac{1}{2}a^{2}}+3\sqrt{2}e^{-a^{2}}-3\sqrt{3}e^{-\frac{3}{2}a^{2}})=\frac{1}{\sqrt{2\pi}}(x+3\sqrt{2}x^{2}-3\sqrt{3}x^{3}),

(11)

where $x=e^{-\frac{1}{2}a^{2}}\in(0,1]$ for $a^{2}\in[0,\infty)$ . The solutions that make (11) equal to zero can be explicitly calculated as $x=0,\frac{\sqrt{6}-\sqrt{6+4\sqrt{3}}}{6}<0$ and $\frac{\sqrt{6}+\sqrt{6+4\sqrt{3}}}{6}>1$ . The minimum value of $q_{1}+3q_{2}-3q_{3}$ is attained when $a$ lies on the boundary. Thus, $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)|_{\tau=0}\geq 0.5$ .

We will now demonstrate that $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)|_{\tau=\frac{3}{8}}>0$ by showing that $4*\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)|_{\tau=\frac{1}{2}}=7q_{1}-3q_{2}-3q_{3}>0$ . Note that

7q_{1}-3q_{2}-3q_{3}=\frac{7}{\sqrt{2\pi}}\int_{-\infty}^{a}e^{-\frac{1}{2}t^{2}}dt-\frac{3}{\sqrt{2\pi}}\int_{-\infty}^{a\sqrt{2}}e^{-\frac{1}{2}t^{2}}dt-\frac{3}{\sqrt{2\pi}}\int_{-\infty}^{a\sqrt{3}}e^{-\frac{1}{2}t^{2}}dt.

(12)

By taking the derivative with respect to $a$ , we have

\frac{\partial}{\partial a}(7q_{1}-3q_{2}-3q_{3})=\frac{1}{\sqrt{2\pi}}(7e^{-\frac{1}{2}a^{2}}-3\sqrt{2}e^{-a^{2}}-3\sqrt{3}e^{-\frac{3}{2}a^{2}})=\frac{1}{\sqrt{2\pi}}(7x-3\sqrt{2}x^{2}-3\sqrt{3}x^{3}),

(13)

where $x=e^{-\frac{1}{2}a^{2}}$ . The solutions that make (13) equal to zero can be explicitly calculated as $x=0,\frac{\pm\sqrt{6+28\sqrt{3}}-\sqrt{6}}{6}$ . Since $e^{-\frac{1}{2}a^{2}}\in(0,1]$ and strictly decreases, the minimum occurs at $e^{-\frac{1}{2}\tilde{a}^{2}}=\frac{\sqrt{6+28\sqrt{3}}-\sqrt{6}}{6}$ , where $\tilde{a}=\sqrt{2\ln{\frac{6}{\sqrt{6+28\sqrt{3}}-\sqrt{6}}}}$ . Thus, we obtain the following inequality:

\displaystyle 7q_{1}-3q_{2}-3q_{3}

\displaystyle\geq\frac{7}{\sqrt{2\pi}}\int_{-\infty}^{\tilde{a}}e^{-\frac{1}{2}t^{2}}dt-\frac{3}{\sqrt{2\pi}}\int_{-\infty}^{\tilde{a}\sqrt{2}}e^{-\frac{1}{2}t^{2}}dt-\frac{3}{\sqrt{2\pi}}\int_{-\infty}^{\tilde{a}\sqrt{3}}e^{-\frac{1}{2}t^{2}}dt>0.1217>0.

Thus implies that $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)\lvert_{\tau=\frac{1}{2}}>0.0304>0$ . Since $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)$ with a negative leading coefficient, it follows that $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)>0$ for $\tau\in[0,\frac{1}{2}]$ .

Next, we demonstrate that $\Delta\mathcal{P}(k,\tau)$ increases as $\tau$ increases for $\tau\in[0,\frac{1}{2}]$ and $k\geq 3$ . To begin, we decompose $\Delta\mathcal{P}(k+1,\tau)$ as follows:

	$\displaystyle\Delta\mathcal{P}(k+1,\tau)$	$\displaystyle=\sum^{k+1}_{j=1}q_{j}\binom{k+1}{j}(1-\tau)^{j}\tau^{k+1-j}-q_{1}(1-\tau)$
		$\displaystyle=\sum^{k}_{j=1}q_{j}\left(\binom{k}{j}+\binom{k}{j-1}\right)(1-\tau)^{j}\tau^{k+1-j}+q_{k+1}(1-\tau)^{k+1}-q_{1}(1-\tau)$
		$\displaystyle=\Delta\mathcal{P}(k,\tau)+\sum^{k}_{j=1}(q_{j+1}-q_{j})\binom{k}{j}(1-\tau)^{j+1}\tau^{k-j}+q_{1}(1-\tau)\tau^{k}$
		$\displaystyle=\Delta\mathcal{P}(3,\tau)+\sum^{k}_{l=3}\sum^{l}_{j=1}(q_{j+1}-q_{j})\binom{l}{j}(1-\tau)^{j+1}\tau^{l-j}+q_{1}(1-\tau)\tau^{l}$
		$\displaystyle=\Delta\mathcal{P}(3,\tau)+q_{1}(1-\tau)\sum^{k}_{l=3}\tau^{l}+(q_{2}-q_{1})(1-\tau)^{2}\sum^{k}_{l=3}\binom{l}{1}\tau^{l-1}+(q_{3}-q_{2})(1-\tau)^{3}\sum^{k}_{l=3}\binom{l}{2}\tau^{l-2}$
		$\displaystyle\quad+\sum^{k}_{j=3}(q_{j+1}-q_{j})(1-\tau)^{j+1}\sum^{k}_{l=j}\binom{l}{j}\tau^{l-j}.$

By taking the derivative with respect to $\tau$ , we have

	$\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k+1,\tau)$	$\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\sum^{k}_{l=3}\tau^{l-1}\Big{\{}\binom{l}{1}-\binom{l+1}{1}\tau\Big{\}}+2(q_{2}-q_{1})(1-\tau)\sum^{k}_{l=3}\tau^{l-2}\Big{\{}\binom{l}{2}-\binom{l+1}{2}\tau\Big{\}}$
		$\displaystyle\quad+3(q_{3}-q_{2})(1-\tau)^{2}\sum^{k}_{l=3}\tau^{l-3}\Big{\{}\binom{l}{3}-\binom{l+1}{3}\tau\Big{\}}$
		$\displaystyle\quad+\sum^{k}_{j=3}(j+1)(q_{j+1}-q_{j})(1-\tau)^{j}\Big{[}-1+\sum^{k}_{l=j+1}\tau^{l-j-1}\Big{\{}\binom{l}{j+1}-\binom{l+1}{j+1}\tau\Big{\}}\Big{]}$
		$\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}(3\tau^{2}-\binom{k+1}{1}\tau^{k})+2(q_{2}-q_{1})(1-\tau)(\binom{3}{2}\tau-\binom{k+1}{2}\tau^{k-1})$
		$\displaystyle\quad+3(q_{3}-q_{2})(1-\tau)^{2}(1-\binom{k+1}{3}\tau^{k-2})-\sum^{k}_{j=3}(j+1)(q_{j+1}-q_{j})\binom{k+1}{j+1}(1-\tau)^{j}\tau^{k-j}$
		$\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+3q_{1}\tau^{2}+6(q_{2}-q_{1})\tau(1-\tau)+3(q_{3}-q_{2})(1-\tau)^{2}$
		$\displaystyle\quad-(k+1)\sum^{k}_{j=0}(q_{j+1}-q_{j})\binom{k}{j}(1-\tau)^{j}\tau^{k-j}-(k+1)q_{0}\tau^{k}$
		$\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-(k+1)\tau^{k-2})+(q_{2}-q_{1})\tau(1-\tau)(6-(k+1)k\tau^{k-2})$
		$\displaystyle\quad+3(q_{3}-q_{2})(1-\tau)^{2}-\sum^{k}_{j=2}(j+1)(q_{j+1}-q_{j})\binom{k+1}{j+1}(1-\tau)^{j}\tau^{k-j}$

By Lemma 2, for $j\geq 2$ ,

\displaystyle(j+1)(q_{j+1}-q_{j})\leq(j+1)\frac{1}{\sqrt{2\pi}}\int_{\sqrt{j\text{ln}((j+1)/j)}}^{\sqrt{(j+1)\text{ln}((j+1)/j)}}e^{-\frac{1}{2}t^{2}}dt\leq\frac{3}{\sqrt{2\pi}}\int_{\sqrt{2\text{ln}(3/2)}}^{\sqrt{3\text{ln}(3/2)}}e^{-\frac{1}{2}t^{2}}dt\leq 0.147.

Additionally, since $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)$ is quadratic function, we have

\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)\geq(1-2\tau)\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)|_{\tau=0}+2\tau\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)|_{\tau=\frac{1}{2}}\geq 0.5-0.94\tau.

Moreover, since $(k+1)\tau^{k-2}$ decreases for $k\geq 3$ , we have

	$\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k+1,\tau)$	$\displaystyle\geq\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-(k+1)\tau^{k-2})-\sum^{k}_{j=2}(j+1)(q_{j+1}-q_{j})\binom{k+1}{j+1}(1-\tau)^{j}\tau^{k-j}$
		$\displaystyle\geq 0.5-0.94\tau+\frac{1}{2}\tau^{2}(3-4\tau)-\frac{0.147}{1-\tau}.$

Therefore, we proved that $0.5-0.94\tau+\frac{1}{2}\tau^{2}(3-4\tau)-\frac{0.147}{1-\tau}>0$ for $\tau\in[0,\frac{3}{8}]$ , $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k+1,\tau)\geq 0$ for $\tau\in[0,\frac{3}{8}]$ for $k\geq 3$ .

Now, we will show that $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(4,\tau)\geq 0$ and $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(5,\tau)\geq 0$ for $\tau\in[\frac{3}{8},\frac{1}{2}]$ . First,

	$\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(4,\tau)$	$\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+3q_{1}\tau^{2}+6(q_{2}-q_{1})\tau(1-\tau)+3(q_{3}-q_{2})(1-\tau)^{2}-4\sum^{3}_{j=0}(q_{j+1}-q_{j})\binom{3}{j}(1-\tau)^{j}\tau^{3-j}-4q_{0}\tau^{3}$
		$\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-4\tau)+6(q_{2}-q_{1})\tau(1-\tau)(1-2\tau)+3(q_{3}-q_{2})(1-\tau)^{2}(1-4\tau)-4(q_{4}-q_{3})(1-\tau)^{3}.$

By using Lemma 2, we can have the following inequalities:

\displaystyle q_{3}-q_{2}\leq\frac{1}{\sqrt{2\pi}}\int_{\sqrt{2\text{ln}(3/2)}}^{\sqrt{3\text{ln}(3/2)}}e^{-\frac{1}{2}t^{2}}dt\leq 0.049,

and

\displaystyle q_{4}-q_{3}\leq\frac{1}{\sqrt{2\pi}}\int_{\sqrt{3\text{ln}(4/3)}}^{\sqrt{4\text{ln}(4/3)}}e^{-\frac{1}{2}t^{2}}dt\leq 0.035.

Since $q_{3}-q_{2}\geq q_{4}-q_{3}$ , we can also the following inequalities:

\displaystyle 3(q_{3}-q_{2})(1-\tau)^{2}(4\tau-1)+4(q_{4}-q_{3})(1-\tau)^{3}\leq(q_{3}-q_{2})(1-\tau)^{2}(1+8\tau)\leq\frac{25}{16}(q_{3}-q_{2})\leq 0.077,

and

\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-4\tau)\geq\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)|_{\tau=\frac{1}{2}}+\frac{27}{128}q_{1}\geq 0.03+0.1054=0.1354.

Therefore, $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(4,\tau)\geq\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-4\tau)-(3(q_{3}-q_{2})(1-\tau)^{2}(4\tau-1)+4(q_{4}-q_{3})(1-\tau)^{3})\geq 0.1354-0.077>0$ .

Similarly, we have

	$\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(5,\tau)$	$\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+3q_{1}\tau^{2}+6(q_{2}-q_{1})\tau(1-\tau)+3(q_{3}-q_{2})(1-\tau)^{2}$
		$\displaystyle\quad-5\sum^{4}_{j=0}(q_{j+1}-q_{j})\binom{4}{j}(1-\tau)^{j}\tau^{4-j}-5q_{0}\tau^{4}$
		$\displaystyle=\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-5\tau^{2})+(q_{2}-q_{1})\tau(1-\tau)(6-20\tau^{2})+3(q_{3}-q_{2})(1-\tau)^{2}(1-10\tau^{2})$
		$\displaystyle\quad-20(q_{4}-q_{3})(1-\tau)^{3}\tau-5(q_{5}-q_{4})(1-\tau)^{4}$
		$\displaystyle\geq\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-5\tau^{2})-(q_{3}-q_{2})(1-\tau)^{2}(15\tau^{2}+10\tau+2)$
		$\displaystyle\geq 0.03+\frac{1323}{4096}q_{1}-\frac{12575}{4096}(q_{3}-q_{2})\geq 0.03+0.1623-0.1505>0.$

For $k\geq 6$ , we can easily show that

	$\displaystyle\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k,\tau)$	$\displaystyle\geq\frac{\partial}{\partial\tau}\Delta\mathcal{P}(3,\tau)+q_{1}\tau^{2}(3-k\tau^{k-3})-\sum^{k-1}_{j=2}(j+1)(q_{j+1}-q_{j})\binom{k}{j+1}(1-\tau)^{j}\tau^{k-1-j}$
		$\displaystyle\geq 0.5-0.94\tau+\frac{1}{2}\tau^{2}(3-6\tau^{3})-\frac{0.147}{1-\tau}.$

Since $0.5-0.94\tau+\frac{1}{2}\tau^{2}(3-6\tau^{3})-\frac{0.147}{1-\tau}>0$ for $\tau\in[\frac{3}{8},\frac{1}{2}]$ , the inequality holds.

Therefore, $\frac{\partial}{\partial\tau}\Delta\mathcal{P}(k,\tau)\geq 0$ for $\tau\in[0,\frac{1}{2}]$ for $k\geq 3$ . ∎

Training Details

In Section 4, we conducted experiments on five different datasets: Credit Fraud [19], FICO [20], COMPAS [21], California Housing (CA Housing) [22], and Boston [23]. Here, we provide a summary of the characteristics in Table II and detailed explanations to ease the understanding of the experimental interpretation.

TABLE II: List of datasets and their characteristics.

Data	# Train	# Test	# Features	Task type
Credit Fraud	227,845	56,962	30	Classification
FICO	8,367	2,092	23	Classification
COMPAS	13,315	3,329	17	Classification
CA Housing	16,512	4,128	8	Regression
Boston	404	102	13	Regression

•

Credit Fraud: This dataset focuses on predicting fraudulent credit card transactions and is highly imbalanced. Due to confidentiality concerns, the features are represented as principal components obtained through PCA.
•

FICO: This dataset aims to predict the risk performance of consumers, categorizing them as either “Bad” or “Good” based on their credit.
•

COMPAS: This dataset aims to predict recidivism, determining whether an individual will reoffend or not.
•

CA Housing: This dataset aims to predict the median house value for districts in California based on data derived from the 1990 U.S. Census.
•

Boston: This dataset focuses on predicting the median value of owner-occupied homes in the Boston area, using data collected by the U.S. Census Service.

For each dataset, we conducted experiments using five different random seeds. To ensure a fair comparison with the prior work [1], we adopted 5-fold cross-validation for datasets that train-test split was not provided.

TABLE III: Comparison between the structures of previous NAM [1] and the proposed.

Data	NAM in [1]	AUC( $\uparrow$ )	NAM in ours	AUC( $\uparrow$ )
Credit Fraud	ExU+ReLU- $1$	0.980 $\pm$ 0.00	ResBlocks+ReLU	0.990 $\pm$ 0.00
COMPAS	Linear+ReLU	0.737 $\pm$ 0.01	ResBlocks+ReLU	0.771 $\pm$ 0.05

In the original paper of NAM [1], the authors proposed exp-centered (ExU) units, which can be formalized as follows:

\text{ExU}(x)=h(e^{w}(x-b)),

(14)

where $w$ and $b$ are weight and bias parameters, and $h(\cdot)$ is an activation function. ExU units were proposed to model jagged functions, enhancing the expressiveness of NAM. The authors also explored the use of ReLU- $n$ activation that bounds the ReLU activation at $n$ and they found it to be beneficial for specific datasets. However, we discover that their convergence is relatively unstable compared to fully connected networks.

To address this issue, we employ ResNet blocks in our approach. Specifically, we utilize a ResNet block with group convolution layers, where each layer is followed by BatchNorm and ReLU activation. This modification enables stable training across various datasets and further significantly improves performance. Following the standard ResNet structure, our model includes one input layer with BatchNorm, ReLU, and Dropout, three ResNet blocks, and one output layer. We find that a dimension of $32$ for each layer is sufficient to achieve superior performance. In Table III, we compare the structures and performances of NAM in [1] with our proposed structure for common datasets under the same setting in [1]. Our proposed structure shows improved performance while requiring lower computational costs. By combining the group convolution integration proposed in [2], the basic structure of our neural additive model becomes Fig. 14. Given the feature dimension $d=3$ , the inputs are reshaped accordingly within the channel dimension. Subsequently, we employ group convolution using $d$ groups, where each kernel is individually applied to its corresponding channel. Then, residual connections are employed for each function $f_{i}$ .

TABLE IV: Selected hyper-parameters by grid-search.

Params	COMPAS	Credit	FICO	Boston	CA Housing
Learning rate $\eta$	0.01	0.01	0.01	0.001	0.01
Dropout rate $\psi$	0.1	0.3	0.0	0.0	0.0
Batch-size $B$	1,024	1,024	2,048	128	2,048
Feature Dropout $\tau$	0.2	0.1	0.4	0.1	0.1

We performed grid searches to determine the best settings for achieving high performance. First, for NAM, we explored the learning rate $\eta\in[0.1,0.01,0.001,0.0001]$ , the dropout rate in the input layer $\psi\in[0.0,0.1,0.2,0.3,0.4,0.5]$ , and the batch size $B\in[128,256,512,1024,2048]$ . We then explored the additional hyper-parameters for BayesNAM while fixing other variables the same as NAM. Following [11], we searched for the initial standard deviation vector $s_{0}$ , and found that $s_{0}=1e^{-4}$ provided the most stable performance. We explored the feature dropout probability $\tau\in[0.1,0.2,0.3,0.4,0.5]$ . In all experiments, we used SGD with cosine learning rate decay, a momentum of 0.9, and weight decay of $5e^{-4}$ over 100 epochs. The selected hyper-parameter settings are provided in Table IV.

Ablation Study

TABLE V: Performance comparison between models on 5 different random seeds. Higher AUC is better (

\uparrow

) and lower RMSE is better (

\downarrow

Model

COMPAS

(AUC

\uparrow

)

Credit

(AUC

\uparrow

)

FICO

(AUC

\uparrow

)

Boston

(RMSE

\downarrow

)

CA Housing

(RMSE

\downarrow

)

w/o FD

0.782

\pm

0.006

0.990

\pm

0.005

0.805

\pm

0.003

3.618

\pm

0.007

0.554

\pm

0.010

w/ FD

0.784

\pm

0.009

0.991

\pm

0.003

0.804

\pm

0.001

3.620

\pm

0.011

0.556

\pm

0.007

Fig. 15 and Table V show the ablation study on the feature dropout. Fig. 15 shows the consistent explanations of BayesNAM under the setting of Fig. 4. Compared to the results of NAM in Fig. 4, BayesNAM effectively explores diverse explanations, even across different random seeds, yielding more consistent results. In Table V, we compare the performance of BayesNAM without feature dropout (denoted as ’naive Bayesian’ in Fig. 8) and with feature dropout. While feature dropout encourages greater exploration of diverse explanations, it does not always lead to improved performance across all datasets. Specifically, for regression tasks, feature dropout often results in performance degradation. This is left as future work, aiming to develop a framework that enhances both performance and explainability across all possible tasks and datasets.