This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Tractability of SHAP Explanations

Guy Van den Broeck,1 Anton Lykov,1 Maximilian Schleich,2 Dan Suciu2
Abstract

Shap explanations are a popular feature-attribution mechanism for explainable AI. They use game-theoretic notions to measure the influence of individual features on the prediction of a machine learning model. Despite a lot of recent interest from both academia and industry, it is not known whether Shap explanations of common machine learning models can be computed efficiently. In this paper, we establish the complexity of computing the Shap explanation in three important settings. First, we consider fully-factorized data distributions, and show that the complexity of computing the Shap explanation is the same as the complexity of computing the expected value of the model. This fully-factorized setting is often used to simplify the Shap computation, yet our results show that the computation can be intractable for commonly used models such as logistic regression. Going beyond fully-factorized distributions, we show that computing Shap explanations is already intractable for a very simple setting: computing Shap explanations of trivial classifiers over naive Bayes distributions. Finally, we show that even computing Shap over the empirical distribution is #P-hard.

1 Introduction

Machine learning is increasingly applied in high stakes decision making. As a consequence, there is growing demand for the ability to explain the prediction of machine learning models. One popular explanation technique is to compute feature-attribution scores, in particular using the Shapley values from cooperative game theory (Roth 1988) as a principled aggregation measure to determine the influence of individual features on the prediction of the collective model. Shapley value based explanations have several desirable properties (Datta, Sen, and Zick 2016), which is why they have attracted a lot of interest in academia as well as industry in recent years (see e.g., Gade et al. (2019)).

Štrumbelj and Kononenko (2014) show that Shapley values can be used to explain arbitrary machine learning models. Datta, Sen, and Zick (2016) use Shapley-value-based explanations as part of a broader framework for algorithmic transparency. Lundberg and Lee (2017) use Shapley values in a framework that unifies various explanation techniques, and they coined the term Shap explanation. They show that the Shap explanation is effective in explaining predictions in the medical domain; see Lundberg et al. (2020). More recently there has been a lot of work on the tradeoffs of variants of the original Shap explanations, e.g., Sundararajan and Najmi (2020), Kumar et al. (2020), Janzing, Minorics, and Bloebaum (2020), Merrick and Taly (2020), and Aas, Jullum, and Løland (2019).

Despite all of this interest, there is considerable confusion about the tractability of computing Shap explanations. The Shap explanations determine the influence of a given feature by systematically computing the expected value of the model given a subsets of the features. As a consequence, the complexity of computing Shap explanations depends on the predictive model as well as assumptions on the underlying data distribution. Lundberg et al. (2020) describe a polynomial-time algorithm for computing the Shap explanation over decision trees, but online discussions have pointed out that this algorithm is not correct as stated. We present a concrete example of this shortcoming in the supplementary material, in Appendix A. In contrast, for fully-factorized distributions, Bertossi et al. (2020) prove that there are models for which computing the Shap explanation is #P-hard. A contemporaneous paper by Arenas et al. (2020) shows that computing the Shap explanation for tractable logical circuits over uniform and fully factorized binary data distributions is tractable. In general, the complexity of the Shap explanation is open.

In this paper we consider the original formulation of the Shap explanation by Lundberg and Lee (2017) and analyze its computational complexity under the following data distributions and model classes:

  1. 1.

    First, we consider fully-factorized distributions, which are the simplest possible data distribution. Fully-factorized distributions capture the assumption that the model’s features are independent, which is a commonly used assumption to simplify the computation of the Shap explanations, see for example Lundberg and Lee (2017).

    For fully-factorized distributions and any prediction model, we show that the complexity of computing the Shap explanation is the same as the complexity of computing the expected value of the model.

    It follows that there are classes of models for which the computation is tractable (e.g., linear regression, decision trees, tractable circuits) while for other models, including commonly used ones such as logistic regression and neural nets with sigmoid activation functions, it is #P-hard.

  2. 2.

    Going beyond fully-factorized distributions, we show that computing Shap explanation becomes intractable already for the simplest probabilistic model that does not assume feature independence: naive Bayes. As a consequence, the complexity of computing Shap explanations on such data distributions is also intractable for many classes of models, including linear and logistic regression.

  3. 3.

    Finally we consider the empirical distribution, and prove that computing Shap explanations is #P-hard for this class of distributions. This result implies that the algorithm by Lundberg et al. (2020) cannot be fixed to compute the exact Shap explanations over decision trees in polynomial time.

2 Background and Problem Statement

Suppose our data is described by nn indexed features 𝐗={X1,,Xn}\mathbf{X}=\{X_{1},\ldots,X_{n}\}. Each feature variable XX takes a value from a finite domain dom(X)\operatorname{dom}(X). A data instance 𝐱=(x1,,xn)\mathbf{x}=(x_{1},\dots,x_{n}) consists of values xdom(X)x\in\operatorname{dom}(X) for every feature XX. This instance space is denoted 𝐱𝒳=dom(X1)××dom(Xn)\mathbf{x}\in\mathcal{X}=\operatorname{dom}(X_{1})\times\dots\times\operatorname{dom}(X_{n}). We are also given a learned function F:𝒳F:\mathcal{X}\to\mathbb{R} that computes a prediction F(𝐱)F(\mathbf{x}) on each instance 𝐱\mathbf{x}. Throughout this paper we assume that the prediction F(𝐱)F(\mathbf{x}) can be computed in polynomial time in nn.

For a particular instance of prediction F(𝐱)F(\mathbf{x}), the goal of local explanations is to clarify why the function FF gave its prediction on instance 𝐱\mathbf{x}, usually by attributing credit to the features. We will focus on local explanation that are inspired by game-theoretic Shapley values (Datta, Sen, and Zick 2016; Lundberg and Lee 2017). Specifically, we will work with the Shap explanations as defined by Lundberg and Lee (2017).

2.1 Shap Explanations

To produce Shap explanations, one needs an additional ingredient: a probability distribution Pr(𝐗)\Pr(\mathbf{X}) over the features, which we call the data distribution. We will use this distribution to reason about partial instances. Concretely, for a set of indices S[n]={1,,n}S\subseteq[n]=\{1,\dots,n\}, we let 𝐱S\mathbf{x}_{S} denote the restriction of complete instance 𝐱\mathbf{x} to those features 𝐗S\mathbf{X}_{S} with indices in SS. Abusing notation, we will also use 𝐱S\mathbf{x}_{S} to denote the probabilistic event 𝐗S=𝐱S\mathbf{X}_{S}=\mathbf{x}_{S}.

Under this data distribution, it now becomes possible to ask for the expected value of the predictive function FF. Clearly, for a complete data instance 𝐱\mathbf{x} we have that 𝐄[F|𝐱]=F(𝐱)\mathbf{E}[F|\mathbf{x}]=F(\mathbf{x}), as there is no uncertainty about the features. However, for a partial instance 𝐱S\mathbf{x}_{S}, which does not assign values to the features outside of 𝐗S\mathbf{X}_{S}, we appeal to the data distribution Pr\Pr to compute the expectation of function FF as 𝐄Pr[F|𝐱S]=𝐱𝒳F(𝐱)Pr(𝐱|𝐱S)\mathbf{E}_{\Pr}[F|\mathbf{x}_{S}]=\sum_{\mathbf{x}\in\mathcal{X}}F(\mathbf{x})\Pr(\mathbf{x}|\mathbf{x}_{S}).

The Shap explanation framework draws from Shapley values in cooperative game theory. Given a particular instance 𝐱\mathbf{x}, it considers features 𝐗\mathbf{X} to be players in a coalition game: the game of making a prediction for 𝐱\mathbf{x}. Shap explanations are defined in terms of a set function vF,𝐱,Pr:2𝐗v_{F,\mathbf{x},\Pr}:2^{\mathbf{X}}\to\mathbb{R}. Its purpose is to evaluate the “value” of each coalition of players/features 𝐗S𝐗\mathbf{X}_{S}\subseteq\mathbf{X} in making the prediction F(𝐱)F(\mathbf{x}) under data distribution Pr\Pr. Concretely, following Lundberg and Lee (2017), this value function is the conditional expectation of function FF:

vF,𝐱,Pr(𝐗S)=def𝐄Pr[F|𝐱S].\displaystyle\operatorname{v}_{F,\mathbf{x},\Pr}(\mathbf{X}_{S})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\mathbf{E}_{\Pr}[F|\mathbf{x}_{S}]. (1)

We will elide FF, 𝐱\mathbf{x}, and Pr\Pr when they are clear from context.

Our goal, however, is to assign credit to individual features. In the context of a coalition 𝐗S\mathbf{X}_{S}, the contribution of an individual feature X𝐗SX\notin\mathbf{X}_{S} is given by

c(X,𝐗S)=defv(𝐗S{X})v(𝐗S).\displaystyle\operatorname{c}(X,\mathbf{X}_{S})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\operatorname{v}(\mathbf{X}_{S}\cup\{X\})-\operatorname{v}(\mathbf{X}_{S}). (2)

where each term is implicitly w.r.t. the same FF, 𝐱\mathbf{x}, and Pr\Pr.

Finally, the Shap explanation computes a score for each feature X𝐗X\in{\mathbf{X}} averaged over all possible contexts, and thus measures the influence feature XX has on the outcome. Let π\pi be a permutation on the set of features 𝐗\mathbf{X}, i.e., π\pi fixes a total order on all features. Let π<X\pi^{<X} be the set of features that come before XX in the order π\pi. The Shap explanations are then defined as computing the following scores.

Definition 1 (Shap Score).

Fix an entity 𝐱\mathbf{x}, a predictive function FF, and a data distribution Pr\Pr. The Shap explanation of a feature XX is the contribution of XX given the features π<X\pi^{<X}, averaged over all permutations π\pi:

Shap(X)=def1n!πc(X,π<X).\displaystyle\textsc{Shap}(X)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{1}{n!}\sum_{\pi}\operatorname{c}(X,\pi^{<X}). (3)

We mention two simple properties of the Shap explanations here; for more discussion see Datta, Sen, and Zick (2016) and Lundberg et al. (2020). First, for the linear combination of functions G(.)=kλkFk(.)G(.)=\sum_{k}\lambda_{k}F_{k}(.), we have that

ShapG(X)=\displaystyle\textsc{Shap}_{G}\left(X\right)\ = kλkShapFk(X).\displaystyle\sum_{k}\lambda_{k}\textsc{Shap}_{F_{k}}(X). (4)

Second, the sum of the Shap explanation of all features is related to the expected value of function FF:

iShapF(Xi)=\displaystyle\sum_{i}\textsc{Shap}_{F}(X_{i})\ = F(𝐱)𝐄[F].\displaystyle\ F(\mathbf{x})-\mathbf{E}[F]. (5)

2.2 Computational Problems

This paper studies the complexity of computing Shap(X)\textsc{Shap}(X); a task we formally define next. We write F for a class of functions. We also write PRn\text{\tt PR}_{n} for a class of data distributions over nn features, and let PR=nPRn\text{\tt PR}=\bigcup_{n}\text{\tt PR}_{n}. We assume that all parameters are rationals. Because Shap explanations are for an arbitrary fixed instance 𝐱\mathbf{x}, we will simplify the notation throughout this paper by assuming it to be the instance 𝐞=(1,1,,1)\mathbf{e}=(1,1,\ldots,1), and that each domain contains the value 11, which is without loss of generality.

Definition 2 (Shap Computational Problems).

For each function class F and distribution class PR, consider the following computational problems.

  • The functional Shap problem F-SHAP(F,PR)\text{\tt F-SHAP}(\text{\tt F},\text{\tt PR}): given a data distribution PrPR\Pr\in\text{\tt PR} and a function FFF\in\text{\tt F}, compute Shap(X1),,Shap(Xn)\textsc{Shap}(X_{1}),\ldots,\textsc{Shap}(X_{n}).

  • The decision Shap problem D-SHAP(F,PR)\text{\tt D-SHAP}(\text{\tt F},\text{\tt PR}): given a data distribution PrPR\Pr\in\text{\tt PR}, a function FFF\in\text{\tt F}, a feature X𝐗X\in\mathbf{X}, and a threshold tt\in\mathbb{R}, decide if Shap(X)>t\textsc{Shap}(X)>t.

To establish the complexities of these problems, we use standard notions of reductions. A polynomial time reduction from a problem A to a problem B, denoted by APB\text{\tt A}~{}\leq^{P}~{}\text{\tt B}, and also called a Cook reduction, is a polynomial-time algorithm for the problem A with access to an oracle for the problem B. We write APB\text{\tt A}\equiv^{P}\text{\tt B} when both APB\text{\tt A}\leq^{P}\text{\tt B} and BPA\text{\tt B}\leq^{P}\text{\tt A}.

In the remainder of this paper will study the computational complexity of these problems for natural hypothesis classes F that are popular in machine learning, as well as common classes of data distributions PR, including those most often used to compute Shap explanations.

3 Shap over Fully-Factorized Distributions

We start our study of the complexity of Shap by considering the simplest probability distribution: a fully-factorized distribution, where all features are independent.

There are both practical and computational reasons why it makes sense to assume a fully-factorized data distribution when computing Shap explanations. First, functions FF are often the product of a supervised learning algorithm that does not have access to a generative model of the data – it is purely discriminative. Hence, it is convenient to make the practical assumption that the data distribution is fully factorized, and therefore easy to estimate. Second, fully-factorized distributions are highly tractable; for example they make it easy to compute expectations of linear regression functions (Khosravi et al. 2019b) and other hard inference tasks (Vergari et al. 2020).

Lundberg and Lee (2017) indeed observe that computing the Shap-explanation on an arbitrary data distribution is challenging and consider using fully-factorized distributions (Sec. 4, Eq. 11). Other prior work on computing explanations also use fully-factorized distributions of features, e.g., Datta, Sen, and Zick (2016); Štrumbelj and Kononenko (2014). As we will show, the Shap explanation can be computed efficiently for several popular classifiers when the distribution is fully factorized. Yet, such simple data distributions are not a guarantee for tractability: computing Shap scores will be intractable for some other common classifiers.

3.1 Equivalence to Computing Expectations

Before studying various function classes, we prove a key result that connects the complexity of Shap explanations to the complexity of computing expectations.

Let INDn\text{\tt IND}_{n} be the class of fully-factorized probability distributions over nn discrete and independent random variables X1,,XnX_{1},\ldots,X_{n}. That is, for every instance (x1,,xn)𝒳(x_{1},\ldots,x_{n})\in\mathcal{X}, we have that Pr(X1=x1,,Xn=xn)=iPr(Xi=xi){\textnormal{Pr}}(X_{1}=x_{1},\ldots,X_{n}=x_{n})=\prod_{i}{\textnormal{Pr}}(X_{i}=x_{i}). Let IND=defn0INDn\text{\tt IND}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\bigcup_{n\geq 0}\text{\tt IND}_{n}. We show that for every function class F, the complexity of F-SHAP(F,IND)\text{\tt F-SHAP}(\text{\tt F},\text{\tt IND}) is the same as that of the fully-factorized expectation problem.

Definition 3 (Fully-Factorized Expectation Problem).

Let F be a class of real-valued functions with discrete inputs. The fully-factorized expectation problem for F, denoted E(F)\text{\tt E}(\text{\tt F}), is the following: given a function FFF\in\text{\tt F} and a probability distribution PrIND{\textnormal{Pr}}\in\text{\tt IND}, compute 𝐄Pr(F)\mathbf{E}_{{\textnormal{Pr}}}(F).

We know from Equation 5 that for any function FF over nn features, E({F})PF-SHAP({F},INDn)\text{\tt E}(\{F\})\leq^{P}\text{\tt F-SHAP}(\{F\},\text{\tt IND}_{n}), because 𝐄[F]=F(𝐱)i=1,nShapF(Xi)\mathbf{E}[F]=F(\mathbf{x})-\sum_{i=1,n}\textsc{Shap}_{F}(X_{i}). In this section we prove that the converse holds too:

Theorem 1.

For any function F:𝒳F:\mathcal{X}\to\mathbb{R}, we have that F-SHAP({F},INDn)PE({F}).\text{\tt F-SHAP}(\{F\},\text{\tt IND}_{n})\equiv^{P}\text{\tt E}(\{F\}).

In other words, for any function FF, the complexity of computing the Shap scores is the same as the complexity of computing the expected value 𝐄[F]\mathbf{E}[F] under a fully-factorized data distribution. One direction of the proof is immediate: E({F})PF-SHAP({F},INDn)\text{\tt E}(\{F\})\leq^{P}\text{\tt F-SHAP}(\{F\},\text{\tt IND}_{n}) because, if we are given an oracle to compute ShapF(Xi)\textsc{Shap}_{F}(X_{i}) for every feature XiX_{i}, then we can obtain 𝐄[F]\mathbf{E}[F] from Equation 5 (recall that we assumed that F(𝐱)F(\mathbf{x}) is computable in polynomial time). The hard part of the proof is the opposite direction: we will show in Sec. 3.2 how to compute ShapF(Xi)\textsc{Shap}_{F}(X_{i}) given an oracle for computing 𝐄[F]\mathbf{E}[F]. Theorem 1 immediately extends to classes of functions F, and to any number of variables, and therefore implies that F-SHAP(F,IND)PE(F)\text{\tt F-SHAP}(\text{\tt F},\text{\tt IND})\equiv^{P}\text{\tt E}(\text{\tt F}).

Sections 3.3 and 3.4 will discuss the consequences of this result, by delineating which function classes support tractable Shap explanations, and which do not. The next section is devoted to proving our main technical result.

3.2 Proof of Theorem 1

We start with the special case when all features XX are binary: dom(X)={0,1}\operatorname{dom}(X)=\{0,1\}. We denote by INDBn\text{\tt INDB}_{n} the class of fully-factorized distributions over binary domains.

Theorem 2.

For any function F:{0,1}nF:\{0,1\}^{n}\to\mathbb{R}, we have that F-SHAP({F},INDBn)PE({F})\text{\tt F-SHAP}(\{F\},\text{\tt INDB}_{n})\equiv^{P}\text{\tt E}(\{F\}).

Proof.

We prove only F-SHAP(F,INDBn)PE({F})\text{\tt F-SHAP}(F,\text{\tt INDB}_{n})\leq^{P}\text{\tt E}(\{F\}); the opposite direction follows immediately from Equation 5. We will assume w.l.o.g. that FF has n+1n+1 binary features 𝐗={X0}𝐗\mathbf{X^{\prime}}=\{X_{0}\}\cup\mathbf{X} and show how to compute ShapF(X0)\textsc{Shap}_{F}(X_{0}) using repeated calls to an oracle for computing 𝐄[F]\mathbf{E}[F], i.e., the expectation of the same function FF, but over fully-factorized distributions with different probabilities. The probability distribution Pr is given to us by n+1n+1 rational numbers, pi=defPr(Xi=1)p_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}{\textnormal{Pr}}(X_{i}\!=\!1), i=0,ni=0,n; obviously, Pr(Xi=0)=1pi{\textnormal{Pr}}(X_{i}\!=\!0)=1-p_{i}. Recall that the instance whose outcome we want to explain is 𝐞=(1,,1)\mathbf{e}=(1,\ldots,1). Recall that for any set S[n]S\subseteq[n] we write 𝐞S\mathbf{e}_{S} for the event iS(Xi=1)\bigwedge_{i\in S}(X_{i}=1). Then, we have that

Shap(X0)\displaystyle\textsc{Shap}(X_{0}) =k=0,nk!(nk)!(n+1)!Dk,where\displaystyle=\sum_{k=0,n}\frac{k!(n-k)!}{(n+1)!}\,D_{k},\ \ \ \ \mbox{where} (6)
Dk\displaystyle D_{k} =defS[n]:|S|=k(𝐄[F|𝐞S{0}]𝐄[F|𝐞S]).\displaystyle\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{S\subseteq[n]:|S|=k}\left(\mathbf{E}\!\left[F|\mathbf{e}_{S\cup\{0\}}\right]-\mathbf{E}\!\left[F|\mathbf{e}_{S}\right]\right).

Let F0=defF[X0:=0]F_{0}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}F[X_{0}:=0] and F1=defF[X0:=1]F_{1}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}F[X_{0}:=1] (both are functions in nn binary features, 𝐗={X1,,Xn}\mathbf{X}=\{X_{1},\ldots,X_{n}\}). Then:

𝐄[F|𝐞S{0}]\displaystyle\mathbf{E}\left[F\middle|\mathbf{e}_{S\cup\{0\}}\right] =𝐄[F1|𝐞S]\displaystyle=\mathbf{E}[F_{1}|\mathbf{e}_{S}]
𝐄[F|𝐞S]\displaystyle\mathbf{E}[F|\mathbf{e}_{S}] =𝐄[F0|𝐞S](1p0)+𝐄[F1|𝐞S]p0\displaystyle=\mathbf{E}[F_{0}|\mathbf{e}_{S}]\cdot(1-p_{0})+\mathbf{E}[F_{1}|\mathbf{e}_{S}]\cdot p_{0}

and therefore DkD_{k} is given by:

Dk=(1p0)S[n]:|S|=k(𝐄[F1|𝐞S]𝐄[F0|𝐞S])\displaystyle D_{k}=(1-p_{0})\sum_{S\subseteq[n]:|S|=k}\left(\mathbf{E}[F_{1}|\mathbf{e}_{S}]-\mathbf{E}[F_{0}|\mathbf{e}_{S}]\right)

For any function GG, Equation 1 defines value vG,𝐞,Pr(𝐗S)\operatorname{v}_{G,\mathbf{e},{\textnormal{Pr}}}(\mathbf{X}_{S}) as 𝐄[G|𝐞S]\mathbf{E}[G|\mathbf{e}_{S}]. Abusing notation, we write vG,kv_{G,k} for the sum of these quantities over all sets SS of cardinality kk:

vG,k=def\displaystyle v_{G,k}\stackrel{{\scriptstyle\mathrm{def}}}{{=}} S[n],|S|=k𝐄[G|𝐞S].\displaystyle\sum_{S\subseteq[n],|S|=k}\mathbf{E}[G|\mathbf{e}_{S}]. (7)

We will prove the following claim.

Claim 1.

Let GG be a function over nn binary variables. Then the n+1n+1 quantities vG,0v_{G,0} until vG,nv_{G,n} can be computed in polynomial time, using n+1n+1 calls to an oracle for E({G})\text{\tt E}(\{G\}).

Note that an oracle for E({F})\text{\tt E}(\{F\}) is also an oracle for both E({F0})\text{\tt E}(\{F_{0}\}) and E({F1})\text{\tt E}(\{F_{1}\}), by simply setting Pr(X0=1)=0{\textnormal{Pr}}(X_{0}=1)=0 or Pr(X0=1)=1{\textnormal{Pr}}(X_{0}=1)=1 respectively. Therefore, Claim 1 proves Theorem 2, by applying it once to F0F_{0} and once to F1F_{1} in order to derive all the quantities vF0,kv_{F_{0},k} and vF1,kv_{F_{1},k}, thereby computing DkD_{k}, and finally computing ShapF(X0)\textsc{Shap}_{F}(X_{0}) using Equation 6. It remains to prove Claim 1.

Fix a function GG over nn binary variables and let vk=vG,kv_{k}=v_{G,k}. Let pj=Pr(Xj=1)p_{j}={\textnormal{Pr}}(X_{j}\!=\!1), for j=1,nj=1,n, define the distribution over which we need to compute v0,,vnv_{0},\ldots,v_{n}. We will prove the following additional claim.

Claim 2.

Given any real number z>0z>0, consider the distribution Prz(Xj)=pj=defpj+z1+z{\textnormal{Pr}}_{z}(X_{j})=p^{\prime}_{j}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{p_{j}+z}{1+z}, for j=1,nj=1,n. Let 𝐄z[G]\mathbf{E}_{z}[G] denote 𝐄[G]\mathbf{E}[G] under distribution Prz{\textnormal{Pr}}_{z}. We then have that

k=0,nzkvk=\displaystyle\sum_{k=0,n}z^{k}\cdot v_{k}= (1+z)n𝐄z[G].\displaystyle(1+z)^{n}\cdot\mathbf{E}_{z}[G]. (8)

Assuming Claim 2 holds, we prove Claim 1. Choose any n+1n+1 distinct values for zz, use the oracle to compute the quantities 𝐄z0[G],,𝐄zn[G]\mathbf{E}_{z_{0}}[G],\ldots,\mathbf{E}_{z_{n}}[G], and form a system of n+1n+1 linear equations (8) with unknowns v0,,vnv_{0},\ldots,v_{n}. Next, observe that its matrix is a non-singular Vandermonde matrix, hence the system has a unique solution which can be computed in polynomial time. It remains to prove Claim 2.

Because of independence, the probability of instance 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n} is Pr(𝐱)=i:𝐱i=1pii:𝐱i=0(1pi){\textnormal{Pr}}(\mathbf{x})=\prod_{i:\mathbf{x}_{i}=1}p_{i}\cdot\prod_{i:\mathbf{x}_{i}=0}(1-p_{i}), where 𝐱i\mathbf{x}_{i} looks up the value of feature XiX_{i} in instance 𝐱\mathbf{x}. Similarly, Prz(𝐱)=i:𝐱i=1pii:𝐱i=0(1pi){\textnormal{Pr}}_{z}(\mathbf{x})=\prod_{i:\mathbf{x}_{i}=1}p_{i}^{\prime}\cdot\prod_{i:\mathbf{x}_{i}=0}(1-p_{i}^{\prime}). Using direct calculations we derive:

Pr(𝐱)i:𝐱i=1(1+zpi)=(1+z)nPrz(𝐱)\displaystyle{\textnormal{Pr}}(\mathbf{x})\prod_{i:\mathbf{x}_{i}=1}\left(1+\frac{z}{p_{i}}\right)=(1+z)^{n}\cdot{\textnormal{Pr}}_{z}(\mathbf{x}) (9)

Separately we also derive the following identity, using the fact that Pr(𝐞S)=iSpi{\textnormal{Pr}}(\mathbf{e}_{S})=\prod_{i\in S}p_{i} by independence:

𝐄[G|𝐞S]=1iSpi𝐱:𝐱S=𝐞SG(𝐱)Pr(𝐱)\displaystyle\mathbf{E}[G|\mathbf{e}_{S}]=\frac{1}{\prod_{i\in S}p_{i}}\sum_{\mathbf{x}:\mathbf{x}_{S}=\mathbf{e}_{S}}G(\mathbf{x})\cdot{\textnormal{Pr}}(\mathbf{x}) (10)

We are now in a position to prove Claim 2:

k=0,nzkvk\displaystyle\sum_{k=0,n}z^{k}\cdot v_{k} =k=0,nzkS[n]:|S|=k𝐄[G|𝐞S]\displaystyle=\sum_{k=0,n}z^{k}\sum_{S\subseteq[n]:|S|=k}\mathbf{E}[G|\mathbf{e}_{S}]
=S[n]z|S|𝐄[G|𝐞S]\displaystyle=\sum_{S\subseteq[n]}z^{|S|}\cdot\mathbf{E}[G|\mathbf{e}_{S}]
=S[n]z|S|iSpi𝐱:𝐱S=𝐞SG(𝐱)Pr(𝐱)\displaystyle=\sum_{S\subseteq[n]}\frac{z^{|S|}}{\prod_{i\in S}p_{i}}\sum_{\mathbf{x}:\mathbf{x}_{S}=\mathbf{e}_{S}}G(\mathbf{x})\cdot{\textnormal{Pr}}(\mathbf{x})

The last line follows from Equation 10. Next, we simply exchange the summations S\sum_{S} and 𝐱\sum_{\mathbf{x}}, after which we apply the identity SAiSui=iA(1+ui)\sum_{S\subseteq A}\prod_{i\in S}u_{i}=\prod_{i\in A}(1+u_{i}).

(continuing)
=\displaystyle= 𝐱{0,1}nG(𝐱)Pr(𝐱)S:𝐱S=𝐞Sz|S|iSpi\displaystyle\sum_{\mathbf{x}\in\{0,1\}^{n}}G(\mathbf{x})\cdot{\textnormal{Pr}}(\mathbf{x})\sum_{S:\mathbf{x}_{S}=\mathbf{e}_{S}}\frac{z^{|S|}}{\prod_{i\in S}p_{i}}
=\displaystyle= 𝐱{0,1}nG(𝐱)Pr(𝐱)i:𝐱i=1(1+zpi)\displaystyle\sum_{\mathbf{x}\in\{0,1\}^{n}}G(\mathbf{x})\cdot{\textnormal{Pr}}(\mathbf{x})\prod_{i:\mathbf{x}_{i}=1}\left(1+\frac{z}{p_{i}}\right)
=\displaystyle= (1+z)n𝐱{0,1}nG(𝐱)Prz(𝐱)=(1+z)n𝐄z[G].\displaystyle~{}(1+z)^{n}\sum_{\mathbf{x}\in\{0,1\}^{n}}G(\mathbf{x})\cdot{\textnormal{Pr}}_{z}(\mathbf{x})=(1+z)^{n}\cdot\mathbf{E}_{z}[G].

The final line uses Equation 9. This completes the proof of Claim 2 as well as Theorem 2. ∎

Next, we generalize this result from binary features to arbitrary discrete features. Fix a function with nn inputs, F:𝒳(=defidom(Xi))F:\mathcal{X}(\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\prod_{i}\operatorname{dom}(X_{i}))\rightarrow\mathbb{R}, where each domain is an arbitrary finite set, dom(Xi)={1,2,,mi}\operatorname{dom}(X_{i})=\{1,2,\ldots,m_{i}\}; we assume w.l.o.g. that mi>1m_{i}>1. A fully factorized probability space PrINDn{\textnormal{Pr}}\in\text{\tt IND}_{n} is defined by numbers pij[0,1]p_{ij}\in[0,1], i=1,ni=1,n, j=1,mij=1,m_{i}, such that, for all ii, jpij=1\sum_{j}p_{ij}=1. Given FF and Pr over the domain idom(Xi)\prod_{i}\operatorname{dom}(X_{i}), we define their projections, Fπ,PrπF_{\pi},{\textnormal{Pr}}_{\pi} over the binary domain {0,1}n\{0,1\}^{n} as follows. For any instance 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}, let T(𝐱)T(\mathbf{x}) denote the event asserting that Xj=1X_{j}=1 iff 𝐱j=1\mathbf{x}_{j}=1. Formally,

T(𝐱)=defj:𝐱j=1(Xj=1)j:𝐱j=0(Xj1).T(\mathbf{x})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\bigwedge_{j:\mathbf{x}_{j}=1}(X_{j}\!=\!1)\wedge\bigwedge_{j:\mathbf{x}_{j}=0}(X_{j}\!\neq\!1).

Then, the projections are defined as follows: 𝐱{0,1}n\forall\mathbf{x}\in\{0,1\}^{n},

Prπ(𝐱)=def\displaystyle{\textnormal{Pr}}_{\pi}(\mathbf{x})\stackrel{{\scriptstyle\mathrm{def}}}{{=}} Pr(T(𝐱))\displaystyle{\textnormal{Pr}}(T(\mathbf{x})) Fπ(𝐱)=def\displaystyle F_{\pi}(\mathbf{x})\stackrel{{\scriptstyle\mathrm{def}}}{{=}} 𝐄[F|T(𝐱)]\displaystyle\mathbf{E}[F~{}|~{}T(\mathbf{x})] (11)

Notice that FπF_{\pi} depends both on FF and on the probability distribution Pr. Intuitively, the projections only distinguishes between Xj=1X_{j}=1 and Xj1X_{j}\neq 1, for example:

Fπ(1,0,0)=\displaystyle F_{\pi}(1,0,0)= 𝐄[F|(X1=1,X21,X31)]\displaystyle\mathbf{E}[F|(X_{1}=1,X_{2}\neq 1,X_{3}\neq 1)]
Prπ(1,0,0)=\displaystyle{\textnormal{Pr}}_{\pi}(1,0,0)= Pr(X1=1,X21,X31)\displaystyle{\textnormal{Pr}}(X_{1}=1,X_{2}\neq 1,X_{3}\neq 1)

We prove the following result in Appendix B:

Proposition 3.

Let F:𝒳F:\mathcal{X}\to\mathbb{R} be a function with nn input features, and PrINDn{\textnormal{Pr}}\in\text{\tt IND}_{n} a fully factorized distribution over 𝒳\mathcal{X}. Then (1) for any feature XjX_{j}, ShapF,Pr(Xj)=ShapFπ,Prπ(Xj)\textsc{Shap}_{F,{\textnormal{Pr}}}(X_{j})=\textsc{Shap}_{F_{\pi},{\textnormal{Pr}}_{\pi}}(X_{j}), and (2) E({Fπ})PE({F})\text{\tt E}(\{F_{\pi}\})\leq^{P}\text{\tt E}(\{F\}).

Item (1) states that the Shap-score of FF computed over the probability space Pr is the same as that of its projection FπF_{\pi} (which depends on Pr) over the projected probability space Prπ{\textnormal{Pr}}_{\pi}. Item (2) says that, for any probability space over {0,1}n\{0,1\}^{n} (not necessarily Prπ{\textnormal{Pr}}_{\pi}), we can compute 𝐄[Fπ]\mathbf{E}[F_{\pi}] in polynomial time given access to an oracle for computing 𝐄[F]\mathbf{E}[F]. We can now complete the proof of Theorem 1, by showing that F-SHAP({F},INDn)PE({F})\text{\tt F-SHAP}(\{F\},\text{\tt IND}_{n})\leq^{P}\text{\tt E}(\{F\}). Given a function FF and probability space PrINDn{\textnormal{Pr}}\in\text{\tt IND}_{n}, in order to compute ShapF,Pr(Xj)\textsc{Shap}_{F,{\textnormal{Pr}}}(X_{j}), by item (1) of Proposition 3 it suffices to show how to compute ShapFπ,Prπ(Xj)\textsc{Shap}_{F_{\pi},{\textnormal{Pr}}_{\pi}}(X_{j}). By Theorem 2, we can compute the latter given access to an oracle for computing 𝐄[Fπ]\mathbf{E}[F_{\pi}]. Finally, by item (2) of the proposition, we can compute 𝐄[Fπ]\mathbf{E}[F_{\pi}] given an oracle for computing 𝐄[F]\mathbf{E}[F].

3.3 Tractable Function Classes

Given the polynomial-time equivalence between computing Shap explanations and computing expectations under fully-factorized distributions, a natural next question is: which real-world hypothesis classes in machine learning support efficient computation of Shap scores?

Corollary 4.

For the following function classes F, computing Shap scores F-SHAP(F,IND)\text{\tt F-SHAP}(\text{\tt F},\text{\tt IND}) is in polynomial time in the size of the representations of function FFF\in\text{\tt F} and fully-factorized distribution PrIND{\textnormal{Pr}}\in\text{\tt IND}.

  1. 1.

    Linear regression models

  2. 2.

    Decision and regression trees

  3. 3.

    Random forests or additive tree ensembles

  4. 4.

    Factorization machines, regression circuits

  5. 5.

    Boolean functions in d-DNNF, binary decision diagrams

  6. 6.

    Bounded-treewidth Boolean functions in CNF

These are all consequences of Theorem 1, and the fact that computing fully-factorized expectations E(F)\text{\tt E}(\text{\tt F}) for these function classes F is in polynomial time. Concretely, we have the following observations about fully-factorized expectations:

  1. 1.

    Expectations of linear regression functions are efficiently computed by mean imputation (Khosravi et al. 2019b). The tractability of Shap on linear regression models is well known. In fact, Štrumbelj and Kononenko (2014) provide a closed-form formula for this case.

  2. 2.

    Paths from root to leaf in a decision or regression tree are mutually exclusive. Their expected value is therefore the sum of expected values of each path, which are tractable to compute within IND; see Khosravi et al. (2020).

  3. 3.

    Additive mixtures of trees, as obtained through bagging or boosting, are tractable, by the linearity of expectation.

  4. 4.

    Factorization machines extend linear regression models with feature interaction terms and factorize the parameters of the higher-order terms (Rendle 2010). Their expectations remain easy to compute. Regression circuits are a graph-based generalization of linear regression. Khosravi et al. (2019a) provide an algorithm to efficiently take their expectation w.r.t. a probabilistic circuit distribution, which is trivial to construct for the fully-factorized case.

The remaining tractable cases are Boolean functions. Computing fully-factorized expectations of Boolean functions is widely known as the weighted model counting task (WMC) (Sang, Beame, and Kautz 2005; Chavira and Darwiche 2008). WMC has been extensively studied both in the theory and the AI communities, and the precise complexity of E(F)\text{\tt E}(\text{\tt F}) is known for many families of Boolean functions F. These results immediately carry over to the F-SHAP(F,IND)\text{\tt F-SHAP}(\text{\tt F},\text{\tt IND}) problem through Theorem 1:

  • 5.

    Expectations can be computed in time linear in the size of various circuit representations, called d-DNNF, which includes binary decision diagrams (OBDD, FBDD) and SDDs (Bryant 1986; Darwiche and Marquis 2002).111In contemporaneous work, Arenas et al. (2020) also show that the Shap explanation is tractable for d-DNNFs, but for the more restricted class of uniform data distributions.

  • 6.

    Bounded-treewidth CNFs are efficiently compiled into OBDD circuits (Ferrara, Pan, and Vardi 2005), and thus enjoy tractable expectations.

To conclude this section, the reader may wonder about the algorithmic complexity of solving F-SHAP(F,IND)\text{\tt F-SHAP}(\text{\tt F},\text{\tt IND}) with an oracle for E(F)\text{\tt E}(\text{\tt F}) under the reduction in Section 3.2. Briefly, we require a linear number of calls to the oracle, as well as time in O(n3)O(n^{3}) for solving a system of linear equations. Hence, for those classes, such as d-DNNF circuits, where expectations are linear in the size of the (circuit) representation of FF, computing F-SHAP(F,IND)\text{\tt F-SHAP}(\text{\tt F},\text{\tt IND}) is also linear in the representation size and polynomial in nn.

3.4 Intractable Function Classes

The polynomial-time equivalence of Theorem 1 also implies that computing Shap scores must be intractable whenever computing fully-factorized expectations is intractable. This section reviews some of those function classes F, including some for which the computational hardness of E(F)\text{\tt E}(\text{\tt F}) is well known. We begin, however, with a more surprising result.

Logistic regression is one of the simplest and most widely used machine learning models, yet it is conspicuously absent from Corollary 4. We prove that computing the expectation of a logistic regression model is #P-hard, even under a uniform data distribution, which is of independent interest.

A logistic regression model is a parameterized function F(𝒙)=defσ(𝒘𝒙)F(\bm{x})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sigma(\bm{w}\cdot\bm{x}), where 𝒘=(w0,w1,,wn)\bm{w}=(w_{0},w_{1},\ldots,w_{n}) is a vector of weights, σ(z)=1/(1+ez)\sigma(z)=1/(1+e^{-z}) is the logistic function, 𝒙=def(1,x1,x2,,xn)\bm{x}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(1,x_{1},x_{2},\ldots,x_{n}), and 𝒘𝒙=defi=0,nwixi\bm{w}\cdot\bm{x}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{i=0,n}w_{i}x_{i} is the dot product. Note that we define the logistic regression function to output probabilities, not data labels. Let LOGITn\text{\tt LOGIT}_{n} denote the class of logistic regression functions with nn variables, and LOGIT=nLOGITn\text{\tt LOGIT}=\bigcup_{n}\text{\tt LOGIT}_{n}. We prove the following:

Theorem 5.

Computing the expectation of a logistic regression model w.r.t. a uniform data distribution is #P-hard.

The full proof in Appendix C is by reduction from counting solutions to the number partitioning problem.

Because the uniform distribution is contained in IND, and following Theorem 1, we immediately obtain:

Corollary 6.

The computational problems E(LOGIT)\text{\tt E}(\text{\tt LOGIT}) and F-SHAP(LOGIT,IND)\text{\tt F-SHAP}(\text{\tt LOGIT},\text{\tt IND}) are both #P-hard.

We are now ready to list general function classes for which computing the Shap explanation is #P-hard.

Corollary 7.

For the following function classes F, computing Shap scores F-SHAP(F,IND)\text{\tt F-SHAP}(\text{\tt F},\text{\tt IND}) is #P-hard in the size of the representations of function FFF\in\text{\tt F} and fully-factorized distribution PrIND{\textnormal{Pr}}\in\text{\tt IND}.

  1. 1.

    Logistic regression models (Corollary 6)

  2. 2.

    Neural networks with sigmoid activation functions

  3. 3.

    Naive Bayes classifiers, logistic circuits

  4. 4.

    Boolean functions in CNF or DNF

Our intractability results stem from these observations:

  1. 2.

    Each neuron is a logistic regression model, and therefore this class subsumes LOGIT.

  2. 3.

    The conditional distribution used by a naive Bayes classifier is known to be equivalent to a logistic regression model (Ng and Jordan 2002). Logistic circuits are a graph-based classification model that subsumes logistic regression (Liang and Van den Broeck 2019).

  3. 4.

    For general CNFs and DNFs, weighted model counting, and therefore E(F)\text{\tt E}(\text{\tt F}) is #P-hard. This is true even for very restricted classes, such as monotone 2CNF and 2DNF functions, and Horn clause logic (Wei and Selman 2005).

4 Beyond Fully-Factorized Distributions

Features in real-world data distributions are not independent. In order to capture more realistic assumptions about the data when computing Shap scores, one needs a more intricate probabilistic model. In this section we prove that the complexity of computing the Shap-explanation quickly becomes intractable, even over the simplest probabilistic models, namely naive Bayes models. To make computing the Shap-explanation as easy as possible, we will assume that the function FF simply outputs the value of one feature. We show that even in this case, even for function classes that are tractable under fully-factorized distributions, computing Shap explanations becomes computationally hard.

Let NBNn\text{\tt NBN}_{n} denote the family of naive Bayes networks over n+1n+1 variables 𝐗={X0,X1,,Xn}\mathbf{X}=\{X_{0},X_{1},\ldots,X_{n}\}, with binary domains, where X0X_{0} is a parent of all features:

Pr(𝐗)=Pr(X0)i=1,nPr(Xi|X0).\displaystyle{\textnormal{Pr}}(\mathbf{X})={\textnormal{Pr}}(X_{0})\cdot\prod_{i=1,n}{\textnormal{Pr}}(X_{i}|X_{0}).

As usual, the class NBN=defn0NBNn\text{\tt NBN}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\bigcup_{n\geq 0}\text{\tt NBN}_{n}. We write X0X_{0} for the function FF that returns the value of feature X0X_{0}; that is, F(𝐱)=𝐱0F(\mathbf{x})=\mathbf{x}_{0}. We prove the following.

Theorem 8.

The decision problem D-SHAP({X0},NBN)\text{\tt D-SHAP}(\{X_{0}\},\text{\tt NBN}) is NP-hard.

The proof in Appendix D is by reduction from the number partitioning problem, similar to the proof of Corollary 6. We note that the subset sum problem was also used to prove related hardness results, e.g., for proving hardness of the Shapely value in network games (Elkind et al. 2008).

This result is in sharp contrast with the complexity of the Shap score over fully-factorized distributions in Section 3. There, the complexity was dictated by the choice of function class F. Here, the function is as simple as possible, yet computing Shap is hard. This ruins any hope of achieving tractability by restricting the function, and this motivates us to restrict the probability distribution in the next section. This result is also surprising because it is efficient to compute marginal probabilities (such as the expectation of X0X_{0}) and conditional probabilities in naive Bayes distributions.

Theorem 8 immediately extends to a large class of probability distributions and functions. We say that FF depends only on XiX_{i} if there exist two constants c0c1c_{0}\neq c_{1} such that F(𝐱)=c0F(\mathbf{x})=c_{0} when 𝐱i=0\mathbf{x}_{i}=0 and F(𝐱)=c1F(\mathbf{x})=c_{1} when 𝐱i=1\mathbf{x}_{i}=1. In other words, FF ignores all variables other than XiX_{i}, and does depend on XiX_{i}. We then have the following.

Corollary 9.

The problem D-SHAP(F,PR)\text{\tt D-SHAP}(\text{\tt F},\text{\tt PR}) is NP-hard, when PR is any of the following classes of distributions:

  1. 1.

    Naive Bayes, bounded-treewidth Bayesian networks

  2. 2.

    Bayesian networks Markov networks, Factor graphs

  3. 3.

    Decomposable probabilistic circuits

and when F is any class that contains some function FF that depends only on X0X_{0}, including the class of linear regression models and all the classes listed in Corollaries 4 and 7.

This corollary follows from two simple observations. First, each of the classes of probability distributions listed in the corollary can represent a naive Bayes network over binary variables 𝐗\mathbf{X}. For example, a Markov network will consists of nn factors f1(X0,X1),f2(X0,X2),,fn(X0,Xn)f_{1}(X_{0},X_{1}),f_{2}(X_{0},X_{2}),\ldots,f_{n}(X_{0},X_{n}); similar simple arguments prove that all the other classes can represent naive Bayes, including tractable probabilistic circuits such as sum-product networks (Vergari et al. 2020).

Second, for each function that depends only on X0X_{0}, there exist two distinct constants c0c1c_{0}\neq c_{1}\in\mathbb{R} such that F(𝐱)=c0F(\mathbf{x})=c_{0} when 𝐱0=0\mathbf{x}_{0}=0 and F(𝐱)=c1F(\mathbf{x})=c_{1} when 𝐱0=1\mathbf{x}_{0}=1. For example, if we consider the class of logistic regression functions F(𝐱)=σ(iwi𝐱i)F(\mathbf{x})=\sigma(\sum_{i}w_{i}\mathbf{x}_{i}), then we choose the weights w0=1w_{0}=1, w1==wn=0w_{1}=\ldots=w_{n}=0 and we obtain F(𝐱)=1/2F(\mathbf{x})=1/2 when 𝐱0=0\mathbf{x}_{0}=0 and F(𝐱)=1/(1+e1)F(\mathbf{x})=1/(1+e^{-1}) when 𝐱0=1\mathbf{x}_{0}=1. Then, over the binary domain {0,1}\{0,1\} the function is equivalent to F(𝐱)=(c1c0)𝐱0+c0F(\mathbf{x})=(c_{1}-c_{0})\mathbf{x}_{0}+c_{0}, and, therefore, by the linearity of the Shap explanation (Equation 4) we have ShapF(X0)=(c1c0)ShapX0(X0)\textsc{Shap}_{F}(X_{0})=(c_{1}-c_{0})\cdot\textsc{Shap}_{X_{0}}(X_{0}) (because the Shap explanation of a constant function c0c_{0} is 0) for which, by Theorem 8, the decision problem is NP-hard.

We end this section by proving that Theorem 8 continues to hold even if the prediction function FF is the value of some leaf node of a (bounded treewidth) Bayesian Network. In other words, the hardness of the Shap explanation is not tied to the function returning the root of the network, and applies to more general functions.

Corollary 10.

The Shap decision problem for Bayesian networks with latent variables is NP-hard, even if the function FF returns a single leaf variable of the network.

The full proof is given in Appendix E.

5 Shap on Empirical Distributions

In supervised learning one does not require a generative model of the data, instead, the model is trained on some concrete data set: the training data. When some probabilistic model is needed, then the training data itself is conveniently used as a probability model, called the empirical distribution. This distribution captures dependencies between features, while its set of possible worlds is limited to those in the data set. For example, the intent of the KernelSHAP algorithm by Lundberg and Lee (2017) is to compute the Shap explanation on the empirical distribution. In another example, Aas, Jullum, and Løland (2019) extend KernelSHAP to work with dependent features, by estimating the conditional probabilities from the empirical distribution.

Compared to the data distributions considered in the previous sections, the empirical distribution has one key advantage: it has many fewer possible worlds with positive probability – this suggests increased tractability. Unfortunately, in this section, we prove that computing the Shap explanation over the empirical distribution is #P-hard in general.

To simplify the presentation, this section assumes that all features are binary: dom(Xj)={0,1}\operatorname{dom}(X_{j})=\{0,1\}. The probability distribution is given by a 0/1-matrix 𝒅=(xij)i[m],j[n]\bm{d}=(x_{ij})_{i\in[m],j\in[n]}, where each row (xi1,,xin)(x_{i1},\ldots,x_{in}) is an outcome with probability 1/m1/m. One can think of 𝒅\bm{d} as a dataset with nn features and mm data instances, where each row (xi1,,xin)(x_{i1},\ldots,x_{in}) is one data instance. Repeated rows are possible: if a row occurs kk times, then its probability is k/mk/m. We denote by X the class of empirical distributions. The predictive function can be any function F:{0,1}nF:\{0,1\}^{n}\rightarrow\mathbb{R}. As our data distribution is no longer strictly positive, we adopt the standard convention that 𝐄[F|𝐗S=1]=0\mathbf{E}[F|\mathbf{X}_{S}=1]=0 when Pr(𝐗S=1)=0{\textnormal{Pr}}(\mathbf{X}_{S}=1)=0.

Recall from Section 2.2 that, by convention, we compute the Shap-explanation w.r.t. instance 𝐞=(1,1,,1)\mathbf{e}=(1,1,\ldots,1), which is without loss of generality. Somewhat surprisingly, the complexity of computing the Shap-explanation of a function FF over the empirical distribution given by a matrix 𝒅\bm{d} is related to the problem of computing the expectation of a certain CNF formula associated to 𝒅\bm{d}.

Definition 4.

The positive, partitioned 2CNF formula, PP2CNF, associated to a matrix 𝒅{0,1}m×n\bm{d}\in\{0,1\}^{m\times n} is:

Φ𝒅=def(i,j):xij=0(UiVj).\displaystyle\Phi_{\bm{d}}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\bigwedge_{(i,j):x_{ij}=0}(U_{i}\vee V_{j}).

Thus, a PP2CNF formula is over m+nm+n variables U1,,Um,V1,,VnU_{1},\ldots,U_{m},V_{1},\ldots,V_{n}, and has only positive clauses. The matrix 𝒅\bm{d} dictates which clauses are present. A quasy-symmetric probability distribution is a fully factorized distribution over the m+nm+n variables for which there exists two numbers p,q[0,1]p,q\in[0,1] such that for every i=1,mi=1,m, Pr(Ui=1)=p{\textnormal{Pr}}(U_{i}=1)=p or Pr(Ui=1)=1{\textnormal{Pr}}(U_{i}=1)=1, and for every j=1,nj=1,n, Pr(Vj=1)=q{\textnormal{Pr}}(V_{j}=1)=q or Pr(Vj=1){\textnormal{Pr}}(V_{j}=1). In other words, all variables U1,,UmU_{1},\ldots,U_{m} have the same probability pp, or have probability 11, and similarly for the variables V1,,VnV_{1},\ldots,V_{n}. We denote by EQS(PP2CNF)\text{\tt EQS}(\text{\tt PP2CNF}) the expectation computation problem for PP2CNF over quasi-symmetric probability distributions. EQS(PP2CNF)\text{\tt EQS}(\text{\tt PP2CNF}) is #P-hard, because computing 𝐄[Φ𝒅]\mathbf{E}[\Phi_{\bm{d}}] under the uniform distribution (i.e. Pr(U1=1)==Pr(Vn=1)=1/2{\textnormal{Pr}}(U_{1}=1)=\cdots={\textnormal{Pr}}(V_{n}=1)=1/2) is #P-hard (Provan and Ball 1983). We prove:

Theorem 11.

Let X be the class of empirical distributions, and F be any class of function such that, for each ii, it includes some function that depends only on XiX_{i}. Then, we have that F-SHAP(F,X)PEQS(PP2CNF)\text{\tt F-SHAP}(\text{\tt F},\text{\tt X})\equiv^{P}\text{\tt EQS}(\text{\tt PP2CNF}).

As a consequence, the problem F-SHAP(F,X)\text{\tt F-SHAP}(\text{\tt F},\text{\tt X}) is #P-hard in the size of the empirical distribution.

The theorem is surprising, because the set of possible outcomes of an empirical distribution is small. This is unlike all the distributions discussed earlier, for example those mentioned in Corollary 9, which have 2n2^{n} possible outcomes, where nn is the number of features. In particular, given an empirical distribution 𝒅\bm{d}, one can compute the expectation 𝐄[F]\mathbf{E}[F] in polynomial time for any function FF, by doing just one iteration over the data. Yet, computing the Shap explanation of FF is #P-hard.

Theorem 11 implies hardness of Shap explanations on the empirical distribution for a large class of functions.

Corollary 12.

The problem F-SHAP(F,X)\text{\tt F-SHAP}(\text{\tt F},\text{\tt X}) is #P-hard, when X is the class of empirical distributions, and F is any class such that for each feature XiX_{i}, the class contains some function that depends only on XiX_{i}. This includes all the function classes listed in Corollaries 4 and 7.

For instance, any class of Boolean function that contains the nn single-variable functions F=defXiF\stackrel{{\scriptstyle\mathrm{def}}}{{=}}X_{i}, for i=1,ni=1,n, fall under this corollary. Section 4 showed an example of how the class of logistic regression functions fall under this corollary as well.

The proof of Theorem 11 follows from the following technical lemma, which is of independent interest:

Lemma 13.

We have that:

  1. 1.

    For every matrix 𝒅\bm{d}, F-SHAP(F,𝒅)PEQS({Φ𝒅})\text{\tt F-SHAP}(\text{\tt F},\bm{d})\leq^{P}\text{\tt EQS}(\{\Phi_{\bm{d}}\}).

  2. 2.

    EQS(PP2CNF)PF-SHAP(F,X)\text{\tt EQS}(\text{\tt PP2CNF})\leq^{P}\text{\tt F-SHAP}(\text{\tt F},\text{\tt X}).

The proof of the Lemma is given in Appendix F and G. The first item says that we can compute the Shap-explanation in polynomial time using an oracle for computing 𝐄[Φ𝒅]\mathbf{E}[\Phi_{\bm{d}}] over quasi-symmetric distributions. The oracle is called only on the PP2CNF Φ𝒅\Phi_{\bm{d}} associated to the data 𝒅\bm{d}, but may perform repeated calls, with different probabilities of the Boolean variables. This is somewhat surprising because the Shap explanation is over an empirical distribution, while 𝐄[Φ𝒅]\mathbf{E}[\Phi_{\bm{d}}] is taken over a fully-factorized distribution; there is no connection between these two distributions. This item immediately implies F-SHAP(F,X)PEQS(PP2CNF)\text{\tt F-SHAP}(\text{\tt F},\text{\tt X})\leq^{P}\text{\tt EQS}(\text{\tt PP2CNF}), where X is the class of empirical distributions 𝒅\bm{d}, since the formula Φ𝒅\Phi_{\bm{d}} is in the class PP2CNF.

The second item says that a weak form of converse also holds. It states that we can compute in polynomial time the expectation 𝐄[Φ]\mathbf{E}[\Phi] over a quasi-symmetric probability distributions by using an oracle for computing Shap explanations, over several matrices 𝒅\bm{d}, but not necessarily restricted to the matrix associated to Φ\Phi. Together, the two items of the lemma prove Theorem 11.

We end this section with a comment on the TreeSHAP algorithm in Lundberg et al. (2020), which is computed over a distribution defined by a tree-based model. Our result implies that the problem that TreeSHAP tries to solve is #P-hard. This follows immediately by observing that every empirical distribution 𝒅\bm{d} can be represented by a binary tree of size polynomial in the size of 𝒅\bm{d}. The tree examines the attributes in the order X1,X2,,XnX_{1},X_{2},\ldots,X_{n}, and each decision node for XiX_{i} has two branches: Xi=0X_{i}=0 and Xi=1X_{i}=1. A branch that does not exists in the matrix 𝒅\bm{d} will end in a leaf with label 0. A complete branch that corresponds to a row xi1,xi2,,xinx_{i1},x_{i2},\ldots,x_{in} in 𝒅\bm{d} ends in a leaf with label 1/m1/m (or k/mk/m if that row occurs kk times in 𝒅\bm{d}). The size of this tree is no larger than twice the size of the matrix (because of the extra dead end branches). This concludes our study of Shap explanations on the empirical distribution.

6 Perspectives and Conclusions

We establish the complexity of computing the Shap explanation in three important settings. First, we consider fully-factorized data distributions and show that for any prediction model, the complexity of computing the Shap explanation is the same as the complexity of computing the expected value of the model. It follows that there are commonly used models, such as logistic regression, for which computing Shap explanations is intractable. Going beyond fully-factorized distributions, we show that computing Shap explanations is also intractable for simple functions and simple distributions – naive Bayes and empirical distributions.

The recent literature on Shap explanations predominantly studies tradeoffs of variants of the original Shap formulation, and relies on approximation algorithms to compute the explanations. These approximation algorithms, however, tend to make simplifying assumptions which can lead to counter-intuitive explanations, see e.g., Slack et al. (2020). We believe that more focus should be given to the computational complexity of Shap explanations. In particular, which classes of machine learning models can be explained efficiently using the Shap scores? Our results show that, under the assumption of fully-factorized data distributions, there are classes of models for which the Shap explanations can be computed in polynomial time. In future work, we plan to explore if there are classes of models for which the complexity of the Shap explanations is tractable under more complex data distributions, such as the ones defined by tractable probabilistic circuits (Vergari et al. 2020) or tractable symmetric probability spaces (Van den Broeck, Meert, and Darwiche 2014; Beame et al. 2015).

Acknowledgements

This work is partially supported by NSF grants IIS-1907997, IIS-1954222 IIS-1943641, IIS-1956441, CCF-1837129, DARPA grant N66001-17-2-4032, a Sloan Fellowship, and gifts by Intel and Facebook research. Schleich is supported by a RelationalAI fellowship. The authors would like to thank YooJung Choi for valuable discussions on the proof of Theorem 5.

References

  • Aas, Jullum, and Løland (2019) Aas, K.; Jullum, M.; and Løland, A. 2019. Explaining individual predictions when features are dependent: More accurate approximations to Shapley values. arXiv preprint arXiv:1903.10464 .
  • Arenas et al. (2020) Arenas, M.; Barceló, P.; Bertossi, L.; and Monet, M. 2020. The Tractability of SHAP-Score-Based Explanations over Deterministic and Decomposable Boolean Circuits. arXiv preprint arXiv:2007.14045 .
  • Beame et al. (2015) Beame, P.; Van den Broeck, G.; Gribkoff, E.; and Suciu, D. 2015. Symmetric Weighted First-Order Model Counting. In Proceedings of the 34th ACM Symposium on Principles of Database Systems, PODS 2015, Melbourne, Victoria, Australia, May 31 - June 4, 2015, 313–328.
  • Bertossi et al. (2020) Bertossi, L.; Li, J.; Schleich, M.; Suciu, D.; and Vagena, Z. 2020. Causality-Based Explanation of Classification Outcomes. In Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning, DEEM’20. New York, NY, USA: Association for Computing Machinery.
  • Bryant (1986) Bryant, R. E. 1986. Graph-based algorithms for boolean function manipulation. Computers, IEEE Transactions on 100(8): 677–691.
  • Chavira and Darwiche (2008) Chavira, M.; and Darwiche, A. 2008. On probabilistic inference by weighted model counting. Artificial Intelligence 172(6-7): 772–799.
  • Darwiche and Marquis (2002) Darwiche, A.; and Marquis, P. 2002. A knowledge compilation map. Journal of Artificial Intelligence Research 17: 229–264.
  • Datta, Sen, and Zick (2016) Datta, A.; Sen, S.; and Zick, Y. 2016. Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems. In IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016, 598–617.
  • Elkind et al. (2008) Elkind, E.; Goldberg, L. A.; Goldberg, P. W.; and Wooldridge, M. J. 2008. A tractable and expressive class of marginal contribution nets and its applications. In 7th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2008), Estoril, Portugal, May 12-16, 2008, Volume 2, 1007–1014.
  • Ferrara, Pan, and Vardi (2005) Ferrara, A.; Pan, G.; and Vardi, M. Y. 2005. Treewidth in verification: Local vs. global. In International Conference on Logic for Programming Artificial Intelligence and Reasoning, 489–503. Springer.
  • Gade et al. (2019) Gade, K.; Geyik, S. C.; Kenthapadi, K.; Mithal, V.; and Taly, A. 2019. Explainable AI in Industry. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, 3203–3204. New York, NY, USA: Association for Computing Machinery.
  • Janzing, Minorics, and Bloebaum (2020) Janzing, D.; Minorics, L.; and Bloebaum, P. 2020. Feature relevance quantification in explainable AI: A causal problem. volume 108 of Proceedings of Machine Learning Research, 2907–2916. PMLR.
  • Khosravi et al. (2019a) Khosravi, P.; Choi, Y.; Liang, Y.; Vergari, A.; and Van den Broeck, G. 2019a. On Tractable Computation of Expected Predictions. In Advances in Neural Information Processing Systems 32 (NeurIPS).
  • Khosravi et al. (2019b) Khosravi, P.; Liang, Y.; Choi, Y.; and den Broeck, G. V. 2019b. What to Expect of Classifiers? Reasoning about Logistic Regression with Missing Features. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, 2716–2724.
  • Khosravi et al. (2020) Khosravi, P.; Vergari, A.; Choi, Y.; Liang, Y.; and Van den Broeck, G. 2020. Handling Missing Data in Decision Trees: A Probabilistic Approach. In The Art of Learning with Missing Values Workshop at ICML (Artemiss).
  • Kumar et al. (2020) Kumar, I. E.; Venkatasubramanian, S.; Scheidegger, C.; and Friedler, S. 2020. Problems with Shapley-value-based explanations as feature importance measures. In Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020.
  • Liang and Van den Broeck (2019) Liang, Y.; and Van den Broeck, G. 2019. Learning Logistic Circuits. In Proceedings of the 33rd Conference on Artificial Intelligence (AAAI).
  • Lundberg et al. (2020) Lundberg, S. M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J. M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; and Lee, S. 2020. From Local Explanations to Global Understanding with Explainable AI for Trees. Nature Machine Intelligence 2: 56–67.
  • Lundberg, Erion, and Lee (2018) Lundberg, S. M.; Erion, G. G.; and Lee, S.-I. 2018. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888 .
  • Lundberg and Lee (2017) Lundberg, S. M.; and Lee, S. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in neural information processing systems (NIPS), 4765–4774.
  • Merrick and Taly (2020) Merrick, L.; and Taly, A. 2020. The Explanation Game: Explaining Machine Learning Models Using Shapley Values. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, 17–38. Springer.
  • Ng and Jordan (2002) Ng, A. Y.; and Jordan, M. I. 2002. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in neural information processing systems, 841–848.
  • Provan and Ball (1983) Provan, J. S.; and Ball, M. O. 1983. The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected. SIAM J. Comput. 12(4): 777–788.
  • Rendle (2010) Rendle, S. 2010. Factorization machines. In 2010 IEEE International Conference on Data Mining, 995–1000. IEEE.
  • Roth (1988) Roth, A. e. 1988. The Shapley Value: Essays in Honor of Lloyd S. Shapley. Cambridge Univ. Press.
  • Sang, Beame, and Kautz (2005) Sang, T.; Beame, P.; and Kautz, H. A. 2005. Performing Bayesian inference by weighted model counting. In AAAI, volume 5, 475–481.
  • Slack et al. (2020) Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; and Lakkaraju, H. 2020. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. In AAAI/ACM Conference on AI, Ethics, and Society (AIES).
  • Štrumbelj and Kononenko (2014) Štrumbelj, E.; and Kononenko, I. 2014. Explaining prediction models and individual predictions with feature contributions. Knowledge and information systems 41(3): 647–665.
  • Sundararajan and Najmi (2020) Sundararajan, M.; and Najmi, A. 2020. The many Shapley values for model explanation. In Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020.
  • Van den Broeck, Meert, and Darwiche (2014) Van den Broeck, G.; Meert, W.; and Darwiche, A. 2014. Skolemization for Weighted First-Order Model Counting. In Principles of Knowledge Representation and Reasoning: Proceedings of the Fourteenth International Conference, KR 2014, Vienna, Austria, July 20-24, 2014.
  • Vergari et al. (2020) Vergari, A.; Choi, Y.; Peharz, R.; and Van den Broeck, G. 2020. Probabilistic Circuits: Representations, Inference, Learning and Applications. AAAI Tutorial.
  • Wei and Selman (2005) Wei, W.; and Selman, B. 2005. A new approach to model counting. In International Conference on Theory and Applications of Satisfiability Testing, 324–339. Springer.

Appendix A Discussion on the TreeSHAP algorithm

Algorithm 1 Algorithm to compute value function v\operatorname{v} from (Lundberg, Erion, and Lee 2018)

procedure EXPVALUE(x, S, tree = {v,a,b,t,r,dv,a,b,t,r,d})
   procedure G(j)
      if  vjv_{j}\neq internal then
         return  vjv_{j}
      else  
         if  djSd_{j}\subseteq S  then
            return  G(aj)G(a_{j}) if  xdjtjx_{d_{j}}\leq t_{j} else  G(bj)G(b_{j})
         else  
            return  (G(aj)raj+G(bj)rbj)/rj(G(a_{j})\cdot r_{a_{j}}+G(b_{j})\cdot r_{b_{j}})/r_{j}
   return  G(1)G(1)

Lundberg, Erion, and Lee (2018) propose TreeSHAP, a variant of Shap explanations for tree-based machine learning models such as decision trees, random forests and gradient boosted trees. The authors claim that, for the case when both the model and probability distribution are defined by a tree-based model, the algorithm can compute the exact Shap explanations in polynomial time. However, it has been pointed out in Github discussions (e.g., https://github.com/christophM/interpretable-ml-book/issues/142) that the TreeSHAP algorithm does not compute the Shap explanation as defined in Section 2. In this section, we provide a concrete example of this shortcoming.

The main shortcoming of the TreeSHAP algorithm is captured by Algorithm 1. The authors claim that Algorithm 1 computes the conditional expectation 𝐄[F𝒙S]\mathbf{E}[F\mid\bm{x}_{S}], for a given set of features SS and tree-based model FF. We first describe the algorithm and then show by example that this algorithm does not accurately compute the conditional expectation.

Algorithm 1 takes as input a feature vector 𝒙\bm{x}, a set of features SS, and a binary tree, which represents the tree-based model. The tree is defined by the following vectors: vv is a vector of node values; internal nodes are assigned the value internal. The vectors aa and bb represent the left and right node indexes for each internal node. The vector tt contains the thresholds for each internal node, and dd is a vector of indexes of the features used for splitting in internal nodes. The vector rr represents the cover of each node (i.e., how many data samples fall in that sub-tree).

The algorithm proceeds recursively in a top-down traversal of the tree. For inner nodes, the algorithm follows the decision path for xx if the split feature is in SS, and takes the weighted average of both branches if the split feature is not in SS. For leaf nodes, it returns the value of the node, which corresponds to the prediction of the model.

The algorithm does not accurately compute the conditional expectation E[F𝒙S]E[F\mid\bm{x}_{S}], because it does not normalize expectation by the probability of the condition. The following simple example shows that the value returned by Algorithm 1 does not represent the conditional expectation.

Example 14.

We consider the following dataset and decision tree model. The dataset has two binary variables X1X_{1} and X2X_{2}, and each instance (x1,x2)(x_{1},x_{2}) is weighted by the occurrence count (i.e., the instance (0,0) occurs twice in the dataset). We want to compute E[F(X1,X2)|X2=0]E[F(X_{1},X_{2})|X_{2}=0], where F(X1,X2)F(X_{1},X_{2}) is the outcome of the decision tree.

X1X_{1} X2X_{2} #
0 0 2
0 1 1
1 0 1
1 1 2
X1X_{1}X2X_{2}X2X_{2}01F(0,0)F(0,0)F(0,1)F(0,1)F(1,0)F(1,0)F(1,1)F(1,1)0101

The correct value is:

E[F(X1,X2)X2=0]=2/3F(0,0)+1/3F(1,0)E[F(X_{1},X_{2})\mid X_{2}=0]=2/3\cdot F(0,0)+1/3\cdot F(1,0)

This is because there are three items with X2=0X_{2}=0, and their probabilities are 2/32/3 and 1/31/3.

Algorithm 1, however, returns:

G(1)=1/2F(0,0)+1/2F(1,0),G(1)=1/2\cdot F(0,0)+1/2\cdot F(1,0),

and thus does not compute E[F(X1,X2)X2=0]E[F(X_{1},X_{2})\mid X_{2}=0].

Appendix B Proof of Proposition 3

We start with item (1). Recall that dom(Xi)={1,2,3,,mi}\operatorname{dom}(X_{i})=\{1,2,3,\ldots,m_{i}\}. We denote by pi1,pi2,,pimip_{i1},p_{i2},\ldots,p_{im_{i}} their probabilities, thus j=1,mipij=1\sum_{j=1,m_{i}}p_{ij}=1. By definition, the projected distribution is: Prπ(Xi=1)=defpi1{\textnormal{Pr}}_{\pi}(X_{i}=1)\stackrel{{\scriptstyle\mathrm{def}}}{{=}}p_{i1}, and Prπ(Xi=0)=1pi1{\textnormal{Pr}}_{\pi}(X_{i}=0)=1-p_{i1}. We denote by 𝐄π\mathbf{E}_{\pi} be the corresponding expectation. Our goal is to prove ShapF,Pr(Xj)=ShapFπ,Prπ(Xj)\textsc{Shap}_{F,\Pr}(X_{j})=\textsc{Shap}_{F_{\pi},\Pr_{\pi}}(X_{j}).

Let 𝐞S\mathbf{e}_{S} again denote the event iS(Xi=1)\bigwedge_{i\in S}(X_{i}=1). Note that, by construction, for any set SS, Pr(𝐞S)=Prπ(𝐞S){\textnormal{Pr}}(\mathbf{e}_{S})={\textnormal{Pr}}_{\pi}(\mathbf{e}_{S}). Recall that for any instance 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}, we let T(𝐱)T(\mathbf{x}) denote the event asserting that Xj=1X_{j}=1 iff 𝐱j=1\mathbf{x}_{j}=1; formally,

T(𝐱)=defj:𝐱j=1(Xj=1)j:𝐱j=0(Xj1).T(\mathbf{x})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\bigwedge_{j:\mathbf{x}_{j}=1}(X_{j}\!=\!1)\wedge\bigwedge_{j:\mathbf{x}_{j}=0}(X_{j}\!\neq\!1).

Then, for any instance 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n},

Pr(T(𝐱))=i:𝐱i=1pi1i:𝐱i=0(pi2+pi3+)=Prπ(𝐱).{\textnormal{Pr}}(T(\mathbf{x}))=\prod_{i:\mathbf{x}_{i}=1}p_{i1}\cdot\prod_{i:\mathbf{x}_{i}=0}(p_{i2}+p_{i3}+\cdots)={\textnormal{Pr}}_{\pi}(\mathbf{x}).

Thus, there are 2n2^{n} disjoint events T(𝐱)T(\mathbf{x}), which partition the space 𝒳\mathcal{X}. Therefore, for every set SS:

𝐄[F𝐞S]\displaystyle\mathbf{E}[F\wedge\mathbf{e}_{S}] =𝐱:iS,𝐱i=1𝐄[F|T(𝐱)]Pr(T(𝐱))\displaystyle=\sum_{\mathbf{x}:\forall i\in S,\mathbf{x}_{i}=1}\mathbf{E}[F|T(\mathbf{x})]\,{\textnormal{Pr}}(T(\mathbf{x}))
=𝐱:iS,𝐱i=1Fπ(𝐱)Prπ(𝐱)\displaystyle=\sum_{\mathbf{x}:\forall i\in S,\mathbf{x}_{i}=1}F_{\pi}(\mathbf{x})\,{\textnormal{Pr}}_{\pi}(\mathbf{x})
=𝐄π[Fπ𝐞S]\displaystyle=\mathbf{E}_{\pi}[F_{\pi}\wedge\mathbf{e}_{S}]

This implies that 𝐄[F|𝐞S]=𝐄π[Fπ|𝐞S]\mathbf{E}[F|\mathbf{e}_{S}]=\mathbf{E}_{\pi}[F_{\pi}|\mathbf{e}_{S}] for any set SS, and ShapF,Pr(Xj)=ShapFπ,Prπ(Xj)\textsc{Shap}_{F,\Pr}(X_{j})=\textsc{Shap}_{F_{\pi},\Pr_{\pi}}(X_{j}) for all jj follows from Equation 6.

We now prove item (2): we show how to compute 𝐄[Fπ]\mathbf{E}[F_{\pi}] given an oracle for computing 𝐄[F]\mathbf{E}[F]. Recall that we want to compute 𝐄[Fπ]\mathbf{E}[F_{\pi}] on some arbitrary distribution Prπ{\textnormal{Pr}}_{\pi}^{\prime} on {0,1}n\{0,1\}^{n}; this should not be confused with the probability Prπ{\textnormal{Pr}}_{\pi} defined in Eq.11, and used to define the function FπF_{\pi}. Denote qi=defPrπ(Xi=1)q_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}{\textnormal{Pr}}_{\pi}^{\prime}(X_{i}\!=\!1), thus Prπ(Xi=0)=1qi{\textnormal{Pr}}_{\pi}(X_{i}\!=\!0)=1-q_{i}. To compute 𝐄[Fπ]\mathbf{E}[F_{\pi}] we will use the oracle for computing 𝐄[F]\mathbf{E}[F], on the probability space defined by the numbers pijp_{ij}^{\prime}, i=1,ni=1,n, j=1,mij=1,m_{i} defined as follows:

wi=def\displaystyle w_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}} 1qiqi and Z=defi=1,nqi\displaystyle\frac{1-q_{i}}{q_{i}}\text{\qquad and \qquad}Z\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\prod_{i=1,n}q_{i}
W=def\displaystyle W\stackrel{{\scriptstyle\mathrm{def}}}{{=}} i=1,n(1+j=2,mipijwi1pi1)\displaystyle\prod_{i=1,n}\left(1+\sum_{j=2,m_{i}}\frac{p_{ij}w_{i}}{1-p_{i1}}\right)
pi1=def\displaystyle p^{\prime}_{i1}\stackrel{{\scriptstyle\mathrm{def}}}{{=}} 1W\displaystyle\frac{1}{W} i=\displaystyle i= 1,n\displaystyle 1,n
pij=def\displaystyle p^{\prime}_{ij}\stackrel{{\scriptstyle\mathrm{def}}}{{=}} pijwiW(1pi1)\displaystyle\frac{p_{ij}w_{i}}{W(1-p_{i1})} i=\displaystyle i= 1,n;j=2,mi\displaystyle 1,n;j=2,m_{i}

One can check that the numbers pijp_{ij}^{\prime} indeed define a probability space on 𝒳\mathcal{X}, in other words pij[0,1]p_{ij}^{\prime}\in[0,1] and, for all i=1,ni=1,n: j=1,mjpij=1\sum_{j=1,m_{j}}p_{ij}^{\prime}=1. We denote by Pr{\textnormal{Pr}}^{\prime} the probability space that they define, and denote by 𝐄[F]\mathbf{E}^{\prime}[F] the expectation of FF in this space. We prove:

Claim 3.

𝐄π[Fπ]=ZW𝐄[F]\mathbf{E}_{\pi}[F_{\pi}]=Z\cdot W\cdot\mathbf{E}^{\prime}[F]

The claim immediately proves item (2) of Proposition 3: we simply invoke the oracle to compute 𝐄[F]\mathbf{E}^{\prime}[F], then multiply with the quantities ZZ and WW, both of which are computable in polynomial time. It remains to prove the claim.

We start with some observations and notations. Recall that the projection FπF_{\pi} depends on both FF and on Pr, see Equation 11. We express it here in terms of the probabilities pijp_{ij}:

Fπ[𝐱]=\displaystyle F_{\pi}[\mathbf{x}]=~{} 𝐄[F|T(𝐱)]=𝐄[FT(𝐱)]Pr(T(𝐱))\displaystyle\mathbf{E}[F|T(\mathbf{x})]=\frac{\mathbf{E}[F\cdot T(\mathbf{x})]}{{\textnormal{Pr}}(T(\mathbf{x}))}
=\displaystyle=~{} τ𝒳:𝐱1(1)=τ1(1)F(τ)ipiτii:𝐱i=1pi1i:𝐱i1(1pi1)\displaystyle\frac{\sum_{\tau\in\mathcal{X}:\mathbf{x}^{-1}(1)=\tau^{-1}(1)}F(\tau)\cdot\prod_{i}p_{i\tau_{i}}}{\prod_{i:\mathbf{x}_{i}=1}p_{i1}\cdot\prod_{i:\mathbf{x}_{i}\neq 1}(1-p_{i1})}
=\displaystyle=~{} τ𝒳:𝐱1(1)=τ1(1)F(τ)i:τi1piτi1pi1.\displaystyle\sum_{\tau\in\mathcal{X}:\mathbf{x}^{-1}(1)=\tau^{-1}(1)}F(\tau)\cdot\prod_{i:\tau_{i}\neq 1}\frac{p_{i\tau_{i}}}{1-p_{i1}}.

We used the fact that, for every instance 𝐱𝒳\mathbf{x}\in\mathcal{X}, Pr(𝐱)=ipi𝐱i{\textnormal{Pr}}(\mathbf{x})=\prod_{i}p_{i\mathbf{x}_{i}}, and denoted by 𝐱1(1)\mathbf{x}^{-1}(1) the set of feature indices for which example 𝐱\mathbf{x} has value 11. We now prove the claim by applying directly the definition of 𝐄π[Fπ]\mathbf{E}_{\pi}[F_{\pi}]:

𝐄π[Fπ]=\displaystyle\mathbf{E}_{\pi}[F_{\pi}]= θ{0,1}nFπ(θ)i:θi=1qii:θi=0(1qi)\displaystyle\sum_{\theta\in\{0,1\}^{n}}F_{\pi}(\theta)\prod_{i:\theta_{i}=1}q_{i}\prod_{i:\theta_{i}=0}(1-q_{i})
=\displaystyle= Zθ{0,1}nFπ(θ)i:θi=0wi\displaystyle Z\cdot\sum_{\theta\in\{0,1\}^{n}}F_{\pi}(\theta)\prod_{i:\theta_{i}=0}w_{i}
=\displaystyle= Zθ{0,1}nτ𝒳θ1(1)=τ1(1)F(τ)i:τi1piτi1pi1i:θi=0wi\displaystyle Z\cdot\sum_{\begin{array}[]{l}\theta\in\{0,1\}^{n}\\ \tau\in\mathcal{X}\\ \theta^{-1}(1)=\tau^{-1}(1)\end{array}}F(\tau)\prod_{i:\tau_{i}\neq 1}\frac{p_{i\tau_{i}}}{1-p_{i1}}\prod_{i:\theta_{i}=0}w_{i}
=\displaystyle= Zτ𝒳F(τ)i:τi1piτiwi1pi1\displaystyle Z\cdot\sum_{\tau\in\mathcal{X}}F(\tau)\prod_{i:\tau_{i}\neq 1}\frac{p_{i\tau_{i}}w_{i}}{1-p_{i1}}
=\displaystyle= ZWτ𝒳F(τ)ipiτi\displaystyle Z\cdot W\cdot\sum_{\tau\in\mathcal{X}}F(\tau)\prod_{i}p^{\prime}_{i\tau_{i}}
=\displaystyle= ZW𝐄[F]\displaystyle Z\cdot W\cdot\mathbf{E}^{\prime}[F]

In line 4 we noticed that the conditions τi1\tau_{i}\neq 1 and θi=0\theta_{i}=0 are equivalent, because θ1(1)=τ1(1)\theta^{-1}(1)=\tau^{-1}(1), and that the assignment τ𝒳\tau\in\mathcal{X} uniquely defines θ\theta, hence θ\theta can be dropped from the summation. This completes the proof of the claim, and of Proposition 3.

Appendix C Proof of Theorem 5

The number partitioning problem, NUMPAR, is the following: given nn natural numbers k1,,knk_{1},\ldots,k_{n}, decide whether there exists a subset S[n]S\subseteq[n] that partitions the numbers into two sets with equal sums: iSki=iSki\sum_{i\in S}k_{i}=\sum_{i\not\in S}k_{i}. NUMPAR is known to be NP-complete. The corresponding counting problem, in notation #NUMPAR\#\text{\tt NUMPAR}, asks for the number of sets SS such that iSki=iSki\sum_{i\in S}k_{i}=\sum_{i\not\in S}k_{i}, and is #P-hard.

We show that we can solve the #NUMPAR\#\text{\tt NUMPAR} problem using an oracle for 𝐄𝐔[F]\mathbf{E}_{\mathbf{U}}[F], where FF is a logistic regression function and 𝐔\mathbf{U} is the uniform probability distribution. This implies that computing the expectation of a logistic regression function is #P-hard.

Fix an instance of NUMPAR, k1,,knk_{1},\ldots,k_{n}, and assume w.l.o.g. that the sum of the numbers kik_{i} is even, iki=2c\sum_{i}k_{i}=2c for some natural number cc. Let

P=def\displaystyle P\stackrel{{\scriptstyle\mathrm{def}}}{{=}} {SS[n],iSki=c}\displaystyle\{{S}\mid{S\subseteq[n],\sum_{i\in S}k_{i}=c}\} (12)

For each set S[n]S\subseteq[n], denote by S¯\bar{S} its complement. Obviously, SPS\in P iff S¯P\bar{S}\in P, therefore |P||P| is an even number.

We next describe an algorithm that computes |P||P| using an oracle for 𝐄𝐔[F]\mathbf{E}_{\mathbf{U}}[F], where FF is a logistic regression function and 𝐔\mathbf{U} is the uniform probability distribution. Let mm be a natural number large enough, to be chosen later, and define the following weights:

w0=def\displaystyle w_{0}\stackrel{{\scriptstyle\mathrm{def}}}{{=}} m2mc\displaystyle-\frac{m}{2}-mc wi=def\displaystyle w_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}} mkii=1,n\displaystyle mk_{i}\;\;\;\forall i=1,n

Let 𝒘=(w1,,wn)\bm{w}=(w_{1},\ldots,w_{n}), then F(x1,,xn)=defσ(w0+𝒘𝒙)F(x_{1},\ldots,x_{n})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sigma(w_{0}+\bm{w}\cdot\bm{x}) is the logistic regression function defined by the weights w0,,wnw_{0},\ldots,w_{n}.

Claim 4.

Let ε=def1/2n+3\varepsilon\stackrel{{\scriptstyle\mathrm{def}}}{{=}}1/2^{n+3}. If mm satisfies both 2σ(m/2)ε2\sigma(-m/2)\leq\varepsilon and 1σ(m/2)ε1-\sigma(m/2)\leq\varepsilon, then:

|P|=\displaystyle|P|= 2n2n+1𝐄[F]1ε\displaystyle\left\lceil 2^{n}-\frac{2^{n+1}\mathbf{E}[F]}{1-\varepsilon}\right\rceil

The claim immediately proves the theorem: in order to solve the #NUMPAR\#\text{\tt NUMPAR} problem, compute 𝐄[F]\mathbf{E}[F] and then use the formula above to derive |P||P|. To prove the claim, for each set S[n]S\subseteq[n] denote by:

weight(S)=def\displaystyle\text{weight}(S)\stackrel{{\scriptstyle\mathrm{def}}}{{=}} m2mc+m(iSki)\displaystyle-\frac{m}{2}-mc+m(\sum_{i\in S}k_{i})

Let 𝐔\mathbf{U} denote the uniform probability distribution over the domain {0,1}n\{0,1\}^{n}. Then,

𝐄𝐔[F]\displaystyle\mathbf{E}_{\mathbf{U}}[F] =12n𝐱σ(w0+𝒘𝐱)\displaystyle=\frac{1}{2^{n}}\sum_{\mathbf{x}}\sigma(w_{0}+\bm{w}\cdot\mathbf{x})
=12n𝐱σ(m2mc+m(i[n]ki𝐱i))\displaystyle=\frac{1}{2^{n}}\sum_{\mathbf{x}}\sigma(-\frac{m}{2}-mc+m(\sum_{i\in[n]}k_{i}\mathbf{x}_{i}))
=12n𝐱σ(m2mc+m(i:𝐱i=1ki))\displaystyle=\frac{1}{2^{n}}\sum_{\mathbf{x}}\sigma(-\frac{m}{2}-mc+m(\sum_{i:\mathbf{x}_{i}=1}k_{i}))
=12nS[n]σ(weight(S))\displaystyle=\frac{1}{2^{n}}\sum_{S\subseteq[n]}\sigma(\text{weight}(S))
=12n+1S[n](σ(weight(S))+σ(weight(S¯)))\displaystyle=\frac{1}{2^{n+1}}\sum_{S\subseteq[n]}(\sigma(\text{weight}(S))+\sigma(\text{weight}(\bar{S})))

If SS is a solution to the number partitioning problem (SPS\in P), then:

σ(weight(S))+σ(weight(S¯))=2σ(m/2)\displaystyle\sigma(\text{weight}(S))+\sigma(\text{weight}(\bar{S}))=2\sigma(-m/2)

Otherwise, one of weight(S)\text{weight}(S), weight(S¯)\text{weight}(\bar{S}) is m/2\geq m/2 and the other is 3m/2\leq-3m/2 and therefore:

σ(m/2)σ(weight(S))+σ(weight(S¯))1+σ(3m/2)\displaystyle\sigma(m/2)\leq\sigma(\text{weight}(S))+\sigma(\text{weight}(\bar{S}))\leq 1+\sigma(-3m/2)

Since ε=1/2n+3\varepsilon=1/2^{n+3}, and mm satisfies both 2σ(m/2)ε2\sigma(-m/2)\leq\varepsilon and 1σ(m/2)ε1-\sigma(m/2)\leq\varepsilon, we have:

SP:\displaystyle S\in P: 0σ(weight(S))+σ(weight(S¯))\displaystyle 0\leq\sigma(\text{weight}(S))+\sigma(\text{weight}(\bar{S})) ε\displaystyle\leq\varepsilon
SP:\displaystyle S\not\in P: 1εσ(weight(S))+σ(weight(S¯))\displaystyle 1-\varepsilon\leq\sigma(\text{weight}(S))+\sigma(\text{weight}(\bar{S})) 1+ε\displaystyle\leq 1+\varepsilon

This implies:

2n|P|2n+1(1ε)\displaystyle\frac{2^{n}-|P|}{2^{n+1}}\left(1-\varepsilon\right)\leq 𝐄[F]|P|2n+1ε+2n|P|2n+1(1+ε)\displaystyle\ \mathbf{E}[F]\leq\frac{|P|}{2^{n+1}}\,\varepsilon+\frac{2^{n}-|P|}{2^{n+1}}(1+\varepsilon)
|P|\displaystyle|P|\geq 2n2n+1𝐄[F]1ε\displaystyle 2^{n}-\frac{2^{n+1}\mathbf{E}[F]}{1-\varepsilon}
|P|\displaystyle|P|\leq 2n(1+ε)2n+1𝐄[F]\displaystyle 2^{n}(1+\varepsilon)-2^{n+1}\mathbf{E}[F]

Thus, we have a lower and an upper bound for |P||P|. Since 𝐄[F]1\mathbf{E}[F]\leq 1, the difference between the two bounds is <1<1 and there exists at most one integer number between them, hence |P||P| is equal to the ceiling of the lower bound (and also to the floor of the upper bound), proving the claim.

Appendix D Proof of Theorem 8

We use a reduction from the decision version of number partitioning problem, NUMPAR, which is NP-complete, see Sec. C.

As before we assume w.l.o.g. that the sum of the numbers kik_{i} is even, iki=2c\sum_{i}k_{i}=2c for some natural number cc. Let mm be a large natural number to be defined later. We reduce the NUMPAR problem to the D-SHAP({X0},NBN)\text{\tt D-SHAP}(\{X_{0}\},\text{\tt NBN}) problem. The Naive Bayes network NBN consists of n+1n+1 binary random variables X0,,XnX_{0},\ldots,X_{n}. Let Xi,X¯iX_{i},\bar{X}_{i} denote the events Xi=1X_{i}=1 and respectively Xi=0X_{i}=0. We define the following probabilities of the NBN:

Pr(X0)Pr(X¯0)=\displaystyle\frac{{\textnormal{Pr}}(X_{0})}{{\textnormal{Pr}}(\bar{X}_{0})}= em2mc\displaystyle e^{-\frac{m}{2}-mc} Pr(Xi|X0)Pr(Xi|X¯0)=\displaystyle\frac{{\textnormal{Pr}}(X_{i}|X_{0})}{{\textnormal{Pr}}(X_{i}|\bar{X}_{0})}= emki\displaystyle e^{mk_{i}}

The probabilities Pr(X¯0){\textnormal{Pr}}(\bar{X}_{0}) and Pr(Xi|X¯0){\textnormal{Pr}}(X_{i}|\bar{X}_{0}) can be chosen arbitrarily (with the obvious constraints Pr(X¯0)em2+mc{\textnormal{Pr}}(\bar{X}_{0})\leq e^{\frac{m}{2}+mc} and Pr(Xi|X¯0)emki{\textnormal{Pr}}(X_{i}|\bar{X}_{0})\leq e^{-mk_{i}}). As required, our classifier is F(X0,,Xn)=defX0F(X_{0},\ldots,X_{n})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}X_{0}. Let ak=defk!(nk)!(n+1)!a_{k}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\frac{k!(n-k)!}{(n+1)!} and let ε>0\varepsilon>0 be any number such that εak\varepsilon\leq a_{k} for all k=0,1,,nk=0,1,\ldots,n. We prove:

Claim 5.

Let ε\varepsilon be the value defined above. If mm satisfies both 2σ(m/2)ε2\sigma(-m/2)\leq\varepsilon and 1σ(m/2)ε1-\sigma(m/2)\leq\varepsilon, then NUMPAR has a solution iff ShapF(X0)1/2(1+ε)\textsc{Shap}_{F}(X_{0})\geq 1/2(1+\varepsilon).

The claim implies Theorem 8. To prove the claim, we express the Shap explanation using Eq. (6). Let 𝐗S\mathbf{X}_{S} denote the event iS(Xi=1)\bigwedge_{i\in S}(X_{i}=1). Then, we can write the Shap explanation as:

ShapF(X0)=\displaystyle\textsc{Shap}_{F}(X_{0})= S[n]a|S|(𝐄[F𝐗S{0}]𝐄[F𝐗S])\displaystyle\sum_{S\subseteq[n]}a_{|S|}\left(\mathbf{E}[F\mid\mathbf{X}_{S\cup\{0\}}]-\mathbf{E}[F\mid\mathbf{X}_{S}]\right)

Obviously, 𝐄[X0𝐗S{0}]=1\mathbf{E}[X_{0}\mid\mathbf{X}_{S\cup\{0\}}]=1. In addition, we have S[n]a|S|=1\sum_{S\subseteq[n]}a_{|S|}=1, because there are (nk){n\choose k} sets of size kk, hence S[n]a|S|=k=0,n(nk)k!(nk)!(n+1)!=1\sum_{S\subseteq[n]}a_{|S|}=\sum_{k=0,n}{n\choose k}\cdot\frac{k!(n-k)!}{(n+1)!}=1. Therefore ShapF(X0)=1D\textsc{Shap}_{F}(X_{0})=1-D, where:

D\displaystyle D =defS[n]a|S|𝐄[X0𝐗S]\displaystyle\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\sum_{S\subseteq[n]}a_{|S|}\cdot\mathbf{E}[X_{0}\mid\mathbf{X}_{S}] (13)

To compute DD, we expand:

𝐄[\displaystyle\mathbf{E}[ X0|𝐗S]=Pr(X0|𝐗S)=Pr(X0,𝐗S)Pr(𝐗S)\displaystyle X_{0}|\mathbf{X}_{S}]={\textnormal{Pr}}(X_{0}|\mathbf{X}_{S})=\frac{{\textnormal{Pr}}(X_{0},\mathbf{X}_{S})}{{\textnormal{Pr}}(\mathbf{X}_{S})}
=iSPr(Xi|X0)Pr(X0)iSPr(Xi|X0)Pr(X0)+iSPr(Xi|X¯0)Pr(X¯0)\displaystyle=\frac{\prod_{i\in S}{\textnormal{Pr}}(X_{i}|X_{0}){\textnormal{Pr}}(X_{0})}{\prod_{i\in S}{\textnormal{Pr}}(X_{i}|X_{0}){\textnormal{Pr}}(X_{0})+\prod_{i\in S}{\textnormal{Pr}}(X_{i}|\bar{X}_{0}){\textnormal{Pr}}(\bar{X}_{0})}
=11+Pr(X¯0)/Pr(X0)iS(Pr(Xi|X¯0)/Pr(Xi|X0))\displaystyle=\frac{1}{1+{\textnormal{Pr}}(\bar{X}_{0})/{\textnormal{Pr}}(X_{0})\cdot\prod_{i\in S}({\textnormal{Pr}}(X_{i}|\bar{X}_{0})/{\textnormal{Pr}}(X_{i}|X_{0}))}
=σ(weight(S))\displaystyle=\sigma(\text{weight}(S))

where:

σ(x)=def\displaystyle\sigma(x)\stackrel{{\scriptstyle\mathrm{def}}}{{=}} 11+ex\displaystyle\frac{1}{1+e^{-x}} weight(S)=def\displaystyle\text{weight}(S)\stackrel{{\scriptstyle\mathrm{def}}}{{=}} m2mc+m(iSki)\displaystyle-\frac{m}{2}-mc+m(\sum_{i\in S}k_{i})

We compute DD in Eq. (13) by grouping each set SS with its complement S¯=def[n]S\bar{S}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}[n]-S:

D=\displaystyle D= 12S[n]a|S|(σ(weight(S))+σ(weight(S¯)))\displaystyle\frac{1}{2}\sum_{S\subseteq[n]}a_{|S|}\left(\sigma(\text{weight}(S))+\sigma(\text{weight}(\bar{S}))\right) (14)

If SS is a solution to the number partitioning problem, then:

σ(weight(S))+σ(weight(S¯))=2σ(m/2)\displaystyle\sigma(\text{weight}(S))+\sigma(\text{weight}(\bar{S}))=2\sigma(-m/2)

Otherwise, one of weight(S)\text{weight}(S), weight(S¯)\text{weight}(\bar{S}) is m/2\geq m/2 and the other is 3m/2\leq-3m/2 and therefore:

σ(m/2)σ(weight(S))+σ(weight(S¯))1+σ(3m/2)\displaystyle\sigma(m/2)\leq\sigma(\text{weight}(S))+\sigma(\text{weight}(\bar{S}))\leq 1+\sigma(-3m/2)

As in Sec. C, we obtain:

SP:\displaystyle S\in P: 0σ(weight(S))+σ(weight(S¯))\displaystyle 0\leq\sigma(\text{weight}(S))+\sigma(\text{weight}(\bar{S})) ε\displaystyle\leq\varepsilon
SP:\displaystyle S\not\in P: 1εσ(weight(S))+σ(weight(S¯))\displaystyle 1-\varepsilon\leq\sigma(\text{weight}(S))+\sigma(\text{weight}(\bar{S})) 1+ε\displaystyle\leq 1+\varepsilon

Therefore, using the fact that S[n]a|S|=1\sum_{S\subseteq[n]}a_{|S|}=1, we derive these bounds for the expression (14) for DD:

  • If the number partitioning problem has no solution, then D1/2(1ε)D\geq 1/2(1-\varepsilon), and ShapF(X0)1/2(1+ε)\textsc{Shap}_{F}(X_{0})\leq 1/2(1+\varepsilon).

  • Otherwise, let SS be any solution to the NUMPAR problem, and k=|S|k=|S|, then:

    D\displaystyle D\leq (12(1+ε)ak(1+ε))+akε\displaystyle\left(\frac{1}{2}(1+\varepsilon)-a_{k}(1+\varepsilon)\right)+a_{k}\varepsilon
    \displaystyle\leq 12(akε2)<12ε2\displaystyle\frac{1}{2}-\left(a_{k}-\frac{\varepsilon}{2}\right)<\frac{1}{2}-\frac{\varepsilon}{2}

    and therefore ShapF(X0)>1/2(1+ε)\textsc{Shap}_{F}(X_{0})>1/2(1+\varepsilon).

Appendix E Proof of Corollary 10

Proof.

(Sketch) We use a reduction from the NUMPAR problem, as in the proof of Theorem 8. We start by constructing the NBN with variables X0,X1,,XnX_{0},X_{1},\ldots,X_{n} (as for Theorem 8), then add two more variables Xn+1,Xn+2X_{n+1},X_{n+2}, and edges X0Xn+1Xn+2X_{0}\rightarrow X_{n+1}\rightarrow X_{n+2}, and define the random variables Xn+1,Xn+2X_{n+1},X_{n+2} to be identical to X0X_{0}, i.e. X0=Xn+1=Xn+2X_{0}=X_{n+1}=X_{n+2}. The prediction function is F=Xn+2F=X_{n+2}, i.e. it returns the feature Xn+2X_{n+2}, and the variables X0,Xn+1X_{0},X_{n+1} are latent. Thus, the new BN is identical to the NBN, and, since both models have exactly the same number of non-latent variables, the Shap-explanation is the same. ∎

Appendix F Proof of Lemma 13 (1)

Fix a PP2CNF Φ=(UiVj)\Phi=\bigwedge(U_{i}\vee V_{j}). A symmetric probability space is defined by two numbers p,q[0,1]p,q\in[0,1] and consists of the fully-factorized distribution where Pr(U1)=Pr(U2)==p{\textnormal{Pr}}(U_{1})={\textnormal{Pr}}(U_{2})=\cdots=p and Pr(V1)=Pr(V2)==q{\textnormal{Pr}}(V_{1})={\textnormal{Pr}}(V_{2})=\cdots=q. A quasi-symmetric probability space consists of two sets of indices I,JI,J and two numbers p,qp,q such that:

Pr(Ui)=\displaystyle{\textnormal{Pr}}(U_{i})= {pwhen iI1when iI\displaystyle\begin{cases}p&\mbox{when }i\not\in I\\ 1&\mbox{when }i\in I\end{cases} Pr(Vj)=\displaystyle{\textnormal{Pr}}(V_{j})= {qwhen jJ1when jJ\displaystyle\begin{cases}q&\mbox{when }j\not\in J\\ 1&\mbox{when }j\in J\end{cases}

In this and the following section we prove Lemma 13: computing the Shap-explanation over an empirical distribution is polynomial time equivalent to computing the expectation of PP2CNF formulas over a (quasi)-symmetric distribution. Provan and Ball (1983) proved that computing the expectation of a PP2CNF over uniform distributions is #P-hard in general. Since uniform distribution are symmetric (namely p=q=1/2p=q=1/2) it follows that computing Shap-explanations is #P-hard in general.

In this section we prove item  (1) of Lemma 13. Fix a 0/1-matrix 𝒙\bm{x} defining an empirical distribution, and let FF be a real-valued prediction function over these features. Let Φ𝒙\Phi_{\bm{x}} be the PP2CNF associated to 𝒙\bm{x} (see Definition 4). Will assume w.l.o.g. that 𝒙\bm{x} has n+1n+1 features (columns), denoted X0,X1,,XnX_{0},X_{1},\ldots,X_{n}. The prediction function FF is any function F:{0,1}n+1F:\{0,1\}^{n+1}\rightarrow\mathbb{R}. We prove:

Proposition 15.

One can compute ShapF(X0)\textsc{Shap}_{F}(X_{0}) in polynomial time using an oracle for computing 𝐄[Φ𝐱]\mathbf{E}[\Phi_{\bm{x}}] over quasi-symmetric distributions.

Denote by yi=defF(xi0,xi1,,xin)y_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}F(x_{i0},x_{i1},\ldots,x_{in}), i=1,mi=1,m the value of FF on the ii’th row of the matrix 𝒙\bm{x}. Since the only possible outcomes of the probability space are the mm rows of the matrix, the quantity ShapF(X0)\textsc{Shap}_{F}(X_{0}) depends only on the vector 𝒚=def(y1,,ym)\bm{y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}(y_{1},\ldots,y_{m}). Furthermore, by the linearity of the Shap explanation (Eq. (4)), it suffices to compute the Shap explanation in the case when 𝒚\bm{y} has a single value =1=1 and all others are =0=0. By permuting the rows of the matrix, we will assume w.l.o.g. that y1=1y_{1}=1, and y2=y3==ym=0y_{2}=y_{3}=\cdots=y_{m}=0. In summary, denoting F1F_{1} the function that is 1 on the first row of the matrix 𝒙\bm{x} and is 0 on all other rows, our task is to compute ShapF1(X0)\textsc{Shap}_{F_{1}}(X_{0}).

For that we use the following expression for Shap (see also Sec. 3):

ShapF1(X0)=k=0,nk!(nk)!(n+1)!(\displaystyle\textsc{Shap}_{F_{1}}(X_{0})=\sum_{k=0,n}\frac{k!(n-k)!}{(n+1)!}\Big{(}
S[n]:|S|=k(𝐄[F1|𝐗S{0}=1]𝐄[F1|𝐗S=1]))\displaystyle\sum_{S\subseteq[n]:|S|=k}\left(\mathbf{E}[F_{1}|\mathbf{X}_{S\cup\{0\}}=1]-\mathbf{E}[F_{1}|\mathbf{X}_{S}=1]\right)\Big{)} (15)

We will only show how to compute the quantity

vF1,k=\displaystyle v_{F_{1},k}= S[n]:|S|=k𝐄[F1|𝐗S=1]\displaystyle\sum_{S\subseteq[n]:|S|=k}\mathbf{E}[F_{1}|\mathbf{X}_{S}=1] (16)

using an oracle to 𝐄[Φ𝒙]\mathbf{E}[\Phi_{\bm{x}}], because the quantity S:|S|=k𝐄[F1|𝐗S{0}=1]\sum_{S:|S|=k}\mathbf{E}[F_{1}|\mathbf{X}_{S\cup\{0\}}=1] is computed similarly, by restricting the matrix 𝒙\bm{x} to the rows ii where xi0=1x_{i0}=1. The PP2CNF Φ\Phi associated to this restricted matrix is obtained from Φ𝒙\Phi_{\bm{x}} as follows. Let I=def{ixi0=1}I\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{{i}\mid{x_{i0}=1}\} be the set of rows of the matrix where the feature X0X_{0} is 1. Then, we need to remove all clauses of the form (UiVj)(U_{i}\vee V_{j}) for iIi\in I. This is equivalent to setting Ui:=1U_{i}:=1 in Φ𝒙\Phi_{\bm{x}}. Therefore, we can compute the expectation of the restricted Φ\Phi by using our oracle for 𝐄[Φ𝒙]\mathbf{E}[\Phi_{\bm{x}}], and running it over the probability space where we define Pr(Ui)=def1{\textnormal{Pr}}(U_{i})\stackrel{{\scriptstyle\mathrm{def}}}{{=}}1 for all iIi\in I. Hence, it suffices to show only how to compute the expression (16). Notice that the quantity vF1,kv_{F_{1},k} is the same as what we defined earlier in Eq. (7).

Column X0X_{0} of the matrix is not used in expression (16), because the set SS ranges over subsets of [n][n]. Hence w.l.o.g. we can drop feature X0X_{0} and denote by 𝒙\bm{x} (with some abuse) the matrix that only has the features X1,,XnX_{1},\ldots,X_{n}. In other words, 𝒙{0,1}m×n\bm{x}\in\{0,1\}^{m\times n}. The PP2CNF formula for the modified matrix is obtained from Φ𝒙\Phi_{\bm{x}} by setting V0:=1V_{0}:=1, hence we can compute its expectation by using our oracle for 𝐄[Φ𝒙]\mathbf{E}[\Phi_{\bm{x}}].

We introduce the following quantities associated to the matrix 𝒙{0,1}m×n\bm{x}\in\{0,1\}^{m\times n}:

  • For all S[n]S\subseteq[n], m,kn\ell\leq m,k\leq n, we define:

    g(S)=def\displaystyle g(S)\stackrel{{\scriptstyle\mathrm{def}}}{{=}} {ijS,xij=1}\displaystyle\{{i}\mid{\forall j\in S,x_{ij}=1}\} (17)
    ak=def\displaystyle a_{\ell k}\stackrel{{\scriptstyle\mathrm{def}}}{{=}} |{S|S|=k,|g(S)|=}|\displaystyle|\{{S}\mid{|S|=k,|g(S)|=\ell}\}| (18)
  • We define the sequence vkv_{k}, k=0,1,,nk=0,1,\ldots,n:

    vk=def\displaystyle v_{k}\stackrel{{\scriptstyle\mathrm{def}}}{{=}} l=1,mak\displaystyle\sum_{l=1,m}\frac{a_{\ell k}}{\ell} (19)
  • We define the value VV:

    V=def\displaystyle V\stackrel{{\scriptstyle\mathrm{def}}}{{=}} k=0,nk!(nk)!(n+1)!vk\displaystyle\sum_{k=0,n}\frac{k!(n-k)!}{(n+1)!}v_{k} (20)

We prove that, under a certain condition, the value vkv_{k} in Eq. (19) is equal to Eq. (16); this justifies the notation vkv_{k}, since it turns out to be the same as vF1,kv_{F_{1},k} from Eq. (7).

Definition 5.

Call the matrix 𝒙\bm{x} “good” if i,j\forall i,j, x1jxijx_{1j}\geq x_{ij}.

In other words, the matrix is “good” if the first row dominates all others. In general the matrix 𝒙\bm{x} need not be “good”, however we can make it “good” by removing all columns where row 1 has a value 0. More precisely, let J(1)=def{jx1j=1}J^{(1)}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{{j}\mid{x_{1j}=1}\} denote the non-zero positions of the first row, and let 𝒙(1)\bm{x}^{(1)} denote the sub-matrix of 𝒙\bm{x} consisting of the columns J(1)J^{(1)}. Obviously, 𝒙(1)\bm{x}^{(1)} is “good”, because its first row is (1,1,,1)(1,1,\ldots,1). The following hold:

If SJ(1):\displaystyle\mbox{If }S\subseteq J^{(1)}: 𝐄𝒙[F1|𝐗S=1]=\displaystyle\mathbf{E}_{\bm{x}}[F_{1}|\mathbf{X}_{S}=1]= 𝐄𝒙(1)[F1|𝐗S=1]\displaystyle\mathbf{E}_{\bm{x}^{(1)}}[F_{1}|\mathbf{X}_{S}=1]
If SJ(1):\displaystyle\mbox{If }S\not\subseteq J^{(1)}: 𝐄𝒙[F1|𝐗S=1]=\displaystyle\mathbf{E}_{\bm{x}}[F_{1}|\mathbf{X}_{S}=1]= 0\displaystyle 0

(When SJ(1)S\not\subseteq J^{(1)} then the quantity 𝐄𝒙(1)[F1|𝐗S=1]\mathbf{E}_{\bm{x}^{(1)}}[F_{1}|\mathbf{X}_{S}=1] is undefined). Therefore:

S[n]:|S|=k𝐄𝒙[F1|𝐗S=1]=\displaystyle\sum_{S\subseteq[n]:|S|=k}\!\!\!\mathbf{E}_{\bm{x}}[F_{1}|\mathbf{X}_{S}=1]= SJ(1):|S|=k𝐄𝒙(1)[F1|𝐗S=1]\displaystyle\sum_{S\subseteq J^{(1)}:|S|=k}\!\!\!\mathbf{E}_{\bm{x}^{(1)}}[F_{1}|\mathbf{X}_{S}=1]

It follows that, in order to compute the values in Eq. (16), we can consider the matrix 𝒙(1)\bm{x}^{(1)} instead of 𝒙\bm{x}; its associated PP2CNF is obtained from Φ𝒙\Phi_{\bm{x}} by setting Vj:=1V_{j}:=1 for all j[m]J(1)j\in[m]-J^{(1)}, hence we can compute its expectation over a quasi-symmetric space by using our oracle for computing 𝐄[Φ𝒙]\mathbf{E}[\Phi_{\bm{x}}] over quasi-symmetric spaces. To simplify the notation, we will still use the name 𝒙\bm{x} for the matrix instead of 𝒙(1)\bm{x}^{(1)}, and assume w.l.o.g. that the first row of the matrix 𝒙\bm{x} is (1,1,,1)(1,1,\ldots,1).

We prove that, when 𝒙\bm{x} is “good”, then vkv_{k} is indeed the quantity Eq. (16) that we want to compute. This holds for any “good” matrix, not just matrices with (1,1,,1)(1,1,\ldots,1) in the first row, and we need this more general result later in Sec. G.

Claim 6.

If the matrix 𝒙\bm{x} is “good”, then, for any k=0,nk=0,n:

vk=\displaystyle v_{k}= S:|S|=k𝐄[F1|𝐗S=1]\displaystyle\sum_{S:|S|=k}\mathbf{E}[F_{1}|\mathbf{X}_{S}=1]
Proof.

Recall that J(1)=def{jx1j=1}J^{(1)}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{{j}\mid{x_{1j}=1}\}. Let S[n]S\subseteq[n] be any set of columns. We consider two cases, depending on whether SS is a subset of J(1)J^{(1)} or not:

S\displaystyle S\subseteq J(1):\displaystyle J^{(1)}: |g(S)|>\displaystyle|g(S)|> 0\displaystyle 0 𝐄[F1|𝐗S=1]=\displaystyle\mathbf{E}[F_{1}|\mathbf{X}_{S}=1]= 1|g(S)|\displaystyle\frac{1}{|g(S)|}
S\displaystyle S\not\subseteq J(1):\displaystyle J^{(1)}: |g(S)|=\displaystyle|g(S)|= 0\displaystyle 0 𝐄[F1|𝐗S=1]=\displaystyle\mathbf{E}[F_{1}|\mathbf{X}_{S}=1]= 0\displaystyle 0

Therefore:

S[n]:|S|=k𝐄[F1|𝐗S=1]=SJ(1):|S|=k𝐄[F1|𝐗S=1]\displaystyle\sum_{S\subseteq[n]:|S|=k}\mathbf{E}[F_{1}|\mathbf{X}_{S}=1]=\sum_{S\subseteq J^{(1)}:|S|=k}\mathbf{E}[F_{1}|\mathbf{X}_{S}=1]
=SJ(1):|S|=k1|g(S)|=S:|S|=k,|g(S)|>01|g(S)|=>0ak\displaystyle=\sum_{S\subseteq J^{(1)}:|S|=k}\frac{1}{|g(S)|}=\sum_{S:|S|=k,|g(S)|>0}\frac{1}{|g(S)|}=\sum_{\ell>0}\frac{a_{\ell k}}{\ell}

At this point we introduce two polynomials, PP and QQ.

Definition 6.

Fix an m×nm\times n matrix 𝒙\bm{x} with 0,10,1-entries. The polynomials P(u,v)P(u,v) and Q(u,v)Q(u,v) in real variables u,vu,v associated to the matrix 𝒙\bm{x} are the following:

P(u,v)=def\displaystyle P(u,v)\stackrel{{\scriptstyle\mathrm{def}}}{{=}} S[n]u|g(S)|v|S|\displaystyle\sum_{S\subseteq[n]}u^{|g(S)|}v^{|S|}
Q(u,v)=def\displaystyle Q(u,v)\stackrel{{\scriptstyle\mathrm{def}}}{{=}} T[m],S[n]:(i,j)T×S:xij=1u|T|v|S|\displaystyle\sum_{\begin{array}[]{c}T\subseteq[m],S\subseteq[n]:\\ \forall(i,j)\in T\times S:\ x_{ij}=1\end{array}}u^{|T|}v^{|S|}

The polynomials are defined by summing over exponentially many sets S[n]S\subseteq[n], or pairs of sets S[n],T[m]S\subseteq[n],T\subseteq[m]. In the definition of PP, we use the function g(S)g(S) associated to the matrix 𝒙\bm{x}, see Eq. (17). In the definition of Q(u,v)Q(u,v) we sum only those pairs T,ST,S where iT\forall i\in T, jS\forall j\in S, xij=1x_{ij}=1. While their definition involves exponentially many terms, these polynomials have only (m+1)(n+1)(m+1)(n+1) terms, because the degrees of the variables u,vu,v are mm and nn respectively. We claim that these terms are as follows:

Claim 7.

The following identities hold:

P(u,v)=\displaystyle P(u,v)= =0,m;k=0,nakuvk\displaystyle\sum_{\ell=0,m;k=0,n}a_{\ell k}u^{\ell}v^{k}
Q(u,v)=\displaystyle Q(u,v)= P(1+u,v)\displaystyle P(1+u,v)
Proof.

The identity for P(u,v)P(u,v) follows immediately from the definition of aka_{\ell k}. We prove the identity for QQ. From the definition of g(S)g(S) in Eq. (17) we derive the following equivalence:

(iT,jS:xij=1)\displaystyle(\forall i\in T,\forall j\in S:x_{ij}=1) \displaystyle\Leftrightarrow Tg(S)\displaystyle T\subseteq g(S)

Which implies:

Q(u,v)=\displaystyle Q(u,v)= S[n],Tg(S)u|T|v|S|\displaystyle\sum_{S\subseteq[n],T\subseteq g(S)}u^{|T|}v^{|S|}

and the claim follows from Tg(S)u|T|=(1+u)|g(S)|\sum_{T\subseteq g(S)}u^{|T|}=(1+u)^{|g(S)|}. ∎

Thus, in order to compute the quantities vkv_{k} for k=0,1,,nk=0,1,\ldots,n it suffices to compute the coefficients aka_{\ell k} of the polynomial P(u,v)P(u,v), and, for that, it suffices to compute the coefficients of the polynomial Q(u,v)Q(u,v). For that, we establish the following important connection between 𝐄[Φ𝒙]\mathbf{E}[\Phi_{\bm{x}}] and the polynomial Q(u,v)Q(u,v). Fix u,v>0u,v>0 any two positive real values, and let p=def1/(1+u)p\stackrel{{\scriptstyle\mathrm{def}}}{{=}}1/(1+u), q=def1/(1+v)q\stackrel{{\scriptstyle\mathrm{def}}}{{=}}1/(1+v); notice that p,q(0,1)p,q\in(0,1). Consider the probability space over independent Boolean variables U1,,Um,V1,,VnU_{1},\ldots,U_{m},V_{1},\ldots,V_{n} where i[m]\forall i\in[m], Pr(Ui)=p{\textnormal{Pr}}(U_{i})=p, and j[n]\forall j\in[n], Pr(Vj)=q{\textnormal{Pr}}(V_{j})=q. Then:

Claim 8.

Given the notations above, the following identity holds:

𝐄[Φ𝒙]=1(1+u)m(1+v)nQ(u,v)\displaystyle\mathbf{E}[\Phi_{\bm{x}}]=\frac{1}{(1+u)^{m}(1+v)^{n}}Q(u,v) (21)
Proof.

A truth assignment for Φ𝒙\Phi_{\bm{x}} consists of two assignments, θ{0,1}m\theta\in\{0,1\}^{m} for the variables UiU_{i}, and τ{0,1}n\tau\in\{0,1\}^{n} for the variables VjV_{j}. Defining T=def{iθ(Ui)=0}T\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{{i}\mid{\theta(U_{i})=0}\} and S=def{jτ(Vj)=0}S\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\{{j}\mid{\tau(V_{j})=0}\}, we observe that Φ𝒙[θ,τ]=true\Phi_{\bm{x}}[\theta,\tau]=\texttt{true} iff iT,jS\forall i\in T,\forall j\in S, xij=1x_{ij}=1, and therefore:

Pr (Φ𝒙)=θ,τ:Φ[θ,τ]=1Pr(θ)Pr(τ)\displaystyle(\Phi_{\bm{x}})=\sum_{\theta,\tau:\Phi[\theta,\tau]=1}{\textnormal{Pr}}(\theta){\textnormal{Pr}}(\tau)
=T[m],S[n](i,j)T×S:xij=1pm|T|(1p)|T|qn|S|(1q)|S|\displaystyle=\sum_{\begin{array}[]{l}T\subseteq[m],S\subseteq[n]\\ \forall(i,j)\in T\times S:x_{ij}=1\end{array}}p^{m-|T|}(1-p)^{|T|}q^{n-|S|}(1-q)^{|S|}
=pmqnQ((1p)/p,(1q)/q)\displaystyle=p^{m}q^{n}Q((1-p)/p,(1-q)/q)

Finally, to prove Lemma 13 (1), it suffices to show how to use an oracle for E[Φ𝒙]E[\Phi_{\bm{x}}] to compute the coefficients of the polynomial Q(u,v)Q(u,v). We denote by bkb_{\ell k} these coefficients, in other words:

Q(u,v)=\displaystyle Q(u,v)= =0,m;k=0,nbkuvk\displaystyle\sum_{\ell=0,m;k=0,n}b_{\ell k}u^{\ell}v^{k} (22)

To compute the coefficients bkb_{\ell k}, we proceed as follows. Choose m+1m+1 distinct values u0,u1,,um>0u_{0},u_{1},\ldots,u_{m}>0, and choose n+1n+1 distinct values v0,v1,,vn>0v_{0},v_{1},\ldots,v_{n}>0, and for all i=0,mi=0,m and j=0,nj=0,n, use the oracle for 𝐄[Φ𝒙]\mathbf{E}[\Phi_{\bm{x}}] to compute Q(ui,vj)Q(u_{i},v_{j}) as per identity (21). This leads to a system of (m+1)(n+1)(m+1)(n+1) equations whose unknowns are the coefficients bkb_{\ell k} (see Eq. (22)) and whose coefficients are uivjku_{i}^{\ell}v_{j}^{k}. The matrix 𝑨\bm{A} of this system of equations is an [(m+1)(n+1)]×[(m+1)(n+1)][(m+1)(n+1)]\times[(m+1)(n+1)] matrix, whose rows are indexed by pairs (i,j)(i,j), and whose columns are indexed by pairs (,k)(\ell,k):

A(ij),(k)=\displaystyle A_{(ij),(\ell k)}= uivjk\displaystyle u_{i}^{\ell}v_{j}^{k}

We prove that this matrix is non-singular, and for that we observe that it is the Kronecker product of two Vandermonde matrices. Recall that the t×tt\times t Vandermonde matrix defined by tt numbers z1,,ztz_{1},\ldots,z_{t} is:

V(z1,,zt)=\displaystyle V(z_{1},\ldots,z_{t})= [111z1z2ztz12z22zt2z1t1z2t1ztt1]\displaystyle\left[\begin{array}[]{cccc}1&1&\ldots&1\\ z_{1}&z_{2}&\ldots&z_{t}\\ z_{1}^{2}&z_{2}^{2}&\ldots&z_{t}^{2}\\ &&\ldots&\\ z_{1}^{t-1}&z_{2}^{t-1}&\ldots&z_{t}^{t-1}\end{array}\right]

It is known that det(V(z1,,zt))=1i<jt(zjzi)\det(V(z_{1},\ldots,z_{t}))=\prod_{1\leq i<j\leq t}(z_{j}-z_{i}) and this is 0\neq 0 iff the values z1,,ztz_{1},\ldots,z_{t} are distinct. We observe that the matrix 𝑨\bm{A} is the Kronecker product of two Vandermonde matrices:

𝑨=\displaystyle\bm{A}= V(u0,u1,,um)V(v0,v1,,vn)\displaystyle V(u_{0},u_{1},\ldots,u_{m})\otimes V(v_{0},v_{1},\ldots,v_{n})

Since we have chosen u0,,umu_{0},\ldots,u_{m} to be distinct, and similarly for v0,,vnv_{0},\ldots,v_{n}, it follows that both Vandermonde matrices are non-singular, hence det(𝑨)0\det(\bm{A})\neq 0. Thus, we can solve this linear system of equations in time O(((m+1)(n+1))3)O\left(\left((m+1)(n+1)\right)^{3}\right), and compute all coefficients bkb_{\ell k}.

Putting It Together We prove now Proposition 15. We are given a 0/1 matrix 𝒙\bm{x} with n+1n+1 features X0,,XnX_{0},\ldots,X_{n} and mm rows. To compute ShapF(X0)\textsc{Shap}_{F}(X_{0}) we proceed as follows:

  1. 1.

    For each i=1,mi=1,m, compute ShapFi(X0)\textsc{Shap}_{F_{i}}(X_{0}), where FiF_{i} is the function defined as =1=1 on row ii of the matrix, and =0=0 on all other rows of the matrix. Return ShapF(X0)=i=1,myiShapFi(X0)\textsc{Shap}_{F}(X_{0})=\sum_{i=1,m}y_{i}\textsc{Shap}_{F_{i}}(X_{0}), where yi=defF(xi0,xi1,,xin)y_{i}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}F(x_{i0},x_{i1},\ldots,x_{in}) is the value of FF on the ii’th row of the matrix.

  2. 2.

    To compute ShapFi(X0)\textsc{Shap}_{F_{i}}(X_{0}), switch rows 11 and ii of the matrix, and compute ShapF1(X0)\textsc{Shap}_{F_{1}}(X_{0}) on the modified matrix.

  3. 3.

    To compute ShapF1(X0)\textsc{Shap}_{F_{1}}(X_{0}), compute both sums in Eq. (15).

  4. 4.

    To compute S[n]:|S|=k𝐄[F1|𝐗S=1]\sum_{S\subseteq[n]:|S|=k}\mathbf{E}[F_{1}|\mathbf{X}_{S}=1], perform steps (5) to (8) below.

  5. 5.

    Let J(1)={jj[n],x1j=1}J^{(1)}=\{{j}\mid{j\in[n],x_{1j}=1}\}; notice that 0J(1)0\not\in J^{(1)}. Let n(1)=|J(1)|n^{(1)}=|J^{(1)}|. Let Φ\Phi^{\prime} denote the PP2CNF obtained from Φ𝒙\Phi_{\bm{x}} by setting Vj:=1V_{j}:=1 for all jJ(1)j\not\in J^{(1)}. Thus, Φ\Phi^{\prime} has m+n(1)m+n^{(1)} variables: UiU_{i} for i[m]i\in[m], and VjV_{j} for jJ(1)j\in J^{(1)}.

  6. 6.

    Choose distinct values u0,u1,,um(0,1)u_{0},u_{1},\ldots,u_{m}\in(0,1) and distinct values v0,v1,,vn(1)(0,1)v_{0},v_{1},\ldots,v_{n^{(1)}}\in(0,1). For each fixed combination uα,vβu_{\alpha},v_{\beta}, compute Q(uα,vβ)=(1+uα)m(1+vβ)n(1)𝐄[Φ]Q(u_{\alpha},v_{\beta})=(1+u_{\alpha})^{m}(1+v_{\beta})^{n^{(1)}}\mathbf{E}[\Phi^{\prime}] (see Claim 8). The value 𝐄[Φ]\mathbf{E}[\Phi^{\prime}] over the probability space where, for all i,ji,j: Pr(Ui)=uα{\textnormal{Pr}}(U_{i})=u_{\alpha}, Pr(Vj)=vβ{\textnormal{Pr}}(V_{j})=v_{\beta}: this can be done by computing 𝐄[Φ𝒙]\mathbf{E}[\Phi_{\bm{x}}] over a quasi-symmetric space.

  7. 7.

    Using the (m+1)(n(1)+1)(m+1)(n^{(1)}+1) results from the previous step, form a system of Equations where the unknowns are the coefficients bkb_{\ell k}, =0,m\ell=0,m, k=0,n(1)k=0,n^{(1)}, of the polynomial Q(u,v)Q(u,v), see (22). Solve for the coefficients bkb_{\ell k}.

  8. 8.

    Compute the coefficients aka_{\ell k} of the polynomial P(u,v)=Q(u1,v)P(u,v)=Q(u-1,v), see Claim 7, then compute vk=ak/v_{k}=\sum_{\ell}a_{\ell k}/\ell. By Claim 6, vk=S:|S|=k𝐄[F1|𝐗S=1]v_{k}=\sum_{S:|S|=k}\mathbf{E}[F_{1}|\mathbf{X}_{S}=1], completing Step (4).

  9. 9.

    To compute S[n]:|S|=k𝐄[F1|𝐗S{0}=1]\sum_{S\subseteq[n]:|S|=k}\mathbf{E}[F_{1}|\mathbf{X}_{S\cup\{0\}}=1], first set Ui:=0U_{i}:=0 for all rows ii where xi0=0x_{i0}=0, then repeat steps (5) to (8).

  10. 10.

    This completes Step (3), and we obtain ShapF1(X0)\textsc{Shap}_{F_{1}}(X_{0}).

Appendix G Proof of Lemma 13 (2)

Here we prove item (2) of Lemma 13: one can compute 𝐄[Φ]\mathbf{E}[\Phi] over a quasi-symmetric probability space in polynomial time, given an oracle for Shap on empirical distributions. If the probability space sets Pr(Ui)=1{\textnormal{Pr}}(U_{i})=1 for some variable, then we can simply replace Φ\Phi with Φ[Ui:=1]\Phi[U_{i}:=1], and similarly if Pr(Vj)=1{\textnormal{Pr}}(V_{j})=1. Hence, w.l.o.g., we can assume that the probability space is symmetric.

More precisely, we fix a PP2CNF formula Φ=(UiVj)\Phi=\bigwedge(U_{i}\vee V_{j}), and let p=Pr(U1)==Pr(Um)p={\textnormal{Pr}}(U_{1})=\cdots={\textnormal{Pr}}(U_{m}) and q=Pr(V1)==Pr(Vn)q={\textnormal{Pr}}(V_{1})=\cdots={\textnormal{Pr}}(V_{n}) define a symmetric probability space. Our task is to compute 𝐄[Φ]\mathbf{E}[\Phi] over this space, given an oracle for computing Shap-explanations on empirical distributions. Throughout this section we will use the notations introduced in Sec. F.

Let 𝒙\bm{x} the matrix associated to Φ\Phi: xij=0x_{ij}=0 iff Φ\Phi contains a clause UiVjU_{i}\vee V_{j}. We describe our algorithm for computing 𝐄(Φ)\mathbf{E}(\Phi) in three steps.

Step 1: 𝐄[Φ]P(v0,v1,,vk)\mathbf{E}[\Phi]\leq^{P}(v_{0},v_{1},\ldots,v_{k}). More precisely:, we claim that we can compute Pr(Φ){\textnormal{Pr}}(\Phi) using an oracle for computing the quantities v0,v1,,vnv_{0},v_{1},\ldots,v_{n} defined in Eq. (19). We have seen in Eq. (21) that 𝐄[Φ]=1(1+u)m(1+v)nQ(u,v)\mathbf{E}[\Phi]=\frac{1}{(1+u)^{m}(1+v)^{n}}Q(u,v) where u=(1p)/pu=(1-p)/p and v=(1q)/qv=(1-q)/q. From Claim 7 we know that Q(u,v)=P(1+u,v)Q(u,v)=P(1+u,v), and the coefficients of P(u,v)P(u,v) are the quantities aka_{\ell k} defined in Eq. (18). To complete Step 1, we will describe a polynomial time algorithm that computes the quantities aka_{\ell k} associated to our matrix 𝒙\bm{x}, with access to an oracle for computing the quantities v0,,vkv_{0},\ldots,v_{k} associated to any matrix 𝒙\bm{x}^{\prime}.

Starting from the matrix 𝒙\bm{x}, construct m+1m+1 new matrices, denoted by 𝒙(1),𝒙(2),,𝒙(m+1)\bm{x}^{(1)},\bm{x}^{(2)},\ldots,\bm{x}^{(m+1)}, where, for each Γ=1,m+1\Gamma=1,m+1, 𝒙(Γ)\bm{x}^{(\Gamma)} consists of the matrix 𝒙\bm{x} extended with Γ\Gamma rows consisting of (1,1,,1)(1,1,\ldots,1). That is, the matrix 𝒙(Γ)\bm{x}^{(\Gamma)} has Γ+m\Gamma+m rows, the first Γ\Gamma rows are (1,,1)(1,\ldots,1), and the remaining mm rows are those in 𝒙\bm{x}. We run our oracle to compute the quantities vkv_{k} on each matrix 𝒙(Γ)\bm{x}^{(\Gamma)}. We continue to use the notations g(S),ak,vkg(S),a_{\ell k},v_{k} introduced in Equations (17), (18), (19) for the matrix 𝒙\bm{x}, and add the superscript (Γ)(\Gamma) for the same quantities associated to the matrix 𝒙(Γ)\bm{x}^{(\Gamma)}. We observe:

g(Γ)=\displaystyle g^{(\Gamma)}= g(S){the Γ new rows}\displaystyle g(S)\cup\{\mbox{the $\Gamma$ new rows}\}
a+Γ,k(Γ)=\displaystyle a_{\ell+\Gamma,k}^{(\Gamma)}= ak\displaystyle a_{\ell k}

and therefore:

vk(1)\displaystyle v^{(1)}_{k} =11a0k+12a1k++1m+1amk\displaystyle=\frac{1}{1}a_{0k}+\frac{1}{2}a_{1k}+\cdots+\frac{1}{m+1}a_{mk}
vk(2)\displaystyle v^{(2)}_{k} =12a0k+13a1k++1m+2amk\displaystyle=\frac{1}{2}a_{0k}+\frac{1}{3}a_{1k}+\cdots+\frac{1}{m+2}a_{mk}
\displaystyle\cdots
vk(m+1)\displaystyle v^{(m+1)}_{k} =1m+2a0k+1m+3a1k++12m+2amk\displaystyle=\frac{1}{m+2}a_{0k}+\frac{1}{m+3}a_{1k}+\cdots+\frac{1}{2m+2}a_{mk}

By solving this system of equations, we compute the quantities aka_{\ell k} for =0,m\ell=0,m. The matrix of this system is a special case of Cauchy’s double alternant determinant:

det[1xi+yj]=\displaystyle\det\left[\frac{1}{x_{i}+y_{j}}\right]= 1i<jn(xixj)(yiyj)i,j(xi+yj)\displaystyle\frac{\prod_{1\leq i<j\leq n}(x_{i}-x_{j})(y_{i}-y_{j})}{\prod_{i,j}(x_{i}+y_{j})}

where xi=ix_{i}=i and yj=j1y_{j}=j-1, and therefore the matrix of the system is non-singular.

We observe that all matrices 𝒙(1),,𝒙(m+1)\bm{x}^{(1)},\ldots,\bm{x}^{(m+1)} are “good” (see Definition 5), because their first row is (1,,1)(1,\ldots,1).

Step 2: Let 𝒙\bm{x} be a “good” matrix (Definition 5). Then: (v0,v1,,vn)PV(v_{0},v_{1},\ldots,v_{n})\leq^{P}V (VV defined in Eq. (20)). In other words, given a matrix 𝒙\bm{x}, we claim that we can compute the quantities v0,v1,,vnv_{0},v_{1},\ldots,v_{n} associated to 𝒙\bm{x} by Eq. (19) in polynomial time, given access to an oracle for computing the quantity VV associated to any matrix 𝒙\bm{x}^{\prime}. The algorithm proceeds as follows. For each Δ=0,1,,n\Delta=0,1,\ldots,n, construct a new m×(2n)m\times(2n) matrix x(Δ)x^{(\Delta)} by extending 𝒙\bm{x} with Δ\Delta new columns set to 11 and nΔn-\Delta new columns set to 0. Thus, x(Δ)x^{(\Delta)} is:

(x11x12x1n11100x21x22x2n11100xm1xm2xmn11100)\displaystyle\left(\begin{array}[]{ccccccccccc}x_{11}&x_{12}&\ldots&x_{1n}&1&1&\ldots&1&0&\ldots&0\\ x_{21}&x_{22}&\ldots&x_{2n}&1&1&\ldots&1&0&\ldots&0\\ &&\ldots&&&&&&&&\\ x_{m1}&x_{m2}&\ldots&x_{mn}&1&1&\ldots&1&0&\ldots&0\\ \end{array}\right)

Notice that x(Δ)x^{(\Delta)} is “good”, for any Δ\Delta. We run the oracle on each matrix x(Δ)x^{(\Delta)} to compute the quantity V(Δ)V^{(\Delta)}. We start by observing the following relationships between the parameters of the matrix 𝒙\bm{x} and those of the matrix 𝒙(Δ)\bm{x}^{(\Delta)}:

g(Δ)(S)=\displaystyle g^{(\Delta)}(S)= g(Δ[n])\displaystyle g(\Delta\cap[n])
ap(Δ)=\displaystyle a^{(\Delta)}_{\ell p}= k=0,min(p,n)(Δpk)ak\displaystyle\sum_{k=0,\min(p,n)}{{\Delta\choose p-k}}a_{\ell k}
vp(Δ)=\displaystyle v^{(\Delta)}_{p}= k=0,min(p,n)(Δpk)vk\displaystyle\sum_{k=0,\min(p,n)}{{\Delta\choose p-k}}v_{k}

Notice that, when p>n+Δp>n+\Delta, then vp(Δ)=0v_{p}^{(\Delta)}=0. We use the oracle to compute the quantity V(Δ)V^{(\Delta)}, which is:

V(Δ)=\displaystyle V^{(\Delta)}= p=0,2np!(2np)!(2n+1)!vp(Δ)\displaystyle\sum_{p=0,2n}\frac{p!(2n-p)!}{(2n+1)!}v_{p}^{(\Delta)}
=\displaystyle= 12n+1p=0,n+Δ1(2np)vpΔ\displaystyle\frac{1}{2n+1}\sum_{p=0,n+\Delta}\frac{1}{{2n\choose p}}v^{\Delta}_{p}
=\displaystyle= 12n+1p=0,n+Δk=0,min(p,n)(Δpk)(2np)vk\displaystyle\frac{1}{2n+1}\sum_{p=0,n+\Delta}\sum_{k=0,\min(p,n)}\frac{{\Delta\choose p-k}}{{2n\choose p}}v_{k}
=\displaystyle= 12n+1k=0,np=k,k+Δ(Δpk)(2np)vk\displaystyle\frac{1}{2n+1}\sum_{k=0,n}\sum_{p=k,k+\Delta}\frac{{\Delta\choose p-k}}{{2n\choose p}}v_{k}
=\displaystyle= 12n+1k=0,n(q=0,Δ(Δq)(2nk+q))vk\displaystyle\frac{1}{2n+1}\sum_{k=0,n}\left(\sum_{q=0,\Delta}\frac{{\Delta\choose q}}{{2n\choose k+q}}\right)v_{k}
=def\displaystyle\stackrel{{\scriptstyle\mathrm{def}}}{{=}} 12n+1k=0,nAΔ,kvk\displaystyle\frac{1}{2n+1}\sum_{k=0,n}A_{\Delta,k}\cdot v_{k}

Thus, after running the oracle on all matrices 𝒙(0),,𝒙(n)\bm{x}^{(0)},\ldots,\bm{x}^{(n)}, we obtain a system of n+1n+1 equations with n+1n+1 unknowns v0,v1,,vnv_{0},v_{1},\ldots,v_{n}. It remains to prove that system’s matrix, AΔ,kA_{\Delta,k}, is non-singular matrix. Let us denote following matrices by:

AΔ,k=def\displaystyle A_{\Delta,k}\stackrel{{\scriptstyle\mathrm{def}}}{{=}} q=0,Δ(Δq)(2nk+q)\displaystyle\sum_{q=0,\Delta}\frac{{\Delta\choose q}}{{2n\choose k+q}} Δ=0,n;k=0,n;\displaystyle\Delta=0,n;k=0,n;
BΔ,q=def\displaystyle B_{\Delta,q}\stackrel{{\scriptstyle\mathrm{def}}}{{=}} (Δq)\displaystyle{\Delta\choose q} Δ=0,n;q=0,n;\displaystyle\Delta=0,n;q=0,n;
Cq,k=def\displaystyle C_{q,k}\stackrel{{\scriptstyle\mathrm{def}}}{{=}} 1(2nk+q)\displaystyle\frac{1}{{2n\choose k+q}} q=0,n;k=0,n;\displaystyle q=0,n;k=0,n;

It is immediate to verify that 𝑨=𝑩𝑪\bm{A}=\bm{B}\cdot\bm{C}, so it suffices to prove det(𝑩)0\det(\bm{B})\neq 0, det(𝑪)0\det(\bm{C})\neq 0. We start with 𝑩\bm{B}, and for that consider the Vandermonde matrix 𝑿=defV(x0,x1,,xn)\bm{X}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}V(x_{0},x_{1},\ldots,x_{n}), Xqt=defxtqX_{qt}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}x_{t}^{q}. Denoting 𝒀=def𝑩𝑿\bm{Y}\stackrel{{\scriptstyle\mathrm{def}}}{{=}}\bm{B}\cdot\bm{X}, we have that

YΔt=\displaystyle Y_{\Delta t}= q=0,nBΔ,qXq,t=q=0,n(Δq)xtq=(1+xt)Δ\displaystyle\sum_{q=0,n}B_{\Delta,q}X_{q,t}=\sum_{q=0,n}{\Delta\choose q}x_{t}^{q}=(1+x_{t})^{\Delta}

is also a Vandermonde matrix 𝒀=V(1+x0,1+x1,,1+xn)\bm{Y}=V(1+x_{0},1+x_{1},\ldots,1+x_{n}). We have det(𝒀)0\det(\bm{Y})\neq 0 when x0,x1,,xnx_{0},x_{1},\ldots,x_{n} are distinct, proving that det(𝑩)0\det(\bm{B})\neq 0.

Finally, we prove det(𝑪)0\det(\bm{C})\neq 0. For that, we prove a slightly more general result. For any N2nN\geq 2n, denote by 𝑪(n,N)\bm{C}^{(n,N)} the following (n+1)×(n+1)(n+1)\times(n+1) matrix:

𝑪(n,N)=def\displaystyle\bm{C}^{(n,N)}\stackrel{{\scriptstyle\mathrm{def}}}{{=}} (1(N0)1(N1)1(Nn)1(N1)1(N2)1(Nn+1)1(Nn)1(Nn+1)1(N2n))\displaystyle\left(\begin{array}[]{cccc}\frac{1}{{N\choose 0}}&\frac{1}{{N\choose 1}}&\ldots&\frac{1}{{N\choose n}}\\ \frac{1}{{N\choose 1}}&\frac{1}{{N\choose 2}}&\ldots&\frac{1}{{N\choose n+1}}\\ &&\ldots&\\ \frac{1}{{N\choose n}}&\frac{1}{{N\choose n+1}}&\ldots&\frac{1}{{N\choose 2n}}\end{array}\right)

We will prove that det(𝑪(n,N))0\det(\bm{C}^{(n,N)})\neq 0; our claim follows from the special case N=2nN=2n. For the base case, n=0n=0, det(𝑪(0,N))=1\det(\bm{C}^{(0,N)})=1 because 𝑪(0,N)\bm{C}^{(0,N)} is a 1×11\times 1 matrix equal to 1/(N0)1/{N\choose 0}, hence det(𝑪(0,N))=1\det(\bm{C}^{(0,N)})=1. To show the induction step, we will perform elementary column operations (which preserve the determinant) to make the last row of the resulting matrix consist of zeros, except for the last entry.

Consider an arbitrary row ii, and two adjacent columns j,j+1j,j+1 in that row:

1(Ni+j)1(Ni+j+1)\displaystyle\begin{array}[]{cccc}\ldots&\frac{1}{{N\choose i+j}}&\frac{1}{{N\choose i+j+1}}&\ldots\end{array}

We use the fact that (Ni+j)=(Ni+j+1)i+j+1Nij{N\choose i+j}={N\choose i+j+1}\frac{i+j+1}{N-i-j} and rewrite the two adjacent elements as:

(1(Ni+j+1)×Niji+j+1)1(Ni+j+1)\displaystyle\begin{array}[]{cccc}\ldots&\left(\frac{1}{{N\choose i+j+1}}\times\frac{N-i-j}{i+j+1}\right)&\frac{1}{{N\choose i+j+1}}&\ldots\end{array}

Now, for each j=0,1,2,,n1j=0,1,2,...,n-1, we subtract column j+1j+1, multiplied by N(n+j)(n+j)+1\frac{N-(n+j)}{(n+j)+1}, from column jj. The last row becomes 0,0,,0,1(N2n)0,0,\ldots,0,\frac{1}{{N\choose 2n}}, which means that det(C(n,N))det(C^{(n,N)}) is equal to 1(N2n)\frac{1}{{N\choose 2n}} times the upper left (n×n)(n\times n) minor.

Now, we check what happens with element at (i,j)(i,j). After subtraction, it becomes

1(Ni+j+1)×(N(i+j)(i+j)+1N(n+j)(n+j)+1)\displaystyle\frac{1}{{N\choose i+j+1}}\times\left(\frac{N-(i+j)}{(i+j)+1}-\frac{N-(n+j)}{(n+j)+1}\right)

This expression can be rewritten as:

1(N(i+j)+1)×(N(i+j)(i+j)+1N(n+j)(n+j)+1)\displaystyle\frac{1}{{N\choose(i+j)+1}}\times\left(\frac{N-(i+j)}{(i+j)+1}-\frac{N-(n+j)}{(n+j)+1}\right)
=(Nij1)!(i+j+1)!N!(N+1)(ni)(i+j+1)(n+j+1)\displaystyle=\frac{(N-i-j-1)!(i+j+1)!}{N!}\frac{(N+1)(n-i)}{(i+j+1)(n+j+1)}
=(Nij1)!(i+j)!(N1)!N(N+1)(ni)(n+j+1)\displaystyle=\frac{(N-i-j-1)!(i+j)!}{(N-1)!N}\frac{(N+1)(n-i)}{(n+j+1)}
=1(N1(i+j))(N+1)(ni)N(n+j+1)\displaystyle=\frac{1}{{N-1\choose(i+j)}}\frac{(N+1)(n-i)}{N(n+j+1)}

Note that this expression holds with the whole (n×n)(n\times n) upper-left minor of 𝑪(n,N)\bm{C}^{(n,N)}: the element in the lower-right corner of the matrix remains 1/(N2n)1/{N\choose 2n}. Observe that the (i,j)(i,j)-th entry of this minor is precisely the (i,j)(i,j)-entry of C(n1,N1)C^{(n-1,N-1)}, multiplied by (N+1)(ni)N(n+j+1)\frac{(N+1)(n-i)}{N(n+j+1)}. Here N+1N\frac{N+1}{N} is a global constant, nin-i is the same constant in the entire row ii, and 1n+j+1\frac{1}{n+j+1} is the same constant in the entire column jj. We factor out the global constant N+1N\frac{N+1}{N}, factor out nin-i from each row ii, and factor out 1n+j+1\frac{1}{n+j+1} from each column jj, and obtain the following recurrence:

det(𝑪(n,N))=\displaystyle\det(\bm{C}^{(n,N)})= 1(N2n)(N+1N)n×i=0n1(ni)j=0n1(n+j+1)\displaystyle\frac{1}{{N\choose 2n}}\left(\frac{N+1}{N}\right)^{n}\times\frac{\prod_{i=0}^{n-1}(n-i)}{\prod_{j=0}^{n-1}(n+j+1)}
×det(𝑪(n1,N1))\displaystyle\times\det(\bm{C}^{(n-1,N-1)})

It follows by induction on nn that det(𝑪(n,N))0\det(\bm{C}^{(n,N)})\neq 0.

Step 3: Let 𝒙\bm{x} be a “good” matrix (Definition 5). Then VPShapV\leq^{P}\textsc{Shap}. More precisely, we claim that we can compute the quantity VV associated to a matrix 𝒙\bm{x} as defined in Eq. (20) in polynomial time, by using an oracle for computing ShapF1(Xj)\textsc{Shap}_{F_{1}}(X_{j}) over any matrix 𝒙\bm{x}^{\prime}.

We modify the matrix 𝒙\bm{x} as follows. We add a new attribute X0X_{0} whose value is 1 only in the first row, and let F1=X0F_{1}=X_{0} denote the function that returns the value of feature X0X_{0}. We show here the new matrix 𝒙\bm{x}^{\prime}, augmented with the values of the function F1F_{1}:

(X0X1X2XnF11x11x12x1n10x21x22x2n00xm1xm2xmn0)\displaystyle\left(\begin{array}[]{ccccc|c}X_{0}&X_{1}&X_{2}&\ldots&X_{n}&F_{1}\\ \hline\cr 1&x_{11}&x_{12}&\ldots&x_{1n}&1\\ 0&x_{21}&x_{22}&\ldots&x_{2n}&0\\ &\ldots&\ldots&\ldots&\ldots&\\ 0&x_{m1}&x_{m2}&\ldots&x_{mn}&0\\ \end{array}\right)

We run our oracle to compute ShapF1(X0)\textsc{Shap}_{F_{1}}(X_{0}) over the matrix 𝒙\bm{x}^{\prime}. The value ShapF1(X0)\textsc{Shap}_{F_{1}}(X_{0}) is given by Eq. (15), but notice that the matrix 𝒙\bm{x}^{\prime} has n+1n+1 columns, while Eq. (15) is given for a matrix with nn columns. Therefore, since 𝐄[F1|XS{0}]=1\mathbf{E}[F_{1}|X_{S\cup\{0\}}]=1 for any set SS, we have:

ShapF1(X0)=\displaystyle\textsc{Shap}_{F_{1}}(X_{0})= 1k=0,nk!(nk)!(n+1)!𝐄[F1|𝐗S=1]\displaystyle 1-\sum_{k=0,n}\frac{k!(n-k)!}{(n+1)!}\mathbf{E}[F_{1}|\mathbf{X}_{S}=1]

Since 𝒙\bm{x} is “good”, so is the new matrix 𝒙\bm{x}^{\prime} and, by Claim 6, for any k=0,nk=0,n

S:|S|=k𝐄[F1|𝐗S=1]=\displaystyle\sum_{S:|S|=k}\mathbf{E}[F_{1}|\mathbf{X}_{S}=1]= vk\displaystyle v_{k}

This implies that we can use the value ShapF1(X0)\textsc{Shap}_{F_{1}}(X_{0}) returned by the oracle to compute the quantity:

k=0,nk!(nk)!(n+1)!𝐄[F1|𝐗S=1]=k=0,nk!(nk)!(n+1)!vk=V\displaystyle\sum_{k=0,n}\frac{k!(n-k)!}{(n+1)!}\mathbf{E}[F_{1}|\mathbf{X}_{S}=1]=\sum_{k=0,n}\frac{k!(n-k)!}{(n+1)!}v_{k}=V

which completes Step 3

Putting It Together Given a PP2CNF formula Φ=(UiVj)\Phi=\bigwedge(U_{i}\vee V_{j}), and two probability values p=Pr(U1)==Pr(Um)p={\textnormal{Pr}}(U_{1})=\cdots={\textnormal{Pr}}(U_{m}) and q=Pr(V1)==Pr(Vn)q={\textnormal{Pr}}(V_{1})=\cdots={\textnormal{Pr}}(V_{n}), to compute 𝐄[Φ]\mathbf{E}[\Phi] we proceed as follows:

  • Construct the 0,1-matrix associated to Φ\Phi, denote it 𝒙\bm{x}.

  • Construct m+1m+1 matrices 𝒙(Γ)\bm{x}^{(\Gamma)}, Γ=1,m+1\Gamma=1,m+1, by adding Γ\Gamma rows (1,1,,1)(1,1,\ldots,1) at the beginning of the matrix.

  • For each matrix 𝒙(Γ)\bm{x}^{(\Gamma)}, construct n+1n+1 matrices 𝒙(Γ,Δ)\bm{x}^{(\Gamma,\Delta)}, Δ=0,n\Delta=0,n, by adding nn columns, of which the first Δ\Delta columns are 1, the others are 0.

  • For each 𝒙(Γ,Δ)\bm{x}^{(\Gamma,\Delta)}, construct one new matrix (𝒙(Γ,Δ))(\bm{x}^{(\Gamma,\Delta)})^{\prime} by adding a column (1,0,0,,0)(1,0,0,\ldots,0). Call this new column X0X_{0}.

  • Use the oracle to compute ShapF1(X0)\textsc{Shap}_{F_{1}}(X_{0}). From here, compute the value V(Γ,Δ)V^{(\Gamma,\Delta)} associated with the matrix 𝒙(Γ,Δ)\bm{x}^{(\Gamma,\Delta)}.

  • Using the values V(Γ,0),V(Γ,1),,V(Γ,n)V^{(\Gamma,0)},V^{(\Gamma,1)},\ldots,V^{(\Gamma,n)}, compute the values v0(Γ),v1(Γ),,vn(Γ)v_{0}^{(\Gamma)},v_{1}^{(\Gamma)},\ldots,v_{n}^{(\Gamma)} associated to the matrix 𝒙(Γ)\bm{x}^{(\Gamma)}.

  • For each k=0,nk=0,n, use the values vk(1),vk(2),,vk(m+1)v_{k}^{(1)},v_{k}^{(2)},\ldots,v_{k}^{(m+1)} to compute the coefficients a0k,a1k,,amka_{0k},a_{1k},\ldots,a_{mk} associated to the matrix 𝒙\bm{x}.

  • At this point we have all coefficients aka_{\ell k} of the polynomial P(u,v)P(u,v).

  • Compute the coefficients bkb_{\ell k} of the polynomial Q(u,v)=P(1+u,v)Q(u,v)=P(1+u,v).

  • Finally, return 𝐄[Φ]=pmqn(1p)m(1q)nQ(1pp,1qq)\mathbf{E}[\Phi]=\frac{p^{m}q^{n}}{(1-p)^{m}(1-q)^{n}}Q(\frac{1-p}{p},\frac{1-q}{q}).

This concludes the entire proof of Lemma 13.