GEM: Generalised Expectation-Maximisation

Making Neural Networks Interpretable with Attribution: Application to Implicit Signals Prediction

Darius Afchar and Romain Hennequin Deezer ResearchParisFrance [email protected]

(2020)

Abstract.

Explaining recommendations enables users to understand whether recommended items are relevant to their needs and has been shown to increase their trust in the system. More generally, if designing explainable machine learning models is key to check the sanity and robustness of a decision process and improve their efficiency, it however remains a challenge for complex architectures, especially deep neural networks that are often deemed ”black-box”. In this paper, we propose a novel formulation of interpretable deep neural networks for the attribution task. Differently to popular post-hoc methods, our approach is interpretable by design. Using masked weights, hidden features can be deeply attributed, split into several input-restricted sub-networks and trained as a boosted mixture of experts. Experimental results on synthetic data and real-world recommendation tasks demonstrate that our method enables to build models achieving close predictive performances to their non-interpretable counterparts, while providing informative attribution interpretations.

Interpretable machine learning, Implicit Recommender System

^†^†journalyear: 2020^†^†copyright: acmlicensed^†^†conference: Fourteenth ACM Conference on Recommender Systems; September 22–26, 2020; Virtual Event, Brazil^†^†booktitle: Fourteenth ACM Conference on Recommender Systems (RecSys ’20), September 22–26, 2020, Virtual Event, Brazil^†^†price: 15.00^†^†doi: 10.1145/3383313.3412253^†^†isbn: 978-1-4503-7583-2/20/09^†^†ccs: Information systems Recommender systems^†^†ccs: Mathematics of computing Computing most probable explanation^†^†ccs: Information systems Personalization

1. Introduction

In recent years, deep neural networks have been successfully used in a wide range of fields to predict, classify, recommend and generate content, often achieving state-of-the-art results. However, understanding the behaviour of such modern machine learning models remains a challenge compared to simpler methods with transparent computation processes (e.g. linear regressions or decision trees).

Yet, explainability is needed in many fields where the sanity, robustness and fairness of a decision has to be checked, for instance in the medical field, autonomous driving or business analytics, hampering the adoption of those highly-performing models. In the field of recommender systems, although providing explanations is not a necessity, it has been widely shown to improve users satisfaction and trust in the system (Herlocker et al., 2000; Sinha and Swearingen, 2002; Tintarev and Masthoff, 2007). The use of neural networks does not consistently lead to an improvement in recommendation (Dacrema et al., 2019), but their ubiquity across various recent methods (Shenbin et al., 2020; Hidasi et al., 2016; Li and She, 2017; He et al., 2017; Liang et al., 2018; He et al., 2018; Van den Oord et al., 2013) brings us to the study of neural networks interpretability.

In particular, real-world applications often come with the hardship of taking into account implicit signals (Hu et al., 2008) from users, namely signals with a broad semantic, that indirectly reflect inaccessible, high-level or conceptual data, as personal tastes of users or specific contexts. Neural networks can provide a way to model the complex, sometimes multimodal, nature of such implicit signals. As an example, for the implicit sequential skip prediction challenge in music streaming sessions¹¹1 See WSDM Skip Prediction Challenge: https://aicrowd-design.netlify.app/template-challenge-overview, traditional models have been outmatched in favour of deep-network-based approaches (Zhu and Chen, 2019; Hansen et al., 2019; Chang et al., 2019). Interpretation was not particularly studied is the latter works, however, we argue that producing interpretations is beneficial to our understanding of the studied implicit signals, which may allow to wittily leverage implicit users feedback in recommender systems by making them more explicit.

In this paper, we study the interpretation of implicit signals through the lens of feature attribution: the behaviour of a model is simplified to the knowledge of the input dimensions that are primarily used to make a prediction. Indeed, attribution is relevant for implicit signals to unveil their underlying nature. In the example of skip prediction, it enables to discriminate between the case of a user disliking a song because of its music content, and another exploring the catalog and thus quickly skipping through content, which should be interpreted as distinct feedbacks by a recommender system. This simple dichotomy is crucial for music streaming services to refine user music profiles and is currently underexploited due to the implicitness of skips. If attribution is straightforward with linear models, it is more difficult to trace the origin of a prediction through usually multi-layered neural networks.

We propose the formulation of a novel class of deep neural networks that are intrinsically interpretable in term of feature attribution. In detail, by using mask constraints in linear layers, we define a deep structured neural network that allows to trace for every neuron what input data its computation is based on. This allows to emulate several expert sub-networks, each based on a specific restriction of the input. Experts are constrained to be residuals of simpler available experts in order to enforce sparsity. This mixture of expert is then jointly trained using a Generalised Expectation-Maximisation (GEM) algorithm. For inference, our network both produces a prediction and an attribution estimation. This method can be applied to make many modern deep architectures interpretable (e.g. Transformer).

Our contributions are the following:

•

We formulate a way to make deep neural networks architecture interpretable while achieving almost similar performances as their non-interpretable counterpart for several recommendation tasks ;
•

we derive a fast joint training algorithm for this novel architecture inspired by boosting ;
•

we demonstrate the effectiveness of our model on the prediction and interpretation of implicit signals and its application for the real-world task of sequential skip prediction.

If our method was designed with implicit signals and recommendation in mind, it is not limited to them and could be applied to a broad class of attribution tasks. It should also be noted that our intrinsically interpretable method does not preclude the use of popular post-hoc methods and both approaches can complement each other.

2. Related work

2.1. Interpretability for deep neural networks

Interpretability in machine learning is an expanding research field that encompasses many different methods. A popular branch of interpretability aims at providing a post-hoc analysis on how the output of a model is related to its input data. This is the case of the LIME method (Ribeiro et al., 2016), that locally computes a linear model of a trained black-box model, thus providing a simplified explanation of how each input dimension influences the predicted target label for a given input space region. With the same idea, DeepRED (Zilke et al., 2016) uses decision trees as the simplified proxy model, allowing to interpret a deep model as a composition of extracted rules. Going further, methods such as DeepLIFT (Shrikumar et al., 2017), LRP (Bach et al., 2015) and other saliency methods (Simonyan et al., 2014; Zhou et al., 2016; Selvaraju et al., 2017; Smilkov et al., 2017) or game-theory-based SHAP (Lundberg and Lee, 2017) can interestingly propagate feature importance values throughout the layers of a deep model, yielding interpretability up to a neuron-wise granularity. Another branch of interpretability is focused on the elaboration of explanation-producing models. This can be done in a supervised manner as in (Hendricks et al., 2016; Kim et al., 2018) when data about the desired output explanation are available and well-posed, allowing to produce high-level explanations that are more human understandable, or in an unsupervised manner using intrinsically interpretable model. Intrinsic interpretability is a desirable property for high stakes decision models (Rudin, 2019), but also for researchers to inspect, understand and improve how neural network components manipulate data. This is for instance the case of the attention mechanism and its extended multi-head attention module (Vaswani et al., 2017), widely used in natural language processing tasks, that reveals the specialisation of heads into different classes of reading mechanisms for words in a sentence (Voita et al., 2019), or atoms in a molecules to form chemical patterns (Maziarka et al., 2020). Using information theory, InfoGAN (Chen et al., 2016) learns to disentangle latent representations during training, making them interpretable and manipulable. The information bottleneck principle (Tishby et al., 1999) is also a promising concept for interpretability that was successfully used in (Schulz et al., 2019) for feature attribution. Using this taxonomy, our method is an intrinsic interpretation method focused on the attribution problem. We do not assume to have access to target explanations, making the interpretability task unsupervised. Differently from information-theoretic and variational methods, we do not require priors on the attributions, allowing to solve a broader class of attribution problems (see section 3.1). We additionally leverage the natural interpretation power of multi-head attention modules in our chosen deep models, though not being limited to it.

2.2. Selection

Our attribution method is related to generalised additive models (Hastie and Tibshirani, 1990) that model a function as a sum of univariate sub-functions. This formulation is intrinsically interpretable as the contribution of each input feature can be assessed by inspecting corresponding univariate functions. Going further, pairwise interactions can be added, as in (Lou et al., 2013): freezing trained univariate functions, the authors add bivariate functions that are trained on residual points in a boosting-like manner (Freund et al., 1999). In this spirit, our method extends (Lou et al., 2013) to any multivariate functions. However, residuality is replaced by general gating functions on classification confidence of child functions with fewer input variables. This formulation is closely related to the mixture of experts (Jacobs et al., 1991; Jordan and Jacobs, 1994), allowing us to train our model jointly instead of iteratively as in (Lou et al., 2013). Our formulation of ensemble learning is also reminiscent of subset selection (Friedman et al., 2001). To avoid the high combinatorial number of best subset candidates, we restrict their space to a reasonable cardinality using human knowledge.

2.3. Structured networks

We use a judiciously structured deep neural network to emulate several deep sub-networks acting as our different multivariate experts. Our original inspiration comes from YOLO (Redmon et al., 2016), a paramount model in the object detection literature. Interestingly, the network outputs several candidates bounding boxes and self-confidence scores, only one predictor is then selected, leading to a specialisation of predictors to specific classes of objects, as reported by the authors. In (Voita et al., 2019), the authors also report a natural specialisation of different components of a multi-attention module. Our method aims at inducing this specialisation to predefined input subsets of interest for interpretability. We manipulate and route neurons by blocks, which can be related to capsule networks (Hinton et al., 2011). Cunningly structuring a neural network has indeed been demonstrated to produce intrinsic interpretability, as in the recent RPGAN (Voynov and Babenko, 2019). However, the routing process is fixed in our method as interpretation subsets are hyper-parameters in this work. We have explored the use of dynamic subsets, which draws our structured network architecture closer to the latter methods, but leave it out of the scope of this paper.

3. Proposed Method

In this section, we introduce the different building blocks of our method. An overview is given in figure 1.

Refer to caption — Figure 1. Method overview An input $x$ is partitioned into coherent groups of features $\{X_{1},...X_{N}\}$ , combined into subsets of interest for interpretation $\{S_{1},...S_{H}\}$ , then fed to a structured neural network that preserves subset dependence, the output of which is combined in a mixture-of-expert manner to provide a prediction $y$ . Additionally, sub-networks specialised on each interpretation subset can be accessed to provide an attribution vector to interpret the origin of the prediction.

3.1. Problem formulation

We consider the supervised classification setting where $x\sim X\in\mathbb{R}^{n}$ is the input random variable and $y\sim Y\in\{0,1\}$ its target label, our model $f_{\theta}:x\mapsto y$ maps the sample of $X$ to $Y$ and is parametrised by $\theta$ . In this paper, we restrict our study to the binary case for $Y$ , but our method can be extended to the multi-class and continuous case with little effort, but because the latter settings need additional discussion and experiments, we leave them for future work. Parallel to the classification task, we want our model to solve an attribution task by yielding interpretation masks for its inputs that highlight the features that were the most relevant to make the prediction. We introduce a random mask variable $m\sim M$ that takes values in a finite space $\mathcal{M}\subset\{0,1\}^{n}$ and depends on $X$ such that: we want (a) the model to be able to accurately predict $Y$ from $X\circ M$ , where $\circ$ denotes the Hadamard product, while (b) having $M$ as sparse as possible. We do not assume $M$ to be observable, making it a latent variable for our model.

The way we choose to solve the attribution problem is by considering different restrictions of the input $X\circ M$ , and feeding them into several sub-models $f^{m}_{\theta}$ (experts). We then average expert predictions by introducing several associated selection functions $\alpha^{m}_{\theta}$ that activate different experts depending on the input:

(1)

f_{\theta}(x)=\sum_{m\in\mathcal{M}}\alpha^{m}_{\theta}(x\circ m)f^{m}_{\theta}(x\circ m)/\sum_{m\in\mathcal{M}}\alpha^{m}_{\theta}(x\circ m)

This latter ensemble technique is closely related to the mixture of experts (Jacobs et al., 1991; Jordan and Jacobs, 1994), but with input restrictions.

Our objective is to (a) find a maximum solution for the likelihood $p_{\theta}(y|x)=\sum_{m\in\mathcal{M}}p_{\theta}(y|x,m)p_{\theta}(m|x)$ , and (b) ensure the sparsity of $m$ by design. As detailed in section 3.3.4, we will maximise the likelihood using GEM (Bishop, 2006), which involves the incremental update of the following conditional expectation $\mathcal{Q}$ :

(2)

\mathcal{Q}(\theta,\theta_{\textrm{old}})=\mathbb{E}_{x\sim X,y\sim Y}[\sum_{m\in\mathcal{M}}p_{\theta_{\textrm{old}}}(m|x)\ln p_{\theta}(y|x,m)]

The posterior distribution $p_{\theta}(M|X)$ is modelled by the selection functions $\alpha^{m}_{\theta}$ . Depending on the problem, the marginal likelihood $p_{\theta}(y|x,m)$ has to be modelled in different ways, we derive $\mathcal{Q}$ in the binary case with standard assumptions (Bishop, 2006):

(3)		$\displaystyle p_{\theta}(y\|x,m)$	$\displaystyle\sim\mathcal{B}(f^{m}_{\theta}(x\circ m))$
(4)		$\displaystyle\arg\max_{\theta}\mathcal{Q}$	$\displaystyle=\arg\min_{\theta}\mathbb{E}\sum_{m\in\mathcal{M}}\alpha^{m}_{\theta_{\textrm{old}}}(x\circ m)\textrm{BCE}(y,f^{m}_{\theta}(x\circ m))$

where BCE stands for Binary Cross-Entropy. In the following sections we develop the different terms of equation (4): in section 3.2 we prune the mask candidates space $\mathcal{M}$ , we then study the computation of $\alpha^{m}_{\theta}$ to enforce the sparsity of $M$ by design in section 3.3 and finally we detail the architecture of $f^{m}_{\theta}$ in section 3.4.

3.2. Latent space reduction

Our formulation requires to sum likelihoods conditioned on the space $\mathcal{M}$ of all candidate interpretation masks. However, the number of mask is exponential with the $n$ dimensions of $X$ , making the computation intractable for realistic values of $n$ . The working hypothesis of this paper is that we do not need to consider all possible masks for $M$ .

Our first approximation consists in considering that the masks are group-sparse (Huang et al., 2010). Feature attribution can indeed lack robustness by yielding noisy or incoherent subset of features that act as adversarial solutions to the interpretability task and make them less human-understandable (Alvarez-Melis and Jaakkola, 2018). Only allowing group-sparsity of coherent subsets of features mitigates the effect by regularising the allowed solutions. We thus partition $X$ into $N$ disjoint subsets $\mathcal{X}=\{X_{1},\dots X_{N}\}$ . In the example skip prediction attribution task we have mentioned in the introduction, we could partition the input space into interaction features ( $X_{1}$ ) versus musical features ( $X_{2}$ ) to understand the origin of a skip feedback.

In practical applications, we often know the structure of $M$ and the sparsity patterns we can obtain (Huang et al., 2011; Zhao et al., 2009). In such cases, we can further prune the set of mask candidates to only match consistent patterns. We denote $\mathcal{S}$ , the resulting subset of size $H$ of all possible combinations of subsets $X_{i}$ : $\mathcal{S}=\{S_{1},\dots S_{H}\}\subset\mathcal{P}(\mathcal{X})$ , with $\mathcal{P}$ the powerset. With the previous example of skip prediction, if we had further split musical features into genre ( $X_{2}$ ) and mood estimation ( $X_{3}$ ), it would be coherent to consider the subsets $\{X_{2}\}$ , $\{X_{3}\}$ , and the aggregated $\{X_{2},X_{3}\}$ musical features.

Doing so, instead of summing on a space of size $2^{n}$ for $p(m|x)$ , we assume we can work with a reasonable number $H\ll 2^{n}$ of masks-by-block candidates for interpretations. Of course, $\mathcal{X}$ and $S$ can be manually tuned to obtain coarser or finer level of interpretability. In the following sections, we denote $\mathcal{M}=\{m_{1},\dots m_{H}\}$ the interpretation masks we assume given and fixed and isomorphic to a given $\mathcal{S}$ through the relation $X\circ m_{i}=S_{i}$ .

3.3. Selection functions

3.3.1. Toy examples

To fix ideas, a bidimensional toy example with four input clusters is given in figure 2. We have illustrated a solution with two univariate experts, i.e. $\mathcal{S}=\{S_{1},S_{2}\}=\{\{X_{1}\},\{X_{2}\}\}$ : each predicts two separable clusters and their respective selection function have low values where the remaining clusters are mixed. A bivariate expert ( $S_{3}=\{X_{1},X_{2}\}$ ) can also solve the task (fig. 4), but in order to have the sparsest $M$ , we would favour the first solution and have a zero-selection on the bivariate expert. In figure 4, we add four outer clusters to example (a) that are not separable when projected on $X_{1}$ nor $X_{2}$ . In this case, the new clusters have to be attributed to the bivariate expert. A single bivariate expert could solve the whole task alone, but in order to get the sparsest mask $M$ , denoting the minimal required input features to make a correct prediction in a specific part of the input space, using univariate experts is sufficient for the central clusters.

3.3.2. Boosting

In the general case, $\mathcal{S}$ is composed of many potentially overlapping subsets. We can represent subsets and experts with a directed acyclic graph (DAG) defined as the Hasse diagram of the subsets partially ordered by inclusions (Bach, 2009): simple experts with few variables are children of parent experts of growing input support. The same way we select univariate experts over the bivariate expert in the toy example, the sparsity constraint means that the selection of a child subset should induce the deselection of parent subset it is included in. We ensure this property by design in a boosting-like manner: for each sample, we try to select an expert restricted to the smallest set of features, then, if it is not selected, we move to a parent expert.

To allow training with a gradient-descent method, we consider a stochastic relaxation of the selection of the expert. We introduce the parametric functions $g^{i}_{\theta}:S_{i}\rightarrow[0,1]$ . We then define the selection functions recursively:

(1)

Atomic subsets: For all $S_{i}\in\mathcal{S}$ such that $\forall S_{j}\in\mathcal{S}\setminus S_{i},S_{j}\not\subset S_{i}$ , then $\alpha^{i}_{\theta}(x)=g^{i}_{\theta}(x\circ m_{i})$ ;
(2)

Mixed subsets: For the remaining subsets, we introduce the notation $\omega(i)=\{j|S_{j}\subsetneq S_{i}\}$ for the set of strictly included subsets and $\alpha^{i}_{\theta}(x)=g^{i}_{\theta}(x\circ m_{i})\prod_{j\in\omega(i)}(1-g^{j}_{\theta}(x\circ m_{j}))$

By induction, the functions $\alpha^{i}$ we design satisfy the following properties:

(1)

Probability: $\alpha^{i}\in[0,1]$
(2)

Input restriction dependence: $\alpha^{i}$ is entirely conditioned on $S_{i}$ , and is thus blind to eventual parent subsets
(3)

Deselection induced by children: $\alpha^{i}\leq\prod_{j\in\omega(i)}(1-g^{j}_{\theta})$

3.3.3. Parametrisation using a neural network

There are many possible choices for $f^{i}_{\theta}$ and $g^{i}_{\theta}$ . In the rest of the paper, we study the use of deep neural networks, that have good generalisation capabilities, and their specific adaptation to our interpretation framework. Let us denote $F_{\theta}^{i}$ a deep neural network function restricted on $S_{i}$ .

For the binary problems, we propose to use a tanh function on the output layer. Then, we use the joint definition:

(5)

f^{i}_{\theta}(x)=(F_{\theta}^{i}(x)+1)/2

(6)

g^{i}_{\theta}(x)=|F_{\theta}^{i}(x)|

The neural network output simultaneously makes a prediction for $Y$ and its absolute value indicates a confidence value for selection as an expert. We experimentally found that for inference, using $g^{i}_{\theta}(x)=F_{\theta}^{i}(x)^{2}$ or $g^{i}_{\theta}(x)=\frac{2|F_{\theta}^{i}(x)|^{p}}{1+|F_{\theta}^{i}(x)|^{p-1}}$ also worked well to dampen noisy values around 0 then smoothly increasing selection importance for stronger predictions.

3.3.4. Training

We could sequentially maximise the likelihood for each expert with subsets of increasing cardinality, or even group independent subsets for fewer training phases, as in (Lou et al., 2013) where all univariate functions are trained in parallel before bivariate functions. This approach can be time-consuming, especially with neural networks as experts.

Instead, we train all experts in parallel using EM. However, with neural networks, we do not have a tractable solution for $\arg\max_{\theta}\mathcal{Q}(\theta,\theta_{\textrm{old}})$ . This issue is addressed by GEM by substituting the maximisation with an incremental update of $\mathcal{Q}$ . The training of our models follows two alternating steps:

(1)

E-step Evaluate $p_{\theta^{\textrm{old}}}(m_{i}|x)=\alpha^{i}_{\theta^{\textrm{old}}}(x\circ m_{i})=|F^{i}_{\theta^{\textrm{old}}}(x\circ m_{i})|\prod_{j\in\omega(i)}(1-|F^{j}_{\theta^{\textrm{old}}}(x\circ m_{j})|)$ , which weights the sample $x$ differently for each expert with a deselection for parents ;
(2)

Generalised M-step Perform a gradient-step update: $\theta_{\textrm{new}}=\theta_{\textrm{old}}+\eta\frac{\partial\mathcal{Q}}{\partial\theta}(\theta,\theta_{\textrm{old}})$ , with $\eta$ the learning rate.

Following the derivation of $\mathcal{Q}$ in equation (4), and using equation (5), we have:

(7)

\frac{\partial\mathcal{Q}}{\partial\theta}(\theta,\theta_{\textrm{old}})=-\mathbb{E}_{x\sim X,y\sim Y}\left[\sum_{m\in\mathcal{M}}\alpha^{m}_{\theta_{\textrm{old}}}(x\circ m)\frac{\partial\textrm{BCE}}{\partial\theta}\left(y,\frac{F^{m}_{\theta}(x\circ m)+1}{2}\right)\right]

The M-step can be easily implemented in modern deep learning libraries to propagate the updates through the layers of the experts $F_{\theta}^{m}$ . In the next section, we detail their architecture.

3.4. Making neural networks interpretable

So far, we have only considered the $H$ experts as being distinct entities. We show that assuming that the collections $(f^{i}_{\theta})$ and $(g^{i}_{\theta})$ are based on a neural network functions $(F^{i}_{\theta})$ , everything can be grouped into a single deep neural network. We formulate such neural network by induction, assuming the conventional multi-layered architecture that alternates linear layers and activation functions. We then extend our method to a broader class of deep models.

3.4.1. One-layer neural network

We assume that functions $(F^{i}_{\theta})$ have a single linear layer with an activation function $\sigma$ :

(8)

\displaystyle F^{i}_{\theta}(x)=\sigma(W^{i}_{0}(x\circ m_{i}))=\sigma(\tilde{W}^{i}_{0}x)

We suppose we have added a scalar $1$ to the input $x$ to account for bias when multiplying by matrix $(W^{i}_{0})$ to simplify notations. $\tilde{W}^{i}_{0}$ corresponds to matrix $W^{i}_{0}$ with null columns at the indices where $m_{i}$ is null, i.e. to $W^{i}_{0}\circ m_{i}$ with the Hadamard product applied row by row. Then, we can stack the matrices and define $F_{\theta}(x)$ :

(9)

\displaystyle F_{\theta}(x)=\begin{bmatrix}F^{1}_{\theta}(x)...&F^{H}_{\theta}(x)\end{bmatrix}^{T}=\sigma\left(\begin{bmatrix}\tilde{W}_{0}^{1}...&\tilde{W}_{0}^{H}\end{bmatrix}^{T}x\right)=\sigma(\tilde{W}_{0}x)

We identify $F_{\theta}$ as an overarching single-layer neural network with activation $\sigma$ and matrix parameter $\tilde{W}_{0}$ . The latter matrix is typically sparse because of the masks $(m_{i})$ successively applied on each row and can be efficiently implemented using a weight constraint in standard deep learning libraries. We have shown the base case: using masks, we can create one-layer networks for which the output dependencies to interpretation subsets can be traced. We must now prove by induction that it can be extended to several stacked layers.

3.4.2. Multi-layer neural network

When functions $(F^{i}_{\theta})$ are multi-layered neural networks with $K$ layers, activation functions $\sigma_{k}$ , matrices parameters $W_{k}^{i}$ and hidden layer output $h^{i}_{k}$ on layer $k$ , where $h^{i}_{K}\triangleq F^{i}_{\theta}$ , we have by definition:

(10)		$\displaystyle h^{i}_{0}(x)=$	$\displaystyle\sigma_{0}(W_{0}^{i}(x\circ m_{i}))=\sigma_{0}(\tilde{W}^{i}_{0}x)$
(11)		$\displaystyle h^{i}_{k+1}(x)=$	$\displaystyle\sigma_{k}(W_{k}^{i}h^{i}_{k}(x))\quad\forall k\in[0..K-1]$

Our goal is to define overarching hidden layers $h_{k}$ that are conditioned on the corresponding restriction $S_{i}$ for $k>0$ :

(12)

\displaystyle h_{k}=\begin{bmatrix}x\circ m_{1}\rightarrow h^{1}_{k}(x)&\dots&x\circ m_{H}\rightarrow h^{H}_{k}(x)\end{bmatrix}^{T}

Let us assume we have already built interpretable hidden layers up to layer $k$ . As in the case of the first layer, we would like to define a masked matrix $W_{k}$ such that $h_{k+1}(x)=\sigma_{k}(W_{k}h_{k}(x))$ , while preserving the correct input dependencies.

A first approach is to define $W_{k}$ as a diagonal by blocks with submatrices $W_{k}^{i}$ :

(13)

\displaystyle h_{k+1}(x)=\begin{bmatrix}\sigma_{k}(W_{k}^{1}h^{1}_{k}(x))\\ \vdots\\ \sigma_{k}(W_{k}^{H}h^{H}_{k}(x))\end{bmatrix}=\sigma_{k}\left(\begin{bmatrix}W^{1}_{k}&0&...&0\\ \vdots&&&\vdots\\ 0&...&0&W^{H}_{k}\end{bmatrix}\begin{bmatrix}h^{1}_{k}(x)\\ \vdots\\ h^{H}_{k}(x)\end{bmatrix}\right)=\sigma(W_{k}h_{k}(x))

We would again use a masked matrix with sparse parameters and it would be equivalent to having $H$ neural networks trained in parallel but yet remaining independent from one another because the hidden features of each expert is only used in the computation of its corresponding upper hidden layer features.

However, we do not change the desired property of dependencies by also allowing to compute the hidden layer $h_{k+1}^{i}$ using $h_{k}^{j}$ for all $j\in\omega(i)$ . We thus rather define $W_{k}$ with non-null blocks $W_{k}^{i,j}$ everywhere $S_{j}\subset S_{i}$ :

(14)

\displaystyle W_{k}=\begin{bmatrix}\textbf{1}_{S_{1}\subset S_{1}}W^{1,1}_{k}&\dots&\textbf{1}_{S_{j}\subset S_{1}}W^{1,j}_{k}&\dots\\ \vdots&&\vdots&\\ \textbf{1}_{S_{1}\subset S_{H}}W^{H,1}_{k}&\dots&\textbf{1}_{S_{j}\subset S_{H}}W^{H,j}_{k}&\dots\\ \end{bmatrix}

An example of such added links is given in the overview figure 1.

The last step to be able to train this network using a gradient-descent-based algorithm is to prevent back-propagation before matrix blocks $W^{i,j}_{k}$ for $i\neq j$ . Otherwise, the dependency is not preserved since child classifiers indirectly depend from their parents during training. This can be easily implemented in Tensorflow using a copy function as stop_gradient. We have then recursively defined an interpretable multi-layered neural network.

3.4.3. Extension

In previous sections, we have formulated a simple way to create interpretable linear layers using masked matrices. Then, applying additional activations or element-wise operations (e.g. skip connection, normalisation, …) does not change the dependency of each sub-network $F^{i}_{\theta}$ on its restriction $S_{i}$ . We can also apply functions along a time dimension to extend our method to sequences of inputs: as long as we process together hidden features computed using the same expert ( $i$ ) or its child experts ( $j\in\omega(i)$ ), we do not change the restricted input dependency.

An interesting case is the Transformer model (Vaswani et al., 2017) or its variants (Devlin et al., 2019), that have been recently popularised across many fields, often achieving state-of-the-art results. Those models leverage multi-head attention modules taking as input a query ( $q$ ), key ( $k$ ) and value ( $v$ ):

(15)		$\displaystyle\textrm{head}_{h}(q,k,v)=$	$\displaystyle\quad\textrm{DotProductAtt}(qW_{h}^{q},kW_{h}^{k},vW_{h}^{v})$
(16)		$\displaystyle\textrm{MultiHead}(q,k,v)=$	$\displaystyle\quad\textrm{Concat}_{h=0}^{P}(\textrm{head}_{h})W^{o}$

Defining $P$ as a multiple of $H$ , we partition the heads into several groups that act as experts. With the same procedure as before, we constraint parameters $W^{q},W^{k},W^{v},W^{o}$ to be masked matrices to obtain an interpretable multi-head attention module. The remaining building blocks of the Transformer are element-wise functions or can be made interpretable, allowing to formulate an interpretable Transformer model.

Other deep architectures can be made interpretable using the same principle. We have derived an interpretable gated recurrent units network during our experiments. A bit of thinking can be required to correctly link each sub-function correctly in complex cases, for instance when using multiple heterogeneous inputs. Several implementation examples can be found on this paper code repository for more details ²²2 Our code repository: https://github.com/deezer/interpretable_nn_attribution.

4. Experimental Results

We evaluate our method with the following research questions:

•

RQ1 Do our deep interpretable models perform as well as their non-interpretable counterpart? (completeness)
•

RQ2 Are the provided interpretation relevant? (interpretability)

Interpretation is task-dependent, we thus study several implicit signals prediction tasks to see how our method fares in various settings: on toy example (b) for which target attributions are available, on a collaborative filtering task using the MovieLens dataset, and on the sequential skip prediction task using user log data from Spotify and Deezer.

4.1. Synthetic data

4.1.1. Setting

We simulate a mixture of eight Gaussian distributions according to toy task (b) we have introduced in section 3.3.1. With $X=(X_{1},X_{2})$ and $Y\in\{-1,1\}$ , we define three interpretation subsets $S_{1},S_{2},S_{3}=\{X_{1}\},\{X_{2}\},\{X_{1},X_{2}\}$ , i.e. two univariate experts, $f^{1}$ and $f^{2}$ , and one residual bivariate expert, $f^{3}$ . We instantiate an interpretable two-layers feed-forward network with $3\times 16$ neurons on each hidden layer and ReLU activations and train it for a few minutes until convergence using Adam (Kingma and Ba, 2014) with default parameters².

4.1.2. Results

Except for a few misclassified edges points, this task is simple enough to be almost perfectly solved by the network, as shown in figure 5 (RQ1). We see that the expected attribution (fig. 4) is obtained, the central clusters are attributed to each univariate expert instead of using the expert with all input features (RQ2).

4.2. Collaborative filtering on implicit signals

4.2.1. Dataset

We evaluate our method on a more realistic collaborative filtering task (CF). We reproduce the setup of NCF (He et al., 2017) and compare their method with an equivalent interpretable network. Specifically, we use the MovieLens 1M dataset ³³3https://grouplens.org/datasets/movielens/1m/, containing one million movie ratings from around six thousands users and four thousands items. All ratings are binarised as implicit feedbacks to mark a positive user-item interaction, while non-interacted items are considered negative feedbacks. The performance is evaluated with the leave-one-out procedure, and judged by Hit rate (HR) and Normalised Discounted Cumulative Gain (NDCG). Validation is done by isolating a random training item for each user. More details can be found in the original paper (He et al., 2017) and (Dacrema et al., 2019).

4.2.2. Interpretation setting

In CF, a user $u$ (resp. item $i$ ) is typically embedded into a latent vector $p_{u}$ (resp. $q_{i}$ ), and the observed interactions are estimated via a similarity function. In NCF, the authors propose to replace the traditional inner product by a neural network to compute similarities. Because of the projection on a latent space, CF is more difficult to interpret than a content-based method that would only leverage the provided descriptive features for users - age range, gender, occupation ( $c_{u}$ ) - and movies - year, genres ( $c_{i}$ ). A model merely treating user and item using generic ranges (i.e. clusters) instead of personalised embeddings is however too coarse and underperforms.

Here, our method can be used to mix content-based and CF experts to discriminate interactions that can be predicted based on content, from the one that need an additional CF treatment to model users and items particularities. This way, we can trace if an item is recommended because of its similarity among a generic item range (eg. similar to horror-movies from the 90’s), a user range (eg. also liked by male viewers in their twenties), the combination of both, or beyond using CF. To this end, we define four experts with $\mathcal{S}=S_{1},\dots S_{4}=\{c_{u},c_{i}\},\{c_{u},p_{u},c_{i}\},\{c_{u},c_{i},q_{i}\},\{c_{u},p_{u},c_{i},q_{i}\}$ .

We use the multi-layered network version of NCF named MLP in (He et al., 2017), and parametrise $p_{u},q_{i}\in\mathbb{R}^{64}$ , $c_{u},c_{i}\in\mathbb{R}^{16}$ , and four hidden layer of sizes $[512,256,128,64]$ . The interpretable counterpart we dub Intrp-MLP, is build with the same architecture but with masked weights², which is equivalent to having four experts with hidden layer sizes $[128,64,32,16]$ .

4.2.3. Results

As presented in table 1, our interpretable version of NCF-MLP achieves close performances to the control non-interpretable model, with a 4% difference in HR (RQ1). This latter control model has a better HR and NDCG than reported in (He et al., 2017), which can be explained by the addition of contextual features $c_{u},c_{i}$ and bigger hidden layers.

Contrary to section 4.1, we do not have access to ground-truth attributions to check our model interpretability. As a simple proxy, we can check the attribution distribution on the test set for RQ2 (fig. 6). A first sanity check is that attributions do not collapse to an unique expert and have a relative diversity. We also see that the pure content-based expert ( $S_{1}$ ) is hardly selected, which is coherent with the underperformance of content-based models on this task.

Overall, we must underline that 66% of the items are predicted using the three first experts, for which an interpretation can be provided as either or both item and user will be described by generic features instead of a CF embedding: e.g. the selection of $S_{2}$ (resp. $S_{3}$ ) indicates a similarity to the item cluster (resp. user), as movies from a specific year and genre. The residual 34% are left for the CF expert ( $S_{4}$ ) when further personalisation is needed.

Model	HR@10	NDCG@10	#Params
content-based	0.386	0.218	190K
NCF-MLP	0.715	0.438	890K
Intrp-MLP	0.678	0.406	782K

4.3. Sequential skip prediction

4.3.1. Dataset

We study the task of skip prediction with two music sessions datasets. First, the Music Streaming Sessions Dataset (Brost et al., 2019). This public dataset contains anonymised listening logs of users from the Spotify streaming service over an 8 week period. Listening logs are sequenced together for each user, forming roughly $150$ million listening sessions of length ranging from size 10 to 20. Sessions including unpopular tracks were excluded, limiting the overall track set to approximately 3.7 millions tracks.

For the sequential skip prediction task¹, a session $(i)$ of length $2l^{(i)}$ is cut in half, with the first half (referred to as $A^{(i)}=A_{1}^{(i)}\dots A_{l^{(i)}}^{(i)}$ $\in\mathbb{R}^{l^{(i)}\times f}$ ) containing sessions logs, user interactions logs and track metadata, while the second half ( $B^{(i)}\in\mathbb{R}^{l^{(i)}\times g}$ ) contains only the track metadata. The goal is to predict the boolean value labelled as skip_2 ( $Y^{(i)}\in\{-1,1\}^{l^{(i)}}$ ) among the missing interaction features of $B^{(i)}$ . We omit the indices $(i)$ to simplify notations.

In addition to this dataset, we also use a private streaming sessions dataset provided by the music streaming service Deezer. This dataset contains $10$ million listening sessions of range 20 to 50 from a week of streaming logs. To have a similar setup to Spotify, we extract random session slices of size 20 at each epoch, enabling to have virtually more listening sessions. Without the anonymity constraint, this private dataset enables to use more features and to better evaluate interpretations with tangible data that can be streamed and manually checked. This dataset notably includes user metadata, including favourite and banned tracks, recently listened tracks, mean skip rate, user embedding, …

The difficulty of the challenge lies in the multitude of origins a skip can have. For instance, we can check that users counterintuitively skip their favourite tracks with the almost same ratio as other tracks. Skips do not just signal a music a user does not like, they also happen when a track is liked by a user but streamed at the wrong time, or when a user quickly browses the catalog, looking for a specific song, or just fresh content, or with connection errors or misclicks. Conversely, non-skip are also implicit, sometimes the user is simply not there to change an unwanted track.

4.3.2. Metrics

We compute the accuracy (Acc) of correctly predicted skip interaction in each half-session $B$ . We also make use of the evaluation metrics introduced during the challenge, the Mean Average Accuracy (MAA), to allow for comparison. Average accuracy is defined by $AA=\sum_{j=1}^{T}\textrm{acc}(j)L(j)/T$ , where $T$ is the number of track to be predicted in $B$ , $\textrm{acc}(j)$ the accuracy at position $j$ , and $L(j)$ a boolean indicating if the prediction at position $j$ was correct. This metric puts more weights on the first track of $B$ than the last ones.

It is argued that in the context of a session-based recommender system, this unbalance is due to the fact that it is more important to know if the next immediate track to be streamed will be skipped or not given preceding interactions and prevent a bad recommendation in the nearest future. However, as we will see, this latter argument can be flawed as first track prediction strongly depends on the blind continuation of interactions more than interesting underlying mechanisms that combine multiple features, hence not always providing much information to improve recommendation. We use the accuracy on the first immediate track (Acc@1) to highlight this effect. This question of relative relevance of a skip to a context needs to be addressed to allow retroactive improvements of a recommender system, which could be provided by the task we are trying to solve, skip interpretation.

4.3.3. Baseline

As we suggested, skips strongly depend on the persistence of skip behaviours. This can be interpreted as users being active by blocks: once a skip is performed, there is a higher chance that a user will also skip the next track while still on its app, and the other round, while not on the app, users may be more likely to tolerate an unwanted song. Because of this effect, it is relevant to use a persistence model as baseline, returning the last known interaction in $A$ for all the elements of $B$ . We additionally use a mean skip measure from $A$ . We thus have two baseline predictors:

(17)

f_{\textrm{last}}(A,B)=A_{l}[\texttt{skip}]

(18)

f_{\textrm{mean}}(A,B)=\frac{1}{l}\sum_{j=1}^{l}A_{j}[\texttt{skip}]

4.3.4. Experimental setup

We use a standard Transformer architecture (Vaswani et al., 2017), with 3 stacked identical self-attentive layer blocks for the encoder architecture with key-values input $A$ , as in the original paper, and 3 cross-attentive layer blocks for the decoder with query input $B$ .

For interpretation, we define $8$ experts for the Spotify dataset (fig. 8), and $10$ for Deezer (fig. 10). It must be underlined that $A$ and $B$ do not have the same interpretation subsets on the first layer as some features are missing for $B$ , the session to be predicted without logged interactions. The method however remains the same in this heterogeneous case to preserve dependencies: a link can be added between features based on $S_{x}$ and $S_{y}$ if and only if $S_{y}\subset S_{x}$ .

We train all models using Adam with default parameters and learning rate set to $10^{-4}$ . The learning rate is automatically reduced on plateau up to $10^{-6}$ . The two datasets are split in a 80-10-10% fashion for training/validation/test. Because of the huge number of sessions, the models reach convergence on the training loss in around roughly two epochs. We did not observe any overfitting effect, making it easy to control the optimisation.

4.3.5. Prediction performance (RQ1)

Results for both datasets are given in table 2 and 3. Baselines models, though parameter-free, are performing strikingly well on both datasets, especially on the first track prediction where the continuation effect is the strongest. In both cases, our interpretable models have close performances to their non-interpretable architecture counterparts, though losing around a point of accuracy.

In the WSDM Challenge¹, the evaluation was performed on a private and still inaccessible test set. However, the results of our baselines and control Transformer model seem to be coherent with the reported results, our control model would have been ranked to the fifth place of the challenge leaderboard.

Model	Acc (%)	Acc@1 (%)	MAA (%)
random	50.0	50.0	33.1
last baseline	63.0	74.2	54.3
mean baseline	61.7	66.3	51.7
Transformer (128)	72.2 $\pm$ 0.2	80.0 $\pm$ 0.2	62.8 $\pm$ 0.2
WSDM leader-board*	-	81.2	65.1
Interpretable Transformer (128)	70.9 $\pm$ 0.2	78.8 $\pm$ 0.2	61.1 $\pm$ 0.2

Table 2. Spotify skip prediction test results Numbers in parenthesis indicate the size of the hidden layer.

Model	Acc (%)	Acc@1 (%)	MAA (%)
random	50.0	50.0	32.3
last baseline	69.0	77.9	60.8
mean baseline	70.1	73.3	60.9
Transformer (256)	78.9	83.4	70.2
Interpretable Transformer (128)	77.7	82.4	68.8
Interpretable Transformer (64)	77.4	82.3	68.4

Table 3. Deezer skip prediction test results Numbers in parenthesis indicate the size of the hidden layer.

4.3.6. Attribution distribution (RQ2)

As in 4.2, we inspect the attribution distribution on the test sessions of Spotify (fig. 8) and Deezer (fig. 10). In both cases, there is a strong unbalance toward the simplest expert containing the interaction data of $A$ . This results allows to confirms our initial intuition that most skips result from pure interaction patterns and do not depend on other data. Those skips are not interesting for a recommender system as they do not tell much about user preferences. For the Spotify dataset, subsets $S_{3}$ , $S_{4}$ and $S_{6}$ reveal that 25% of skips can be predicted from the overall track metadata coherence, while being agnostic to the given skips in $A$ , which hints at simple ways to filter tracks in a candidate recommended session continuation $B$ . For the Deezer dataset, the second most attributed expert ( $S_{4}$ ) leverages a discounted skip rate measure that indicates a recent user-track affinity. We can conclude from the attribution levels that this relative measure is a stronger indicator than a favourite track signal ( $S_{2}$ ) or overall popularity ( $S_{3}$ ) to predict skips.

To illustrate the kind of interpretation we can get, an example of predicted session from Deezer is given in figure 11. Quite typically, the $S_{1}$ expert that only contains the given skip in $A$ will have the strongest attribution for the first track of $B$ , corresponding to the continuation of the last two non-skips in $A$ , but will vanish rapidly for the next tracks in favour of more complex experts. In the middle of $B$ , there is a Malaysian pop music, this rupture from the other rock songs can be observed in the sudden attribution to $S_{9}$ , an expert based on musical data. Beyond simple cases of interaction continuation or favourite tracks, our method can handle this kind of multifactorial session and provide an insight on their nature.

5. Conclusion

We introduce a novel attribution method that provides intrinsic interpretability by formulating a mixture of restricted experts, where simple experts are prioritised over more complex ones. We evaluate our method on synthetic problems, for which a ground-truth local attribution is available for comparison, and real-data tasks, aiming at predicting and interpreting binary implicit signals. Our experiments demonstrate that not only our interpretable networks achieve similar performances as their non-interpretable counterparts, but also help produce coherent interpretations that can be used to better understand implicit data, and may be leveraged by recommender systems.

As mentioned, our main future direction is the extension of our method to learnable interpretation subsets, which are currently fixed as hyper-parameters. Prior to this subject, a deeper discussion regarding the expected properties and geometry of attribution solutions will be needed in the unsupervised case for local attribution methods.

References

(1)
Alvarez-Melis and Jaakkola (2018) David Alvarez-Melis and Tommi S Jaakkola. 2018. On the robustness of interpretability methods. Workshop on Human Interpretability in Machine Learning, ICML (2018), 66–71.
Bach (2009) Francis Bach. 2009. High-dimensional non-linear variable selection through hierarchical kernel learning. arXiv preprint arXiv:0909.0844 (2009).
Bach et al. (2015) Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. 2015. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10, 7 (2015).
Bishop (2006) Christopher M Bishop. 2006. Pattern recognition and machine learning. springer.
Brost et al. (2019) Brian Brost, Rishabh Mehrotra, and Tristan Jehan. 2019. The music streaming sessions dataset. In The World Wide Web Conference. 2594–2600.
Chang et al. (2019) Sungkyun Chang, Seungjin Lee, and Kyogu Lee. 2019. Sequential Skip Prediction with Few-shot in Streamed Music Contents. arXiv preprint arXiv:1901.08203 (2019).
Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems. 2172–2180.
Dacrema et al. (2019) Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. 2019. Are we really making much progress? A worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems. 101–109.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
Freund et al. (1999) Yoav Freund, Robert Schapire, and Naoki Abe. 1999. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence 14, 771-780 (1999), 1612.
Friedman et al. (2001) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York.
Hansen et al. (2019) Christian Hansen, Casper Hansen, Stephen Alstrup, Jakob Grue Simonsen, and Christina Lioma. 2019. Modelling Sequential Music Track Skips using a Multi-RNN Approach. In ACM International Conference on Web Search and Data Mining. Association for Computing Machinery.
Hastie and Tibshirani (1990) Trevor J Hastie and Robert J Tibshirani. 1990. Generalized additive models. Vol. 43. CRC press.
He et al. (2018) Xiangnan He, Zhankui He, Jingkuan Song, Zhenguang Liu, Yu-Gang Jiang, and Tat-Seng Chua. 2018. Nais: Neural attentive item similarity model for recommendation. IEEE Transactions on Knowledge and Data Engineering 30, 12 (2018), 2354–2366.
He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web. 173–182.
Hendricks et al. (2016) Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. 2016. Generating visual explanations. In European Conference on Computer Vision. Springer, 3–19.
Herlocker et al. (2000) Jonathan L Herlocker, Joseph A Konstan, and John Riedl. 2000. Explaining collaborative filtering recommendations. In Proceedings of the 2000 ACM conference on Computer supported cooperative work. 241–250.
Hidasi et al. (2016) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and D Tikk. 2016. Session-based recommendations with recurrent neural networks. In 4th International Conference on Learning Representations, ICLR 2016.
Hinton et al. (2011) Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. 2011. Transforming auto-encoders. In International conference on artificial neural networks. Springer, 44–51.
Hu et al. (2008) Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE International Conference on Data Mining. Ieee, 263–272.
Huang et al. (2010) Junzhou Huang, Tong Zhang, et al. 2010. The benefit of group sparsity. The Annals of Statistics 38, 4 (2010), 1978–2004.
Huang et al. (2011) Junzhou Huang, Tong Zhang, and Dimitris Metaxas. 2011. Learning with structured sparsity. Journal of Machine Learning Research 12, Nov (2011), 3371–3412.
Jacobs et al. (1991) Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts. Neural computation 3, 1 (1991), 79–87.
Jordan and Jacobs (1994) Michael I Jordan and Robert A Jacobs. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural computation 6, 2 (1994), 181–214.
Kim et al. (2018) Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. 2018. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning. 2668–2677.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (2014).
Li and She (2017) Xiaopeng Li and James She. 2017. Collaborative variational autoencoder for recommender systems. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 305–314.
Liang et al. (2018) Dawen Liang, Rahul G Krishnan, Matthew D Hoffman, and Tony Jebara. 2018. Variational autoencoders for collaborative filtering. In Proceedings of the 2018 World Wide Web Conference. 689–698.
Lou et al. (2013) Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. 2013. Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 623–631.
Lundberg and Lee (2017) Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765–4774. http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
Maziarka et al. (2020) Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzębski. 2020. Molecule Attention Transformer. arXiv preprint arXiv:2002.08264 (2020).
Redmon et al. (2016) Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779–788.
Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ” Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135–1144.
Rudin (2019) Cynthia Rudin. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1, 5 (2019), 206–215.
Schulz et al. (2019) Karl Schulz, Leon Sixt, Federico Tombari, and Tim Landgraf. 2019. Restricting the Flow: Information Bottlenecks for Attribution. In International Conference on Learning Representations.
Selvaraju et al. (2017) Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision. 618–626.
Shenbin et al. (2020) Ilya Shenbin, Anton Alekseev, Elena Tutubalina, Valentin Malykh, and Sergey I Nikolenko. 2020. RecVAE: A New Variational Autoencoder for Top-N Recommendations with Implicit Feedback. In Proceedings of the 13th International Conference on Web Search and Data Mining. 528–536.
Shrikumar et al. (2017) Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning Important Features Through Propagating Activation Differences. In International Conference on Machine Learning. 3145–3153.
Simonyan et al. (2014) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside convolutional networks: Visualising image classification models and saliency maps. Workshop, ICLR (2014).
Sinha and Swearingen (2002) Rashmi Sinha and Kirsten Swearingen. 2002. The role of transparency in recommender systems. In CHI’02 extended abstracts on Human factors in computing systems. 830–831.
Smilkov et al. (2017) Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. 2017. Smoothgrad: removing noise by adding noise. Workshop on Visualization for Deep Learning, ICML (2017).
Tintarev and Masthoff (2007) Nava Tintarev and Judith Masthoff. 2007. A survey of explanations in recommender systems. In 2007 IEEE 23rd international conference on data engineering workshop. IEEE, 801–810.
Tishby et al. (1999) Naftali Tishby, Fernando C Pereira, and William Bialek. 1999. The information bottleneck method. The 37th annual Allerton Conf. on Communication, Control, and Computing (1999), 368–377.
Van den Oord et al. (2013) Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep content-based music recommendation. In Advances in neural information processing systems. 2643–2651.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
Voita et al. (2019) Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5797–5808.
Voynov and Babenko (2019) Andrey Voynov and Artem Babenko. 2019. RPGAN: GANs Interpretability via Random Routing. arXiv preprint arXiv:1912.10920 (2019).
Zhao et al. (2009) Peng Zhao, Guilherme Rocha, Bin Yu, et al. 2009. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics 37, 6A (2009), 3468–3497.
Zhou et al. (2016) Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2921–2929.
Zhu and Chen (2019) Lin Zhu and Yihong Chen. 2019. Session-based Sequential Skip Prediction via Recurrent Neural Networks. arXiv preprint arXiv:1902.04743 (2019).
Zilke et al. (2016) Jan Ruben Zilke, Eneldo Loza Mencía, and Frederik Janssen. 2016. Deepred–rule extraction from deep neural networks. In International Conference on Discovery Science. Springer, 457–473.