This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Y. Geng and L. Lan 22institutetext: Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China
22email: {yugeng, lanliang}@comp.hkbu.edu.hk

Memory-Efficient Factorization Machines via Binarizing both Data and Model Coefficients

Yu Geng    Liang Lan
Abstract

Factorization Machines (FM), a general predictor that can efficiently model feature interactions in linear time, was primarily proposed for collaborative recommendation and have been broadly used for regression, classification and ranking tasks. Subspace Encoding Factorization Machine (SEFM) has been proposed recently to overcome the expressiveness limitation of Factorization Machines (FM) by applying explicit nonlinear feature mapping for both individual features and feature interactions through one-hot encoding to each input feature. Despite the effectiveness of SEFM, it increases the memory cost of FM by bb times, where bb is the number of bins when applying one-hot encoding on each input feature. To reduce the memory cost of SEFM, we propose a new method called Binarized FM which constraints the model parameters to be binary values (i.e., 1 or 1-1). Then each parameter value can be efficiently stored in one bit. Our proposed method can significantly reduce the memory cost of SEFM model. In addition, we propose a new algorithm to effectively and efficiently learn proposed FM with binary constraints using Straight Through Estimator (STE) with Adaptive Gradient Descent (Adagrad). Finally, we evaluate the performance of our proposed method on eight different classification datasets. Our experimental results have demonstrated that our proposed method achieves comparable accuracy with SEFM but with much less memory cost.

Keywords:
Factorization Machines Binary Models

1 Introduction

Factorization Machines (FM) (Rendle,, 2010) is a popular method that can efficiently model feature interactions in linear time with respect to the number of features. While FM has been successfully applied in different predictive tasks, it is also be noted that FM fails to capture highly nonlinear patterns in the data because it only uses polynomial feature expansion for nonlinear relationship modeling (He and Chua,, 2017; Liu et al.,, 2017; Lan and Geng,, 2019; Chen et al.,, 2019). To address the expressiveness limitation of the FM model, different methods have been proposed: (1) Embedding neural networks into FM models (He and Chua,, 2017; Xiao et al.,, 2017; Guo et al.,, 2017): they use neural-network based approach to model feature interactions in a highly nonlinear way; (2) Locally Linear FM (LLFM) (Liu et al.,, 2017) and its extension (Chen et al.,, 2019): they use a local encoding technique to capture nonlinear concept based on the idea that nonlinear manifold behaves linearly in its local neighbors; (3) Subspace Encoding FM (SEFM) (Lan and Geng,, 2019): it captures the nonlinear relationships of individual features and feature interactions to the class label by applying explicit nonlinear feature mapping to each input feature.

Although these FM extensions have shown improved accuracy on different predictive tasks, the space and computational complexity of these methods are much higher than the original FM. This high space and computational complexities will limit their applications in resource-constrained scenarios, such as model deployment on mobile phones or IoT devices. We seek to develop a novel memory and computational efficient FM model that can capture highly nonlinear patterns in the data. In this paper, we focus on reducing the memory cost of SEFM. The core idea of SEFM is to apply one-hot encoding for each input feature, which can effectively model the relationship of individual features and feature interactions to the class label in a highly-nonlinear way. SEFM can achieve the same computational complexity as FM for model inference. However, with respect to the space complexity of the model parameters, SEFM increases the memory cost of FM model by bb times where bb is the number of bins when applying one-hot encoding to each input feature.

Motivated by recent works that reduce the memory cost of classification models by binarizing the model parameters (Hubara et al.,, 2016; Rastegari et al.,, 2016; Alizadeh et al.,, 2018; Shen et al.,, 2017), we propose to reduce the memory cost of SEFM by binarizing their model parameters. By constraining the model parameters of SEFM to be binary values (i.e., 1 or 1-1), only one bit is required to store each parameter value. Therefore, our proposed method can significantly reduce the memory cost of SEFM. Note that our proposed method is different from Discrete FM (Liu et al.,, 2018) (DFM) which only binarizes the model parameter for pairwise feature interactions on the original feature space. Our method based on mapped feature space will have a higher expressiveness capacity than DFM. In addition, DFM focuses on regression loss and their proposed training algorithm can not be directly applied for the hinge loss or logloss which are popular for classification problems. In this paper, we propose to train our binarized FM model by using Straight Through Estimator (STE)  (Bengio et al.,, 2013) with adaptive gradient descent (AdaGrad) (Duchi et al.,, 2011) which can apply to any convex loss function.

The contributions of this paper can be summarized as: First, we propose a novel method that can reduce the memory cost of SEFM by binarizing the model parameters. Second, since directly learning the binary model parameters is an NP-hard problem, we propose to use STE with Adagrad to effectively and efficiently obtain the binary model parameters. We provide a detailed analysis of our method’s space and computational complexities for model inference and compare it with other competing methods. Finally, we evaluate the performance of our proposed method on eight datasets. Our experimental results have clearly shown that our proposed method can get much better accuracy than FM while having a similar memory cost. The results also show that our proposed method achieves comparable accuracy with SEFM but with much less memory cost.

2 Preliminaries

Factorization Machines. Given a training data set D={𝐱i,yi}i=1nD=\{\mathbf{x}_{i},y_{i}\}_{i=1}^{n}, where 𝐱i\mathbf{x}_{i} is a dd dimensional feature vector representing the ii-th training sample, yi{1,1}y_{i}\in\{-1,1\} is the class label of the ii-th training sample. xijx_{ij} denotes the jj-th feature value of the ii-th sample. The predicted value for a data sample 𝐱i\mathbf{x}_{i} based on FM with degree-2 is,

f(𝐱i)=j=1dwjxij+j=1dk=j+1dw~jkxijxik,\displaystyle f(\mathbf{x}_{i})=\sum_{j=1}^{d}w_{j}x_{ij}+\sum_{j=1}^{d}\sum_{k=j+1}^{d}\tilde{w}_{jk}x_{ij}x_{ik}, (1)

where wjw_{j} models the relationship between the jj-th feature and the class label, and w~jk\tilde{w}_{jk} models the relationship between the pairwise feature interaction of feature jj and feature kk and the class label. In FM, the model coefficient for feature interaction w~jk\tilde{w}_{jk} is factorized as 𝐯jT𝐯k\mathbf{v}_{j}^{T}\mathbf{v}_{k}, where 𝐯j\mathbf{v}_{j} is an mm-dimensional vector. By using this factorization, the second term in (1) can be computed in time O(md)O(md) which is in linear with respect to the number of feature dd according to Lemma 3.1 in Rendle, (2010),

j=1dk=j+1dw~jkxijxik=j=1dk=j+1d𝐯jT𝐯kxijxik=12f=1m((j=1dvj,fxij)2j=1dvj,f2xij2).\begin{split}&\sum_{j=1}^{d}\sum_{k=j+1}^{d}\tilde{w}_{jk}x_{ij}x_{ik}=\sum_{j=1}^{d}\sum_{k=j+1}^{d}\langle\mathbf{v}_{j}^{T}\mathbf{v}_{k}\rangle x_{ij}x_{ik}\\ &=\frac{1}{2}\sum_{f=1}^{m}((\sum_{j=1}^{d}v_{j,f}x_{ij})^{2}-\sum_{j=1}^{d}v_{j,f}^{2}x_{ij}^{2}).\end{split} (2)

The model parameters of FMs are 𝐰\mathbf{w}\in d\mathbb{R}^{d} and 𝐕d×m\mathbf{V}\in\mathbb{R}^{d\times m}. These model parameters can be efficiently learnt from data by using Stochastic Gradient Descent (SGD). The gradient of FM prediction based on one sample 𝐱i\mathbf{x}_{i} can be computed as

f(𝐱i)wj=xijf(𝐱i)vjf=xijj=1dvjfxijvjfxij2.\begin{split}\frac{\partial f(\mathbf{x}_{i})}{\partial w_{j}}&=x_{ij}\\ \frac{\partial f(\mathbf{x}_{i})}{\partial v_{jf}}&=x_{ij}\sum_{j=1}^{d}v_{jf}x_{ij}-v_{jf}x_{ij}^{2}.\end{split} (3)

By using low rank factorization of w~jk=𝐯jT𝐯k\tilde{w}_{jk}=\mathbf{v}_{j}^{T}\mathbf{v}_{k}, FM can model all pairwise feature interations in O(md)O(md) time which is much more efficient than SVM with polynomial-2 kernel Chang et al., (2010).

Refer to caption
(a) FM
Refer to caption
(b) SEFM
Refer to caption
(c) Our Method
Figure 1: Comparison of the decision boundaries of FM, SEFM and our method on circles data

Subspace Encoding Factorization Machines. As shown in (1), the relationship of individual features and feature interactions to the class label in FM is just linear and therefore it fails to capture highly nonlinear pattern in the data. Recently, SEFM Lan and Geng, (2019) was proposed to address the expressiveness limitation of FM. The basic idea of SEFM is to apply one-hot encoding to each individual feature. This simple idea does an efficient non-parametric nonlinear feature mapping for both individual features and feature interactions and therefore can improve the model expressiveness capacity. For a given data sample 𝐱i\mathbf{x}_{i}, SEFM will transform it to a very sparse vector 𝐳i\mathbf{z}_{i} with length d×bd\times b by using one-hot encoding where bb is the number of bins that each feature is binned into. That is

zijh={1,ifxijBhj0,otherwise,z_{ijh}=\begin{cases}1,\ \text{if}\ x_{ij}\in B_{h}^{j}\\ 0,\ \text{otherwise},\end{cases} (4)

where BhjB_{h}^{j} denotes the interval boundary of the hh-bin of the jj-th feature in original space. In other words, original feature matrix 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d} will be transformed to a binary feature matrix 𝐙n×p\mathbf{Z}\in\mathbb{R}^{n\times p} where p=d×bp=d\times b. Then a FM model is trained on the new data {𝐳i,yi}i=1n\{\mathbf{z}_{i},y_{i}\}_{i=1}^{n}. The predicted value of SEFM will be 𝐰\mathbf{w}\in p\mathbb{R}^{p} and 𝐕p×m\mathbf{V}\in\mathbb{R}^{p\times m} and pp is bb times larger than dd.

f(𝐳i)=j=1pwjzij+j=1pk=j+1p(𝐯jT𝐯k)zijzik.\displaystyle f(\mathbf{z}_{i})=\sum_{j=1}^{p}w_{j}z_{ij}+\sum_{j=1}^{p}\sum_{k=j+1}^{p}(\mathbf{v}_{j}^{T}\mathbf{v}_{k})z_{ij}z_{ik}. (5)

Figure 1 demonstrates the comparison of the decision boundaries between FM and SEFM on circles data, which is a nonlinear synthetic classification dataset in two-dimensional space. The blue points in the large outer circle belong to one class and the red points in the inner circle belong to the other class. As shown in sub-figure (a), the nonlinear decision boundary based on polynomial feature expansion produced by FM cannot capture the highly nonlinear pattern. SEFM (sub-figure (b)) produces the piecewise axis perpendicular decision boundary which works well on this data. The sub-figure (c) shows the decision boundary of our proposed method which also works very well on this nonlinear data but with much less memory cost compared with SEFM.

3 Methodology

3.1 Binarized Factorization Machines

Even though SEFM improves the model expressiveness capacity of FM and therefore achieves higher accuracy than FM on highly nonlinear datasets, SEFM increases the memory cost of FM by bb times since the model parameters are 𝐰\mathbf{w}\in p\mathbb{R}^{p} and 𝐕p×m\mathbf{V}\in\mathbb{R}^{p\times m} where pp is bb times larger than the original data dimensionality dd. The high space complexity of SEFM will limit its applications in resource-constrained scenarios, such as model deployment on mobile phones or IoT devices.

To reduce the memory cost of SEFM, we propose to binarize the model parameter 𝐰\mathbf{w} and 𝐕\mathbf{V}. We approximate the full-precision 𝐰\mathbf{w} and 𝐕\mathbf{V} using binarized 𝐰b{1,1}p\mathbf{w}^{b}\in\{-1,1\}^{p} and 𝐕b{1,1}p×m\mathbf{V}^{b}\{-1,1\}^{p\times m} and scaling parameters α\alpha, β\beta. That is, 𝐰α𝐰b\mathbf{w}\approx\alpha\mathbf{w}^{b} and 𝐕β𝐕b\mathbf{V}\approx\beta\mathbf{V}^{b}. The predicted value of our model will be

f(𝐳i)=αj=1pwjbzij+β2j=1pk=j+1p((𝐯jb)T𝐯kb)zijzik.f(\mathbf{z}_{i})=\alpha\sum_{j=1}^{p}w^{b}_{j}z_{ij}+\beta^{2}\sum_{j=1}^{p}\sum_{k=j+1}^{p}((\mathbf{v}^{b}_{j})^{T}\mathbf{v}^{b}_{k})z_{ij}z_{ik}. (6)

In our experiment, we show that using the scaling parameter α\alpha and β\beta can get better accuracy than without using them. The objective of our proposed model can be formulated as

minα,β,𝐰b,𝐕b1ni=1n(yi,f(𝐳i))+λ12α𝐰b22+λ22β𝐕bF2s.t.𝐰b{1,1}p,𝐕b{1,1}p×m,α,β>0,\begin{split}\min\limits_{\alpha,\beta,\mathbf{w}^{b},\mathbf{V}^{b}}&\frac{1}{n}\sum_{i=1}^{n}\ell(y_{i},f(\mathbf{z}_{i}))+\frac{\lambda_{1}}{2}\|\alpha\mathbf{w}^{b}\|_{2}^{2}+\frac{\lambda_{2}}{2}\|\beta\mathbf{V}^{b}\|_{F}^{2}\\ \text{s.t.}&\ \ \ \ \mathbf{w}^{b}\in\{-1,1\}^{p},\\ &\ \ \ \ \mathbf{V}^{b}\in\{-1,1\}^{p\times m},\\ &\ \ \ \ \alpha,\beta>0,\end{split} (7)

where (yi,f(𝐳i))\ell(y_{i},f(\mathbf{z}_{i})) in (7) denotes a convex loss function. Due to the non-smooth binary constraints, optimization of (7) is an NP-hard problem and it will need O(2(p+mp))O(2^{(p+mp)}) time to get the global optimal solution. Next, we will show how to optimize (7) efficiently using STE with Adagrad.

3.2 Learning Binary Model Parameters by Straight Through Estimator (STE)

In this section, we describe our proposed algorithm to solve (7) based on Straight Through Estimator (STE) (Bengio et al.,, 2013). STE is popularly used for training binary convolution neural networks (Alizadeh et al.,, 2018). The idea of STE is to learn a full-precision variable as the proxy of the binary variable during the training process. As in our proposed model, we introduce a full-precision 𝐰~\mathbf{\widetilde{w}} as the proxy of binary 𝐰b\mathbf{w}^{b} in (7) and a full-precision 𝐕~\mathbf{\widetilde{V}} as the proxy of binary 𝐕b\mathbf{V}^{b} in (7). Instead of directly learning binary variables 𝐰b\mathbf{w}^{b} and 𝐕b\mathbf{V}^{b}, we learn their full-precision proxy variables 𝐰~\mathbf{\widetilde{w}} and 𝐕~\mathbf{\widetilde{V}}. To estimate the binary variables 𝐰b\mathbf{w}^{b} and 𝐕b\mathbf{V}^{b} and scaling parameter α\alpha, β\beta, we propose to minimize 𝐰~α𝐰b2\|\mathbf{\widetilde{w}}-\alpha\mathbf{w}^{b}\|^{2} and 𝐕~β𝐕b2\|\mathbf{\widetilde{V}}-\beta\mathbf{V}^{b}\|^{2}.

Here we present our detailed derivation of estimating 𝐰b\mathbf{w}^{b}, 𝐕b\mathbf{V}^{b} and scaling parameter α\alpha, β\beta based on the full-precision proxy variables 𝐰~\mathbf{\widetilde{w}} and 𝐕~\mathbf{\widetilde{V}}. The derivation follows the similar procedures of optimizing binary convolution neural networks (Rastegari et al.,, 2016). In the STE framework, we use SGD to update full-precision proxy variables 𝐰~\mathbf{\widetilde{w}} and 𝐕~\mathbf{\widetilde{V}} instead of 𝐰b\mathbf{w}^{b}, 𝐕b\mathbf{V}^{b}. We propose to get an optimal approximation for 𝐰α𝐰b\mathbf{w}\approx\alpha\mathbf{w}^{b} and 𝐕β𝐕b\mathbf{V}\approx\beta\mathbf{V}^{b} by minimizing 𝐰~α𝐰b2\|\mathbf{\widetilde{w}}-\alpha\mathbf{w}^{b}\|^{2} and 𝐕~β𝐕b2\|\mathbf{\widetilde{V}}-\beta\mathbf{V}^{b}\|^{2}.

Let us consider minimize 𝐰~α𝐰b2\|\mathbf{\widetilde{w}}-\alpha\mathbf{w}^{b}\|^{2} first. By expanding the equation𝐰~α𝐰b2\|\mathbf{\widetilde{w}}-\alpha\mathbf{w}^{b}\|^{2}, it can be rewritten as

𝐰~α𝐰b2=𝐰~T𝐰~2α𝐰~T𝐰b+α2(𝐰b)T𝐰b.\|\mathbf{\widetilde{w}}-\alpha\mathbf{w}^{b}\|^{2}=\mathbf{\widetilde{w}}^{T}\mathbf{\widetilde{w}}-2\alpha\mathbf{{\widetilde{w}}}^{T}\mathbf{w}^{b}+\alpha^{2}(\mathbf{w}^{b})^{T}\mathbf{w}^{b}. (8)

Since 𝐰b{1,1}p\mathbf{w}^{b}\in\{-1,1\}^{p}, (𝐰b)T𝐰b(\mathbf{w}^{b})^{T}\mathbf{w}^{b} will equals to a constant pp. The scaling parameter α>0\alpha>0 and w~\widetilde{w} is a known variable in (8). Therefore, to minimize (8) is equivalent to the following optimization problem,

max𝐰b𝐰~T𝐰bs.t.𝐰b{1,1}p.\begin{split}\max\limits_{\mathbf{w}^{b}}\ \ &\mathbf{{\widetilde{w}}}^{T}\mathbf{w}^{b}\\ \text{s.t.}&\ \ \ \ \mathbf{w}^{b}\in\{-1,1\}^{p}.\\ \end{split} (9)

It is straightforward to the optimal solution for (9), which is 𝐰b=sign(𝐰~)\mathbf{w}^{b}=\text{sign}({\mathbf{\widetilde{w}}}). To obtain the optimal solution for the scaling parameter α\alpha, we take the derivative of (8) with respective to α\alpha. It is 2𝐰~T𝐰b+2αp-2\mathbf{{\widetilde{w}}}^{T}\mathbf{w}^{b}+2\alpha p. By setting it to 0, we will get α=𝐰~T𝐰bp\alpha=\frac{\mathbf{{\widetilde{w}}}^{T}\mathbf{w}^{b}}{p}. Since 𝐰b=sign(𝐰~)\mathbf{w}^{b}=\text{sign}({\mathbf{\widetilde{w}}}), the scaling parameter α\alpha will be

α=j|w~j|p.\alpha=\frac{\sum_{j}|\widetilde{w}_{j}|}{p}. (10)

By following the same procedure, we can also get the optimal 𝐕b=sign(𝐕~)\mathbf{V}^{b}=\text{sign}({\mathbf{\widetilde{V}}}) and the optimal β\beta as

β=i,j|v~ij|p×m.\beta=\frac{\sum_{i,j}|\widetilde{v}_{ij}|}{p\times m}. (11)

The optimal binary 𝐰b\mathbf{w}^{b} is obtained by 𝐰b=sign(𝐰~)\mathbf{w}^{b}=\text{sign}({\mathbf{\widetilde{w}}}), where sign() is the element-wise sign function which return 1 if the element is larger or equal than zero and return 1-1 otherwise. Similarly, 𝐕b\mathbf{V}^{b} can be obtained by 𝐕b=sign(𝐕~)\mathbf{V}^{b}=\text{sign}({\mathbf{\widetilde{V}}}). Since the sign() function are not differentiable, STE estimates the gradients with respect to 𝐰~\mathbf{\widetilde{w}} and 𝐕~\mathbf{\widetilde{V}} as if the non-differentiable function sign() is not present. In other word, STE will simply estimate sign(w~j)w~j\frac{\partial\text{sign}(\tilde{w}_{j})}{\partial\tilde{w}_{j}} as w~jw~j=1\frac{\partial\tilde{w}_{j}}{\partial\tilde{w}_{j}}=1. In practice, we also employ the gradient clipping as in (Hubara et al.,, 2016). Then, the gradient for the non-differentiable sign function is

sign(w~j)w~j=1|w~j|1.\frac{\partial\text{sign}(\tilde{w}_{j})}{\partial\tilde{w}_{j}}=1_{|\tilde{w}_{j}|\leq 1}. (12)

Therefore, the gradient of (7) with respect to w~j\tilde{w}_{j} (i.e., the jj-th element in the proxy variable 𝐰~\mathbf{\tilde{w}}) based on the ii-th data sample can be estimated as

gw~j=(yi,f(𝐳i))w~j=((yi,f(𝐳i))wjb+λ1αwjb)wjbw~j=((yi,f(𝐳i))f(𝐳i)αzij+λ1αwjb)1|w~j|1.\begin{split}\begin{aligned} g_{\tilde{w}_{j}}&=\frac{\partial\ell(y_{i},f(\mathbf{z}_{i}))}{\partial\tilde{w}_{j}}=(\frac{\partial\ell(y_{i},f(\mathbf{z}_{i}))}{\partial w^{b}_{j}}+\lambda_{1}\alpha w^{b}_{j})\frac{\partial w^{b}_{j}}{\partial\tilde{w}_{j}}\\ &=(\frac{\partial\ell(y_{i},f(\mathbf{z}_{i}))}{\partial f(\mathbf{z}_{i})}\alpha z_{ij}+\lambda_{1}\alpha w^{b}_{j})1_{|\tilde{w}_{j}|\leq 1}.\end{aligned}\end{split} (13)

Any convex loss function can be used in (13) and the gradient (yi,f(𝐳i))f(𝐳i)\frac{\partial\ell(y_{i},f(\mathbf{z}_{i}))}{\partial f(\mathbf{z}_{i})} can be easily obtained for different loss function.

Similarly, the gradient of (7) with respect to v~jf\tilde{v}_{jf} (i.e., the (j,f)(j,f)-th element in 𝐕~\mathbf{\widetilde{V}}) based on the ii-th data sample can be estimated as

gv~jf=((yi,f(𝐳i))f(𝐳i)f(𝐳i)vjfb+λ2βvjfb)1|v~jf|1.g_{\tilde{v}_{jf}}=(\frac{\partial\ell(y_{i},f(\mathbf{z}_{i}))}{\partial f(\mathbf{z}_{i})}\frac{\partial f(\mathbf{z}_{i})}{v^{b}_{jf}}+\lambda_{2}\beta v^{b}_{jf})1_{|\tilde{v}_{jf}|\leq 1}. (14)

where f(𝐳i)vjfb\frac{\partial f(\mathbf{z}_{i})}{v^{b}_{jf}} can be computed based a similar form as in (3). That is

f(𝐳i)vjfb=β2(zijj=1pvjfbzijvjfbzij2).\frac{\partial f(\mathbf{z}_{i})}{\partial v^{b}_{jf}}=\beta^{2}(z_{ij}\sum_{j=1}^{p}v^{b}_{jf}z_{ij}-v^{b}_{jf}z_{ij}^{2}). (15)

Note that (15) can be computed in O(1)O(1) time if precomputing f=1pvjfbzij\sum_{f=1}^{p}v^{b}_{jf}z_{ij}. Based on the stochastic gradient as shown in (13) and (14), we can update 𝐰~\mathbf{\widetilde{w}} and 𝐕~\mathbf{\widetilde{V}} using SGD method.

3.3 Updating the Proxy Variables by Adagrad

In our objective optimization problem (7), the feature representation 𝐙\mathbf{Z} after binning is very sparse since it is generated by applying one-hot encoding to each original feature. Also, the sparsity rate for different features in 𝐙\mathbf{Z} will be very different, as shown in Figure 2. Therefore, parameters associated with features with high sparsity rate will be updated less frequently than parameters associated with features with low sparsity rate in SGD. If we directly applied the standard SGD where the same learning rate η\eta is set for all parameters, it could be either too small for the parameters associated with features with high sparsity rate or too large for the parameters associated with features with low sparsity rate.

To address this issue, we propose to use Adagrad (Duchi et al.,, 2011) to update proxy variables 𝐰~\mathbf{\widetilde{w}} and 𝐕~\mathbf{\widetilde{V}}. Adagrad uses different learning rates for different model parameters and the learning rates can be adaptively adjusted during the training process. It has been shown that Adagrad can greatly improve the robustness of SGD (Ruder,, 2016). The basic intuition of Adagrad is to give low learning rates for parameters associated with features with low sparsity rate and to given high learning rates for parameters associated with features with high sparsity rate.

For a parameter (e.g., w~j\tilde{w}_{j} or v~jf\tilde{v}_{jf}) in our model, based on the idea of Adagrad, we set the learning rate in the tt-th update ηt\eta_{t} to ηst+ϵ\frac{\eta}{\sqrt{s_{t}}+\epsilon} where sts_{t} equals the sum of squares of previously observed gradients up to time tt and ϵ\epsilon is a small constant that prevents division by 0.

Therefore, for updating the jj-th element in 𝐰~\mathbf{\tilde{w}}, we can first obtain the gradient gw~jg_{\tilde{w}_{j}} as in (13). Then, the updating rule for w~j\tilde{w}_{j} at the tt-th time will be

w~jt+1=w~jtηsjt+ϵgw~j,\tilde{w}_{j}^{t+1}=\tilde{w}_{j}^{t}-\frac{\eta}{\sqrt{s_{j}^{t}+\epsilon}}g_{\tilde{w}_{j}}, (16)

where sjts_{j}^{t} aggregates sum of squares of previously observed gradients up to time tt. That is, sjt=sjt1+(gw~j)2.s_{j}^{t}=s_{j}^{t-1}+(g_{\tilde{w}_{j}})^{2}.

Similarly, for updating the (j,f)(j,f)-th parameter in 𝐕~\mathbf{\tilde{V}}, we first obtain the gradient gv~jfg_{\tilde{v}_{jf}} as in (14). Then, the updating rule for v~jf\tilde{v}_{jf} at the tt-th time will be

v~jft+1=v~jftηs~jft+ϵgv~jf,\tilde{v}_{jf}^{t+1}=\tilde{v}_{jf}^{t}-\frac{\eta}{\sqrt{\tilde{s}_{jf}^{t}+\epsilon}}g_{\tilde{v}_{jf}}, (17)

where s~jft\tilde{s}_{jf}^{t} aggregates sum of squares of previously observed gradients with respect to v~jf\tilde{v}_{jf} up to time tt. That is, s~jft=s~jft1+(gv~jf)2.\tilde{s}_{jf}^{t}=\tilde{s}_{jf}^{t-1}+(g_{\tilde{v}_{jf}})^{2}.

Refer to caption
(a) banana
Refer to caption
(b) pendigits
Refer to caption
(c) ijcnn
Figure 2: Feature sparsity after binning on different datasets

As shown in (16) and (17), Adagrad adaptively adjusts the learning rate η\eta at each time step tt for each parameter in 𝐰~\mathbf{\tilde{w}} and 𝐕~\mathbf{\tilde{V}} based on their past gradients. In our experiment, we have shown that Adagrad can achieve lower training loss and is more robust than SGD.

Algorithm 1 Learning Binary Model Parameters by Straight Through Estimator (STE) with Adagrad
Training
Input: training data set D={𝐱i,yi}i=1nD=\{\mathbf{x}_{i},y_{i}\}_{i=1}^{n}, low-rank parameter mm, number of bins bb, step size η\eta, regularization parameters λ1\lambda_{1}, λ2\lambda_{2} and a small constant ϵ\epsilon;
Output: Binarized FM model 𝐰b{1,1}p,𝐕b{1,1}p×m\mathbf{w}^{b}\in\{-1,1\}^{p},\mathbf{V}^{b}\in\{-1,1\}^{p\times m} on mapped feature space;
1:Generate new feature representation 𝐙\mathbf{Z}
2:Initialize proxy variables 𝐰~=𝟎\mathbf{\tilde{w}}=\mathbf{0}, 𝐕~=𝟎\mathbf{\tilde{V}}=\mathbf{0} and 𝐬p=𝟎\mathbf{s}\in\mathbb{R}^{p}=\mathbf{0}, 𝐒~p×m=𝟎\tilde{\mathbf{S}}\in\mathbb{R}^{p\times m}=\mathbf{0} which are used for adjusting learning rate in Adagrad. Use (10) and (11) to set scaling parameter α\alpha, β\beta.
3:repeat
4:     for i[1,n]\forall i\in[1,n] do
5:         for each nonzero zijz_{ij} do
6:              compute the gradient gw~jg_{\tilde{w}_{j}} as in (13)
7:              update sjs_{j} as sj+=(gw~j)2s_{j}\mathrel{+}=(g_{\tilde{w}_{j}})^{2}
8:              update w~j\tilde{w}_{j} as w~j=w~jηsj+ϵgw~j{\tilde{w}}_{j}={\tilde{w}}_{j}-\frac{\eta}{\sqrt{s_{j}+\epsilon}}g_{\tilde{w}_{j}}
9:              wjb=sign(w~j){w}^{b}_{j}=\text{sign}({\tilde{w}_{j}})
10:         end for
11:         for f[1,m]\forall f\in[1,m] do
12:              for each nonzero zijz_{ij} do
13:                  compute the gradient gv~jfg_{\tilde{v}_{jf}} as in (14)
14:                  update s~jf\tilde{s}_{jf} as s~jf+=(gv~jf)2\tilde{s}_{jf}\mathrel{+}=(g_{\tilde{v}_{jf}})^{2}
15:                  v~jf=v~jfηs~jf+ϵgv~jf\tilde{v}_{jf}={\tilde{v}}_{jf}-\frac{\eta}{\sqrt{\tilde{s}_{jf}+\epsilon}}g_{\tilde{v}_{jf}}
16:                  vjfb=sign(v~jf){v^{b}_{jf}}=\text{sign}(\tilde{v}_{jf})
17:              end for
18:         end for
19:     end for
20:     update scaling parameter α\alpha, β\beta using (10) and (11)
21:until stopping criterion is met
Table 1: Memory costs for different FM models
Method Memory Cost (bits)
FM 32 ×\times ( dd + mm ×\times dd )
SEFM 32 ×\times ( dd ×\times bb + mm ×\times dd ×\times bb )
DFM 32 ×\times dd + mm ×\times dd
Binarized FM dd ×\times bb + mm ×\times dd ×\times bb
Table 2: The accuracy, prediction time and memory cost of different methods
Dataset Performance Liblinear FM DFM NFM SEFM Binarized FM
nn/dd/#class
circles accuracy (%) 47.88±\pm1.75 49.20±\pm1.29 51.03±\pm2.94 99.95±\pm0.04 99.91±\pm0.08 99.95±\pm0.06
5,000/2/2 prediction time (ms) 0.3 2.0 2.2 39.3 1.6 1.6
memory cost 0.06×\times 1×\times 0.12×\times 893×\times 30×\times 0.94×\times
moons accuracy (%) 89.68±\pm0.42 88.57±\pm0.37 88.71±\pm0.96 99.93±\pm0.07 99.99±\pm0.00 99.99±\pm0.00
5,000/2/2 prediction time (ms) 0.3 2.7 2.8 92.7 1.8 1.2
memory cost 0.01×\times 1×\times 0.02×\times 130×\times 3.95×\times 0.12×\times
banana accuracy (%) 56.10±\pm0.58 54.82±\pm3.65 56.74±\pm1.12 68.49±\pm1.88 88.98±\pm0.47 89.25±\pm0.48
5,300/2/2 prediction time (ms) 0.3 1.7 2.0 132.2 1.3 1.8
memory cost 0.06×\times 1×\times 0.09×\times 1849×\times 30×\times 3.66×\times
breast-cancer accuracy (%) 95.02±\pm0.94 94.73±\pm0.64 95.22±\pm1.26 96.91±\pm0.74 95.51±\pm0.94 96.20±\pm0.64
683/10/2 prediction time (ms) 0.2 0.9 0.8 13.0 0.6 0.6
memory cost 0.06×\times 1×\times 0.12×\times 300×\times 30×\times 3.58×\times
segment accuracy (%) 91.52±\pm1.03 91.98±\pm1.60 90.51±\pm1.09 88.17±\pm0.87 94.23±\pm0.82 94.75±\pm0.82
2,310/19/7 prediction time (ms) 0.3 2.6 4.1 136.5 30.1 16.4
memory cost 0.06×\times 1×\times 0.18×\times 484×\times 19.4×\times 2.37×\times
pendigits accuracy (%) 93.12±\pm0.36 94.26±\pm0.61 95.74±\pm0.45 94.29±\pm0.15 97.38±\pm0.12 97.56±\pm0.16
10,992/16/10 prediction time (ms) 0.9 168.2 36.6 2,045.1 1,268.8 1,700.3
memory cost 0.01×\times 1×\times 0.04×\times 53×\times 10×\times 0.31×\times
ijcnn accuracy (%) 92.08±\pm0.20 94.75±\pm0.26 94.59±\pm0.35 95.61±\pm0.08 96.29±\pm0.18 96.45±\pm0.09
49,990/22/2 prediction time (ms) 8.9 158.3 193.7 651.2 125.7 229.2
memory cost 0.01×\times 1×\times 0.04×\times 45×\times 7.67×\times 0.94×\times
webspam accuracy (%) 92.71±\pm0.04 95.49±\pm0.72 95.21±\pm0.15 97.53±\pm0.12 98.11±\pm0.13 97.79±\pm0.08
350,000/254/2 prediction time (ms) 67.4 5,025.2 4,997.9 54,605.7 7,792.2 7,004.4
memory cost 0.01×\times 1×\times 0.04×\times 2.06×\times 30×\times 0.94×\times

3.4 Algorithm Implementation and Analysis

We summarize our proposed method in Algorithm 1. The step 1 obtains the new feature representation 𝐙\mathbf{Z} by applying one-hot encoding on each feature in input data 𝐗\mathbf{X} which can be done in O(nd)O(nd) time (Lan and Geng,, 2019). The step 2 initializes the proxy variables 𝐰~\mathbf{\tilde{w}}, 𝐕~\mathbf{\tilde{V}} and 𝐬p=𝟎\mathbf{s}\in\mathbb{R}^{p}=\mathbf{0}, 𝐒~p×m=𝟎\tilde{\mathbf{S}}\in\mathbb{R}^{p\times m}=\mathbf{0} which are used for adjusting learning rate in Adagrad. From step 3 to 21, we update the proxy parameters and the binary parameters using STE with Adagrad. The time complexity for learning the binarized parameter is O(nnz(𝐙)ml)O(nnz(\mathbf{Z})ml), where nnz(𝐙)nnz(\mathbf{Z}) denotes the number of non-zero values in 𝐙\mathbf{Z}, mm is the low-rank parameter and ll is the number of passes over dataset 𝐙\mathbf{Z}. Empirically, the algorithm converge very fast and ll is a small number in our experiments.

3.5 Model Inference of Our Proposed Model

Similar to Lemma 3.1 in (Rendle,, 2010), the prediction score of our model for a given input 𝐳i\mathbf{z}_{i} can be written as

f(𝐳i)=αj=1pwjbzij+β2j=1pk=j+1p(𝐯jb)T𝐯kbzijzik=αj=1pwjbzij+12β2f=1m((j=1pvjfbzij)2j=1pvjfb2zij2)\begin{split}\begin{aligned} &f(\mathbf{z}_{i})=\alpha\sum_{j=1}^{p}w^{b}_{j}z_{ij}+\beta^{2}\sum_{j=1}^{p}\sum_{k=j+1}^{p}\langle(\mathbf{v}^{b}_{j})^{T}\mathbf{v}^{b}_{k}\rangle z_{ij}z_{ik}\\ &=\alpha\sum_{j=1}^{p}w^{b}_{j}z_{ij}+\frac{1}{2}\beta^{2}\sum_{f=1}^{m}((\sum_{j=1}^{p}v^{b}_{jf}z_{ij})^{2}-\sum_{j=1}^{p}{v^{b}_{jf}}^{2}z_{ij}^{2})\\ \end{aligned}\end{split} (18)

Since each element in 𝐕b\mathbf{V}^{b} is either 1 or 1-1, vjfb2=1{v^{b}_{jf}}^{2}=1 and j=1pvjfb2zij2=nnz(𝐳i)\sum_{j=1}^{p}{v^{b}_{jf}}^{2}z_{ij}^{2}=nnz(\mathbf{z}_{i}) where nnz(𝐳i)nnz(\mathbf{z}_{i}) is the number of non-zero values in 𝐳i\mathbf{z}_{i} which will be equal to dd because of one-hot encoding. Also note that 𝐰b\mathbf{w}^{b} and 𝐕b\mathbf{V}^{b} are binary weights and 𝐳i\mathbf{z}_{i} is a sparse vector with dd nonzero values. The nonzero value in ziz_{i} can only be 1. Therefore, the model inference as shown in (18) can be estimated by very efficiently only using addition and subtraction operations (without the need of multiplication operation). Also the arithmetic operations in (18) can be further replaced cheap XNOR and POPCOUNT bitwise operations (Rastegari et al.,, 2016).

During model inference, we do not need to keep the proxy variables 𝐰~\mathbf{\tilde{w}} and 𝐕~\mathbf{\tilde{V}}. We only need the binary parameters 𝐰b\mathbf{w}^{b}, 𝐕b\mathbf{V}^{b} and scaling parameter α\alpha, β\beta to compute the prediction score for a test sample based on (18). We summarize the memory cost for different FM models in Table 1.

4 Experiments

In this section, we compare our proposed method with other competing algorithms on eight datasets. The summary information of each dataset is given in the first column of Table 2. Among them, there are three synthetic datasets and five real-life datasets. These three synthetic datasets (i.e., circles, moons and banana) are widely used for evaluation nonlinear models. For the real-life datasets, three of them are binary classification and two of them are multi-class classification. One-vs-all strategy is used for multi-class classification.

In our experiments, we evaluate the performance of the following six different classification methods:

  • Liblinear: an efficient solver for linear support vector machine (Fan et al.,, 2008).

  • FM: The original Factorization Machines (Rendle,, 2010). We use the implementation of fastFM (Bayer,, 2016).

  • DFM: Discrete Factorization Machines (Liu et al.,, 2018) which binarize the feature interaction parameters in FM. The DFM model focused on regression loss, and we extend it for classification loss.

  • NFM: Neural Factorization Machines (He and Chua,, 2017) which combines the FM with the neural networks.

  • SEFM: It improves expressive limitation of FM by one-hot encoding to each input features (Lan and Geng,, 2019).

  • Binarized FM: our proposed method.

4.1 Experimental Setting.

For each dataset, we randomly select 70% as training data and use the remaining 30% as test data. This procedure is repeated for 10 times and we report the average accuracy on test data. The low-rank parameter mm for FM, SEFM, DFM and Binarized FM is chosen from {16, 32, 64, 128}. The parameter bb (i.e., the number of bins) of SEFM and Binarized FM is chosen from {10, 20, 30, 40, 50}. The regularization parameter in Liblinear, FM, DFM and SEFM is chosen from {102,101,,102}\{10^{-2},10^{-1},\dots,10^{2}\}. The optimal parameter combination is selected by 5-fold cross-validation on training data. The suggested settings are used in NFM, but for webspam, the hidden factor and the number of layers are both decreased to make it successfully run in our computing environment.

Refer to caption
(a) circles
Refer to caption
(b) moons
Refer to caption
(c) segment
Refer to caption
(d) pendigits
Refer to caption
(e) ijcnn
Refer to caption
(f) webspam
Figure 3: Memory cost vs. test accuracy of different FM models

4.2 Experiment Results.

The accuracy, prediction time and memory cost of all six algorithms are reported in Table 2. Note that the memory cost of model parameters is normalized by the memory cost of standard FM. For example, 10x means the memory cost is 10 times larger than standard FM. The best accuracy for each dataset is in bold. The second best accuracy is in italic.

Accuracy. As shown in Table 2, both SEFM and our proposed method Binarized FM achieve higher accuracy than Liblinear, FM and DFM on all eight datasets. This is due to the fact that the expressiveness capacities of Liblinear, FM and DFM are smaller than SEFM and our proposed binarized FM. Overall, the linear classifier (i.e., Liblinear) gets the lowest accuracy, while FM and DFM get higher accuracy than the linear classifier. DFM is less accurate than FM because of binarizing of feature interaction model parameters. Our proposed method Binarized FM achieves comparable results with SEFM with much lower memory cost.

Memory cost. With regard to memory cost, Liblinear and DFM have much lower memory cost than FM, as Liblinear is a linear classifier being linear in the parameters and DFM binarize the 2-dimensional parameter 𝐕\mathbf{V} of FM. As for our proposed method Binarized FM, which binarizes both the parameter 𝐰\mathbf{w} and 𝐕\mathbf{V} and transforms features by one-hot encoding, it has relatively lower memory cost on circles, moons, pendigits, ijcnn and webspam than FM and slightly higher memory costs on the other datasets. The memory costs of SEFM and NFM are much higher than FM and our proposed method on all of datasets.

We visually present the variation between memory cost and accuracy among four methods–FM, SEFM, DFM and Binarized FM on six different datasets as shown in Figure 3. It is clearly discovered that our proposed method Binarized FM can achieve higher accuracy than FM and DFM. With the same memory cost, Binarized FM has predominantly higher accuracy than FM. SEFM has the high accuracy and the highest memory cost, while the memory cost of our proposed method Binarized FM is obviously lower and its accuracy approaches that of SEFM.

The effect of scaling parameter α\alpha and β\beta. The accuracy and prediction time of our algorithms with and without scaling parameter are reported in Table 3. After adding the scaling parameter α\alpha and β\beta, our method achieves higher accuracy on most of datasets than without using them, especially on banana, segment, pendigits, ijcnn and webspam.

Table 3: The effect of scaling parameter α\alpha and β\beta
Dataset Performance Without With
nn/dd/#class scaling scaling
parameter parameter
circles accuracy(%) 99.97±\pm0.06 99.95±\pm0.06
5,000/2/2 prediction time 1.5ms 1.6ms
moons accuracy(%) 99.99±\pm0.00 99.99±\pm0.00
5,000/2/2 prediction time 1.2ms 1.2ms
banana accuracy(%) 87.01±\pm0.60 89.25±\pm0.48
5,300/2/2 prediction time 1.5ms 1.8ms
breast-cancer accuracy(%) 96.00±\pm1.06 96.20±\pm0.64
683/10/2 prediction time 0.8ms 0.6ms
segment accuracy(%) 92.29±\pm0.51 94.75±\pm0.82
2,310/19/7 prediction time 17.5ms 16.4ms
pendigits accuracy(%) 95.90±\pm0.39 97.56±\pm0.16
10,992/16/10 prediction time 1.5273s 1.7003s
ijcnn accuracy(%) 95.13±\pm0.20 96.45±\pm0.09
49,990/22/2 prediction time 0.2037s 0.2292s
webspam accuracy(%) 96.71±\pm0.08 97.79±\pm0.08
350,000/254/2 prediction time 7.1381s 7.0044s

Illustration of Decision Boundaries on Synthetic Datasets. We first illustrate that our proposed method can produce highly nonlinear decision boundary as SEFM on the synthetic datasets. We show the decision boundaries of FM, SEFM, DFM and our proposed binarized FM on moons and banana datasets in Figure 4. As shown in this figure, both FM and DFM can not capture the highly nonlinear pattern on these two datasets since they only used polynomial expansion on the original data to model the nonlinear relationship. SEFM and our proposed binarized FM can produce complex piecewise axis perpendicular decision boundary.

Refer to caption
(a) FM on moons
Refer to caption
(b) SEFM on moons
Refer to caption
(c) DFM on moons
Refer to caption
(d) Binarized FM on moons
Refer to caption
(e) FM on banana
Refer to caption
(f) SEFM on banana
Refer to caption
(g) DFM on banana
Refer to caption
(h) Binarized FM on banana
Figure 4: Decision boundaries of FM, SEFM and our method on moons and banana datasets

Comparison of the Convergence of Adagrad and SGD in Binarized FM. In addition to demonstrating the performance of our proposed method, we illustrate our training process by presenting the variation of training loss during iteration to prove our optimization method–Adagrad has more efficient performance than SGD. As shown in Figure 5, Adagrad reaches the lower training loss fast and gradually tend to converge.

Refer to caption
(a) banana
Refer to caption
(b) ijcnn
Figure 5: Comparison of the Convergence of Adagrad and SGD

5 Conclusion

In this paper, we propose a novel method Binarized FM that can reduce the memory cost of SEFM by binarizing the model parameters while maintaining comparable accuracy as SEFM. Since directly learning the binary model parameters is an NP-hard problem, we propose to use STE with Adagrad to efficiently learn the binary model parameters. We also provide a detailed analysis of our proposed method’s space and computational complexities for model inference and compare it with other competing methods. Our experimental results on eight datasets have clearly shown that our proposed method can get much better accuracy than FM while having similar memory cost and can achieve comparable accuracy with SEFM but with much less memory cost. We also demonstrate that learning model parameters by STE with Adagrad can converge very fast during the iteration of training the input dataset.

References

  • Alizadeh et al., (2018) Alizadeh, M., Fernández-Marqués, J., Lane, N. D., and Gal, Y. (2018). An empirical study of binary neural networks’ optimisation.
  • Bayer, (2016) Bayer, I. (2016). fastfm: A library for factorization machines. Journal of Machine Learning Research, 17(184):1–5.
  • Bengio et al., (2013) Bengio, Y., Léonard, N., and Courville, A. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
  • Chang et al., (2010) Chang, Y.-W., Hsieh, C.-J., Chang, K.-W., Ringgaard, M., and Lin, C.-J. (2010). Training and testing low-degree polynomial data mappings via linear svm. Journal of Machine Learning Research, 11(4).
  • Chen et al., (2019) Chen, X., Zheng, Y., Zhao, P., Jiang, Z., Ma, W., and Huang, J. (2019). A generalized locally linear factorization machine with supervised variational encoding. IEEE Transactions on Knowledge and Data Engineering, 32(6):1036–1049.
  • Duchi et al., (2011) Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).
  • Fan et al., (2008) Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874.
  • Guo et al., (2017) Guo, H., Tang, R., Ye, Y., Li, Z., and He, X. (2017). Deepfm: a factorization-machine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 1725–1731.
  • He and Chua, (2017) He, X. and Chua, T.-S. (2017). Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 355–364.
  • Hubara et al., (2016) Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. (2016). Binarized neural networks. In Advances in neural information processing systems, pages 4107–4115.
  • Lan and Geng, (2019) Lan, L. and Geng, Y. (2019). Accurate and interpretable factorization machines. In Proceeding of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI), pages 4139–4146.
  • Liu et al., (2017) Liu, C., Zhang, T., Zhao, P., Zhou, J., and Sun, J. (2017). Locally linear factorization machines. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 2294–2300. AAAI Press.
  • Liu et al., (2018) Liu, H., He, X., Feng, F., Nie, L., Liu, R., and Zhang, H. (2018). Discrete factorization machines for fast feature-based recommendation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 3449–3455.
  • Rastegari et al., (2016) Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. (2016). Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer.
  • Rendle, (2010) Rendle, S. (2010). Factorization machines. In IEEE 10th International Conference on Data Mining (ICDM), pages 995–1000.
  • Ruder, (2016) Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.
  • Shen et al., (2017) Shen, F., Mu, Y., Yang, Y., Liu, W., Liu, L., Song, J., and Shen, H. T. (2017). Classification by retrieval: Binarizing data and classifiers. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pages 595–604.
  • Xiao et al., (2017) Xiao, J., Ye, H., He, X., Zhang, H., Wu, F., and Chua, T.-S. (2017). Attentional factorization machines: learning the weight of feature interactions via attention networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 3119–3125.