This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Wide Boosting

Michael T. Horrell, PhD
https://github.com/mthorrell
Abstract

Gradient Boosting (GB) is a popular methodology used to solve prediction problems by minimizing a differentiable loss function, LL. GB performs very well on tabular machine learning (ML) problems; however, as a pure ML solver it lacks the ability to fit models with probabilistic but correlated multi-dimensional outputs, for example, multiple correlated Bernoulli outputs. GB also does not form intermediate abstract data embeddings, one property of Deep Learning that gives greater flexibility and performance on other types of problems. This paper presents a simple adjustment to GB motivated in part by artificial neural networks. Specifically, our adjustment inserts a matrix multiplication between the output of a GB model and the loss, LL. This allows the output of a GB model to have increased dimension prior to being fed into the loss and is thus “wider” than standard GB implementations. We call our method Wide Boosting (WB) and show that WB outperforms GB on mult-dimesional output tasks and that the embeddings generated by WB contain are more useful in downstream prediction tasks than GB output predictions alone.

1 Introduction

Gradient Boosting (GB), first discussed in general in [6], is a popular methodology to build prediction models. Specifically, GB finds a function, ff, restricted to a class of functions, \mathcal{F}, that attempts to minimize

L(Y,f(X))L(Y,f(X)) (1)

for LL a differentiable loss function and Yn×dY\in\mathbb{R}^{n\times d} and Xn×pX\in\mathbb{R}^{n\times p} and f(x)n×qf(x)\in\mathbb{R}^{n\times q}. For the most common LL, q=dq=d. Gradient Boosting accomplishes this by composing ff out of additive, iterative adjustments, aia_{i}. f0f_{0} can be initialized to be a constant function. Then

fi+1=fi+ηiaif_{i+1}=f_{i}+\eta_{i}a_{i} (2)

where ηi\eta_{i} is a scalar and aia_{i} approximates or is otherwise related to L(Y,fi(X))fi(X)-\frac{\partial L(Y,f_{i}(X))}{\partial f_{i}(X)}; thus, GB can be viewed as a kind of gradient descent in function space.

[3] tabulated results from Kaggle, a prediction competition website, and found around 60% of winning solutions posted in 2015 used a particular software package implementing GB, XGBoost [3]. In a review of the 2015 KDD Cup, [2] further remarks on the effective and popular use of XGBoost in that competition. Since 2015, other software packages relying on GB have been developed and open-sourced by researchers at Microsoft (LightGBM [9]) and Yandex (CatBoost [4]). LightGBM has also been effectively used in several Kaggle prediction competitions [14].

In this paper, we develop a method that generalizes standard GB frameworks that we call Wide Boosting (WB). This method introduces a matrix, β\beta, multiplied to f(X)f(X) so that the output dimension of f(X)f(X), qq, can be arbitrary. The output of the usual GB model is therefore an intermediate embedding of the input data. Writing WB in terms of the overall loss, WB estimates ff and β\beta in attempt to minimize L(Y,f(X)β)L(Y,f(X)\beta).

We have implemented Wide Boosting in a Python package, wideboost (available via pypi.org or at https://github.com/mthorrell/wideboost). Notably, wideboost is able to use existing GB packages as a backend; thus we can take advantage of these highly optimized packages. Currently the XGBoost and LightGBM backends are implemented for wideboost. The XGBoost backend is currently the only backend supporting multi-dimesion responses (apart from multinomial response which is supported natively by XGBoost and LightGBM).

As a broader interpretation, Wide Boosting is exactly analogous to treating f(X)f(X) as the output of the hidden layer in a dense, one-hidden-layer neural network. From this perspective, f(X)f(X) embeds the rows of XX in a qq-dimensional space prior to being processed for prediction. Other works combine the powerful neural network and tree fitting methodologies in different ways ([11], [10], [1], [15]). However, these works adjust the base methodologies and introduce further complexities to merge the approaches. Wide Boosting leverages both boosting and neural network architecture methodologies without significant adjustment and hence is able to take direct advantage of improvements in world-leading implementations of boosting while gaining some properties of neural networks. As one new method combining two popular prediction paradigms, Wide Boosting potentially points to more research areas where nodes of computational networks have new structures and methods to fit them. As an additional example, [8] indicates that the weights in a feed-forward neural networks can be usefully fit using boosting.

It should also be noted that the simple β\beta multiplication isn’t the only way to merge GB and neural networks. Indeed GB only requires a differentiable function relating inputs to a response (preferably twice differentiable to use XGBoost and LightGBM as a backend); thus, for example, gradient boosted decision trees can be put as the first layer in any feed-forward neural network.

The rest of this paper is structured as follows. Section 2 details Wide Boosting, its parameters and necessary calculations. Section 3 reviews numerical experiments using wideboost.

2 Wide boosting

Let f(X)n×qf(X)\in\mathbb{R}^{n\times q} and let βq×d\beta\in\mathbb{R}^{q\times d}. We fit ff and β\beta in attempt to minimize

L(Y,f(X)β).L(Y,f(X)\beta). (3)

With this formulation, qq is the output dimension of f(X)f(X), and β\beta, using matrix multiplication, gives f(X)βf(X)\beta a dimension that matches YY.

If we think of β\beta as part of the loss function, wide boosting is a form of gradient boosting and can make use of existing theory and methods to find ff. Namely ff can be additive decision trees, the usual method of estimating ff in GB models. If L(Y,)L(Y,\cdot) is convex in its second argument, L(Y,β)L(Y,\cdot\beta) is convex also. This property further preserves some convergence properties of note in [7]. For major existing implementations of gradient boosting (XGBoost [3], LightGBM [9], CatBoost [4]), wide boosting can be implemented by simply providing these frameworks with the right gradient and hessian calculations. We give general calculations of gradients and hessians in Appendix 5.

2.1 Constructing and estimating β\beta

We tried a few methods of initializing β\beta and have left these methods as hyperparameters in the wideboost package. Using an identity matrix appended to a uniform random matrix or a purely uniform random matrix make up our base β\beta initializations. A further adjustment can be made to normalize the columns of these matrices in order to decouple the “wideness” of the network from the usual step parameter, η\eta. Note that when β=I\beta=I and q=dq=d, WB is equivalent to GB.

Like ff, β\beta can also be estimated. β\beta can be estimated in the same manner that one can use boosting to estimate linear regression coefficients. β^\hat{\beta} is sequentially updated each round by fitting adjustment coefficients to L(Y,fi(X))fi(X)-\frac{\partial L(Y,f_{i}(X))}{\partial f_{i}(X)}. Many other methods could be used to estimate β\beta, but fitting β\beta also via boosting is convenient because the same intermediate gradients get calculated when estimating f(X)f(X) as well. As a note of caution, initial experiments showed fitting β\beta can be fairly unstable. This might be improvable through more careful conditioning. Currently in wideboost, using very small learning rate for β\beta best handles these stability issues. Moreover, not fitting β\beta (setting a learning rate to 0) does not seem to cause a large performance drop.

2.2 Computational Scaling and Efficiency

Given our main implementation of WB makes use of XGBoost or LightGBM software packages as computational backends, we inherit both the optimizations employed in those packages and their limitations. Specifically both XGBoost and LightGBM fit additional, independent trees when the output of ff has additional dimensions. Thus, WB computational complexity scales linearly with the number of added dimensions. For example, if q=2dq=2d, a single round of WB fits twice as many trees a a round fitting a GB model on the same dataset. Multi-dimension trees exist and can be used to fit WB models [5]; however, initial tries at using multi-dimension-output trees for WB gave more time-efficient but worse overall performance.

On computation timing, it’s worth noting that wideboost is a python wrapper around either XGBoost or LightGBM, and thus generally has more overhead per round compared to vanilla GB trained using either base package. Comparing one trial from Section 3.1 where wideboost is used to fit a vanilla GB model (parameters β=I\beta=I, q=dq=d and β\beta is not estimated during training) showed that wideboost could take 64% longer to train than the same model using XGBoost (indeed the output models were the same in this trial). Timing can be improved with more code optimization. Currently, at least for medium-sized problems, timing is not so long that the benefits of WB shown in Section 3 become impractical.

3 Numerical experiments

We repurpose the MNIST dataset [12] to investigate three properties of WB:

  1. 1.

    Can WB improve on multi-dimension output problems where the dimensions are not independent? We treat the usual multinomial outcome as separate bernoulli outcomes. WB, by not treating the multiple outcomes as independent, significantly outperforms GB which, in current implementations, treats outcomes as independent.

  2. 2.

    Can WB make useful intermediate embeddings? To look at this we use a transfer learning setup similar to the experimental setup in [13]. An “upstream” model is trained to create embeddings. Quality of those embeddings is evaluated on a “downstream” dataset. Embeddings using WB outperform using GB predictions as an embedding (a technique better known as stacking). Embeddings from WB are shown to contain more information than GB for the modified MNIST task we consider.

  3. 3.

    Does WB add value as an additional hyperparameter on usual GB problems? Paired with other hyperparameter tuning, WB was able to get to lower loss values faster than GB. This does not appear to always be the case, however. Unsurprisingly, vanilla GB is sometimes best.

Note that all comparisons in this section compare WB to GB where both WB and GB are fit using wideboost. For GB models, we simply restrict the parameter space (β=I\beta=I, q=dq=d, β\beta is fixed). This gives the most comparable results.

3.1 Multi-dimension output learning

We repeatedly (n = 30) fit WB and GB models on samples of the MNIST dataset, using hyperopt to find appropriate hyperparameter values for each WB and GB model. To shorten computation, each iteration samples new training and validation sets of size 3,000 and 2,000 respectively from the MNIST training set. The test set is a sample of size 1,000 from the usual MNIST test set. Features are a sample of size 200 from the usual 28×\times28 = 784 features. hyperopt is given 25 chances per trial to find good hyperparameters. XGBoost as a backend was given 20 early stopping rounds (based on validation performance) with 500 maximum boost rounds.

The loss calculated is average log-loss across the 10 dimension one-hot response vector. We treat the dimensions as separate binary outcomes (as opposed to the usual multinomial outcome) in order to simulate a dataset with a multivariate response and unknown dependence across the responses.

Across the 30 trials, WB makes use of the dependence across the response dimensions to achieve a statistically significant lower average log-los (see Table 1).

Model N Trial Wins Avg test log-loss t-stat Avg n trees t-stat
WB 28 0.0541 (0.0007) -3.9**** 1245 (206) -2.1*
GB 2 0.0582 (0.0008) 1785 (145)
Table 1: * t-stat is significant at the 0.05 level. **** t-stat is significant at the 0.001 level.

As part of the hyperparamerter optimization, WB could have chosen plain GB (where q=dq=d and β=I\beta=I). However, the hyperparameter optimization consistently chose q>dq>d. For all ten trials q>dq>d, and the average value for qq was 14.

Also of note is that WB seems to be relatively efficient with fewer trees fit on average. This is not necessarily an intuitive result because a single round of WB where q>dq>d will fit more trees than a single round in GB (for example, if q=14q=14, 40% more trees are fit in WB compared to GB). This can be explained by WB using fewer boost rounds than GB for several trials. The median number of boost rounds used for WB was 47, while the median for GB was 153. Roughly restating, while WB fits 140% of the GB number of trees per round in this exercise, vanilla GB tends to use over 300% more boost rounds.

3.2 Embedding Quality

Like the experiment in 3.1, we treat the MNIST response as separate binary outcomes. Following a similar experiment format to [13], we run several trials (n = 30) consisting of an “upstream” dataset used to fit an initial model on half of the binary outputs. This initial model is then used on a “downstream” dataset to generate embeddings of the downstream data for use in predictions on the other half of the binary outputs. The quality of the embedding is then judged by evaluating prediction performance on the downstream dataset using only the embeddings generated from the upstream model on the downstream data.

Each trial generates new upstream and downstream datasets by sampling. Upstream datasets are of size [3,000, 2,500, 2,000] for the train, validation and test sets respectively. Downstream datasets are of size [500, 1,500, 1,000] for the train, valiation and test sets respectively. Each trial also samples 5 of the 10 response dimensions for the upstream dataset and leaves the remaining 5 as the response of the downstream dataset. We again use hyperopt to optimize the models at both the upstream and downstream stages. The hyperparamerter optimization is given 15 tries for each trial.

To ensure a fair comparison between the WB and GB embeddings, the downstream model is simply a set of GB models (one for each output dimension) using either the WB or GB embeddings as inputs. To emphasize, even though the WB trials use WB for the upstream model, the downstream model is always a GB model.

Based on downstream test set performance, the embeddings made by WB outperform using GB logits as embeddings in a downstream model.

Model N Trial Wins Avg downstream test log-loss t-stat
WB 30 0.180 (0.004) -7.5*****
GB 0 0.220 (0.003)
Table 2: ***** t-stat is significant at less than the 0.001 level.

3.3 WB as an additional Hyperparameter

If we employ WB head-to-head against GB on the classical MNIST prediction problem (this time treating the response properly as a multinomial), the advantages of WB are less clear.

Specifying q>dq>d and leaving all other parameters the same gives better models for the same number of boosting rounds (Figure 1(a)), but if you control for the extra number of trees, the benefit of WB disappears (Figure 1(b)).

(a) (b)

Refer to caption
Refer to caption
Figure 1: Test error rates for differing qq on the the MNIST dataset. Larger qq shows better performance for earlier boosting rounds (subfigure (a)). However, when controlling for number of trees this difference disappears (subfigure (b)).

Nonetheless it is interesting that the two lines in Figure 1(b) are so close. On this dataset at least, the repeated gradient corrections employed in GB doesn’t seem to have any special advantages over fitting more trees simultaneously and adding them together.

More generally, although on the classic task for MNIST the advantage of WB is none or negilgible, specific trials in Section 3.1 showed that WB employed with hyperparameter tuning can give better performance using less compute.

4 Conclusion

Wide Boosting is a relatively straightforward generalization of standard Gradient Boosting and introduces greater flexibility to the GB methodology. Specifically, WB outperformed GB on a prediction problem where the response is multi-dimensional and not statistically independent across the dimensions. WB also introduces the concept of abstract embeddings to GB and outperformed GB on a transfer learning task. As in deep learning, these embeddings can be used for transfer learning or may be of direct interest themselves.

The python package wideboost makes WB publicly available. Users of the popular packages XGBoost and LightGBM in particular may find benefit in using wideboost because wideboost currently wraps both packages, supplying those packages the gradient and hessian calculations.

Given that WB can also be thought of in a computational network or Deep Learning context, there may be similar, useful network architectures that can take advantage of Gradient Boosting or other powerful prediction methodologies not traditionally related to Deep Learning. For example, future work may use gradient boosted decision trees as the first layer in deep neural networks.

References

  • Balestriero, [2017] Balestriero, R. (2017). Neural decision trees. arXiv preprint arXiv:1702.07360.
  • Bekkerman, [2015] Bekkerman, R. (2015). The present and the future of the kdd cup competition: an outsider’s perspective. https://www.linkedin.com/pulse/present-future-kdd-cup-competition-outsiders-ron-bekkerman.
  • Chen and Guestrin, [2016] Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794.
  • Dorogush et al., [2018] Dorogush, A. V., Ershov, V., and Gulin, A. (2018). Catboost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363.
  • Dumont et al., [2009] Dumont, M., Marée, R., Wehenkel, L., and Geurts, P. (2009). Fast multi-class image annotation with random subwindows and multiple output randomized trees. volume 2, pages 196–203.
  • Friedman, [2001] Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232.
  • Grubb and Bagnell, [2011] Grubb, A. and Bagnell, J. A. (2011). Generalized boosting algorithms for convex optimization. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1209–1216.
  • Horrell, [2019] Horrell, M. (2019). Gradient fitting for deep learning. https://mthorrell.github.io/horrellblog/2019/04/28/gradient-fitting-for-deep-learning/. Accessed: 2020-07-19.
  • Ke et al., [2017] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems, pages 3146–3154.
  • Ke et al., [2019] Ke, G., Xu, Z., Zhang, J., Bian, J., and Liu, T.-Y. (2019). Deepgbm: A deep learning framework distilled by gbdt for online prediction tasks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 384–394.
  • Kontschieder et al., [2015] Kontschieder, P., Fiterau, M., Criminisi, A., and Rota Bulo, S. (2015). Deep neural decision forests. In Proceedings of the IEEE international conference on computer vision, pages 1467–1475.
  • LeCun and Cortes, [2010] LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/.
  • Levin et al., [2022] Levin, R., Cherepanova, V., Schwarzschild, A., Bansal, A., Bruss, C. B., Goldstein, T., Wilson, A. G., and Goldblum, M. (2022). Transfer learning with deep tabular models. arXiv preprint arXiv:2206.15306.
  • Microsoft, [2020] Microsoft (2020). Lightgbm: Machine learning challenge winning solutions. https://github.com/microsoft/LightGBM/tree/master/examples#machine-learning-challenge-winning-solutions.
  • Popov et al., [2019] Popov, S., Morozov, S., and Babenko, A. (2019). Neural oblivious decision ensembles for deep learning on tabular data. arXiv preprint arXiv:1909.06312.

5 Appendix: Gradient and Hessian calculations

Consider the ii-th row of f(X)f(X), denoted f[i]f_{[i]}. We calculate Lf[i]\frac{\partial L}{\partial f_{[i]}} and Lf[i]f[i]T\frac{\partial L}{\partial f_{[i]}\partial f_{[i]}^{T}}, where LL is defined in (3). If we denote G[i]G_{[i]} and H[[i]]H_{[[i]]} as the gradient and hessian of LL with respect to the ii-th row of f(X)f(X) when LL is defined by (1), the gradient and hessians for LL with respect to f[i]f_{[i]} when LL is defined by (3) can be computed using the chain rule:

Lf[i]\displaystyle\frac{\partial L}{\partial f_{[i]}} =\displaystyle= G[i]βT\displaystyle G_{[i]}\beta^{T}
Lf[i]f[i]T\displaystyle\frac{\partial L}{\partial f_{[i]}\partial f_{[i]}^{T}} =\displaystyle= βH[[i]]βT\displaystyle\beta H_{[[i]]}\beta^{T}

Applying these formulas to common loss functions we find the following example calculations for gradients and hessians for Wide Boosting. Note we will use the subscript [i] to denote the ii-th row of a matrix.

5.0.1 Regression – Squared Error

Yn×1Y\in\mathbb{R}^{n\times 1}, f(X)n×qf(X)\in\mathbb{R}^{n\times q} and βq×1\beta\in\mathbb{R}^{q\times 1}.

L(Y,f(X)β)\displaystyle L(Y,f(X)\beta) =\displaystyle= 12Yf(X)β2\displaystyle\frac{1}{2}\|Y-f(X)\beta\|^{2}
Lf[i]\displaystyle\frac{\partial L}{\partial f_{[i]}} =\displaystyle= (f(X)βY)[i]βT\displaystyle(f(X)\beta-Y)_{[i]}\beta^{T}
2Lf[i]f[i]T\displaystyle\frac{\partial^{2}L}{\partial f_{[i]}\partial f_{[i]}^{T}} =\displaystyle= ββT\displaystyle\beta\beta^{T}

5.0.2 Binary Classification – Log-loss

Y{0,1}n×1Y\in\{0,1\}^{n\times 1}, f(X)n×qf(X)\in\mathbb{R}^{n\times q} and βq×1\beta\in\mathbb{R}^{q\times 1}

LetPn×1whereP[i]\displaystyle\mbox{Let}~{}P\in\mathbb{R}^{n\times 1}~{}\mbox{where}~{}P_{[i]} =\displaystyle= exp[f(X)β][i]1+exp[f(X)β][i]\displaystyle\frac{\exp[f(X)\beta]_{[i]}}{1+\exp[f(X)\beta]_{[i]}}
L(Y,f(X)β)\displaystyle L(Y,f(X)\beta) =\displaystyle= YTlog(P)+(𝟏Y)Tlog(𝟏P)\displaystyle Y^{T}\log(P)+(\mathbf{1}-Y)^{T}\log(\mathbf{1}-P)
Lf[i]\displaystyle\frac{\partial L}{\partial f_{[i]}} =\displaystyle= (PY)[i]βT\displaystyle(P-Y)_{[i]}\beta^{T}
2Lf[i]f[i]T\displaystyle\frac{\partial^{2}L}{\partial f_{[i]}\partial f_{[i]}^{T}} =\displaystyle= (P[i]P[i]2)ββT\displaystyle\left(P_{[i]}-P_{[i]}^{2}\right)\beta\beta^{T}

where the exp[]\exp[\cdot] and log[]\log[\cdot] functions are applied elementwise to the input matrix and 𝟏n×1\mathbf{1}\in\mathbb{R}^{n\times 1} is a vector of all ones.

5.0.3 Multi-class Classification – Log-loss

Y{0,1}n×dY\in\{0,1\}^{n\times d}, f(X)n×qf(X)\in\mathbb{R}^{n\times q} and βq×d\beta\in\mathbb{R}^{q\times d}

LetPn×dwhereP[i]\displaystyle\mbox{Let}~{}P\in\mathbb{R}^{n\times d}~{}\mbox{where}~{}P_{[i]} =\displaystyle= exp[f(X)β][i]exp[f(X)β][i]𝟏\displaystyle\frac{\exp[f(X)\beta]_{[i]}}{\exp[f(X)\beta]_{[i]}\mathbf{1}}
L(Y,f(X)β)\displaystyle L(Y,f(X)\beta) =\displaystyle= tr(Ylog(P)T)\displaystyle\mbox{tr}(Y\log(P)^{T})
Lf[i]\displaystyle\frac{\partial L}{\partial f_{[i]}} =\displaystyle= (PY)[i]βT\displaystyle(P-Y)_{[i]}\beta^{T}
2Lf[i]f[i]T\displaystyle\frac{\partial^{2}L}{\partial f_{[i]}\partial f_{[i]}^{T}} =\displaystyle= β(diag[P[i]]P[i]TP[i])βT\displaystyle\beta\left(\mbox{diag}[P_{[i]}]-P_{[i]}^{T}P_{[i]}\right)\beta^{T}

where the exp[]\exp[\cdot] and log[]\log[\cdot] functions are applied elementwise to the input matrix and 𝟏d×1\mathbf{1}\in\mathbb{R}^{d\times 1} is a vector of all ones. The diag[]\mbox{diag}[\cdot] function takes a row vector and returns a square, diagonal matrix with diagonal elements corresponding to the elements of the vector.