Faster Algorithms for Learning Convex Functions
Abstract
The task of approximating an arbitrary convex function arises in several learning problems such as convex regression, learning with a difference of convex (DC) functions, and learning Bregman or -divergences. In this paper, we develop and analyze an approach for solving a broad range of convex function learning problems that is faster than state-of-the-art approaches. Our approach is based on a 2-block ADMM method where each block can be computed in closed form. For the task of convex Lipschitz regression, we establish that our proposed algorithm converges with iteration complexity of for a dataset and . Combined with per-iteration computation complexity, our method converges with the rate . This new rate improves the state of the art rate of if . Further we provide similar solvers for DC regression and Bregman divergence learning. Unlike previous approaches, our method is amenable to the use of GPUs. We demonstrate on regression and metric learning experiments that our approach is over 100 times faster than existing approaches on some data sets, and produces results that are comparable to state of the art.
1 Introduction
The importance of convex functions in machine learning is undisputed. However, while most applications of convexity in machine learning involve fixed convex functions, an emerging trend in the field has focused on the problem of learning convex functions for a particular task. In this setting, data or supervision is used to tailor a convex function for some end goal. Recent machine learning examples include learning with a difference of convex (DC) functions (Siahkamari et al., 2020), learning input convex neural networks for tasks in reinforcement learning or vision (Amos et al., 2017), and learning divergences between points or distributions via learning the underlying convex function in a Bregman divergence (Siahkamari et al., 2019) or -divergence (Zhang et al., 2020) for problems such as data generation, metric learning, and others.
In fact, work in convex function learning at least dates back to methods developed for the problem of convex regression, an important learning problem that is used commonly in econometrics and engineering for modeling demand, utility and production curves (Afriat, 1967; Varian, 1982). Convex regression is the problem of estimating a convex function when receiving a dataset , where are predictors and are continuous responses. Consider
(1) |
as the class of all convex functions over . Then an estimator is proposed by minimizing the squared error between observations and responses ,
(2) |
where is some penalty term. A key insight about this problem is that although Eq. (2) is an infinite dimensional minimization problem, Boyd et al. (2004) shows it can be solved as a convex optimization problem since the optimal solution can be shown to be piece-wise linear. Furthermore, generalization bounds for this problem have recently been developed in (Balázs, 2016; Siahkamari et al., 2020) using certain classes for .
Many other convex function learning problems have a similar structure to them, and also yield resulting optimization problems that can be tractably solved. However, a key downside is that the computational complexity is often prohibitive for real-world applications, particularly due to the non-parametric nature of the resulting optimization problems. For instance, the state-of-the-art solver for convex regression is based on interior point methods and has a complexity of for a given accuracy .
Thus, our goal in this paper is to study a class of methods that yields provably faster convergence for a number of different convex function learning problems. Our main insight is that many convex function learning problems may be expressed as optimization problems that can be solved with a 2-block ADMM algorithm such that each of the two blocks can be computed in closed form. For convex regression, the resulting method (which we call FCR, for Faster Convex Regression) is guaranteed to converge with computational complexity of for a dataset and . This new rate improves the state of the art in (Balázs, 2016) available via interior point methods when . In addition to an improved convergence rate, FCR makes several contributions. Firstly, FCR is stand-alone and does not need any optimization software. FCR is based on tensor computations and can easily be implemented on GPUs. Furthermore, FCR can include regularization terms in the context of non-parametric convex regression.
We extend FCR solver for problems beyond convex regression. In particular, we provide 2-block ADMM solvers and present empirical comparisons for DC regression (Siahkamari et al., 2020) and Bregman divergence learning (Siahkamari et al., 2019). Finally, we demonstrate empirical results showing that our approach yields significantly more scalable convex learning methods, resulting in speedups of over 100 times on some problems.
Notation.
We generally denote scalars as lower case letters, vectors as lower case bold letters, and matrices as upper case bold letters.
For a dataset , represents the number of observations and is the number of covariates.
We define and as an outer product. We define to be the set of integers . By and we denote the largest subgradient and the largest partial sub-derivative.
1.1 Connections to Existing Methodologies
Convex regression has been extensively studied over the last two decades. Boyd et al. (2004) introduce the non-parametric convex regression problem Eq. (2). Balázs (2016) show that convex Lipschitz regression requires regularization in order to have generalization guarantees. In particular, it was shown that the penalty term yields a generalization error of .
Balázs (2016); Mazumder et al. (2019) provide ADMM solvers for the convex regression problem; however, solutions have more than two blocks and are not guaranteed to converge in all circumstances. More recently, Chen & Mazumder (2020) provide an active-set type solver and Bertsimas & Mundru (2021) provide a delayed constraint generation algorithm which have favorable scalablity. They use a penalty of , which makes the loss function in Eq. (2) strongly convex and easier to solve. However, no theoretical generalization bound is known when using this new penalty term. Siahkamari et al. (2020) use for learning a difference of convex functions where they provide a multi-block ADMM solver for this problem.
On the other hand, Ghosh et al. (2019) and Kim et al. (2021) study the parametric class of max-linear functions for . They provide convex programming and alternating algorithms. The downside of these methods is the assumption that are i.i.d samples from a normal distribution and furthermore that each feature is independent. These assumptions are needed for their algorithm to converge in theory. We note that this parametric class if , is the same as , when used in convex regression minimization problem Eq. (2).
Our methodology is most similar to Balázs (2016) and is in the context of learning theory where we do not require any distributional assumption on and only require them be i.i.d. and bounded. We further need to know a bound on but not on .
2 Convex Regression with Lasso Penalty
Suppose we are given a dataset where are predictors, assumed drawn i.i.d. from a distribution supported on a compact domain with , and are responses such that for centered, independent random noise . Assume and both are bounded by . Furthermore, is convex and Lipschitz. We propose to solve the penalized convex regression problem in Eq. (2) with the penalty term . This results in the following convex optimization problem:
(3) | ||||
Then, we estimate via
(4) |
This is a model similar to Balázs (2016) with a minor modification of using a different penalty term in the loss function. Our penalty term depends on the sum of partial derivatives rather than . This new penalty term acts similar to a regularizer and encourages feature sparsity. It is easy to show the estimator is bounded i.e., . Also , for . Hence, similar to Siahkamari et al. (2020), our penalty term is a valid seminorm which allows us to use their theorem here.
Proposition 1.
With the appropriate choice of which requires knowledge of the bound on and , it holds that with probability at least over the data, the estimator of has excess risk upper bounded by
2.1 Optimization
Our method utilizes the well-known ADMM (Gabay & Mercier, 1976) algorithm. ADMM is a standard tool for solving convex problems that consists of variable blocks with linear constraints (Eckstein & Yao, 2012). It has an iterative procedure to update the problem variables with provable convergence guarantees (He & Yuan, 2012).
We solve program (3) using ADMM; the main insight is that we can solve a 2-block ADMM formulation where each block can be computed in closed form. We first consider an equivalent form of the optimization problem as:
(5) | ||||
with the augmented Lagrangian
where , and are dual variables. We divide parameters into two blocks as and , where . We find closed form solutions for each block given the solution to the other block.
2.1.1 First block
We first note that we always normalize the dataset such that and . This will result in and simplify the solutions.
By setting we can solve for as:
(6) |
where
Similarly by setting and substituting Eq. (6) for we can solve for , simultaneously as a system of linear equations,
(7) |
where , , and
2.1.2 Second block
Set for . Hence
(8) | ||||
Lastly set and plug the solutions for , and . Denote , after rearrangement of the terms we have:
Note that the right hand side is a monotonic and piecewise linear function of . It is easy to find the solution to this problem with respect to , using a sort and a simple algorithm that takes flops; see L_update Algorithm 1. Observe that it is possible to add monotonicity constraints for by projecting either or to zero.
2.1.3 Dual variables
The update for dual variables follows from the standard ADMM algorithm updates:
(9) | |||||
2.1.4 Algorithms via ADMM
Algorithm 2 provides the full steps for a parallel ADMM optimizer for the convex regression problem. In each line in Algorithm 2, we use subscripts such as , and on the left hand side. These updates may be run in parallel for , , or . We initially set all block variables to zero and normalize the dataset such that and . We have implemented our algorithm using PyTorch (Paszke et al., 2019), which benefits from this parallel structure when using a GPU. Our code, along with a built-in tuner for our hyperparameter and , is available on our GitHub repository 111 https://github.com/Siahkamari/Piecewise-linear-regression .
2.2 Analysis
Our method has two sources of errors which are the error due to the ADMM procedure (Eq. 5) and the error due to estimating the ground truth convex function (Eq. 4). We characterize both errors based on ADMM convergence (He & Yuan, 2012).
Theorem 1.
[from He & Yuan (2012)] Consider the separable convex optimization problem,
where (, ) are the block variables, (, ) are convex sets that includes all zero vectors, () are the coefficient matrices and is a constant vector. Let and be solutions at iteration of a two block ADMM procedure with learning rate where are all zero vectors. Denote average of iterates as . For all we have,
where are the optimal solutions.
We arranged Theorem 1 such that it explicitly depends on the problem dependent constants such as , the number of iterations, , as well as the learning rate . A proof is provided in Appendix B.1.
Next, we present convergence analysis in terms of regularized MSE of the output of the ADMM algorithm (2). This has the further benefit of finding an appropriate learning rate and a range for regularization coefficient that minimizes the computational complexity.
Theorem 2.
Let be the output of Algorithm 2 at the iteration, and . Denote . Assume and . If we choose , for and we have:
Corollary 1.
Our method needs iterations to achieve error. Each iteration requires flops operations. Prepossessing costs . Therefore the total computational complexity is .
We note that assumptions of Theorem 2 are simply satisfied by normalizing the training dataset. The main difficulty for the derivation of Theorem 2 is: might be violating the constraints in (5). This could result in and . Therefore, the main steps of the proof of Theorem 2 is to characterize and bound the effect of such constraint violations on our objective function. Then we set in Theorem 1 in a way to cover for the effects of constraint violations. For a full proof, we refer to Appendix B.2.
3 Approximating a Bregman Divergence
We next consider the application of learning a Bregman divergence from supervision. This problem is a type of metric learning problem, where we are given supervision (pairs of similar/dissimilar pairs, or triplets consisting of relative comparisons amongst data points) and we aim to learn a task-specific distance or divergence measure from the data. Classical metric learning methods typically learn a linear transformation of the data, corresponding to a Mahalanobis-type distance function,
where is a positive semi-definite matrix (note that this generalizes the squared Euclidean distance, where would be the identity matrix). More generally, Mahalanobis distances are examples of Bregman divergences, which include other divergences such as the KL-divergence as special cases.
Bregman divergences are parameterized by an underlying strictly convex function, sometimes called the convex generating function of the divergence. Recently, Siahkamari et al. (2019) formulate learning a Bregman divergence as learning the underlying convex function parameterizing the divergence. The resulting optimization problem they study is similar to convex regression problem Eq. (5). Here we will discuss an improvement to their approach with a slightly different loss function and our Lasso penalty term. We then derive an algorithm using 2-block ADMM; the existing solver for this problem uses standard linear programming.
In particular suppose we observe the classification dataset , for and is an integer. Let
be a Bregman divergence with convex generating function . Let the pairwise similarity loss be
We estimate the underlying convex function of the Bregman divergence by,
We note that this approach can easily be generalized to other possible metric learning losses.
3.1 Optimization
Now we provide an algorithm for the Bregman divergence learning based on 2-block ADMM. We solve for each block in closed form. We first re-formulate the optimization problem to standard form and introducing some auxiliary variables we have:
(10) | ||||
where .
Now we write the augmented Lagrangian:
where and are dual variables. Next we divide the variables into two blocks and solve for each in closed form.
3.1.1 First block
Setting gives:
(11) |
Collecting the terms containing in Eq.(10) and comparing to those in Eq. (5), the solution for follows:
(12) |
3.1.2 Second block
3.1.3 Dual variables
We only see a new constraint different from Eq. (5) with dual multiplier . The update is
(15) |
Algorithm 3 in the appendix provides full-steps for solving the Bregman divergence learning problem.
4 Difference of Convex (DC) Regression
As a further example, we extend our convex regression solver to the difference of convex regression as studied in Siahkamari et al. (2020). DC functions are set of functions that can be represented as for a choice of two convex functions. DC functions are a very rich class—for instance, they are known to contain all functions. DC regression has been studied in Cui et al. (2018); Siahkamari et al. (2020); Bagirov et al. (2020). We provide a 2-block ADMM solver for the difference of convex estimator proposed in Siahkamari et al. (2020), with a minor change of choosing a different penalty
In particular, consider a regression dataset with and . Denote the class of difference of convex functions
In order to estimate , we minimize the penalized least square over the class:
This results in a convex program:
(16) |
and the estimator for is given as
We note that the linear constraint sets for and have no variable in common in the linear program (4). Fixing allows us to solve for using Eq. (7) and vice-versa. From there we can decouple and solutions by solving a system of linear equations,
(17) |
Algorithm 4 in Appendix provides the full steps to solve the difference of convex regression problem.
5 Experiments

In this section we provide experiments on real datasets from the UCI machine learning repository as well as on synthetic datasets. We compare running times between our proposed approach and the baseline interior point method for all three problems (convex regression, DC regression, and Bregman divergence learning). We also compare both our DC regression algorithm as well as our Bregman divergence learning algorithm to state-of-the-art regression and classification methods, and show that our approach is close to state-of-the-art in terms of accuracy. We compare our (DC)-regression algorithm with state-of-the-art regression models.
For the ADMM-based methods, we use a V100 Nvidia GPU processor with gigabyte of GPU memory, and 4 cores of CPU. For all the other methods we use a 16 core CPU.
Seconds | |||
---|---|---|---|
Baseline (Siahkamari et al., 2020) | This Paper | ||
1000 | 2 | 32.3 | 3.63 |
1000 | 4 | 30.8 | 3.75 |
1000 | 8 | 54.5 | 4.03 |
1000 | 16 | 112.3 | 4.12 |
1000 | 32 | 189.3 | 4.37 |
Seconds | |||
---|---|---|---|
Baseline (Siahkamari et al., 2020) | This Paper | ||
1000 | 2 | 140.9 | 3.67 |
1000 | 4 | 113.9 | 3.79 |
1000 | 8 | 210.3 | 4.05 |
1000 | 16 | 351.6 | 4.15 |
1000 | 32 | 823.2 | 4.42 |
5.1 Timing Results for Convex and DC Regression
To demonstrate the effectiveness of our proposed approach, we first compare on synthetic data the existing interior point method to our 2-block ADMM method on standard convex regression and DC regression. Here, we fixed the number of data points at 1000 and tested data of dimensionality .
5.2 Results on Real Data
Now we experiment with our DC regression algorithm and the Bregman divergence learning algorithm (PBDL) on real data sets, and report accuracy and timing results. For PBDL, we compare against its predecessor (PBDL_0) as well as state of the art classification models. All results are reported either on the pre-specifed train/test split or a 5-fold cross validation set based on instruction for each dataset.
For the purpose of comparisons, we compare to the previous (DC)-regression and PBDL algorithms. Furthermore, we pick two popular tree-based algorithms XGboost and Random Forest. We also consider Lasso as a baseline. According to Fernández-Delgado et al. (2014, 2019), Random Forest achieves the state of the art on most regression and classification tasks. More recently, XGboost has been shown to achive the state of the art on many real datasets and Kaggle.com competitions (Chen & Guestrin, 2016). However, we stress that achieving state-of-the-art results is not the purpose of our experiments.
We provide the details of hyper-parameters for each method.
DC-regression: We choose from a grid by 5 fold cross validation. Then we do at most 2 more rounds of grid search around the optimal at the first round. We fix and choose by early stopping, i.e., whenever validation error improvement is less than after iterations of ADMM.
PBDL: We predict the classes using a nearest neighbour scheme . where is learned Bregman divergence. Tuning is similar to DC-regression.
PBDL_0: We use the existing code from the authors for learning a Bregman divergence based on Groubi solvers. We use their built-in parameter tuner for which is based on 5 fold cross validation.
XGboost Regressor/Classifier: and is found by 5 fold cross validation over . Other parameters set to default. We use the xgboost 1.6.0 package in Chen & Guestrin (2016).
Random Forest Regressor/Classifier: is found by 5 fold cross validation over . We use the sklearn package in Pedregosa et al. (2011).
Lasso: We use the sklearn package function LassoCV.
5.2.1 Datasets
We chose all regression datasets from UCI which have number of instances . These were 30 datasets at the time of submission of this paper. We discard of these datasets due to various reasons such as the data set not being available. Some of these datasets have multiple target variables, therefore we report the out of sample results with a suffix for each target variable.
For classification, we chose all datasets originally used in Siahkamari et al. (2019). Further, we include the abalone dataset which was too large for the previous PBDL algorithm. We use these datasets as multi-class classification problems. We report accuracy as well as the training run-time.
5.2.2 Results
Regression: For our regression experiments we present the out of sample on datasets in figure 1 . We observe our method has similar performance to that of XGboost and Random Forest despite having parameters. In solar-flare and Parkinson datasets, other methods overfit whereas our algorithm avoids overfitting and it is more robust. Maximum number of instances which we were able to experiment on with reasonable time is about and is x larger than Siahkamari et al. (2020) has experimented on. The detailed results are in Appendix.
Classification: For our classification experiments we compare the logarithm of runtimes of all methods in Figure 2. In terms of speed, we are on average x faster than the original PBDL_0 algorithm. Also PBDL_0 fails to handle the Abalone dataset with , where the new PBDL takes less than a minute to finish one fit. We are slower than XGboost and Random Forest. However, we only rely on a python script code vs an optimized compiled code. We note that the divergence function learned in PBDL can be further used for other tasks such as ranking and clustering. We present the accuracy in Figure 3 in Appendix. We observe that our Bregman divergence Learning Algorithm (PBDL) as well as the original (PBDL_0) have similar performance to that of XGboost and Random Forest.

6 Conclusion
In this paper we, studied the nonparametric convex regression problem with a regularization penalty. we provided a solver for this problem and proved the iteration complexity be . The total computational complexity is , which improves that of already known for this problem. We also extended our solver to the problem of difference of convex (DC) regression and the problem of learning an arbitrary Bregman divergence. We provided comparisons to state of the art regression and classification models.
Acknowledgements
This research was supported by the Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001, Army Research Office Grant W911NF2110246, the National Science Foundation grants CCF-2007350 and CCF-1955981, and the Hariri Data Science Faculty Fellowship Grants, and a gift from the ARM corporation.
References
- Afriat (1967) Afriat, S. N. The construction of utility functions from expenditure data. International economic review, 8(1):67–77, 1967.
- Amos et al. (2017) Amos, B., Xu, L., and Kolter, J. Z. Input convex neural networks. In International Conference on Machine Learning, pp. 146–155. PMLR, 2017.
- Bagirov et al. (2020) Bagirov, A. M., Taheri, S., Karmitsa, N., Sultanova, N., and Asadi, S. Robust piecewise linear l 1-regression via nonsmooth dc optimization. Optimization Methods and Software, pp. 1–21, 2020.
- Balázs (2016) Balázs, G. Convex regression: theory, practice, and applications. 2016.
- Bertsimas & Mundru (2021) Bertsimas, D. and Mundru, N. Sparse convex regression. INFORMS Journal on Computing, 33(1):262–279, 2021.
- Boyd et al. (2004) Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.
- Chen & Guestrin (2016) Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794, 2016.
- Chen & Mazumder (2020) Chen, W. and Mazumder, R. Multivariate convex regression at scale. arXiv preprint arXiv:2005.11588, 2020.
- Cui et al. (2018) Cui, Y., Pang, J.-S., and Sen, B. Composite difference-max programs for modern statistical estimation problems. SIAM Journal on Optimization, 28(4):3344–3374, 2018.
- Eckstein & Yao (2012) Eckstein, J. and Yao, W. Augmented lagrangian and alternating direction methods for convex optimization: A tutorial and some illustrative computational results. RUTCOR Research Reports, 32(3), 2012.
- Fernández-Delgado et al. (2014) Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. Do we need hundreds of classifiers to solve real world classification problems? The journal of machine learning research, 15(1):3133–3181, 2014.
- Fernández-Delgado et al. (2019) Fernández-Delgado, M., Sirsat, M. S., Cernadas, E., Alawadi, S., Barro, S., and Febrero-Bande, M. An extensive experimental survey of regression methods. Neural Networks, 111:11–34, 2019.
- Gabay & Mercier (1976) Gabay, D. and Mercier, B. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & mathematics with applications, 2(1):17–40, 1976.
- Ghosh et al. (2019) Ghosh, A., Pananjady, A., Guntuboyina, A., and Ramchandran, K. Max-affine regression: Provable, tractable, and near-optimal statistical estimation. arXiv preprint arXiv:1906.09255, 2019.
- He & Yuan (2012) He, B. and Yuan, X. On the o(1/n) convergence rate of the douglas–rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012.
- Kim et al. (2021) Kim, S., Bahmani, S., and Lee, K. Max-linear regression by scalable and guaranteed convex programming. arXiv preprint arXiv:2103.07020, 2021.
- Mazumder et al. (2019) Mazumder, R., Choudhury, A., Iyengar, G., and Sen, B. A computational framework for multivariate convex regression and its variants. Journal of the American Statistical Association, 114(525):318–331, 2019.
- Nagurney (1998) Nagurney, A. Network economics: A variational inequality approach, volume 10. Springer Science & Business Media, 1998.
- Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037, 2019.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Siahkamari et al. (2019) Siahkamari, A., Saligrama, V., Castanon, D., and Kulis, B. Learning bregman divergences. arXiv e-prints, pp. arXiv–1905, 2019.
- Siahkamari et al. (2020) Siahkamari, A., Gangrade, A., Kulis, B., and Saligrama, V. Piecewise linear regression via a difference of convex functions. In International Conference on Machine Learning, pp. 8895–8904. PMLR, 2020.
- Varian (1982) Varian, H. R. The nonparametric approach to demand analysis. Econometrica: Journal of the Econometric Society, pp. 945–973, 1982.
- Zhang et al. (2020) Zhang, X., Li, Y., Zhang, Z., and Zhang, Z.-L. -gail: Learning -divergence for generative adversarial imitation learning. arXiv preprint arXiv:2010.01207, 2020.
Appendix to ‘Faster Convex Lipschitz Regression via 2 blocks ADMM’
Appendix A Algorithms
Appendix B Derivations
B.1 ADMM Convergence
Theorem 1 gives convergence of the ADMM procedure. We follow analysis similar to He & Yuan (2012) and give all steps for the sake of completeness. We first restate the convergence as follows:
Theorem 1.
Consider the separable convex optimization problem:
where (, ) are the block variables, (, ) are convex sets that includes all zero vectors, () are the coefficient matrices and is a constant vector. Let and be solutions at iteration of a two block ADMM procedure with learning rate . i.e:
where .
Denote the average of iterates as . For all we have
where are the optimal solutions.
Before presenting the proof, we define a dual variable like sequence with the update relation as
The only difference between and is due to and . We have the following relation:
(18) |
Lemma 1.
[from Nagurney (1998)] Let where is continuously differentiable and is closed and convex. Then satisfies,
Applying Lemma 1 to optimization problems of ADMM gives,
(19) | |||
(20) |
Theorem 1 is a direct conclusion of the following Lemma.
Lemma 2.
For all we have,
where and .
Let be . By definition we have which cancels an inner product. Rearranging and averaging over time gives
The RHS telescopes. Since we have . We lower bound LHS with Jensen Inq. as . Combining LHS and RHS we get
which is the statement in Theorem 1 assuming the initial variables are s.
We continue to prove Lemma 2. We start with convexity definition for both and as,
Summing the relations and plugging in Eq. 21 give,
(22) |
The update rule of gives . Then for all , we have . Adding such a term to does not change its value. After adding the zero inner product, we rearrange as
Plugging Eq. 18 in gives .
We use the following norm square relation and prove it at the end,
Lemma 3.
For a symmetric we have,
Using Lemma 3, we get
,
since and are symmetric matrices.
We combine and as
Plugging it back to ,
(23) |
Let for . We know that achieves its minimum at . Since is convex and closed, we have,
Proof of Lemma 3.
Expand RHS as
We reach the .
B.2 Proof for Theorem 2 (computational complexity)
Theorem 2.
Let be the output of Algorithm(2) at the iteration, and . Denote . Assume and . If we choose , for and we have:
For the proof, we make extensive use of Theorem (1) which provides convergence rate of a general 2-block ADMM for a separable convex program with linear constraints
(24) | ||||
It guarantees for all the average solutions of a 2-block ADMM satisfies
(25) |
where are the optimal solutions of program (24).
To match the notations between (26) and (24) denote the optimal/average ADMM solutions to program (26) as
(27) |
Therefore the separable losses for the average iterates of ADMM for program (26) are
(28) | ||||
Note that convergence rate for is available by setting in (25). However this is not sufficient for convergence of our objective . The reason is might be violating the constraints in (26). This could result in and . Therefore main steps of the proof of Theorem (2) is to characterize and bound the effect of such constraint violations on our objective function. Let us specify how much each linear constraint in program (26) is violated,
(29) | ||||
We break the proof to smaller parts by first providing some intermediate lemmas.
Lemma 4.
.
Proof.
Note that is a feasible solution and is the optimal solution. Also algorithm 2 normalizes the dataset such that , therefore
∎
Lemma 5.
.
Proof.
Since is a feasible solution we have . We also have:
(by constraints definitions) | ||||
(convexity) | ||||
( | ||||
Lemma 6.
.
Proof.
(Definition) | ||||
( definition) | ||||
() |
∎
Lemma 7.
Proof.
(Lemma 6) |
On the other hand for regularization term we have:
Lemma 8.
.
Proof.
(definition) | ||||
( definition) | ||||
( definition) |
∎
Next we combine the previous lemmas to prove Theorem 2. Before proceeding lets incorporate the definitions of constraint violations Eq. (29) in to get:
(30) |
where
Appendix C Experimental results
This section provides the regression and classification experimental results in tabular format. Table 3 compares the regression performance of our method against baseline methods on benchmark datasets. Table 4 compares the classification accuracy of our method against baseline methods. We present the run time of our method on selected UCI datasets in Table 5. Here, we omit the baseline method because it did not run to completion on most of the datasets. Hyperparameters were fixed for all timing results.
dataset | dc regression | lasso | random forest | xgboost | ||
---|---|---|---|---|---|---|
Parkinson Speech Dataset | 702 | 52 | -3.7 | -4 | -14.8 | -20 |
Garment Productivity | 905 | 37 | 25.5 | 15.2 | 45.9 | 44.2 |
Concrete Compressive Strength | 1030 | 8 | 91.7 | 59.9 | 91.8 | 92.5 |
Geographical Original of Music-1 | 1059 | 68 | 21.9 | 16.6 | 25.1 | 26 |
Geographical Original of Music-2 | 1059 | 68 | 32.6 | 23.8 | 31.3 | 30.8 |
Solar Flare-1 | 1066 | 23 | 3.3 | 6.1 | -28 | 6.2 |
Solar Flare-2 | 1066 | 23 | -6.1 | -2.1 | -17 | 2.1 |
Airfoil Self-Noise | 1503 | 5 | 95.2 | 51.1 | 93.3 | 95 |
Communities and Crime | 1994 | 122 | 62.1 | 64.5 | 65.4 | 65.8 |
SML2010 | 3000 | 24 | 90.8 | 96 | 93.4 | 92.1 |
SML2010 | 3000 | 24 | 77.3 | 94 | 92.4 | 90 |
Parkinson’s Telemonitoring | 4406 | 25 | 93.8 | 98.4 | 92.3 | 98.2 |
Parkinson’s Telemonitoring | 4406 | 25 | 95.5 | 98.5 | 92 | 98 |
Wine Quality | 4898 | 11 | 51.8 | 27.2 | 52.6 | 51 |
Bias Correction of Temperature Forecast-1 | 6200 | 52 | 62.9 | 60.8 | 64.2 | 65.8 |
Bias Correction of Temperature Forecast-2 | 6200 | 52 | 75.2 | 76.5 | 76.3 | 76.7 |
Seoul Bike Sharing Demand | 6570 | 19 | 89.7 | 80.7 | 93.1 | 92.3 |
Air Quality-1 | 7110 | 21 | 87 | 89.5 | 88.2 | 89.9 |
Air Quality-2 | 7110 | 21 | 100 | 94.7 | 99.8 | 100 |
Air Quality-3 | 7110 | 21 | 86.6 | 86.4 | 87.7 | 88.5 |
Air Quality-4 | 7110 | 21 | 81.7 | 84 | 78.1 | 80.7 |
Combined Cycle Power Plant | 9568 | 4 | 95.5 | 92.9 | 96.5 | 96.7 |
Accuracy (%) | Run time of all fits (sec) | |||||||
dataset | pbdl | pbdl0 | rand forest | xgboost | pbdl | pbdl0 | ||
Iris | 149 | 4 | 96.0 | 98.7 | 96 | 96.6 | 22.1 | 46.2 |
Wine | 178 | 13 | 96.0 | 96.6 | 96.7 | 95.5 | 23.6 | 496.2 |
Blood Transfusion | 748 | 4 | 72.6 | 74.8 | 72.9 | 78.4 | 46.3 | 21514.0 |
Breast Cancer Wisconsin | 569 | 30 | 94.4 | 96.5 | 96.1 | 96.0 | 114.0 | 224407 |
Balance Scale | 625 | 4 | 84.6 | 93.0 | 83 | 92.2 | 150.3 | 617.0 |
Abalone | 4177 | 10 | 22.6 | N/A | 24.9 | 26.2 | 3688.0 | N/A |
dataset | This Paper (Seconds) | ||
---|---|---|---|
Parkinson Speech Dataset | 702 | 52 | 3.4 |
Garment Productivity | 905 | 37 | 4.12 |
Concrete Compressive Strength | 1030 | 8 | 3.28 |
Geographical Original of Music | 1059 | 68 | 4.44 |
Solar Flare | 1066 | 23 | 3.64 |
Airfoil Self-Noise | 1503 | 5 | 4.92 |
Communities and Crime | 1994 | 122 | 12.82 |
SML2010 | 3000 | 24 | 21.78 |
