Faster Algorithms for Learning Convex Functions

Ali Siahkamari Durmus Alp Emre Acar Christopher Liao Kelly Geyer Venkatesh Saligrama Brian Kulis

Abstract

The task of approximating an arbitrary convex function arises in several learning problems such as convex regression, learning with a difference of convex (DC) functions, and learning Bregman or $f$ -divergences. In this paper, we develop and analyze an approach for solving a broad range of convex function learning problems that is faster than state-of-the-art approaches. Our approach is based on a 2-block ADMM method where each block can be computed in closed form. For the task of convex Lipschitz regression, we establish that our proposed algorithm converges with iteration complexity of $O(n\sqrt{d}/\epsilon)$ for a dataset $\bm{X}\in\mathbb{R}^{n\times d}$ and $\epsilon>0$ . Combined with per-iteration computation complexity, our method converges with the rate $O(n^{3}d^{1.5}/\epsilon+n^{2}d^{2.5}/\epsilon+nd^{3}/\epsilon)$ . This new rate improves the state of the art rate of $O(n^{5}d^{2}/\epsilon)$ if $d=o(n^{4})$ . Further we provide similar solvers for DC regression and Bregman divergence learning. Unlike previous approaches, our method is amenable to the use of GPUs. We demonstrate on regression and metric learning experiments that our approach is over 100 times faster than existing approaches on some data sets, and produces results that are comparable to state of the art.

Machine Learning, ICML

1 Introduction

The importance of convex functions in machine learning is undisputed. However, while most applications of convexity in machine learning involve fixed convex functions, an emerging trend in the field has focused on the problem of learning convex functions for a particular task. In this setting, data or supervision is used to tailor a convex function for some end goal. Recent machine learning examples include learning with a difference of convex (DC) functions (Siahkamari et al., 2020), learning input convex neural networks for tasks in reinforcement learning or vision (Amos et al., 2017), and learning divergences between points or distributions via learning the underlying convex function in a Bregman divergence (Siahkamari et al., 2019) or $f$ -divergence (Zhang et al., 2020) for problems such as data generation, metric learning, and others.

In fact, work in convex function learning at least dates back to methods developed for the problem of convex regression, an important learning problem that is used commonly in econometrics and engineering for modeling demand, utility and production curves (Afriat, 1967; Varian, 1982). Convex regression is the problem of estimating a convex function when receiving a dataset $\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}$ , where $\bm{x}_{i}\in\mathbb{R}^{d}$ are predictors and $y_{i}$ are continuous responses. Consider

\displaystyle\mathcal{F}\triangleq\{f:\mathbb{R}^{d}\to\mathbb{R}\ |\ f\text{ is convex}\},

(1)

as the class of all convex functions over $\mathbb{R}^{d}$ . Then an estimator is proposed by minimizing the squared error between observations $\bm{x}_{i}$ and responses $y_{i}$ ,

\displaystyle\hat{f}\triangleq\arg\min_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-f(\bm{x}_{i}))^{2}+\lambda\|f\|,

(2)

where $\|f\|$ is some penalty term. A key insight about this problem is that although Eq. (2) is an infinite dimensional minimization problem, Boyd et al. (2004) shows it can be solved as a convex optimization problem since the optimal solution can be shown to be piece-wise linear. Furthermore, generalization bounds for this problem have recently been developed in (Balázs, 2016; Siahkamari et al., 2020) using certain classes for $\|f\|$ .

Many other convex function learning problems have a similar structure to them, and also yield resulting optimization problems that can be tractably solved. However, a key downside is that the computational complexity is often prohibitive for real-world applications, particularly due to the non-parametric nature of the resulting optimization problems. For instance, the state-of-the-art solver for convex regression is based on interior point methods and has a complexity of $O(n^{5}d^{2}/\epsilon)$ for a given accuracy $\epsilon$ .

Thus, our goal in this paper is to study a class of methods that yields provably faster convergence for a number of different convex function learning problems. Our main insight is that many convex function learning problems may be expressed as optimization problems that can be solved with a 2-block ADMM algorithm such that each of the two blocks can be computed in closed form. For convex regression, the resulting method (which we call FCR, for Faster Convex Regression) is guaranteed to converge with computational complexity of $O(n^{3}d^{1.5}/\epsilon+n^{2}d^{2.5}/\epsilon+nd^{3}/\epsilon)$ for a dataset $\bm{X}\in\mathbb{R}^{n\times d}$ and $\epsilon>0$ . This new rate improves the state of the art $O(n^{5}d^{2}/\epsilon)$ in (Balázs, 2016) available via interior point methods when $d=o(n^{4})$ . In addition to an improved convergence rate, FCR makes several contributions. Firstly, FCR is stand-alone and does not need any optimization software. FCR is based on tensor computations and can easily be implemented on GPUs. Furthermore, FCR can include regularization terms in the context of non-parametric convex regression.

We extend FCR solver for problems beyond convex regression. In particular, we provide 2-block ADMM solvers and present empirical comparisons for DC regression (Siahkamari et al., 2020) and Bregman divergence learning (Siahkamari et al., 2019). Finally, we demonstrate empirical results showing that our approach yields significantly more scalable convex learning methods, resulting in speedups of over 100 times on some problems.

Notation. We generally denote scalars as lower case letters, vectors as lower case bold letters, and matrices as upper case bold letters. For a dataset $\bm{X}\in\mathbb{R}^{n\times d}$ , $n$ represents the number of observations and $d$ is the number of covariates. We define $x^{+}=\max\{x,0\},x^{-}=\max\{-x,0\}$ and $\times$ as an outer product. We define $[n]$ to be the set of integers $\{1,\dots,n\}$ . By $\|\nabla^{*}f(\bm{x})\|$ and $|\partial^{*}_{x_{l}}f(\bm{x})|$ we denote the largest subgradient and the largest partial sub-derivative.

1.1 Connections to Existing Methodologies

Convex regression has been extensively studied over the last two decades. Boyd et al. (2004) introduce the non-parametric convex regression problem Eq. (2). Balázs (2016) show that convex Lipschitz regression requires regularization in order to have generalization guarantees. In particular, it was shown that the penalty term $\|f\|=\sup_{\bm{x}}\|\nabla^{*}f(\bm{x})\|$ yields a generalization error of $O(n^{-2/d}\log n)$ .

Balázs (2016); Mazumder et al. (2019) provide ADMM solvers for the convex regression problem; however, solutions have more than two blocks and are not guaranteed to converge in all circumstances. More recently, Chen & Mazumder (2020) provide an active-set type solver and Bertsimas & Mundru (2021) provide a delayed constraint generation algorithm which have favorable scalablity. They use a penalty of $\|f\|=\sum_{i}\|\nabla^{*}f(\bm{x}_{i})\|_{2}^{2}$ , which makes the loss function in Eq. (2) strongly convex and easier to solve. However, no theoretical generalization bound is known when using this new penalty term. Siahkamari et al. (2020) use $\|f\|=\sup_{\bm{x}}(\|\nabla^{*}f(\bm{x})\|_{1})$ for learning a difference of convex functions where they provide a multi-block ADMM solver for this problem.

On the other hand, Ghosh et al. (2019) and Kim et al. (2021) study the parametric class of max-linear functions $\mathcal{F}_{k}=\{f:\mathbb{R}^{d}\to\mathbb{R}\ |\ f(\bm{x})=\max_{j=1}^{k}\langle\bm{a}_{j},\bm{x}\rangle+b_{j}\}$ for $k<n$ . They provide convex programming and alternating algorithms. The downside of these methods is the assumption that $\bm{x}_{i}$ are i.i.d samples from a normal distribution and furthermore that each feature $x_{i,l}$ is independent. These assumptions are needed for their algorithm to converge in theory. We note that this parametric class $\mathcal{F}_{k}$ if $k\geq n$ , is the same as $\mathcal{F}$ , when used in convex regression minimization problem Eq. (2).

Our methodology is most similar to Balázs (2016) and is in the context of learning theory where we do not require any distributional assumption on $\bm{x}_{i}$ and only require them be i.i.d. and bounded. We further need to know a bound on $f$ but not on $\|f\|$ .

2 Convex Regression with Lasso Penalty

Suppose we are given a dataset $\{(y_{i},\bm{x}_{i})\}_{i=1}^{n}$ where $\bm{x}_{i}\in\mathbb{R}^{d}$ are predictors, assumed drawn i.i.d. from a distribution $P_{X}$ supported on a compact domain $\Omega\subset\{\|\bm{x}\|_{\infty}\leq R\},$ with $n\geq d$ , and $y_{i}$ are responses such that $y_{i}=f(\bm{x}_{i})+\varepsilon_{i}$ for centered, independent random noise $\varepsilon_{i}$ . Assume $|\varepsilon_{i}|$ and $|f(\cdot)|$ both are bounded by $M$ . Furthermore, $f$ is convex and Lipschitz. We propose to solve the penalized convex regression problem in Eq. (2) with the penalty term $\|f\|\triangleq\sum_{l=1}^{d}\sup_{\bm{x}}|\partial_{x_{l}}^{*}f(\bm{x})|$ . This results in the following convex optimization problem:

		$\displaystyle\qquad\min_{\hat{y}_{i},\bm{a}_{i}}\frac{1}{n}\sum_{i=1}^{n}(\hat{y}_{i}-y_{i})^{2}+\lambda\sum_{l=1}^{d}\max_{i=1}^{n}\|a_{i,l}\|$		(3)
		$\displaystyle\textrm{s.t. }\hat{y}_{i}-\hat{y}_{j}-\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle\leq 0\quad i,j\in[n]\times[n].$

Then, we estimate $f(\bm{x})$ via

\hat{f}(\bm{x})\triangleq\max_{i}\langle\bm{a}_{i},\bm{x}-\bm{x}_{i}\rangle+\hat{y}_{i}.

(4)

This is a model similar to Balázs (2016) with a minor modification of using a different penalty term in the loss function. Our penalty term depends on the sum of partial derivatives $\sum_{l=1}^{d}\sup_{\bm{x}}|\partial_{x_{l}}f(\bm{x})|$ rather than $\sup_{\bm{x}}\|\nabla f(\bm{x})\|^{2}_{2}$ . This new penalty term acts similar to a $L1$ regularizer and encourages feature sparsity. It is easy to show the estimator is bounded i.e., $\sup_{\bm{x}\in\Omega}|\hat{f}(\bm{x})|\leq M+4\|\hat{f}\|R$ . Also $\|f+g\|\leq\|f\|+\|g\|$ , $\|fc\|=c\|f\|$ for $c\geq 0$ . Hence, similar to Siahkamari et al. (2020), our penalty term is a valid seminorm which allows us to use their theorem here.

Proposition 1.

With the appropriate choice of $\lambda$ which requires knowledge of $M$ the bound on $f$ and $n\geq d$ , it holds that with probability at least $1-\delta$ over the data, the estimator $\hat{f}$ of $(\ref{eqn:plc_estimate})$ has excess risk upper bounded by

\displaystyle\mathbb{E}[|f(\bm{x})-\hat{f}(\bm{x})|^{2}]\leq O\bigg{(}\bigg{(}\frac{n}{d}\bigg{)}^{\frac{-2}{d+4}}\log\bigg{(}\frac{n}{d}\bigg{)}+\sqrt{\frac{\log(1/\delta)}{n}}\bigg{)}.

2.1 Optimization

Our method utilizes the well-known ADMM (Gabay & Mercier, 1976) algorithm. ADMM is a standard tool for solving convex problems that consists of variable blocks with linear constraints (Eckstein & Yao, 2012). It has an iterative procedure to update the problem variables with provable convergence guarantees (He & Yuan, 2012).

We solve program (3) using ADMM; the main insight is that we can solve a 2-block ADMM formulation where each block can be computed in closed form. We first consider an equivalent form of the optimization problem as:

		$\displaystyle\qquad\min_{\begin{subarray}{c}\hat{y}_{i},\bm{a}_{i},L_{l}\geq 0,\\ \bm{p}_{i}^{+}\geq 0,\bm{p}_{i}^{-}\geq 0,\bm{u}_{i}\geq 0,s_{i,j}\geq 0\end{subarray}}\sum_{i=1}^{n}(\hat{y}_{i}-y_{i})^{2}{+}\lambda\sum_{l=1}^{d}L_{l}$		(5)
		$\displaystyle\textrm{s.t.}\begin{cases}s_{i,j}+\hat{y}_{i}-\hat{y}_{j}-\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle=0&i,j\in[n]\times[n]\\ u_{i,l}+p_{i,l}^{+}+p_{i,l}^{-}-L_{l}=0&\,i,l\in[n]\times[d]\\ a_{i,l}-p_{i,l}^{+}+p_{i,l}^{-}=0&\,i,l\in[n]\times[d],\end{cases}$

with the augmented Lagrangian

	$\displaystyle\ell(\hat{y}_{i},\bm{a}_{i},L_{l},\bm{p}_{i}^{+},\bm{p}_{i}^{-},\bm{u}_{i},s_{i,j})=\frac{1}{n}\sum_{i=1}^{n}(\hat{y}_{i}-y_{i})^{2}+\lambda\sum_{l=1}^{d}L_{l}$
	$\displaystyle+\sum_{i}\sum_{j}\frac{\rho}{2}(s_{i,j}+\hat{y}_{i}-\hat{y}_{j}-\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle+\alpha_{i,j})^{2}$
	$\displaystyle+\sum_{i}\sum_{l}\frac{\rho}{2}(u_{i,l}+p_{i,l}^{+}+p_{i,l}^{-}-L_{l}+\gamma_{i,l})^{2}$
	$\displaystyle+\sum_{i}\sum_{l}\frac{\rho}{2}(a_{i,l}-p_{i,l}^{+}+p_{i,l}^{-}+\eta_{i,l})^{2},$

where $\alpha_{i,j}$ , $\gamma_{i,l}$ and $\eta_{i,j}$ are dual variables. We divide parameters into two blocks as ${\bm{b}}^{1}=\{\hat{y}_{i},\bm{a}_{i}\}$ and ${\bm{b}}^{2}=\{L_{l},\bm{p}_{i}^{+},\bm{p}_{i}^{-},\bm{u}_{i},s_{i,j}\}$ , where ${i,j\in[n]\times[n],l\in[d]}$ . We find closed form solutions for each block given the solution to the other block.

2.1.1 First block ${\bm{b}}^{1}=\{\hat{y}_{i},\bm{a}_{i}\}$

We first note that we always normalize the dataset such that $\sum_{i}\bm{x}_{i}=\bm{0}$ and $\sum_{i}y_{i}=0$ . This will result in $\sum_{i}\hat{y}_{i}=0$ and simplify the solutions.

By setting $\nabla_{\bm{a}_{i}}\ell=\bm{0}$ we can solve for $\bm{a}_{i}$ as:

\displaystyle\bm{a}_{i}=\bm{\Lambda}_{i}(\bm{\theta}_{i}+\hat{y}_{i}\bm{x}_{i}+\frac{1}{n}\sum_{k}\hat{y}_{k}\bm{x}_{k}),

(6)

where

	$\displaystyle\bm{\Lambda}_{i}$	$\displaystyle\triangleq(\bm{x}_{i}\bm{x}_{i}^{T}+\frac{1}{n}I+\frac{1}{n}\sum_{j}\bm{x}_{j}\bm{x}_{j}^{T})^{-1},$
	$\displaystyle\bm{\theta}_{i}$	$\displaystyle\triangleq\frac{1}{n}\bigg{(}\bm{p}_{i}^{+}-\bm{p}_{i}^{-}-\bm{\eta}_{i}+\sum_{j}(\alpha_{i,j}+s_{i,j})(\bm{x}_{i}-\bm{x}_{j})\bigg{)}.$

Similarly by setting $\partial_{\hat{y}_{i}}\ell=0$ and substituting Eq. (6) for $\bm{a}_{i}$ we can solve for $\hat{y}_{1},\dots,\hat{y}_{n}$ , simultaneously as a system of linear equations,

\displaystyle\hat{\bm{y}}

\displaystyle=\bm{\Omega}^{-1}\bigg{(}\frac{2\bm{y}}{n^{2}\rho}+\bm{v}-\bm{\beta}\bigg{)}

(7)

where ${\bm{y}}=[{y}_{1},\dots,{y}_{n}]^{T}$ , $\hat{\bm{y}}=[\hat{y}_{1},\dots,\hat{y}_{n}]^{T}$ , and

	$\displaystyle\beta_{i}$	$\displaystyle\triangleq\frac{1}{n}\sum_{j}\alpha_{i,j}-\alpha_{j,i}+s_{i,j}-s_{j,i},$
	$\displaystyle v_{i}$	$\displaystyle\triangleq\bm{x}_{i}^{T}\bm{\Lambda}_{i}\bm{\theta}_{i}+\bm{x}_{i}^{T}\frac{1}{n}\sum_{j}\bm{\Lambda}_{j}\bm{\theta}_{j}-\frac{1}{n}\sum_{j}\bm{x}_{j}^{T}\bm{\Lambda}_{j}\bm{\theta}_{j}$
	$\displaystyle\Omega_{i,j}$	$\displaystyle\triangleq\bigg{(}\frac{2}{n^{2}\rho}+2-\bm{x}_{i}^{T}\bm{\Lambda}_{i}\bm{x}_{i}\bigg{)}\mathbbm{1}(i=j)-\frac{1}{n}D_{i,j},$
	$\displaystyle D_{i,j}$	$\displaystyle\triangleq\bm{x}_{i}^{T}\bigg{(}\bm{\Lambda}_{i}\!+\!\bm{\Lambda}_{j}\!+\!\frac{1}{n}\!\sum_{k}\!\bm{\Lambda}_{k}\bigg{)}\bm{x}_{j}-\bm{x}_{j}^{T}\bm{\Lambda}_{j}\bm{x}_{j}$
		$\displaystyle-\frac{1}{n}\sum_{k}\bm{x}_{k}\bm{\Lambda}_{k}\bm{x}_{j}.$

2.1.2 Second block ${\bm{b}}^{2}=\{L_{l},\bm{p}_{i}^{+},\bm{p}_{i}^{-},\bm{u}_{i},\}$

Set $\partial_{s}\ell=0$ for $s\in\{s_{i,j},\bm{p}_{i}^{+},\bm{p}_{i}^{-},\bm{u}_{i},\}$ . Hence

$\displaystyle s_{i,j}$	$\displaystyle=(-\alpha_{i,j}-\hat{y}_{i}+\hat{y}_{j}+\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle)^{+},$	(8)
$\displaystyle u_{i,l}$	$\displaystyle=(L_{l}-\gamma_{i,l}-\|\eta_{i,l}+a_{i,l}\|)^{+},$
$\displaystyle p_{i,l}^{+}$	$\displaystyle=\frac{1}{2}(L_{l}-\gamma_{i,l}-u_{i,l}+\eta_{i,l}+a_{i,l})^{+},$
$\displaystyle p_{i,l}^{-}$	$\displaystyle=\frac{1}{2}(L_{l}-\gamma_{i,l}-u_{i,l}-\eta_{i,l}-a_{i,l})^{-}.$

Lastly set $\partial_{L_{l}}\ell=0$ and plug the solutions for $u_{i,l}$ , $p_{i,l}^{+}$ and $p_{i,l}^{-}$ . Denote $c_{i,l}\triangleq|\eta_{i,l}+a_{i,l}|$ , after rearrangement of the terms we have:

\displaystyle\frac{\lambda}{\rho}=\sum_{i}\left\{\begin{array}[]{ll}0&\mbox{if }L_{l}-\gamma_{i,l}\geq c_{i,l}\\ \frac{1}{2}(\gamma_{i,l}+c_{i,l}-L_{l})&\mbox{if }|L_{l}-\gamma_{i,l}|\leq c_{i,l}\\ \gamma_{i,l}-L_{l}&\mbox{if }L_{l}-\gamma_{i,l}\leq-c_{i,l}.\end{array}\right.

Note that the right hand side is a monotonic and piecewise linear function of $L_{l}$ . It is easy to find the solution to this problem with respect to $L_{l}$ , using a sort and a simple algorithm that takes $O(n\log n)$ flops; see L_update Algorithm 1. Observe that it is possible to add monotonicity constraints for $\hat{f}$ by projecting either $p^{+}_{i,l}$ or $p^{-}_{i,l}$ to zero.

2.1.3 Dual variables

The update for dual variables follows from the standard ADMM algorithm updates:

$\displaystyle\alpha_{i,j}=\alpha_{i,j}+s_{i,j}$		(9)
$\displaystyle+\hat{y}_{i}-\hat{y}_{j}-\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle$	$\displaystyle i,j\in[n]\times[n]$
$\displaystyle\gamma_{i,l}=\gamma_{i,l}+u_{i,l}+p_{i,l}^{+}+p_{i,l}^{-}-L_{l}$	$\displaystyle i,l\in[n]\times[d]$
$\displaystyle\eta_{i,l}=\eta_{i,l}+a_{i,l}-p_{i,l}^{+}+p_{i,l}^{-}$	$\displaystyle i,l\in[n]\times[d].$

2.1.4 Algorithms via ADMM

Algorithm 2 provides the full steps for a parallel ADMM optimizer for the convex regression problem. In each line in Algorithm 2, we use subscripts such as $i$ , $j$ and $l$ on the left hand side. These updates may be run in parallel for $i\in[n]$ , $i,j\in[n]\times[n]$ , $i,l\in[n]\times[d]$ or $q\in[2]$ . We initially set all block variables to zero and normalize the dataset such that $\sum_{i}y_{i}=0$ and $\sum_{i}\bm{x}_{i}=0$ . We have implemented our algorithm using PyTorch (Paszke et al., 2019), which benefits from this parallel structure when using a GPU. Our code, along with a built-in tuner for our hyperparameter $\lambda$ and $T$ , is available on our GitHub repository ¹¹1 https://github.com/Siahkamari/Piecewise-linear-regression .

Algorithm 1 L-update

\{\gamma_{i},c_{i}\}_{i=1}^{n}

, and

\rho/\lambda

{knot_{2n},\dots,knot_{1}}\leftarrow sort\{\gamma_{i}+c_{i},\gamma_{i}-c_{i}\}_{i=1}^{n}

f\leftarrow\lambda/\rho

f^{\prime}\leftarrow 0

4: for

j=2

2n

f^{\prime}\leftarrow f^{\prime}+\frac{1}{2}

f\leftarrow f+f^{\prime}\cdot(knot_{j}-knot_{j-1})

7: if

f\leq 0

then

8: return

\big{(}knot_{j}-\frac{f}{f^{\prime}}\big{)}^{+}

9: end if

10: end for

11: return

\big{(}knot_{2n}-\frac{f}{n}\big{)}^{+}

Algorithm 2 Convex regression

\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}

\rho

\lambda

, and

T

\hat{y}_{i}=s_{i,j}=\alpha_{i,j}\leftarrow 0

\bm{L}=\bm{a}_{i}=\bm{p}_{i}=\bm{u}_{i}=\bm{\eta}_{i}=\bm{\gamma}_{i}\leftarrow\bm{0}_{d\times 1}

3: for

t=1

T

4: Update

\hat{\bm{y}}

by Eq. (7)

5: Update

\bm{a}_{i}

by Eq. (6)

L_{l}\leftarrow\textbf{L\_update}(\{\gamma_{i,l},|\eta_{i,l}+a_{i,l}|\}_{i\in[n]},\lambda/\rho)

7: Update

u_{i,l},p_{i,l}^{+},p_{i,l}^{-},s_{i,j}

by Eq. (8)

8: Update

\alpha_{i,j},\gamma_{i,l},\eta_{i,l}

by Eq. (9)

9: end for

10: return

f(\cdot)\triangleq\max_{i=1}^{n}(\langle\bm{a}_{i},\cdot-\bm{x}_{i}\rangle+\hat{y}_{i})

2.2 Analysis

Our method has two sources of errors which are the error due to the ADMM procedure (Eq. 5) and the error due to estimating the ground truth convex function (Eq. 4). We characterize both errors based on ADMM convergence (He & Yuan, 2012).

Theorem 1.

[from He & Yuan (2012)] Consider the separable convex optimization problem,

	$\displaystyle\min_{{\bf b}^{1}\in\mathcal{S}_{1},{\bf b}^{2}\in\mathcal{S}_{2}}\left[\psi({\bf b}^{1},{\bf b}^{2})=\psi_{1}({\bf b}^{1})+\psi_{2}({\bf b}^{2})\right]$
	$\displaystyle\quad s.t:{\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}+\bm{b}=\bm{0},$

where ( ${\bf b}^{1}$ , ${\bf b}^{2}$ ) are the block variables, ( $\mathcal{S}_{1}$ , $\mathcal{S}_{2}$ ) are convex sets that includes all zero vectors, ( ${\bf A},{\bf B}$ ) are the coefficient matrices and ${\bf b}$ is a constant vector. Let ${\bf b}^{1}_{t}$ and ${\bf b}^{2}_{t}$ be solutions at iteration $t$ of a two block ADMM procedure with learning rate $\rho$ where $({\bf b}^{1}_{0},{\bf b}^{2}_{0})$ are all zero vectors. Denote average of iterates as $(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})=\left(\frac{1}{T}\sum_{t=1}^{T}{\bf b}^{1}_{t},\frac{1}{T}\sum_{t=1}^{T}{\bf b}^{2}_{t}\right)$ . For all ${\bm{\kappa}}$ we have,

	$\displaystyle\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{},{\bf b}^{2}_{})-{\bm{\kappa}}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})$
	$\displaystyle\leq\frac{1}{T}\left(\frac{\rho}{2}\\|{\bf B}{\bf b}^{2}_{*}\\|^{2}+\frac{1}{2\rho}\\|{\bm{\kappa}}\\|^{2}\right),$

where $({\bf b}_{1}^{*},{\bf b}_{2}^{*})$ are the optimal solutions.

We arranged Theorem 1 such that it explicitly depends on the problem dependent constants such as $\|{\bf B}{\bf b}^{2}_{*}\|^{2}$ , the number of iterations, $T$ , as well as the learning rate $\rho$ . A proof is provided in Appendix B.1.

Next, we present convergence analysis in terms of regularized MSE of the output of the ADMM algorithm (2). This has the further benefit of finding an appropriate learning rate $\rho$ and a range for regularization coefficient $\lambda$ that minimizes the computational complexity.

Theorem 2.

Let $\{\hat{y}_{i}^{t},\bm{a}_{i}^{t}\}_{i=1}^{n}$ be the output of Algorithm 2 at the $t^{th}$ iteration, $\tilde{y_{i}}\triangleq\frac{1}{T}\sum_{t=1}^{T}\hat{y}_{i}^{t}$ and $\tilde{\bm{a}_{i}}\triangleq\frac{1}{T}\sum_{t=1}^{T}\bm{a}_{i}^{t}$ . Denote $\tilde{f}_{T}(\bm{x})\triangleq\max_{i}\langle\tilde{\bm{a}_{i}},\bm{x}-\bm{x}_{i}\rangle+\tilde{y_{i}}$ . Assume $\max_{i,l}|x_{i,l}|\leq 1$ and $\mathbbm{Var}(\{y_{i}\}_{i=1}^{n})\leq 1$ . If we choose $\rho=\frac{\sqrt{d}\lambda^{2}}{n}$ , for $\lambda\geq\frac{3}{\sqrt{2nd}}$ and $T\geq{n\sqrt{d}}$ we have:

	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}(\tilde{f}_{T}(\bm{x}_{i}){-}y_{i})^{2}{+}\lambda\\|\tilde{f}_{T}\\|$
	$\displaystyle\leq\min_{\hat{f}\in\mathcal{F}}\bigg{(}\frac{1}{n}\sum_{i=1}^{n}\big{(}\hat{f}(\bm{x}_{i})-y_{i}\big{)}^{2}+\lambda\\|\hat{f}\\|\bigg{)}+\frac{6n\sqrt{d}}{T+1}.$

Corollary 1.

Our method needs $T=\frac{6n\sqrt{d}}{\epsilon}$ iterations to achieve $\epsilon$ error. Each iteration requires $\mathcal{O}(n^{2}d+nd^{2})$ flops operations. Prepossessing costs $\mathcal{O}(nd^{3})$ . Therefore the total computational complexity is $\mathcal{O}\bigg{(}\frac{n^{3}d^{1.5}+n^{2}d^{2.5}+nd^{3}}{\epsilon}\bigg{)}$ .

We note that assumptions of Theorem 2 are simply satisfied by normalizing the training dataset. The main difficulty for the derivation of Theorem 2 is: $(\tilde{\bm{a}_{i}},\tilde{y_{i}})$ might be violating the constraints in (5). This could result in $\tilde{f}_{T}(\bm{x}_{i})\neq\tilde{y}_{i}$ and $\|\tilde{f}_{T}\|\neq\sum_{l=1}^{d}\tilde{L}_{l}$ . Therefore, the main steps of the proof of Theorem 2 is to characterize and bound the effect of such constraint violations on our objective function. Then we set $\bm{\kappa}$ in Theorem 1 in a way to cover for the effects of constraint violations. For a full proof, we refer to Appendix B.2.

3 Approximating a Bregman Divergence

We next consider the application of learning a Bregman divergence from supervision. This problem is a type of metric learning problem, where we are given supervision (pairs of similar/dissimilar pairs, or triplets consisting of relative comparisons amongst data points) and we aim to learn a task-specific distance or divergence measure from the data. Classical metric learning methods typically learn a linear transformation of the data, corresponding to a Mahalanobis-type distance function,

(\bm{x}-\bm{y})^{T}M(\bm{x}-\bm{y}),

where $M$ is a positive semi-definite matrix (note that this generalizes the squared Euclidean distance, where $M$ would be the identity matrix). More generally, Mahalanobis distances are examples of Bregman divergences, which include other divergences such as the KL-divergence as special cases.

Bregman divergences are parameterized by an underlying strictly convex function, sometimes called the convex generating function of the divergence. Recently, Siahkamari et al. (2019) formulate learning a Bregman divergence as learning the underlying convex function parameterizing the divergence. The resulting optimization problem they study is similar to convex regression problem Eq. (5). Here we will discuss an improvement to their approach with a slightly different loss function and our Lasso penalty term. We then derive an algorithm using 2-block ADMM; the existing solver for this problem uses standard linear programming.

In particular suppose we observe the classification dataset $S_{n}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}$ , for $\bm{x}_{i}\in\mathbb{R}^{d}$ and $y_{i}$ is an integer. Let

\displaystyle D_{f}(\bm{x}_{i},\bm{x}_{j})=f(\bm{x}_{i})-f(\bm{x}_{j})-\langle\nabla^{*}f(\bm{x}_{j}),\bm{x}_{i}-\bm{x}_{j}\rangle,

be a Bregman divergence with convex generating function ${f}(\bm{x})\in\mathcal{F}$ . Let the pairwise similarity loss be

	$\displaystyle\ell(D_{f}(\bm{x}_{i},\bm{x}_{j}),y_{i},y_{j})$	$\displaystyle=$
	$\displaystyle\mathbbm{1}\big{[}\mathbbm{1}[$	$\displaystyle y_{i}=y_{j}]=(D_{f}(\bm{x}_{i},\bm{x}_{j})\geq 1)\big{]}.$

We estimate $f$ the underlying convex function of the Bregman divergence $D_{f}$ by,

\displaystyle\hat{f}\triangleq\arg\min_{f\in\mathcal{F}}\sum_{i=1}^{n}\sum_{j=1}^{n}\ell(D_{f}(\bm{x}_{i},\bm{x}_{j}),y_{i},y_{j})+\lambda\|f\|.

We note that this approach can easily be generalized to other possible metric learning losses.

3.1 Optimization

Now we provide an algorithm for the Bregman divergence learning based on 2-block ADMM. We solve for each block in closed form. We first re-formulate the optimization problem to standard form and introducing some auxiliary variables we have:

		$\displaystyle\min_{\begin{subarray}{c}z_{i},\bm{a}_{i},\zeta_{i,j}\geq 0,\\ L_{l}\geq 0,\bm{p}_{i}^{+}\geq 0,\bm{p}_{i}^{+}\geq 0,\\ \bm{u}_{i}\geq 0,s_{i,j}\geq 0,t_{i,j}\geq 0\end{subarray}}\sum_{i=1}^{n}\sum_{j=1}^{n}\zeta_{i,j}+\lambda\sum_{d}L_{l}$		(10)
		$\displaystyle\textrm{s.t.}\begin{cases}\iota_{i,j}s_{i,j}-\iota_{i,j}+t_{i,j}+1-\zeta_{i,j}=0&i,j\in[n]\times[n]\\ s_{i,j}+z_{i}-z_{j}-\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle=0&i,j\in[n]\times[n]\\ p_{i,l}^{+}+p_{i,l}^{-}+u_{i,l}-L_{l}=0&i,l\,\in[n]\times[d]\\ a_{i,l}=p_{i,l}^{+}-p_{i,l}^{-}&i,l\,\in[n]\times[d],\end{cases}$

where $\iota_{i,j}=2(\mathbbm{1}[y_{i}=y_{j}]-\frac{1}{2})$ .

Now we write the augmented Lagrangian:

	$\displaystyle\ell(z_{i},\bm{a}_{i},\zeta_{i,j},L_{l},\bm{p}_{i}^{+},\bm{p}_{i}^{-},\bm{u}_{i},s_{i,j},t_{i,j})$
	$\displaystyle=\sum_{i=1}^{n}\sum_{j=1}^{n}\zeta_{i,j}+\lambda\sum_{l=1}^{d}L_{l}$
	$\displaystyle+\sum_{i}\sum_{j}\frac{\rho}{2}(s_{i,j}+z_{i}-z_{j}-\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle+\alpha_{i,j})^{2}$
	$\displaystyle+\sum_{i}\sum_{l}\frac{\rho}{2}(u_{i,l}+p_{i,l}^{+}+p_{i,l}^{-}-L_{l}+\gamma_{i,l})^{2}$
	$\displaystyle+\sum_{i}\sum_{l}\frac{\rho}{2}(a_{i,l}-p_{i,l}^{+}+p_{i,l}^{-}+\eta_{i,l})^{2}$
	$\displaystyle+\sum_{i}\sum_{j}\frac{\rho}{2}(\iota_{i,j}s_{i,j}-\iota_{i,j}+t_{i,j}+1-\zeta_{i,j}+\tau_{i,j})^{2},$

where $\alpha_{i,j},\gamma_{i,j},\eta_{i,j}$ and $\tau_{i,j}$ are dual variables. Next we divide the variables into two blocks and solve for each in closed form.

3.1.1 First block ${\bm{b}}^{1}=\{z_{i},\bm{a}_{i},\zeta_{i,j}\}$

Setting $\partial_{\zeta_{i,j}}\ell=0$ gives:

\displaystyle\zeta_{i,j}=(\frac{-1}{n\rho}+\tau_{i,j}+\iota_{i,j}s_{i,j}-\iota_{i,j}+t_{i,j}+1)^{+}.

(11)

Collecting the terms containing $\bm{a}_{i}$ in Eq.(10) and comparing to those in Eq. (5), the solution for $\bm{a}_{i}$ follows:

\displaystyle\bm{a}_{i}\triangleq\bm{\Lambda}_{i}(\bm{\theta}_{i}+z_{i}\bm{x}_{i}+\frac{1}{n}\sum_{k}z_{k}\bm{x}_{k}).

(12)

Set $\partial_{z_{i}}\ell=0$ , $\sum_{i}z_{i}=0$ and $\sum_{i}\bm{x}_{i}=\bm{0}$ . Using Eq. (12) we can solve for $z_{1},\dots,z_{n}$ as

\displaystyle\bm{z}

\displaystyle=\bm{\Omega}_{\text{breg}}^{-1}(\bm{\nu}-\bm{\beta}),

(13)

where $\Omega_{\text{breg}_{i,j}}=(2-\bm{x}_{i}^{T}\Lambda_{i}\bm{x}_{i})\mathbbm{1}[{i=j}]-\frac{1}{n}D_{i,j}$ .

3.1.2 Second block ${\bm{b}}^{2}=\{L_{l},\bm{p}_{i},\bm{u}_{i},s_{i,j},t_{i,j}\}$

Comparing Eq. (10) and Eq. (5), we find that $L_{l}$ , $\bm{p}_{i}$ and $\bm{u}_{i}$ , have the same solutions as in sec 2.1.2. Hence we only proceed to solve for $s_{i,j}$ and $t_{i,j}$ . Set $\partial_{t_{i,j}}\ell=\partial_{t_{i,j}}=0$ . Some algebra gives:

	$\displaystyle s_{i,j}$	$\displaystyle=\frac{1}{2}(\pi^{2}_{i,j}+\iota_{i,j}\pi^{1}_{i,j}-\iota_{i,j}(\pi^{1}_{i,j}-\iota_{i,j}\pi^{2}_{i,j})^{+})^{+},$		(14)
	$\displaystyle t_{i,j}$	$\displaystyle=(\pi^{1}_{i,j}-\iota_{i,j}s_{i,j})^{+},$

where

	$\displaystyle\pi^{1}_{i,j}$	$\displaystyle\triangleq-\tau_{i,j}+\iota_{i,j}-1+\zeta_{i,j},$
	$\displaystyle\pi^{2}_{i,j}$	$\displaystyle\triangleq-\alpha_{i,j}-z_{i}+z_{j}+\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle.$

3.1.3 Dual variables

We only see a new constraint different from Eq. (5) with dual multiplier $\tau_{i,j}$ . The update is

\displaystyle\tau_{i,j}

\displaystyle=\tau_{i,j}+\iota_{i,j}s_{i,j}-\iota_{i,j}+t_{i,j}+1-\zeta_{i,j}.

(15)

Algorithm 3 in the appendix provides full-steps for solving the Bregman divergence learning problem.

4 Difference of Convex (DC) Regression

As a further example, we extend our convex regression solver to the difference of convex regression as studied in Siahkamari et al. (2020). DC functions are set of functions $f$ that can be represented as $f=\phi^{1}-\phi^{2}$ for a choice of two convex functions. DC functions are a very rich class—for instance, they are known to contain all $\mathcal{C}^{2}$ functions. DC regression has been studied in Cui et al. (2018); Siahkamari et al. (2020); Bagirov et al. (2020). We provide a 2-block ADMM solver for the difference of convex estimator proposed in Siahkamari et al. (2020), with a minor change of choosing a different penalty

	$\displaystyle\\|f\\|\triangleq\inf_{\phi^{1},\phi^{2}}$	$\displaystyle\sum_{q=1}^{2}\sum_{l=1}^{d}\sup_{\bm{x}}\|\partial_{x_{l}}^{*}\phi^{q}(\bm{x})\|$
		$\displaystyle\textrm{s.t. }\phi_{1},\phi_{2}\text{ are convex, }\phi_{1}-\phi_{2}=f.$

In particular, consider a regression dataset $\{(y_{i},\bm{x}_{i})\}_{i=1}^{n}$ with $\bm{x}_{i}\in\mathbb{R}^{d}$ and $y_{i}\in\mathbb{R}$ . Denote the class of difference of convex functions

\displaystyle\mathcal{DC}\triangleq\{f:\mathbb{R}^{d}\to\mathbb{R}\ |\ f=\phi^{1}-\phi^{2},(\phi^{1},\phi^{2})\text{ are convex}\},

In order to estimate $f$ , we minimize the penalized least square over the $\mathcal{DC}$ class:

\displaystyle\hat{f}\triangleq\arg\min_{f\in\mathcal{DC}}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-f(\bm{x}_{i}))^{2}+\lambda\|f\|.

This results in a convex program:

		$\displaystyle\quad\min_{\hat{y}_{i}^{q},\bm{a}_{i}^{q}}\frac{1}{n}\sum_{i=1}^{n}(\hat{y}_{i}^{1}-\hat{y}_{i}^{2}-y_{i})^{2}+\lambda\sum_{q=1}^{2}\sum_{l=1}^{d}\max_{i=1}^{n}\|a_{i,l}^{q}\|$
		$\displaystyle\textrm{s.t. }\begin{cases}\hat{y}_{i}^{1}-\hat{y}_{j}^{1}-\langle\bm{a}_{i}^{1},\bm{x}_{i}-\bm{x}_{j}\rangle\leq 0\quad i,j\in[n]\times[n],\\ \hat{y}_{i}^{2}-\hat{y}_{j}^{2}-\langle\bm{a}_{i}^{2},\bm{x}_{i}-\bm{x}_{j}\rangle\leq 0\quad i,j\in[n]\times[n],\end{cases}$		(16)

and the estimator for $f$ is given as

	$\displaystyle\hat{f}(\bm{x})\triangleq$	$\displaystyle\max_{i}\langle\bm{a}_{i}^{1},\bm{x}-\bm{x}_{i}\rangle+\hat{y}_{i}^{1}$
		$\displaystyle-\max_{i}\langle\bm{a}_{i}^{2},\bm{x}-\bm{x}_{i}\rangle+\hat{y}_{i}^{2}.$

We note that the linear constraint sets for $\{\hat{y}_{i}^{1},\bm{a}_{i}^{1}\}$ and $\{\hat{y}_{i}^{2},\bm{a}_{i}^{2}\}$ have no variable in common in the linear program (4). Fixing $\hat{\bm{y}}^{2}$ allows us to solve for $\hat{\bm{y}}^{1}$ using Eq. (7) and vice-versa. From there we can decouple $\hat{\bm{y}}^{1}$ and $\hat{\bm{y}}^{2}$ solutions by solving a system of linear equations,

	$\displaystyle\hat{\bm{y}}^{q}$	$\displaystyle{=}\frac{(-1)^{q+1}}{2}\bigg{(}\bm{\Omega}{+}\frac{2I}{n^{2}\rho}\bigg{)}^{-1}\bigg{(}\frac{4\bm{y}}{n^{2}\rho}\bm{+}v^{1}-\bm{\beta}^{1}{-}\bm{v}^{2}{+}\bm{\beta}^{2}\bigg{)}$
		$\displaystyle{+}\frac{1}{2}\bigg{(}\bm{\Omega}-\frac{2\bm{I}}{n^{2}\rho}\bigg{)}^{-1}(\bm{v}^{1}-\bm{\beta}^{1}+\bm{v}^{2}-\bm{\beta}^{2}).$		(17)

Algorithm 4 in Appendix provides the full steps to solve the difference of convex regression problem.

5 Experiments

Refer to caption — Figure 1: Regression results on UCI datasets

In this section we provide experiments on real datasets from the UCI machine learning repository as well as on synthetic datasets. We compare running times between our proposed approach and the baseline interior point method for all three problems (convex regression, DC regression, and Bregman divergence learning). We also compare both our DC regression algorithm as well as our Bregman divergence learning algorithm to state-of-the-art regression and classification methods, and show that our approach is close to state-of-the-art in terms of accuracy. We compare our (DC)-regression algorithm with state-of-the-art regression models.

For the ADMM-based methods, we use a V100 Nvidia GPU processor with $11$ gigabyte of GPU memory, and 4 cores of CPU. For all the other methods we use a 16 core CPU.

Table 1: Comparison of Convex Regression run time against baseline on Synthetic Data

		Seconds
$n$	$d$	Baseline (Siahkamari et al., 2020)	This Paper
1000	2	32.3	3.63
1000	4	30.8	3.75
1000	8	54.5	4.03
1000	16	112.3	4.12
1000	32	189.3	4.37

Table 2: Comparison of DC Regression run time against baseline on Synthetic Data

		Seconds
$n$	$d$	Baseline (Siahkamari et al., 2020)	This Paper
1000	2	140.9	3.67
1000	4	113.9	3.79
1000	8	210.3	4.05
1000	16	351.6	4.15
1000	32	823.2	4.42

5.1 Timing Results for Convex and DC Regression

To demonstrate the effectiveness of our proposed approach, we first compare on synthetic data the existing interior point method to our 2-block ADMM method on standard convex regression and DC regression. Here, we fixed the number of data points at 1000 and tested data of dimensionality $d\in\{2,4,8,12,32\}$ .

Results are shown in Tables 1 and 2. We see that our method ranges from 8.8x faster than the interior point method to up to 186x faster. Note that the interior point method fails on large data sets, and cannot be run for many of the UCI data sets tested later.

5.2 Results on Real Data

Now we experiment with our DC regression algorithm and the Bregman divergence learning algorithm (PBDL) on real data sets, and report accuracy and timing results. For PBDL, we compare against its predecessor (PBDL_0) as well as state of the art classification models. All results are reported either on the pre-specifed train/test split or a 5-fold cross validation set based on instruction for each dataset.

For the purpose of comparisons, we compare to the previous (DC)-regression and PBDL algorithms. Furthermore, we pick two popular tree-based algorithms XGboost and Random Forest. We also consider Lasso as a baseline. According to Fernández-Delgado et al. (2014, 2019), Random Forest achieves the state of the art on most regression and classification tasks. More recently, XGboost has been shown to achive the state of the art on many real datasets and Kaggle.com competitions (Chen & Guestrin, 2016). However, we stress that achieving state-of-the-art results is not the purpose of our experiments.

We provide the details of hyper-parameters for each method.

DC-regression: We choose $\lambda$ from a grid $10^{-3:3}$ by 5 fold cross validation. Then we do at most 2 more rounds of grid search around the optimal $\lambda$ at the first round. We fix $\rho=0.01$ and choose $T$ by early stopping, i.e., whenever validation error improvement is less than $10^{-3}$ after $n$ iterations of ADMM.

PBDL: We predict the classes using a $5$ nearest neighbour scheme $i_{\text{nearest}}(\bm{x})=\arg\min_{i}D_{\phi}(\bm{x},\bm{x}_{i})$ . where $D_{\phi}$ is learned Bregman divergence. Tuning $\lambda$ is similar to DC-regression.

PBDL_0: We use the existing code from the authors for learning a Bregman divergence based on Groubi solvers. We use their built-in parameter tuner for $\lambda$ which is based on 5 fold cross validation.

XGboost Regressor/Classifier: $max\_depth$ and $learning\_rate$ is found by 5 fold cross validation over $\{1,\dots,10\}\times\{0.05,0.1,0.5\}$ . Other parameters set to default. We use the xgboost 1.6.0 package in Chen & Guestrin (2016).

Random Forest Regressor/Classifier: $n\_features$ is found by 5 fold cross validation over $[d^{0.25},d^{0.75}]$ . We use the sklearn package in Pedregosa et al. (2011).

Lasso: We use the sklearn package function LassoCV.

5.2.1 Datasets

We chose all regression datasets from UCI which have number of instances $10^{3}\leq n\leq 10^{4}$ . These were 30 datasets at the time of submission of this paper. We discard $16$ of these datasets due to various reasons such as the data set not being available. Some of these datasets have multiple target variables, therefore we report the out of sample $R^{2}$ results with a suffix for each target variable.

For classification, we chose all datasets originally used in Siahkamari et al. (2019). Further, we include the abalone dataset which was too large for the previous PBDL algorithm. We use these datasets as multi-class classification problems. We report accuracy as well as the training run-time.

5.2.2 Results

Regression: For our regression experiments we present the out of sample $R^{2}$ on $18$ datasets in figure 1 . We observe our method has similar performance to that of XGboost and Random Forest despite having $n\times d$ parameters. In solar-flare and Parkinson datasets, other methods overfit whereas our algorithm avoids overfitting and it is more robust. Maximum number of instances which we were able to experiment on with reasonable time is about $10^{4}$ and is $10$ x larger than Siahkamari et al. (2020) has experimented on. The detailed results are in Appendix.

Classification: For our classification experiments we compare the logarithm of runtimes of all methods in Figure 2. In terms of speed, we are on average $30$ x faster than the original PBDL_0 algorithm. Also PBDL_0 fails to handle the Abalone dataset with $n=4177$ , where the new PBDL takes less than a minute to finish one fit. We are slower than XGboost and Random Forest. However, we only rely on a python script code vs an optimized compiled code. We note that the divergence function learned in PBDL can be further used for other tasks such as ranking and clustering. We present the accuracy in Figure 3 in Appendix. We observe that our Bregman divergence Learning Algorithm (PBDL) as well as the original (PBDL_0) have similar performance to that of XGboost and Random Forest.

6 Conclusion

In this paper we, studied the nonparametric convex regression problem with a $L1$ regularization penalty. we provided a solver for this problem and proved the iteration complexity be $6n\sqrt{d}/\epsilon$ . The total computational complexity is $O(n^{3}d^{1.5}/\epsilon+n^{2}d^{2.5}/\epsilon+nd^{3}/\epsilon)$ , which improves that of $O(n^{5}d^{2}/\epsilon)$ already known for this problem. We also extended our solver to the problem of difference of convex (DC) regression and the problem of learning an arbitrary Bregman divergence. We provided comparisons to state of the art regression and classification models.

Acknowledgements

This research was supported by the Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001, Army Research Office Grant W911NF2110246, the National Science Foundation grants CCF-2007350 and CCF-1955981, and the Hariri Data Science Faculty Fellowship Grants, and a gift from the ARM corporation.

References

Afriat (1967) Afriat, S. N. The construction of utility functions from expenditure data. International economic review, 8(1):67–77, 1967.
Amos et al. (2017) Amos, B., Xu, L., and Kolter, J. Z. Input convex neural networks. In International Conference on Machine Learning, pp. 146–155. PMLR, 2017.
Bagirov et al. (2020) Bagirov, A. M., Taheri, S., Karmitsa, N., Sultanova, N., and Asadi, S. Robust piecewise linear l 1-regression via nonsmooth dc optimization. Optimization Methods and Software, pp. 1–21, 2020.
Balázs (2016) Balázs, G. Convex regression: theory, practice, and applications. 2016.
Bertsimas & Mundru (2021) Bertsimas, D. and Mundru, N. Sparse convex regression. INFORMS Journal on Computing, 33(1):262–279, 2021.
Boyd et al. (2004) Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.
Chen & Guestrin (2016) Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794, 2016.
Chen & Mazumder (2020) Chen, W. and Mazumder, R. Multivariate convex regression at scale. arXiv preprint arXiv:2005.11588, 2020.
Cui et al. (2018) Cui, Y., Pang, J.-S., and Sen, B. Composite difference-max programs for modern statistical estimation problems. SIAM Journal on Optimization, 28(4):3344–3374, 2018.
Eckstein & Yao (2012) Eckstein, J. and Yao, W. Augmented lagrangian and alternating direction methods for convex optimization: A tutorial and some illustrative computational results. RUTCOR Research Reports, 32(3), 2012.
Fernández-Delgado et al. (2014) Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. Do we need hundreds of classifiers to solve real world classification problems? The journal of machine learning research, 15(1):3133–3181, 2014.
Fernández-Delgado et al. (2019) Fernández-Delgado, M., Sirsat, M. S., Cernadas, E., Alawadi, S., Barro, S., and Febrero-Bande, M. An extensive experimental survey of regression methods. Neural Networks, 111:11–34, 2019.
Gabay & Mercier (1976) Gabay, D. and Mercier, B. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & mathematics with applications, 2(1):17–40, 1976.
Ghosh et al. (2019) Ghosh, A., Pananjady, A., Guntuboyina, A., and Ramchandran, K. Max-affine regression: Provable, tractable, and near-optimal statistical estimation. arXiv preprint arXiv:1906.09255, 2019.
He & Yuan (2012) He, B. and Yuan, X. On the o(1/n) convergence rate of the douglas–rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012.
Kim et al. (2021) Kim, S., Bahmani, S., and Lee, K. Max-linear regression by scalable and guaranteed convex programming. arXiv preprint arXiv:2103.07020, 2021.
Mazumder et al. (2019) Mazumder, R., Choudhury, A., Iyengar, G., and Sen, B. A computational framework for multivariate convex regression and its variants. Journal of the American Statistical Association, 114(525):318–331, 2019.
Nagurney (1998) Nagurney, A. Network economics: A variational inequality approach, volume 10. Springer Science & Business Media, 1998.
Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037, 2019.
Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
Siahkamari et al. (2019) Siahkamari, A., Saligrama, V., Castanon, D., and Kulis, B. Learning bregman divergences. arXiv e-prints, pp. arXiv–1905, 2019.
Siahkamari et al. (2020) Siahkamari, A., Gangrade, A., Kulis, B., and Saligrama, V. Piecewise linear regression via a difference of convex functions. In International Conference on Machine Learning, pp. 8895–8904. PMLR, 2020.
Varian (1982) Varian, H. R. The nonparametric approach to demand analysis. Econometrica: Journal of the Econometric Society, pp. 945–973, 1982.
Zhang et al. (2020) Zhang, X., Li, Y., Zhang, Z., and Zhang, Z.-L. $f$ -gail: Learning $f$ -divergence for generative adversarial imitation learning. arXiv preprint arXiv:2010.01207, 2020.

Appendix to ‘Faster Convex Lipschitz Regression via 2 blocks ADMM’

Appendix A Algorithms

Algorithm 3 Learning a Bregman divergence

1: Require:

\{(\bm{x}_{i},y_{i})\,|\,\bm{x}_{i}\in\mathbb{R}^{d},y_{i}\in\mathbb{N}\}_{i\in[n]}

and

\{\rho,\,\lambda,\,T\}

z_{i}=s_{i,j}=t_{i,j}=\alpha_{i,j}=\tau_{i,j}\leftarrow 0

\bm{L}=\bm{a}_{i}=\bm{p}_{i}=\bm{u}_{i}=\bm{\eta}_{i}=\bm{\gamma}_{i}\leftarrow\bm{0}_{d\times 1}

4: for

t=1

T

5: Update

\zeta_{i,j}

by Eq. (11)

6: Update

\bm{z}

by Eq. (13)

7: Update

\bm{a}_{i}

by Eq. (12)

L_{l}\leftarrow\textbf{L\_update}(\{\gamma_{i,l},|\eta_{i,l}+a_{i,l}|\}_{i\in[n]},\lambda/\rho)

9: Update

s_{i,j},t_{i,j}

by Eq. (14)

10: Update

u_{i,l},p_{i,l}^{+},p_{i,l}^{-}

by Eq. (8)

11: Update

\alpha_{i,j},\gamma_{i,l},\eta_{i,l}

by Eq. (9)

12: Update

\tau_{i,j}

by Eq. (15)

13: end for

14: return:

f(\cdot)\triangleq\max_{i=1}^{n}(\langle\bm{a}_{i},\cdot-\bm{x}_{i}\rangle+z_{i})

Algorithm 4 Difference of convex regression

\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}

\rho

\lambda

, and

T

\hat{y}_{i}^{q}=s_{i,j}^{q}=\alpha_{i,j}^{q}\leftarrow 0

\bm{L}^{q}=\bm{a}_{i}^{q}=\bm{p}_{i}^{q}=\bm{u}_{i}^{q}=\bm{\eta}_{i}^{q}=\bm{\gamma}_{i}^{q}\leftarrow\bm{0}_{d\times 1}

3: for

t=1

T

4: Update

\hat{\bm{y}}^{q}

by Eq. (4)

5: Update

\bm{a}_{i}^{q}

by Eq. (6)

L_{l}^{q}\leftarrow\textbf{L\_update}(\{\gamma_{i,l}^{q},|\eta_{i,l}^{q}+a_{i,l}^{q}|\}_{i\in[n]},\lambda/\rho)

7: Update

u_{i,l}^{q},p_{i,l}^{q+},p_{i,l}^{q-},s_{i,j}^{q}

by Eq. (8)

8: Update

\alpha_{i,j}^{q},\gamma_{i,l}^{q},\eta_{i,l}^{q}

by Eq. (9)

9: end for

10: return

\phi^{q}(\cdot)\triangleq\max_{i=1}^{n}(\langle\bm{a}_{i}^{q},\cdot-\bm{x}_{i}\rangle+\hat{y}_{i}^{q})

Appendix B Derivations

B.1 ADMM Convergence

Theorem 1 gives convergence of the ADMM procedure. We follow analysis similar to He & Yuan (2012) and give all steps for the sake of completeness. We first restate the convergence as follows:

Theorem 1.

Consider the separable convex optimization problem:

	$\displaystyle\min_{{\bf b}^{1}\in\mathcal{S}_{1},{\bf b}^{2}\in\mathcal{S}_{2}}\left[\psi({\bf b}^{1},{\bf b}^{2})=\psi_{1}({\bf b}^{1})+\psi_{2}({\bf b}^{2})\right]$
	$\displaystyle\quad s.t:{\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}+\bm{b}=\bm{0},$

	$\displaystyle{\bf b}^{1}_{t+1}$	$\displaystyle=\arg\min_{{\bf b}^{1}\in\mathcal{S}_{1}}\psi_{1}({\bf b}^{1})+\frac{\rho}{2}\big{\\|}{\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}_{t}+{\bf b}-\frac{1}{\rho}{\bf d}_{t}\big{\\|}^{2}$
	$\displaystyle{\bf b}^{2}_{t+1}$	$\displaystyle=\arg\min_{{\bf b}^{2}\in\mathcal{S}_{2}}\psi_{2}({\bf b}^{2})+\frac{\rho}{2}\big{\\|}{\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}+{\bf b}-\frac{1}{\rho}{\bf d}_{t}\big{\\|}^{2}$
	$\displaystyle{\bf d}_{t+1}$	$\displaystyle={\bf d}_{t}-\rho\left({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t+1}+{\bf b}\right),$

where ${\bf b}^{1}_{0}={\bf b}^{2}_{0}={\bf d}_{0}=0$ .

Denote the average of iterates as $(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})=\left(\frac{1}{T}\sum_{t=1}^{T}{\bf b}^{1}_{t},\frac{1}{T}\sum_{t=1}^{T}{\bf b}^{2}_{t}\right)$ . For all ${\bm{\kappa}}$ we have

\displaystyle\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})-{\bm{\kappa}}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})\leq\frac{1}{T}\left(\frac{\rho}{2}\|{\bf B}{\bf b}^{2}_{*}\|^{2}+\frac{1}{2\rho}\|{\bm{\kappa}}\|^{2}\right),

where $({\bf b}_{1}^{*},{\bf b}_{2}^{*})$ are the optimal solutions.

Before presenting the proof, we define a dual variable like sequence $\{\overline{\bf d}_{t}\}_{t}$ with the update relation as

\overline{\bf d}_{t+1}={\bf d}_{t}-\rho({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t}+\bf b).

The only difference between $\overline{\bf d}_{t+1}$ and ${\bf d}_{t+1}$ is due to ${\bf b}^{2}_{t}$ and ${\bf b}^{2}_{t+1}$ . We have the following relation:

\displaystyle\overline{\bf d}_{t+1}-{\bf d}_{t+1}=\rho{\bf B}({\bf b}^{2}_{t+1}-{\bf b}^{2}_{t}).

(18)

Lemma 1.

[from Nagurney (1998)] Let ${\bf b}^{*}=\min_{{\bf b}\in\mathcal{S}}\psi({\bf b})$ where $\psi$ is continuously differentiable and $\mathcal{S}$ is closed and convex. Then $\bf b^{*}$ satisfies,

\left\langle\nabla\psi\left({\bf b}^{*}\right),{\bf b}-{\bf b}^{*}\right\rangle\geq 0,\quad\quad\forall{\bf b}\in\mathcal{S}.

Applying Lemma 1 to optimization problems of ADMM gives,

	$\displaystyle\left\langle\nabla\psi\left({\bf b}^{1}_{t+1}\right)+\rho{\bf A}^{T}\left({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t}+{\bf b}-\frac{1}{\rho}{\bf d}_{t}\right),{\bf b}^{1}-{\bf b}^{1}_{t+1}\right\rangle\geq 0,\quad\forall{\bf b}^{1}\in\mathcal{S}_{1}$		(19)
	$\displaystyle\left\langle\nabla\psi\left({\bf b}^{2}_{t+1}\right)+\rho{\bf B}^{T}\left({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t+1}+{\bf b}-\frac{1}{\rho}{\bf d}_{t}\right),{\bf b}^{2}-{\bf b}^{2}_{t+1}\right\rangle\geq 0,\quad\forall{\bf b}^{2}\in\mathcal{S}_{2}.$		(20)

Combining Eq. 19, 20, and the $\{\overline{\bf d}_{t}\}_{t}$ update rule we get, $\forall{\bf b}^{1}\in\mathcal{S}_{1},\forall{\bf b}^{2}\in\mathcal{S}_{2}$ ,

\displaystyle\left\langle\nabla\psi\left({\bf b}^{1}_{t+1}\right)-{\bf A}^{T}\overline{\bf d}_{t+1},{\bf b}^{1}-{\bf b}^{1}_{t+1}\right\rangle\geq 0,\quad\left\langle\nabla\psi\left({\bf b}^{2}_{t+1}\right)-{\bf B}^{T}\overline{\bf d}_{t+1}-\rho{\bf B}^{T}{\bf B}({\bf b}^{2}_{t}-{\bf b}^{2}_{t+1}),{\bf b}^{2}-{\bf b}^{2}_{t+1}\right\rangle\geq 0.

(21)

Theorem 1 is a direct conclusion of the following Lemma.

Lemma 2.

For all ${\bf b}^{1},{\bf b}^{2},{\bf\kappa}$ we have,

\psi({\bf b}^{1},{\bf b}^{2})-\psi({\bf b}^{1}_{t+1},{\bf b}^{2}_{t+1})+{\bm{\kappa}}^{T}({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t+1}+\bm{b})-\overline{\bf d}_{t+1}^{T}({\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}+{\bf b})+{Z}_{t}-{Z}_{t+1}\geq 0,

where ${Z}_{t}=\frac{\rho}{2}\|{\bf b}^{2}-{\bf b}^{2}_{t}\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}+\frac{1}{2\rho}\|{\bm{\kappa}}-{\bf d}_{t}\|^{2}$ and $\|{\bf x}\|_{\bf M}^{2}=\bf x^{T}\bf M\bf x$ .

Let ${\bf b}^{1},{\bf b}^{2}$ be $({\bf b}_{1}^{*},{\bf b}_{2}^{*})$ . By definition we have ${\bf A}{\bf b}^{1}_{*}+{\bf B}{\bf b}^{2}_{*}+{\bf b}={\bf 0}$ which cancels an inner product. Rearranging and averaging over time gives

	$\displaystyle\psi({\bf b}^{1}_{t+1},{\bf b}^{2}_{t+1})-\psi({\bf b}^{1}_{},{\bf b}^{2}_{})-{\bm{\kappa}}^{T}({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t+1}+\bm{b})\leq{Z}_{t}-{Z}_{t+1}$
	$\displaystyle\frac{1}{T}\left(\sum_{t=0}^{T-1}\psi({\bf b}^{1}_{t+1},{\bf b}^{2}_{t+1})\right)-\psi({\bf b}^{1}_{},{\bf b}^{2}_{})-{\bm{\kappa}}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})\leq\frac{1}{T}\left(\sum_{t=0}^{T-1}{Z}_{t}-{Z}_{t+1}\right).$

The RHS telescopes. Since ${Z}_{t}\geq 0$ we have $RHS\leq Z_{0}$ . We lower bound LHS with Jensen Inq. as $\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})\leq\frac{1}{T}\sum_{t=0}^{T-1}\psi({\bf b}^{1}_{t+1},{\bf b}^{2}_{t+1})$ . Combining LHS and RHS we get

\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})-{\bm{\kappa}}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})\leq\frac{1}{T}Z_{0},

which is the statement in Theorem 1 assuming the initial variables are $0$ s.

We continue to prove Lemma 2. We start with convexity definition for both $\psi_{1}$ and $\psi$ as,

\displaystyle\psi_{1}({\bf b}^{1})-\psi_{1}({\bf b}^{1}_{t+1})-({\bf b}^{1}-{\bf b}^{1}_{t+1})^{T}\nabla\psi_{1}({\bf b}^{1}_{t+1})\geq 0;\quad\psi_{2}({\bf b}^{2})-\psi_{2}({\bf b}^{2}_{t+1})-({\bf b}^{2}-{\bf b}^{2}_{t+1})^{T}\nabla\psi_{2}({\bf b}^{2}_{t+1})\geq 0.

Summing the relations and plugging in Eq. 21 give,

\displaystyle\psi({\bf b}^{1},{\bf b}^{2})-\psi({\bf b}^{1}_{t+1},{\bf b}^{2}_{t+1})\underbrace{-({\bf b}^{1}-{\bf b}^{1}_{t+1})^{T}{\bf A}^{T}\overline{\bf d}_{t+1}-({\bf b}^{2}-{\bf b}^{2}_{t+1})^{T}{\bf B}^{T}\overline{\bf d}_{t+1}-\rho({\bf b}^{2}-{\bf b}^{2}_{t+1})^{T}{\bf B}^{T}{\bf B}({\bf b}^{2}_{t}-{\bf b}^{2}_{t+1})}_{T_{1}}\geq 0.

(22)

The update rule of $\overline{\bf d}$ gives ${\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t}+{\bf b}+\frac{1}{\rho}(\overline{\bf d}_{t+1}-{\bf d}_{t})={\bf 0}$ . Then for all ${\bm{\kappa}}$ , we have $({\bm{\kappa}}-\overline{\bf d}_{t+1})^{T}({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t}+{\bf b}+\frac{1}{\rho}(\overline{\bf d}_{t+1}-{\bf d}_{t}))=0$ . Adding such a term to $T_{1}$ does not change its value. After adding the zero inner product, we rearrange $T_{1}$ as

	$\displaystyle T_{1}=$	$\displaystyle\underbrace{{\bm{\kappa}}^{T}({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t+1}+\bm{b})-\overline{\bf d}_{t+1}^{T}({\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}+{\bf b})}_{Y_{1}}+\rho\underbrace{({\bf b}^{2}-{\bf b}^{2}_{t+1})^{T}{\bf B}^{T}{\bf B}({\bf b}^{2}_{t+1}-{\bf b}^{2}_{t})}_{Y_{2}}$
		$\displaystyle+\underbrace{({\bm{\kappa}-\overline{\bf d}_{t+1}})^{T}({\bf B}({\bf b}^{2}_{t}-{\bf b}^{2}_{t+1})+\frac{1}{\rho}(\overline{\bf d}_{t+1}-{\bf d}_{t}))}_{Y_{3}}.$

Plugging Eq. 18 in $Y_{3}$ gives $Y_{3}=\frac{1}{\rho}({\bm{\kappa}-\overline{\bf d}_{t+1}})^{T}({\bf d}_{t+1}-{\bf d}_{t})$ .

We use the following norm square relation and prove it at the end,

Lemma 3.

For a symmetric ${\bf M}={\bf M}^{T}$ we have,

({\bf x}-{\bf y})^{T}{\bf M}({\bf z}-{\bf t})=\frac{1}{2}\left(\|{\bf x}-{\bf t}\|_{{\bf M}}^{2}-\|{\bf x}-{\bf z}\|_{{\bf M}}^{2}\right)+\frac{1}{2}\left(\|{\bf y}-{\bf z}\|_{{\bf M}}^{2}-\|{\bf y}-{\bf t}\|_{{\bf M}}^{2}\right).

Using Lemma 3, we get

$Y_{2}=\frac{1}{2}\left(\|{\bf b}^{2}-{\bf b}^{2}_{t}\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}-\|{\bf b}^{2}-{\bf b}^{2}_{t+1}\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}\right)-\frac{1}{2}\|{\bf b}^{2}_{t+1}-{\bf b}^{2}_{t}\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}$

$Y_{3}=\frac{1}{\rho}\frac{1}{2}\left(\|{\bm{\kappa}}-{\bf d}_{t}\|_{{\bf I}}^{2}-\|{\bm{\kappa}}-{\bf d}_{t+1}\|_{{\bf I}}^{2}\right)+\frac{1}{\rho}\frac{1}{2}\left(\|\overline{\bf d}_{t+1}-{\bf d}_{t+1}\|_{{\bf I}}^{2}-\|\overline{\bf d}_{t+1}-{\bf d}_{t}\|_{{\bf I}}^{2}\right)$ ,

since ${\bf{\bf B}^{T}{\bf B}}$ and ${\bf I}$ are symmetric matrices.

We combine $Y_{2}$ and $Y_{3}$ as

	$\displaystyle\rho Y_{2}+Y_{3}=$	$\displaystyle\frac{\rho}{2}\left(\\|{\bf b}^{2}-{\bf b}^{2}_{t})\\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}-\\|{\bf b}^{2}-{\bf b}^{2}_{t+1}\\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}\right)+\frac{1}{2\rho}\left(\\|{\bm{\kappa}}-{\bf d}_{t}\\|_{{\bf I}}^{2}-\\|{\bm{\kappa}}-{\bf d}_{t+1}\\|_{{\bf I}}^{2}\right)$
		$\displaystyle\underbrace{-\frac{1}{2\rho}\\|\overline{\bf d}_{t+1}-{\bf d}_{t}\\|_{{\bf I}}^{2}}_{\leq 0}+\underbrace{\frac{1}{2\rho}\left(\\|\overline{\bf d}_{t+1}-{\bf d}_{t+1}\\|_{{\bf I}}^{2}-\rho^{2}\\|{\bf b}^{2}_{t+1}-{\bf b}^{2}_{t}\\|_{{\bf{\bf B}^{T}{\bf B}}}\right)}_{=0\text{ due to Eq. \ref{eq.tilde}}}$
	$\displaystyle\leq$	$\displaystyle\frac{\rho}{2}\left(\\|{\bf b}^{2}-{\bf b}^{2}_{t})\\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}-\\|{\bf b}^{2}-{\bf b}^{2}_{t+1}\\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}\right)+\frac{1}{2\rho}\left(\\|{\bm{\kappa}}-{\bf d}_{t}\\|_{{\bf I}}^{2}-\\|{\bm{\kappa}}-{\bf d}_{t+1}\\|_{{\bf I}}^{2}\right).$

Plugging it back to $T_{1}$ ,

\displaystyle T_{1}=Y_{1}+\rho Y_{2}+Y_{3}\leq Y_{1}+\frac{\rho}{2}\left(\|{\bf b}^{2}-{\bf b}^{2}_{t})\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}-\|{\bf b}^{2}-{\bf b}^{2}_{t+1}\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}\right)+\frac{1}{2\rho}\left(\|{\bm{\kappa}}-{\bf d}_{t}\|_{{\bf I}}^{2}-\|{\bm{\kappa}}-{\bf d}_{t+1}\|_{{\bf I}}^{2}\right)\leq Y_{1}+{Z}_{t}-{Z}_{t+1}.

(23)

Upper bounding $T_{1}$ term in Eq. 22 with Eq. 23 gives

\psi({\bf b}^{1},{\bf b}^{2})-\psi({\bf b}^{1}_{t+1},{\bf b}^{2}_{t+1})+{\bm{\kappa}}^{T}({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t+1}+\bm{b})-\overline{\bf d}_{t+1}^{T}({\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}+{\bf b})+{Z}_{t}-{Z}_{t+1}\geq{0},

which is the statement in Lemma 2. $\square$

Proof of Lemma 1 [from Nagurney (1998)].

Let $\phi({\bf t})=\psi({\bf b}^{*}+t({\bf b}-{\bf b}^{*}))$ for $t\in[0,1]$ . We know that $\phi$ achieves its minimum at $t=0$ . Since $\mathcal{S}$ is convex and closed, we have, $0\leq\left[\nabla\phi(t)\right]_{t=0}=\left\langle\nabla\psi\left({\bf b}^{*}\right),{\bf b}-{\bf b}^{*}\right\rangle.\square$

Proof of Lemma 3.

Expand RHS as

	$\displaystyle RHS$	$\displaystyle=\frac{1}{2}\left({\bf t}^{T}{\bf M}{\bf t}-2{\bf x}^{T}{\bf M}{\bf t}-{\bf z}^{T}{\bf M}{\bf z}+2{\bf x}^{T}{\bf M}{\bf z}\right)+\frac{1}{2}\left({\bf z}^{T}{\bf M}{\bf z}-2{\bf y}^{T}{\bf M}{\bf z}-{\bf t}^{T}{\bf M}{\bf t}+2{\bf t}^{T}{\bf M}{\bf y}\right)$
		$\displaystyle=-{\bf x}^{T}{\bf M}{\bf t}+{\bf x}^{T}{\bf M}{\bf z}-{\bf y}^{T}{\bf M}{\bf z}+{\bf t}^{T}{\bf M}{\bf y}$
		$\displaystyle=({\bf x}-{\bf y})^{T}{\bf M}({\bf z}-{\bf t}).$

We reach the $LHS$ . $\square$

B.2 Proof for Theorem 2 (computational complexity)

Theorem 2.

Let $\{\hat{y}_{i}^{t},\bm{a}_{i}^{t}\}_{i=1}^{n}$ be the output of Algorithm(2) at the $t^{th}$ iteration, $\tilde{y_{i}}\triangleq\frac{1}{T}\sum_{t=1}^{T}\hat{y}_{i}^{t}$ and $\tilde{\bm{a}_{i}}\triangleq\frac{1}{T}\sum_{t=1}^{T}\bm{a}_{i}^{t}$ . Denote $\tilde{f}_{T}(\bm{x})\triangleq\max_{i}\langle\tilde{\bm{a}_{i}},\bm{x}-\bm{x}_{i}\rangle+\tilde{y_{i}}$ . Assume $\max_{i,l}|x_{i,l}|\leq 1$ and $\mathbbm{Var}(\{y_{i}\}_{i=1}^{n})\leq 1$ . If we choose $\rho=\frac{\sqrt{d}\lambda^{2}}{n}$ , for $\lambda\geq\frac{3}{\sqrt{2nd}}$ and $T\geq{n\sqrt{d}}$ we have:

\displaystyle\frac{1}{n}\sum_{i}(\tilde{f}_{T}(\bm{x}_{i}){-}y_{i})^{2}{+}\lambda\|\tilde{f}_{T}\|\leq\min_{\hat{f}}\bigg{(}\frac{1}{n}\sum_{i}\big{(}\hat{f}(\bm{x}_{i})-y_{i}\big{)}^{2}+\lambda\|\hat{f}\|\bigg{)}+\frac{6n\sqrt{d}}{T+1}.

For the proof, we make extensive use of Theorem (1) which provides convergence rate of a general 2-block ADMM for a separable convex program with linear constraints

		$\displaystyle\min_{{\bf b}^{1}\in\mathcal{S}_{1},{\bf b}^{2}\in\mathcal{S}_{2}}\left[\psi({\bf b}^{1},{\bf b}^{2})=\psi_{1}({\bf b}^{1})+\psi_{2}({\bf b}^{2})\right]$		(24)
		$\displaystyle\quad s.t:{\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}+\bm{b}=\bm{0}.$

It guarantees for all ${\bm{\kappa}}$ the average solutions of a 2-block ADMM $(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})$ satisfies

\displaystyle\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})-{\bm{\kappa}}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})\leq\frac{1}{T}\left(\frac{\rho}{2}\|{\bf B}{\bf b}^{2}_{*}\|^{2}+\frac{1}{2\rho}\|{\bm{\kappa}}\|^{2}\right),

(25)

where $({\bf b}_{1}^{*},{\bf b}_{2}^{*})$ are the optimal solutions of program (24).

However in Eq. (5) we are solving a specific version of program (24) of the form

		$\displaystyle\min_{\hat{y}_{i},\bm{a}_{i},\bm{p}_{i},L_{l}\geq 0,\bm{u}_{i}\geq 0,s_{i,j}\geq 0}\left[\psi(\hat{y}_{i},\bm{a}_{i},\bm{p}_{i},L_{l},\bm{u}_{i},s_{i,j}){=}\frac{1}{n}\sum_{i=1}^{n}(\hat{y}_{i}-y_{i})^{2}{+}\lambda\sum_{l=1}^{d}L_{l}\right]$		(26)
		$\displaystyle\textrm{s.t.}\begin{cases}-s_{i,j}+\hat{y}_{i}-\hat{y}_{j}-\langle\bm{a}_{j},\bm{x}_{i}-\bm{x}_{j}\rangle=0&i,j\in[n]\times[n]\\ u_{i,l}+\|p_{i,l}\|-L_{l}=0&\,i,l\in[n]\times[d]\\ a_{i,l}-p_{i,l}=0&\,i,l\in[n]\times[d].\end{cases}$

To match the notations between (26) and (24) denote the optimal/average ADMM solutions to program (26) as

$\displaystyle{\bf b}^{1}_{*}$	$\displaystyle=[\hat{y}_{1},\dots,\hat{y}_{n},a_{1,1},a_{1,2},\dots,a_{n,d}]^{T}$
$\displaystyle{\bf b}^{2}_{*}$	$\displaystyle=[L_{1},\dots,L_{d},p_{1,1},\dots,p_{n,d},u_{1,1},\dots,u_{n,d},s_{1,1},\dots,s_{n,n}]^{T}$
$\displaystyle\tilde{\bf b}^{1}_{T}$	$\displaystyle=[\tilde{y}_{1},\dots,\tilde{y}_{n},\tilde{a}_{1,1},\tilde{a}_{1,2},\dots,\tilde{a}_{n,d}]^{T}$
$\displaystyle\tilde{\bf b}^{2}_{T}$	$\displaystyle=[\tilde{L}_{1},\dots,\tilde{L}_{d},\tilde{p}_{1,1},\dots,\tilde{p}_{n,d},\tilde{u}_{1,1},\dots,\tilde{u}_{n,d},\tilde{s}_{1,1},\dots,\tilde{s}_{n,n}]^{T}.$	(27)

Therefore the separable losses for the average iterates of ADMM for program (26) are

		$\displaystyle\psi_{1}(\tilde{\bf b}^{1}_{T})=\frac{1}{n}\sum_{i=1}^{n}(\tilde{y}_{i}-y_{i})^{2},$		(28)
		$\displaystyle\psi_{2}(\tilde{\bf b}^{2}_{T})=\lambda\sum_{l=1}^{d}\tilde{L}_{l}.$

Note that convergence rate for $\psi_{1}(\tilde{\bf b}^{1}_{T})+\psi_{2}(\tilde{\bf b}^{2}_{T})$ is available by setting $\bm{\kappa}=0$ in (25). However this is not sufficient for convergence of our objective $\frac{1}{n}\sum_{i}(\tilde{f}_{T}(\bm{x}_{i}){-}y_{i})^{2}{+}\lambda\|\tilde{f}_{T}\|$ . The reason is $(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})$ might be violating the constraints in (26). This could result in $\tilde{f}_{T}(\bm{x}_{i})\neq\tilde{y}_{i}$ and $\|\tilde{f}_{T}\|\neq\sum_{l=1}^{d}\tilde{L}_{l}$ . Therefore main steps of the proof of Theorem (2) is to characterize and bound the effect of such constraint violations on our objective function. Let us specify how much each linear constraint in program (26) is violated,

$\displaystyle\varepsilon^{1}_{i,j}$	$\displaystyle\triangleq-\tilde{s}_{i,j}+\tilde{y}_{i}-\tilde{y}_{j}-\langle\tilde{\bm{a}_{j}},\bm{x}_{i}-\bm{x}_{j}\rangle,$	(29)
$\displaystyle\varepsilon^{2}_{i,l}$	$\displaystyle\triangleq\tilde{u}_{i,l}+\|\tilde{p}_{i,l}\|-\tilde{L}_{l},$
$\displaystyle\varepsilon^{3}_{i,l}$	$\displaystyle\triangleq\tilde{a}_{i,l}-\tilde{p}_{i,l}.$

We break the proof to smaller parts by first providing some intermediate lemmas.

Lemma 4.

$\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})=\frac{1}{n}\sum_{i}(\hat{y}_{i}-y_{i})^{2}+\lambda\sum_{l}L_{l}\leq 1$ .

Proof.

Note that $\bm{b}^{1}=\bm{b}^{2}=0$ is a feasible solution and $(\hat{y}_{i},L_{l})$ is the optimal solution. Also algorithm 2 normalizes the dataset such that $\sum_{i=1}^{n}y_{i}=0$ , therefore

	$\displaystyle\psi({\bf b}^{1}_{},{\bf b}^{2}_{})$	$\displaystyle\leq\psi(\bm{0},\bm{0})$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}y_{i}^{2}=\mathbbm{Var}(\{y_{i}\}_{i=1}^{n})\leq 1.$

∎

Lemma 5.

$\|{\bf B}{\bf b}^{2}_{*}\|_{2}^{2}\leq 18\frac{n^{2}}{\lambda^{2}}$ .

Proof.

Since $p_{i,l}$ is a feasible solution we have $|p_{i,l}|\leq L_{l}$ . We also have:

$\displaystyle 0\leq s_{i,j}$	$\displaystyle=\hat{y}_{i}-\hat{y}_{j}-\langle\bm{a}_{j},\bm{x}_{i}-\bm{x}_{j}\rangle$	(by constraints definitions)
	$\displaystyle\leq\langle\bm{a}_{i}-\bm{a}_{j},\bm{x}_{i}-\bm{x}_{j}\rangle$	(convexity)
	$\displaystyle\leq 2\sum_{l}L_{l}\|x_{i,l}+x_{j,l}\|$	( $\max_{i}\|a_{i,l}\|\leq L_{l})$
	$\displaystyle\leq 4\sum_{l}L_{l}$	$\displaystyle\text{($\max_{i}\|x_{i,l}\|\leq 1)$}.$

Arranging the constraints in (26) into $(n^{2}+nd+nd)$ rows and separating the ${\bf b}^{1}_{*},{\bf b}^{2}_{*}$ coefficients as ${\bf A}$ and ${\bf B}$ in order we have:

	$\displaystyle{\bf A}{\bf b}^{1}_{}+{\bf B}{\bf b}^{2}_{}=0$
	$\displaystyle{\bf B}{\bf b}^{2}_{*}=[-s_{1,1},\dots,-s_{n,n},0_{1,1},\dots,0_{n,d},-p_{1,1},\dots,-p_{n,d}]^{T}.$

Taking the norm

$\displaystyle\\|{\bf B}{\bf b}^{2}_{*}\\|_{2}^{2}$	$\displaystyle=\sum_{i,j}s_{i,j}^{2}+\sum_{i,l}p_{i,l}^{2}$
	$\displaystyle\leq n^{2}(4\sum_{l=1}^{d}L_{l})^{2}+n\sum_{l=1}^{d}L_{l}^{2}$
	$\displaystyle\leq 18\frac{n^{2}}{\lambda^{2}}.$	(Lemma 4)

∎

Lemma 6.

$0\leq\tilde{f}_{T}(\bm{x}_{i})-\tilde{y}_{i}\leq\max_{j}(\varepsilon^{1}_{i,j})$ .

Proof.

$\displaystyle\tilde{y}_{i}\leq\tilde{f}_{T}(\bm{x}_{i})$	$\displaystyle=\max_{j}\langle\tilde{\bm{a}_{j}},\bm{x}_{i}-\bm{x}_{j}\rangle+\tilde{y}_{j}$	(Definition)
	$\displaystyle=\max_{j}(\tilde{y}_{i}-\tilde{s}_{i,j}+\varepsilon^{1}_{i,j})$	( $\varepsilon^{1}_{i,j}$ definition)
	$\displaystyle\leq\tilde{y}_{i}+\max_{j}(\varepsilon^{1}_{i,j}).$	( $\tilde{s}_{i,j}\geq 0$ )

∎

Lemma 7.

\displaystyle\frac{1}{n}\sum_{i}\big{(}\tilde{f}_{T}(\bm{x}_{i})-y_{i}\big{)}^{2}\leq\frac{1}{n}\sum_{i}(\tilde{y}_{i}-y_{i})^{2}+\frac{1}{n}\sum_{i}\max_{j}(\varepsilon^{1}_{i,j})^{2}+2\sqrt{\frac{1}{n}\sum_{i}\max_{j}(\varepsilon^{1}_{i,j})^{2}}\sqrt{1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}}}.

Proof.

$\displaystyle\frac{1}{n}\sum_{i}\big{(}\tilde{f}_{T}(\bm{x}_{i})-y_{i}\big{)}^{2}$	$\displaystyle=\frac{1}{n}\sum_{i}\big{(}\tilde{f}_{T}(\bm{x}_{i})-\tilde{y}_{i}+\tilde{y}_{i}-y_{i}\big{)}^{2}$
	$\displaystyle=\frac{1}{n}\sum_{i}(\tilde{y}_{i}-y_{i})^{2}+\frac{1}{n}\sum_{i}\big{(}\tilde{f}_{T}(\bm{x}_{i})-\tilde{y}_{i}\big{)}^{2}+\frac{2}{n}\big{(}\tilde{f}_{T}(\bm{x}_{i})-\tilde{y}_{i}\big{)}\big{(}\tilde{y}_{i}-y_{i}\big{)}$
	$\displaystyle\leq\frac{1}{n}\sum_{i}(\tilde{y}_{i}-y_{i})^{2}+\frac{1}{n}\sum_{i}\max_{j}(\varepsilon^{1}_{i,j})^{2}+\frac{2}{n}\|\max_{j}(\varepsilon^{1}_{i,j})\|\big{\|}\tilde{y}_{i}-y_{i}\big{\|}.$	(Lemma 6)

Let us focus on the last term on the RHS.

$\displaystyle\frac{\big{(}\frac{1}{n}\sum_{i}\big{\|}\max_{j}(\varepsilon^{1}_{i,j})\|\|\tilde{y}_{i}-y_{i}\|\big{)}^{2}}{\frac{1}{n}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}}$	$\displaystyle\leq\frac{1}{n}\sum_{i}(\tilde{y}_{i}-y_{i})^{2}$	(Cauchy–Schwartz)
	$\displaystyle\leq\frac{1}{n}\sum_{i}(\tilde{y}_{i}-y_{i})^{2}+\lambda\sum_{l}\tilde{L}_{l}$	( $\tilde{L}_{l}\geq 0$ )
	$\displaystyle\leq\frac{1}{n}\sum_{i}(\hat{y}_{i}-y_{i})^{2}+\lambda\sum_{l}L_{l}+\frac{1}{T}\frac{\rho}{2}\\|{\bf B}{\bf b}^{2}_{*}\\|^{2}$	(Use Theorem 1 with $\bm{\kappa}=0$ )
	$\displaystyle\leq 1+\frac{1}{T}\frac{\rho}{2}\\|{\bf B}{\bf b}^{2}_{*}\\|^{2}$	(Lemma 4)
	$\displaystyle\leq 1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}}.$	(Lemma 5)

∎

On the other hand for regularization term we have:

Lemma 8.

$\|\tilde{f}_{T}\|\leq\sum_{l}|\tilde{L}_{l}|+\sum_{l}\max_{i}|\varepsilon^{2}_{i,l}|+\max_{i}|\varepsilon^{3}_{i,l}|$ .

Proof.

$\displaystyle\\|\tilde{f}_{T}\\|$	$\displaystyle=\sum_{l}\max_{i}\|\tilde{\bm{a}}_{i,l}\|$	(definition)
	$\displaystyle=\sum_{l}\max_{i}\|\tilde{\bm{p}}_{i,l}+\varepsilon^{3}_{i,l}\|$	( $\varepsilon^{3}_{i,l}$ definition)
	$\displaystyle\leq\sum_{l}\|\tilde{L}_{l}\|+\sum_{l}\max_{i}\|\varepsilon^{2}_{i,l}\|+\max_{i}\|\varepsilon^{3}_{i,l}\|.$	( $\varepsilon^{2}_{i,l}$ definition)

∎

Next we combine the previous lemmas to prove Theorem 2. Before proceeding lets incorporate the definitions of constraint violations Eq. (29) in ${\bm{\kappa}}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})$ to get:

\displaystyle\bm{\kappa}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})={\bm{\kappa}^{1}}^{T}\bm{\varepsilon}^{1}+{\bm{\kappa}^{2}}^{T}\bm{\varepsilon}^{2}+{\bm{\kappa}^{3}}^{T}\bm{\varepsilon}^{3},

(30)

where

	$\displaystyle\bm{\kappa}^{1}\triangleq[\kappa^{1}_{1,1},\dots,\kappa^{2}_{n,n}]^{T},\qquad\bm{\kappa}^{2}\triangleq[\kappa^{2}_{1,1},\dots,\kappa^{2}_{n,d}]^{T},\qquad\bm{\kappa}^{3}\triangleq[\kappa^{3}_{1,1},\dots,\kappa^{3}_{n,d}]^{T}$
	$\displaystyle\bm{\varepsilon}^{1}\triangleq[\varepsilon^{1}_{1,1},\dots,\varepsilon^{2}_{n,n}]^{T},\qquad\bm{\varepsilon}^{2}\triangleq[\varepsilon^{2}_{1,1},\dots,\ \varepsilon^{2}_{n,d}]^{T},\qquad\bm{\varepsilon}^{3}\triangleq[\varepsilon^{3}_{1,1},\dots,\ \ \varepsilon^{3}_{n,d}]^{T}.$

Substituting (30) in (25) we have:

	$\displaystyle\Delta$	$\displaystyle\triangleq\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{},{\bf b}^{2}_{})-{\bm{\kappa}^{1}}^{T}\bm{\varepsilon}^{1}-{\bm{\kappa}^{2}}^{T}\bm{\varepsilon}^{2}-{\bm{\kappa}^{3}}^{T}\bm{\varepsilon}^{3}$
		$\displaystyle\leq\frac{1}{T}\left(\frac{\rho}{2}\\|{\bf B}{\bf b}^{2}_{*}\\|^{2}+\frac{1}{2\rho}\\|{\bm{\kappa}^{1}}\\|^{2}+\frac{1}{2\rho}\\|{\bm{\kappa}^{2}}\\|^{2}+\frac{1}{2\rho}\\|{\bm{\kappa}^{3}}\\|^{2}\right).$		(31)

Proof.

Add up both sides of Lemma 8 and Lemma 7 and subtract $\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})$ . The total approximation error due to ADMM optimization schema is $LHS$ :

	$\displaystyle LHS$	$\displaystyle\triangleq\frac{1}{n}\sum_{i}\big{(}\tilde{f}_{T}(\bm{x}_{i})-y_{i}\big{)}^{2}+\lambda\\|\tilde{f}_{T}\\|-\min_{\hat{f}}\bigg{(}\frac{1}{n}\sum_{i}\big{(}\hat{f}(\bm{x}_{i})-y_{i}\big{)}^{2}+\lambda\\|\hat{f}\\|\bigg{)}$
		$\displaystyle\leq\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{},{\bf b}^{2}_{})$
		$\displaystyle+\frac{1}{n}\sum_{i}(\varepsilon^{1}_{i,l*})^{2}+2\underbrace{\sqrt{\frac{1}{n}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}}}_{\epsilon_{1}}\underbrace{\sqrt{1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}}}}_{\epsilon_{2}}+\lambda\sum_{l}\big{(}\max_{i}\|\varepsilon^{2}_{i,l}\|+\max_{i}\|\varepsilon^{3}_{i,l}\|\big{)}$
		$\displaystyle\triangleq RHS.$

Assume $\epsilon_{1}\leq\epsilon_{2}$ then,

\displaystyle RHS

\displaystyle\leq\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})+3\sqrt{\frac{1}{n}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}}\sqrt{1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}}}+\lambda\sum_{l}\big{(}\max_{i}|\varepsilon^{2}_{i,l}|+\max_{i}|\varepsilon^{3}_{i,l}|\big{)}.

In Eq. (B.2) set

	$\displaystyle\kappa^{1}_{i,j}$	$\displaystyle=\begin{cases}-\frac{3\sqrt{1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}}}\varepsilon^{1}_{i,j}}{\sqrt{n\sum_{i}(\max_{k}\varepsilon^{1}_{i,k})^{2}}}&\text{if }j=\arg\max_{k}\varepsilon^{1}_{i,k}\\ 0&\text{otherwise }\end{cases}$
	$\displaystyle\kappa^{2}_{i,l}$	$\displaystyle=\begin{cases}-\lambda\operatorname{sign}\varepsilon^{2}_{i,l}&\text{if }i=\arg\max_{k}\|\varepsilon^{2}_{k,l}\|\\ 0&\text{otherwise }\end{cases}$
	$\displaystyle\kappa^{3}_{i,l}$	$\displaystyle=\begin{cases}-\lambda\operatorname{sign}\varepsilon^{3}_{i,l}&\text{if }i=\arg\max_{k}\|\varepsilon^{3}_{k,l}\|\\ 0&\text{otherwise.}\end{cases}$

We have

	$\displaystyle RHS=\Delta$	$\displaystyle\leq\frac{1}{T}\left(\frac{\rho}{2}\\|{\bf B}{\bf b}^{2}_{*}\\|^{2}+\frac{9}{2n\rho}(1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}})+\frac{d}{\rho}\lambda^{2}\right)$
		$\displaystyle\leq\frac{1}{T}\left(\frac{9\rho n^{2}}{\lambda^{2}}+\frac{9}{2n\rho}(1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}})+\frac{d}{\rho}\lambda^{2}\right).$

Set $\rho=\frac{\sqrt{d}\lambda^{2}}{n}$ , and then we get: $LHS\leq RHS=\Delta\leq\frac{6n\sqrt{d}}{T}$ if $\lambda\geq\frac{3}{\sqrt{2nd}}$ and $n\rho T\geq\frac{9}{2}$ .

Now assume $\epsilon_{1}\geq\epsilon_{2}$ . We get:

\displaystyle RHS

\displaystyle\leq\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})+3\frac{1}{n}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}+\lambda\sum_{l}\big{(}\max_{i}|\varepsilon^{2}_{i,l}|+\max_{i}|\varepsilon^{3}_{i,l}|\big{)}.

In Eq. (B.2) set

	$\displaystyle\kappa^{1}_{i,j}$	$\displaystyle=\begin{cases}-\frac{4}{n}\varepsilon^{1}_{i,j}&\text{if }j=\arg\max_{k}\varepsilon^{1}_{i,k}\\ 0&\text{otherwise }\end{cases}$
	$\displaystyle\kappa^{2}_{i,l}$	$\displaystyle=\begin{cases}-\lambda\operatorname{sign}\varepsilon^{2}_{i,l}&\text{if }i=\arg\max_{k}\|\varepsilon^{2}_{k,l}\|\\ 0&\text{otherwise }\end{cases}$
	$\displaystyle\kappa^{3}_{i,l}$	$\displaystyle=\begin{cases}-\lambda\operatorname{sign}\varepsilon^{3}_{i,l}&\text{if }i=\arg\max_{k}\|\varepsilon^{3}_{k,l}\|\\ 0&\text{otherwise.}\end{cases}$

We get

	$\displaystyle\Delta$	$\displaystyle=\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{},{\bf b}^{2}_{})+4\frac{1}{n}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}+\lambda\sum_{l}\big{(}\max_{i}\|\varepsilon^{2}_{i,l}\|+\max_{i}\|\varepsilon^{3}_{i,l}\|\big{)}$
		$\displaystyle\leq\frac{1}{T}\left(\frac{9\rho n^{2}}{\lambda^{2}}+\frac{16}{2\rho n^{2}}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}+\frac{d}{\rho}\lambda^{2}\right).$

It’s straightforward to see if $n\rho T\geq 8$ we have

\displaystyle LHS\leq RHS\leq\Delta-\frac{16}{2\rho n^{2}T}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}=O(\frac{1}{T}(\frac{\rho n^{2}}{\lambda^{2}}+\frac{d}{\rho}\lambda^{2})).

Set $\rho=\frac{\sqrt{d}\lambda^{2}}{n}$ we get: $LHS\leq\frac{3n\sqrt{d}}{T}$ . Only if $T\geq\frac{8}{\sqrt{d}\lambda^{2}}$ . ∎

Appendix C Experimental results

This section provides the regression and classification experimental results in tabular format. Table 3 compares the regression performance of our method against baseline methods on benchmark datasets. Table 4 compares the classification accuracy of our method against baseline methods. We present the run time of our method on selected UCI datasets in Table 5. Here, we omit the baseline method because it did not run to completion on most of the datasets. Hyperparameters were fixed for all timing results.

Table 3: Comparison of Regression performance on UCI datasets.

			$R^{2}\times 100$
dataset	$n$	$d$	dc regression	lasso	random forest	xgboost
Parkinson Speech Dataset	702	52	-3.7	-4	-14.8	-20
Garment Productivity	905	37	25.5	15.2	45.9	44.2
Concrete Compressive Strength	1030	8	91.7	59.9	91.8	92.5
Geographical Original of Music-1	1059	68	21.9	16.6	25.1	26
Geographical Original of Music-2	1059	68	32.6	23.8	31.3	30.8
Solar Flare-1	1066	23	3.3	6.1	-28	6.2
Solar Flare-2	1066	23	-6.1	-2.1	-17	2.1
Airfoil Self-Noise	1503	5	95.2	51.1	93.3	95
Communities and Crime	1994	122	62.1	64.5	65.4	65.8
SML2010	3000	24	90.8	96	93.4	92.1
SML2010	3000	24	77.3	94	92.4	90
Parkinson’s Telemonitoring	4406	25	93.8	98.4	92.3	98.2
Parkinson’s Telemonitoring	4406	25	95.5	98.5	92	98
Wine Quality	4898	11	51.8	27.2	52.6	51
Bias Correction of Temperature Forecast-1	6200	52	62.9	60.8	64.2	65.8
Bias Correction of Temperature Forecast-2	6200	52	75.2	76.5	76.3	76.7
Seoul Bike Sharing Demand	6570	19	89.7	80.7	93.1	92.3
Air Quality-1	7110	21	87	89.5	88.2	89.9
Air Quality-2	7110	21	100	94.7	99.8	100
Air Quality-3	7110	21	86.6	86.4	87.7	88.5
Air Quality-4	7110	21	81.7	84	78.1	80.7
Combined Cycle Power Plant	9568	4	95.5	92.9	96.5	96.7

Table 4: Comparison of Classification performance on UCI datasets.

			Accuracy (%)				Run time of all fits (sec)
dataset	$n$	$d$	pbdl	pbdl0	rand forest	xgboost	pbdl	pbdl0
Iris	149	4	96.0	98.7	96	96.6	22.1	46.2
Wine	178	13	96.0	96.6	96.7	95.5	23.6	496.2
Blood Transfusion	748	4	72.6	74.8	72.9	78.4	46.3	21514.0
Breast Cancer Wisconsin	569	30	94.4	96.5	96.1	96.0	114.0	224407
Balance Scale	625	4	84.6	93.0	83	92.2	150.3	617.0
Abalone	4177	10	22.6	N/A	24.9	26.2	3688.0	N/A

Table 5: DC Regression run time on UCI datasets. (Baseline is ommitted because it does not run to completion on most datasets)

dataset	$n$	$d$	This Paper (Seconds)
Parkinson Speech Dataset	702	52	3.4
Garment Productivity	905	37	4.12
Concrete Compressive Strength	1030	8	3.28
Geographical Original of Music	1059	68	4.44
Solar Flare	1066	23	3.64
Airfoil Self-Noise	1503	5	4.92
Communities and Crime	1994	122	12.82
SML2010	3000	24	21.78

	$\displaystyle\rho Y_{2}+Y_{3}=$	$\displaystyle\frac{\rho}{2}\left(\\|{\bf b}^{2}-{\bf b}^{2}_{t})\\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}-\\|{\bf b}^{2}-{\bf b}^{2}_{t+1}\\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}\right)+\frac{1}{2\rho}\left(\\|{\bm{\kappa}}-{\bf d}_{t}\\|_{{\bf I}}^{2}-\\|{\bm{\kappa}}-{\bf d}_{t+1}\\|_{{\bf I}}^{2}\right)$
		$\displaystyle\underbrace{-\frac{1}{2\rho}\\|\overline{\bf d}_{t+1}-{\bf d}_{t}\\|_{{\bf I}}^{2}}_{\leq 0}+\underbrace{\frac{1}{2\rho}\left(\\|\overline{\bf d}_{t+1}-{\bf d}_{t+1}\\|_{{\bf I}}^{2}-\rho^{2}\\|{\bf b}^{2}_{t+1}-{\bf b}^{2}_{t}\\|_{{\bf{\bf B}^{T}{\bf B}}}\right)}_{=0\text{ due to Eq. \ref{eq.tilde}}}$
	$\displaystyle\leq$	$\displaystyle\frac{\rho}{2}\left(\\|{\bf b}^{2}-{\bf b}^{2}_{t})\\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}-\\|{\bf b}^{2}-{\bf b}^{2}_{t+1}\\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}\right)+\frac{1}{2\rho}\left(\\|{\bm{\kappa}}-{\bf d}_{t}\\|_{{\bf I}}^{2}-\\|{\bm{\kappa}}-{\bf d}_{t+1}\\|_{{\bf I}}^{2}\right).$

$\displaystyle\\|\tilde{f}_{T}\\|$	$\displaystyle=\sum_{l}\max_{i}\|\tilde{\bm{a}}_{i,l}\|$	(definition)
	$\displaystyle=\sum_{l}\max_{i}\|\tilde{\bm{p}}_{i,l}+\varepsilon^{3}_{i,l}\|$	( $\varepsilon^{3}_{i,l}$ definition)
	$\displaystyle\leq\sum_{l}\|\tilde{L}_{l}\|+\sum_{l}\max_{i}\|\varepsilon^{2}_{i,l}\|+\max_{i}\|\varepsilon^{3}_{i,l}\|.$	( $\varepsilon^{2}_{i,l}$ definition)

	$\displaystyle LHS$	$\displaystyle\triangleq\frac{1}{n}\sum_{i}\big{(}\tilde{f}_{T}(\bm{x}_{i})-y_{i}\big{)}^{2}+\lambda\\|\tilde{f}_{T}\\|-\min_{\hat{f}}\bigg{(}\frac{1}{n}\sum_{i}\big{(}\hat{f}(\bm{x}_{i})-y_{i}\big{)}^{2}+\lambda\\|\hat{f}\\|\bigg{)}$
		$\displaystyle\leq\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{},{\bf b}^{2}_{})$
		$\displaystyle+\frac{1}{n}\sum_{i}(\varepsilon^{1}_{i,l*})^{2}+2\underbrace{\sqrt{\frac{1}{n}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}}}_{\epsilon_{1}}\underbrace{\sqrt{1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}}}}_{\epsilon_{2}}+\lambda\sum_{l}\big{(}\max_{i}\|\varepsilon^{2}_{i,l}\|+\max_{i}\|\varepsilon^{3}_{i,l}\|\big{)}$
		$\displaystyle\triangleq RHS.$

Faster Algorithms for Learning Convex Functions

Abstract

1 Introduction

1.1 Connections to Existing Methodologies

2 Convex Regression with Lasso Penalty

Proposition 1.

2.1 Optimization

2.1.1 First block 𝒃1={y^i,𝒂i}{\bm{b}}^{1}=\{\hat{y}_{i},\bm{a}_{i}\}

2.1.2 Second block 𝒃2={Ll,𝒑i+,𝒑i−,𝒖i,}{\bm{b}}^{2}=\{L_{l},\bm{p}_{i}^{+},\bm{p}_{i}^{-},\bm{u}_{i},\}

2.1.3 Dual variables

2.1.4 Algorithms via ADMM

2.2 Analysis

Theorem 1.

Theorem 2.

Corollary 1.

3 Approximating a Bregman Divergence

3.1 Optimization

3.1.1 First block 𝒃1={zi,𝒂i,ζi,j}{\bm{b}}^{1}=\{z_{i},\bm{a}_{i},\zeta_{i,j}\}

3.1.2 Second block 𝒃2={Ll,𝒑i,𝒖i,si,j,ti,j}{\bm{b}}^{2}=\{L_{l},\bm{p}_{i},\bm{u}_{i},s_{i,j},t_{i,j}\}

3.1.3 Dual variables

4 Difference of Convex (DC) Regression

5 Experiments

5.1 Timing Results for Convex and DC Regression

5.2 Results on Real Data

5.2.1 Datasets

5.2.2 Results

6 Conclusion

Acknowledgements

References

Appendix A Algorithms

Appendix B Derivations

B.1 ADMM Convergence

Theorem 1.

Lemma 1.

Lemma 2.

Lemma 3.

B.2 Proof for Theorem 2 (computational complexity)

Theorem 2.

Lemma 4.

Proof.

Lemma 5.

Proof.

Lemma 6.

Proof.

Lemma 7.

Proof.

Lemma 8.

Proof.

Proof.

Appendix C Experimental results

2.1.1 First block ${\bm{b}}^{1}=\{\hat{y}_{i},\bm{a}_{i}\}$

2.1.2 Second block ${\bm{b}}^{2}=\{L_{l},\bm{p}_{i}^{+},\bm{p}_{i}^{-},\bm{u}_{i},\}$

3.1.1 First block ${\bm{b}}^{1}=\{z_{i},\bm{a}_{i},\zeta_{i,j}\}$

3.1.2 Second block ${\bm{b}}^{2}=\{L_{l},\bm{p}_{i},\bm{u}_{i},s_{i,j},t_{i,j}\}$