This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Faster Algorithms for Learning Convex Functions

Ali Siahkamari    Durmus Alp Emre Acar    Christopher Liao    Kelly Geyer    Venkatesh Saligrama    Brian Kulis
Abstract

The task of approximating an arbitrary convex function arises in several learning problems such as convex regression, learning with a difference of convex (DC) functions, and learning Bregman or ff-divergences. In this paper, we develop and analyze an approach for solving a broad range of convex function learning problems that is faster than state-of-the-art approaches. Our approach is based on a 2-block ADMM method where each block can be computed in closed form. For the task of convex Lipschitz regression, we establish that our proposed algorithm converges with iteration complexity of O(nd/ϵ)O(n\sqrt{d}/\epsilon) for a dataset 𝑿n×d\bm{X}\in\mathbb{R}^{n\times d} and ϵ>0\epsilon>0. Combined with per-iteration computation complexity, our method converges with the rate O(n3d1.5/ϵ+n2d2.5/ϵ+nd3/ϵ)O(n^{3}d^{1.5}/\epsilon+n^{2}d^{2.5}/\epsilon+nd^{3}/\epsilon). This new rate improves the state of the art rate of O(n5d2/ϵ)O(n^{5}d^{2}/\epsilon) if d=o(n4)d=o(n^{4}). Further we provide similar solvers for DC regression and Bregman divergence learning. Unlike previous approaches, our method is amenable to the use of GPUs. We demonstrate on regression and metric learning experiments that our approach is over 100 times faster than existing approaches on some data sets, and produces results that are comparable to state of the art.

Machine Learning, ICML

1 Introduction

The importance of convex functions in machine learning is undisputed. However, while most applications of convexity in machine learning involve fixed convex functions, an emerging trend in the field has focused on the problem of learning convex functions for a particular task. In this setting, data or supervision is used to tailor a convex function for some end goal. Recent machine learning examples include learning with a difference of convex (DC) functions (Siahkamari et al., 2020), learning input convex neural networks for tasks in reinforcement learning or vision (Amos et al., 2017), and learning divergences between points or distributions via learning the underlying convex function in a Bregman divergence (Siahkamari et al., 2019) or ff-divergence (Zhang et al., 2020) for problems such as data generation, metric learning, and others.

In fact, work in convex function learning at least dates back to methods developed for the problem of convex regression, an important learning problem that is used commonly in econometrics and engineering for modeling demand, utility and production curves (Afriat, 1967; Varian, 1982). Convex regression is the problem of estimating a convex function when receiving a dataset {(𝒙i,yi)}i=1n\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}, where 𝒙id\bm{x}_{i}\in\mathbb{R}^{d} are predictors and yiy_{i} are continuous responses. Consider

{f:d|f is convex},\displaystyle\mathcal{F}\triangleq\{f:\mathbb{R}^{d}\to\mathbb{R}\ |\ f\text{ is convex}\}, (1)

as the class of all convex functions over d\mathbb{R}^{d}. Then an estimator is proposed by minimizing the squared error between observations 𝒙i\bm{x}_{i} and responses yiy_{i},

f^argminf1ni=1n(yif(𝒙i))2+λf,\displaystyle\hat{f}\triangleq\arg\min_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-f(\bm{x}_{i}))^{2}+\lambda\|f\|, (2)

where f\|f\| is some penalty term. A key insight about this problem is that although Eq. (2) is an infinite dimensional minimization problem, Boyd et al. (2004) shows it can be solved as a convex optimization problem since the optimal solution can be shown to be piece-wise linear. Furthermore, generalization bounds for this problem have recently been developed in (Balázs, 2016; Siahkamari et al., 2020) using certain classes for f\|f\|.

Many other convex function learning problems have a similar structure to them, and also yield resulting optimization problems that can be tractably solved. However, a key downside is that the computational complexity is often prohibitive for real-world applications, particularly due to the non-parametric nature of the resulting optimization problems. For instance, the state-of-the-art solver for convex regression is based on interior point methods and has a complexity of O(n5d2/ϵ)O(n^{5}d^{2}/\epsilon) for a given accuracy ϵ\epsilon.

Thus, our goal in this paper is to study a class of methods that yields provably faster convergence for a number of different convex function learning problems. Our main insight is that many convex function learning problems may be expressed as optimization problems that can be solved with a 2-block ADMM algorithm such that each of the two blocks can be computed in closed form. For convex regression, the resulting method (which we call FCR, for Faster Convex Regression) is guaranteed to converge with computational complexity of O(n3d1.5/ϵ+n2d2.5/ϵ+nd3/ϵ)O(n^{3}d^{1.5}/\epsilon+n^{2}d^{2.5}/\epsilon+nd^{3}/\epsilon) for a dataset 𝑿n×d\bm{X}\in\mathbb{R}^{n\times d} and ϵ>0\epsilon>0. This new rate improves the state of the art O(n5d2/ϵ)O(n^{5}d^{2}/\epsilon) in (Balázs, 2016) available via interior point methods when d=o(n4)d=o(n^{4}). In addition to an improved convergence rate, FCR makes several contributions. Firstly, FCR is stand-alone and does not need any optimization software. FCR is based on tensor computations and can easily be implemented on GPUs. Furthermore, FCR can include regularization terms in the context of non-parametric convex regression.

We extend FCR solver for problems beyond convex regression. In particular, we provide 2-block ADMM solvers and present empirical comparisons for DC regression (Siahkamari et al., 2020) and Bregman divergence learning (Siahkamari et al., 2019). Finally, we demonstrate empirical results showing that our approach yields significantly more scalable convex learning methods, resulting in speedups of over 100 times on some problems.

Notation. We generally denote scalars as lower case letters, vectors as lower case bold letters, and matrices as upper case bold letters. For a dataset 𝑿n×d\bm{X}\in\mathbb{R}^{n\times d}, nn represents the number of observations and dd is the number of covariates. We define x+=max{x,0},x=max{x,0}x^{+}=\max\{x,0\},x^{-}=\max\{-x,0\} and ×\times as an outer product. We define [n][n] to be the set of integers {1,,n}\{1,\dots,n\}. By f(𝒙)\|\nabla^{*}f(\bm{x})\| and |xlf(𝒙)||\partial^{*}_{x_{l}}f(\bm{x})| we denote the largest subgradient and the largest partial sub-derivative.

1.1 Connections to Existing Methodologies

Convex regression has been extensively studied over the last two decades. Boyd et al. (2004) introduce the non-parametric convex regression problem Eq. (2). Balázs (2016) show that convex Lipschitz regression requires regularization in order to have generalization guarantees. In particular, it was shown that the penalty term f=sup𝒙f(𝒙)\|f\|=\sup_{\bm{x}}\|\nabla^{*}f(\bm{x})\| yields a generalization error of O(n2/dlogn)O(n^{-2/d}\log n).

Balázs (2016); Mazumder et al. (2019) provide ADMM solvers for the convex regression problem; however, solutions have more than two blocks and are not guaranteed to converge in all circumstances. More recently, Chen & Mazumder (2020) provide an active-set type solver and Bertsimas & Mundru (2021) provide a delayed constraint generation algorithm which have favorable scalablity. They use a penalty of f=if(𝒙i)22\|f\|=\sum_{i}\|\nabla^{*}f(\bm{x}_{i})\|_{2}^{2}, which makes the loss function in Eq. (2) strongly convex and easier to solve. However, no theoretical generalization bound is known when using this new penalty term. Siahkamari et al. (2020) use f=sup𝒙(f(𝒙)1)\|f\|=\sup_{\bm{x}}(\|\nabla^{*}f(\bm{x})\|_{1}) for learning a difference of convex functions where they provide a multi-block ADMM solver for this problem.

On the other hand, Ghosh et al. (2019) and Kim et al. (2021) study the parametric class of max-linear functions k={f:d|f(𝒙)=maxj=1k𝒂j,𝒙+bj}\mathcal{F}_{k}=\{f:\mathbb{R}^{d}\to\mathbb{R}\ |\ f(\bm{x})=\max_{j=1}^{k}\langle\bm{a}_{j},\bm{x}\rangle+b_{j}\} for k<nk<n. They provide convex programming and alternating algorithms. The downside of these methods is the assumption that 𝒙i\bm{x}_{i} are i.i.d samples from a normal distribution and furthermore that each feature xi,lx_{i,l} is independent. These assumptions are needed for their algorithm to converge in theory. We note that this parametric class k\mathcal{F}_{k} if knk\geq n, is the same as \mathcal{F}, when used in convex regression minimization problem Eq. (2).

Our methodology is most similar to Balázs (2016) and is in the context of learning theory where we do not require any distributional assumption on 𝒙i\bm{x}_{i} and only require them be i.i.d. and bounded. We further need to know a bound on ff but not on f\|f\|.

2 Convex Regression with Lasso Penalty

Suppose we are given a dataset {(yi,𝒙i)}i=1n\{(y_{i},\bm{x}_{i})\}_{i=1}^{n} where 𝒙id\bm{x}_{i}\in\mathbb{R}^{d} are predictors, assumed drawn i.i.d. from a distribution PXP_{X} supported on a compact domain Ω{𝒙R},\Omega\subset\{\|\bm{x}\|_{\infty}\leq R\}, with ndn\geq d, and yiy_{i} are responses such that yi=f(𝒙i)+εiy_{i}=f(\bm{x}_{i})+\varepsilon_{i} for centered, independent random noise εi\varepsilon_{i}. Assume |εi||\varepsilon_{i}| and |f()||f(\cdot)| both are bounded by MM. Furthermore, ff is convex and Lipschitz. We propose to solve the penalized convex regression problem in Eq. (2) with the penalty term fl=1dsup𝒙|xlf(𝒙)|\|f\|\triangleq\sum_{l=1}^{d}\sup_{\bm{x}}|\partial_{x_{l}}^{*}f(\bm{x})|. This results in the following convex optimization problem:

miny^i,𝒂i1ni=1n(y^iyi)2+λl=1dmaxi=1n|ai,l|\displaystyle\qquad\min_{\hat{y}_{i},\bm{a}_{i}}\frac{1}{n}\sum_{i=1}^{n}(\hat{y}_{i}-y_{i})^{2}+\lambda\sum_{l=1}^{d}\max_{i=1}^{n}|a_{i,l}| (3)
s.t. y^iy^j𝒂i,𝒙i𝒙j0i,j[n]×[n].\displaystyle\textrm{s.t. }\hat{y}_{i}-\hat{y}_{j}-\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle\leq 0\quad i,j\in[n]\times[n].

Then, we estimate f(𝒙)f(\bm{x}) via

f^(𝒙)maxi𝒂i,𝒙𝒙i+y^i.\hat{f}(\bm{x})\triangleq\max_{i}\langle\bm{a}_{i},\bm{x}-\bm{x}_{i}\rangle+\hat{y}_{i}. (4)

This is a model similar to Balázs (2016) with a minor modification of using a different penalty term in the loss function. Our penalty term depends on the sum of partial derivatives l=1dsup𝒙|xlf(𝒙)|\sum_{l=1}^{d}\sup_{\bm{x}}|\partial_{x_{l}}f(\bm{x})| rather than sup𝒙f(𝒙)22\sup_{\bm{x}}\|\nabla f(\bm{x})\|^{2}_{2}. This new penalty term acts similar to a L1L1 regularizer and encourages feature sparsity. It is easy to show the estimator is bounded i.e., sup𝒙Ω|f^(𝒙)|M+4f^R\sup_{\bm{x}\in\Omega}|\hat{f}(\bm{x})|\leq M+4\|\hat{f}\|R. Also f+gf+g\|f+g\|\leq\|f\|+\|g\|, fc=cf\|fc\|=c\|f\| for c0c\geq 0. Hence, similar to Siahkamari et al. (2020), our penalty term is a valid seminorm which allows us to use their theorem here.

Proposition 1.

With the appropriate choice of λ\lambda which requires knowledge of MM the bound on ff and ndn\geq d, it holds that with probability at least 1δ1-\delta over the data, the estimator f^\hat{f} of (4)(\ref{eqn:plc_estimate}) has excess risk upper bounded by

𝔼[|f(𝒙)f^(𝒙)|2]O((nd)2d+4log(nd)+log(1/δ)n).\displaystyle\mathbb{E}[|f(\bm{x})-\hat{f}(\bm{x})|^{2}]\leq O\bigg{(}\bigg{(}\frac{n}{d}\bigg{)}^{\frac{-2}{d+4}}\log\bigg{(}\frac{n}{d}\bigg{)}+\sqrt{\frac{\log(1/\delta)}{n}}\bigg{)}.

2.1 Optimization

Our method utilizes the well-known ADMM (Gabay & Mercier, 1976) algorithm. ADMM is a standard tool for solving convex problems that consists of variable blocks with linear constraints (Eckstein & Yao, 2012). It has an iterative procedure to update the problem variables with provable convergence guarantees (He & Yuan, 2012).

We solve program (3) using ADMM; the main insight is that we can solve a 2-block ADMM formulation where each block can be computed in closed form. We first consider an equivalent form of the optimization problem as:

miny^i,𝒂i,Ll0,𝒑i+0,𝒑i0,𝒖i0,si,j0i=1n(y^iyi)2+λl=1dLl\displaystyle\qquad\min_{\begin{subarray}{c}\hat{y}_{i},\bm{a}_{i},L_{l}\geq 0,\\ \bm{p}_{i}^{+}\geq 0,\bm{p}_{i}^{-}\geq 0,\bm{u}_{i}\geq 0,s_{i,j}\geq 0\end{subarray}}\sum_{i=1}^{n}(\hat{y}_{i}-y_{i})^{2}{+}\lambda\sum_{l=1}^{d}L_{l} (5)
s.t.{si,j+y^iy^j𝒂i,𝒙i𝒙j=0i,j[n]×[n]ui,l+pi,l++pi,lLl=0i,l[n]×[d]ai,lpi,l++pi,l=0i,l[n]×[d],\displaystyle\textrm{s.t.}\begin{cases}s_{i,j}+\hat{y}_{i}-\hat{y}_{j}-\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle=0&i,j\in[n]\times[n]\\ u_{i,l}+p_{i,l}^{+}+p_{i,l}^{-}-L_{l}=0&\,i,l\in[n]\times[d]\\ a_{i,l}-p_{i,l}^{+}+p_{i,l}^{-}=0&\,i,l\in[n]\times[d],\end{cases}

with the augmented Lagrangian

(y^i,𝒂i,Ll,𝒑i+,𝒑i,𝒖i,si,j)=1ni=1n(y^iyi)2+λl=1dLl\displaystyle\ell(\hat{y}_{i},\bm{a}_{i},L_{l},\bm{p}_{i}^{+},\bm{p}_{i}^{-},\bm{u}_{i},s_{i,j})=\frac{1}{n}\sum_{i=1}^{n}(\hat{y}_{i}-y_{i})^{2}+\lambda\sum_{l=1}^{d}L_{l}
+ijρ2(si,j+y^iy^j𝒂i,𝒙i𝒙j+αi,j)2\displaystyle+\sum_{i}\sum_{j}\frac{\rho}{2}(s_{i,j}+\hat{y}_{i}-\hat{y}_{j}-\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle+\alpha_{i,j})^{2}
+ilρ2(ui,l+pi,l++pi,lLl+γi,l)2\displaystyle+\sum_{i}\sum_{l}\frac{\rho}{2}(u_{i,l}+p_{i,l}^{+}+p_{i,l}^{-}-L_{l}+\gamma_{i,l})^{2}
+ilρ2(ai,lpi,l++pi,l+ηi,l)2,\displaystyle+\sum_{i}\sum_{l}\frac{\rho}{2}(a_{i,l}-p_{i,l}^{+}+p_{i,l}^{-}+\eta_{i,l})^{2},

where αi,j\alpha_{i,j}, γi,l\gamma_{i,l} and ηi,j\eta_{i,j} are dual variables. We divide parameters into two blocks as 𝒃1={y^i,𝒂i}{\bm{b}}^{1}=\{\hat{y}_{i},\bm{a}_{i}\} and 𝒃2={Ll,𝒑i+,𝒑i,𝒖i,si,j}{\bm{b}}^{2}=\{L_{l},\bm{p}_{i}^{+},\bm{p}_{i}^{-},\bm{u}_{i},s_{i,j}\}, where i,j[n]×[n],l[d]{i,j\in[n]\times[n],l\in[d]}. We find closed form solutions for each block given the solution to the other block.

2.1.1 First block 𝒃1={y^i,𝒂i}{\bm{b}}^{1}=\{\hat{y}_{i},\bm{a}_{i}\}

We first note that we always normalize the dataset such that i𝒙i=𝟎\sum_{i}\bm{x}_{i}=\bm{0} and iyi=0\sum_{i}y_{i}=0. This will result in iy^i=0\sum_{i}\hat{y}_{i}=0 and simplify the solutions.

By setting 𝒂i=𝟎\nabla_{\bm{a}_{i}}\ell=\bm{0} we can solve for 𝒂i\bm{a}_{i} as:

𝒂i=𝚲i(𝜽i+y^i𝒙i+1nky^k𝒙k),\displaystyle\bm{a}_{i}=\bm{\Lambda}_{i}(\bm{\theta}_{i}+\hat{y}_{i}\bm{x}_{i}+\frac{1}{n}\sum_{k}\hat{y}_{k}\bm{x}_{k}), (6)

where

𝚲i\displaystyle\bm{\Lambda}_{i} (𝒙i𝒙iT+1nI+1nj𝒙j𝒙jT)1,\displaystyle\triangleq(\bm{x}_{i}\bm{x}_{i}^{T}+\frac{1}{n}I+\frac{1}{n}\sum_{j}\bm{x}_{j}\bm{x}_{j}^{T})^{-1},
𝜽i\displaystyle\bm{\theta}_{i} 1n(𝒑i+𝒑i𝜼i+j(αi,j+si,j)(𝒙i𝒙j)).\displaystyle\triangleq\frac{1}{n}\bigg{(}\bm{p}_{i}^{+}-\bm{p}_{i}^{-}-\bm{\eta}_{i}+\sum_{j}(\alpha_{i,j}+s_{i,j})(\bm{x}_{i}-\bm{x}_{j})\bigg{)}.

Similarly by setting y^i=0\partial_{\hat{y}_{i}}\ell=0 and substituting Eq. (6) for 𝒂i\bm{a}_{i} we can solve for y^1,,y^n\hat{y}_{1},\dots,\hat{y}_{n}, simultaneously as a system of linear equations,

𝒚^\displaystyle\hat{\bm{y}} =𝛀1(2𝒚n2ρ+𝒗𝜷)\displaystyle=\bm{\Omega}^{-1}\bigg{(}\frac{2\bm{y}}{n^{2}\rho}+\bm{v}-\bm{\beta}\bigg{)} (7)

where 𝒚=[y1,,yn]T{\bm{y}}=[{y}_{1},\dots,{y}_{n}]^{T}, 𝒚^=[y^1,,y^n]T\hat{\bm{y}}=[\hat{y}_{1},\dots,\hat{y}_{n}]^{T}, and

βi\displaystyle\beta_{i} 1njαi,jαj,i+si,jsj,i,\displaystyle\triangleq\frac{1}{n}\sum_{j}\alpha_{i,j}-\alpha_{j,i}+s_{i,j}-s_{j,i},
vi\displaystyle v_{i} 𝒙iT𝚲i𝜽i+𝒙iT1nj𝚲j𝜽j1nj𝒙jT𝚲j𝜽j\displaystyle\triangleq\bm{x}_{i}^{T}\bm{\Lambda}_{i}\bm{\theta}_{i}+\bm{x}_{i}^{T}\frac{1}{n}\sum_{j}\bm{\Lambda}_{j}\bm{\theta}_{j}-\frac{1}{n}\sum_{j}\bm{x}_{j}^{T}\bm{\Lambda}_{j}\bm{\theta}_{j}
Ωi,j\displaystyle\Omega_{i,j} (2n2ρ+2𝒙iT𝚲i𝒙i)𝟙(i=j)1nDi,j,\displaystyle\triangleq\bigg{(}\frac{2}{n^{2}\rho}+2-\bm{x}_{i}^{T}\bm{\Lambda}_{i}\bm{x}_{i}\bigg{)}\mathbbm{1}(i=j)-\frac{1}{n}D_{i,j},
Di,j\displaystyle D_{i,j} 𝒙iT(𝚲i+𝚲j+1nk𝚲k)𝒙j𝒙jT𝚲j𝒙j\displaystyle\triangleq\bm{x}_{i}^{T}\bigg{(}\bm{\Lambda}_{i}\!+\!\bm{\Lambda}_{j}\!+\!\frac{1}{n}\!\sum_{k}\!\bm{\Lambda}_{k}\bigg{)}\bm{x}_{j}-\bm{x}_{j}^{T}\bm{\Lambda}_{j}\bm{x}_{j}
1nk𝒙k𝚲k𝒙j.\displaystyle-\frac{1}{n}\sum_{k}\bm{x}_{k}\bm{\Lambda}_{k}\bm{x}_{j}.

2.1.2 Second block 𝒃2={Ll,𝒑i+,𝒑i,𝒖i,}{\bm{b}}^{2}=\{L_{l},\bm{p}_{i}^{+},\bm{p}_{i}^{-},\bm{u}_{i},\}

Set s=0\partial_{s}\ell=0 for s{si,j,𝒑i+,𝒑i,𝒖i,}s\in\{s_{i,j},\bm{p}_{i}^{+},\bm{p}_{i}^{-},\bm{u}_{i},\}. Hence

si,j\displaystyle s_{i,j} =(αi,jy^i+y^j+𝒂i,𝒙i𝒙j)+,\displaystyle=(-\alpha_{i,j}-\hat{y}_{i}+\hat{y}_{j}+\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle)^{+}, (8)
ui,l\displaystyle u_{i,l} =(Llγi,l|ηi,l+ai,l|)+,\displaystyle=(L_{l}-\gamma_{i,l}-|\eta_{i,l}+a_{i,l}|)^{+},
pi,l+\displaystyle p_{i,l}^{+} =12(Llγi,lui,l+ηi,l+ai,l)+,\displaystyle=\frac{1}{2}(L_{l}-\gamma_{i,l}-u_{i,l}+\eta_{i,l}+a_{i,l})^{+},
pi,l\displaystyle p_{i,l}^{-} =12(Llγi,lui,lηi,lai,l).\displaystyle=\frac{1}{2}(L_{l}-\gamma_{i,l}-u_{i,l}-\eta_{i,l}-a_{i,l})^{-}.

Lastly set Ll=0\partial_{L_{l}}\ell=0 and plug the solutions for ui,lu_{i,l}, pi,l+p_{i,l}^{+} and pi,lp_{i,l}^{-}. Denote ci,l|ηi,l+ai,l|c_{i,l}\triangleq|\eta_{i,l}+a_{i,l}|, after rearrangement of the terms we have:

λρ=i{0if Llγi,lci,l12(γi,l+ci,lLl)if |Llγi,l|ci,lγi,lLlif Llγi,lci,l.\displaystyle\frac{\lambda}{\rho}=\sum_{i}\left\{\begin{array}[]{ll}0&\mbox{if }L_{l}-\gamma_{i,l}\geq c_{i,l}\\ \frac{1}{2}(\gamma_{i,l}+c_{i,l}-L_{l})&\mbox{if }|L_{l}-\gamma_{i,l}|\leq c_{i,l}\\ \gamma_{i,l}-L_{l}&\mbox{if }L_{l}-\gamma_{i,l}\leq-c_{i,l}.\end{array}\right.

Note that the right hand side is a monotonic and piecewise linear function of LlL_{l}. It is easy to find the solution to this problem with respect to LlL_{l}, using a sort and a simple algorithm that takes O(nlogn)O(n\log n) flops; see L_update Algorithm 1. Observe that it is possible to add monotonicity constraints for f^\hat{f} by projecting either pi,l+p^{+}_{i,l} or pi,lp^{-}_{i,l} to zero.

2.1.3 Dual variables

The update for dual variables follows from the standard ADMM algorithm updates:

αi,j=αi,j+si,j\displaystyle\alpha_{i,j}=\alpha_{i,j}+s_{i,j} (9)
+y^iy^j𝒂i,𝒙i𝒙j\displaystyle+\hat{y}_{i}-\hat{y}_{j}-\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle i,j[n]×[n]\displaystyle i,j\in[n]\times[n]
γi,l=γi,l+ui,l+pi,l++pi,lLl\displaystyle\gamma_{i,l}=\gamma_{i,l}+u_{i,l}+p_{i,l}^{+}+p_{i,l}^{-}-L_{l} i,l[n]×[d]\displaystyle i,l\in[n]\times[d]
ηi,l=ηi,l+ai,lpi,l++pi,l\displaystyle\eta_{i,l}=\eta_{i,l}+a_{i,l}-p_{i,l}^{+}+p_{i,l}^{-} i,l[n]×[d].\displaystyle i,l\in[n]\times[d].

2.1.4 Algorithms via ADMM

Algorithm 2 provides the full steps for a parallel ADMM optimizer for the convex regression problem. In each line in Algorithm 2, we use subscripts such as ii, jj and ll on the left hand side. These updates may be run in parallel for i[n]i\in[n], i,j[n]×[n]i,j\in[n]\times[n], i,l[n]×[d]i,l\in[n]\times[d] or q[2]q\in[2]. We initially set all block variables to zero and normalize the dataset such that iyi=0\sum_{i}y_{i}=0 and i𝒙i=0\sum_{i}\bm{x}_{i}=0. We have implemented our algorithm using PyTorch (Paszke et al., 2019), which benefits from this parallel structure when using a GPU. Our code, along with a built-in tuner for our hyperparameter λ\lambda and TT, is available on our GitHub repository 111 https://github.com/Siahkamari/Piecewise-linear-regression .

Algorithm 1 L-update
0:  {γi,ci}i=1n\{\gamma_{i},c_{i}\}_{i=1}^{n}, and ρ/λ\rho/\lambda
1:   knot2n,,knot1sort{γi+ci,γici}i=1n{knot_{2n},\dots,knot_{1}}\leftarrow sort\{\gamma_{i}+c_{i},\gamma_{i}-c_{i}\}_{i=1}^{n}
2:   fλ/ρf\leftarrow\lambda/\rho
3:   f0f^{\prime}\leftarrow 0
4:  for j=2j=2 to 2n2n do
5:      ff+12f^{\prime}\leftarrow f^{\prime}+\frac{1}{2}
6:      ff+f(knotjknotj1)f\leftarrow f+f^{\prime}\cdot(knot_{j}-knot_{j-1})
7:     if f0f\leq 0 then
8:        return   (knotjff)+\big{(}knot_{j}-\frac{f}{f^{\prime}}\big{)}^{+}
9:     end if
10:  end for
11:  return   (knot2nfn)+\big{(}knot_{2n}-\frac{f}{n}\big{)}^{+}
Algorithm 2 Convex regression
0:  {(𝒙i,yi)}i=1n\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}, ρ\rho, λ\lambda, and TT
1:   y^i=si,j=αi,j0\hat{y}_{i}=s_{i,j}=\alpha_{i,j}\leftarrow 0
2:   𝑳=𝒂i=𝒑i=𝒖i=𝜼i=𝜸i𝟎d×1\bm{L}=\bm{a}_{i}=\bm{p}_{i}=\bm{u}_{i}=\bm{\eta}_{i}=\bm{\gamma}_{i}\leftarrow\bm{0}_{d\times 1}
3:  for t=1t=1 to TT do
4:     Update 𝒚^\hat{\bm{y}} by Eq. (7)
5:     Update 𝒂i\bm{a}_{i} by Eq. (6)
6:     LlL_update({γi,l,|ηi,l+ai,l|}i[n],λ/ρ)L_{l}\leftarrow\textbf{L\_update}(\{\gamma_{i,l},|\eta_{i,l}+a_{i,l}|\}_{i\in[n]},\lambda/\rho)
7:     Update ui,l,pi,l+,pi,l,si,ju_{i,l},p_{i,l}^{+},p_{i,l}^{-},s_{i,j} by Eq. (8)
8:     Update αi,j,γi,l,ηi,l\alpha_{i,j},\gamma_{i,l},\eta_{i,l} by Eq. (9)
9:  end for
10:  return   f()maxi=1n(𝒂i,𝒙i+y^i)f(\cdot)\triangleq\max_{i=1}^{n}(\langle\bm{a}_{i},\cdot-\bm{x}_{i}\rangle+\hat{y}_{i})

2.2 Analysis

Our method has two sources of errors which are the error due to the ADMM procedure (Eq. 5) and the error due to estimating the ground truth convex function (Eq. 4). We characterize both errors based on ADMM convergence (He & Yuan, 2012).

Theorem 1.

[from He & Yuan (2012)] Consider the separable convex optimization problem,

min𝐛1𝒮1,𝐛2𝒮2[ψ(𝐛1,𝐛2)=ψ1(𝐛1)+ψ2(𝐛2)]\displaystyle\min_{{\bf b}^{1}\in\mathcal{S}_{1},{\bf b}^{2}\in\mathcal{S}_{2}}\left[\psi({\bf b}^{1},{\bf b}^{2})=\psi_{1}({\bf b}^{1})+\psi_{2}({\bf b}^{2})\right]
s.t:𝐀𝐛1+𝐁𝐛2+𝒃=𝟎,\displaystyle\quad s.t:{\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}+\bm{b}=\bm{0},

where (𝐛1{\bf b}^{1}, 𝐛2{\bf b}^{2}) are the block variables, (𝒮1\mathcal{S}_{1}, 𝒮2\mathcal{S}_{2}) are convex sets that includes all zero vectors, (𝐀,𝐁{\bf A},{\bf B}) are the coefficient matrices and 𝐛{\bf b} is a constant vector. Let 𝐛t1{\bf b}^{1}_{t} and 𝐛t2{\bf b}^{2}_{t} be solutions at iteration tt of a two block ADMM procedure with learning rate ρ\rho where (𝐛01,𝐛02)({\bf b}^{1}_{0},{\bf b}^{2}_{0}) are all zero vectors. Denote average of iterates as (𝐛~T1,𝐛~T2)=(1Tt=1T𝐛t1,1Tt=1T𝐛t2)(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})=\left(\frac{1}{T}\sum_{t=1}^{T}{\bf b}^{1}_{t},\frac{1}{T}\sum_{t=1}^{T}{\bf b}^{2}_{t}\right). For all 𝛋{\bm{\kappa}} we have,

ψ(𝐛~T1,𝐛~T2)ψ(𝐛1,𝐛2)𝜿T(𝐀𝐛~T1+𝐁𝐛~T2+𝒃)\displaystyle\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})-{\bm{\kappa}}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})
1T(ρ2𝐁𝐛22+12ρ𝜿2),\displaystyle\leq\frac{1}{T}\left(\frac{\rho}{2}\|{\bf B}{\bf b}^{2}_{*}\|^{2}+\frac{1}{2\rho}\|{\bm{\kappa}}\|^{2}\right),

where (𝐛1,𝐛2)({\bf b}_{1}^{*},{\bf b}_{2}^{*}) are the optimal solutions.

We arranged Theorem 1 such that it explicitly depends on the problem dependent constants such as 𝐁𝐛22\|{\bf B}{\bf b}^{2}_{*}\|^{2}, the number of iterations, TT, as well as the learning rate ρ\rho. A proof is provided in Appendix B.1.

Next, we present convergence analysis in terms of regularized MSE of the output of the ADMM algorithm (2). This has the further benefit of finding an appropriate learning rate ρ\rho and a range for regularization coefficient λ\lambda that minimizes the computational complexity.

Theorem 2.

Let {y^it,𝐚it}i=1n\{\hat{y}_{i}^{t},\bm{a}_{i}^{t}\}_{i=1}^{n} be the output of Algorithm 2 at the ttht^{th} iteration, yi~1Tt=1Ty^it\tilde{y_{i}}\triangleq\frac{1}{T}\sum_{t=1}^{T}\hat{y}_{i}^{t} and 𝐚i~1Tt=1T𝐚it\tilde{\bm{a}_{i}}\triangleq\frac{1}{T}\sum_{t=1}^{T}\bm{a}_{i}^{t}. Denote f~T(𝐱)maxi𝐚i~,𝐱𝐱i+yi~\tilde{f}_{T}(\bm{x})\triangleq\max_{i}\langle\tilde{\bm{a}_{i}},\bm{x}-\bm{x}_{i}\rangle+\tilde{y_{i}}. Assume maxi,l|xi,l|1\max_{i,l}|x_{i,l}|\leq 1 and 𝕍𝕒𝕣({yi}i=1n)1\mathbbm{Var}(\{y_{i}\}_{i=1}^{n})\leq 1. If we choose ρ=dλ2n\rho=\frac{\sqrt{d}\lambda^{2}}{n}, for λ32nd\lambda\geq\frac{3}{\sqrt{2nd}} and TndT\geq{n\sqrt{d}} we have:

1ni=1n(f~T(𝒙i)yi)2+λf~T\displaystyle\frac{1}{n}\sum_{i=1}^{n}(\tilde{f}_{T}(\bm{x}_{i}){-}y_{i})^{2}{+}\lambda\|\tilde{f}_{T}\|
minf^(1ni=1n(f^(𝒙i)yi)2+λf^)+6ndT+1.\displaystyle\leq\min_{\hat{f}\in\mathcal{F}}\bigg{(}\frac{1}{n}\sum_{i=1}^{n}\big{(}\hat{f}(\bm{x}_{i})-y_{i}\big{)}^{2}+\lambda\|\hat{f}\|\bigg{)}+\frac{6n\sqrt{d}}{T+1}.
Corollary 1.

Our method needs T=6ndϵT=\frac{6n\sqrt{d}}{\epsilon} iterations to achieve ϵ\epsilon error. Each iteration requires 𝒪(n2d+nd2)\mathcal{O}(n^{2}d+nd^{2}) flops operations. Prepossessing costs 𝒪(nd3)\mathcal{O}(nd^{3}). Therefore the total computational complexity is 𝒪(n3d1.5+n2d2.5+nd3ϵ)\mathcal{O}\bigg{(}\frac{n^{3}d^{1.5}+n^{2}d^{2.5}+nd^{3}}{\epsilon}\bigg{)}.

We note that assumptions of Theorem 2 are simply satisfied by normalizing the training dataset. The main difficulty for the derivation of Theorem 2 is: (𝒂i~,yi~)(\tilde{\bm{a}_{i}},\tilde{y_{i}}) might be violating the constraints in (5). This could result in f~T(𝒙i)y~i\tilde{f}_{T}(\bm{x}_{i})\neq\tilde{y}_{i} and f~Tl=1dL~l\|\tilde{f}_{T}\|\neq\sum_{l=1}^{d}\tilde{L}_{l}. Therefore, the main steps of the proof of Theorem 2 is to characterize and bound the effect of such constraint violations on our objective function. Then we set 𝜿\bm{\kappa} in Theorem 1 in a way to cover for the effects of constraint violations. For a full proof, we refer to Appendix B.2.

3 Approximating a Bregman Divergence

We next consider the application of learning a Bregman divergence from supervision. This problem is a type of metric learning problem, where we are given supervision (pairs of similar/dissimilar pairs, or triplets consisting of relative comparisons amongst data points) and we aim to learn a task-specific distance or divergence measure from the data. Classical metric learning methods typically learn a linear transformation of the data, corresponding to a Mahalanobis-type distance function,

(𝒙𝒚)TM(𝒙𝒚),(\bm{x}-\bm{y})^{T}M(\bm{x}-\bm{y}),

where MM is a positive semi-definite matrix (note that this generalizes the squared Euclidean distance, where MM would be the identity matrix). More generally, Mahalanobis distances are examples of Bregman divergences, which include other divergences such as the KL-divergence as special cases.

Bregman divergences are parameterized by an underlying strictly convex function, sometimes called the convex generating function of the divergence. Recently, Siahkamari et al. (2019) formulate learning a Bregman divergence as learning the underlying convex function parameterizing the divergence. The resulting optimization problem they study is similar to convex regression problem Eq. (5). Here we will discuss an improvement to their approach with a slightly different loss function and our Lasso penalty term. We then derive an algorithm using 2-block ADMM; the existing solver for this problem uses standard linear programming.

In particular suppose we observe the classification dataset Sn={(𝒙i,yi)}i=1nS_{n}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}, for 𝒙id\bm{x}_{i}\in\mathbb{R}^{d} and yiy_{i} is an integer. Let

Df(𝒙i,𝒙j)=f(𝒙i)f(𝒙j)f(𝒙j),𝒙i𝒙j,\displaystyle D_{f}(\bm{x}_{i},\bm{x}_{j})=f(\bm{x}_{i})-f(\bm{x}_{j})-\langle\nabla^{*}f(\bm{x}_{j}),\bm{x}_{i}-\bm{x}_{j}\rangle,

be a Bregman divergence with convex generating function f(𝒙){f}(\bm{x})\in\mathcal{F}. Let the pairwise similarity loss be

(Df(𝒙i,𝒙j),yi,yj)\displaystyle\ell(D_{f}(\bm{x}_{i},\bm{x}_{j}),y_{i},y_{j}) =\displaystyle=
𝟙[𝟙[\displaystyle\mathbbm{1}\big{[}\mathbbm{1}[ yi=yj]=(Df(𝒙i,𝒙j)1)].\displaystyle y_{i}=y_{j}]=(D_{f}(\bm{x}_{i},\bm{x}_{j})\geq 1)\big{]}.

We estimate ff the underlying convex function of the Bregman divergence DfD_{f} by,

f^argminfi=1nj=1n(Df(𝒙i,𝒙j),yi,yj)+λf.\displaystyle\hat{f}\triangleq\arg\min_{f\in\mathcal{F}}\sum_{i=1}^{n}\sum_{j=1}^{n}\ell(D_{f}(\bm{x}_{i},\bm{x}_{j}),y_{i},y_{j})+\lambda\|f\|.

We note that this approach can easily be generalized to other possible metric learning losses.

3.1 Optimization

Now we provide an algorithm for the Bregman divergence learning based on 2-block ADMM. We solve for each block in closed form. We first re-formulate the optimization problem to standard form and introducing some auxiliary variables we have:

minzi,𝒂i,ζi,j0,Ll0,𝒑i+0,𝒑i+0,𝒖i0,si,j0,ti,j0i=1nj=1nζi,j+λdLl\displaystyle\min_{\begin{subarray}{c}z_{i},\bm{a}_{i},\zeta_{i,j}\geq 0,\\ L_{l}\geq 0,\bm{p}_{i}^{+}\geq 0,\bm{p}_{i}^{+}\geq 0,\\ \bm{u}_{i}\geq 0,s_{i,j}\geq 0,t_{i,j}\geq 0\end{subarray}}\sum_{i=1}^{n}\sum_{j=1}^{n}\zeta_{i,j}+\lambda\sum_{d}L_{l} (10)
s.t.{ιi,jsi,jιi,j+ti,j+1ζi,j=0i,j[n]×[n]si,j+zizj𝒂i,𝒙i𝒙j=0i,j[n]×[n]pi,l++pi,l+ui,lLl=0i,l[n]×[d]ai,l=pi,l+pi,li,l[n]×[d],\displaystyle\textrm{s.t.}\begin{cases}\iota_{i,j}s_{i,j}-\iota_{i,j}+t_{i,j}+1-\zeta_{i,j}=0&i,j\in[n]\times[n]\\ s_{i,j}+z_{i}-z_{j}-\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle=0&i,j\in[n]\times[n]\\ p_{i,l}^{+}+p_{i,l}^{-}+u_{i,l}-L_{l}=0&i,l\,\in[n]\times[d]\\ a_{i,l}=p_{i,l}^{+}-p_{i,l}^{-}&i,l\,\in[n]\times[d],\end{cases}

where ιi,j=2(𝟙[yi=yj]12)\iota_{i,j}=2(\mathbbm{1}[y_{i}=y_{j}]-\frac{1}{2}).

Now we write the augmented Lagrangian:

(zi,𝒂i,ζi,j,Ll,𝒑i+,𝒑i,𝒖i,si,j,ti,j)\displaystyle\ell(z_{i},\bm{a}_{i},\zeta_{i,j},L_{l},\bm{p}_{i}^{+},\bm{p}_{i}^{-},\bm{u}_{i},s_{i,j},t_{i,j})
=i=1nj=1nζi,j+λl=1dLl\displaystyle=\sum_{i=1}^{n}\sum_{j=1}^{n}\zeta_{i,j}+\lambda\sum_{l=1}^{d}L_{l}
+ijρ2(si,j+zizj𝒂i,𝒙i𝒙j+αi,j)2\displaystyle+\sum_{i}\sum_{j}\frac{\rho}{2}(s_{i,j}+z_{i}-z_{j}-\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle+\alpha_{i,j})^{2}
+ilρ2(ui,l+pi,l++pi,lLl+γi,l)2\displaystyle+\sum_{i}\sum_{l}\frac{\rho}{2}(u_{i,l}+p_{i,l}^{+}+p_{i,l}^{-}-L_{l}+\gamma_{i,l})^{2}
+ilρ2(ai,lpi,l++pi,l+ηi,l)2\displaystyle+\sum_{i}\sum_{l}\frac{\rho}{2}(a_{i,l}-p_{i,l}^{+}+p_{i,l}^{-}+\eta_{i,l})^{2}
+ijρ2(ιi,jsi,jιi,j+ti,j+1ζi,j+τi,j)2,\displaystyle+\sum_{i}\sum_{j}\frac{\rho}{2}(\iota_{i,j}s_{i,j}-\iota_{i,j}+t_{i,j}+1-\zeta_{i,j}+\tau_{i,j})^{2},

where αi,j,γi,j,ηi,j\alpha_{i,j},\gamma_{i,j},\eta_{i,j} and τi,j\tau_{i,j} are dual variables. Next we divide the variables into two blocks and solve for each in closed form.

3.1.1 First block 𝒃1={zi,𝒂i,ζi,j}{\bm{b}}^{1}=\{z_{i},\bm{a}_{i},\zeta_{i,j}\}

Setting ζi,j=0\partial_{\zeta_{i,j}}\ell=0 gives:

ζi,j=(1nρ+τi,j+ιi,jsi,jιi,j+ti,j+1)+.\displaystyle\zeta_{i,j}=(\frac{-1}{n\rho}+\tau_{i,j}+\iota_{i,j}s_{i,j}-\iota_{i,j}+t_{i,j}+1)^{+}. (11)

Collecting the terms containing 𝒂i\bm{a}_{i} in Eq.(10) and comparing to those in Eq. (5), the solution for 𝒂i\bm{a}_{i} follows:

𝒂i𝚲i(𝜽i+zi𝒙i+1nkzk𝒙k).\displaystyle\bm{a}_{i}\triangleq\bm{\Lambda}_{i}(\bm{\theta}_{i}+z_{i}\bm{x}_{i}+\frac{1}{n}\sum_{k}z_{k}\bm{x}_{k}). (12)

Set zi=0\partial_{z_{i}}\ell=0, izi=0\sum_{i}z_{i}=0 and i𝒙i=𝟎\sum_{i}\bm{x}_{i}=\bm{0}. Using Eq. (12) we can solve for z1,,znz_{1},\dots,z_{n} as

𝒛\displaystyle\bm{z} =𝛀breg1(𝝂𝜷),\displaystyle=\bm{\Omega}_{\text{breg}}^{-1}(\bm{\nu}-\bm{\beta}), (13)

where Ωbregi,j=(2𝒙iTΛi𝒙i)𝟙[i=j]1nDi,j\Omega_{\text{breg}_{i,j}}=(2-\bm{x}_{i}^{T}\Lambda_{i}\bm{x}_{i})\mathbbm{1}[{i=j}]-\frac{1}{n}D_{i,j}.

3.1.2 Second block 𝒃2={Ll,𝒑i,𝒖i,si,j,ti,j}{\bm{b}}^{2}=\{L_{l},\bm{p}_{i},\bm{u}_{i},s_{i,j},t_{i,j}\}

Comparing Eq. (10) and Eq. (5), we find that LlL_{l}, 𝒑i\bm{p}_{i} and 𝒖i\bm{u}_{i}, have the same solutions as in sec 2.1.2. Hence we only proceed to solve for si,js_{i,j} and ti,jt_{i,j}. Set ti,j=ti,j=0\partial_{t_{i,j}}\ell=\partial_{t_{i,j}}=0. Some algebra gives:

si,j\displaystyle s_{i,j} =12(πi,j2+ιi,jπi,j1ιi,j(πi,j1ιi,jπi,j2)+)+,\displaystyle=\frac{1}{2}(\pi^{2}_{i,j}+\iota_{i,j}\pi^{1}_{i,j}-\iota_{i,j}(\pi^{1}_{i,j}-\iota_{i,j}\pi^{2}_{i,j})^{+})^{+}, (14)
ti,j\displaystyle t_{i,j} =(πi,j1ιi,jsi,j)+,\displaystyle=(\pi^{1}_{i,j}-\iota_{i,j}s_{i,j})^{+},

where

πi,j1\displaystyle\pi^{1}_{i,j} τi,j+ιi,j1+ζi,j,\displaystyle\triangleq-\tau_{i,j}+\iota_{i,j}-1+\zeta_{i,j},
πi,j2\displaystyle\pi^{2}_{i,j} αi,jzi+zj+𝒂i,𝒙i𝒙j.\displaystyle\triangleq-\alpha_{i,j}-z_{i}+z_{j}+\langle\bm{a}_{i},\bm{x}_{i}-\bm{x}_{j}\rangle.

3.1.3 Dual variables

We only see a new constraint different from Eq. (5) with dual multiplier τi,j\tau_{i,j}. The update is

τi,j\displaystyle\tau_{i,j} =τi,j+ιi,jsi,jιi,j+ti,j+1ζi,j.\displaystyle=\tau_{i,j}+\iota_{i,j}s_{i,j}-\iota_{i,j}+t_{i,j}+1-\zeta_{i,j}. (15)

Algorithm 3 in the appendix provides full-steps for solving the Bregman divergence learning problem.

4 Difference of Convex (DC) Regression

As a further example, we extend our convex regression solver to the difference of convex regression as studied in Siahkamari et al. (2020). DC functions are set of functions ff that can be represented as f=ϕ1ϕ2f=\phi^{1}-\phi^{2} for a choice of two convex functions. DC functions are a very rich class—for instance, they are known to contain all 𝒞2\mathcal{C}^{2} functions. DC regression has been studied in Cui et al. (2018); Siahkamari et al. (2020); Bagirov et al. (2020). We provide a 2-block ADMM solver for the difference of convex estimator proposed in Siahkamari et al. (2020), with a minor change of choosing a different penalty

finfϕ1,ϕ2\displaystyle\|f\|\triangleq\inf_{\phi^{1},\phi^{2}} q=12l=1dsup𝒙|xlϕq(𝒙)|\displaystyle\sum_{q=1}^{2}\sum_{l=1}^{d}\sup_{\bm{x}}|\partial_{x_{l}}^{*}\phi^{q}(\bm{x})|
s.t. ϕ1,ϕ2 are convex, ϕ1ϕ2=f.\displaystyle\textrm{s.t. }\phi_{1},\phi_{2}\text{ are convex, }\phi_{1}-\phi_{2}=f.

In particular, consider a regression dataset {(yi,𝒙i)}i=1n\{(y_{i},\bm{x}_{i})\}_{i=1}^{n} with 𝒙id\bm{x}_{i}\in\mathbb{R}^{d} and yiy_{i}\in\mathbb{R}. Denote the class of difference of convex functions

𝒟𝒞{f:d|f=ϕ1ϕ2,(ϕ1,ϕ2) are convex},\displaystyle\mathcal{DC}\triangleq\{f:\mathbb{R}^{d}\to\mathbb{R}\ |\ f=\phi^{1}-\phi^{2},(\phi^{1},\phi^{2})\text{ are convex}\},

In order to estimate ff, we minimize the penalized least square over the 𝒟𝒞\mathcal{DC} class:

f^argminf𝒟𝒞1ni=1n(yif(𝒙i))2+λf.\displaystyle\hat{f}\triangleq\arg\min_{f\in\mathcal{DC}}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-f(\bm{x}_{i}))^{2}+\lambda\|f\|.

This results in a convex program:

miny^iq,𝒂iq1ni=1n(y^i1y^i2yi)2+λq=12l=1dmaxi=1n|ai,lq|\displaystyle\quad\min_{\hat{y}_{i}^{q},\bm{a}_{i}^{q}}\frac{1}{n}\sum_{i=1}^{n}(\hat{y}_{i}^{1}-\hat{y}_{i}^{2}-y_{i})^{2}+\lambda\sum_{q=1}^{2}\sum_{l=1}^{d}\max_{i=1}^{n}|a_{i,l}^{q}|
s.t. {y^i1y^j1𝒂i1,𝒙i𝒙j0i,j[n]×[n],y^i2y^j2𝒂i2,𝒙i𝒙j0i,j[n]×[n],\displaystyle\textrm{s.t. }\begin{cases}\hat{y}_{i}^{1}-\hat{y}_{j}^{1}-\langle\bm{a}_{i}^{1},\bm{x}_{i}-\bm{x}_{j}\rangle\leq 0\quad i,j\in[n]\times[n],\\ \hat{y}_{i}^{2}-\hat{y}_{j}^{2}-\langle\bm{a}_{i}^{2},\bm{x}_{i}-\bm{x}_{j}\rangle\leq 0\quad i,j\in[n]\times[n],\end{cases} (16)

and the estimator for ff is given as

f^(𝒙)\displaystyle\hat{f}(\bm{x})\triangleq maxi𝒂i1,𝒙𝒙i+y^i1\displaystyle\max_{i}\langle\bm{a}_{i}^{1},\bm{x}-\bm{x}_{i}\rangle+\hat{y}_{i}^{1}
maxi𝒂i2,𝒙𝒙i+y^i2.\displaystyle-\max_{i}\langle\bm{a}_{i}^{2},\bm{x}-\bm{x}_{i}\rangle+\hat{y}_{i}^{2}.

We note that the linear constraint sets for {y^i1,𝒂i1}\{\hat{y}_{i}^{1},\bm{a}_{i}^{1}\} and {y^i2,𝒂i2}\{\hat{y}_{i}^{2},\bm{a}_{i}^{2}\} have no variable in common in the linear program (4). Fixing 𝒚^2\hat{\bm{y}}^{2} allows us to solve for 𝒚^1\hat{\bm{y}}^{1} using Eq. (7) and vice-versa. From there we can decouple 𝒚^1\hat{\bm{y}}^{1} and 𝒚^2\hat{\bm{y}}^{2} solutions by solving a system of linear equations,

𝒚^q\displaystyle\hat{\bm{y}}^{q} =(1)q+12(𝛀+2In2ρ)1(4𝒚n2ρ+v1𝜷1𝒗2+𝜷2)\displaystyle{=}\frac{(-1)^{q+1}}{2}\bigg{(}\bm{\Omega}{+}\frac{2I}{n^{2}\rho}\bigg{)}^{-1}\bigg{(}\frac{4\bm{y}}{n^{2}\rho}\bm{+}v^{1}-\bm{\beta}^{1}{-}\bm{v}^{2}{+}\bm{\beta}^{2}\bigg{)}
+12(𝛀2𝑰n2ρ)1(𝒗1𝜷1+𝒗2𝜷2).\displaystyle{+}\frac{1}{2}\bigg{(}\bm{\Omega}-\frac{2\bm{I}}{n^{2}\rho}\bigg{)}^{-1}(\bm{v}^{1}-\bm{\beta}^{1}+\bm{v}^{2}-\bm{\beta}^{2}). (17)

Algorithm 4 in Appendix provides the full steps to solve the difference of convex regression problem.

5 Experiments

Refer to caption
Figure 1: Regression results on UCI datasets

In this section we provide experiments on real datasets from the UCI machine learning repository as well as on synthetic datasets. We compare running times between our proposed approach and the baseline interior point method for all three problems (convex regression, DC regression, and Bregman divergence learning). We also compare both our DC regression algorithm as well as our Bregman divergence learning algorithm to state-of-the-art regression and classification methods, and show that our approach is close to state-of-the-art in terms of accuracy. We compare our (DC)-regression algorithm with state-of-the-art regression models.

For the ADMM-based methods, we use a V100 Nvidia GPU processor with 1111 gigabyte of GPU memory, and 4 cores of CPU. For all the other methods we use a 16 core CPU.

Table 1: Comparison of Convex Regression run time against baseline on Synthetic Data
Seconds
nn dd Baseline (Siahkamari et al., 2020) This Paper
1000 2 32.3 3.63
1000 4 30.8 3.75
1000 8 54.5 4.03
1000 16 112.3 4.12
1000 32 189.3 4.37
Table 2: Comparison of DC Regression run time against baseline on Synthetic Data
Seconds
nn dd Baseline (Siahkamari et al., 2020) This Paper
1000 2 140.9 3.67
1000 4 113.9 3.79
1000 8 210.3 4.05
1000 16 351.6 4.15
1000 32 823.2 4.42

5.1 Timing Results for Convex and DC Regression

To demonstrate the effectiveness of our proposed approach, we first compare on synthetic data the existing interior point method to our 2-block ADMM method on standard convex regression and DC regression. Here, we fixed the number of data points at 1000 and tested data of dimensionality d{2,4,8,12,32}d\in\{2,4,8,12,32\}.

Results are shown in Tables 1 and 2. We see that our method ranges from 8.8x faster than the interior point method to up to 186x faster. Note that the interior point method fails on large data sets, and cannot be run for many of the UCI data sets tested later.

5.2 Results on Real Data

Now we experiment with our DC regression algorithm and the Bregman divergence learning algorithm (PBDL) on real data sets, and report accuracy and timing results. For PBDL, we compare against its predecessor (PBDL_0) as well as state of the art classification models. All results are reported either on the pre-specifed train/test split or a 5-fold cross validation set based on instruction for each dataset.

For the purpose of comparisons, we compare to the previous (DC)-regression and PBDL algorithms. Furthermore, we pick two popular tree-based algorithms XGboost and Random Forest. We also consider Lasso as a baseline. According to Fernández-Delgado et al. (2014, 2019), Random Forest achieves the state of the art on most regression and classification tasks. More recently, XGboost has been shown to achive the state of the art on many real datasets and Kaggle.com competitions (Chen & Guestrin, 2016). However, we stress that achieving state-of-the-art results is not the purpose of our experiments.

We provide the details of hyper-parameters for each method.

DC-regression: We choose λ\lambda from a grid 103:310^{-3:3} by 5 fold cross validation. Then we do at most 2 more rounds of grid search around the optimal λ\lambda at the first round. We fix ρ=0.01\rho=0.01 and choose TT by early stopping, i.e., whenever validation error improvement is less than 10310^{-3} after nn iterations of ADMM.

PBDL: We predict the classes using a 55 nearest neighbour scheme inearest(𝒙)=argminiDϕ(𝒙,𝒙i)i_{\text{nearest}}(\bm{x})=\arg\min_{i}D_{\phi}(\bm{x},\bm{x}_{i}). where DϕD_{\phi} is learned Bregman divergence. Tuning λ\lambda is similar to DC-regression.

PBDL_0: We use the existing code from the authors for learning a Bregman divergence based on Groubi solvers. We use their built-in parameter tuner for λ\lambda which is based on 5 fold cross validation.

XGboost Regressor/Classifier: max_depthmax\_depth and learning_ratelearning\_rate is found by 5 fold cross validation over {1,,10}×{0.05,0.1,0.5}\{1,\dots,10\}\times\{0.05,0.1,0.5\}. Other parameters set to default. We use the xgboost 1.6.0 package in Chen & Guestrin (2016).

Random Forest Regressor/Classifier: n_featuresn\_features is found by 5 fold cross validation over [d0.25,d0.75][d^{0.25},d^{0.75}]. We use the sklearn package in Pedregosa et al. (2011).

Lasso: We use the sklearn package function LassoCV.

5.2.1 Datasets

We chose all regression datasets from UCI which have number of instances 103n10410^{3}\leq n\leq 10^{4}. These were 30 datasets at the time of submission of this paper. We discard 1616 of these datasets due to various reasons such as the data set not being available. Some of these datasets have multiple target variables, therefore we report the out of sample R2R^{2} results with a suffix for each target variable.

For classification, we chose all datasets originally used in Siahkamari et al. (2019). Further, we include the abalone dataset which was too large for the previous PBDL algorithm. We use these datasets as multi-class classification problems. We report accuracy as well as the training run-time.

5.2.2 Results

Regression: For our regression experiments we present the out of sample R2R^{2} on 1818 datasets in figure 1 . We observe our method has similar performance to that of XGboost and Random Forest despite having n×dn\times d parameters. In solar-flare and Parkinson datasets, other methods overfit whereas our algorithm avoids overfitting and it is more robust. Maximum number of instances which we were able to experiment on with reasonable time is about 10410^{4} and is 1010x larger than Siahkamari et al. (2020) has experimented on. The detailed results are in Appendix.

Classification: For our classification experiments we compare the logarithm of runtimes of all methods in Figure 2. In terms of speed, we are on average 3030x faster than the original PBDL_0 algorithm. Also PBDL_0 fails to handle the Abalone dataset with n=4177n=4177, where the new PBDL takes less than a minute to finish one fit. We are slower than XGboost and Random Forest. However, we only rely on a python script code vs an optimized compiled code. We note that the divergence function learned in PBDL can be further used for other tasks such as ranking and clustering. We present the accuracy in Figure 3 in Appendix. We observe that our Bregman divergence Learning Algorithm (PBDL) as well as the original (PBDL_0) have similar performance to that of XGboost and Random Forest.

Refer to caption
Figure 2: Classification log(runtime) for all fits on UCI datasets

6 Conclusion

In this paper we, studied the nonparametric convex regression problem with a L1L1 regularization penalty. we provided a solver for this problem and proved the iteration complexity be 6nd/ϵ6n\sqrt{d}/\epsilon. The total computational complexity is O(n3d1.5/ϵ+n2d2.5/ϵ+nd3/ϵ)O(n^{3}d^{1.5}/\epsilon+n^{2}d^{2.5}/\epsilon+nd^{3}/\epsilon), which improves that of O(n5d2/ϵ)O(n^{5}d^{2}/\epsilon) already known for this problem. We also extended our solver to the problem of difference of convex (DC) regression and the problem of learning an arbitrary Bregman divergence. We provided comparisons to state of the art regression and classification models.

Acknowledgements

This research was supported by the Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001, Army Research Office Grant W911NF2110246, the National Science Foundation grants CCF-2007350 and CCF-1955981, and the Hariri Data Science Faculty Fellowship Grants, and a gift from the ARM corporation.

References

  • Afriat (1967) Afriat, S. N. The construction of utility functions from expenditure data. International economic review, 8(1):67–77, 1967.
  • Amos et al. (2017) Amos, B., Xu, L., and Kolter, J. Z. Input convex neural networks. In International Conference on Machine Learning, pp. 146–155. PMLR, 2017.
  • Bagirov et al. (2020) Bagirov, A. M., Taheri, S., Karmitsa, N., Sultanova, N., and Asadi, S. Robust piecewise linear l 1-regression via nonsmooth dc optimization. Optimization Methods and Software, pp.  1–21, 2020.
  • Balázs (2016) Balázs, G. Convex regression: theory, practice, and applications. 2016.
  • Bertsimas & Mundru (2021) Bertsimas, D. and Mundru, N. Sparse convex regression. INFORMS Journal on Computing, 33(1):262–279, 2021.
  • Boyd et al. (2004) Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.
  • Chen & Guestrin (2016) Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp.  785–794, 2016.
  • Chen & Mazumder (2020) Chen, W. and Mazumder, R. Multivariate convex regression at scale. arXiv preprint arXiv:2005.11588, 2020.
  • Cui et al. (2018) Cui, Y., Pang, J.-S., and Sen, B. Composite difference-max programs for modern statistical estimation problems. SIAM Journal on Optimization, 28(4):3344–3374, 2018.
  • Eckstein & Yao (2012) Eckstein, J. and Yao, W. Augmented lagrangian and alternating direction methods for convex optimization: A tutorial and some illustrative computational results. RUTCOR Research Reports, 32(3), 2012.
  • Fernández-Delgado et al. (2014) Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. Do we need hundreds of classifiers to solve real world classification problems? The journal of machine learning research, 15(1):3133–3181, 2014.
  • Fernández-Delgado et al. (2019) Fernández-Delgado, M., Sirsat, M. S., Cernadas, E., Alawadi, S., Barro, S., and Febrero-Bande, M. An extensive experimental survey of regression methods. Neural Networks, 111:11–34, 2019.
  • Gabay & Mercier (1976) Gabay, D. and Mercier, B. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & mathematics with applications, 2(1):17–40, 1976.
  • Ghosh et al. (2019) Ghosh, A., Pananjady, A., Guntuboyina, A., and Ramchandran, K. Max-affine regression: Provable, tractable, and near-optimal statistical estimation. arXiv preprint arXiv:1906.09255, 2019.
  • He & Yuan (2012) He, B. and Yuan, X. On the o(1/n) convergence rate of the douglas–rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012.
  • Kim et al. (2021) Kim, S., Bahmani, S., and Lee, K. Max-linear regression by scalable and guaranteed convex programming. arXiv preprint arXiv:2103.07020, 2021.
  • Mazumder et al. (2019) Mazumder, R., Choudhury, A., Iyengar, G., and Sen, B. A computational framework for multivariate convex regression and its variants. Journal of the American Statistical Association, 114(525):318–331, 2019.
  • Nagurney (1998) Nagurney, A. Network economics: A variational inequality approach, volume 10. Springer Science & Business Media, 1998.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037, 2019.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Siahkamari et al. (2019) Siahkamari, A., Saligrama, V., Castanon, D., and Kulis, B. Learning bregman divergences. arXiv e-prints, pp.  arXiv–1905, 2019.
  • Siahkamari et al. (2020) Siahkamari, A., Gangrade, A., Kulis, B., and Saligrama, V. Piecewise linear regression via a difference of convex functions. In International Conference on Machine Learning, pp. 8895–8904. PMLR, 2020.
  • Varian (1982) Varian, H. R. The nonparametric approach to demand analysis. Econometrica: Journal of the Econometric Society, pp. 945–973, 1982.
  • Zhang et al. (2020) Zhang, X., Li, Y., Zhang, Z., and Zhang, Z.-L. ff-gail: Learning ff-divergence for generative adversarial imitation learning. arXiv preprint arXiv:2010.01207, 2020.

Appendix to ‘Faster Convex Lipschitz Regression via 2 blocks ADMM’

Appendix A Algorithms

Algorithm 3 Learning a Bregman divergence
1:  Require: {(𝒙i,yi)|𝒙id,yi}i[n]\{(\bm{x}_{i},y_{i})\,|\,\bm{x}_{i}\in\mathbb{R}^{d},y_{i}\in\mathbb{N}\}_{i\in[n]} and {ρ,λ,T}\{\rho,\,\lambda,\,T\}
2:   zi=si,j=ti,j=αi,j=τi,j0z_{i}=s_{i,j}=t_{i,j}=\alpha_{i,j}=\tau_{i,j}\leftarrow 0
3:   𝑳=𝒂i=𝒑i=𝒖i=𝜼i=𝜸i𝟎d×1\bm{L}=\bm{a}_{i}=\bm{p}_{i}=\bm{u}_{i}=\bm{\eta}_{i}=\bm{\gamma}_{i}\leftarrow\bm{0}_{d\times 1}
4:  for t=1t=1 to TT do
5:     Update ζi,j\zeta_{i,j} by Eq. (11)
6:     Update 𝒛\bm{z} by Eq. (13)
7:     Update 𝒂i\bm{a}_{i} by Eq. (12)
8:     LlL_update({γi,l,|ηi,l+ai,l|}i[n],λ/ρ)L_{l}\leftarrow\textbf{L\_update}(\{\gamma_{i,l},|\eta_{i,l}+a_{i,l}|\}_{i\in[n]},\lambda/\rho)
9:     Update si,j,ti,js_{i,j},t_{i,j} by Eq. (14)
10:     Update ui,l,pi,l+,pi,lu_{i,l},p_{i,l}^{+},p_{i,l}^{-} by Eq. (8)
11:     Update αi,j,γi,l,ηi,l\alpha_{i,j},\gamma_{i,l},\eta_{i,l} by Eq. (9)
12:     Update τi,j\tau_{i,j} by Eq. (15)
13:  end for
14:  return: f()maxi=1n(𝒂i,𝒙i+zi)f(\cdot)\triangleq\max_{i=1}^{n}(\langle\bm{a}_{i},\cdot-\bm{x}_{i}\rangle+z_{i})

Algorithm 4 Difference of convex regression
0:  {(𝒙i,yi)}i=1n\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}, ρ\rho, λ\lambda, and TT
1:   y^iq=si,jq=αi,jq0\hat{y}_{i}^{q}=s_{i,j}^{q}=\alpha_{i,j}^{q}\leftarrow 0
2:   𝑳q=𝒂iq=𝒑iq=𝒖iq=𝜼iq=𝜸iq𝟎d×1\bm{L}^{q}=\bm{a}_{i}^{q}=\bm{p}_{i}^{q}=\bm{u}_{i}^{q}=\bm{\eta}_{i}^{q}=\bm{\gamma}_{i}^{q}\leftarrow\bm{0}_{d\times 1}
3:  for t=1t=1 to TT do
4:     Update 𝒚^q\hat{\bm{y}}^{q} by Eq. (4)
5:     Update 𝒂iq\bm{a}_{i}^{q} by Eq. (6)
6:     LlqL_update({γi,lq,|ηi,lq+ai,lq|}i[n],λ/ρ)L_{l}^{q}\leftarrow\textbf{L\_update}(\{\gamma_{i,l}^{q},|\eta_{i,l}^{q}+a_{i,l}^{q}|\}_{i\in[n]},\lambda/\rho)
7:     Update ui,lq,pi,lq+,pi,lq,si,jqu_{i,l}^{q},p_{i,l}^{q+},p_{i,l}^{q-},s_{i,j}^{q} by Eq. (8)
8:     Update αi,jq,γi,lq,ηi,lq\alpha_{i,j}^{q},\gamma_{i,l}^{q},\eta_{i,l}^{q} by Eq. (9)
9:  end for
10:  return   ϕq()maxi=1n(𝒂iq,𝒙i+y^iq)\phi^{q}(\cdot)\triangleq\max_{i=1}^{n}(\langle\bm{a}_{i}^{q},\cdot-\bm{x}_{i}\rangle+\hat{y}_{i}^{q})

Appendix B Derivations

B.1 ADMM Convergence

Theorem 1 gives convergence of the ADMM procedure. We follow analysis similar to He & Yuan (2012) and give all steps for the sake of completeness. We first restate the convergence as follows:

Theorem 1.

Consider the separable convex optimization problem:

min𝐛1𝒮1,𝐛2𝒮2[ψ(𝐛1,𝐛2)=ψ1(𝐛1)+ψ2(𝐛2)]\displaystyle\min_{{\bf b}^{1}\in\mathcal{S}_{1},{\bf b}^{2}\in\mathcal{S}_{2}}\left[\psi({\bf b}^{1},{\bf b}^{2})=\psi_{1}({\bf b}^{1})+\psi_{2}({\bf b}^{2})\right]
s.t:𝐀𝐛1+𝐁𝐛2+𝒃=𝟎,\displaystyle\quad s.t:{\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}+\bm{b}=\bm{0},

where (𝐛1{\bf b}^{1}, 𝐛2{\bf b}^{2}) are the block variables, (𝒮1\mathcal{S}_{1}, 𝒮2\mathcal{S}_{2}) are convex sets that includes all zero vectors, (𝐀,𝐁{\bf A},{\bf B}) are the coefficient matrices and 𝐛{\bf b} is a constant vector. Let 𝐛t1{\bf b}^{1}_{t} and 𝐛t2{\bf b}^{2}_{t} be solutions at iteration tt of a two block ADMM procedure with learning rate ρ\rho. i.e:

𝐛t+11\displaystyle{\bf b}^{1}_{t+1} =argmin𝐛1𝒮1ψ1(𝐛1)+ρ2𝐀𝐛1+𝐁𝐛t2+𝐛1ρ𝐝t2\displaystyle=\arg\min_{{\bf b}^{1}\in\mathcal{S}_{1}}\psi_{1}({\bf b}^{1})+\frac{\rho}{2}\big{\|}{\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}_{t}+{\bf b}-\frac{1}{\rho}{\bf d}_{t}\big{\|}^{2}
𝐛t+12\displaystyle{\bf b}^{2}_{t+1} =argmin𝐛2𝒮2ψ2(𝐛2)+ρ2𝐀𝐛t+11+𝐁𝐛2+𝐛1ρ𝐝t2\displaystyle=\arg\min_{{\bf b}^{2}\in\mathcal{S}_{2}}\psi_{2}({\bf b}^{2})+\frac{\rho}{2}\big{\|}{\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}+{\bf b}-\frac{1}{\rho}{\bf d}_{t}\big{\|}^{2}
𝐝t+1\displaystyle{\bf d}_{t+1} =𝐝tρ(𝐀𝐛t+11+𝐁𝐛t+12+𝐛),\displaystyle={\bf d}_{t}-\rho\left({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t+1}+{\bf b}\right),

where 𝐛01=𝐛02=𝐝0=0{\bf b}^{1}_{0}={\bf b}^{2}_{0}={\bf d}_{0}=0.

Denote the average of iterates as (𝐛~T1,𝐛~T2)=(1Tt=1T𝐛t1,1Tt=1T𝐛t2)(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})=\left(\frac{1}{T}\sum_{t=1}^{T}{\bf b}^{1}_{t},\frac{1}{T}\sum_{t=1}^{T}{\bf b}^{2}_{t}\right). For all 𝛋{\bm{\kappa}} we have

ψ(𝐛~T1,𝐛~T2)ψ(𝐛1,𝐛2)𝜿T(𝐀𝐛~T1+𝐁𝐛~T2+𝒃)1T(ρ2𝐁𝐛22+12ρ𝜿2),\displaystyle\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})-{\bm{\kappa}}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})\leq\frac{1}{T}\left(\frac{\rho}{2}\|{\bf B}{\bf b}^{2}_{*}\|^{2}+\frac{1}{2\rho}\|{\bm{\kappa}}\|^{2}\right),

where (𝐛1,𝐛2)({\bf b}_{1}^{*},{\bf b}_{2}^{*}) are the optimal solutions.

Before presenting the proof, we define a dual variable like sequence {𝐝¯t}t\{\overline{\bf d}_{t}\}_{t} with the update relation as

𝐝¯t+1=𝐝tρ(𝐀𝐛t+11+𝐁𝐛t2+𝐛).\overline{\bf d}_{t+1}={\bf d}_{t}-\rho({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t}+\bf b).

The only difference between 𝐝¯t+1\overline{\bf d}_{t+1} and 𝐝t+1{\bf d}_{t+1} is due to 𝐛t2{\bf b}^{2}_{t} and 𝐛t+12{\bf b}^{2}_{t+1}. We have the following relation:

𝐝¯t+1𝐝t+1=ρ𝐁(𝐛t+12𝐛t2).\displaystyle\overline{\bf d}_{t+1}-{\bf d}_{t+1}=\rho{\bf B}({\bf b}^{2}_{t+1}-{\bf b}^{2}_{t}). (18)
Lemma 1.

[from Nagurney (1998)] Let 𝐛=min𝐛𝒮ψ(𝐛){\bf b}^{*}=\min_{{\bf b}\in\mathcal{S}}\psi({\bf b}) where ψ\psi is continuously differentiable and 𝒮\mathcal{S} is closed and convex. Then 𝐛\bf b^{*} satisfies,

ψ(𝐛),𝐛𝐛0,𝐛𝒮.\left\langle\nabla\psi\left({\bf b}^{*}\right),{\bf b}-{\bf b}^{*}\right\rangle\geq 0,\quad\quad\forall{\bf b}\in\mathcal{S}.

Applying Lemma 1 to optimization problems of ADMM gives,

ψ(𝐛t+11)+ρ𝐀T(𝐀𝐛t+11+𝐁𝐛t2+𝐛1ρ𝐝t),𝐛1𝐛t+110,𝐛1𝒮1\displaystyle\left\langle\nabla\psi\left({\bf b}^{1}_{t+1}\right)+\rho{\bf A}^{T}\left({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t}+{\bf b}-\frac{1}{\rho}{\bf d}_{t}\right),{\bf b}^{1}-{\bf b}^{1}_{t+1}\right\rangle\geq 0,\quad\forall{\bf b}^{1}\in\mathcal{S}_{1} (19)
ψ(𝐛t+12)+ρ𝐁T(𝐀𝐛t+11+𝐁𝐛t+12+𝐛1ρ𝐝t),𝐛2𝐛t+120,𝐛2𝒮2.\displaystyle\left\langle\nabla\psi\left({\bf b}^{2}_{t+1}\right)+\rho{\bf B}^{T}\left({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t+1}+{\bf b}-\frac{1}{\rho}{\bf d}_{t}\right),{\bf b}^{2}-{\bf b}^{2}_{t+1}\right\rangle\geq 0,\quad\forall{\bf b}^{2}\in\mathcal{S}_{2}. (20)

Combining Eq. 19, 20, and the {𝐝¯t}t\{\overline{\bf d}_{t}\}_{t} update rule we get, 𝐛1𝒮1,𝐛2𝒮2\forall{\bf b}^{1}\in\mathcal{S}_{1},\forall{\bf b}^{2}\in\mathcal{S}_{2},

ψ(𝐛t+11)𝐀T𝐝¯t+1,𝐛1𝐛t+110,ψ(𝐛t+12)𝐁T𝐝¯t+1ρ𝐁T𝐁(𝐛t2𝐛t+12),𝐛2𝐛t+120.\displaystyle\left\langle\nabla\psi\left({\bf b}^{1}_{t+1}\right)-{\bf A}^{T}\overline{\bf d}_{t+1},{\bf b}^{1}-{\bf b}^{1}_{t+1}\right\rangle\geq 0,\quad\left\langle\nabla\psi\left({\bf b}^{2}_{t+1}\right)-{\bf B}^{T}\overline{\bf d}_{t+1}-\rho{\bf B}^{T}{\bf B}({\bf b}^{2}_{t}-{\bf b}^{2}_{t+1}),{\bf b}^{2}-{\bf b}^{2}_{t+1}\right\rangle\geq 0. (21)

Theorem 1 is a direct conclusion of the following Lemma.

Lemma 2.

For all 𝐛1,𝐛2,κ{\bf b}^{1},{\bf b}^{2},{\bf\kappa} we have,

ψ(𝐛1,𝐛2)ψ(𝐛t+11,𝐛t+12)+𝜿T(𝐀𝐛t+11+𝐁𝐛t+12+𝒃)𝐝¯t+1T(𝐀𝐛1+𝐁𝐛2+𝐛)+ZtZt+10,\psi({\bf b}^{1},{\bf b}^{2})-\psi({\bf b}^{1}_{t+1},{\bf b}^{2}_{t+1})+{\bm{\kappa}}^{T}({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t+1}+\bm{b})-\overline{\bf d}_{t+1}^{T}({\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}+{\bf b})+{Z}_{t}-{Z}_{t+1}\geq 0,

where Zt=ρ2𝐛2𝐛t2𝐁𝐓𝐁2+12ρ𝛋𝐝t2{Z}_{t}=\frac{\rho}{2}\|{\bf b}^{2}-{\bf b}^{2}_{t}\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}+\frac{1}{2\rho}\|{\bm{\kappa}}-{\bf d}_{t}\|^{2} and 𝐱𝐌2=𝐱𝐓𝐌𝐱\|{\bf x}\|_{\bf M}^{2}=\bf x^{T}\bf M\bf x.

Let 𝐛1,𝐛2{\bf b}^{1},{\bf b}^{2} be (𝐛1,𝐛2)({\bf b}_{1}^{*},{\bf b}_{2}^{*}). By definition we have 𝐀𝐛1+𝐁𝐛2+𝐛=𝟎{\bf A}{\bf b}^{1}_{*}+{\bf B}{\bf b}^{2}_{*}+{\bf b}={\bf 0} which cancels an inner product. Rearranging and averaging over time gives

ψ(𝐛t+11,𝐛t+12)ψ(𝐛1,𝐛2)𝜿T(𝐀𝐛t+11+𝐁𝐛t+12+𝒃)ZtZt+1\displaystyle\psi({\bf b}^{1}_{t+1},{\bf b}^{2}_{t+1})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})-{\bm{\kappa}}^{T}({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t+1}+\bm{b})\leq{Z}_{t}-{Z}_{t+1}
1T(t=0T1ψ(𝐛t+11,𝐛t+12))ψ(𝐛1,𝐛2)𝜿T(𝐀𝐛~T1+𝐁𝐛~T2+𝒃)1T(t=0T1ZtZt+1).\displaystyle\frac{1}{T}\left(\sum_{t=0}^{T-1}\psi({\bf b}^{1}_{t+1},{\bf b}^{2}_{t+1})\right)-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})-{\bm{\kappa}}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})\leq\frac{1}{T}\left(\sum_{t=0}^{T-1}{Z}_{t}-{Z}_{t+1}\right).

The RHS telescopes. Since Zt0{Z}_{t}\geq 0 we have RHSZ0RHS\leq Z_{0}. We lower bound LHS with Jensen Inq. as ψ(𝐛~T1,𝐛~T2)1Tt=0T1ψ(𝐛t+11,𝐛t+12)\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})\leq\frac{1}{T}\sum_{t=0}^{T-1}\psi({\bf b}^{1}_{t+1},{\bf b}^{2}_{t+1}). Combining LHS and RHS we get

ψ(𝐛~T1,𝐛~T2)ψ(𝐛1,𝐛2)𝜿T(𝐀𝐛~T1+𝐁𝐛~T2+𝒃)1TZ0,\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})-{\bm{\kappa}}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})\leq\frac{1}{T}Z_{0},

which is the statement in Theorem 1 assuming the initial variables are 0s.

We continue to prove Lemma 2. We start with convexity definition for both ψ1\psi_{1} and ψ\psi as,

ψ1(𝐛1)ψ1(𝐛t+11)(𝐛1𝐛t+11)Tψ1(𝐛t+11)0;ψ2(𝐛2)ψ2(𝐛t+12)(𝐛2𝐛t+12)Tψ2(𝐛t+12)0.\displaystyle\psi_{1}({\bf b}^{1})-\psi_{1}({\bf b}^{1}_{t+1})-({\bf b}^{1}-{\bf b}^{1}_{t+1})^{T}\nabla\psi_{1}({\bf b}^{1}_{t+1})\geq 0;\quad\psi_{2}({\bf b}^{2})-\psi_{2}({\bf b}^{2}_{t+1})-({\bf b}^{2}-{\bf b}^{2}_{t+1})^{T}\nabla\psi_{2}({\bf b}^{2}_{t+1})\geq 0.

Summing the relations and plugging in Eq. 21 give,

ψ(𝐛1,𝐛2)ψ(𝐛t+11,𝐛t+12)(𝐛1𝐛t+11)T𝐀T𝐝¯t+1(𝐛2𝐛t+12)T𝐁T𝐝¯t+1ρ(𝐛2𝐛t+12)T𝐁T𝐁(𝐛t2𝐛t+12)T10.\displaystyle\psi({\bf b}^{1},{\bf b}^{2})-\psi({\bf b}^{1}_{t+1},{\bf b}^{2}_{t+1})\underbrace{-({\bf b}^{1}-{\bf b}^{1}_{t+1})^{T}{\bf A}^{T}\overline{\bf d}_{t+1}-({\bf b}^{2}-{\bf b}^{2}_{t+1})^{T}{\bf B}^{T}\overline{\bf d}_{t+1}-\rho({\bf b}^{2}-{\bf b}^{2}_{t+1})^{T}{\bf B}^{T}{\bf B}({\bf b}^{2}_{t}-{\bf b}^{2}_{t+1})}_{T_{1}}\geq 0. (22)

The update rule of 𝐝¯\overline{\bf d} gives 𝐀𝐛t+11+𝐁𝐛t2+𝐛+1ρ(𝐝¯t+1𝐝t)=𝟎{\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t}+{\bf b}+\frac{1}{\rho}(\overline{\bf d}_{t+1}-{\bf d}_{t})={\bf 0}. Then for all 𝜿{\bm{\kappa}}, we have (𝜿𝐝¯t+1)T(𝐀𝐛t+11+𝐁𝐛t2+𝐛+1ρ(𝐝¯t+1𝐝t))=0({\bm{\kappa}}-\overline{\bf d}_{t+1})^{T}({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t}+{\bf b}+\frac{1}{\rho}(\overline{\bf d}_{t+1}-{\bf d}_{t}))=0. Adding such a term to T1T_{1} does not change its value. After adding the zero inner product, we rearrange T1T_{1} as

T1=\displaystyle T_{1}= 𝜿T(𝐀𝐛t+11+𝐁𝐛t+12+𝒃)𝐝¯t+1T(𝐀𝐛1+𝐁𝐛2+𝐛)Y1+ρ(𝐛2𝐛t+12)T𝐁T𝐁(𝐛t+12𝐛t2)Y2\displaystyle\underbrace{{\bm{\kappa}}^{T}({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t+1}+\bm{b})-\overline{\bf d}_{t+1}^{T}({\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}+{\bf b})}_{Y_{1}}+\rho\underbrace{({\bf b}^{2}-{\bf b}^{2}_{t+1})^{T}{\bf B}^{T}{\bf B}({\bf b}^{2}_{t+1}-{\bf b}^{2}_{t})}_{Y_{2}}
+(𝜿𝐝¯t+1)T(𝐁(𝐛t2𝐛t+12)+1ρ(𝐝¯t+1𝐝t))Y3.\displaystyle+\underbrace{({\bm{\kappa}-\overline{\bf d}_{t+1}})^{T}({\bf B}({\bf b}^{2}_{t}-{\bf b}^{2}_{t+1})+\frac{1}{\rho}(\overline{\bf d}_{t+1}-{\bf d}_{t}))}_{Y_{3}}.

Plugging Eq. 18 in Y3Y_{3} gives Y3=1ρ(𝜿𝐝¯t+1)T(𝐝t+1𝐝t)Y_{3}=\frac{1}{\rho}({\bm{\kappa}-\overline{\bf d}_{t+1}})^{T}({\bf d}_{t+1}-{\bf d}_{t}).

We use the following norm square relation and prove it at the end,

Lemma 3.

For a symmetric 𝐌=𝐌T{\bf M}={\bf M}^{T} we have,

(𝐱𝐲)T𝐌(𝐳𝐭)=12(𝐱𝐭𝐌2𝐱𝐳𝐌2)+12(𝐲𝐳𝐌2𝐲𝐭𝐌2).({\bf x}-{\bf y})^{T}{\bf M}({\bf z}-{\bf t})=\frac{1}{2}\left(\|{\bf x}-{\bf t}\|_{{\bf M}}^{2}-\|{\bf x}-{\bf z}\|_{{\bf M}}^{2}\right)+\frac{1}{2}\left(\|{\bf y}-{\bf z}\|_{{\bf M}}^{2}-\|{\bf y}-{\bf t}\|_{{\bf M}}^{2}\right).

Using Lemma 3, we get

Y2=12(𝐛2𝐛t2𝐁𝐓𝐁2𝐛2𝐛t+12𝐁𝐓𝐁2)12𝐛t+12𝐛t2𝐁𝐓𝐁2Y_{2}=\frac{1}{2}\left(\|{\bf b}^{2}-{\bf b}^{2}_{t}\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}-\|{\bf b}^{2}-{\bf b}^{2}_{t+1}\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}\right)-\frac{1}{2}\|{\bf b}^{2}_{t+1}-{\bf b}^{2}_{t}\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}

Y3=1ρ12(𝜿𝐝t𝐈2𝜿𝐝t+1𝐈2)+1ρ12(𝐝¯t+1𝐝t+1𝐈2𝐝¯t+1𝐝t𝐈2)Y_{3}=\frac{1}{\rho}\frac{1}{2}\left(\|{\bm{\kappa}}-{\bf d}_{t}\|_{{\bf I}}^{2}-\|{\bm{\kappa}}-{\bf d}_{t+1}\|_{{\bf I}}^{2}\right)+\frac{1}{\rho}\frac{1}{2}\left(\|\overline{\bf d}_{t+1}-{\bf d}_{t+1}\|_{{\bf I}}^{2}-\|\overline{\bf d}_{t+1}-{\bf d}_{t}\|_{{\bf I}}^{2}\right),

since 𝐁𝐓𝐁{\bf{\bf B}^{T}{\bf B}} and 𝐈{\bf I} are symmetric matrices.

We combine Y2Y_{2} and Y3Y_{3} as

ρY2+Y3=\displaystyle\rho Y_{2}+Y_{3}= ρ2(𝐛2𝐛t2)𝐁𝐓𝐁2𝐛2𝐛t+12𝐁𝐓𝐁2)+12ρ(𝜿𝐝t𝐈2𝜿𝐝t+1𝐈2)\displaystyle\frac{\rho}{2}\left(\|{\bf b}^{2}-{\bf b}^{2}_{t})\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}-\|{\bf b}^{2}-{\bf b}^{2}_{t+1}\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}\right)+\frac{1}{2\rho}\left(\|{\bm{\kappa}}-{\bf d}_{t}\|_{{\bf I}}^{2}-\|{\bm{\kappa}}-{\bf d}_{t+1}\|_{{\bf I}}^{2}\right)
12ρ𝐝¯t+1𝐝t𝐈20+12ρ(𝐝¯t+1𝐝t+1𝐈2ρ2𝐛t+12𝐛t2𝐁𝐓𝐁)=0 due to Eq. 18\displaystyle\underbrace{-\frac{1}{2\rho}\|\overline{\bf d}_{t+1}-{\bf d}_{t}\|_{{\bf I}}^{2}}_{\leq 0}+\underbrace{\frac{1}{2\rho}\left(\|\overline{\bf d}_{t+1}-{\bf d}_{t+1}\|_{{\bf I}}^{2}-\rho^{2}\|{\bf b}^{2}_{t+1}-{\bf b}^{2}_{t}\|_{{\bf{\bf B}^{T}{\bf B}}}\right)}_{=0\text{ due to Eq. \ref{eq.tilde}}}
\displaystyle\leq ρ2(𝐛2𝐛t2)𝐁𝐓𝐁2𝐛2𝐛t+12𝐁𝐓𝐁2)+12ρ(𝜿𝐝t𝐈2𝜿𝐝t+1𝐈2).\displaystyle\frac{\rho}{2}\left(\|{\bf b}^{2}-{\bf b}^{2}_{t})\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}-\|{\bf b}^{2}-{\bf b}^{2}_{t+1}\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}\right)+\frac{1}{2\rho}\left(\|{\bm{\kappa}}-{\bf d}_{t}\|_{{\bf I}}^{2}-\|{\bm{\kappa}}-{\bf d}_{t+1}\|_{{\bf I}}^{2}\right).

Plugging it back to T1T_{1},

T1=Y1+ρY2+Y3Y1+ρ2(𝐛2𝐛t2)𝐁𝐓𝐁2𝐛2𝐛t+12𝐁𝐓𝐁2)+12ρ(𝜿𝐝t𝐈2𝜿𝐝t+1𝐈2)Y1+ZtZt+1.\displaystyle T_{1}=Y_{1}+\rho Y_{2}+Y_{3}\leq Y_{1}+\frac{\rho}{2}\left(\|{\bf b}^{2}-{\bf b}^{2}_{t})\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}-\|{\bf b}^{2}-{\bf b}^{2}_{t+1}\|_{{\bf{\bf B}^{T}{\bf B}}}^{2}\right)+\frac{1}{2\rho}\left(\|{\bm{\kappa}}-{\bf d}_{t}\|_{{\bf I}}^{2}-\|{\bm{\kappa}}-{\bf d}_{t+1}\|_{{\bf I}}^{2}\right)\leq Y_{1}+{Z}_{t}-{Z}_{t+1}. (23)

Upper bounding T1T_{1} term in Eq. 22 with Eq. 23 gives

ψ(𝐛1,𝐛2)ψ(𝐛t+11,𝐛t+12)+𝜿T(𝐀𝐛t+11+𝐁𝐛t+12+𝒃)𝐝¯t+1T(𝐀𝐛1+𝐁𝐛2+𝐛)+ZtZt+10,\psi({\bf b}^{1},{\bf b}^{2})-\psi({\bf b}^{1}_{t+1},{\bf b}^{2}_{t+1})+{\bm{\kappa}}^{T}({\bf A}{\bf b}^{1}_{t+1}+{\bf B}{\bf b}^{2}_{t+1}+\bm{b})-\overline{\bf d}_{t+1}^{T}({\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}+{\bf b})+{Z}_{t}-{Z}_{t+1}\geq{0},

which is the statement in Lemma 2.\square

Proof of Lemma 1 [from Nagurney (1998)].

Let ϕ(𝐭)=ψ(𝐛+t(𝐛𝐛))\phi({\bf t})=\psi({\bf b}^{*}+t({\bf b}-{\bf b}^{*})) for t[0,1]t\in[0,1]. We know that ϕ\phi achieves its minimum at t=0t=0. Since 𝒮\mathcal{S} is convex and closed, we have, 0[ϕ(t)]t=0=ψ(𝐛),𝐛𝐛.0\leq\left[\nabla\phi(t)\right]_{t=0}=\left\langle\nabla\psi\left({\bf b}^{*}\right),{\bf b}-{\bf b}^{*}\right\rangle.\square

Proof of Lemma 3.

Expand RHS as

RHS\displaystyle RHS =12(𝐭T𝐌𝐭2𝐱T𝐌𝐭𝐳T𝐌𝐳+2𝐱T𝐌𝐳)+12(𝐳T𝐌𝐳2𝐲T𝐌𝐳𝐭T𝐌𝐭+2𝐭T𝐌𝐲)\displaystyle=\frac{1}{2}\left({\bf t}^{T}{\bf M}{\bf t}-2{\bf x}^{T}{\bf M}{\bf t}-{\bf z}^{T}{\bf M}{\bf z}+2{\bf x}^{T}{\bf M}{\bf z}\right)+\frac{1}{2}\left({\bf z}^{T}{\bf M}{\bf z}-2{\bf y}^{T}{\bf M}{\bf z}-{\bf t}^{T}{\bf M}{\bf t}+2{\bf t}^{T}{\bf M}{\bf y}\right)
=𝐱T𝐌𝐭+𝐱T𝐌𝐳𝐲T𝐌𝐳+𝐭T𝐌𝐲\displaystyle=-{\bf x}^{T}{\bf M}{\bf t}+{\bf x}^{T}{\bf M}{\bf z}-{\bf y}^{T}{\bf M}{\bf z}+{\bf t}^{T}{\bf M}{\bf y}
=(𝐱𝐲)T𝐌(𝐳𝐭).\displaystyle=({\bf x}-{\bf y})^{T}{\bf M}({\bf z}-{\bf t}).

We reach the LHSLHS. \square

B.2 Proof for Theorem 2 (computational complexity)

Theorem 2.

Let {y^it,𝐚it}i=1n\{\hat{y}_{i}^{t},\bm{a}_{i}^{t}\}_{i=1}^{n} be the output of Algorithm(2) at the ttht^{th} iteration, yi~1Tt=1Ty^it\tilde{y_{i}}\triangleq\frac{1}{T}\sum_{t=1}^{T}\hat{y}_{i}^{t} and 𝐚i~1Tt=1T𝐚it\tilde{\bm{a}_{i}}\triangleq\frac{1}{T}\sum_{t=1}^{T}\bm{a}_{i}^{t}. Denote f~T(𝐱)maxi𝐚i~,𝐱𝐱i+yi~\tilde{f}_{T}(\bm{x})\triangleq\max_{i}\langle\tilde{\bm{a}_{i}},\bm{x}-\bm{x}_{i}\rangle+\tilde{y_{i}}. Assume maxi,l|xi,l|1\max_{i,l}|x_{i,l}|\leq 1 and 𝕍𝕒𝕣({yi}i=1n)1\mathbbm{Var}(\{y_{i}\}_{i=1}^{n})\leq 1. If we choose ρ=dλ2n\rho=\frac{\sqrt{d}\lambda^{2}}{n}, for λ32nd\lambda\geq\frac{3}{\sqrt{2nd}} and TndT\geq{n\sqrt{d}} we have:

1ni(f~T(𝒙i)yi)2+λf~Tminf^(1ni(f^(𝒙i)yi)2+λf^)+6ndT+1.\displaystyle\frac{1}{n}\sum_{i}(\tilde{f}_{T}(\bm{x}_{i}){-}y_{i})^{2}{+}\lambda\|\tilde{f}_{T}\|\leq\min_{\hat{f}}\bigg{(}\frac{1}{n}\sum_{i}\big{(}\hat{f}(\bm{x}_{i})-y_{i}\big{)}^{2}+\lambda\|\hat{f}\|\bigg{)}+\frac{6n\sqrt{d}}{T+1}.

For the proof, we make extensive use of Theorem (1) which provides convergence rate of a general 2-block ADMM for a separable convex program with linear constraints

min𝐛1𝒮1,𝐛2𝒮2[ψ(𝐛1,𝐛2)=ψ1(𝐛1)+ψ2(𝐛2)]\displaystyle\min_{{\bf b}^{1}\in\mathcal{S}_{1},{\bf b}^{2}\in\mathcal{S}_{2}}\left[\psi({\bf b}^{1},{\bf b}^{2})=\psi_{1}({\bf b}^{1})+\psi_{2}({\bf b}^{2})\right] (24)
s.t:𝐀𝐛1+𝐁𝐛2+𝒃=𝟎.\displaystyle\quad s.t:{\bf A}{\bf b}^{1}+{\bf B}{\bf b}^{2}+\bm{b}=\bm{0}.

It guarantees for all 𝜿{\bm{\kappa}} the average solutions of a 2-block ADMM (𝐛~T1,𝐛~T2)(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T}) satisfies

ψ(𝐛~T1,𝐛~T2)ψ(𝐛1,𝐛2)𝜿T(𝐀𝐛~T1+𝐁𝐛~T2+𝒃)1T(ρ2𝐁𝐛22+12ρ𝜿2),\displaystyle\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})-{\bm{\kappa}}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})\leq\frac{1}{T}\left(\frac{\rho}{2}\|{\bf B}{\bf b}^{2}_{*}\|^{2}+\frac{1}{2\rho}\|{\bm{\kappa}}\|^{2}\right), (25)

where (𝐛1,𝐛2)({\bf b}_{1}^{*},{\bf b}_{2}^{*}) are the optimal solutions of program (24).

However in Eq. (5) we are solving a specific version of program (24) of the form

miny^i,𝒂i,𝒑i,Ll0,𝒖i0,si,j0[ψ(y^i,𝒂i,𝒑i,Ll,𝒖i,si,j)=1ni=1n(y^iyi)2+λl=1dLl]\displaystyle\min_{\hat{y}_{i},\bm{a}_{i},\bm{p}_{i},L_{l}\geq 0,\bm{u}_{i}\geq 0,s_{i,j}\geq 0}\left[\psi(\hat{y}_{i},\bm{a}_{i},\bm{p}_{i},L_{l},\bm{u}_{i},s_{i,j}){=}\frac{1}{n}\sum_{i=1}^{n}(\hat{y}_{i}-y_{i})^{2}{+}\lambda\sum_{l=1}^{d}L_{l}\right] (26)
s.t.{si,j+y^iy^j𝒂j,𝒙i𝒙j=0i,j[n]×[n]ui,l+|pi,l|Ll=0i,l[n]×[d]ai,lpi,l=0i,l[n]×[d].\displaystyle\textrm{s.t.}\begin{cases}-s_{i,j}+\hat{y}_{i}-\hat{y}_{j}-\langle\bm{a}_{j},\bm{x}_{i}-\bm{x}_{j}\rangle=0&i,j\in[n]\times[n]\\ u_{i,l}+|p_{i,l}|-L_{l}=0&\,i,l\in[n]\times[d]\\ a_{i,l}-p_{i,l}=0&\,i,l\in[n]\times[d].\end{cases}

To match the notations between (26) and (24) denote the optimal/average ADMM solutions to program (26) as

𝐛1\displaystyle{\bf b}^{1}_{*} =[y^1,,y^n,a1,1,a1,2,,an,d]T\displaystyle=[\hat{y}_{1},\dots,\hat{y}_{n},a_{1,1},a_{1,2},\dots,a_{n,d}]^{T}
𝐛2\displaystyle{\bf b}^{2}_{*} =[L1,,Ld,p1,1,,pn,d,u1,1,,un,d,s1,1,,sn,n]T\displaystyle=[L_{1},\dots,L_{d},p_{1,1},\dots,p_{n,d},u_{1,1},\dots,u_{n,d},s_{1,1},\dots,s_{n,n}]^{T}
𝐛~T1\displaystyle\tilde{\bf b}^{1}_{T} =[y~1,,y~n,a~1,1,a~1,2,,a~n,d]T\displaystyle=[\tilde{y}_{1},\dots,\tilde{y}_{n},\tilde{a}_{1,1},\tilde{a}_{1,2},\dots,\tilde{a}_{n,d}]^{T}
𝐛~T2\displaystyle\tilde{\bf b}^{2}_{T} =[L~1,,L~d,p~1,1,,p~n,d,u~1,1,,u~n,d,s~1,1,,s~n,n]T.\displaystyle=[\tilde{L}_{1},\dots,\tilde{L}_{d},\tilde{p}_{1,1},\dots,\tilde{p}_{n,d},\tilde{u}_{1,1},\dots,\tilde{u}_{n,d},\tilde{s}_{1,1},\dots,\tilde{s}_{n,n}]^{T}. (27)

Therefore the separable losses for the average iterates of ADMM for program (26) are

ψ1(𝐛~T1)=1ni=1n(y~iyi)2,\displaystyle\psi_{1}(\tilde{\bf b}^{1}_{T})=\frac{1}{n}\sum_{i=1}^{n}(\tilde{y}_{i}-y_{i})^{2}, (28)
ψ2(𝐛~T2)=λl=1dL~l.\displaystyle\psi_{2}(\tilde{\bf b}^{2}_{T})=\lambda\sum_{l=1}^{d}\tilde{L}_{l}.

Note that convergence rate for ψ1(𝐛~T1)+ψ2(𝐛~T2)\psi_{1}(\tilde{\bf b}^{1}_{T})+\psi_{2}(\tilde{\bf b}^{2}_{T}) is available by setting 𝜿=0\bm{\kappa}=0 in (25). However this is not sufficient for convergence of our objective 1ni(f~T(𝒙i)yi)2+λf~T\frac{1}{n}\sum_{i}(\tilde{f}_{T}(\bm{x}_{i}){-}y_{i})^{2}{+}\lambda\|\tilde{f}_{T}\|. The reason is (𝐛~T1,𝐛~T2)(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T}) might be violating the constraints in (26). This could result in f~T(𝒙i)y~i\tilde{f}_{T}(\bm{x}_{i})\neq\tilde{y}_{i} and f~Tl=1dL~l\|\tilde{f}_{T}\|\neq\sum_{l=1}^{d}\tilde{L}_{l}. Therefore main steps of the proof of Theorem (2) is to characterize and bound the effect of such constraint violations on our objective function. Let us specify how much each linear constraint in program (26) is violated,

εi,j1\displaystyle\varepsilon^{1}_{i,j} s~i,j+y~iy~j𝒂j~,𝒙i𝒙j,\displaystyle\triangleq-\tilde{s}_{i,j}+\tilde{y}_{i}-\tilde{y}_{j}-\langle\tilde{\bm{a}_{j}},\bm{x}_{i}-\bm{x}_{j}\rangle, (29)
εi,l2\displaystyle\varepsilon^{2}_{i,l} u~i,l+|p~i,l|L~l,\displaystyle\triangleq\tilde{u}_{i,l}+|\tilde{p}_{i,l}|-\tilde{L}_{l},
εi,l3\displaystyle\varepsilon^{3}_{i,l} a~i,lp~i,l.\displaystyle\triangleq\tilde{a}_{i,l}-\tilde{p}_{i,l}.

We break the proof to smaller parts by first providing some intermediate lemmas.

Lemma 4.

ψ(𝐛1,𝐛2)=1ni(y^iyi)2+λlLl1\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})=\frac{1}{n}\sum_{i}(\hat{y}_{i}-y_{i})^{2}+\lambda\sum_{l}L_{l}\leq 1.

Proof.

Note that 𝒃1=𝒃2=0\bm{b}^{1}=\bm{b}^{2}=0 is a feasible solution and (y^i,Ll)(\hat{y}_{i},L_{l}) is the optimal solution. Also algorithm 2 normalizes the dataset such that i=1nyi=0\sum_{i=1}^{n}y_{i}=0, therefore

ψ(𝐛1,𝐛2)\displaystyle\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*}) ψ(𝟎,𝟎)\displaystyle\leq\psi(\bm{0},\bm{0})
=1ni=1nyi2=𝕍𝕒𝕣({yi}i=1n)1.\displaystyle=\frac{1}{n}\sum_{i=1}^{n}y_{i}^{2}=\mathbbm{Var}(\{y_{i}\}_{i=1}^{n})\leq 1.

Lemma 5.

𝐁𝐛22218n2λ2\|{\bf B}{\bf b}^{2}_{*}\|_{2}^{2}\leq 18\frac{n^{2}}{\lambda^{2}}.

Proof.

Since pi,lp_{i,l} is a feasible solution we have |pi,l|Ll|p_{i,l}|\leq L_{l}. We also have:

0si,j\displaystyle 0\leq s_{i,j} =y^iy^j𝒂j,𝒙i𝒙j\displaystyle=\hat{y}_{i}-\hat{y}_{j}-\langle\bm{a}_{j},\bm{x}_{i}-\bm{x}_{j}\rangle (by constraints definitions)
𝒂i𝒂j,𝒙i𝒙j\displaystyle\leq\langle\bm{a}_{i}-\bm{a}_{j},\bm{x}_{i}-\bm{x}_{j}\rangle (convexity)
2lLl|xi,l+xj,l|\displaystyle\leq 2\sum_{l}L_{l}|x_{i,l}+x_{j,l}| (maxi|ai,l|Ll)\max_{i}|a_{i,l}|\leq L_{l})
4lLl\displaystyle\leq 4\sum_{l}L_{l} (maxi|xi,l|1).\displaystyle\text{($\max_{i}|x_{i,l}|\leq 1)$}.

Arranging the constraints in (26) into (n2+nd+nd)(n^{2}+nd+nd) rows and separating the 𝐛1,𝐛2{\bf b}^{1}_{*},{\bf b}^{2}_{*} coefficients as 𝐀{\bf A} and 𝐁{\bf B} in order we have:

𝐀𝐛1+𝐁𝐛2=0\displaystyle{\bf A}{\bf b}^{1}_{*}+{\bf B}{\bf b}^{2}_{*}=0
𝐁𝐛2=[s1,1,,sn,n,01,1,,0n,d,p1,1,,pn,d]T.\displaystyle{\bf B}{\bf b}^{2}_{*}=[-s_{1,1},\dots,-s_{n,n},0_{1,1},\dots,0_{n,d},-p_{1,1},\dots,-p_{n,d}]^{T}.

Taking the norm

𝐁𝐛222\displaystyle\|{\bf B}{\bf b}^{2}_{*}\|_{2}^{2} =i,jsi,j2+i,lpi,l2\displaystyle=\sum_{i,j}s_{i,j}^{2}+\sum_{i,l}p_{i,l}^{2}
n2(4l=1dLl)2+nl=1dLl2\displaystyle\leq n^{2}(4\sum_{l=1}^{d}L_{l})^{2}+n\sum_{l=1}^{d}L_{l}^{2}
18n2λ2.\displaystyle\leq 18\frac{n^{2}}{\lambda^{2}}. (Lemma 4)

Lemma 6.

0f~T(𝒙i)y~imaxj(εi,j1)0\leq\tilde{f}_{T}(\bm{x}_{i})-\tilde{y}_{i}\leq\max_{j}(\varepsilon^{1}_{i,j}).

Proof.
y~if~T(𝒙i)\displaystyle\tilde{y}_{i}\leq\tilde{f}_{T}(\bm{x}_{i}) =maxj𝒂j~,𝒙i𝒙j+y~j\displaystyle=\max_{j}\langle\tilde{\bm{a}_{j}},\bm{x}_{i}-\bm{x}_{j}\rangle+\tilde{y}_{j} (Definition)
=maxj(y~is~i,j+εi,j1)\displaystyle=\max_{j}(\tilde{y}_{i}-\tilde{s}_{i,j}+\varepsilon^{1}_{i,j}) (εi,j1\varepsilon^{1}_{i,j} definition)
y~i+maxj(εi,j1).\displaystyle\leq\tilde{y}_{i}+\max_{j}(\varepsilon^{1}_{i,j}). (s~i,j0\tilde{s}_{i,j}\geq 0)

Lemma 7.
1ni(f~T(𝒙i)yi)21ni(y~iyi)2+1nimaxj(εi,j1)2+21nimaxj(εi,j1)21+1T9ρn2λ2.\displaystyle\frac{1}{n}\sum_{i}\big{(}\tilde{f}_{T}(\bm{x}_{i})-y_{i}\big{)}^{2}\leq\frac{1}{n}\sum_{i}(\tilde{y}_{i}-y_{i})^{2}+\frac{1}{n}\sum_{i}\max_{j}(\varepsilon^{1}_{i,j})^{2}+2\sqrt{\frac{1}{n}\sum_{i}\max_{j}(\varepsilon^{1}_{i,j})^{2}}\sqrt{1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}}}.
Proof.
1ni(f~T(𝒙i)yi)2\displaystyle\frac{1}{n}\sum_{i}\big{(}\tilde{f}_{T}(\bm{x}_{i})-y_{i}\big{)}^{2} =1ni(f~T(𝒙i)y~i+y~iyi)2\displaystyle=\frac{1}{n}\sum_{i}\big{(}\tilde{f}_{T}(\bm{x}_{i})-\tilde{y}_{i}+\tilde{y}_{i}-y_{i}\big{)}^{2}
=1ni(y~iyi)2+1ni(f~T(𝒙i)y~i)2+2n(f~T(𝒙i)y~i)(y~iyi)\displaystyle=\frac{1}{n}\sum_{i}(\tilde{y}_{i}-y_{i})^{2}+\frac{1}{n}\sum_{i}\big{(}\tilde{f}_{T}(\bm{x}_{i})-\tilde{y}_{i}\big{)}^{2}+\frac{2}{n}\big{(}\tilde{f}_{T}(\bm{x}_{i})-\tilde{y}_{i}\big{)}\big{(}\tilde{y}_{i}-y_{i}\big{)}
1ni(y~iyi)2+1nimaxj(εi,j1)2+2n|maxj(εi,j1)||y~iyi|.\displaystyle\leq\frac{1}{n}\sum_{i}(\tilde{y}_{i}-y_{i})^{2}+\frac{1}{n}\sum_{i}\max_{j}(\varepsilon^{1}_{i,j})^{2}+\frac{2}{n}|\max_{j}(\varepsilon^{1}_{i,j})|\big{|}\tilde{y}_{i}-y_{i}\big{|}. (Lemma 6)

Let us focus on the last term on the RHS.

(1ni|maxj(εi,j1)||y~iyi|)21ni(maxjεi,j1)2\displaystyle\frac{\big{(}\frac{1}{n}\sum_{i}\big{|}\max_{j}(\varepsilon^{1}_{i,j})||\tilde{y}_{i}-y_{i}|\big{)}^{2}}{\frac{1}{n}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}} 1ni(y~iyi)2\displaystyle\leq\frac{1}{n}\sum_{i}(\tilde{y}_{i}-y_{i})^{2} (Cauchy–Schwartz)
1ni(y~iyi)2+λlL~l\displaystyle\leq\frac{1}{n}\sum_{i}(\tilde{y}_{i}-y_{i})^{2}+\lambda\sum_{l}\tilde{L}_{l} (L~l0\tilde{L}_{l}\geq 0)
1ni(y^iyi)2+λlLl+1Tρ2𝐁𝐛22\displaystyle\leq\frac{1}{n}\sum_{i}(\hat{y}_{i}-y_{i})^{2}+\lambda\sum_{l}L_{l}+\frac{1}{T}\frac{\rho}{2}\|{\bf B}{\bf b}^{2}_{*}\|^{2} (Use Theorem 1 with 𝜿=0\bm{\kappa}=0)
1+1Tρ2𝐁𝐛22\displaystyle\leq 1+\frac{1}{T}\frac{\rho}{2}\|{\bf B}{\bf b}^{2}_{*}\|^{2} (Lemma 4)
1+1T9ρn2λ2.\displaystyle\leq 1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}}. (Lemma 5)

On the other hand for regularization term we have:

Lemma 8.

f~Tl|L~l|+lmaxi|εi,l2|+maxi|εi,l3|\|\tilde{f}_{T}\|\leq\sum_{l}|\tilde{L}_{l}|+\sum_{l}\max_{i}|\varepsilon^{2}_{i,l}|+\max_{i}|\varepsilon^{3}_{i,l}|.

Proof.
f~T\displaystyle\|\tilde{f}_{T}\| =lmaxi|𝒂~i,l|\displaystyle=\sum_{l}\max_{i}|\tilde{\bm{a}}_{i,l}| (definition)
=lmaxi|𝒑~i,l+εi,l3|\displaystyle=\sum_{l}\max_{i}|\tilde{\bm{p}}_{i,l}+\varepsilon^{3}_{i,l}| (εi,l3\varepsilon^{3}_{i,l} definition)
l|L~l|+lmaxi|εi,l2|+maxi|εi,l3|.\displaystyle\leq\sum_{l}|\tilde{L}_{l}|+\sum_{l}\max_{i}|\varepsilon^{2}_{i,l}|+\max_{i}|\varepsilon^{3}_{i,l}|. (εi,l2\varepsilon^{2}_{i,l} definition)

Next we combine the previous lemmas to prove Theorem 2. Before proceeding lets incorporate the definitions of constraint violations Eq. (29) in 𝜿T(𝐀𝐛~T1+𝐁𝐛~T2+𝒃){\bm{\kappa}}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b}) to get:

𝜿T(𝐀𝐛~T1+𝐁𝐛~T2+𝒃)=𝜿1T𝜺1+𝜿2T𝜺2+𝜿3T𝜺3,\displaystyle\bm{\kappa}^{T}({\bf A}\tilde{\bf b}^{1}_{T}+{\bf B}\tilde{\bf b}^{2}_{T}+\bm{b})={\bm{\kappa}^{1}}^{T}\bm{\varepsilon}^{1}+{\bm{\kappa}^{2}}^{T}\bm{\varepsilon}^{2}+{\bm{\kappa}^{3}}^{T}\bm{\varepsilon}^{3}, (30)

where

𝜿1[κ1,11,,κn,n2]T,𝜿2[κ1,12,,κn,d2]T,𝜿3[κ1,13,,κn,d3]T\displaystyle\bm{\kappa}^{1}\triangleq[\kappa^{1}_{1,1},\dots,\kappa^{2}_{n,n}]^{T},\qquad\bm{\kappa}^{2}\triangleq[\kappa^{2}_{1,1},\dots,\kappa^{2}_{n,d}]^{T},\qquad\bm{\kappa}^{3}\triangleq[\kappa^{3}_{1,1},\dots,\kappa^{3}_{n,d}]^{T}
𝜺1[ε1,11,,εn,n2]T,𝜺2[ε1,12,,εn,d2]T,𝜺3[ε1,13,,εn,d3]T.\displaystyle\bm{\varepsilon}^{1}\triangleq[\varepsilon^{1}_{1,1},\dots,\varepsilon^{2}_{n,n}]^{T},\qquad\bm{\varepsilon}^{2}\triangleq[\varepsilon^{2}_{1,1},\dots,\ \varepsilon^{2}_{n,d}]^{T},\qquad\bm{\varepsilon}^{3}\triangleq[\varepsilon^{3}_{1,1},\dots,\ \ \varepsilon^{3}_{n,d}]^{T}.

Substituting (30) in (25) we have:

Δ\displaystyle\Delta ψ(𝐛~T1,𝐛~T2)ψ(𝐛1,𝐛2)𝜿1T𝜺1𝜿2T𝜺2𝜿3T𝜺3\displaystyle\triangleq\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})-{\bm{\kappa}^{1}}^{T}\bm{\varepsilon}^{1}-{\bm{\kappa}^{2}}^{T}\bm{\varepsilon}^{2}-{\bm{\kappa}^{3}}^{T}\bm{\varepsilon}^{3}
1T(ρ2𝐁𝐛22+12ρ𝜿12+12ρ𝜿22+12ρ𝜿32).\displaystyle\leq\frac{1}{T}\left(\frac{\rho}{2}\|{\bf B}{\bf b}^{2}_{*}\|^{2}+\frac{1}{2\rho}\|{\bm{\kappa}^{1}}\|^{2}+\frac{1}{2\rho}\|{\bm{\kappa}^{2}}\|^{2}+\frac{1}{2\rho}\|{\bm{\kappa}^{3}}\|^{2}\right). (31)
Proof.

Add up both sides of Lemma 8 and Lemma 7 and subtract ψ(𝐛1,𝐛2)\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*}). The total approximation error due to ADMM optimization schema is LHSLHS:

LHS\displaystyle LHS 1ni(f~T(𝒙i)yi)2+λf~Tminf^(1ni(f^(𝒙i)yi)2+λf^)\displaystyle\triangleq\frac{1}{n}\sum_{i}\big{(}\tilde{f}_{T}(\bm{x}_{i})-y_{i}\big{)}^{2}+\lambda\|\tilde{f}_{T}\|-\min_{\hat{f}}\bigg{(}\frac{1}{n}\sum_{i}\big{(}\hat{f}(\bm{x}_{i})-y_{i}\big{)}^{2}+\lambda\|\hat{f}\|\bigg{)}
ψ(𝐛~T1,𝐛~T2)ψ(𝐛1,𝐛2)\displaystyle\leq\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})
+1ni(εi,l1)2+21ni(maxjεi,j1)2ϵ11+1T9ρn2λ2ϵ2+λl(maxi|εi,l2|+maxi|εi,l3|)\displaystyle+\frac{1}{n}\sum_{i}(\varepsilon^{1}_{i,l*})^{2}+2\underbrace{\sqrt{\frac{1}{n}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}}}_{\epsilon_{1}}\underbrace{\sqrt{1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}}}}_{\epsilon_{2}}+\lambda\sum_{l}\big{(}\max_{i}|\varepsilon^{2}_{i,l}|+\max_{i}|\varepsilon^{3}_{i,l}|\big{)}
RHS.\displaystyle\triangleq RHS.

Assume ϵ1ϵ2\epsilon_{1}\leq\epsilon_{2} then,

RHS\displaystyle RHS ψ(𝐛~T1,𝐛~T2)ψ(𝐛1,𝐛2)+31ni(maxjεi,j1)21+1T9ρn2λ2+λl(maxi|εi,l2|+maxi|εi,l3|).\displaystyle\leq\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})+3\sqrt{\frac{1}{n}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}}\sqrt{1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}}}+\lambda\sum_{l}\big{(}\max_{i}|\varepsilon^{2}_{i,l}|+\max_{i}|\varepsilon^{3}_{i,l}|\big{)}.

In Eq. (B.2) set

κi,j1\displaystyle\kappa^{1}_{i,j} ={31+1T9ρn2λ2εi,j1ni(maxkεi,k1)2if j=argmaxkεi,k10otherwise\displaystyle=\begin{cases}-\frac{3\sqrt{1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}}}\varepsilon^{1}_{i,j}}{\sqrt{n\sum_{i}(\max_{k}\varepsilon^{1}_{i,k})^{2}}}&\text{if }j=\arg\max_{k}\varepsilon^{1}_{i,k}\\ 0&\text{otherwise }\end{cases}
κi,l2\displaystyle\kappa^{2}_{i,l} ={λsignεi,l2if i=argmaxk|εk,l2|0otherwise\displaystyle=\begin{cases}-\lambda\operatorname{sign}\varepsilon^{2}_{i,l}&\text{if }i=\arg\max_{k}|\varepsilon^{2}_{k,l}|\\ 0&\text{otherwise }\end{cases}
κi,l3\displaystyle\kappa^{3}_{i,l} ={λsignεi,l3if i=argmaxk|εk,l3|0otherwise.\displaystyle=\begin{cases}-\lambda\operatorname{sign}\varepsilon^{3}_{i,l}&\text{if }i=\arg\max_{k}|\varepsilon^{3}_{k,l}|\\ 0&\text{otherwise.}\end{cases}

We have

RHS=Δ\displaystyle RHS=\Delta 1T(ρ2𝐁𝐛22+92nρ(1+1T9ρn2λ2)+dρλ2)\displaystyle\leq\frac{1}{T}\left(\frac{\rho}{2}\|{\bf B}{\bf b}^{2}_{*}\|^{2}+\frac{9}{2n\rho}(1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}})+\frac{d}{\rho}\lambda^{2}\right)
1T(9ρn2λ2+92nρ(1+1T9ρn2λ2)+dρλ2).\displaystyle\leq\frac{1}{T}\left(\frac{9\rho n^{2}}{\lambda^{2}}+\frac{9}{2n\rho}(1+\frac{1}{T}\frac{9\rho n^{2}}{\lambda^{2}})+\frac{d}{\rho}\lambda^{2}\right).

Set ρ=dλ2n\rho=\frac{\sqrt{d}\lambda^{2}}{n}, and then we get: LHSRHS=Δ6ndTLHS\leq RHS=\Delta\leq\frac{6n\sqrt{d}}{T} if λ32nd\lambda\geq\frac{3}{\sqrt{2nd}} and nρT92n\rho T\geq\frac{9}{2}.

Now assume ϵ1ϵ2\epsilon_{1}\geq\epsilon_{2}. We get:

RHS\displaystyle RHS ψ(𝐛~T1,𝐛~T2)ψ(𝐛1,𝐛2)+31ni(maxjεi,j1)2+λl(maxi|εi,l2|+maxi|εi,l3|).\displaystyle\leq\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})+3\frac{1}{n}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}+\lambda\sum_{l}\big{(}\max_{i}|\varepsilon^{2}_{i,l}|+\max_{i}|\varepsilon^{3}_{i,l}|\big{)}.

In Eq. (B.2) set

κi,j1\displaystyle\kappa^{1}_{i,j} ={4nεi,j1if j=argmaxkεi,k10otherwise\displaystyle=\begin{cases}-\frac{4}{n}\varepsilon^{1}_{i,j}&\text{if }j=\arg\max_{k}\varepsilon^{1}_{i,k}\\ 0&\text{otherwise }\end{cases}
κi,l2\displaystyle\kappa^{2}_{i,l} ={λsignεi,l2if i=argmaxk|εk,l2|0otherwise\displaystyle=\begin{cases}-\lambda\operatorname{sign}\varepsilon^{2}_{i,l}&\text{if }i=\arg\max_{k}|\varepsilon^{2}_{k,l}|\\ 0&\text{otherwise }\end{cases}
κi,l3\displaystyle\kappa^{3}_{i,l} ={λsignεi,l3if i=argmaxk|εk,l3|0otherwise.\displaystyle=\begin{cases}-\lambda\operatorname{sign}\varepsilon^{3}_{i,l}&\text{if }i=\arg\max_{k}|\varepsilon^{3}_{k,l}|\\ 0&\text{otherwise.}\end{cases}

We get

Δ\displaystyle\Delta =ψ(𝐛~T1,𝐛~T2)ψ(𝐛1,𝐛2)+41ni(maxjεi,j1)2+λl(maxi|εi,l2|+maxi|εi,l3|)\displaystyle=\psi(\tilde{\bf b}^{1}_{T},\tilde{\bf b}^{2}_{T})-\psi({\bf b}^{1}_{*},{\bf b}^{2}_{*})+4\frac{1}{n}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}+\lambda\sum_{l}\big{(}\max_{i}|\varepsilon^{2}_{i,l}|+\max_{i}|\varepsilon^{3}_{i,l}|\big{)}
1T(9ρn2λ2+162ρn2i(maxjεi,j1)2+dρλ2).\displaystyle\leq\frac{1}{T}\left(\frac{9\rho n^{2}}{\lambda^{2}}+\frac{16}{2\rho n^{2}}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}+\frac{d}{\rho}\lambda^{2}\right).

It’s straightforward to see if nρT8n\rho T\geq 8 we have

LHSRHSΔ162ρn2Ti(maxjεi,j1)2=O(1T(ρn2λ2+dρλ2)).\displaystyle LHS\leq RHS\leq\Delta-\frac{16}{2\rho n^{2}T}\sum_{i}(\max_{j}\varepsilon^{1}_{i,j})^{2}=O(\frac{1}{T}(\frac{\rho n^{2}}{\lambda^{2}}+\frac{d}{\rho}\lambda^{2})).

Set ρ=dλ2n\rho=\frac{\sqrt{d}\lambda^{2}}{n} we get: LHS3ndTLHS\leq\frac{3n\sqrt{d}}{T}. Only if T8dλ2T\geq\frac{8}{\sqrt{d}\lambda^{2}}. ∎

Appendix C Experimental results

This section provides the regression and classification experimental results in tabular format. Table 3 compares the regression performance of our method against baseline methods on benchmark datasets. Table 4 compares the classification accuracy of our method against baseline methods. We present the run time of our method on selected UCI datasets in Table 5. Here, we omit the baseline method because it did not run to completion on most of the datasets. Hyperparameters were fixed for all timing results.

Table 3: Comparison of Regression performance on UCI datasets.
R2×100R^{2}\times 100
dataset nn dd dc regression lasso random forest xgboost
Parkinson Speech Dataset 702 52 -3.7 -4 -14.8 -20
Garment Productivity 905 37 25.5 15.2 45.9 44.2
Concrete Compressive Strength 1030 8 91.7 59.9 91.8 92.5
Geographical Original of Music-1 1059 68 21.9 16.6 25.1 26
Geographical Original of Music-2 1059 68 32.6 23.8 31.3 30.8
Solar Flare-1 1066 23 3.3 6.1 -28 6.2
Solar Flare-2 1066 23 -6.1 -2.1 -17 2.1
Airfoil Self-Noise 1503 5 95.2 51.1 93.3 95
Communities and Crime 1994 122 62.1 64.5 65.4 65.8
SML2010 3000 24 90.8 96 93.4 92.1
SML2010 3000 24 77.3 94 92.4 90
Parkinson’s Telemonitoring 4406 25 93.8 98.4 92.3 98.2
Parkinson’s Telemonitoring 4406 25 95.5 98.5 92 98
Wine Quality 4898 11 51.8 27.2 52.6 51
Bias Correction of Temperature Forecast-1 6200 52 62.9 60.8 64.2 65.8
Bias Correction of Temperature Forecast-2 6200 52 75.2 76.5 76.3 76.7
Seoul Bike Sharing Demand 6570 19 89.7 80.7 93.1 92.3
Air Quality-1 7110 21 87 89.5 88.2 89.9
Air Quality-2 7110 21 100 94.7 99.8 100
Air Quality-3 7110 21 86.6 86.4 87.7 88.5
Air Quality-4 7110 21 81.7 84 78.1 80.7
Combined Cycle Power Plant 9568 4 95.5 92.9 96.5 96.7
Table 4: Comparison of Classification performance on UCI datasets.
Accuracy (%) Run time of all fits (sec)
dataset nn dd pbdl pbdl0 rand forest xgboost pbdl pbdl0
Iris 149 4 96.0 98.7 96 96.6 22.1 46.2
Wine 178 13 96.0 96.6 96.7 95.5 23.6 496.2
Blood Transfusion 748 4 72.6 74.8 72.9 78.4 46.3 21514.0
Breast Cancer Wisconsin 569 30 94.4 96.5 96.1 96.0 114.0 224407
Balance Scale 625 4 84.6 93.0 83 92.2 150.3 617.0
Abalone 4177 10 22.6 N/A 24.9 26.2 3688.0 N/A
Table 5: DC Regression run time on UCI datasets. (Baseline is ommitted because it does not run to completion on most datasets)
dataset nn dd This Paper (Seconds)
Parkinson Speech Dataset 702 52 3.4
Garment Productivity 905 37 4.12
Concrete Compressive Strength 1030 8 3.28
Geographical Original of Music 1059 68 4.44
Solar Flare 1066 23 3.64
Airfoil Self-Noise 1503 5 4.92
Communities and Crime 1994 122 12.82
SML2010 3000 24 21.78
Refer to caption
Figure 3: Classification accuracy for all fits on UCI datasets