This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fused Lasso Nearly Isotonic Signal Approximation in General Dimensions

Vladimir Pastukhov
Department of Computer Science and Engineering,
Chalmers, Sweden
Email: [email protected]
Abstract

In this paper we introduce and study fused lasso nearly-isotonic signal approximation, which is a combination of fused lasso and generalized nearly-isotonic regression. We show how these three estimators relate to each other, derive solution to the general problem, show that it is computationally feasible and provides a trade-off between piecewise monotonicity, sparsity and goodness-of-fit. Also, we derive an unbiased estimator of the degrees of freedom of the approximator.


Keywords: Constrained inference; Isotonic regression; Nearly-isotonic Regression; Fused lasso

1 Introduction

This work is motivated by recent papers in nearly-constrained estimation and papers in penalized least squared regression. The subject of penalized estimators starts with L1L_{1}-penalisation (Tibshirani, 1996), i.e.

𝜷^L(𝒚,λL)=argmin𝜷n12𝒚𝜷22+λL𝜷1,\hat{\bm{\beta}}^{L}(\bm{y},\lambda_{L})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{L}||\bm{\beta}||_{1},

which is called lasso signal approximation. The L2L_{2}-penalisation is usually addressed as ridge regression (Hoerl & Kennard, 1970) or sometimes as Tikhonov-Philips regularization (Phillips, 1962; Tikhonov et al., 1995).

First, to set up the ideas and for simplicity of notation we consider one dimensional cases of the penalized estimators. In the next subsection we generalise these estimators to the case of isotonic constraints with respect to a general partial order.

For a given sequence of data points 𝒚n\bm{y}\in\mathbb{R}^{n} the fusion approximator (cf. Rinaldo (2009)) is given by

𝜷^F(𝒚,λF)=argmin𝜷n12𝒚𝜷22+λFi=1n1|βiβi+1|.\hat{\bm{\beta}}^{F}(\bm{y},\lambda_{F})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}\sum_{i=1}^{n-1}|\beta_{i}-\beta_{i+1}|. (1)

Further, the combination of fusion estimator and lasso is called fused lasso estimator and is given by:

𝜷^FL(𝒚,λF,λL)=argmin𝜷n12𝒚𝜷22+λFi=1n1|βiβi+1|+λL𝜷1.\hat{\bm{\beta}}^{FL}(\bm{y},\lambda_{F},\lambda_{L})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}\sum_{i=1}^{n-1}|\beta_{i}-\beta_{i+1}|+\lambda_{L}||\bm{\beta}||_{1}. (2)

The fused lasso was introduced in cf. Tibshirani et al. (2005) and its asymptotic properties were studied in detail in Rinaldo (2009).

Remark 1

In the paper Tibshirani & Taylor (2011) the estimator in (1) is called the fused lasso, while the estimator in (2) is addressed as the sparse fused lasso.

In the area of constrained inference the basic and simplest problem is the isotonic regression in one dimension. For a given sequence of data points 𝒚n\bm{y}\in\mathbb{R}^{n} the isotonic regression is the following approximation

𝜷^I=argmin𝜷n𝒚𝜷22,subject toβ1β2βn,\hat{\bm{\beta}}^{I}=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}||\bm{y}-\bm{\beta}||_{2}^{2},\quad\text{subject to}\quad\beta_{1}\leq\beta_{2}\leq\dots\leq\beta_{n}, (3)

i.e. the isotonic regression is the 2\ell^{2}-projection of the vector 𝒚\bm{y} onto the set of non-increasing vectors in n\mathbb{R}^{n}. The notion of isotonic ”regression” in this context might be confusing. Nevertheless, it is a standard notion in this subject, cf., for example, the papers Best & Chakravarti (1990); Stout (2013), where the notation ”isotonic regression” is used for the isotonic projection of a general vector. Also, in this paper we use notations ”regression”, ”estimator” and ”approximator” interchangeably.

The nearly-isotonic regression, introduced in Tibshirani et al. (2011) and studied in detail in Minami (2020), is a less restrictive version of the isotonic regression and is given by the following optimization problem

𝜷^NI(𝒚,λNI)=argmin𝜷n12𝒚𝜷22+λNIi=1n1|βiβi+1|+,\hat{\bm{\beta}}^{NI}(\bm{y},\lambda_{NI})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{NI}\sum_{i=1}^{n-1}|\beta_{i}-\beta_{i+1}|_{+}, (4)

where x+=x1{x>0}x_{+}=x\cdot 1\{x>0\}.

In this paper we combine fused lasso estimator with nearly-isotonic regression and call the resulting estimator as fused lasso nearly-isotonic signal approximator, i.e. for a given sequence of data points 𝒚n\bm{y}\in\mathbb{R}^{n} the problem in one dimensional case is the following optimization

𝜷^FLNI(𝒚,λF,λL,λNI)=\displaystyle\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})={} (5)
argmin𝜷n12𝒚𝜷22+\displaystyle\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+ λFi=1n1|βiβi+1|+λL𝜷1+λNIi=1n1|βiβi+1|+.\displaystyle\lambda_{F}\sum_{i=1}^{n-1}|\beta_{i}-\beta_{i+1}|+\lambda_{L}||\bm{\beta}||_{1}+\lambda_{NI}\sum_{i=1}^{n-1}|\beta_{i}-\beta_{i+1}|_{+}.

Also, in the case of λF0\lambda_{F}\neq 0 and λNI0\lambda_{NI}\neq 0, with λL=0\lambda_{L}=0, we call the estimator as fused nearly-isotonic regression, i.e.

𝜷^FNI(𝒚,λF,λNI)\displaystyle\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})\equiv{} 𝜷^FLNI(𝒚,λF,0,λNI)=\displaystyle\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})= (6)
argmin𝜷n12𝒚𝜷22+λFi=1n1|βiβi+1|+λNIi=1n1|βiβi+1|+.\displaystyle\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}\sum_{i=1}^{n-1}|\beta_{i}-\beta_{i+1}|+\lambda_{NI}\sum_{i=1}^{n-1}|\beta_{i}-\beta_{i+1}|_{+}.

The generalisation of nearly-isotonic regression in (6) was proposed in the conclusion of the paper Tibshirani et al. (2011).

We will state the problem in (5) for the case of isotonic constraints with respect to a general partial order, but, first, we have to introduce the notation.

1.1 Notation

We start with basic definitions of partial order and isotonic regression. Let ={𝒊1,,𝒊n}\mathcal{I}=\{\bm{i}_{1},\dots,\bm{i}_{n}\} be some index set. Next, we define the following binary relation \preceq on \mathcal{I}.

A binary relation \preceq on \mathcal{I} is called partial order if

  • it is reflexive, i.e. 𝒋𝒋\bm{j}\preceq\bm{j} for all 𝒋\bm{j}\in\mathcal{I};

  • it is transitive, i.e. 𝒋1,𝒋2,𝒋3\bm{j}_{1},\bm{j}_{2},\bm{j}_{3}\in\mathcal{I}, 𝒋1𝒋2\bm{j}_{1}\preceq\bm{j}_{2} and 𝒋2𝒋3\bm{j}_{2}\preceq\bm{j}_{3} imply 𝒋1𝒋3\bm{j}_{1}\preceq\bm{j}_{3};

  • it is antisymmetric, i.e. 𝒋1,𝒋2\bm{j}_{1},\bm{j}_{2}\in\mathcal{I}, 𝒋1𝒋2\bm{j}_{1}\preceq\bm{j}_{2} and 𝒋2𝒋1\bm{j}_{2}\preceq\bm{j}_{1} imply 𝒋1=𝒋2\bm{j}_{1}=\bm{j}_{2}.

Further, a vector 𝜷n\bm{\beta}\in\mathbb{R}^{n} indexed by \mathcal{I} is called isotonic with respect to the partial order \preceq on \mathcal{I} if 𝒋1𝒋2\bm{j}_{1}\preceq\bm{j}_{2} implies β𝒋1β𝒋2\beta_{\bm{j}_{1}}\leq\beta_{\bm{j}_{2}}. We denote the set of all isotonic vectors in n\mathbb{R}^{n} with respect to the partial order \preceq on \mathcal{I} by 𝓑is\bm{\mathcal{B}}^{is}, which is also called isotonic cone. Next, a vector 𝜷In\bm{\beta}^{I}\in\mathbb{R}^{n} is isotonic regression of an arbitrary vector 𝒚n\bm{y}\in\mathbb{R}^{n} over the pre-ordered index set \mathcal{I} if

𝜷I=argmin𝜷𝓑is𝒋(β𝒋y𝒋)2.\displaystyle\bm{\beta}^{I}=\underset{\bm{\beta}\in\bm{\mathcal{B}}^{is}}{\arg\min}\sum_{\bm{j}\in\mathcal{I}}(\beta_{\bm{j}}-y_{\bm{j}})^{2}. (7)

For any partial order relation \preceq on \mathcal{I} there exists directed graph G=(V,E)G=(V,E), with V=V=\mathcal{I} and

E={(𝒋1,𝒋2),where𝒋𝟏and𝒋𝟐are certain elements of},\displaystyle E=\{(\bm{j}_{1},\bm{j}_{2}),\,\text{where}\,\bm{j_{1}}\,\text{and}\,\bm{j_{2}}\,\text{are certain elements of}\,\mathcal{I}\}, (8)

such that an arbitrary vector 𝜷n\bm{\beta}\in\mathbb{R}^{n} is isotonic with respect to \preceq iff β𝒋𝟏β𝒋𝟐\beta_{\bm{j_{1}}}\leq\beta_{\bm{j_{2}}} for any (𝒋1,𝒋2)E(\bm{j}_{1},\bm{j}_{2})\in E. Therefore, equivalently to the definition in (7), a vector 𝜷In\bm{\beta}^{I}\in\mathbb{R}^{n} is isotonic regression of an arbitrary vector 𝒚n\bm{y}\in\mathbb{R}^{n} indexed by the partially ordered index set \mathcal{I} if

𝜷I=argmin𝜷𝒋(β𝒋y𝒋)2,subject toβ𝒋𝟏β𝒋𝟐,whenever(𝒋𝟏,𝒋𝟐)E.\displaystyle\bm{\beta}^{I}=\underset{\bm{\beta}}{\arg\min}\sum_{\bm{j}\in\mathcal{I}}(\beta_{\bm{j}}-y_{\bm{j}})^{2},\quad\text{subject to}\quad\beta_{\bm{j_{1}}}\leq\beta_{\bm{j_{2}}},\quad\text{whenever}\quad(\bm{j_{1}},\bm{j_{2}})\in E. (9)

Further, for the directed graph G=(V,E)G=(V,E), which corresponds to the partial order \preceq on \mathcal{I}, the nearly-isotonic regresson of 𝒚n\bm{y}\in\mathbb{R}^{n} indexed by \mathcal{I} is given by

𝜷^NI(𝒚,λNI)=argmin𝜷n12𝒚𝜷22+λNI(𝒊,𝒋)E|β𝒊β𝒋|+.\hat{\bm{\beta}}^{NI}(\bm{y},\lambda_{NI})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{NI}\sum_{(\bm{i},\bm{j})\in E}|\beta_{\bm{i}}-\beta_{\bm{j}}|_{+}. (10)

This generalisation of nearly-isotonic regression was introduced and studied in Minami (2020).

Next, fussed and fussed lasso approximators for a general directed graph G=(V,E)G=(V,E) are given by

𝜷^F(𝒚,λF)=argmin𝜷n12𝒚𝜷22+λF(𝒊,𝒋)E|β𝒊β𝒋|,\hat{\bm{\beta}}^{F}(\bm{y},\lambda_{F})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}\sum_{(\bm{i},\bm{j})\in E}|\beta_{\bm{i}}-\beta_{\bm{j}}|, (11)

and

𝜷^FL(𝒚,λF,λL)=argmin𝜷n12𝒚𝜷22+λF(𝒊,𝒋)E|β𝒊β𝒋|+λL𝜷1.\hat{\bm{\beta}}^{FL}(\bm{y},\lambda_{F},\lambda_{L})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}\sum_{(\bm{i},\bm{j})\in E}|\beta_{\bm{i}}-\beta_{\bm{j}}|+\lambda_{L}||\bm{\beta}||_{1}. (12)

These optimization problems were introduced and solved for a general directed graph in Tibshirani & Taylor (2011).

Further, let DD denote the oriented incidence matrix for the directed graph G=(V,E)G=(V,E), corresponding to \preceq on \mathcal{I} . We choose the orientation of DD in the following way. Assume that the graph GG with nn vertexes has mm edges. Next, assume we label the vertexes by {1,,n}\{1,\dots,n\} and edges by {1,,m}\{1,\dots,m\}. Then DD is m×nm\times n matrix with

Di,j={1,if vertex j is the source of edge i,1,if vertex j is the target of edge i,0,otherwise.D_{i,j}=\begin{cases}1,&\quad\text{if vertex $j$ is the source of edge $i$},\\ -1,&\quad\text{if vertex $j$ is the target of edge $i$},\\ 0,&\quad\text{otherwise}.\end{cases}

In order to clarify the notations we consider the following examples of partial order relations. First, let us consider the monotonic order relation in one dimensional case. Let ={1,,n}\mathcal{I}=\{1,\dots,n\}, and for j1j_{1}\in\mathcal{I} and j2j_{2}\in\mathcal{I} we naturally define j1j2j_{1}\preceq j_{2} if j1j2j_{1}\leq j_{2}. Further, if we let V=V=\mathcal{I} and E={(i,i+1):i=1,,n1}E=\{(i,i+1):i=1,\dots,n-1\}, then G=(V,E)G=(V,E) is the directed graph which correspond to the one dimensional order relation on \mathcal{I}. Figure 1 displays the graph and the incidence matrix for the graph.

Refer to caption

(a) Graph G=(V,E)G=(V,E)

D=(11000011000011000011)D=\begin{pmatrix}1&-1&0&0&0\\ 0&1&-1&0&0\\ 0&0&1&-1&0\\ 0&0&0&1&-1\end{pmatrix}

(b) Oriented incidence matrix DD

Figure 1: Graph for monotonic contstraints and oriented incidence matrix

Next, we consider two dimensional case with bimonotonic constraints. The notion of bimonotonicity was first introduced in Beran & Dümbgen (2010) and it means the following. Let us consider the index set

={𝒊=(i(1),i(2)):i(1)=1,2,,n1,i(2)=1,2,,n2}\displaystyle\mathcal{I}=\{\bm{i}=(i^{(1)},i^{(2)}):\,i^{(1)}=1,2,\dots,n_{1},\,i^{(2)}=1,2,\dots,n_{2}\}

with the following order relation \preceq on it: for 𝒋1,𝒋2\bm{j}_{1},\bm{j}_{2}\in\mathcal{I} we have 𝒋1𝒋2\bm{j}_{1}\preceq\bm{j}_{2} iff j1(1)j2(1)j^{(1)}_{1}\leq j^{(1)}_{2} and j1(2)j2(2)j^{(2)}_{1}\leq j^{(2)}_{2}. Then, a vector 𝜷n\bm{\beta}\in\mathbb{R}^{n}, with n=n1n2n=n_{1}n_{2}, indexed by \mathcal{I} is called bimonotone if it is isotonic with respect to bimonotone order \preceq defined on its index \mathcal{I}. Further, we define the directed graph G=(V,E)G=(V,E) with vertexes V=V=\mathcal{I}, and the edges

E={((l,k),(l,k+1)): 1ln1,1kn21}{((l,k),(l+1,k)): 1ln11,1kn2}.\displaystyle\begin{aligned} E={}&\{((l,k),(l,k+1)):\,1\leq l\leq n_{1},1\leq k\leq n_{2}-1\}\\ \cup\,&\{((l,k),(l+1,k)):\,1\leq l\leq n_{1}-1,1\leq k\leq n_{2}\}.\end{aligned}

The labeled graph and its incidence matrix are displayed on Figure 2.

Refer to caption

(a) Graph G=(V,E)G=(V,E)

D=(110000001100000000011100010001000000000001)D=\begin{pmatrix}1&-1&0&0&0&\dots&0&0\\ 0&1&-1&0&0&\dots&0&0\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 0&0&0&0&0&\dots&1&-1\\ 1&0&0&0&-1&\dots&0&0\\ 0&1&0&0&0&\dots&0&0\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 0&0&0&0&0&\dots&0&-1\end{pmatrix}

(b) Oriented incidence matrix D17×12D\in\mathbb{R}^{17\times 12}

Figure 2: Graph for bimonotonic contstraints and oriented incidence matrix

1.2 General statement of the problem

Now we can state the general problem studied in this paper. Let 𝒚n\bm{y}\in\mathbb{R}^{n} be a signal indexed by the index set \mathcal{I} with the partial order relation \preceq defined on \mathcal{I}. Next, let G=(V,E)G=(V,E) be the directed gpraph corresponding to \preceq on \mathcal{I}. The fussed lasso nearly isotonic signal approximation with respect to \preceq on \mathcal{I} (or, equivalently, to the directed graph G=(V,E)G=(V,E), corresponding to \preceq) is given by

𝜷^FLNI(𝒚,λF,λL,λNI)=\displaystyle\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})={} (13)
argmin𝜷n12𝒚𝜷22+\displaystyle\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+ λF(𝒊,𝒋)E|β𝒊β𝒋|+λL𝜷1+λNI(𝒊,𝒋)E|β𝒊β𝒋|+.\displaystyle\lambda_{F}\sum_{(\bm{i},\bm{j})\in E}|\beta_{\bm{i}}-\beta_{\bm{j}}|+\lambda_{L}||\bm{\beta}||_{1}+\lambda_{NI}\sum_{(\bm{i},\bm{j})\in E}|\beta_{\bm{i}}-\beta_{\bm{j}}|_{+}.

Therefore, the estimator in (13) is a combination of the estimators in (10) and (12).

Equivalently, we can rewrite the problem in the following way:

𝜷^FLNI(𝒚,λF,λL,λNI)=argmin𝜷n12𝒚𝜷22+λFD𝜷1+λL𝜷1+λNID𝜷+,\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})={}\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}||D\bm{\beta}||_{1}+\lambda_{L}||\bm{\beta}||_{1}+\lambda_{NI}||D\bm{\beta}||_{+}, (14)

where DD is the oriented incidence matrix of the graph G=(V,E)G=(V,E).

Analogously to the definition in one dimensional case, if λL=0\lambda_{L}=0 we call the estimator as fussed nearly-isotonic approximator and denote it by 𝜷^FNI(𝒚,λF,λNI)\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI}).

2 The solution to the fused lasso nearly-isotonic signal approximator

First, we consider fused nearly-isotonic regression, i.e. in (14) we assume that λL=0\lambda_{L}=0.

Theorem 1

For a fixed data vector 𝐲n\bm{y}\in\mathbb{R}^{n} indexed by the index set \mathcal{I} with the partial order relation \preceq defined on \mathcal{I} the solution to the fused nearly-isotonic problem in (14) is given by

𝜷^FNI(𝒚,λF,λNI)=𝒚DT𝝂^(λF,λNI)\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})=\bm{y}-D^{T}\hat{\bm{\nu}}(\lambda_{F},\lambda_{NI}) (15)

with

𝝂^(𝒚,λF,λNI)=argmin𝝂n112𝒚DT𝝂22subject toλF𝟏𝝂(λF+λNI)𝟏,\hat{\bm{\nu}}(\bm{y},\lambda_{F},\lambda_{NI})=\underset{\bm{\nu}\in\mathbb{R}^{n-1}}{\arg\min}\,\frac{1}{2}||\bm{y}-D^{T}\bm{\nu}||_{2}^{2}\quad\text{subject to}\quad-\lambda_{F}\bm{1}\leq\bm{\nu}\leq(\lambda_{F}+\lambda_{NI})\bm{1}, (16)

where DD is the incidence matrix of the directed graph G=(V,E)G=(V,E), corresponding to \preceq on \mathcal{I}, 𝟏n\bm{1}\in\mathbb{R}^{n} is the vector whose all elements are equal to 11 and the notation 𝒂𝒃\bm{a}\leq\bm{b} for vectors 𝒂,𝒃n\bm{a},\bm{b}\in\mathbb{R}^{n} means aibia_{i}\leq b_{i} for all i=1,,ni=1,\dots,n.

Proof. First, following the derivations of 1\ell_{1} trend filtering and generalised lasso in Kim et al. (2009) and Tibshirani & Taylor (2011), respectively, we can write the optimization problem in (6) in the following way

minimize𝜷,𝒛12𝒚𝜷22+λF𝒛1+λNI𝒛+subject toD𝜷=𝒛n1.\underset{\bm{\beta},\bm{z}}{\text{minimize}}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}||\bm{z}||_{1}+\lambda_{NI}||\bm{z}||_{+}\quad\text{subject to}\quad D\bm{\beta}=\bm{z}\in\mathbb{R}^{n-1}.

Further, the Lagrangian is given by

L(𝜷,𝒛,𝝂)=12𝒚𝜷22+λF𝒛1+λNI𝒛++𝝂T(D𝜷𝒛),L(\bm{\beta},\bm{z},\bm{\nu})=\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}||\bm{z}||_{1}+\lambda_{NI}||\bm{z}||_{+}+\bm{\nu}^{T}(D\bm{\beta}-\bm{z}), (17)

where 𝝂n1\bm{\nu}\in\mathbb{R}^{n-1} is a dual variable.

Note that

min𝒛(λF𝒛1+λNI𝒛+𝝂T𝒛)={0,ifλF𝟏𝝂(λF+λNI)𝟏,,otherwise,\underset{\bm{z}}{\min}\big{(}\lambda_{F}||\bm{z}||_{1}+\lambda_{NI}||\bm{z}||_{+}-\bm{\nu}^{T}\bm{z}\big{)}=\begin{cases}0,&\quad\text{if}\quad-\lambda_{F}\bm{1}\leq\bm{\nu}\leq(\lambda_{F}+\lambda_{NI})\bm{1},\\ -\infty,&\quad\text{otherwise},\end{cases}

and

min𝜷(12𝒚𝜷22+𝝂TD𝜷)=12𝝂TDDT𝝂+𝒚TDT𝝂=12𝒚DT𝝂22+12𝒚T𝒚.\underset{\bm{\beta}}{\min}\big{(}\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\bm{\nu}^{T}D\bm{\beta}\big{)}=-\frac{1}{2}\bm{\nu}^{T}DD^{T}\bm{\nu}+\bm{y}^{T}D^{T}\bm{\nu}=-\frac{1}{2}||\bm{y}-D^{T}\bm{\nu}||_{2}^{2}+\frac{1}{2}\bm{y}^{T}\bm{y}.

Next, the dual function is given by

g(ν)=\displaystyle g(\nu)={} min𝜷,𝒛L(𝜷,𝒛,𝝂)=\displaystyle\underset{\bm{\beta},\bm{z}}{\min}\,L(\bm{\beta},\bm{z},\bm{\nu})=
{12𝒚DT𝝂22+12𝒚T𝒚,ifλF𝟏𝝂(λF+λNI)𝟏,,otherwise,\displaystyle\begin{cases}-\frac{1}{2}||\bm{y}-D^{T}\bm{\nu}||_{2}^{2}+\frac{1}{2}\bm{y}^{T}\bm{y},&\quad\text{if}\quad-\lambda_{F}\bm{1}\leq\bm{\nu}\leq(\lambda_{F}+\lambda_{NI})\bm{1},\\ -\infty,&\quad\text{otherwise},\end{cases}

and, therefore, the dual problem is

𝝂^(𝒚,λF,λNI)=argmax𝝂g(ν)subject toλF𝟏𝝂(λF+λNI)𝟏,\hat{\bm{\nu}}(\bm{y},\lambda_{F},\lambda_{NI})=\underset{\bm{\nu}}{\arg\max}\,g(\nu)\quad\text{subject to}\quad-\lambda_{F}\bm{1}\leq\bm{\nu}\leq(\lambda_{F}+\lambda_{NI})\bm{1},

which is equivalent to

𝝂^(𝒚,λF,λNI)=argmin𝝂12𝒚DT𝝂22subject toλF𝟏𝝂(λF+λNI)𝟏.\hat{\bm{\nu}}(\bm{y},\lambda_{F},\lambda_{NI})=\underset{\bm{\nu}}{\arg\min}\,\frac{1}{2}||\bm{y}-D^{T}\bm{\nu}||_{2}^{2}\quad\text{subject to}\quad-\lambda_{F}\bm{1}\leq\bm{\nu}\leq(\lambda_{F}+\lambda_{NI})\bm{1}.

Lastly, taking first derivative of Lagrangian L(𝜷,𝒛,𝝂)L(\bm{\beta},\bm{z},\bm{\nu}) with respect to 𝜷\bm{\beta} we get the following relation between 𝜷^FNI(λF,λNI)\hat{\bm{\beta}}^{FNI}(\lambda_{F},\lambda_{NI}) and 𝝂^(𝒚,λF,λNI)\hat{\bm{\nu}}(\bm{y},\lambda_{F},\lambda_{NI})

𝜷^FNI(𝒚,λF,λNI)=𝒚DT𝝂^(𝒚,λF,λNI).\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})=\bm{y}-D^{T}\hat{\bm{\nu}}(\bm{y},\lambda_{F},\lambda_{NI}).

\Box

Next, we provide the solution to the fused lasso nearly-isotonic regression.

Theorem 2

For a given vector 𝐲\bm{y} indexed by \mathcal{I} the solution to the fused lasso nearly-isotonic signal approximator 𝛃^FLNI(𝐲,λF,λL,λNI)\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI}) is given by soft thresholding the fused nearly-isotonic regression 𝛃^FNI(𝐲,λF,λNI)\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI}), i.e.

β^𝒊FLNI(𝒚,λF,λL,λNI)={β^𝒊FNI(𝒚,λF,λNI)λL,ifβ^𝒊FNI(𝒚,λF,λNI)λL,0,if|β^𝒊FNI(𝒚,λF,λNI)|λL,β^𝒊FNI(𝒚,λF,λNI)+λL,ifβ^𝒊FNI(𝒚,λF,λNI)λL,\hat{\beta}^{FLNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\begin{cases}\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})-\lambda_{L},&\quad\text{if}\quad\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})\geq\lambda_{L},\\ 0,&\quad\text{if}\quad|\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})|\leq\lambda_{L},\\ \hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})+\lambda_{L},&\quad\text{if}\quad\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})\leq-\lambda_{L},\end{cases} (18)

for 𝐢{\bm{i}}\in\mathcal{I}.

Proof. The proof is similar to the derivation of solution of the fused lasso in Friedman et al. (2007). Nevertheless, for compliteness of the paper we provide the proof for 𝜷^FLNI(𝒚,λF,λL,λNI)\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI}).

The subgradient equations (which are necessary and sufficient conditions for the solution of (5)) for β𝒊\beta_{\bm{i}}, with 𝒊\bm{i}\in\mathcal{I}, are

g𝒊(λL)=\displaystyle g_{\bm{i}}(\lambda_{L})={} (y𝒊β𝒊)+λNI(𝒋:(𝒊,𝒋)Eq𝒊,𝒋𝒋:(𝒋,𝒊)Eq𝒋,𝒊)+\displaystyle-(y_{\bm{i}}-\beta_{\bm{i}})+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}})+ (19)
λF(𝒋:(𝒊,𝒋)Et𝒊,𝒋𝒋:(𝒋,𝒊)Et𝒋,𝒊)+λLsi=0,\displaystyle\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}})+\lambda_{L}s_{i}=0,

where

q𝒊,𝒋:{=1,if β𝒊β𝒋>0,=0,if β𝒊β𝒋<0,[0,1],if β𝒊=β𝒋,t𝒊,𝒋:{=1,if β𝒊β𝒋>0,=1,if β𝒊β𝒋<0,[1,1],if β𝒊=β𝒋,q_{\bm{i},\bm{j}}:\begin{cases}=1,&\text{if }\,\beta_{\bm{i}}-\beta_{\bm{j}}>0,\\ =0,&\text{if }\,\beta_{\bm{i}}-\beta_{\bm{j}}<0,\\ \in[0,1],&\text{if }\,\beta_{\bm{i}}=\beta_{\bm{j}},\end{cases}\quad\quad t_{\bm{i},\bm{j}}:\begin{cases}=1,&\text{if }\,\beta_{\bm{i}}-\beta_{\bm{j}}>0,\\ =-1,&\text{if }\,\beta_{\bm{i}}-\beta_{\bm{j}}<0,\\ \in[-1,1],&\text{if }\,\beta_{\bm{i}}=\beta_{\bm{j}},\end{cases} (20)
s𝒊:{=1,if β𝒊>0,=1,if β𝒊<0,[1,1],if β𝒊=0.s_{\bm{i}}:\begin{cases}=1,&\text{if }\,\beta_{\bm{i}}>0,\\ =-1,&\text{if }\,\beta_{\bm{i}}<0,\\ \in[-1,1],&\text{if }\,\beta_{\bm{i}}=0.\end{cases}

Next, let q𝒊,𝒋(λL)q_{\bm{i},\bm{j}}(\lambda_{L}), t𝒊,𝒋(λL)t_{\bm{i},\bm{j}}(\lambda_{L}) and s𝒊(λL)s_{\bm{i}}(\lambda_{L}) denote the values of the parameters defined above at some value of λL\lambda_{L}. Note, the values of λNI\lambda_{NI} and λF\lambda_{F} are fixed. Therefore, if β^𝒊FLNI(𝒚,λF,0,λNI)0\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})\neq 0 for s𝒊(0)s_{\bm{i}}(0) we have

s𝒊(0)={1,if β^𝒊FLNI(𝒚,λF,0,λNI)>0,1,if β^𝒊FLNI(𝒚,λF,0,λNI)<0,s_{\bm{i}}(0)=\begin{cases}1,&\text{if }\,\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})>0,\\ -1,&\text{if }\,\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})<0,\\ \end{cases}

and for β^𝒊FLNI(𝒚,λF,0,λNI)=0\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})=0 we can set s𝒊(0)=0s_{\bm{i}}(0)=0.

Next, let 𝜷^ST(λL)\hat{\bm{\beta}}^{ST}(\lambda_{L}) denote the soft thresholding of 𝜷^FLNI(𝒚,λF,0,λNI)\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI}), i.e.

β^𝒊ST(λL)={β^𝒊FLNI(𝒚,λF,0,λNI)λL,ifβ^𝒊FLNI(𝒚,λF,0,λNI)λL,0,if|β^𝒊FLNI(𝒚,λF,0,λNI)|λL,β^𝒊FLNI(𝒚,λF,0,λNI)+λL,ifβ^𝒊FLNI(𝒚,λF,0,λNI)λL.\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})=\begin{cases}\hat{\beta}^{FLNI}_{\bm{i}}(\bm{y},\lambda_{F},0,\lambda_{NI})-\lambda_{L},&\quad\text{if}\quad\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})\geq\lambda_{L},\\ 0,&\quad\text{if}\quad|\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})|\leq\lambda_{L},\\ \hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})+\lambda_{L},&\quad\text{if}\quad\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})\leq-\lambda_{L}.\end{cases}

The goal is to prove that 𝜷^ST(λL)\hat{\bm{\beta}}^{ST}(\lambda_{L}) provides the solution to (13).

Note, analogously to the proof for the fused lasso estimator in Lemma A.1 at Friedman et al. (2007), if either β^𝒊ST(λL)0\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})\neq 0 or β^𝒋ST(λL)0\hat{\beta}_{\bm{j}}^{ST}(\lambda_{L})\neq 0, and β^𝒊ST(λL)<β^𝒋ST(λL)\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})<\hat{\beta}_{\bm{j}}^{ST}(\lambda_{L}) or β^𝒊ST(λL)>β^𝒋ST(λL)\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})>\hat{\beta}_{\bm{j}}^{ST}(\lambda_{L}), then we also have β^𝒊ST(0)<β^𝒋ST(0)\hat{\beta}_{\bm{i}}^{ST}(0)<\hat{\beta}_{\bm{j}}^{ST}(0) or β^𝒊ST(λL)>β^𝒋ST(λL)\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})>\hat{\beta}_{\bm{j}}^{ST}(\lambda_{L}), respectively. Therefore, soft thresholding of 𝜷^FLNI(𝒚,λF,0,λNI)\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI}) does not change the ordering of these pairs and we have q𝒊,𝒋(λL)=q𝒊,𝒋(0)q_{\bm{i},\bm{j}}(\lambda_{L})=q_{\bm{i},\bm{j}}(0) and t𝒊,𝒋(λL)=t𝒊,𝒋(0)t_{\bm{i},\bm{j}}(\lambda_{L})=t_{\bm{i},\bm{j}}(0). Next, if for some (𝒊,𝒋)E(\bm{i},\bm{j})\in E we have β^𝒊ST(λL)=β^𝒋ST(λL)=0\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})=\hat{\beta}_{\bm{j}}^{ST}(\lambda_{L})=0, then q𝒊,𝒋[0,1]q_{\bm{i},\bm{j}}\in[0,1] and t𝒊,𝒋[1,1]t_{\bm{i},\bm{j}}\in[-1,1], and, again, we can set t𝒊,𝒋(λL)=t𝒊,𝒋(0)t_{\bm{i},\bm{j}}(\lambda_{L})=t_{\bm{i},\bm{j}}(0), and q𝒊,𝒋(λL)=q𝒊,𝒋(0)q_{\bm{i},\bm{j}}(\lambda_{L})=q_{\bm{i},\bm{j}}(0).

Now let us insert β^𝒊ST(λL)\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L}) into subgradient equations (19) and show that we can find s𝒊(λL)[0,1]s_{\bm{i}}(\lambda_{L})\in[0,1], for all 𝒊\bm{i}\in\mathcal{I}.

First, assume that for some 𝒊\bm{i} we have β^𝒊FLNI(𝒚,λF,0,λNI)λL\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})\geq\lambda_{L}. Then

g𝒊(λL)\displaystyle g_{\bm{i}}(\lambda_{L}) =(y𝒊β^𝒊FLNI(𝒚,λF,0,λNI))λL+λNI(𝒋:(𝒊,𝒋)Eq𝒊,𝒋(λL)𝒋:(𝒋,𝒊)Eq𝒋,𝒊(λL))\displaystyle={}-(y_{\bm{i}}-\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI}))-\lambda_{L}+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}(\lambda_{L})-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}}(\lambda_{L}))
+λF(𝒋:(𝒊,𝒋)Et𝒊,𝒋(λL)𝒋:(𝒋,𝒊)Et𝒋,𝒊(λL))+λLs𝒊(λL)\displaystyle+\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}(\lambda_{L})-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}}(\lambda_{L}))+\lambda_{L}s_{\bm{i}}(\lambda_{L})
=(y𝒊β^𝒊FLNI(𝒚,λF,0,λNI))+λNI(𝒋:(𝒊,𝒋)Eq𝒊,𝒋(0)𝒋:(𝒋,𝒊)Eq𝒋,𝒊(0))+\displaystyle={}-(y_{\bm{i}}-\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI}))+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}}(0))+
λF(𝒋:(𝒊,𝒋)Et𝒊,𝒋(0)𝒋:(𝒋,𝒊)Et𝒋,𝒊(0))+λLs𝒊(λL)λL=0.\displaystyle\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}}(0))+\lambda_{L}s_{\bm{i}}(\lambda_{L})-\lambda_{L}=0.

Note, that

(yiβ^iFLNI(𝒚,λF,0,λNI))+λNI(𝒋:(𝒊,𝒋)Eq𝒊,𝒋(0)𝒋:(𝒋,𝒊)Eq𝒋,𝒊(0))+\displaystyle-(y_{i}-\hat{\beta}_{i}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI}))+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}}(0))+
λF(𝒋:(𝒊,𝒋)Et𝒊,𝒋(0)𝒋:(𝒋,𝒊)Et𝒋,𝒊(0))=0,\displaystyle\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}}(0))=0,

because 𝜷^FNI(𝒚,λF,λNI)𝜷^FLNI(𝒚,λF,0,λNI)\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})\equiv\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI}). Therefore, if si(λL)=signβ^𝒊ST(λL)=1s_{i}(\lambda_{L})=\operatorname{sign}{\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})}=1, then g𝒊(λL)=0g_{\bm{i}}(\lambda_{L})=0.

The proof for the case when β^𝒊FLNI(𝒚,λF,0,λNI)λL\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})\leq-\lambda_{L} is similar and one can show that g𝒊(λL)=0g_{\bm{i}}(\lambda_{L})=0 if s𝒊(λL)=signβ^𝒊ST(λL)=1s_{\bm{i}}(\lambda_{L})=\operatorname{sign}{\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})}=-1.

Second, assume that |β^𝒊FLNI(𝒚,λF,0,λNI)|<λL|\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})|<\lambda_{L}. Then, β^𝒊ST(λL)=0\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})=0, and

g𝒊(λL)\displaystyle g_{\bm{i}}(\lambda_{L}) =y𝒊+λNI(𝒋:(𝒊,𝒋)Eq𝒊,𝒋(λL)𝒋:(𝒋,𝒊)Eq𝒋,𝒊(λL))\displaystyle={}-y_{\bm{i}}+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}(\lambda_{L})-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}}(\lambda_{L}))
+λF(𝒋:(𝒊,𝒋)Et𝒊,𝒋(λL)𝒋:(𝒋,𝒊)Et𝒋,𝒊(λL))+λLs𝒊(λL)\displaystyle+\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}(\lambda_{L})-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}}(\lambda_{L}))+\lambda_{L}s_{\bm{i}}(\lambda_{L})
=y𝒊+λNI(𝒋:(𝒊,𝒋)Eq𝒊,𝒋(0)𝒋:(𝒋,𝒊)Eq𝒋,𝒊(0))+\displaystyle={}-y_{\bm{i}}+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}}(0))+
λF(𝒋:(𝒊,𝒋)Et𝒊,𝒋(0)𝒋:(𝒋,𝒊)Et𝒋,𝒊(0))+λLs𝒊(λL)=0.\displaystyle\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}}(0))+\lambda_{L}s_{\bm{i}}(\lambda_{L})=0.

Next, if we let s𝒊(λL)=β^𝒊FLNI(𝒚,λF,0,λNI)/λLs_{\bm{i}}(\lambda_{L})=\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})/\lambda_{L}, then, again, we have

g𝒊(λL)\displaystyle g_{\bm{i}}(\lambda_{L}) =(yiβ^iFLNI(𝒚,λF,0,λNI))+λNI(𝒋:(𝒊,𝒋)Eq𝒊,𝒋(0)𝒋:(𝒋,𝒊)Eq𝒋,𝒊(0))+\displaystyle={}-(y_{i}-\hat{\beta}_{i}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI}))+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}}(0))+
λF(𝒋:(𝒊,𝒋)Et𝒊,𝒋(0)𝒋:(𝒋,𝒊)Et𝒋,𝒊(0))=0,\displaystyle\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}}(0))=0,

Therefore, we have proved that 𝜷^FLNI(𝒚,λF,λL,λNI)=𝜷^ST(λL)\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\hat{\bm{\beta}}^{ST}(\lambda_{L}). \Box

3 Properties of the fused lasso nearly-isotonic signal approximator

We start with a proposition which shows how the solutions to the optimization problems (11), (10) and (14) are related to each other. This result will be used in the next section to derive degrees of freedom of the fused lasso nearly-isotonic signal approximator.

Proposition 1

For a fixed data vector 𝐲\bm{y} indexed by \mathcal{I} and penalisation parameters λNI\lambda_{NI} and λF\lambda_{F} the following relations between estimators 𝛃^F\hat{\bm{\beta}}^{F}, 𝛃^NI\hat{\bm{\beta}}^{NI} and 𝛃^FNI\hat{\bm{\beta}}^{FNI} hold

𝜷^NI(𝒚,λNI)=𝜷^F(𝒚λNI2DT𝟏,12λNI),\hat{\bm{\beta}}^{NI}(\bm{y},\lambda_{NI})=\hat{\bm{\beta}}^{F}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}), (21)
𝜷^FNI(𝒚,λF,λNI)=𝜷^NI(𝒚+λFDT𝟏,λNI+2λF)=𝜷^F(𝒚λNI2DT𝟏,12λNI+λF)\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})=\hat{\bm{\beta}}^{NI}(\bm{y}+\lambda_{F}D^{T}\bm{1},\lambda_{NI}+2\lambda_{F})=\hat{\bm{\beta}}^{F}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F}) (22)

and

𝜷^FLNI(𝒚,λF,λL,λNI)=𝜷^FL(𝒚λNI2DT𝟏,12λNI+λF,λL),\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\hat{\bm{\beta}}^{FL}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F},\lambda_{L}), (23)

where DD is the oriented incidence matrix for the graph G=(V,E)G=(V,E) corresponding to the partial order relation \preceq on \mathcal{I}.

Proof. First, from Tibshirani et al. (2011) the solution to the nearly-isotonic problem is given by

𝜷^NI(𝒚,λNI)=𝒚DT𝒗^(𝒚,λNI),\hat{\bm{\beta}}^{NI}(\bm{y},\lambda_{NI})=\bm{y}-D^{T}\hat{\bm{v}}(\bm{y},\lambda_{NI}),

with

𝒗^(𝒚,λNI)=argmin𝒗n112𝒚DT𝒗22subject to𝟎𝒗λNI𝟏,\hat{\bm{v}}(\bm{y},\lambda_{NI})=\underset{\bm{v}\in\mathbb{R}^{n-1}}{\arg\min}\,\frac{1}{2}||\bm{y}-D^{T}\bm{v}||_{2}^{2}\quad\text{subject to}\quad\bm{0}\leq\bm{v}\leq\lambda_{NI}\bm{1},

and from Tibshirani & Taylor (2011) it follows

𝜷^F(𝒚,λF)=𝒚DT𝒘^(𝒚,λF),\hat{\bm{\beta}}^{F}(\bm{y},\lambda_{F})=\bm{y}-D^{T}\hat{\bm{w}}(\bm{y},\lambda_{F}),

with

𝒘^(𝒚,λF)=argmin𝒘n112𝒚DT𝒘22subject toλF𝟏𝒘λF𝟏.\hat{\bm{w}}(\bm{y},\lambda_{F})=\underset{\bm{w}\in\mathbb{R}^{n-1}}{\arg\min}\,\frac{1}{2}||\bm{y}-D^{T}\bm{w}||_{2}^{2}\quad\text{subject to}\quad-\lambda_{F}\bm{1}\leq\bm{w}\leq\lambda_{F}\bm{1}.

Second, let us introduce a new variable 𝒗=𝒗λNI2𝟏\bm{v}^{*}=\bm{v}-\frac{\lambda_{NI}}{2}\bm{1}. Then

𝜷^NI(𝒚,λNI)=𝒚DTλNI2𝟏DT𝒗^(𝒚,λNI),\hat{\bm{\beta}}^{NI}(\bm{y},\lambda_{NI})=\bm{y}-D^{T}\frac{\lambda_{NI}}{2}\bm{1}-D^{T}\hat{\bm{v}}^{*}(\bm{y},\lambda_{NI}),

where

𝒗^(𝒚,λNI)=argmin𝒗n112𝒚DTλNI2𝟏DT𝒗22subject toλNI2𝟏𝒗λNI2𝟏.\hat{\bm{v}}^{*}(\bm{y},\lambda_{NI})=\underset{\bm{v}^{*}\in\mathbb{R}^{n-1}}{\arg\min}\,\frac{1}{2}||\bm{y}-D^{T}\frac{\lambda_{NI}}{2}\bm{1}-D^{T}\bm{v}^{*}||_{2}^{2}\quad\text{subject to}\quad-\frac{\lambda_{NI}}{2}\bm{1}\leq\bm{v}^{*}\leq\frac{\lambda_{NI}}{2}\bm{1}.

Therefore, we have proved that 𝜷^NI(𝒚,λNI)=𝜷^F(𝒚λNI2DT𝟏,12λNI)\hat{\bm{\beta}}^{NI}(\bm{y},\lambda_{NI})=\hat{\bm{\beta}}^{F}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}).

The proof for the fused lasso nearly-isotonic estimator is the same with the change of variable 𝒖=𝒖+DTλF𝟏\bm{u}^{*}=\bm{u}+D^{T}\lambda_{F}\bm{1} in (15) and (16) for the proof of the first equality in (22) and with 𝒖=𝒖λNI2𝟏\bm{u}^{*}=\bm{u}-\frac{\lambda_{NI}}{2}\bm{1} for the second equality.

Next, we prove the result for the case of fused lasso nearly-isotonic approximator. From Theorem 2 we have

β^𝒊FLNI(𝒚,λF,λL,λNI)={β^𝒊FNI(𝒚,λF,λNI)λL,ifβ^𝒊FNI(𝒚,λF,λNI)λL,0,if|β^𝒊FNI(𝒚,λF,λNI)|λL,β^𝒊FNI(𝒚,λF,λNI)+λL,ifβ^𝒊FNI(𝒚,λF,λNI)λL,\hat{\beta}^{FLNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\begin{cases}\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})-\lambda_{L},&\quad\text{if}\quad\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})\geq\lambda_{L},\\ 0,&\quad\text{if}\quad|\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})|\leq\lambda_{L},\\ \hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})+\lambda_{L},&\quad\text{if}\quad\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})\leq-\lambda_{L},\end{cases}

for 𝒊\bm{i}\in\mathcal{I}.

Further, using (22) we have

β^𝒊FLNI(𝒚,λF,λL,λNI)=β^𝒊F(𝒚λNI2DT𝟏,12λNI+λF)λL,\hat{\beta}^{FLNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\hat{\beta}^{F}_{\bm{i}}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})-\lambda_{L},

if β^𝒊F(𝒚λNI2DT𝟏,12λNI+λF)λL\hat{\beta}^{F}_{\bm{i}}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})\geq\lambda_{L},

β^𝒊FLNI(𝒚,λF,λL,λNI)=0,\hat{\beta}^{FLNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=0,

if |β^𝒊F(𝒚λNI2DT𝟏,12λNI+λF)|λL|\hat{\beta}^{F}_{\bm{i}}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})|\leq\lambda_{L},

β^𝒊FLNI(𝒚,λF,λL,λNI)=β^𝒊F(𝒚λNI2DT𝟏,12λNI+λF)+λL,\hat{\beta}^{FLNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\hat{\beta}^{F}_{\bm{i}}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})+\lambda_{L},

if β^𝒊F(𝒚λNI2DT𝟏,12λNI+λF)λL\hat{\beta}^{F}_{\bm{i}}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})\leq-\lambda_{L}.

Therefore, we obtain

𝜷^FLNI(𝒚,λF,λL,λNI)\displaystyle\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI}) =\displaystyle={}
argmin𝜷n12||𝒚\displaystyle\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y} λNI2DT𝟏𝜷||22+(12λNI+λF)||D𝜷||1+λL||𝜷||1\displaystyle-\frac{\lambda_{NI}}{2}D^{T}\bm{1}-\bm{\beta}||_{2}^{2}+(\frac{1}{2}\lambda_{NI}+\lambda_{F})||D\bm{\beta}||_{1}+\lambda_{L}||\bm{\beta}||_{1}\equiv
𝜷^FL(𝒚\displaystyle\hat{\bm{\beta}}^{FL}(\bm{y} λNI2DT𝟏,12λNI+λF,λL).\displaystyle-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F},\lambda_{L}).

\Box

Next, we prove that, analogously to fussed lasso and nearly-isotonic regression, as one of the penalization parameters increases the constant regions in the solution 𝜷^FLNI\hat{\bm{\beta}}^{FLNI} can only be joined together and not split apart. We prove this result only for one dimensional monotonic order, and the general case is an open question. This result could be potentially useful in the future research for the solution path of fussed lasso nearly-isotonic approximator.

Proposition 2

Let ={1,,n}\mathcal{I}=\{1,\dots,n\} with a natural order defined on it. Next, let 𝛌=(λF,λL,λNI)\bm{\lambda}=(\lambda_{F},\lambda_{L},\lambda_{NI}) and 𝛌=(λF,λL,λNI)\bm{\lambda}^{*}=(\lambda_{F}^{*},\lambda_{L}^{*},\lambda_{NI}^{*}) are the triples of penalisation parameters such that one of the elements of 𝛌\bm{\lambda}^{*} is greater than the corresponding element in 𝛌\bm{\lambda}, while two others are the same. Next, assume that for some ii the solution 𝛃^FLNI(𝐲,𝛌)\hat{\bm{\beta}}^{FLNI}(\bm{y},\bm{\lambda}) satisfies

β^iFLNI(𝒚,𝝀)=β^i+1FLNI(𝒚,𝝀).\hat{\beta}_{i}^{FLNI}(\bm{y},\bm{\lambda})=\hat{\beta}_{i+1}^{FLNI}(\bm{y},\bm{\lambda}).

Then for 𝛌\bm{\lambda}^{*} we have

β^iFLNI(𝒚,𝝀)=β^i+1FLNI(𝒚,𝝀).\hat{\beta}_{i}^{FLNI}(\bm{y},\bm{\lambda}^{*})=\hat{\beta}_{i+1}^{FLNI}(\bm{y},\bm{\lambda}^{*}).

Proof.

Case 1: λNI\lambda_{NI} and λF\lambda_{F} are fixed and λL>λL\lambda_{L}^{*}>\lambda_{L}. The result of the proposition for this case follows directly from Theorem 2.

Case 2: λF\lambda_{F} and λL\lambda_{L} are fixed and λNI>λNI\lambda_{NI}^{*}>\lambda_{NI}. Let us consider the fused nearly-isotonic regression and write the subgradient equations

gi(λNI)=(yiβi)+λNI(qi(λNI)qi1(λNI))+λF(ti(λNI)ti1(λNI))=0,g_{i}(\lambda_{NI})=-(y_{i}-\beta_{i})+\lambda_{NI}(q_{i}(\lambda_{NI})-q_{i-1}(\lambda_{NI}))+\lambda_{F}(t_{i}(\lambda_{NI})-t_{i-1}(\lambda_{NI}))=0,

where qiq_{i} and tit_{i}, with i=1,,ni=1,\dots,n, are defined in (20), and, analogously to the proof of Theorem 2, q(λNI)q(\lambda_{NI}), t(λNI)t(\lambda_{NI}) denote the values of the parameters defined above at some value of λNI\lambda_{NI}.

Assume that for λNI\lambda_{NI} in the solution 𝜷^FNI(𝒚,λF,λNI)\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI}) we have a following constant region

β^j1FNI(𝒚,λF,λNI)β^jFNI(𝒚,λF,λNI)==β^j+kFNI(𝒚,λF,λNI)β^j+k+1FNI(𝒚,λF,λNI),\hat{\beta}_{j-1}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})\neq\hat{\beta}_{j}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})=\dots=\hat{\beta}_{j+k}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})\neq\hat{\beta}_{j+k+1}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI}), (24)

and in the similar way as in Tibshirani et al. (2011) for λNI\lambda_{NI}^{*} we consider the subset of the subgradient equations

gi(λNI)=(yiβi)+λNI(qi(λNI)qi1(λNI))+λF(ti(λNI)ti1(λNI))=0,g_{i}(\lambda_{NI})=-(y_{i}-\beta_{i})+\lambda_{NI}^{*}(q_{i}(\lambda_{NI}^{*})-q_{i-1}(\lambda_{NI}^{*}))+\lambda_{F}(t_{i}(\lambda_{NI}^{*})-t_{i-1}(\lambda_{NI}^{*}))=0, (25)

with i=j,,ki=j,\dots,k, and show that there exists the solution for which (24) holds, qi[0,1]q_{i}\in[0,1] and ti[1,1]t_{i}\in[-1,1].

Note first that as λNI\lambda_{NI} increases, (24) holds until the merge with other groups happens, which means that qj1,qj+k{0,1}q_{j-1},q_{j+k}\in\{0,1\} and tj1,tj+k{1,1}t_{j-1},t_{j+k}\in\{-1,1\} will not change their values until the merge of this constant region. Also, as it follows from (20), for i[j,j+k]i\in[j,j+k] the value of tit_{i} is in [1,1][-1,1]. Therefore, without any violation of the restrictions on tit_{i} we can assume that ti(λNI)=ti(λ)t_{i}(\lambda_{NI}^{*})=t_{i}(\lambda) for any i[j,j+k1]i\in[j,j+k-1].

Next, taking pairwise differences between subgradient equations for λNI\lambda_{NI} we have

λNIA𝒒~(λNI)+λFA𝒕~(λNI)=D𝒚~+λNI𝒄(λNI)+λF𝒅(λNI),\lambda_{NI}A\tilde{\bm{q}}(\lambda_{NI})+\lambda_{F}A\tilde{\bm{t}}(\lambda_{NI})=D\tilde{\bm{y}}+\lambda_{NI}\bm{c}(\lambda_{NI})+\lambda_{F}\bm{d}(\lambda_{NI}),

where DD is displayed at Figure 1,

A=[210000121000000121000012],A=\begin{bmatrix}2&-1&0&\dots&0&0&0\\ -1&2&-1&\dots&0&0&0\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 0&0&0&\dots&-1&2&-1\\ 0&0&0&\dots&0&-1&2\end{bmatrix}, (26)

and 𝒒~(λNI)=(qj(λNI),,qj+k1(λNI))\tilde{\bm{q}}(\lambda_{NI})=(q_{j}(\lambda_{NI}),\dots,q_{j+k-1}(\lambda_{NI})), 𝒕~(λNI)=(tj(λNI),,tj+k1(λNI))\tilde{\bm{t}}(\lambda_{NI})=(t_{j}(\lambda_{NI}),\dots,t_{j+k-1}(\lambda_{NI})), 𝒚~=(yj,,yj+k)\tilde{\bm{y}}=(y_{j},\dots,y_{j+k}), 𝒄(λNI)=(qj1(λNI),0,,0,qj+k(λNI))\bm{c}(\lambda_{NI})=(q_{j-1}(\lambda_{NI}),0,\dots,0,q_{j+k}(\lambda_{NI})), and 𝒅(λNI)=(tj1(λNI),0,,0,tj+k(λNI))\bm{d}(\lambda_{NI})=(t_{j-1}(\lambda_{NI}),0,\dots,0,t_{j+k}(\lambda_{NI})).

Since AA is invertible we have

λNI𝒒~(λNI)+λF𝒕~(λNI)=A1D𝒚~+λNIA1𝒄(λNI)+λFA1𝒅(λNI),\lambda_{NI}\tilde{\bm{q}}(\lambda_{NI})+\lambda_{F}\tilde{\bm{t}}(\lambda_{NI})=A^{-1}D\tilde{\bm{y}}+\lambda_{NI}A^{-1}\bm{c}(\lambda_{NI})+\lambda_{F}A^{-1}\bm{d}(\lambda_{NI}),

and, since 𝒒~(λNI)\tilde{\bm{q}}(\lambda_{NI}) and 𝒕~(λNI)\tilde{\bm{t}}(\lambda_{NI}) provide the solution to the subgradient equations (25), then

λFλNI𝒒~(λNI)+λF𝒕~(λNI)λNI+λF.-\lambda_{F}\leq\lambda_{NI}\tilde{\bm{q}}(\lambda_{NI})+\lambda_{F}\tilde{\bm{t}}(\lambda_{NI})\leq\lambda_{NI}+\lambda_{F}. (27)

Next, as pointed out at Friedman et al. (2007) and Tibshirani et al. (2011)

(A1)i,1=(ni+1)/(n+1)and(A1)i,n=i/(n+1),(A^{-1})_{i,1}=(n-i+1)/(n+1)\quad\text{and}\quad(A^{-1})_{i,n}=i/(n+1),

then, one can show that

λF𝟏λNIA1𝒄(λNI)+λFA1𝒅(λNI)λNI𝟏+λF𝟏.-\lambda_{F}\bm{1}\preceq\lambda_{NI}A^{-1}\bm{c}(\lambda_{NI})+\lambda_{F}A^{-1}\bm{d}(\lambda_{NI})\preceq\lambda_{NI}\bm{1}+\lambda_{F}\bm{1}. (28)

Further, let us consider the case of λNI>λNI\lambda_{NI}^{*}>\lambda_{NI}. Then we have

λNI𝒒~(λNI)+λF𝒕~(λNI)=A1D𝒚~+λNIA1𝒄(λNI)+λFA1𝒅(λNI).\lambda_{NI}^{*}\tilde{\bm{q}}(\lambda_{NI}^{*})+\lambda_{F}\tilde{\bm{t}}(\lambda_{NI}^{*})=A^{-1}D\tilde{\bm{y}}+\lambda_{NI}^{*}A^{-1}\bm{c}(\lambda_{NI}^{*})+\lambda_{F}A^{-1}\bm{d}(\lambda_{NI}^{*}).

Recall, above we set 𝒕~(λNI)=𝒕~(λNI)\tilde{\bm{t}}(\lambda_{NI}^{*})=\tilde{\bm{t}}(\lambda_{NI}), and qj1,qj+k,tj1q_{j-1},q_{j+k},t_{j-1} and tj+kt_{j+k} does not change their values until the merge, which means that 𝒄(λNI)=𝒄(λNI)\bm{c}(\lambda_{NI}^{*})=\bm{c}(\lambda_{NI}), and 𝒅(λNI)=𝒅(λNI)\bm{d}(\lambda_{NI}^{*})=\bm{d}(\lambda_{NI}).

Therefore, the subgradient equations for λNI\lambda_{NI}^{*} can be written as

λNI𝒒~(λNI)+λF𝒕~(λNI)=A1D𝒚~+λNIA1𝒄(λNI)+λFA1𝒅(λNI).\lambda_{NI}^{*}\tilde{\bm{q}}(\lambda_{NI}^{*})+\lambda_{F}\tilde{\bm{t}}(\lambda_{NI})=A^{-1}D\tilde{\bm{y}}+\lambda_{NI}^{*}A^{-1}\bm{c}(\lambda_{NI})+\lambda_{F}A^{-1}\bm{d}(\lambda_{NI}).

Next, since the term A1D𝒚~A^{-1}D\tilde{\bm{y}} is not changed, λFλF𝒕~(λNI)λF-\lambda_{F}\leq\lambda_{F}\tilde{\bm{t}}(\lambda_{NI})\leq\lambda_{F}, and

λF𝟏λNIA1𝒄(λNI)+λFA1𝒅(λNI)λNI𝟏+λF𝟏,-\lambda_{F}\bm{1}\preceq\lambda_{NI}^{*}A^{-1}\bm{c}(\lambda_{NI})+\lambda_{F}A^{-1}\bm{d}(\lambda_{NI})\preceq\lambda_{NI}^{*}\bm{1}+\lambda_{F}\bm{1},

then we have

𝟎𝒒~(λNI)𝟏.\bm{0}\preceq\tilde{\bm{q}}(\lambda_{NI}^{*})\preceq\bm{1}.

Therefore we proved that β^iFNI(𝒚,𝝀)=β^i+1FNI(𝒚,𝝀)\hat{\beta}_{i}^{FNI}(\bm{y},\bm{\lambda}^{*})=\hat{\beta}_{i+1}^{FNI}(\bm{y},\bm{\lambda}^{*}). Since β^iFLNI(𝒚,𝝀)\hat{\beta}_{i}^{FLNI}(\bm{y},\bm{\lambda}^{*}) is given by soft thresholding of β^iFNI(𝒚,𝝀)\hat{\beta}_{i}^{FNI}(\bm{y},\bm{\lambda}^{*}), then β^iFLNI(𝒚,𝝀)=β^i+1FLNI(𝒚,𝝀)\hat{\beta}_{i}^{FLNI}(\bm{y},\bm{\lambda}^{*})=\hat{\beta}_{i+1}^{FLNI}(\bm{y},\bm{\lambda}^{*}) for i[j,k]i\in[j,k].

Case 3: λNI\lambda_{NI} and λL\lambda_{L} are fixed and λF>λF\lambda_{F}^{*}>\lambda_{F}. The proof for this case is virtually identical to the proof for the Case 2. In this case we assume that qi(λF)=qi(λ2)q_{i}(\lambda_{F}^{*})=q_{i}(\lambda_{2}) for any i[j,j+k1]i\in[j,j+k-1]. Next, qj1,qj+k,tj1q_{j-1},q_{j+k},t_{j-1} and tj+kt_{j+k} do not change their values until the merge, which, again, means that 𝒄(λF)=𝒄(λF)\bm{c}(\lambda_{F}^{*})=\bm{c}(\lambda_{F}), and 𝒅(λF)=𝒅(λF)\bm{d}(\lambda_{F}^{*})=\bm{d}(\lambda_{F}). Finally, we can show that

𝟏𝒕~(λF)𝟏.-\bm{1}\preceq\tilde{\bm{t}}(\lambda_{F}^{*})\preceq\bm{1}.

\Box

4 Degrees of freedom

In this section we discuss the estimation of degrees of freedom for fused nearly-isotonic regression and fused lasso nearly isotonic signal approximator. Let us consider the following nonparametric model

𝒀=𝜷̊+𝜺,\bm{Y}=\bm{\mathring{\beta}}+\bm{\varepsilon},

where 𝜷̊n\bm{\mathring{\beta}}\in\mathbb{R}^{n} is the unknown signal, and the error term 𝜺𝒩(𝟎,σ2𝑰)\bm{\varepsilon}\in\mathcal{N}(\bm{0},\sigma^{2}\bm{I}), with σ<\sigma<\infty.

The degrees of freedom is a measure of complexity of the estimator, and following Efron (1986), for the fixed values of λF\lambda_{F}, λL\lambda_{L} and λNi\lambda_{Ni} the degrees of freedom of 𝜷^FNI\hat{\bm{\beta}}^{FNI} and 𝜷^FLNI\hat{\bm{\beta}}^{FLNI} are given by

df(𝜷^FNI(𝒀,λF,λNI))=1σ2i=1nCov[β^iFNI(𝒀,λF,λNI),Yi]df(\hat{\bm{\beta}}^{FNI}(\bm{Y},\lambda_{F},\lambda_{NI}))=\frac{1}{\sigma^{2}}\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{FNI}_{i}(\bm{Y},\lambda_{F},\lambda_{NI}),Y_{i}] (29)

and

df(𝜷^FLNI(𝒀,λF,λL,λNI))=1σ2i=1nCov[β^iFLNI(𝒀,λF,λL,λNI),Yi].df(\hat{\bm{\beta}}^{FLNI}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI}))=\frac{1}{\sigma^{2}}\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{FLNI}_{i}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI}),Y_{i}]. (30)

The next theorem provides the unbiased estimators of the degrees of freedom df(𝜷^FNI)df(\hat{\bm{\beta}}^{FNI}) and df(𝜷^FLNI)df(\hat{\bm{\beta}}^{FLNI}).

Theorem 3

For the fixed values of λF\lambda_{F}, λL\lambda_{L} and λNi\lambda_{Ni} let

KFNI(𝒚,λF,λNI)=#{fused groups in 𝜷^FNI(𝒚,λF,λNI)},K^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})=\#\{\text{fused groups in }\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})\},

and

KFLNI(𝒚,λF,λL,λNI)=#{non-zero fused groups in 𝜷^FLNI(𝒚,λF,λL,λNI)}.K^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\#\{\text{non-zero fused groups in }\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})\}.

Then we have

𝔼[KFNI(𝒀,λF,λNI)]=df(𝜷^FNI(𝒀,λF,λNI)),\mathbb{E}[K^{FNI}(\bm{Y},\lambda_{F},\lambda_{NI})]=df(\hat{\bm{\beta}}^{FNI}(\bm{Y},\lambda_{F},\lambda_{NI})),

and

𝔼[KFLNI(𝒀,λF,λL,λNI)]=df(𝜷^FLNI(𝒀,λF,λL,λNI)).\mathbb{E}[K^{FLNI}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI})]=df(\hat{\bm{\beta}}^{FLNI}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI})).

Proof. First, for the fused estimator 𝜷^F(𝒚,λF)\hat{\bm{\beta}}^{F}(\bm{y},\lambda_{F}) let

KF(𝒚,λF)=#{fused groups in 𝜷^F(𝒚,λF)}.K^{F}(\bm{y},\lambda_{F})=\#\{\text{fused groups in }\hat{\bm{\beta}}^{F}(\bm{y},\lambda_{F})\}.

Then, as it follows from Tibshirani & Taylor (2011), for 𝜷^F(𝒚,λF)\hat{\bm{\beta}}^{F}(\bm{y},\lambda_{F}) we have

𝔼[KF(𝒀,λF)]=df(𝜷^F(𝒀,λF)).\mathbb{E}[K^{F}(\bm{Y},\lambda_{F})]=df(\hat{\bm{\beta}}^{F}(\bm{Y},\lambda_{F})).

Next, from Proposition 1, it follows

𝜷^FNI(𝒚,λF,λNI)=𝜷^F(𝒚λNI2DT𝟏,12λNI+λF).\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})=\hat{\bm{\beta}}^{F}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F}).

Therefore, using the property of covariance we have

df(𝜷^FNI(\displaystyle df(\hat{\bm{\beta}}^{FNI}( 𝒀,λF,λNI))=i=1nCov[β^iFNI(𝒀,λF,λNI),Yi]=\displaystyle\bm{Y},\lambda_{F},\lambda_{NI}))=\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{FNI}_{i}(\bm{Y},\lambda_{F},\lambda_{NI}),Y_{i}]=
i=1nCov[β^iF(𝒀λNI2DT𝟏,12λNI+λF),Yi]=\displaystyle\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{F}_{i}(\bm{Y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F}),Y_{i}]=
i=1nCov[β^iF(𝒀λNI2DT𝟏,12λNI+λF),YiλNI2[DT𝟏]i]=\displaystyle\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{F}_{i}(\bm{Y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F}),Y_{i}-\frac{\lambda_{NI}}{2}[D^{T}\bm{1}]_{i}]=
𝔼[KF(𝒀λNI2DT𝟏,12λNI+λF)]𝔼[KFNI(𝒀,λF,λNI)],\displaystyle\mathbb{E}[K^{F}(\bm{Y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})]\equiv\mathbb{E}[K^{FNI}(\bm{Y},\lambda_{F},\lambda_{NI})],

where [𝒂]i[\bm{a}]_{i} denotes ii-th element in the vector 𝒂n\bm{a}\in\mathbb{R}^{n}.

Next, we prove the result for the fused lasso nearly-isotonic approximator. From Proposition 1 we have

𝜷^FLNI(𝒚,λF,λL,λNI)=𝜷^FL(𝒚λNI2DT𝟏,12λNI+λF,λL).\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\hat{\bm{\beta}}^{FL}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F},\lambda_{L}).

Next, for the fused lasso 𝜷^FL(𝒚,λF,λL)\hat{\bm{\beta}}^{FL}(\bm{y},\lambda_{F},\lambda_{L}) defined in (2) let

KFL(𝒚,λF,λL)=#{non-zero fused groups in 𝜷^FL(𝒚,λF,λL)},K^{FL}(\bm{y},\lambda_{F},\lambda_{L})=\#\{\text{non-zero fused groups in }\hat{\bm{\beta}}^{FL}(\bm{y},\lambda_{F},\lambda_{L})\},

and from Tibshirani & Taylor (2011) it follows

𝔼[KFL(𝒀,λF,λL)]=df(𝜷^FL(𝒀,λF,λL)).\mathbb{E}[K^{FL}(\bm{Y},\lambda_{F},\lambda_{L})]=df(\hat{\bm{\beta}}^{FL}(\bm{Y},\lambda_{F},\lambda_{L})).

Further, again, using the property of the covariance, we have

df(𝜷^FLNI\displaystyle df(\hat{\bm{\beta}}^{FLNI} (𝒀,λF,λL,λNI))=i=1nCov[β^iFLNI(𝒀,λF,λL,λNI),Yi]=\displaystyle(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI}))={}\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{FLNI}_{i}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI}),Y_{i}]=
i=1nCov[β^iFL(𝒀λNI2DT𝟏,12λNI+λF,λL),Yi]=\displaystyle\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{FL}_{i}(\bm{Y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F},\lambda_{L}),Y_{i}]=
i=1nCov[β^iFL(𝒀λNI2DT𝟏,12λNI+λF,λL),YiλNI2[DT𝟏]i]=\displaystyle\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{FL}_{i}(\bm{Y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F},\lambda_{L}),Y_{i}-\frac{\lambda_{NI}}{2}[D^{T}\bm{1}]_{i}]=
𝔼[KFL(𝒀λNI2DT𝟏,12λNI+λF,λL)]𝔼[KFLNI(𝒀,λF,λL,λNI)].\displaystyle\mathbb{E}[K^{FL}(\bm{Y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F},\lambda_{L})]\equiv\mathbb{E}[K^{FLNI}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI})].

Lastly, we note that the proof for the unbiased estimator of the degrees of freedom for nearly-isotonic regression, given in Tibshirani et al. (2011), can be done in the same way as in the current proof, using the relation (21) and, again, the result of the paper Tibshirani & Taylor (2011) for the fusion estimator 𝜷^FLNI(𝒀,λF)\hat{\bm{\beta}}^{FLNI}(\bm{Y},\lambda_{F}). \Box

We can use the estimate of degrees of freedom for unbiased estimation of the true risk 𝔼[i=1n(β̊iβ^iFLNI(𝒀,λF,λL,λNI))2]\mathbb{E}[\sum_{i=1}^{n}(\mathring{\beta}_{i}-\hat{\beta}^{FLNI}_{i}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI}))^{2}], which is given by C^p\hat{C}_{p} statistic

C^p(λF,λL,λNI)=i=1n(yiβ^iFLNI(𝒚,λF,λL,λNI))2nσ2+2σ2KFLNI(𝒀,λF,λL,λNI).\hat{C}_{p}(\lambda_{F},\lambda_{L},\lambda_{NI})=\sum_{i=1}^{n}(y_{i}-\hat{\beta}^{FLNI}_{i}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI}))^{2}-n\sigma^{2}+2\sigma^{2}K^{FLNI}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI}).

5 Acknowledgments

This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundataion.


Appendix A Appendix

References

  • (1)
  • Beran & Dümbgen (2010) Beran, R. & Dümbgen, L. (2010), ‘Least squares and shrinkage estimation under bimonotonicity constraints’, Statistics and computing 20(2), 177–189.
  • Best & Chakravarti (1990) Best, M. J. & Chakravarti, N. (1990), ‘Active set algorithms for isotonic regression; a unifying framework’, Mathematical Programming 47(1), 425–439.
  • Efron (1986) Efron, B. (1986), ‘How biased is the apparent error rate of a prediction rule?’, Journal of the American statistical Association 81(394), 461–470.
  • Friedman et al. (2007) Friedman, J., Hastie, T., Höfling, H. & Tibshirani, R. (2007), ‘Pathwise coordinate optimization’, The Annals of Applied Statistics 1(2), 302–332.
  • Hoerl & Kennard (1970) Hoerl, A. E. & Kennard, R. W. (1970), ‘Ridge regression: Biased estimation for nonorthogonal problems’, Technometrics 12(1), 55–67.
  • Kim et al. (2009) Kim, S.-J., Koh, K., Boyd, S. & Gorinevsky, D. (2009), ‘1\ell_{1} trend filtering’, SIAM Review 51(2), 339–360.
  • Minami (2020) Minami, K. (2020), ‘Estimating piecewise monotone signals’, Electronic Journal of Statistics 14(1), 1508–1576.
  • Phillips (1962) Phillips, D. L. (1962), ‘A technique for the numerical solution of certain integral equations of the first kind’, Journal of the ACM (JACM) 9(1), 84–97.
  • Rinaldo (2009) Rinaldo, A. (2009), ‘Properties and refinements of the fused lasso’, The Annals of Statistics 37(5B), 2922–2952.
  • Stout (2013) Stout, Q. F. (2013), ‘Isotonic regression via partitioning’, Algorithmica 66(1), 93–112.
  • Tibshirani (1996) Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288.
  • Tibshirani et al. (2011) Tibshirani, R. J., Hoefling, H. & Tibshirani, R. (2011), ‘Nearly-isotonic regression’, Technometrics 53(1), 54–61.
  • Tibshirani & Taylor (2011) Tibshirani, R. J. & Taylor, J. (2011), ‘The solution path of the generalized lasso’, The Annals of Statistics 39(3), 1335–1371.
  • Tibshirani et al. (2005) Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. & Knight, K. (2005), ‘Sparsity and smoothness via the fused lasso’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(1), 91–108.
  • Tikhonov et al. (1995) Tikhonov, A. N., Goncharsky, A., Stepanov, V. & Yagola, A. G. (1995), Numerical methods for the solution of ill-posed problems, Vol. 328, Springer Science & Business Media.