Fused Lasso Nearly Isotonic Signal Approximation in General Dimensions

Vladimir Pastukhov
Department of Computer Science and Engineering,
Chalmers, Sweden
Email: [email protected]

Abstract

In this paper we introduce and study fused lasso nearly-isotonic signal approximation, which is a combination of fused lasso and generalized nearly-isotonic regression. We show how these three estimators relate to each other, derive solution to the general problem, show that it is computationally feasible and provides a trade-off between piecewise monotonicity, sparsity and goodness-of-fit. Also, we derive an unbiased estimator of the degrees of freedom of the approximator.

Keywords: Constrained inference; Isotonic regression; Nearly-isotonic Regression; Fused lasso

1 Introduction

This work is motivated by recent papers in nearly-constrained estimation and papers in penalized least squared regression. The subject of penalized estimators starts with $L_{1}$ -penalisation (Tibshirani, 1996), i.e.

\hat{\bm{\beta}}^{L}(\bm{y},\lambda_{L})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{L}||\bm{\beta}||_{1},

which is called lasso signal approximation. The $L_{2}$ -penalisation is usually addressed as ridge regression (Hoerl & Kennard, 1970) or sometimes as Tikhonov-Philips regularization (Phillips, 1962; Tikhonov et al., 1995).

First, to set up the ideas and for simplicity of notation we consider one dimensional cases of the penalized estimators. In the next subsection we generalise these estimators to the case of isotonic constraints with respect to a general partial order.

For a given sequence of data points $\bm{y}\in\mathbb{R}^{n}$ the fusion approximator (cf. Rinaldo (2009)) is given by

\hat{\bm{\beta}}^{F}(\bm{y},\lambda_{F})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}\sum_{i=1}^{n-1}|\beta_{i}-\beta_{i+1}|.

(1)

Further, the combination of fusion estimator and lasso is called fused lasso estimator and is given by:

\hat{\bm{\beta}}^{FL}(\bm{y},\lambda_{F},\lambda_{L})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}\sum_{i=1}^{n-1}|\beta_{i}-\beta_{i+1}|+\lambda_{L}||\bm{\beta}||_{1}.

(2)

The fused lasso was introduced in cf. Tibshirani et al. (2005) and its asymptotic properties were studied in detail in Rinaldo (2009).

Remark 1

In the paper Tibshirani & Taylor (2011) the estimator in (1) is called the fused lasso, while the estimator in (2) is addressed as the sparse fused lasso.

In the area of constrained inference the basic and simplest problem is the isotonic regression in one dimension. For a given sequence of data points $\bm{y}\in\mathbb{R}^{n}$ the isotonic regression is the following approximation

\hat{\bm{\beta}}^{I}=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}||\bm{y}-\bm{\beta}||_{2}^{2},\quad\text{subject to}\quad\beta_{1}\leq\beta_{2}\leq\dots\leq\beta_{n},

(3)

i.e. the isotonic regression is the $\ell^{2}$ -projection of the vector $\bm{y}$ onto the set of non-increasing vectors in $\mathbb{R}^{n}$ . The notion of isotonic ”regression” in this context might be confusing. Nevertheless, it is a standard notion in this subject, cf., for example, the papers Best & Chakravarti (1990); Stout (2013), where the notation ”isotonic regression” is used for the isotonic projection of a general vector. Also, in this paper we use notations ”regression”, ”estimator” and ”approximator” interchangeably.

The nearly-isotonic regression, introduced in Tibshirani et al. (2011) and studied in detail in Minami (2020), is a less restrictive version of the isotonic regression and is given by the following optimization problem

\hat{\bm{\beta}}^{NI}(\bm{y},\lambda_{NI})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{NI}\sum_{i=1}^{n-1}|\beta_{i}-\beta_{i+1}|_{+},

(4)

where $x_{+}=x\cdot 1\{x>0\}$ .

In this paper we combine fused lasso estimator with nearly-isotonic regression and call the resulting estimator as fused lasso nearly-isotonic signal approximator, i.e. for a given sequence of data points $\bm{y}\in\mathbb{R}^{n}$ the problem in one dimensional case is the following optimization

	$\displaystyle\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})={}$			(5)
	$\displaystyle\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}\|\|\bm{y}-\bm{\beta}\|\|_{2}^{2}+$	$\displaystyle\lambda_{F}\sum_{i=1}^{n-1}\|\beta_{i}-\beta_{i+1}\|+\lambda_{L}\|\|\bm{\beta}\|\|_{1}+\lambda_{NI}\sum_{i=1}^{n-1}\|\beta_{i}-\beta_{i+1}\|_{+}.$		(5)

Also, in the case of $\lambda_{F}\neq 0$ and $\lambda_{NI}\neq 0$ , with $\lambda_{L}=0$ , we call the estimator as fused nearly-isotonic regression, i.e.

	$\displaystyle\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})\equiv{}$	$\displaystyle\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})=$		(6)
		$\displaystyle\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}\|\|\bm{y}-\bm{\beta}\|\|_{2}^{2}+\lambda_{F}\sum_{i=1}^{n-1}\|\beta_{i}-\beta_{i+1}\|+\lambda_{NI}\sum_{i=1}^{n-1}\|\beta_{i}-\beta_{i+1}\|_{+}.$		(6)

The generalisation of nearly-isotonic regression in (6) was proposed in the conclusion of the paper Tibshirani et al. (2011).

We will state the problem in (5) for the case of isotonic constraints with respect to a general partial order, but, first, we have to introduce the notation.

1.1 Notation

We start with basic definitions of partial order and isotonic regression. Let $\mathcal{I}=\{\bm{i}_{1},\dots,\bm{i}_{n}\}$ be some index set. Next, we define the following binary relation $\preceq$ on $\mathcal{I}$ .

A binary relation $\preceq$ on $\mathcal{I}$ is called partial order if

•

it is reflexive, i.e. $\bm{j}\preceq\bm{j}$ for all $\bm{j}\in\mathcal{I}$ ;
•

it is transitive, i.e. $\bm{j}_{1},\bm{j}_{2},\bm{j}_{3}\in\mathcal{I}$ , $\bm{j}_{1}\preceq\bm{j}_{2}$ and $\bm{j}_{2}\preceq\bm{j}_{3}$ imply $\bm{j}_{1}\preceq\bm{j}_{3}$ ;
•

it is antisymmetric, i.e. $\bm{j}_{1},\bm{j}_{2}\in\mathcal{I}$ , $\bm{j}_{1}\preceq\bm{j}_{2}$ and $\bm{j}_{2}\preceq\bm{j}_{1}$ imply $\bm{j}_{1}=\bm{j}_{2}$ .

Further, a vector $\bm{\beta}\in\mathbb{R}^{n}$ indexed by $\mathcal{I}$ is called isotonic with respect to the partial order $\preceq$ on $\mathcal{I}$ if $\bm{j}_{1}\preceq\bm{j}_{2}$ implies $\beta_{\bm{j}_{1}}\leq\beta_{\bm{j}_{2}}$ . We denote the set of all isotonic vectors in $\mathbb{R}^{n}$ with respect to the partial order $\preceq$ on $\mathcal{I}$ by $\bm{\mathcal{B}}^{is}$ , which is also called isotonic cone. Next, a vector $\bm{\beta}^{I}\in\mathbb{R}^{n}$ is isotonic regression of an arbitrary vector $\bm{y}\in\mathbb{R}^{n}$ over the pre-ordered index set $\mathcal{I}$ if

\displaystyle\bm{\beta}^{I}=\underset{\bm{\beta}\in\bm{\mathcal{B}}^{is}}{\arg\min}\sum_{\bm{j}\in\mathcal{I}}(\beta_{\bm{j}}-y_{\bm{j}})^{2}.

(7)

For any partial order relation $\preceq$ on $\mathcal{I}$ there exists directed graph $G=(V,E)$ , with $V=\mathcal{I}$ and

\displaystyle E=\{(\bm{j}_{1},\bm{j}_{2}),\,\text{where}\,\bm{j_{1}}\,\text{and}\,\bm{j_{2}}\,\text{are certain elements of}\,\mathcal{I}\},

(8)

such that an arbitrary vector $\bm{\beta}\in\mathbb{R}^{n}$ is isotonic with respect to $\preceq$ iff $\beta_{\bm{j_{1}}}\leq\beta_{\bm{j_{2}}}$ for any $(\bm{j}_{1},\bm{j}_{2})\in E$ . Therefore, equivalently to the definition in (7), a vector $\bm{\beta}^{I}\in\mathbb{R}^{n}$ is isotonic regression of an arbitrary vector $\bm{y}\in\mathbb{R}^{n}$ indexed by the partially ordered index set $\mathcal{I}$ if

\displaystyle\bm{\beta}^{I}=\underset{\bm{\beta}}{\arg\min}\sum_{\bm{j}\in\mathcal{I}}(\beta_{\bm{j}}-y_{\bm{j}})^{2},\quad\text{subject to}\quad\beta_{\bm{j_{1}}}\leq\beta_{\bm{j_{2}}},\quad\text{whenever}\quad(\bm{j_{1}},\bm{j_{2}})\in E.

(9)

Further, for the directed graph $G=(V,E)$ , which corresponds to the partial order $\preceq$ on $\mathcal{I}$ , the nearly-isotonic regresson of $\bm{y}\in\mathbb{R}^{n}$ indexed by $\mathcal{I}$ is given by

\hat{\bm{\beta}}^{NI}(\bm{y},\lambda_{NI})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{NI}\sum_{(\bm{i},\bm{j})\in E}|\beta_{\bm{i}}-\beta_{\bm{j}}|_{+}.

(10)

This generalisation of nearly-isotonic regression was introduced and studied in Minami (2020).

Next, fussed and fussed lasso approximators for a general directed graph $G=(V,E)$ are given by

\hat{\bm{\beta}}^{F}(\bm{y},\lambda_{F})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}\sum_{(\bm{i},\bm{j})\in E}|\beta_{\bm{i}}-\beta_{\bm{j}}|,

(11)

and

\hat{\bm{\beta}}^{FL}(\bm{y},\lambda_{F},\lambda_{L})=\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}\sum_{(\bm{i},\bm{j})\in E}|\beta_{\bm{i}}-\beta_{\bm{j}}|+\lambda_{L}||\bm{\beta}||_{1}.

(12)

These optimization problems were introduced and solved for a general directed graph in Tibshirani & Taylor (2011).

Further, let $D$ denote the oriented incidence matrix for the directed graph $G=(V,E)$ , corresponding to $\preceq$ on $\mathcal{I}$ . We choose the orientation of $D$ in the following way. Assume that the graph $G$ with $n$ vertexes has $m$ edges. Next, assume we label the vertexes by $\{1,\dots,n\}$ and edges by $\{1,\dots,m\}$ . Then $D$ is $m\times n$ matrix with

D_{i,j}=\begin{cases}1,&\quad\text{if vertex $j$ is the source of edge $i$},\\ -1,&\quad\text{if vertex $j$ is the target of edge $i$},\\ 0,&\quad\text{otherwise}.\end{cases}

In order to clarify the notations we consider the following examples of partial order relations. First, let us consider the monotonic order relation in one dimensional case. Let $\mathcal{I}=\{1,\dots,n\}$ , and for $j_{1}\in\mathcal{I}$ and $j_{2}\in\mathcal{I}$ we naturally define $j_{1}\preceq j_{2}$ if $j_{1}\leq j_{2}$ . Further, if we let $V=\mathcal{I}$ and $E=\{(i,i+1):i=1,\dots,n-1\}$ , then $G=(V,E)$ is the directed graph which correspond to the one dimensional order relation on $\mathcal{I}$ . Figure 1 displays the graph and the incidence matrix for the graph.

Refer to caption — Figure 1: Graph for monotonic contstraints and oriented incidence matrix

Next, we consider two dimensional case with bimonotonic constraints. The notion of bimonotonicity was first introduced in Beran & Dümbgen (2010) and it means the following. Let us consider the index set

\displaystyle\mathcal{I}=\{\bm{i}=(i^{(1)},i^{(2)}):\,i^{(1)}=1,2,\dots,n_{1},\,i^{(2)}=1,2,\dots,n_{2}\}

with the following order relation $\preceq$ on it: for $\bm{j}_{1},\bm{j}_{2}\in\mathcal{I}$ we have $\bm{j}_{1}\preceq\bm{j}_{2}$ iff $j^{(1)}_{1}\leq j^{(1)}_{2}$ and $j^{(2)}_{1}\leq j^{(2)}_{2}$ . Then, a vector $\bm{\beta}\in\mathbb{R}^{n}$ , with $n=n_{1}n_{2}$ , indexed by $\mathcal{I}$ is called bimonotone if it is isotonic with respect to bimonotone order $\preceq$ defined on its index $\mathcal{I}$ . Further, we define the directed graph $G=(V,E)$ with vertexes $V=\mathcal{I}$ , and the edges

\displaystyle\begin{aligned} E={}&\{((l,k),(l,k+1)):\,1\leq l\leq n_{1},1\leq k\leq n_{2}-1\}\\ \cup\,&\{((l,k),(l+1,k)):\,1\leq l\leq n_{1}-1,1\leq k\leq n_{2}\}.\end{aligned}

The labeled graph and its incidence matrix are displayed on Figure 2.

1.2 General statement of the problem

Now we can state the general problem studied in this paper. Let $\bm{y}\in\mathbb{R}^{n}$ be a signal indexed by the index set $\mathcal{I}$ with the partial order relation $\preceq$ defined on $\mathcal{I}$ . Next, let $G=(V,E)$ be the directed gpraph corresponding to $\preceq$ on $\mathcal{I}$ . The fussed lasso nearly isotonic signal approximation with respect to $\preceq$ on $\mathcal{I}$ (or, equivalently, to the directed graph $G=(V,E)$ , corresponding to $\preceq$ ) is given by

	$\displaystyle\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})={}$			(13)
	$\displaystyle\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}\|\|\bm{y}-\bm{\beta}\|\|_{2}^{2}+$	$\displaystyle\lambda_{F}\sum_{(\bm{i},\bm{j})\in E}\|\beta_{\bm{i}}-\beta_{\bm{j}}\|+\lambda_{L}\|\|\bm{\beta}\|\|_{1}+\lambda_{NI}\sum_{(\bm{i},\bm{j})\in E}\|\beta_{\bm{i}}-\beta_{\bm{j}}\|_{+}.$		(13)

Therefore, the estimator in (13) is a combination of the estimators in (10) and (12).

Equivalently, we can rewrite the problem in the following way:

\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})={}\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}||D\bm{\beta}||_{1}+\lambda_{L}||\bm{\beta}||_{1}+\lambda_{NI}||D\bm{\beta}||_{+},

(14)

where $D$ is the oriented incidence matrix of the graph $G=(V,E)$ .

Analogously to the definition in one dimensional case, if $\lambda_{L}=0$ we call the estimator as fussed nearly-isotonic approximator and denote it by $\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})$ .

2 The solution to the fused lasso nearly-isotonic signal approximator

First, we consider fused nearly-isotonic regression, i.e. in (14) we assume that $\lambda_{L}=0$ .

Theorem 1

For a fixed data vector $\bm{y}\in\mathbb{R}^{n}$ indexed by the index set $\mathcal{I}$ with the partial order relation $\preceq$ defined on $\mathcal{I}$ the solution to the fused nearly-isotonic problem in (14) is given by

\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})=\bm{y}-D^{T}\hat{\bm{\nu}}(\lambda_{F},\lambda_{NI})

(15)

with

\hat{\bm{\nu}}(\bm{y},\lambda_{F},\lambda_{NI})=\underset{\bm{\nu}\in\mathbb{R}^{n-1}}{\arg\min}\,\frac{1}{2}||\bm{y}-D^{T}\bm{\nu}||_{2}^{2}\quad\text{subject to}\quad-\lambda_{F}\bm{1}\leq\bm{\nu}\leq(\lambda_{F}+\lambda_{NI})\bm{1},

(16)

where $D$ is the incidence matrix of the directed graph $G=(V,E)$ , corresponding to $\preceq$ on $\mathcal{I}$ , $\bm{1}\in\mathbb{R}^{n}$ is the vector whose all elements are equal to $1$ and the notation $\bm{a}\leq\bm{b}$ for vectors $\bm{a},\bm{b}\in\mathbb{R}^{n}$ means $a_{i}\leq b_{i}$ for all $i=1,\dots,n$ .

Proof. First, following the derivations of $\ell_{1}$ trend filtering and generalised lasso in Kim et al. (2009) and Tibshirani & Taylor (2011), respectively, we can write the optimization problem in (6) in the following way

\underset{\bm{\beta},\bm{z}}{\text{minimize}}\,\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}||\bm{z}||_{1}+\lambda_{NI}||\bm{z}||_{+}\quad\text{subject to}\quad D\bm{\beta}=\bm{z}\in\mathbb{R}^{n-1}.

Further, the Lagrangian is given by

L(\bm{\beta},\bm{z},\bm{\nu})=\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\lambda_{F}||\bm{z}||_{1}+\lambda_{NI}||\bm{z}||_{+}+\bm{\nu}^{T}(D\bm{\beta}-\bm{z}),

(17)

where $\bm{\nu}\in\mathbb{R}^{n-1}$ is a dual variable.

Note that

\underset{\bm{z}}{\min}\big{(}\lambda_{F}||\bm{z}||_{1}+\lambda_{NI}||\bm{z}||_{+}-\bm{\nu}^{T}\bm{z}\big{)}=\begin{cases}0,&\quad\text{if}\quad-\lambda_{F}\bm{1}\leq\bm{\nu}\leq(\lambda_{F}+\lambda_{NI})\bm{1},\\ -\infty,&\quad\text{otherwise},\end{cases}

and

\underset{\bm{\beta}}{\min}\big{(}\frac{1}{2}||\bm{y}-\bm{\beta}||_{2}^{2}+\bm{\nu}^{T}D\bm{\beta}\big{)}=-\frac{1}{2}\bm{\nu}^{T}DD^{T}\bm{\nu}+\bm{y}^{T}D^{T}\bm{\nu}=-\frac{1}{2}||\bm{y}-D^{T}\bm{\nu}||_{2}^{2}+\frac{1}{2}\bm{y}^{T}\bm{y}.

Next, the dual function is given by

	$\displaystyle g(\nu)={}$	$\displaystyle\underset{\bm{\beta},\bm{z}}{\min}\,L(\bm{\beta},\bm{z},\bm{\nu})=$
		$\displaystyle\begin{cases}-\frac{1}{2}\|\|\bm{y}-D^{T}\bm{\nu}\|\|_{2}^{2}+\frac{1}{2}\bm{y}^{T}\bm{y},&\quad\text{if}\quad-\lambda_{F}\bm{1}\leq\bm{\nu}\leq(\lambda_{F}+\lambda_{NI})\bm{1},\\ -\infty,&\quad\text{otherwise},\end{cases}$

and, therefore, the dual problem is

\hat{\bm{\nu}}(\bm{y},\lambda_{F},\lambda_{NI})=\underset{\bm{\nu}}{\arg\max}\,g(\nu)\quad\text{subject to}\quad-\lambda_{F}\bm{1}\leq\bm{\nu}\leq(\lambda_{F}+\lambda_{NI})\bm{1},

which is equivalent to

\hat{\bm{\nu}}(\bm{y},\lambda_{F},\lambda_{NI})=\underset{\bm{\nu}}{\arg\min}\,\frac{1}{2}||\bm{y}-D^{T}\bm{\nu}||_{2}^{2}\quad\text{subject to}\quad-\lambda_{F}\bm{1}\leq\bm{\nu}\leq(\lambda_{F}+\lambda_{NI})\bm{1}.

Lastly, taking first derivative of Lagrangian $L(\bm{\beta},\bm{z},\bm{\nu})$ with respect to $\bm{\beta}$ we get the following relation between $\hat{\bm{\beta}}^{FNI}(\lambda_{F},\lambda_{NI})$ and $\hat{\bm{\nu}}(\bm{y},\lambda_{F},\lambda_{NI})$

\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})=\bm{y}-D^{T}\hat{\bm{\nu}}(\bm{y},\lambda_{F},\lambda_{NI}).

$\Box$

Next, we provide the solution to the fused lasso nearly-isotonic regression.

Theorem 2

For a given vector $\bm{y}$ indexed by $\mathcal{I}$ the solution to the fused lasso nearly-isotonic signal approximator $\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})$ is given by soft thresholding the fused nearly-isotonic regression $\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})$ , i.e.

\hat{\beta}^{FLNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\begin{cases}\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})-\lambda_{L},&\quad\text{if}\quad\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})\geq\lambda_{L},\\ 0,&\quad\text{if}\quad|\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})|\leq\lambda_{L},\\ \hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})+\lambda_{L},&\quad\text{if}\quad\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})\leq-\lambda_{L},\end{cases}

(18)

for ${\bm{i}}\in\mathcal{I}$ .

Proof. The proof is similar to the derivation of solution of the fused lasso in Friedman et al. (2007). Nevertheless, for compliteness of the paper we provide the proof for $\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})$ .

The subgradient equations (which are necessary and sufficient conditions for the solution of (5)) for $\beta_{\bm{i}}$ , with $\bm{i}\in\mathcal{I}$ , are

	$\displaystyle g_{\bm{i}}(\lambda_{L})={}$	$\displaystyle-(y_{\bm{i}}-\beta_{\bm{i}})+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}})+$		(19)
		$\displaystyle\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}})+\lambda_{L}s_{i}=0,$		(19)

where

q_{\bm{i},\bm{j}}:\begin{cases}=1,&\text{if }\,\beta_{\bm{i}}-\beta_{\bm{j}}>0,\\ =0,&\text{if }\,\beta_{\bm{i}}-\beta_{\bm{j}}<0,\\ \in[0,1],&\text{if }\,\beta_{\bm{i}}=\beta_{\bm{j}},\end{cases}\quad\quad t_{\bm{i},\bm{j}}:\begin{cases}=1,&\text{if }\,\beta_{\bm{i}}-\beta_{\bm{j}}>0,\\ =-1,&\text{if }\,\beta_{\bm{i}}-\beta_{\bm{j}}<0,\\ \in[-1,1],&\text{if }\,\beta_{\bm{i}}=\beta_{\bm{j}},\end{cases}

(20)

s_{\bm{i}}:\begin{cases}=1,&\text{if }\,\beta_{\bm{i}}>0,\\ =-1,&\text{if }\,\beta_{\bm{i}}<0,\\ \in[-1,1],&\text{if }\,\beta_{\bm{i}}=0.\end{cases}

Next, let $q_{\bm{i},\bm{j}}(\lambda_{L})$ , $t_{\bm{i},\bm{j}}(\lambda_{L})$ and $s_{\bm{i}}(\lambda_{L})$ denote the values of the parameters defined above at some value of $\lambda_{L}$ . Note, the values of $\lambda_{NI}$ and $\lambda_{F}$ are fixed. Therefore, if $\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})\neq 0$ for $s_{\bm{i}}(0)$ we have

s_{\bm{i}}(0)=\begin{cases}1,&\text{if }\,\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})>0,\\ -1,&\text{if }\,\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})<0,\\ \end{cases}

and for $\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})=0$ we can set $s_{\bm{i}}(0)=0$ .

Next, let $\hat{\bm{\beta}}^{ST}(\lambda_{L})$ denote the soft thresholding of $\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})$ , i.e.

\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})=\begin{cases}\hat{\beta}^{FLNI}_{\bm{i}}(\bm{y},\lambda_{F},0,\lambda_{NI})-\lambda_{L},&\quad\text{if}\quad\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})\geq\lambda_{L},\\ 0,&\quad\text{if}\quad|\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})|\leq\lambda_{L},\\ \hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})+\lambda_{L},&\quad\text{if}\quad\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})\leq-\lambda_{L}.\end{cases}

The goal is to prove that $\hat{\bm{\beta}}^{ST}(\lambda_{L})$ provides the solution to (13).

Note, analogously to the proof for the fused lasso estimator in Lemma A.1 at Friedman et al. (2007), if either $\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})\neq 0$ or $\hat{\beta}_{\bm{j}}^{ST}(\lambda_{L})\neq 0$ , and $\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})<\hat{\beta}_{\bm{j}}^{ST}(\lambda_{L})$ or $\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})>\hat{\beta}_{\bm{j}}^{ST}(\lambda_{L})$ , then we also have $\hat{\beta}_{\bm{i}}^{ST}(0)<\hat{\beta}_{\bm{j}}^{ST}(0)$ or $\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})>\hat{\beta}_{\bm{j}}^{ST}(\lambda_{L})$ , respectively. Therefore, soft thresholding of $\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})$ does not change the ordering of these pairs and we have $q_{\bm{i},\bm{j}}(\lambda_{L})=q_{\bm{i},\bm{j}}(0)$ and $t_{\bm{i},\bm{j}}(\lambda_{L})=t_{\bm{i},\bm{j}}(0)$ . Next, if for some $(\bm{i},\bm{j})\in E$ we have $\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})=\hat{\beta}_{\bm{j}}^{ST}(\lambda_{L})=0$ , then $q_{\bm{i},\bm{j}}\in[0,1]$ and $t_{\bm{i},\bm{j}}\in[-1,1]$ , and, again, we can set $t_{\bm{i},\bm{j}}(\lambda_{L})=t_{\bm{i},\bm{j}}(0)$ , and $q_{\bm{i},\bm{j}}(\lambda_{L})=q_{\bm{i},\bm{j}}(0)$ .

Now let us insert $\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})$ into subgradient equations (19) and show that we can find $s_{\bm{i}}(\lambda_{L})\in[0,1]$ , for all $\bm{i}\in\mathcal{I}$ .

First, assume that for some $\bm{i}$ we have $\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})\geq\lambda_{L}$ . Then

	$\displaystyle g_{\bm{i}}(\lambda_{L})$	$\displaystyle={}-(y_{\bm{i}}-\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI}))-\lambda_{L}+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}(\lambda_{L})-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}}(\lambda_{L}))$
		$\displaystyle+\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}(\lambda_{L})-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}}(\lambda_{L}))+\lambda_{L}s_{\bm{i}}(\lambda_{L})$
		$\displaystyle={}-(y_{\bm{i}}-\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI}))+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}}(0))+$
		$\displaystyle\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}}(0))+\lambda_{L}s_{\bm{i}}(\lambda_{L})-\lambda_{L}=0.$

Note, that

		$\displaystyle-(y_{i}-\hat{\beta}_{i}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI}))+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}}(0))+$
		$\displaystyle\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}}(0))=0,$

because $\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})\equiv\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})$ . Therefore, if $s_{i}(\lambda_{L})=\operatorname{sign}{\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})}=1$ , then $g_{\bm{i}}(\lambda_{L})=0$ .

The proof for the case when $\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})\leq-\lambda_{L}$ is similar and one can show that $g_{\bm{i}}(\lambda_{L})=0$ if $s_{\bm{i}}(\lambda_{L})=\operatorname{sign}{\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})}=-1$ .

Second, assume that $|\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})|<\lambda_{L}$ . Then, $\hat{\beta}_{\bm{i}}^{ST}(\lambda_{L})=0$ , and

	$\displaystyle g_{\bm{i}}(\lambda_{L})$	$\displaystyle={}-y_{\bm{i}}+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}(\lambda_{L})-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}}(\lambda_{L}))$
		$\displaystyle+\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}(\lambda_{L})-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}}(\lambda_{L}))+\lambda_{L}s_{\bm{i}}(\lambda_{L})$
		$\displaystyle={}-y_{\bm{i}}+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}}(0))+$
		$\displaystyle\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}}(0))+\lambda_{L}s_{\bm{i}}(\lambda_{L})=0.$

Next, if we let $s_{\bm{i}}(\lambda_{L})=\hat{\beta}_{\bm{i}}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI})/\lambda_{L}$ , then, again, we have

	$\displaystyle g_{\bm{i}}(\lambda_{L})$	$\displaystyle={}-(y_{i}-\hat{\beta}_{i}^{FLNI}(\bm{y},\lambda_{F},0,\lambda_{NI}))+\lambda_{NI}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}q_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}q_{\bm{j},\bm{i}}(0))+$
		$\displaystyle\lambda_{F}(\sum_{\bm{j}:(\bm{i},\bm{j})\in E}t_{\bm{i},\bm{j}}(0)-\sum_{\bm{j}:(\bm{j},\bm{i})\in E}t_{\bm{j},\bm{i}}(0))=0,$

Therefore, we have proved that $\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\hat{\bm{\beta}}^{ST}(\lambda_{L})$ . $\Box$

3 Properties of the fused lasso nearly-isotonic signal approximator

We start with a proposition which shows how the solutions to the optimization problems (11), (10) and (14) are related to each other. This result will be used in the next section to derive degrees of freedom of the fused lasso nearly-isotonic signal approximator.

Proposition 1

For a fixed data vector $\bm{y}$ indexed by $\mathcal{I}$ and penalisation parameters $\lambda_{NI}$ and $\lambda_{F}$ the following relations between estimators $\hat{\bm{\beta}}^{F}$ , $\hat{\bm{\beta}}^{NI}$ and $\hat{\bm{\beta}}^{FNI}$ hold

\hat{\bm{\beta}}^{NI}(\bm{y},\lambda_{NI})=\hat{\bm{\beta}}^{F}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}),

(21)

\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})=\hat{\bm{\beta}}^{NI}(\bm{y}+\lambda_{F}D^{T}\bm{1},\lambda_{NI}+2\lambda_{F})=\hat{\bm{\beta}}^{F}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})

(22)

and

\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\hat{\bm{\beta}}^{FL}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F},\lambda_{L}),

(23)

where $D$ is the oriented incidence matrix for the graph $G=(V,E)$ corresponding to the partial order relation $\preceq$ on $\mathcal{I}$ .

Proof. First, from Tibshirani et al. (2011) the solution to the nearly-isotonic problem is given by

\hat{\bm{\beta}}^{NI}(\bm{y},\lambda_{NI})=\bm{y}-D^{T}\hat{\bm{v}}(\bm{y},\lambda_{NI}),

with

\hat{\bm{v}}(\bm{y},\lambda_{NI})=\underset{\bm{v}\in\mathbb{R}^{n-1}}{\arg\min}\,\frac{1}{2}||\bm{y}-D^{T}\bm{v}||_{2}^{2}\quad\text{subject to}\quad\bm{0}\leq\bm{v}\leq\lambda_{NI}\bm{1},

and from Tibshirani & Taylor (2011) it follows

\hat{\bm{\beta}}^{F}(\bm{y},\lambda_{F})=\bm{y}-D^{T}\hat{\bm{w}}(\bm{y},\lambda_{F}),

with

\hat{\bm{w}}(\bm{y},\lambda_{F})=\underset{\bm{w}\in\mathbb{R}^{n-1}}{\arg\min}\,\frac{1}{2}||\bm{y}-D^{T}\bm{w}||_{2}^{2}\quad\text{subject to}\quad-\lambda_{F}\bm{1}\leq\bm{w}\leq\lambda_{F}\bm{1}.

Second, let us introduce a new variable $\bm{v}^{*}=\bm{v}-\frac{\lambda_{NI}}{2}\bm{1}$ . Then

\hat{\bm{\beta}}^{NI}(\bm{y},\lambda_{NI})=\bm{y}-D^{T}\frac{\lambda_{NI}}{2}\bm{1}-D^{T}\hat{\bm{v}}^{*}(\bm{y},\lambda_{NI}),

where

\hat{\bm{v}}^{*}(\bm{y},\lambda_{NI})=\underset{\bm{v}^{*}\in\mathbb{R}^{n-1}}{\arg\min}\,\frac{1}{2}||\bm{y}-D^{T}\frac{\lambda_{NI}}{2}\bm{1}-D^{T}\bm{v}^{*}||_{2}^{2}\quad\text{subject to}\quad-\frac{\lambda_{NI}}{2}\bm{1}\leq\bm{v}^{*}\leq\frac{\lambda_{NI}}{2}\bm{1}.

Therefore, we have proved that $\hat{\bm{\beta}}^{NI}(\bm{y},\lambda_{NI})=\hat{\bm{\beta}}^{F}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI})$ .

The proof for the fused lasso nearly-isotonic estimator is the same with the change of variable $\bm{u}^{*}=\bm{u}+D^{T}\lambda_{F}\bm{1}$ in (15) and (16) for the proof of the first equality in (22) and with $\bm{u}^{*}=\bm{u}-\frac{\lambda_{NI}}{2}\bm{1}$ for the second equality.

Next, we prove the result for the case of fused lasso nearly-isotonic approximator. From Theorem 2 we have

\hat{\beta}^{FLNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\begin{cases}\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})-\lambda_{L},&\quad\text{if}\quad\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})\geq\lambda_{L},\\ 0,&\quad\text{if}\quad|\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})|\leq\lambda_{L},\\ \hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})+\lambda_{L},&\quad\text{if}\quad\hat{\beta}^{FNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{NI})\leq-\lambda_{L},\end{cases}

for $\bm{i}\in\mathcal{I}$ .

Further, using (22) we have

\hat{\beta}^{FLNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\hat{\beta}^{F}_{\bm{i}}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})-\lambda_{L},

if $\hat{\beta}^{F}_{\bm{i}}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})\geq\lambda_{L}$ ,

\hat{\beta}^{FLNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=0,

if $|\hat{\beta}^{F}_{\bm{i}}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})|\leq\lambda_{L}$ ,

\hat{\beta}^{FLNI}_{\bm{i}}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\hat{\beta}^{F}_{\bm{i}}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})+\lambda_{L},

if $\hat{\beta}^{F}_{\bm{i}}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})\leq-\lambda_{L}$ .

Therefore, we obtain

	$\displaystyle\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})$	$\displaystyle={}$
	$\displaystyle\underset{\bm{\beta}\in\mathbb{R}^{n}}{\arg\min}\,\frac{1}{2}\|\|\bm{y}$	$\displaystyle-\frac{\lambda_{NI}}{2}D^{T}\bm{1}-\bm{\beta}\|\|_{2}^{2}+(\frac{1}{2}\lambda_{NI}+\lambda_{F})\|\|D\bm{\beta}\|\|_{1}+\lambda_{L}\|\|\bm{\beta}\|\|_{1}\equiv$
	$\displaystyle\hat{\bm{\beta}}^{FL}(\bm{y}$	$\displaystyle-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F},\lambda_{L}).$

$\Box$

Next, we prove that, analogously to fussed lasso and nearly-isotonic regression, as one of the penalization parameters increases the constant regions in the solution $\hat{\bm{\beta}}^{FLNI}$ can only be joined together and not split apart. We prove this result only for one dimensional monotonic order, and the general case is an open question. This result could be potentially useful in the future research for the solution path of fussed lasso nearly-isotonic approximator.

Proposition 2

Let $\mathcal{I}=\{1,\dots,n\}$ with a natural order defined on it. Next, let $\bm{\lambda}=(\lambda_{F},\lambda_{L},\lambda_{NI})$ and $\bm{\lambda}^{*}=(\lambda_{F}^{*},\lambda_{L}^{*},\lambda_{NI}^{*})$ are the triples of penalisation parameters such that one of the elements of $\bm{\lambda}^{*}$ is greater than the corresponding element in $\bm{\lambda}$ , while two others are the same. Next, assume that for some $i$ the solution $\hat{\bm{\beta}}^{FLNI}(\bm{y},\bm{\lambda})$ satisfies

\hat{\beta}_{i}^{FLNI}(\bm{y},\bm{\lambda})=\hat{\beta}_{i+1}^{FLNI}(\bm{y},\bm{\lambda}).

Then for $\bm{\lambda}^{*}$ we have

\hat{\beta}_{i}^{FLNI}(\bm{y},\bm{\lambda}^{*})=\hat{\beta}_{i+1}^{FLNI}(\bm{y},\bm{\lambda}^{*}).

Proof.

Case 1: $\lambda_{NI}$ and $\lambda_{F}$ are fixed and $\lambda_{L}^{*}>\lambda_{L}$ . The result of the proposition for this case follows directly from Theorem 2.

Case 2: $\lambda_{F}$ and $\lambda_{L}$ are fixed and $\lambda_{NI}^{*}>\lambda_{NI}$ . Let us consider the fused nearly-isotonic regression and write the subgradient equations

g_{i}(\lambda_{NI})=-(y_{i}-\beta_{i})+\lambda_{NI}(q_{i}(\lambda_{NI})-q_{i-1}(\lambda_{NI}))+\lambda_{F}(t_{i}(\lambda_{NI})-t_{i-1}(\lambda_{NI}))=0,

where $q_{i}$ and $t_{i}$ , with $i=1,\dots,n$ , are defined in (20), and, analogously to the proof of Theorem 2, $q(\lambda_{NI})$ , $t(\lambda_{NI})$ denote the values of the parameters defined above at some value of $\lambda_{NI}$ .

Assume that for $\lambda_{NI}$ in the solution $\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})$ we have a following constant region

\hat{\beta}_{j-1}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})\neq\hat{\beta}_{j}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})=\dots=\hat{\beta}_{j+k}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})\neq\hat{\beta}_{j+k+1}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI}),

(24)

and in the similar way as in Tibshirani et al. (2011) for $\lambda_{NI}^{*}$ we consider the subset of the subgradient equations

g_{i}(\lambda_{NI})=-(y_{i}-\beta_{i})+\lambda_{NI}^{*}(q_{i}(\lambda_{NI}^{*})-q_{i-1}(\lambda_{NI}^{*}))+\lambda_{F}(t_{i}(\lambda_{NI}^{*})-t_{i-1}(\lambda_{NI}^{*}))=0,

(25)

with $i=j,\dots,k$ , and show that there exists the solution for which (24) holds, $q_{i}\in[0,1]$ and $t_{i}\in[-1,1]$ .

Note first that as $\lambda_{NI}$ increases, (24) holds until the merge with other groups happens, which means that $q_{j-1},q_{j+k}\in\{0,1\}$ and $t_{j-1},t_{j+k}\in\{-1,1\}$ will not change their values until the merge of this constant region. Also, as it follows from (20), for $i\in[j,j+k]$ the value of $t_{i}$ is in $[-1,1]$ . Therefore, without any violation of the restrictions on $t_{i}$ we can assume that $t_{i}(\lambda_{NI}^{*})=t_{i}(\lambda)$ for any $i\in[j,j+k-1]$ .

Next, taking pairwise differences between subgradient equations for $\lambda_{NI}$ we have

\lambda_{NI}A\tilde{\bm{q}}(\lambda_{NI})+\lambda_{F}A\tilde{\bm{t}}(\lambda_{NI})=D\tilde{\bm{y}}+\lambda_{NI}\bm{c}(\lambda_{NI})+\lambda_{F}\bm{d}(\lambda_{NI}),

where $D$ is displayed at Figure 1,

A=\begin{bmatrix}2&-1&0&\dots&0&0&0\\ -1&2&-1&\dots&0&0&0\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ 0&0&0&\dots&-1&2&-1\\ 0&0&0&\dots&0&-1&2\end{bmatrix},

(26)

and $\tilde{\bm{q}}(\lambda_{NI})=(q_{j}(\lambda_{NI}),\dots,q_{j+k-1}(\lambda_{NI}))$ , $\tilde{\bm{t}}(\lambda_{NI})=(t_{j}(\lambda_{NI}),\dots,t_{j+k-1}(\lambda_{NI}))$ , $\tilde{\bm{y}}=(y_{j},\dots,y_{j+k})$ , $\bm{c}(\lambda_{NI})=(q_{j-1}(\lambda_{NI}),0,\dots,0,q_{j+k}(\lambda_{NI}))$ , and $\bm{d}(\lambda_{NI})=(t_{j-1}(\lambda_{NI}),0,\dots,0,t_{j+k}(\lambda_{NI}))$ .

Since $A$ is invertible we have

\lambda_{NI}\tilde{\bm{q}}(\lambda_{NI})+\lambda_{F}\tilde{\bm{t}}(\lambda_{NI})=A^{-1}D\tilde{\bm{y}}+\lambda_{NI}A^{-1}\bm{c}(\lambda_{NI})+\lambda_{F}A^{-1}\bm{d}(\lambda_{NI}),

and, since $\tilde{\bm{q}}(\lambda_{NI})$ and $\tilde{\bm{t}}(\lambda_{NI})$ provide the solution to the subgradient equations (25), then

-\lambda_{F}\leq\lambda_{NI}\tilde{\bm{q}}(\lambda_{NI})+\lambda_{F}\tilde{\bm{t}}(\lambda_{NI})\leq\lambda_{NI}+\lambda_{F}.

(27)

Next, as pointed out at Friedman et al. (2007) and Tibshirani et al. (2011)

(A^{-1})_{i,1}=(n-i+1)/(n+1)\quad\text{and}\quad(A^{-1})_{i,n}=i/(n+1),

then, one can show that

-\lambda_{F}\bm{1}\preceq\lambda_{NI}A^{-1}\bm{c}(\lambda_{NI})+\lambda_{F}A^{-1}\bm{d}(\lambda_{NI})\preceq\lambda_{NI}\bm{1}+\lambda_{F}\bm{1}.

(28)

Further, let us consider the case of $\lambda_{NI}^{*}>\lambda_{NI}$ . Then we have

\lambda_{NI}^{*}\tilde{\bm{q}}(\lambda_{NI}^{*})+\lambda_{F}\tilde{\bm{t}}(\lambda_{NI}^{*})=A^{-1}D\tilde{\bm{y}}+\lambda_{NI}^{*}A^{-1}\bm{c}(\lambda_{NI}^{*})+\lambda_{F}A^{-1}\bm{d}(\lambda_{NI}^{*}).

Recall, above we set $\tilde{\bm{t}}(\lambda_{NI}^{*})=\tilde{\bm{t}}(\lambda_{NI})$ , and $q_{j-1},q_{j+k},t_{j-1}$ and $t_{j+k}$ does not change their values until the merge, which means that $\bm{c}(\lambda_{NI}^{*})=\bm{c}(\lambda_{NI})$ , and $\bm{d}(\lambda_{NI}^{*})=\bm{d}(\lambda_{NI})$ .

Therefore, the subgradient equations for $\lambda_{NI}^{*}$ can be written as

\lambda_{NI}^{*}\tilde{\bm{q}}(\lambda_{NI}^{*})+\lambda_{F}\tilde{\bm{t}}(\lambda_{NI})=A^{-1}D\tilde{\bm{y}}+\lambda_{NI}^{*}A^{-1}\bm{c}(\lambda_{NI})+\lambda_{F}A^{-1}\bm{d}(\lambda_{NI}).

Next, since the term $A^{-1}D\tilde{\bm{y}}$ is not changed, $-\lambda_{F}\leq\lambda_{F}\tilde{\bm{t}}(\lambda_{NI})\leq\lambda_{F}$ , and

-\lambda_{F}\bm{1}\preceq\lambda_{NI}^{*}A^{-1}\bm{c}(\lambda_{NI})+\lambda_{F}A^{-1}\bm{d}(\lambda_{NI})\preceq\lambda_{NI}^{*}\bm{1}+\lambda_{F}\bm{1},

then we have

\bm{0}\preceq\tilde{\bm{q}}(\lambda_{NI}^{*})\preceq\bm{1}.

Therefore we proved that $\hat{\beta}_{i}^{FNI}(\bm{y},\bm{\lambda}^{*})=\hat{\beta}_{i+1}^{FNI}(\bm{y},\bm{\lambda}^{*})$ . Since $\hat{\beta}_{i}^{FLNI}(\bm{y},\bm{\lambda}^{*})$ is given by soft thresholding of $\hat{\beta}_{i}^{FNI}(\bm{y},\bm{\lambda}^{*})$ , then $\hat{\beta}_{i}^{FLNI}(\bm{y},\bm{\lambda}^{*})=\hat{\beta}_{i+1}^{FLNI}(\bm{y},\bm{\lambda}^{*})$ for $i\in[j,k]$ .

Case 3: $\lambda_{NI}$ and $\lambda_{L}$ are fixed and $\lambda_{F}^{*}>\lambda_{F}$ . The proof for this case is virtually identical to the proof for the Case 2. In this case we assume that $q_{i}(\lambda_{F}^{*})=q_{i}(\lambda_{2})$ for any $i\in[j,j+k-1]$ . Next, $q_{j-1},q_{j+k},t_{j-1}$ and $t_{j+k}$ do not change their values until the merge, which, again, means that $\bm{c}(\lambda_{F}^{*})=\bm{c}(\lambda_{F})$ , and $\bm{d}(\lambda_{F}^{*})=\bm{d}(\lambda_{F})$ . Finally, we can show that

-\bm{1}\preceq\tilde{\bm{t}}(\lambda_{F}^{*})\preceq\bm{1}.

$\Box$

4 Degrees of freedom

In this section we discuss the estimation of degrees of freedom for fused nearly-isotonic regression and fused lasso nearly isotonic signal approximator. Let us consider the following nonparametric model

\bm{Y}=\bm{\mathring{\beta}}+\bm{\varepsilon},

where $\bm{\mathring{\beta}}\in\mathbb{R}^{n}$ is the unknown signal, and the error term $\bm{\varepsilon}\in\mathcal{N}(\bm{0},\sigma^{2}\bm{I})$ , with $\sigma<\infty$ .

The degrees of freedom is a measure of complexity of the estimator, and following Efron (1986), for the fixed values of $\lambda_{F}$ , $\lambda_{L}$ and $\lambda_{Ni}$ the degrees of freedom of $\hat{\bm{\beta}}^{FNI}$ and $\hat{\bm{\beta}}^{FLNI}$ are given by

df(\hat{\bm{\beta}}^{FNI}(\bm{Y},\lambda_{F},\lambda_{NI}))=\frac{1}{\sigma^{2}}\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{FNI}_{i}(\bm{Y},\lambda_{F},\lambda_{NI}),Y_{i}]

(29)

and

df(\hat{\bm{\beta}}^{FLNI}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI}))=\frac{1}{\sigma^{2}}\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{FLNI}_{i}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI}),Y_{i}].

(30)

The next theorem provides the unbiased estimators of the degrees of freedom $df(\hat{\bm{\beta}}^{FNI})$ and $df(\hat{\bm{\beta}}^{FLNI})$ .

Theorem 3

For the fixed values of $\lambda_{F}$ , $\lambda_{L}$ and $\lambda_{Ni}$ let

K^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})=\#\{\text{fused groups in }\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})\},

and

K^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\#\{\text{non-zero fused groups in }\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})\}.

Then we have

\mathbb{E}[K^{FNI}(\bm{Y},\lambda_{F},\lambda_{NI})]=df(\hat{\bm{\beta}}^{FNI}(\bm{Y},\lambda_{F},\lambda_{NI})),

and

\mathbb{E}[K^{FLNI}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI})]=df(\hat{\bm{\beta}}^{FLNI}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI})).

Proof. First, for the fused estimator $\hat{\bm{\beta}}^{F}(\bm{y},\lambda_{F})$ let

K^{F}(\bm{y},\lambda_{F})=\#\{\text{fused groups in }\hat{\bm{\beta}}^{F}(\bm{y},\lambda_{F})\}.

Then, as it follows from Tibshirani & Taylor (2011), for $\hat{\bm{\beta}}^{F}(\bm{y},\lambda_{F})$ we have

\mathbb{E}[K^{F}(\bm{Y},\lambda_{F})]=df(\hat{\bm{\beta}}^{F}(\bm{Y},\lambda_{F})).

Next, from Proposition 1, it follows

\hat{\bm{\beta}}^{FNI}(\bm{y},\lambda_{F},\lambda_{NI})=\hat{\bm{\beta}}^{F}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F}).

Therefore, using the property of covariance we have

	$\displaystyle df(\hat{\bm{\beta}}^{FNI}($	$\displaystyle\bm{Y},\lambda_{F},\lambda_{NI}))=\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{FNI}_{i}(\bm{Y},\lambda_{F},\lambda_{NI}),Y_{i}]=$
		$\displaystyle\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{F}_{i}(\bm{Y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F}),Y_{i}]=$
		$\displaystyle\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{F}_{i}(\bm{Y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F}),Y_{i}-\frac{\lambda_{NI}}{2}[D^{T}\bm{1}]_{i}]=$
		$\displaystyle\mathbb{E}[K^{F}(\bm{Y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F})]\equiv\mathbb{E}[K^{FNI}(\bm{Y},\lambda_{F},\lambda_{NI})],$

where $[\bm{a}]_{i}$ denotes $i$ -th element in the vector $\bm{a}\in\mathbb{R}^{n}$ .

Next, we prove the result for the fused lasso nearly-isotonic approximator. From Proposition 1 we have

\hat{\bm{\beta}}^{FLNI}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI})=\hat{\bm{\beta}}^{FL}(\bm{y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F},\lambda_{L}).

Next, for the fused lasso $\hat{\bm{\beta}}^{FL}(\bm{y},\lambda_{F},\lambda_{L})$ defined in (2) let

K^{FL}(\bm{y},\lambda_{F},\lambda_{L})=\#\{\text{non-zero fused groups in }\hat{\bm{\beta}}^{FL}(\bm{y},\lambda_{F},\lambda_{L})\},

and from Tibshirani & Taylor (2011) it follows

\mathbb{E}[K^{FL}(\bm{Y},\lambda_{F},\lambda_{L})]=df(\hat{\bm{\beta}}^{FL}(\bm{Y},\lambda_{F},\lambda_{L})).

Further, again, using the property of the covariance, we have

	$\displaystyle df(\hat{\bm{\beta}}^{FLNI}$	$\displaystyle(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI}))={}\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{FLNI}_{i}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI}),Y_{i}]=$
		$\displaystyle\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{FL}_{i}(\bm{Y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F},\lambda_{L}),Y_{i}]=$
		$\displaystyle\sum_{i=1}^{n}\mathrm{Cov}[\hat{\beta}^{FL}_{i}(\bm{Y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F},\lambda_{L}),Y_{i}-\frac{\lambda_{NI}}{2}[D^{T}\bm{1}]_{i}]=$
		$\displaystyle\mathbb{E}[K^{FL}(\bm{Y}-\frac{\lambda_{NI}}{2}D^{T}\bm{1},\frac{1}{2}\lambda_{NI}+\lambda_{F},\lambda_{L})]\equiv\mathbb{E}[K^{FLNI}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI})].$

Lastly, we note that the proof for the unbiased estimator of the degrees of freedom for nearly-isotonic regression, given in Tibshirani et al. (2011), can be done in the same way as in the current proof, using the relation (21) and, again, the result of the paper Tibshirani & Taylor (2011) for the fusion estimator $\hat{\bm{\beta}}^{FLNI}(\bm{Y},\lambda_{F})$ . $\Box$

We can use the estimate of degrees of freedom for unbiased estimation of the true risk $\mathbb{E}[\sum_{i=1}^{n}(\mathring{\beta}_{i}-\hat{\beta}^{FLNI}_{i}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI}))^{2}]$ , which is given by $\hat{C}_{p}$ statistic

\hat{C}_{p}(\lambda_{F},\lambda_{L},\lambda_{NI})=\sum_{i=1}^{n}(y_{i}-\hat{\beta}^{FLNI}_{i}(\bm{y},\lambda_{F},\lambda_{L},\lambda_{NI}))^{2}-n\sigma^{2}+2\sigma^{2}K^{FLNI}(\bm{Y},\lambda_{F},\lambda_{L},\lambda_{NI}).

5 Acknowledgments

This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundataion.

Appendix A Appendix

References

(1)
Beran & Dümbgen (2010) Beran, R. & Dümbgen, L. (2010), ‘Least squares and shrinkage estimation under bimonotonicity constraints’, Statistics and computing 20(2), 177–189.
Best & Chakravarti (1990) Best, M. J. & Chakravarti, N. (1990), ‘Active set algorithms for isotonic regression; a unifying framework’, Mathematical Programming 47(1), 425–439.
Efron (1986) Efron, B. (1986), ‘How biased is the apparent error rate of a prediction rule?’, Journal of the American statistical Association 81(394), 461–470.
Friedman et al. (2007) Friedman, J., Hastie, T., Höfling, H. & Tibshirani, R. (2007), ‘Pathwise coordinate optimization’, The Annals of Applied Statistics 1(2), 302–332.
Hoerl & Kennard (1970) Hoerl, A. E. & Kennard, R. W. (1970), ‘Ridge regression: Biased estimation for nonorthogonal problems’, Technometrics 12(1), 55–67.
Kim et al. (2009) Kim, S.-J., Koh, K., Boyd, S. & Gorinevsky, D. (2009), ‘ $\ell_{1}$ trend filtering’, SIAM Review 51(2), 339–360.
Minami (2020) Minami, K. (2020), ‘Estimating piecewise monotone signals’, Electronic Journal of Statistics 14(1), 1508–1576.
Phillips (1962) Phillips, D. L. (1962), ‘A technique for the numerical solution of certain integral equations of the first kind’, Journal of the ACM (JACM) 9(1), 84–97.
Rinaldo (2009) Rinaldo, A. (2009), ‘Properties and refinements of the fused lasso’, The Annals of Statistics 37(5B), 2922–2952.
Stout (2013) Stout, Q. F. (2013), ‘Isotonic regression via partitioning’, Algorithmica 66(1), 93–112.
Tibshirani (1996) Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288.
Tibshirani et al. (2011) Tibshirani, R. J., Hoefling, H. & Tibshirani, R. (2011), ‘Nearly-isotonic regression’, Technometrics 53(1), 54–61.
Tibshirani & Taylor (2011) Tibshirani, R. J. & Taylor, J. (2011), ‘The solution path of the generalized lasso’, The Annals of Statistics 39(3), 1335–1371.
Tibshirani et al. (2005) Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. & Knight, K. (2005), ‘Sparsity and smoothness via the fused lasso’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(1), 91–108.
Tikhonov et al. (1995) Tikhonov, A. N., Goncharsky, A., Stepanov, V. & Yagola, A. G. (1995), Numerical methods for the solution of ill-posed problems, Vol. 328, Springer Science & Business Media.