DANR: Discrepancy-aware Network Regularization

Hongyuan You Department of Computer Science, University of California, Santa Barbara ([email protected], [email protected], [email protected]) Furkan Kocayusufoglu¹¹footnotemark: 1 Ambuj K. Singh¹¹footnotemark: 1

Abstract

Network regularization is an effective tool for incorporating structural prior knowledge to learn coherent models over networks, and has yielded provably accurate estimates in applications ranging from spatial economics to neuroimaging studies. Recently, there has been an increasing interest in extending network regularization to the spatio-temporal case to accommodate the evolution of networks. However, in both static and spatio-temporal cases, missing or corrupted edge weights can compromise the ability of network regularization to discover desired solutions. To address these gaps, we propose a novel approach—discrepancy-aware network regularization (DANR)—that is robust to inadequate regularizations and effectively captures model evolution and structural changes over spatio-temporal networks. We develop a distributed and scalable algorithm based on alternating direction method of multipliers (ADMM) to solve the proposed problem with guaranteed convergence to global optimum solutions. Experimental results on both synthetic and real-world networks demonstrate that our approach achieves improved performance on various tasks, and enables interpretation of model changes in evolving networks.

1 INTRODUCTION

Network regularization is a fundamental approach to encode and incorporate general relationships among variables, which are represented as nodes and linked together by weighted edges that describe their local proximity. A significant amount of effort has been devoted to developing successful regularizations [27, 25, 11, 30] that take advantage of prior knowledge on network structures to enhance estimation performance for a variety of application settings, including image denoising [22] and genomic data analysis [18].

We consider the general setting of network regularization, now proposed as a convex optimization problem defined on an undirected graph $\mathcal{G}(\mathcal{V},\mathcal{E})$ with node set $\mathcal{V}$ and edge set $\mathcal{E}$ :

(1.1)

\displaystyle\operatorname{min}\sum_{i\in\mathcal{V}}f_{i}(\bm{x}_{i})+\lambda\sum_{(j,k)\in\mathcal{E}}\omega_{jk}\cdot g(\bm{x}_{j},\bm{x}_{k}),

where on each node $i$ , we intend to learn a local model $\bm{x}_{i}\in\mathbb{R}^{d}$ , which in addition to minimizing a predefined convex loss function $f_{i}:\mathbb{R}^{d}\to\mathbb{R}$ , carries certain relations to its neighbors, defined by the function $g:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ .

In this paper, we focus on the particular setting where $g(\bm{x}_{i},\bm{x}_{j})$ = $\|\bm{x}_{i}-\bm{x}_{j}\|_{2}$ , also known as a sum-of-norms (SON) regularizer. It assumes that the network is composed of multiple clusters, suggesting that all nodes within a cluster share the same consensus model ( $\bm{x}_{i}=\cdots=\bm{x}_{j}$ ). As the observed data on each node may be sparse, the SON regularizer allows nodes to ”borrow” observations from their neighbors to improve their own models, as well as to determine the network cluster to which they belong. The SON objective was first introduced in [16] for convex clustering problems and used off-the-shelf sequential convex solvers. Chi & Lange [5] adapt SON clustering to incorporate similarity weights with an arbitrary norm and solve the problem by both alternating direction method of multipliers (ADMM) or alternating minimization algorithm (AMA) in a parallel manner. More recently, Hallac et al. [11] point out that weighted SON (named as Network Lasso) allows for simultaneous clustering and optimization on graphs, and is highly suitable for a broad class of large-scale network problems.

In network regularization, a static weight assigned to an edge determines how strictly the difference between models on the corresponding nodes is being penalized, relative to the other node pairs in the network. However, unlike few scenarios in which network information is explicitly given, edge weights are usually unavailable or even infeasible to obtain in most real-world networks (e.g., gene regulatory networks [20]). Moreover, due to possible measurement errors and inaccurate prior knowledge, the assigned edge weights don’t necessarily align with the underlying clustering structure of networks as shown in [1, 13]. As a result, heavily relying on the edge weights in determining how much penalty should be applied to neighboring models may contaminate the discovered solutions.

Similar intuition applies when we extend to the temporal setting. Models on nodes of temporal networks usually change at the level of groups over time. However, some groups exhibit different evolution patterns than others [17]. Moreover, the grouping structures themselves may evolve with time [15]. As static regularization cannot capture such temporal evolution, a direct solution is to induce a time-varying local consensus by employing the SON objective on both spatial and temporal directions, but we face an more drastic problem: how to assign weights between snapshots.

Consequently, a framework that is robust to missing or corrupted edge weights in network-based regularization and adapts to spatio-temporal settings is desired. Note that without the assigned network weights, the setting in Eq. (1.1) degrades to simply applying isotropic regularization for all the edges. In this work, we propose a generic formulation, called the discrepancy-aware network regularization (DANR), which deploys a suitable amount of anisotropic network regularization in both spatial and temporal aspects. DANR infers models at each node per timestamp and can learn evolution of models and transitions of network structures over time. We develop an ADMM-based algorithm that adopts an efficient and distributed iterative scheme to solve problems formulated by the DANR, and show that the proposed solution obtains guaranteed convergence towards global optimal solutions. By applying to both synthetic and real-world datasets, we demonstrate the effectiveness of the proposed approach on various network problems.

2 Problem Setting

2.1 Discrepancy-aware Network Regularization

Since existing network-based regularizers rely on known weights on edges, the corresponding solutions get misled when the edge weights are erroneous or unknown. Directly learning the unknown edge weights by using the same regularizer as Eq. (1.1) would lead to an optimization problem:

(2.2)

\displaystyle\underset{\bm{x},\bm{\omega}}{\operatorname{min~{}}}\quad\sum_{i\in\mathcal{V}}f_{i}(\bm{x}_{i})+\lambda\sum_{(j,k)\in\mathcal{E}}\omega_{jk}\left\|\bm{x}_{j}-\bm{x}_{k}\right\|_{2}

which yields the trivial all-zero solution of $\bm{\omega}$ and thus becomes unsatisfactory. Instead of imposing more sophisticated model-based or problem-dependent regularization as suggested in [33], we consider an alternative formulation that explicitly accounts for discrepancies between models on adjacent nodes:

(2.3)

\underset{\bm{x},\bm{\alpha}}{\operatorname{min~{}}}\quad\sum_{i\in\mathcal{V}}f_{i}(\bm{x}_{i})+\lambda\cdot\mathcal{R}_{S}(\bm{x},\bm{\alpha})\\ \mathcal{R}_{S}(\bm{x},\bm{\alpha})=\mu\sum_{(j,k)\in\mathcal{E}}\omega_{jk}\left\|\bm{x}_{j}+\bm{\alpha}_{jk}-\bm{x}_{k}\right\|_{2}+(1-\mu)\|\bm{\alpha}\|_{1,p}

where we denote $\|\bm{\alpha}\|_{1,p}=\sum_{(j,k)\in\mathcal{E}}\|\bm{\alpha}_{jk}\|_{p}$ as the sum of $p$ -norms of all $\bm{\alpha}_{jk}$ ’s. We set a penalty parameter $\lambda\geq 0$ to control the overall strength of network regularization, and a portion parameter $0<\mu<1$ to control the emphasis between the two terms in $\mathcal{R}_{S}(\bm{x},\bm{\alpha})$ . We name this formulation in Eq. (2.3) the discrepancy-aware network regularization (DANR).

In the first term of $\mathcal{R}_{S}$ , we define a discrepancy-buffering (DB) variable $\bm{\alpha}_{jk}\in\mathbb{R}^{d}$ to denote the preserved discrepancy between local model parameters $\bm{x}_{i}$ and $\bm{x}_{j}$ . More specifically, when an edge weight $\omega_{jk}$ is given but potentially imprecise or otherwise corrupted, $\bm{\alpha}_{jk}$ would compensate for abnormally large differences between models $\bm{x}_{j}$ and $\bm{x}_{k}$ to reduce the magnitude of $\omega_{jk}$ -weighted edge penalty term in $\mathcal{R}_{S}$ . The DB variable provides additional flexibility to its associated models, allowing them to stay adequately close to their own local solutions w.r.t. minimizing local loss functions, and avoid over-penalized consensus. When all $\omega_{jk}$ ’s are not given, $\bm{\alpha}_{jk}$ ’s enable solving Eq. 2.3 under anisotropic regularizations even with homogeneous weights.

The second term of $\mathcal{R}_{S}$ is the $\ell_{1,p}$ -regularizer (with $1<p<\infty$ ) [28] that ensures sparsity at the group level. Note that we regard each variable vector $\bm{\alpha}_{jk}$ as a group ( $|\mathcal{E}|$ groups in total). The first key intuition here is that the regularization on $\bm{\alpha}$ helps to exclude trivial solutions, that is $\bm{\alpha}_{jk}=\bm{x}_{k}-\bm{x}_{j}$ . Second, it allows us to identify a succinct set of non-zero vectors $\bm{\alpha}_{jk}$ ’s, compensating for possible intrinsic discrepancies between two adjacent nodes.

To sum up, discrepancy-buffering variable $\bm{\alpha}_{jk}$ ’s are designed to elaborately adjust network regularization strength on all edges, and thus reduce negative effects of the unsquared norm regularizer. We will show in later sections that this modified formulation remains convex and tractable via parallel optimization algorithms on large-scale networks.

2.2 Distributed ADMM-based Solution

In this section, we propose an ADMM-based algorithm for solving the ST-DANR problem in Eq. (3.12) and present the convergence and complexity of the proposed algorithm. The algorithm can be easily adapted to the spatial-only DANR problem in Eq. (2.3) by omitting temporal-related updates.

The ADMM method was originally derived in [8] and has been reformulated in many contexts including optimal control and image processing [23]. The method can be considered as combining augmented Lagrangian methods and the method of multipliers [4]. It aims to solve optimization problems with two-block separable convex objectives in the following form of

(2.4)

\displaystyle\operatorname{min}~{}f(\bm{x})+g(\bm{z})\qquad~{}s.t.~{}A\bm{x}+B\bm{z}=\bm{c}

Observe that our proposed objective in Eq. (3.12) has a separable convex objective function: it can be reorganized into two-block separable convex objectives as in Eq. (2.4), for which ADMM methods guarantee convergence to global optimal solutions [7].

More precisely, to fit our problem into the ADMM framework, we first define $\bm{x}$ = $\{\bm{x}_{j}\}^{j\in V}$ for model parameter vectors on the given undirected network $\mathcal{G}$ , and define consensus variables $\bm{u}$ = $[\{\bm{u}_{jk},\bm{u}_{kj}\}^{(j,k)\in\mathcal{E}}]$ as copies of $\bm{x}$ in the spatial penalty term $\mathcal{R}_{S}(\bm{x},\bm{\alpha})$ . Then we rewrite Eq. (3.12) as follows:

(2.5)		$\displaystyle\underset{\bm{x},\bm{u},\bm{\alpha}}{\operatorname{min}}\quad$	$\displaystyle\sum_{j\in\mathcal{V}}f_{j}(\bm{x}_{j})+\lambda_{1}(1-\mu_{1})\\|\bm{\alpha}\\|_{1,p}$
		$\displaystyle+\lambda_{1}\mu_{1}\sum_{(j,k)\in\mathcal{E}}\omega_{jk}\left\\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\right\\|_{2}$
	s.t.	$\displaystyle\left[\bm{u}_{jk},\bm{u}_{kj}\right]~{}~{}~{}=\left[\bm{x}_{j},\bm{x}_{k}\right],\quad(j,k)\in\mathcal{E}$

in which the first term corresponds to the $f$ block in Eq. (2.4), and the remaining terms correspond to the $g$ block. The equality constraints on Eq. (3.12) are used to force consensus between variables $\bm{x}$ and $\bm{u}$ . Next, we derive the augmented Lagrangian of Eq. (E.1):

(2.6)		$\displaystyle\mathcal{L}_{\rho_{1}}(\bm{x},\bm{u},\bm{\alpha},\bm{\delta})=\sum_{j\in V}f_{j}(\bm{x}_{j})+\lambda_{1}(1-\mu_{1})\\|\bm{\alpha}\\|_{1,p}$
	$\displaystyle\hskip 27.00005pt+\sum_{(j,k)\in E}\Big{(}\lambda_{1}\mu_{1}\omega_{j,k}\\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\\|_{2}$
	$\displaystyle\hskip 54.00009pt+\frac{\rho_{1}}{2}\\|\bm{x}_{j}-\bm{u}_{jk}+\bm{\delta}^{u}_{jk}\\|_{2}^{2}-\frac{\rho_{1}}{2}\\|\bm{\delta}^{u}_{jk}\\|_{2}^{2}$
	$\displaystyle\hskip 54.00009pt+\frac{\rho_{1}}{2}\\|\bm{x}_{k}-\bm{u}_{kj}+\bm{\delta}^{u}_{kj}\\|_{2}^{2}-\frac{\rho_{1}}{2}\\|\bm{\delta}^{u}_{kj}\\|_{2}^{2}\Big{)}$

where $\bm{\delta}$ = $[\{\bm{\delta}^{u}_{jk},\bm{\delta}^{u}_{kj}\}^{(j,k)\in\mathcal{E}}]$ are scaled dual variables for each equality constraint on the elements of $\bm{u}$ . The parameters $\rho_{1}>0$ penalize the violation of equality constraints in the spatial and temporal domain [21] respectively. The iterative scheme of ADMM under the above setting can be written as follows, with $l$ denoting the iteration index:

(2.7)

\displaystyle\begin{rcases}&\bm{x}^{(l+1)}=\underset{\bm{x}}{\operatorname{argmin~{}}}\mathcal{L}_{\rho_{1}}(\bm{x},(\bm{u},\bm{\alpha},\bm{\delta})^{(l)})\\ &(\bm{u},\bm{\alpha})^{(l+1)}=\underset{\bm{u},\bm{\alpha}}{\operatorname{argmin~{}}}\mathcal{L}_{\rho_{1}}(\bm{x}^{(l+1)},\bm{u},\bm{\alpha},\bm{\delta}^{(l)})\\ &\bm{\delta}^{(l+1)}=\bm{\delta}^{(l)}+(\tilde{\bm{x}}^{(l+1)}-\bm{u}^{(l+1)})\end{rcases}

where $\tilde{\bm{x}}$ = $[\{\bm{x}_{j},\bm{x}_{k}\}^{(j,k)\in\mathcal{E}}]$ is composed of replicated elements in $\bm{x}$ , and thus has a one-to-one correspondence with elements in $\bm{u}$ .

Because the augmented Lagrangian in Eq. (2.6) has a separable structure as well, we can further split the optimization above over each univariate element in $\bm{x}$ and $\bm{z}$ . Next, we provide details for each ADMM update step.

$\bm{x}$ -Update. In the first update step of our ADMM updating scheme, we can decompose the problem of minimizing $\bm{x}$ into separately minimizing $\bm{x}_{j}$ for each node $j$ :

\displaystyle\bm{x}_{j}^{(l+1)}=\underset{\bm{x}_{j}}{\operatorname{argmin~{}}}\Big{(}f_{j}(\bm{x}_{j})+\frac{\rho_{1}}{2}\sum_{k\in N_{j}}\|\bm{x}_{j}-\bm{u}_{jk}^{(l)}+\bm{\delta}_{jk}^{u(l)}\|_{2}^{2}\Big{)}

As above equation suggests, $\bm{x}_{j}$ on node $j$ first receives local information from the corresponding consensus variables belonging to all of its neighbors $k\in N_{j}$ , then updates its value to minimize the loss function and remain close to neighbouring consensus variables. Since all remaining regularizations are quadratic, the $\bm{x}$ -update problem can be efficiently solved whenever the loss function $f_{j}$ has certain properties, such as strong convexity.

( $\bm{u},\bm{\alpha}$ )-Update. We can further decompose the ( $\bm{z},\bm{\alpha}$ )-update step in Eq. (E.5) into subproblems on each edge (and its related variables $\bm{u}_{jk},\bm{u}_{kj},\bm{\alpha}_{jk}$ ) as follows, which can be solved in parallel:

(2.8)

(\bm{u}_{jk}^{(l+1)},\bm{u}_{kj}^{(l+1)},\bm{\alpha}_{jk}^{(l+1)})=\\ \operatorname{argmin~{}}\Big{(}\lambda_{1}(1-\mu_{1})\|\bm{\alpha}_{jk}\|_{p}+\lambda_{1}\mu_{1}\omega_{jk}\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\|_{2}\\ \hskip 9.00002pt+\frac{\rho_{1}}{2}\|\bm{x}_{j}^{(l+1)}-\bm{u}_{jk}+\bm{\delta}^{u(l)}_{jk}\|_{2}^{2}+\frac{\rho_{1}}{2}\|\bm{x}_{k}^{(l+1)}-\bm{u}_{kj}+\bm{\delta}^{u(l)}_{kj}\|_{2}^{2}\Big{)}

Notice that the sum of convex functions which are defined on different sets of variables preserves the convexity. To minimize the objective with $\bm{u}_{jk},\bm{u}_{kj}$ and $\bm{\alpha}_{jk}$ simultaneously, the convexity of Eq. (2.2) motivates us to adopt an alternating descent algorithm, which minimizes each component iteratively with respect to $(\bm{u}_{jk},\bm{u}_{kj})$ and $\bm{\alpha}_{jk}$ while holding the other fixed. In detail, we solve Eq. (2.9) and Eq. (2.10) iteratively until convergence is achieved:

(2.9)			$\displaystyle(\bm{u}_{jk}^{(l^{\prime}+1)},\bm{u}_{kj}^{(l^{\prime}+1)})={\operatorname{argmin}}\Big{(}\lambda_{1}\mu_{1}\omega_{jk}\\|\bm{u}_{jk}+\bm{\alpha}_{jk}^{(l^{\prime})}-\bm{u}_{kj}\\|_{2}$
		$\displaystyle+\frac{\rho_{1}}{2}\\|\bm{x}_{j}^{(l+1)}-\bm{u}_{jk}+\bm{\delta}^{u(l)}_{jk}\\|_{2}^{2}+\frac{\rho_{1}}{2}\\|\bm{x}_{k}^{(l+1)}-\bm{u}_{kj}+\bm{\delta}^{u(l)}_{kj}\\|_{2}^{2}\Big{)}$

(2.10)		$\displaystyle\bm{\alpha}_{jk}^{(l^{\prime}+1)}$	$\displaystyle={\operatorname{{argmin}}}\Big{(}(1-\mu_{1})\\|\bm{\alpha}_{jk}\\|_{p}$
		$\displaystyle\hskip 18.00003pt+\mu_{1}\omega_{jk}\\|\bm{u}_{jk}^{(l^{\prime}+1)}+\bm{\alpha}_{jk}-\bm{u}_{kj}^{(l^{\prime}+1)}\\|_{2}\Big{)}$

where $l^{\prime}$ denoting the iteration index of alternating descent. Further, we can utilize the analytic solution to speed up the calculation of $\bm{u}_{jk}$ and $\bm{u}_{kj}$ :

Lemma 2.1

Problem (2.9) has a closed-form solution:

\begin{rcases}\bm{u}_{jk}^{(l^{\prime}+1)}=(1-\theta)\bm{a}+\theta\bm{b}-\theta\bm{\alpha}_{jk}^{(l^{\prime})}\qquad\\ \bm{u}_{kj}^{(l^{\prime}+1)}=\theta\bm{a}+(1-\theta)\bm{b}+\theta\bm{\alpha}_{jk}^{(l^{\prime})}\qquad\end{rcases}

where we denote $\bm{a}=\bm{x}_{j}^{(l^{\prime}+1)}+{\bm{\delta}_{jk}^{u}}^{(l^{\prime})}$ , $\bm{b}=\bm{x}_{k}^{(l^{\prime}+1)}+{\bm{\delta}_{kj}^{u}}^{(l^{\prime})}$ , $c=\lambda_{1}(1-\mu_{1})\omega_{jk}$ , and $\theta=c/(\rho_{1}\|\bm{a}-\bm{b}+\bm{\alpha}_{jk}^{(l^{\prime})}\|_{2})$ for simplification. Details in Appendix B.

$\bm{\delta}$ -Update. In the last step of our ADMM updating scheme, we have fully independent update rules for each scaled dual variable $\bm{\delta}_{jk}^{u}$ as follows:

(2.11)

\displaystyle{\bm{\delta}_{jk}^{u}}^{(l+1)}

\displaystyle={\bm{\delta}_{jk}^{u}}^{(l)}+(\bm{x}_{j}^{(l+1)}-\bm{u}_{jk}^{(l+1)})

Finally, we present the pseudo-code of our DANR approach in Appendix D.

Stopping Criterion and Global Convergence. For our ADMM iterative scheme in Eq. (E.5), we use the norm of primal residual $\bm{r}^{(l)}=\tilde{\bm{x}}^{(l)}-\bm{u}^{(l)}$ and dual residual $\bm{s}^{(l)}=\rho_{1}(\bm{u}^{(l)}-\bm{u}^{(l+1)})$ as the termination measure. The optimality condition [4] of ADMM shows that if both residuals are small then the objective suboptimality must be small, and thus suggests $\|\bm{r}^{(l)}\|_{2}\leq\epsilon_{1}\wedge\|\bm{s}^{(l)}\|_{2}\leq\epsilon_{2}$ as a reasonable stopping criterion. Convex subproblems in $\bm{x}$ -update and $(\bm{u},\bm{\alpha})$ -update need iterative methods to solve as well. The stopping criterion for these subproblems is naturally to keep iteration differences below thresholds, i.e. in $(\bm{u},\bm{\alpha})$ -update, we require ${\small\|\Delta\bm{u}^{(l^{\prime})}\|_{2}\leq\epsilon^{\prime}}$ and ${\small\|\Delta\bm{\alpha}^{(l^{\prime})}\|_{2}\leq\epsilon^{\prime}}$ .

Lemma 2.2

Our ADMM approach to solve the DANR problem is guaranteed to converge to the global optimum. Details in Appendix C.

Computational Complexity. Let $N_{c}$ denote the number of iterations that ADMM takes to achieve an approximate solution $\hat{\bm{x}}$ with an accuracy of $\epsilon_{x}$ . Based on the convergence analysis in [14], the time complexity scales as $O(1/\epsilon_{x})$ , which in our problem mostly depends on the properties of cost functions $f_{jt}$ ’s. Assume that all convex subproblems in $x$ -update and $(u,\alpha)$ -update are solved by general first-order gradient descent methods that take $N_{x}$ and $N_{\alpha}$ iterations to converge respectively, the overall complexity of the algorithm is therefore $O\big{(}N_{c}\big{(}N_{x}|\mathcal{V}|+N_{\alpha}|\mathcal{E}|+|\mathcal{E}|\big{)}\big{)}$ .

3 Extension to Spatio-temporal Setting

The aforementioned shortcomings with pre-defined edge weights accentuate when we consider temporal networks since acquiring explicit temporal edge weights is usually not feasible. Thus, discrepancy-buffering variables are also crucial for the temporal setting.

Consider a temporal undirected network $\mathcal{G}$ consisting of $M$ sequential network snapshots $\{\mathcal{G}_{1},\mathcal{G}_{2},\ldots,\mathcal{G}_{M}\}$ , where each network snapshot $\mathcal{G}_{t}$ contains a node set $\mathcal{V}_{t}$ and an edge set $\mathcal{E}_{t}$ . We denote $\omega_{j,k}^{t}$ as the weight of spatial edge $(j_{t},k_{t})$ between nodes $j_{t}$ and $k_{t}$ within snapshot $\mathcal{G}_{t}$ , and $\omega_{j}^{t,t+1}$ for the weight of temporal edge $(j_{t},j_{t+1})$ that links node $j_{t}$ with $j_{t+1}$ across snapshots $\mathcal{G}_{t}$ and $\mathcal{G}_{t+1}$ (shown in Fig.1). On each node $j_{t}$ , a convex loss function $f_{j,t}:\mathbb{R}^{d}\to\mathbb{R}$ is given to measure the fitness of a local model parameterized by $\bm{x}_{j,t}\in\mathbb{R}^{d}$ . With sparse observations for all snapshots, a straightforward task is to find jointly optimal models for every node and every timestamp under the regularization for both network topology and temporal evolution. We can propose an extension of the DANR formulation to spatio-temporal networks, referred as ST-DANR:

(3.12)

\underset{\bm{x},\bm{\alpha},\bm{\beta}}{\operatorname{min}}\quad\sum_{j\in\mathcal{V}}\sum_{t=1}^{M}f_{j,t}(\bm{x}_{j,t})+\lambda_{1}\cdot\mathcal{R}_{S}(\bm{x},\bm{\alpha})+\lambda_{2}\cdot\mathcal{R}_{T}(\bm{x},\bm{\beta})

where we define the discrepancy-aware regularizers as:

	$\displaystyle\vspace{-4ex}\mathcal{R}_{S}(\bm{x},\bm{\alpha})$	$\displaystyle=\mu_{1}\sum_{t=1}^{M}\sum_{(j_{t},k_{t})\in\mathcal{E}_{t}}\left(\omega_{jk}^{t}\left\\|\bm{x}_{j,t}+\bm{\alpha}_{jk,t}-\bm{x}_{k,t}\right\\|_{2}\right)$
(3.13)			$\displaystyle+(1-\mu_{1})\\|\bm{\alpha}\\|_{1,p}\qquad\text{(spatial penalty term)}$
	$\displaystyle\mathcal{R}_{T}(\bm{x},\bm{\beta})$	$\displaystyle=\mu_{2}\sum_{j\in\mathcal{V}}\sum_{t=1}^{M-1}\left(\omega_{j}^{t,t+1}\\|\bm{x}_{j,t}+\bm{\beta}_{j,t}-\bm{x}_{j,t+1}\\|_{2}\right)$
(3.14)			$\displaystyle+(1-\mu_{2})\\|\bm{\beta}\\|_{1,p}\quad\text{(temporal penalty term)}$

As we are interested in the heterogeneous evolution of nodal models across time, the chosen unsquared norms in spatial and temporal regularization terms $\mathcal{R}_{S}(\bm{x},\bm{\alpha})$ and $\mathcal{R}_{T}(\bm{x},\bm{\beta})$ enforce piecewise consensus in both spatial and temporal aspects, which indicates abrupt changes of regional models or sudden transitions in network structures at particular timestamps, and also implies valid persistent models in the remaining segments of time [12]. Beside the spatial DB variables $\bm{\alpha}_{jk,t}$ in $\mathcal{R}_{S}(\bm{x},\bm{\alpha})$ , we define another set of temporal DB variables $\bm{\beta}_{j,t}\in\mathbb{R}^{d}$ in $\mathcal{R}_{T}(\bm{x},\bm{\beta})$ to compensate for inadequate regularizations caused by temporal fluctuations. For example, when modeling housing prices in a city, $\bm{\beta}_{j,t}$ would tolerate anomalous short-term fluctuations in prices, while large values of $\bm{\alpha}_{jk,t}$ might be found near boundaries of disparate neighborhoods.

Refer to caption — Figure 1: Overview of our problem setting in temporal networks.

Adaptation to Streaming Data. In many applications, observed network snapshots arrive in a streaming fashion. Instead of recomputing all models on past snapshots whenever a new snapshot is observed, incorporating existing models to facilitate the new incoming snapshot is more efficient. Assume that we have learned models $\{\hat{\bm{x}}_{j,t}\}_{j\in\mathcal{V}}$ for the $\tau$ -th snapshot, we then read observations of the next $m$ snapshots indexed by $\tau+1,\cdots,\tau+m$ and attempt to learn models for them. To this intent, we deploy our ST-DANR formulation in Eq. (3.12) on a limited time interval $t=\tau,\cdots,\tau+m$ , and then integrate established model at $t=\tau$ by fixing $\{\bm{x}_{j,\tau}\}_{j\in\mathcal{V}}$ = $\{\hat{\bm{x}}_{j,\tau}\}_{j\in\mathcal{V}}$ and solving the remaining variables $\{\bm{x}_{j,t}\}_{j\in\mathcal{V}}^{t=\tau+1,\cdots,\tau+m}$ in the corresponding problem. Note that in our setting, temporal messages are only transferable through consecutive snapshots, meaning that fixing the $\tau$ -th snapshot is the same as fixing all past snapshots.

Lastly, we point out that distributed ADMM-based solution for DANR (§2.2) can be adapted to ST-DANR, for either batch formulation (Eq. (3.12)) or streaming-case formulation, which readers can refer to in Appendix E.

Temporal Evolutionary Patterns. In this work, we consider the unsquared edge penalty term $\|\bm{x}_{j,t}+\bm{\beta}_{j,t}-\bm{x}_{j,t+1}\|_{2}$ in $\mathcal{R}_{T}$ since it enforces temporal reconstruction, indicating sharp transitions or change points at particular timestamps. Similar to the spatial case, there could be multiple available alternatives to enforce different types of evolutionary patterns [12]. For example, replacing the unsquared norm with squared norm form $\|\bm{x}_{j,t}+\bm{\beta}_{j,t}-\bm{x}_{j,t+1}\|_{2}^{2}$ will yield smoothly varying models along consecutive snapshots, and has been well studied in [29, 6].

4 EXPERIMENTS

In this section, we present an experimental analysis of the proposed method (DANR) and evaluate its performance on classification and regression tasks. First, we employ a synthetic dataset (§4.1) and two housing price datasets (BAY and SAC in §4.2) to demonstrate the set of scenarios and applications that are particularly good fit for DANR. In addition to an improvement on the estimation task, DANR learns fine-grained neighborhood boundaries and reveals interesting insights into the network’s heterogeneous structure. Lastly, in order to validate the efficacy of ST-DANR in the temporal setting, we conduct experiments with the geospatial and temporal database of Northeastern US lakes (§4.3), from which we aim to estimate the water quality of lakes over 10 years.

Baseline Algorithms. In spatial network scenarios (§4.1 & §4.2), we compare DANR against three baselines: network lasso (NL) [11], robust multi-task feature learning (rMTFL) [9], factorized multi-task learning (FORMULA) [32]. In spatial-temporal network scenarios (§4.3), we compare ST-DANR with two widely-applied temporal regularizers in the literature, which are detailed in §4.3.

Parameters Setting. For both NL and DANR, we tune $\lambda$ parameters from $10^{-3}$ to $10^{2}$ where $\lambda^{n+1}=1.3\lambda^{n}$ . Concerning the DANR, for each value of $~{}\lambda$ , we further tune $\mu$ parameters from $0.3$ to $1$ , where $\mu^{n+1}=\mu^{n}+0.02$ . Lastly, we set $p=3$ . We follow the same strategy for both spatial (DANR) and spatio-temporal (ST-DANR) variants of the proposed method. For all penalty parameters in rMTFL and FORMULA, we tune them in the same way as $\lambda$ in DANR. In addition, we vary the number of factorized base models in FORMULA from $1$ to $50$ to achieve its best performance. For each dataset, we standardize all features and response variables to zero mean and unit variance.

4.1 Node Classification with Limited Data

Following previous work [11], we first experiment with a synthetic network in which each node attempts to solve its own support vector machine (SVM) classifier, but only has a limited amount of data to learn from. We further split the nodes into clusters where each cluster has an underlying model, i.e., nodes in the same cluster share the same underlying model. Our aim is to learn accurate models for each node by leveraging their network connections and limited observations. Note that the neighbors with different underlying models provide misleading information to each other, which potentially harms the overall performance. Therefore, we later investigate the robustness of the DANR with a varying ratio of such malicious edges.

Synthetic Network Generation. We create a network of 100 nodes, split into 5 ground-truth communities $C^{1}$ , $C^{2}$ , $\cdots$ , $C^{5}$ . Each of these communities consists of 20 nodes. We denote the community of a node $i$ with $C_{i}$ . Next, we form the network connections by adding edges between nodes based on the following criterion: The probability of adding an edge between any node pair $(i,j)$ is 0.5 if $C_{i}=C_{j}$ , i.e., the edge connects nodes from the same community (intra-community edge). Otherwise, we set the probability as 0.02 if $C_{i}\neq C_{j}$ , i.e., the edge connects nodes from different communities (inter-community edge). The resulting synthetic network $G_{0}(\mathcal{V},\mathcal{E})$ has 100 nodes and 574 edges, with 82% of them being intra-community edges. Each ground-truth community $C^{k}$ has an underlying classifier model $\bm{X}^{k}\in R^{10}$ , drawn independently from a normal distribution of zero mean and unit variance, where $k\in[1,\cdots,5]$ . Next, we generate 5 random training example pairs $(\bm{w},y)$ per node, where $\bm{w}\in R^{10}$ denotes an observation and $y\in\{-1,1\}$ denotes the ground-truth value to estimate. The ground-truth value $y$ for each observation $\bm{w}$ is then computed using the underlying classifier of the community that a node belongs to:

(4.15)

\displaystyle y=\text{sign}(\bm{w}^{T}\bm{X}^{k}+\eta)

where $\bm{X}^{k}\in R^{10}$ represents the corresponding weight of the classifier w.r.t. observations, while $\eta$ is the noise. All observations and noises are drawn independently from a normal distribution for each data point. Note that 5 observations are insufficient to accurately estimate a model $\bm{x}_{i}\in R^{10}$ . Hence for this setting, the network connections play an important role as nodes can “borrow” training examples from its neighbors to improve their own models.

Optimization Problem. The objective function $f_{i}$ on each node is formulated by the soft margin SVM objective, where node $i$ estimates its model (i.e. separating hyperplane) $\bm{x}_{i}\in R^{10}$ using its five training examples $(\bm{w},y)$ as follows:

\displaystyle\min\Big{(}\frac{1}{2}\bm{x}_{i}^{T}\bm{x}_{i}+C\sum_{l=1}^{5}\xi_{l}\Big{)}\quad\text{s.t. }y^{(l)}(\bm{x}_{i}^{T}\bm{w}^{(l)})\geqslant(1-\xi_{l}),~{}\xi_{l}\geqslant 0

where $\xi_{l}$ ’s are slack variables that accounts for non-separable data. The constant $C$ controls the trade-off between minimizing the training error and maximizing the margin. We set $C$ to 0.75, which we empirically find to perform well.

Test Results. In following subsections, we first thoroughly inspect DANR and its prototype method NL on synthetic network $G_{0}$ , in order to show the benefits of introducing discrepancy-buffering variable $\bm{\alpha}$ . Later in §4.1.2, we evaluate the performance and robustness of DANR and all three baseline methods under more noisy scenarios. To assess the performance, prediction accuracies are computed over a separate set of 1000 test pairs (10 per node). The test pairs are again randomly generated using Eq. (4.15).

4.1.1 Effect of Discrepancy-Buffering Variables.

We compare DANR with NL and report the prediction accuracy with varying $\lambda$ values in Figure 2. Recall that the proposed method introduces the $\mu$ parameter which is coupled with the $\lambda$ parameter. Therefore in order to have an adequate comparison between DANR and NL, we modify our $\lambda$ parameter to be $\tilde{\lambda}=\lambda/\mu$ . Doing so ensures that both methods apply the same amount of penalty to their network regularization terms (See Eq. 2.2 and 2.3). For each value of $~{}\lambda$ , we vary $\mu$ from 0.3 to 1 and report the highest accuracy achieved.

We first briefly mention the desired behaviors of the NL method and later accentuate its drawbacks. As shown by Figure 2, for small $\lambda$ values, the problem reduces to solving the local optimization problem where each node estimates its model solely based on its limited observations. This achieves 61.3% accuracy on the test set. Furthermore, as $\lambda$ increases, the NL penalty enforces nodes to form clusters, where nodes in the same cluster share the same model. This behavior firmly improves the test performance with prediction accuracy rising to 75.3%. Right after the peak performance is reached however, a slight increase in $\lambda$ causes a sudden drop in the performance and the problem reduces to solving the global optimization problem over the entire network. Therefore when $\lambda>\lambda_{critical}$ , the problem converges to a common model for all the nodes in the network , i.e., $\bm{x}_{i}=\bm{x}_{j}$ for $\forall i,j\in\mathcal{V}$ . This further drops the prediction accuracy to 56.4% on the test set.

The observed rapid drop in performance implies that the NL method is highly sensitive to the $\lambda$ parameter. There is only a narrow window of $\lambda$ values that result in a proper clustering of the network. As a result, one needs to excessively tune the $\lambda$ parameter around $\lambda_{critical}$ to find its optimal value. In addition, such clustering performance is also highly affected by the volume of “malicious” (inter-cluster) edges in the network, as shown in §4.1.2.

From Figure 2, we can observe that the DANR approach exhibits improved performance thanks to its discrepancy-buffering variables, $\alpha$ , which allow our regularization term to better exploit the good edges while providing additional tolerance towards bad edges. As a result, DANR achieves 79.1% accuracy on the test set, outperforming the best baseline method by 3.8%. Another important observation is that DANR provides a much wider window for selecting near-optimal $\lambda$ than the baseline approach. That being said, we now analyze how the $\mu$ parameter couples with the $\lambda$ parameter, and further present its effect on DANR’s performance. Figure 3 displays the prediction accuracy vs. $\mu$ , with a fixed $\lambda=10^{0}$ and $10^{1}$ . Intuitively, as $\lambda$ increases, the optimal value of $\mu$ also increases. The reason behind is that as $\lambda$ increases, the more non-zero discrepancy-buffering variables (parameterized by ( $1-\mu$ )) are needed to persist the accurate clustering formation since high values of $\lambda$ forces global consensus over the network.

4.1.2 Robustness to Malicious Edges.

Note that the ratio of inter-community edges in the earlier synthetic network $G_{0}$ is 18%. Here we use term noise for the ratio of such malicious edges in the network. Taking into account that the amount of noise is critical in learning accurate models for nodes in a network, we now investigate the efficacy of DANR w.r.t. varying noise. First, we remove all inter-community edges and later iteratively add malicious edges to the network. Then we solve the same problem with the noise varying from 0 to 0.6, and report prediction accuracies of all methods in Figure 4. As shown, the performance of DANR, NL and FORMULA drops in the absence of noise. Meanwhile the pure multitask method rMTFL maintains the same low accuracy, since it doesn’t utilize the network information. However, as we increase the noise, DANR starts to outperform NL and FORMULA, where the gap between the models’ performances becomes larger at a higher ratio of malicious edges. This confirms that compared to the baseline methods, the DANR exhibits more robust performance in noisy settings thanks to its discrepancy-buffering variables.

4.2 Spatial: Housing Rental Price Estimation

In this section, our goal is to jointly (i) estimate the rental prices of houses by leveraging their geological location and the set of features; (ii) discover boundaries between neighborhoods. The intuition is that the houses in the same neighborhood often tend to have similar pricing models. However, such neighborhoods can have complex underlying structures (as described later in this section), which imposes additional challenges in learning accurate models.

Dataset. We experiment with two largely populated areas in Northern California: the Greater Sacramento Area (SAC) and the Bay Area (BAY). The anonymized data is provided by the property management software company Appfolio Inc, from which we further sample houses with at least one signed rental agreement during the year 2017. The resulting dataset covers 3849 houses in SAC and 1498 houses in BAY. Concerning the houses that have more than one rental agreement signed during 2017, we average the rental prices listed in all the agreements. Each house holds the information about its location (latitude/longitude), number of bedrooms, number of bathrooms, square footage, and the rental price. We regard these areas (SAC and BAY) as two separate datasets for the remainder of this section. We also randomly split 20 % of the houses in each dataset for testing.

Network Construction. After excluding test houses from both datasets, we construct two networks (one for each dataset) based on the houses’ locations. An undirected edge exists between node $i$ and $j$ , if at least one of them is in the set of 10 nearest neighbors of the other. (Note that node $j$ being one of the 10 nearest neighbors of $i$ doesn’t necessarily imply that node $i$ is also in the set of nearest neighbors of $j$ .) In the resulting networks, the average degree of a node is 12.16 for SAC and 12.04 for BAY. Additionally, we construct two versions of these networks, weighted and unweighted. While the weighted network has edge weights inversely proportional to the distances between houses, the unweighted network ignores the proximity between houses, and thus weights on all the edges are 1.

Optimization Problem. The model at each house estimates its rental price by solving a linear regression problem. More specifically, at each node $i$ we learn a 4-dimensional model $\bm{x}_{i}=[a_{i},b_{i},c_{i},d_{i}]$ . $\bm{x}$ simply represents the coefficients of each feature, which later is used to estimate the rental price $p_{i}$ as follows:

	$\displaystyle\overline{p_{i}}=a_{i}\cdot(\#bedrooms)+b_{i}\cdot(\#bathrooms)$
	$\displaystyle+c_{i}\cdot(squarefootage)+d_{i},$

where $d_{i}$ is the bias term. The training objective is

\displaystyle\min_{\bm{x,\alpha}}~{}\sum_{i\in\mathcal{V}}\|\overline{p_{i}}-p_{i}\|_{2}^{2}+c\cdot\|\bm{x}_{i}\|_{2}^{2}+\lambda\cdot\mathcal{R}_{S}(\bm{x},\bm{\alpha})

where $\mathcal{R}_{S}$ represents the proposed network regularization term (see Eq. 2.3), while $c$ is the regularization weight to prevent over-fitting.

Test Results. Once training converges, we predict the rental prices on the test set. To do that, we connect each house in the test set to its 10 nearest neighbors in the training set. We then infer the new model $\bm{x}_{j}$ by taking the average of the models on its neighbors: $\bm{x}_{j}=(1/|N(j)|)\sum_{k\in N(j)}\bm{x}_{k}$ . The model $\bm{x}_{j}$ is further used to estimate the rental price of the corresponding house. Alternatively, one can also infer $\bm{x}_{j}$ by solving $\min_{\bm{x}_{j}}\sum_{k\in N(j)}\|\bm{x}_{j}-\bm{x}_{k}^{*}\|_{2}$ , while keeping the models on neighbors fixed. However, we empirically find that simply averaging the neighbors’ models performs better for both methods in this particular setting. We compute Mean Squared Error (MSE) over test nodes to evaluate the performance.

Method	BAY	SAC
Local estimation ( $\lambda=0$ )	0.5984	0.6250
Global estimation ( $\lambda>\lambda_{critical}$ )	0.4951	0.5403
rMTFL	0.4774	0.4115
FORMULA (unweighted)	0.4446	0.3503
FORMULA (weighted)	0.4181	0.3379
Network Lasso (unweighted)	0.4392	0.3273
Network Lasso (weighted)	0.4173	0.3022
DANR (unweighted)	0.4106	0.2978

Table 1: MSE for housing rental price prediction on test set.

Table 3 displays the test results of eight different settings on both datasets. As shown, local & global estimations and rMTFL method produce high errors for both datasets. We further apply FORMULA and Network Lasso (NL) methods to both weighted and unweighted versions of the networks, while the proposed method is only applied to the unweighted version. Notice that the network weights are irrelevant for the local & global estimation settings and the rMTFL method. Intuitively, both FORMULA and NL performs better on the weighted setting compared to the unweighted setting for both datasets. As shown, the rental price estimation errors achieved by the NL are 0.4173 (weighted) vs. 0.4392 (unweighted) for BAY and 0.3022 (weighted) vs. 0.3273 (unweighted) for SAC. These results suggest that the pre-defined weights on these networks help to learn more accurate models on houses.

However, we further argue that although such pre-defined weights improve the overall clustering performance, they don’t account for more complex clustering scenarios, e.g., two nearby houses falling into different school districts or some houses having higher values compared to their neighbors due to geography, e.g., having a view of the city. That being said, DANR outperforms all the other baselines and achieves the smallest errors for both datasets; 0.4106 for BAY and 0.2978 for SAC. Especially, the DANR (unweighted) even outperforms the weighted version of the baseline approaches by a notable margin. This reveals that the DANR indeed accounts for such heterogeneities in data and provides enhanced clustering of the network.

Figure 5 shows examples of two complex scenarios that are captured by the DANR from the SAC network. We use a color code to represent the clusters in the network, where the same colored houses ( $i$ , $j$ ) are in consensus on their models ( $\bm{x}_{i}=\bm{x}_{j}$ ). In the left subfigure, the house shown in yellow uses a different model than all of its neighbors. Interestingly, it resides at the border of three different area codes. The area code for this house is 95864, while the area code on its west is 95825 and 95826 on its south-east. Additionally, we observe similar heterogeneous behaviors in some houses that are near creeks, lakes, and rivers. As an example, Figure 5 (right) displays a creek side house (colored in blue), which again uses a different underlying model from its neighbors.

4.3 Spatio-Temporal: Water Quality Estimation of Northeastern US Lakes

We now evaluate our method in spatio-temporal setting, where the aim is to dynamically estimate the water quality of Northeastern US lakes over years. We follow the same procedure in §4.2, but have an additional temporal regularization term in our objective that allows nodes to obtain signals from past snapshots of the network, along with their neighbors in the current snapshot.

Dataset and Network. We experiment with the geospatial and temporal dataset of lakes in 17 states in the Northeastern United States, called LAGOS [26]. The dataset holds extensive information about the physical characteristics, various nutrient concentration measurements (water quality), ecological context (surrounding land use/cover, climate etc.), and location of lakes; from which we select the following available set of features: {max depth, surface area, elevation, annual mean temperature, % of agricultural land, % of urban land, % of forest land, % of wetland}. The water quality metric to estimate is the summerly mean chlorophyll concentration of the lakes. We represent the feature vector with $\bm{w}\in\mathbb{R}^{8}$ and the water quality score with $y\in\mathbb{R}$ .

During our experiments, we consider a 10-year period between 2000 and 2010. Due to sparsity in the data, we allow 2 years range between the two consecutive snapshots of the network. This results in total of 1039 lakes with all the aforementioned features and the water quality measurements available for each of the selected years. After randomly splitting 20% of the lakes for testing, we build our network by using the latitude/longitude information of lakes, where each lake (node) is connected to 10 nearest lakes.

Optimization Problem. After constructing the network, we now aim to dynamically estimate the water quality ( $y_{i,t}$ ) of lakes by using their feature vectors ( $\bm{w}_{i,t}$ ). More formally, for each year $t\in[2000,2002,\cdots,2010]$ , we learn a model $\bm{x}_{i,t}\in\mathbb{R}^{8}$ per node by solving the following spatio-temporal problem in a streaming fashion: (4.16)

\displaystyle\min_{\bm{x,\alpha,\beta}}~{}\sum_{i\in\mathcal{V}}f(\bm{x}_{i,t},\bm{w}_{i,t},y_{i,t})+\lambda_{1}\cdot\mathcal{R}_{S}^{t}(\bm{x},\bm{\alpha})+\lambda_{2}\cdot\mathcal{R}_{T}^{t}(\bm{x},\bm{\beta})

where $f$ is the linear regression objective and $\mathcal{R}_{S}$ is the DANR term applied on the spatial edges. $\mathcal{R}_{T}$ is the temporal regularization term, for which we consider two formulations as described next.

Test results. For each year, we first solve the spatial problem (Eq. (4.16) without the temporal term $\mathcal{R}_{T}$ ) and report the Mean Squared Errors in Table 2. Note that the test nodes are again inferred by averaging the neighbors’ models in the current snapshot. Later, in order to successfully assess the improvements gained by the temporal discrepancy-buffering variables ( $\bm{\beta}$ ), we solve the above spatio-temporal problem with three different temporal regularizers:

DANR+T-SON: DANR with temporal sum-of-norms regularizer where $\mathcal{R}_{T}^{t}(\bm{x},\cdot)$ = $\sum_{i\in\mathcal{V}}\|\bm{x}_{i,t}-\hat{\bm{x}}_{i,t-1}\|_{2}$

DANR+T-SOS: DANR with temporal sum-of-squares regularizer where $\mathcal{R}_{T}^{t}(\bm{x},\cdot)$ = $\sum_{i\in\mathcal{V}}\|\bm{x}_{i,t}-\hat{\bm{x}}_{i,t-1}\|_{2}^{2}$

ST-DANR: Spatio-temporal discrepancy-aware network regularizer where $\mathcal{R}_{T}^{t}(\bm{x},\bm{\beta})$ = $\mu_{2}\cdot\sum_{i\in\mathcal{V}}\|\bm{x}_{i,t}-\hat{\bm{x}}_{i,t-1}+\bm{\beta}_{i,t}\|_{2}+(1-\mu_{2})\cdot\|\bm{\beta}\|_{1,p}$

DANR+T-SON and DANR+T-SOS approaches simply apply two widely adopted temporal regularizers on the temporal edges (the sum-of-norms regularizer [19] and sum-of-squares reqularizer [6, 29] respectively), while ST-DANR includes the discrepancy-buffering variables on both spatial and temporal edges. Recall that we solve the Eq. (4.16) in a streaming fashion for simplicity, i.e., each snapshot of the network solves the problem while holding the models learned on the previous snapshot (if available) fixed. Potentially, the performance can further be improved by allowing updates on the previous snapshots.

As we can see from the Table 2, leveraging the temporal connections between the two consecutive snapshots significantly improves the performance. The DANR+T-SON outperforms the DANR by 8%-14%, while DANR+TSOS outperforms the DANR by 10%-15%. Moreover, the ST-DANR consistently outperforms both baselines for all the years. This confirms that the proposed formulation accounts for the heterogeneous nature of the temporal networks where some group of nodes exhibits different evolution patterns than the others.

Method	2000	2002	2004	2006	2008	2010
DANR	0.8479	0.9033	0.6384	0.6722	0.4556	0.4113
\hdashline+ T-SON	N/A	0.8308	0.5744	0.6061	0.4109	0.3517
+ T-SOS	N/A	0.8045	0.5618	0.6045	0.3978	0.3476
ST-DANR	N/A	0.8037	0.5604	0.5866	0.3844	0.3311

Table 2: MSE for water quality estimation over years.

Figure 6 further displays the models (color-coded) learned on three consecutive snapshots, corresponding to years 2000, 2002 and 2004. In 2000, the formed clusters don’t go much beyond the boundaries of the states, resulting in coarse clustering of the network. Yet some states such as Missouri and Iowa are grouped into one cluster (shown in green). This suggests that accurately estimating the chlorophyll concentration of lakes based on the selected set of features is indeed challenging. Specifically, the estimation problem depends on several other external factors that potentially affect the volume of plants and algae in lakes; cultural eutrophication, nutrient inputs from human activities to name a few [26]. However, as it can be seen from the models learned in 2002 and 2004, the clustering performance improves once we allow temporal regularization to synchronize models between two consecutive snapshots, which is analogous with the reduction in errors over years as reported in Table 2. Overall, we observe a split of clusters over time, e.g., in Wisconsin, Minnesota, Iowa and New England. Additionally, a dark brown cluster in south Missouri begins to appear in 2002 and further spreads north in 2004. This indicates that by leveraging the temporal connections, the ST-DANR learns improved models and clustering while allowing for heterogeneity in group level evolutions of nodes.

5 CONCLUSIONS

We propose discrepancy-aware network regularization (DANR) approach for spatio-temporal networks. By introducing discrepancy-buffering variables, the DANR automatically compensates for inaccurate prior knowledge encoded in edge weights, and enables modeling heterogeneous temporal patterns of network clusters. We develop a scalable algorithm based on alternating direction method of multipliers (ADMM) to solve the proposed problem, which enjoys analytic solutions for decomposed subproblems and employs a distributed update scheme on nodes and edges. Our experimental results show that DANR successfully captures structural changes over spatio-temporal networks and yields enhanced models by providing robustness towards missing or inaccurate edge weights.

6 ACKNOWLEDGMENTS

This project has received funding from the National Science Foundation under grant IIS-1817046 and the Defense Threat Reduction Agency under award HDTRA1-19-1-0017.

References

[1] L. Anselin, Under the hood: Issues in the specification and interpretation of spatial regression models, Agricultural economics, 27 (2002).
[2] M. Belkin et al., Regularization and semi-supervised learning on large graphs, in COLT, 2004.
[3] M. Belkin, P. Niyogi, and V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, Journal of machine learning research, 7 (2006), pp. 2399–2434.
[4] S. Boyd et al., Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends® in Machine Learning, 3 (2011).
[5] E. C. Chi et al., Splitting methods for convex clustering, Journal of Computational and Graphical Statistics, 24 (2015).
[6] X.-H. Dang et al., Subnetwork mining with spatial and temporal smoothness, in SDM, 2017.
[7] J. Eckstein et al., On the douglas—rachford splitting method and the proximal point algorithm for maximal monotone operators, Mathematical Programming, 55 (1992).
[8] D. Gabay and B. Mercier, A dual algorithm for the solution of nonlinear variational problems via finite element approximation, Computers & Mathematics with Applications, (1976).
[9] P. Gong, J. Ye, and C. Zhang, Robust multi-task feature learning, in ACM SIGKDD, 2012.
[10] N. Guan, D. Tao, Z. Luo, and B. Yuan, Manifold regularized discriminative nonnegative matrix factorization with fast gradient descent, IEEE Transactions on Image Processing, 20 (2011), pp. 2030–2048.
[11] D. Hallac et al., Network lasso: Clustering and optimization in large graphs, in ACM SIGKDD, 2015.
[12] D. Hallac, Y. Park, S. Boyd, and J. Leskovec, Network inference via the time-varying graphical lasso, 2017.
[13] R. Harris, J. Moffat, and V. Kravtsova, In search of ’w’, Spatial Economic Analysis, 6 (2011), pp. 249–270.
[14] B. He and X. Yuan, On non-ergodic convergence rate of douglas–rachford alternating direction method of multipliers, Numerische Mathematik, 130 (2015), pp. 567–577.
[15] Q. Ho et al., Evolving cluster mixed-membership blockmodel for time-evolving networks, in AISTATS, 2011.
[16] T. D. Hocking et al., Clusterpath an algorithm for clustering using convex fusion penalties, in ICML, 2011.
[17] M. Kim and J. Leskovec, Nonparametric multi-group membership model for dynamic networks, in NIPS, 2013.
[18] C. Li and H. Li, Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics, 24 (2008), pp. 1175–1182.
[19] F. Lindsten et al., Clustering using sum-of-norms regularization, in Statistical Signal Processing Workshop, 2011.
[20] Y.-Y. Liu, J.-J. Slotine, and A.-L. Barabási, Controllability of complex networks, Nature, 473 (2011), p. 167.
[21] R. Nishihara et al., A general analysis of the convergence of admm, in ICML, 2015, pp. 343–352.
[22] J. Pang et al., Graph laplacian regularization for image denoising, IEEE Transactions on Image Processing, (2017).
[23] Y. Peng et al., Robust alignment by sparse and low-rank decomposition for linearly correlated images, PAMI, (2012).
[24] L. I. Rudin, S. Osher, and E. Fatemi, Nonlinear total variation based noise removal algorithms, Physica D: nonlinear phenomena, 60 (1992).
[25] J. Sharpnack, A. Singh, and A. Rinaldo, Sparsistency of the edge lasso over graphs, in AISTATS, 2012.
[26] P. A. Soranno et al., Lagos-ne, GigaScience, 6 (2017).
[27] R. Tibshirani et al., Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society, (2005).
[28] J. E. Vogt and V. Roth, A complete analysis of the $\ell_{1,p}$ group-lasso, in ICML, 2012.
[29] H. Wang, F. Nie, and H. Huang, Low-rank tensor completion with spatio-temporal consistency, in AAAI, 2014.
[30] Y.-X. Wang, J. Sharpnack, A. J. Smola, and R. J. Tibshirani, Trend filtering on graphs, in AISTATS, 2015.
[31] K. Q. Weinberger, F. Sha, Q. Zhu, and L. K. Saul, Graph laplacian regularization for large-scale semidefinite programming, in NIPS, 2007.
[32] J. Xu et al., Factorized multi-task learning for task discovery in personalized medical models, in SDM, 2015.
[33] H.-F. Yu et al., Temporal regularized matrix factorization for high-dimensional time series prediction, in NIPS, 2016.

A RELATED WORKS

Among existing network-based regularization approaches, there are two major types of edge objectives that are most commonly used: the square-norm objective and the unsquared-norm objective, encouraging different styles of expected solutions. The squared-norm edge objective (i.e., $\sum_{jk}\omega_{jk}\|\bm{x}_{j}-\bm{x}_{k}\|_{2}^{2}$ ), well-known as graph Laplacian regularization [3], assumes underlying models on nodes are smoothly varying as one traverses edges in the network, and accordingly enforces similar but not identical models on linked nodes. Due to the merits of simple computations and good performance, graph Laplacian regularizer has gained popularity in solving problems, such as image deblurring and denoising [22], genomic data analysis [18], semi-supervised regression [2], non-negative matrix factorization [10] and semi-definite programming [31]. However, the square-norm objective usually induces very dense models. The unsquared-norm objective, known as sum-of-norms regularizer [16, 19], bears some similarity to scale-valued fused lasso signal approximator [27] that exists in signal processing applications, for example, total-variation (TV) regularizer for image denoising, and graph trend filtering (GTF) regularizer for nonparametric regression. TV regularizer [24] is developed to penalize both the horizontal and vertical differences between neighboring pixels, which can be thought of as a special scalar-valued network lasso on a 2D grid network. GTF [30] generalizes the successful idea of trend filtering to graphs, by directly penalizing higher order differences across nodes. All regularizers above require correct edge weights so that network structures can be properly used to improve joint estimation. In contrast, our proposed approach allows ambiguous input of network edge weights by introducing the discrepancy-buffering variables. Also, as a variation of SON regularization, our formulation enjoys small model complexity as local consensus is imposed across large-scale networks.

B Proof of Lemma 2.1

We present the analytic solution for updating $(\bm{u}_{jk},\bm{u}_{kj})$ , introduced in Eq. (2.9). For simplicity, we omit iteration index $l^{\prime}$ , and denote $\bm{a}=\bm{x}_{j}+\bm{\delta}_{jk}^{u}$ , $\bm{b}=\bm{x}_{k}+\bm{\delta}_{kj}^{u}$ and $c=\lambda_{1}(1-\mu_{1})\omega_{jk}$ .

	$\displaystyle\operatorname{min~{}}\quad g(\bm{u}_{jk},$	$\displaystyle\bm{u}_{kj})=\quad c\left\\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\right\\|_{2}$
(B.1)			$\displaystyle+\frac{\rho_{1}}{2}\left(\left\\|\bm{a}-\bm{u}_{jk}\right\\|_{2}^{2}+\left\\|\bm{b}-\bm{u}_{kj}\right\\|_{2}^{2}\right)$

We discuss the problem in the following two cases:

Case (1): When the objective function in Eq. (B) is differentiable under the condition of $\left\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\right\|_{2}\neq 0$ , taking the sufficient condition of optimality as $\nabla g=0$ , we have:

	$\displaystyle c\frac{\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}}{\left\\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\right\\|_{2}}-\rho_{1}(\bm{a}-\bm{u}_{jk})=0$
	$\displaystyle-c\frac{\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}}{\left\\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\right\\|_{2}}-\rho_{1}(\bm{b}-\bm{u}_{kj})=0$

and we can solve above linear system equations to obtain:

	$\displaystyle\bm{u}_{jk}=\frac{(c+\tau\rho_{1})\bm{a}+c\bm{b}-c\bm{\alpha}_{jk}}{2c+\tau\rho_{1}}$
	$\displaystyle\bm{u}_{kj}=\frac{c\bm{a}+(c+\tau\rho_{1})\bm{b}+c\bm{\alpha}_{jk}}{2c+\tau\rho_{1}}$

where we define $\tau$ $\coloneqq$ $\left\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\right\|_{2}$ which still depends on $\bm{u}_{jk}$ and $\bm{u}_{kj}$ . Thus we plug the expressions of $\bm{u}_{jk}$ and $\bm{u}_{kj}$ back into the definition of $\tau$ , and achieve the expression of $\tau=\|\bm{a}-\bm{b}+\bm{\alpha}_{jk}\|_{2}-2c/\rho_{1}$ which is fully comprised of fixed variables. Hence, the solution is:

	$\displaystyle\bm{u}_{jk}^{\ast}=(1-\theta)\bm{a}+\theta\bm{b}-\theta\bm{\alpha}_{jk}$
(B.2)		$\displaystyle\bm{u}_{kj}^{\ast}=\theta\bm{a}+(1-\theta)\bm{b}+\theta\bm{\alpha}_{jk}$

where $\theta\coloneqq c/(\rho_{1}\|\bm{a}-\bm{b}+\bm{\alpha}_{jk}\|_{2})\geq 0$ .

Case (2): When the objective function in (B) is not differentiable, we need to solve the quadratic problem $\operatorname{min}\left\|\bm{a}-\bm{u}_{jk}\right\|_{2}^{2}+\left\|\bm{b}-\bm{u}_{kj}\right\|_{2}^{2}$ with the constraint $\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\|_{2}=0$ . It yields $\bm{u}_{jk}^{\ast}=(\bm{a}+\bm{b}-\bm{\alpha}_{jk})/2$ , $\bm{u}_{kj}^{\ast}=(\bm{a}+\bm{b}+\bm{\alpha}_{kj})/2$ , which is in the same form of Eq. (B) if $\theta=1/2$ .

To determine which of above differentiable conditions is satisfied, we compare resulting objective values in two cases. Denote $F_{1}^{\ast}$ and $F_{2}^{\ast}$ as the optimal objective value in above two cases, we compute their difference as:

\displaystyle\frac{F_{1}^{\ast}-F_{2}^{\ast}}{\|\bm{a}-\bm{b}+\bm{\alpha}_{jk}\|_{2}}=(\theta^{2}-\frac{1}{4})\frac{c}{\theta}+\frac{c\tau\rho_{1}}{2c+\tau\rho_{1}}

From the definition of $\tau$ and $\theta$ , we can derive $\tau\rho_{1}=c(1/\theta-2)$ . Combining above, we have:

\displaystyle\frac{F_{1}^{\ast}-F_{2}^{\ast}}{c\|\bm{a}-\bm{b}+\bm{\alpha}_{jk}\|_{2}}=-\frac{(\theta-\frac{1}{2})^{2}}{\theta}

which shows that $F_{1}^{\ast}\leq F_{2}^{\ast}$ unless $\theta=1/2$ . Therefore, we can have a unified solution as Eq. (B).

C Proof of Lemma 2.2

Based on rigorous analysis in [4] and [7], we know that the convergence to the global optimum is guaranteed for ADMM algorithms on problems of the following form:

(C.3)

\displaystyle\operatorname{min}~{}h(\bm{x}_{0})+g(\bm{z}_{0})\qquad~{}s.t.~{}A\bm{x}_{0}+B\bm{z}_{0}=\bm{c}_{0}

where both $h$ and $g$ need to be convex functions and constraints are linear. Note that our problem in Eq.(2.2) is equivalent to Eq.(2.5). To satisfy conditions of convergence guarantee, we need to fit Eq.(2.5) to the form in Eq.(C.3).

We let $\bm{x}_{0}=\bm{x}$ and let $\bm{z}_{0}=[\bm{u},\bm{\alpha}]$ , and further define $h(\bm{x}_{0})=\sum_{j\in\mathcal{V}}f_{j}(\bm{x}_{j})$ and $g(\bm{z}_{0})$ $=$ $\lambda_{1}(1-\mu_{1})\|\bm{\alpha}\|_{1,p}$ $+$ $\lambda_{1}\mu_{1}\sum_{(j,k)\in\mathcal{E}}\omega_{jk}\|\bm{u}_{jk}$ $+$ $\bm{\alpha}_{jk}-\bm{u}_{kj}\|_{2}$ . Next, $h(\bm{x}_{0})$ is convex because each cmponent $f_{j}(\bm{x}_{j})$ is required to be convex in our problem setting. $g(\bm{z}_{0})$ is also convex in terms of the joint variable $[\bm{u},\bm{\alpha}]$ , by the definition of convexity and the triangle inequality. Lastly, the constraint $\{[\bm{u}_{jk},\bm{u}_{kj}]$ $=$ $[\bm{x}_{j},\bm{x}_{k}],$ $\forall(j,k)\in\mathcal{E}\}$ is also linear in terms of entries in $\bm{u}$ and $\bm{x}$ . Summing up, we can see that Eq.(2.5) fits the form of Eq.(C.3). Then by theorems in [4] and [7], Our ADMM approach to solve the problem Eq.(2.2) is guaranteed to converge to the global optimum.

D Pseudo-code for Solving DANR Problem

E Solution of ST-DANR

E.1 Batch Formulation

Similar to the spatial only case in §2, we define $\bm{x}$ = $\{\bm{x}_{j,t}\}_{t=1...M}^{j\in V}$ for model parameter vectors on temporal undirected network $\mathcal{G}$ , and define consensus variables

	$\displaystyle\bm{z}$	$\displaystyle=[\bm{u},\bm{v}]$
		$\displaystyle=[\{\bm{u}_{jk,t,0},\bm{u}_{kj,t,0}\}_{t=1...M}^{(j,k)\in\mathcal{E}_{t}},$
		$\displaystyle\hskip 20.00003pt\{\bm{v}_{j,t,1}\}_{t=1...M-1}^{j\in\mathcal{V}_{t}},\{\bm{v}_{j,t,2}\}_{t=2...M}^{j\in\mathcal{V}_{t}}]$

as copies of $\bm{x}$ in the corresponding spatial penalty term $\mathcal{R}_{S}(\bm{x},\bm{\alpha})$ and temporal penalty term $\mathcal{R}_{T}(\bm{x},\bm{\beta})$ . Then we rewrite the problem as follows:

	$\displaystyle\underset{\bm{x},\bm{z},\bm{\alpha},\bm{\beta}}{\operatorname{min}}$	$\displaystyle\quad\sum_{t=1}^{M}\sum_{j\in\mathcal{V}}f_{j,t}(\bm{x}_{j,t})$
		$\displaystyle+\lambda_{1}\mu_{1}\sum_{t=1}^{M}\sum_{(j,k)\in\mathcal{E}}\omega_{jk}^{t}\left\\|\bm{u}_{jk,t}+\bm{\alpha}_{jk,t}-\bm{u}_{kj,t}\right\\|_{2}$
		$\displaystyle+\lambda_{2}\mu_{2}\sum_{j\in\mathcal{V}}\sum_{t=1}^{M-1}\omega_{j}^{t,t+1}\left\\|\bm{v}_{j,t,1}+\bm{\beta}_{j,t}-\bm{v}_{j,t+1,2}\right\\|_{2}$
(E.4)			$\displaystyle+\lambda_{1}(1-\mu_{1})\\|\bm{\alpha}\\|_{1,p}+\lambda_{2}(1-\mu_{2})\\|\bm{\beta}\\|_{1,p}$
	s.t.	$\displaystyle\left[\bm{u}_{jk,t},\bm{u}_{kj,t}\right]~{}~{}~{}=\left[\bm{x}_{j,t},\bm{x}_{k,t}\right],\quad t=1,\ldots,M$
		$\displaystyle\left[\bm{v}_{j,t,1},\bm{v}_{j,t+1,2}\right]=\left[\bm{x}_{j,t},\bm{x}_{j,t+1}\right],\quad t=1,\ldots,M-1$

The iterative scheme of ADMM under the above setting is modified as follows, with $l$ denoting the iteration index:

(E.5)

\displaystyle\begin{rcases}&\bm{x}^{(l+1)}=\underset{\bm{x}}{\operatorname{argmin~{}}}\mathcal{L}_{\rho_{1},\rho_{2}}(\bm{x},(\bm{z},\bm{\alpha},\bm{\beta},\bm{\delta})^{(l)})\\ &(\bm{z},\bm{\alpha},\bm{\beta})^{(l+1)}=\underset{\bm{z},\bm{\alpha},\bm{\beta}}{\operatorname{argmin~{}}}\mathcal{L}_{\rho_{1},\rho_{2}}(\bm{x}^{(l+1)},\bm{z},\bm{\alpha},\bm{\beta},\bm{\delta}^{(l)})\\ &\bm{\delta}^{(l+1)}=\bm{\delta}^{(l)}+(\tilde{\bm{x}}^{(l+1)}-\bm{z}^{(l+1)})\end{rcases}

where $\tilde{\bm{x}}$ = $[\{\bm{x}_{j,t},\bm{x}_{k,t}\}_{t=1...M}^{(j,k)\in\mathcal{E}_{t}}$ , $\{\bm{x}_{j,t}\}_{t=1...M-1}^{j\in\mathcal{V}_{t}}$ , $\{\bm{x}_{j,t}\}_{t=2...M}^{j\in\mathcal{V}_{t}}]$ is composed of replicated elements in $\bm{x}$ .

$\bm{x}$ -Update. Now we need to decompose the problem of minimizing $\bm{x}$ into separately minimizing $\bm{x}_{jt}$ for each node $j_{t}$ :

		$\displaystyle\bm{x}_{j,t}^{(l+1)}=\underset{\bm{x}_{j,t}}{\operatorname{argmin~{}}}\Big{(}f_{j,t}(\bm{x}_{j,t})+\frac{\rho_{1}}{2}\sum_{k\in N_{j,t}}\\|\bm{x}_{j,t}-\bm{u}_{jk,t}^{(l)}+\bm{\delta}_{jk,t}^{u(l)}\\|_{2}^{2}$
(E.6)		$\displaystyle+$	$\displaystyle\frac{\rho_{2}}{2}\\|\bm{x}_{j,t}-\bm{v}_{j,t,1}^{(l)}+\bm{\delta}^{v(l)}_{j,t,1}\\|_{2}^{2}+\frac{\rho_{2}}{2}\\|\bm{x}_{j,t}-\bm{v}_{j,t,2}^{(l)}+\bm{\delta}^{v(l)}_{j,t,2}\\|_{2}^{2}\Big{)}$

( $\bm{z},\bm{\alpha},\bm{\beta}$ )-Update. The ( $\bm{z},\bm{\alpha},\bm{\beta}$ )-update step in Eq. (E.5) can be decoupled into separate updates for a spatial ( $\bm{u},\bm{\alpha}$ )-update and a temporal ( $\bm{v},\bm{\beta}$ )-update. For spatial ( $\bm{u},\bm{\alpha}$ )-update, we have:

	$\displaystyle(\bm{u}_{jk,t}^{(l^{\prime}+1)},\bm{u}_{kj,t}^{(l^{\prime}+1)})={\operatorname{argmin}}\Big{(}\lambda_{1}\mu_{1}\omega_{jk}^{t}\\|\bm{u}_{jk,t}+\bm{\alpha}_{jk,t}^{(l^{\prime})}-\bm{u}_{kj,t}\\|_{2}$
	$\displaystyle\hskip 72.00012pt+\frac{\rho_{1}}{2}\\|\bm{x}_{j,t}^{(l+1)}-\bm{u}_{jk,t}+\bm{\delta}^{u(l)}_{jk,t}\\|_{2}^{2}$
	$\displaystyle\hskip 72.00012pt+\frac{\rho_{1}}{2}\\|\bm{x}_{k,t}^{(l+1)}-\bm{u}_{kj,t}+\bm{\delta}^{u(l)}_{kj,t}\\|_{2}^{2}\Big{)}$
	$\displaystyle\bm{\alpha}_{jk,t}^{(l^{\prime}+1)}={\operatorname{{argmin}}}\Big{(}\mu_{1}\omega_{jk}^{t}\\|\bm{u}_{jk,t}^{(l^{\prime}+1)}+\bm{\alpha}_{jk,t}-\bm{u}_{kj,t}^{(l^{\prime}+1)}\\|_{2}$
	$\displaystyle\hskip 99.00017pt+(1-\mu_{1})\\|\bm{\alpha}_{jk,t}\\|_{p}\Big{)}$

We can follow the same path to solve the temporal ( $\bm{v},\bm{\beta}$ )-update, which is omitted here.

$\bm{\delta}$ -Update. In the last step of our ADMM updating scheme, we have fully independent update rules for each scaled dual variable $\bm{\delta}_{jk,t}^{u}$ , $\bm{\delta}_{j,t,1}^{v}$ and $\bm{\delta}_{j,t,2}^{v}$ as follows:

(E.7)

\displaystyle\begin{rcases}{\bm{\delta}_{jk,t}^{u}}^{(l+1)}&={\bm{\delta}_{jk,t}^{u}}^{(l)}+(\bm{x}_{j,t}^{(l+1)}-\bm{u}_{jk,t}^{(l+1)})\\ {\bm{\delta}_{j,t,1}^{v}}^{(l+1)}&={\bm{\delta}_{j,t,1}^{v}}^{(l)}+(\bm{x}_{j,t}^{(l+1)}-\bm{v}_{j,t,1}^{(l+1)})\\ {\bm{\delta}_{j,t+1,2}^{v}}^{(l+1)}&={\bm{\delta}_{j,t+1,2}^{v}}^{(l)}+(\bm{x}_{j,t+1}^{(l+1)}-\bm{v}_{j,t+1,2}^{(l+1)})\end{rcases}

E.2 Streaming Case Formulation

Following the same notions in batch formulation of ST-DANR, we formally present the streaming case formulation:

	$\displaystyle\underset{\bm{x},\bm{\alpha},\bm{\beta}}{\operatorname{min}}$	$\displaystyle\quad\sum_{t=\tau+1}^{\tau+m}\sum_{j\in\mathcal{V}}f_{j,t}(\bm{x}_{j,t})$
		$\displaystyle+\lambda_{1}(1-\mu_{1})\\|\bm{\alpha}\\|_{1,p}+\lambda_{2}(1-\mu_{2})\\|\bm{\beta}\\|_{1,p}$
		$\displaystyle+\lambda_{1}\mu_{1}\sum_{t=\tau+1}^{\tau+m}\sum_{(j,k)\in\mathcal{E}}\omega_{jk}^{t}\left\\|\bm{x}_{jk,t}+\bm{\alpha}_{jk,t}-\bm{x}_{kj,t}\right\\|_{2}$
		$\displaystyle+\lambda_{2}\mu_{2}\sum_{j\in\mathcal{V}}\Big{(}\omega_{j}^{\tau,\tau+1}\left\\|\hat{\bm{x}}_{j,\tau,1}+\bm{\beta}_{j,\tau}-\bm{x}_{j,\tau+1,2}\right\\|_{2}$
		$\displaystyle\hskip 40.00006pt+\sum_{t=\tau+1}^{\tau+m-1}\omega_{j}^{t,t+1}\left\\|\bm{x}_{j,t,1}+\bm{\beta}_{j,t}-\bm{x}_{j,t+1,2}\right\\|_{2}\Big{)}$

where $\|\bm{\alpha}\|_{1,p}=\sum_{jk\in\mathcal{E}}\sum_{t=\tau+1}^{\tau+m}\|\bm{\alpha}_{jk,t}\|_{p}$ and $\|\bm{\beta}\|_{1,p}=\sum_{j\in\mathcal{V}}\sum_{t=\tau}^{\tau+m-1}$ $\|\bm{\beta}_{j,t}\|_{p}$ . We then fit the problem to classic two-block ADMM framework as we did for ST-DANR in batch formulation:

(E.8)		$\displaystyle\underset{\bm{x},\bm{u},\bm{v},\bm{\alpha},\bm{\beta}}{\operatorname{min}}~{}$	$\displaystyle\sum_{t=\tau+1}^{\tau+m}\sum_{j\in\mathcal{V}}f_{j,t}(\bm{x}_{j,t})$
		$\displaystyle+\lambda_{1}(1-\mu_{1})\\|\bm{\alpha}\\|_{1,p}+\lambda_{2}(1-\mu_{2})\\|\bm{\beta}\\|_{1,p}$
		$\displaystyle+\lambda_{1}\mu_{1}\sum_{t=\tau+1}^{\tau+m}\sum_{(j,k)\in\mathcal{E}}\omega_{jk}^{t}\left\\|\bm{u}_{jk,t}+\bm{\alpha}_{jk,t}-\bm{u}_{kj,t}\right\\|_{2}$
		$\displaystyle+\lambda_{2}\mu_{2}\sum_{j\in\mathcal{V}}\Big{(}\omega_{j}^{\tau,\tau+1}\left\\|\hat{\bm{x}}_{j,\tau}+\bm{\beta}_{j,\tau}-\bm{v}_{j,\tau+1,2}\right\\|_{2}$
		$\displaystyle\hskip 18.00003pt+\sum_{t=\tau+1}^{\tau+m-1}\omega_{j}^{t,t+1}\left\\|\bm{v}_{j,t,1}+\bm{\beta}_{j,t}-\bm{v}_{j,t+1,2}\right\\|_{2}\Big{)}$
	s.t.	$\displaystyle\left[\bm{u}_{jk,t},\bm{u}_{kj,t}\right]~{}~{}~{}=\left[\bm{x}_{j,t},\bm{x}_{k,t}\right],\quad t=\tau+1,\ldots,\tau+m$
		$\displaystyle\left[\bm{v}_{j,t,1},\bm{v}_{j,t+1,2}\right]=\left[\bm{x}_{j,t},\bm{x}_{j,t+1}\right],~{}t=\tau+1,\ldots,\tau+m-1$

where $\hat{\bm{x}}_{\tau}$ is the estimated models at $\tau$ -th snapshot. Note that the existence of $\bm{\beta}_{j,\tau}$ and $\bm{v}_{j,\tau+1,2}$ causes a difference between Eq. (E.8) and the batch case formula in Eq. (E.1), which leads to a modification of updates in $(\bm{v},\bm{\beta})$ -updates w.r.t. these two variables:

	$\displaystyle(\bm{v}_{j,\tau+1,2}^{(l+1)},\bm{\beta}_{j,\tau}^{(l+1)})=\operatorname{argmin~{}}\Big{(}\lambda_{2}\mu_{2}\omega_{j}^{\tau,\tau+1}\left\\|\hat{\bm{x}}_{j,\tau}+\bm{\beta}_{j,\tau}-\bm{v}_{j,\tau+1,2}\right\\|_{2}$
	$\displaystyle+\frac{\rho_{2}}{2}\\|\bm{x}_{j,\tau+1}^{(l+1)}-\bm{v}_{j,\tau+1,2}+\bm{\delta}^{v(l)}_{j,\tau+1,2}\\|_{2}^{2}+\lambda_{2}(1-\mu_{2})\\|\bm{\beta}_{j,\tau}\\|_{p}\Big{)}$

This subproblem can still be solved by alternative minimization with following iterative scheme:

		$\displaystyle\bm{v}_{j,\tau+1,2}^{(l^{\prime}+1)}=\operatorname{argmin~{}}\Big{(}\lambda_{2}\mu_{2}\omega_{j}^{\tau,\tau+1}\left\\|\hat{\bm{x}}_{j,\tau}+\bm{\beta}_{j,\tau}^{(l^{\prime})}-\bm{v}_{j,\tau+1,2}\right\\|_{2}$
		$\displaystyle\hskip 72.00012pt+\frac{\rho_{2}}{2}\\|\bm{x}_{j,\tau+1}^{(l+1)}-\bm{v}_{j,\tau+1,2}+\bm{\delta}^{v(l)}_{j,\tau+1,2}\\|_{2}^{2}\Big{)}$
		$\displaystyle\bm{\beta}_{j,\tau}^{(l^{\prime}+1)}={\operatorname{{argmin}}}\Big{(}\mu_{2}\omega_{j}^{\tau,\tau+1}\left\\|\hat{\bm{x}}_{j,\tau}+\bm{\beta}_{j,\tau}-\bm{v}_{j,\tau+1,2}^{(l^{\prime}+1)}\right\\|_{2}$
		$\displaystyle\hskip 72.00012pt+(1-\mu_{2})\\|\bm{\beta}_{j,\tau}\\|_{p}\Big{)}$

Each step above is a simple optimization problem. Besides, updating steps for all remaining variables remain the same as in solution for batch formulation.

F Running Time and Scalability

# nodes	DANR	NL	rMTFL	FORMULA
100	4.18 s	1.41 s	3.32 s	7.38 s
500	15.10 s	5.75 s	7.88 s	24.62 s
1000	48.85 s	12.36 s	53.91 s	62.68 s
5000	146.78 s	37.99 s	221.52 s	446.21 s
10000	484.68 s	123.81 s	-	-

Table 3: Running time on synthetic datasets when we vary the number of nodes in the network. ‘-’ means the method cannot finish.

To test the scalability of our method, we create synthetic dataset in the same way as in §4.1, but varying the number of nodes in the graph from $100$ to $10000$ . Moreover, we tune the probability of adding edges within and between clusters so that the degree of each node is maintained as 20 for all graph sizes. We compare the running time of all methods in the table above. The rMTFL and FORMULA cannot give a result when the graph has 10000 nodes. DANR and NL are both distributed methods, and therefore their running times increase almost linearly with the increasing number of nodes. Comparing with NL, DANR spends a longer time to get more robust results, which suggests a trade-off between complexity and robustness.

(2.6)		$\displaystyle\mathcal{L}_{\rho_{1}}(\bm{x},\bm{u},\bm{\alpha},\bm{\delta})=\sum_{j\in V}f_{j}(\bm{x}_{j})+\lambda_{1}(1-\mu_{1})\\|\bm{\alpha}\\|_{1,p}$
	$\displaystyle\hskip 27.00005pt+\sum_{(j,k)\in E}\Big{(}\lambda_{1}\mu_{1}\omega_{j,k}\\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\\|_{2}$
	$\displaystyle\hskip 54.00009pt+\frac{\rho_{1}}{2}\\|\bm{x}_{j}-\bm{u}_{jk}+\bm{\delta}^{u}_{jk}\\|_{2}^{2}-\frac{\rho_{1}}{2}\\|\bm{\delta}^{u}_{jk}\\|_{2}^{2}$
	$\displaystyle\hskip 54.00009pt+\frac{\rho_{1}}{2}\\|\bm{x}_{k}-\bm{u}_{kj}+\bm{\delta}^{u}_{kj}\\|_{2}^{2}-\frac{\rho_{1}}{2}\\|\bm{\delta}^{u}_{kj}\\|_{2}^{2}\Big{)}$

	$\displaystyle(\bm{u}_{jk,t}^{(l^{\prime}+1)},\bm{u}_{kj,t}^{(l^{\prime}+1)})={\operatorname{argmin}}\Big{(}\lambda_{1}\mu_{1}\omega_{jk}^{t}\\|\bm{u}_{jk,t}+\bm{\alpha}_{jk,t}^{(l^{\prime})}-\bm{u}_{kj,t}\\|_{2}$
	$\displaystyle\hskip 72.00012pt+\frac{\rho_{1}}{2}\\|\bm{x}_{j,t}^{(l+1)}-\bm{u}_{jk,t}+\bm{\delta}^{u(l)}_{jk,t}\\|_{2}^{2}$
	$\displaystyle\hskip 72.00012pt+\frac{\rho_{1}}{2}\\|\bm{x}_{k,t}^{(l+1)}-\bm{u}_{kj,t}+\bm{\delta}^{u(l)}_{kj,t}\\|_{2}^{2}\Big{)}$
	$\displaystyle\bm{\alpha}_{jk,t}^{(l^{\prime}+1)}={\operatorname{{argmin}}}\Big{(}\mu_{1}\omega_{jk}^{t}\\|\bm{u}_{jk,t}^{(l^{\prime}+1)}+\bm{\alpha}_{jk,t}-\bm{u}_{kj,t}^{(l^{\prime}+1)}\\|_{2}$
	$\displaystyle\hskip 99.00017pt+(1-\mu_{1})\\|\bm{\alpha}_{jk,t}\\|_{p}\Big{)}$

	$\displaystyle\underset{\bm{x},\bm{\alpha},\bm{\beta}}{\operatorname{min}}$	$\displaystyle\quad\sum_{t=\tau+1}^{\tau+m}\sum_{j\in\mathcal{V}}f_{j,t}(\bm{x}_{j,t})$
		$\displaystyle+\lambda_{1}(1-\mu_{1})\\|\bm{\alpha}\\|_{1,p}+\lambda_{2}(1-\mu_{2})\\|\bm{\beta}\\|_{1,p}$
		$\displaystyle+\lambda_{1}\mu_{1}\sum_{t=\tau+1}^{\tau+m}\sum_{(j,k)\in\mathcal{E}}\omega_{jk}^{t}\left\\|\bm{x}_{jk,t}+\bm{\alpha}_{jk,t}-\bm{x}_{kj,t}\right\\|_{2}$
		$\displaystyle+\lambda_{2}\mu_{2}\sum_{j\in\mathcal{V}}\Big{(}\omega_{j}^{\tau,\tau+1}\left\\|\hat{\bm{x}}_{j,\tau,1}+\bm{\beta}_{j,\tau}-\bm{x}_{j,\tau+1,2}\right\\|_{2}$
		$\displaystyle\hskip 40.00006pt+\sum_{t=\tau+1}^{\tau+m-1}\omega_{j}^{t,t+1}\left\\|\bm{x}_{j,t,1}+\bm{\beta}_{j,t}-\bm{x}_{j,t+1,2}\right\\|_{2}\Big{)}$

		$\displaystyle\bm{v}_{j,\tau+1,2}^{(l^{\prime}+1)}=\operatorname{argmin~{}}\Big{(}\lambda_{2}\mu_{2}\omega_{j}^{\tau,\tau+1}\left\\|\hat{\bm{x}}_{j,\tau}+\bm{\beta}_{j,\tau}^{(l^{\prime})}-\bm{v}_{j,\tau+1,2}\right\\|_{2}$
		$\displaystyle\hskip 72.00012pt+\frac{\rho_{2}}{2}\\|\bm{x}_{j,\tau+1}^{(l+1)}-\bm{v}_{j,\tau+1,2}+\bm{\delta}^{v(l)}_{j,\tau+1,2}\\|_{2}^{2}\Big{)}$
		$\displaystyle\bm{\beta}_{j,\tau}^{(l^{\prime}+1)}={\operatorname{{argmin}}}\Big{(}\mu_{2}\omega_{j}^{\tau,\tau+1}\left\\|\hat{\bm{x}}_{j,\tau}+\bm{\beta}_{j,\tau}-\bm{v}_{j,\tau+1,2}^{(l^{\prime}+1)}\right\\|_{2}$
		$\displaystyle\hskip 72.00012pt+(1-\mu_{2})\\|\bm{\beta}_{j,\tau}\\|_{p}\Big{)}$