This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DANR: Discrepancy-aware Network Regularization

Hongyuan You Department of Computer Science, University of California, Santa Barbara ([email protected], [email protected], [email protected])    Furkan Kocayusufoglu11footnotemark: 1    Ambuj K. Singh11footnotemark: 1
Abstract

Network regularization is an effective tool for incorporating structural prior knowledge to learn coherent models over networks, and has yielded provably accurate estimates in applications ranging from spatial economics to neuroimaging studies. Recently, there has been an increasing interest in extending network regularization to the spatio-temporal case to accommodate the evolution of networks. However, in both static and spatio-temporal cases, missing or corrupted edge weights can compromise the ability of network regularization to discover desired solutions. To address these gaps, we propose a novel approach—discrepancy-aware network regularization (DANR)—that is robust to inadequate regularizations and effectively captures model evolution and structural changes over spatio-temporal networks. We develop a distributed and scalable algorithm based on alternating direction method of multipliers (ADMM) to solve the proposed problem with guaranteed convergence to global optimum solutions. Experimental results on both synthetic and real-world networks demonstrate that our approach achieves improved performance on various tasks, and enables interpretation of model changes in evolving networks.

1 INTRODUCTION

Network regularization is a fundamental approach to encode and incorporate general relationships among variables, which are represented as nodes and linked together by weighted edges that describe their local proximity. A significant amount of effort has been devoted to developing successful regularizations [27, 25, 11, 30] that take advantage of prior knowledge on network structures to enhance estimation performance for a variety of application settings, including image denoising [22] and genomic data analysis [18].

We consider the general setting of network regularization, now proposed as a convex optimization problem defined on an undirected graph 𝒢(𝒱,)\mathcal{G}(\mathcal{V},\mathcal{E}) with node set 𝒱\mathcal{V} and edge set \mathcal{E}:

(1.1) mini𝒱fi(𝒙i)+λ(j,k)ωjkg(𝒙j,𝒙k),\displaystyle\operatorname{min}\sum_{i\in\mathcal{V}}f_{i}(\bm{x}_{i})+\lambda\sum_{(j,k)\in\mathcal{E}}\omega_{jk}\cdot g(\bm{x}_{j},\bm{x}_{k}),

where on each node ii, we intend to learn a local model 𝒙id\bm{x}_{i}\in\mathbb{R}^{d}, which in addition to minimizing a predefined convex loss function fi:df_{i}:\mathbb{R}^{d}\to\mathbb{R}, carries certain relations to its neighbors, defined by the function g:d×dg:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}.

In this paper, we focus on the particular setting where g(𝒙i,𝒙j)g(\bm{x}_{i},\bm{x}_{j}) = 𝒙i𝒙j2\|\bm{x}_{i}-\bm{x}_{j}\|_{2}, also known as a sum-of-norms (SON) regularizer. It assumes that the network is composed of multiple clusters, suggesting that all nodes within a cluster share the same consensus model (𝒙i==𝒙j\bm{x}_{i}=\cdots=\bm{x}_{j}). As the observed data on each node may be sparse, the SON regularizer allows nodes to ”borrow” observations from their neighbors to improve their own models, as well as to determine the network cluster to which they belong. The SON objective was first introduced in [16] for convex clustering problems and used off-the-shelf sequential convex solvers. Chi & Lange [5] adapt SON clustering to incorporate similarity weights with an arbitrary norm and solve the problem by both alternating direction method of multipliers (ADMM) or alternating minimization algorithm (AMA) in a parallel manner. More recently, Hallac et al. [11] point out that weighted SON (named as Network Lasso) allows for simultaneous clustering and optimization on graphs, and is highly suitable for a broad class of large-scale network problems.

In network regularization, a static weight assigned to an edge determines how strictly the difference between models on the corresponding nodes is being penalized, relative to the other node pairs in the network. However, unlike few scenarios in which network information is explicitly given, edge weights are usually unavailable or even infeasible to obtain in most real-world networks (e.g., gene regulatory networks [20]). Moreover, due to possible measurement errors and inaccurate prior knowledge, the assigned edge weights don’t necessarily align with the underlying clustering structure of networks as shown in [1, 13]. As a result, heavily relying on the edge weights in determining how much penalty should be applied to neighboring models may contaminate the discovered solutions.

Similar intuition applies when we extend to the temporal setting. Models on nodes of temporal networks usually change at the level of groups over time. However, some groups exhibit different evolution patterns than others [17]. Moreover, the grouping structures themselves may evolve with time [15]. As static regularization cannot capture such temporal evolution, a direct solution is to induce a time-varying local consensus by employing the SON objective on both spatial and temporal directions, but we face an more drastic problem: how to assign weights between snapshots.

Consequently, a framework that is robust to missing or corrupted edge weights in network-based regularization and adapts to spatio-temporal settings is desired. Note that without the assigned network weights, the setting in Eq. (1.1) degrades to simply applying isotropic regularization for all the edges. In this work, we propose a generic formulation, called the discrepancy-aware network regularization (DANR), which deploys a suitable amount of anisotropic network regularization in both spatial and temporal aspects. DANR infers models at each node per timestamp and can learn evolution of models and transitions of network structures over time. We develop an ADMM-based algorithm that adopts an efficient and distributed iterative scheme to solve problems formulated by the DANR, and show that the proposed solution obtains guaranteed convergence towards global optimal solutions. By applying to both synthetic and real-world datasets, we demonstrate the effectiveness of the proposed approach on various network problems.

2 Problem Setting

2.1 Discrepancy-aware Network Regularization

Since existing network-based regularizers rely on known weights on edges, the corresponding solutions get misled when the edge weights are erroneous or unknown. Directly learning the unknown edge weights by using the same regularizer as Eq. (1.1) would lead to an optimization problem:

(2.2) min𝒙,𝝎i𝒱fi(𝒙i)+λ(j,k)ωjk𝒙j𝒙k2\displaystyle\underset{\bm{x},\bm{\omega}}{\operatorname{min~{}}}\quad\sum_{i\in\mathcal{V}}f_{i}(\bm{x}_{i})+\lambda\sum_{(j,k)\in\mathcal{E}}\omega_{jk}\left\|\bm{x}_{j}-\bm{x}_{k}\right\|_{2}

which yields the trivial all-zero solution of 𝝎\bm{\omega} and thus becomes unsatisfactory. Instead of imposing more sophisticated model-based or problem-dependent regularization as suggested in [33], we consider an alternative formulation that explicitly accounts for discrepancies between models on adjacent nodes:

(2.3) min𝒙,𝜶i𝒱fi(𝒙i)+λS(𝒙,𝜶)S(𝒙,𝜶)=μ(j,k)ωjk𝒙j+𝜶jk𝒙k2+(1μ)𝜶1,p\underset{\bm{x},\bm{\alpha}}{\operatorname{min~{}}}\quad\sum_{i\in\mathcal{V}}f_{i}(\bm{x}_{i})+\lambda\cdot\mathcal{R}_{S}(\bm{x},\bm{\alpha})\\ \mathcal{R}_{S}(\bm{x},\bm{\alpha})=\mu\sum_{(j,k)\in\mathcal{E}}\omega_{jk}\left\|\bm{x}_{j}+\bm{\alpha}_{jk}-\bm{x}_{k}\right\|_{2}+(1-\mu)\|\bm{\alpha}\|_{1,p}

where we denote 𝜶1,p=(j,k)𝜶jkp\|\bm{\alpha}\|_{1,p}=\sum_{(j,k)\in\mathcal{E}}\|\bm{\alpha}_{jk}\|_{p} as the sum of pp-norms of all 𝜶jk\bm{\alpha}_{jk}’s. We set a penalty parameter λ0\lambda\geq 0 to control the overall strength of network regularization, and a portion parameter 0<μ<10<\mu<1 to control the emphasis between the two terms in S(𝒙,𝜶)\mathcal{R}_{S}(\bm{x},\bm{\alpha}). We name this formulation in Eq. (2.3) the discrepancy-aware network regularization (DANR).

In the first term of S\mathcal{R}_{S}, we define a discrepancy-buffering (DB) variable 𝜶jkd\bm{\alpha}_{jk}\in\mathbb{R}^{d} to denote the preserved discrepancy between local model parameters 𝒙i\bm{x}_{i} and 𝒙j\bm{x}_{j}. More specifically, when an edge weight ωjk\omega_{jk} is given but potentially imprecise or otherwise corrupted, 𝜶jk\bm{\alpha}_{jk} would compensate for abnormally large differences between models 𝒙j\bm{x}_{j} and 𝒙k\bm{x}_{k} to reduce the magnitude of ωjk\omega_{jk}-weighted edge penalty term in S\mathcal{R}_{S}. The DB variable provides additional flexibility to its associated models, allowing them to stay adequately close to their own local solutions w.r.t. minimizing local loss functions, and avoid over-penalized consensus. When all ωjk\omega_{jk}’s are not given, 𝜶jk\bm{\alpha}_{jk}’s enable solving Eq. 2.3 under anisotropic regularizations even with homogeneous weights.

The second term of S\mathcal{R}_{S} is the 1,p\ell_{1,p}-regularizer (with 1<p<1<p<\infty) [28] that ensures sparsity at the group level. Note that we regard each variable vector 𝜶jk\bm{\alpha}_{jk} as a group (|||\mathcal{E}| groups in total). The first key intuition here is that the regularization on 𝜶\bm{\alpha} helps to exclude trivial solutions, that is 𝜶jk=𝒙k𝒙j\bm{\alpha}_{jk}=\bm{x}_{k}-\bm{x}_{j}. Second, it allows us to identify a succinct set of non-zero vectors 𝜶jk\bm{\alpha}_{jk}’s, compensating for possible intrinsic discrepancies between two adjacent nodes.

To sum up, discrepancy-buffering variable 𝜶jk\bm{\alpha}_{jk}’s are designed to elaborately adjust network regularization strength on all edges, and thus reduce negative effects of the unsquared norm regularizer. We will show in later sections that this modified formulation remains convex and tractable via parallel optimization algorithms on large-scale networks.

2.2 Distributed ADMM-based Solution

In this section, we propose an ADMM-based algorithm for solving the ST-DANR problem in Eq. (3.12) and present the convergence and complexity of the proposed algorithm. The algorithm can be easily adapted to the spatial-only DANR problem in Eq. (2.3) by omitting temporal-related updates.

The ADMM method was originally derived in [8] and has been reformulated in many contexts including optimal control and image processing [23]. The method can be considered as combining augmented Lagrangian methods and the method of multipliers [4]. It aims to solve optimization problems with two-block separable convex objectives in the following form of

(2.4) minf(𝒙)+g(𝒛)s.t.A𝒙+B𝒛=𝒄\displaystyle\operatorname{min}~{}f(\bm{x})+g(\bm{z})\qquad~{}s.t.~{}A\bm{x}+B\bm{z}=\bm{c}

Observe that our proposed objective in Eq. (3.12) has a separable convex objective function: it can be reorganized into two-block separable convex objectives as in Eq. (2.4), for which ADMM methods guarantee convergence to global optimal solutions [7].

More precisely, to fit our problem into the ADMM framework, we first define 𝒙\bm{x}={𝒙j}jV\{\bm{x}_{j}\}^{j\in V} for model parameter vectors on the given undirected network 𝒢\mathcal{G}, and define consensus variables 𝒖\bm{u} = [{𝒖jk,𝒖kj}(j,k)][\{\bm{u}_{jk},\bm{u}_{kj}\}^{(j,k)\in\mathcal{E}}] as copies of 𝒙\bm{x} in the spatial penalty term S(𝒙,𝜶)\mathcal{R}_{S}(\bm{x},\bm{\alpha}). Then we rewrite Eq. (3.12) as follows:

(2.5) min𝒙,𝒖,𝜶\displaystyle\underset{\bm{x},\bm{u},\bm{\alpha}}{\operatorname{min}}\quad j𝒱fj(𝒙j)+λ1(1μ1)𝜶1,p\displaystyle\sum_{j\in\mathcal{V}}f_{j}(\bm{x}_{j})+\lambda_{1}(1-\mu_{1})\|\bm{\alpha}\|_{1,p}
+λ1μ1(j,k)ωjk𝒖jk+𝜶jk𝒖kj2\displaystyle+\lambda_{1}\mu_{1}\sum_{(j,k)\in\mathcal{E}}\omega_{jk}\left\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\right\|_{2}
  s.t. [𝒖jk,𝒖kj]=[𝒙j,𝒙k],(j,k)\displaystyle\left[\bm{u}_{jk},\bm{u}_{kj}\right]~{}~{}~{}=\left[\bm{x}_{j},\bm{x}_{k}\right],\quad(j,k)\in\mathcal{E}

in which the first term corresponds to the ff block in Eq. (2.4), and the remaining terms correspond to the gg block. The equality constraints on Eq. (3.12) are used to force consensus between variables 𝒙\bm{x} and 𝒖\bm{u}. Next, we derive the augmented Lagrangian of Eq. (E.1):

(2.6) ρ1(𝒙,𝒖,𝜶,𝜹)=jVfj(𝒙j)+λ1(1μ1)𝜶1,p\displaystyle\mathcal{L}_{\rho_{1}}(\bm{x},\bm{u},\bm{\alpha},\bm{\delta})=\sum_{j\in V}f_{j}(\bm{x}_{j})+\lambda_{1}(1-\mu_{1})\|\bm{\alpha}\|_{1,p}
+(j,k)E(λ1μ1ωj,k𝒖jk+𝜶jk𝒖kj2\displaystyle\hskip 27.00005pt+\sum_{(j,k)\in E}\Big{(}\lambda_{1}\mu_{1}\omega_{j,k}\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\|_{2}
+ρ12𝒙j𝒖jk+𝜹jku22ρ12𝜹jku22\displaystyle\hskip 54.00009pt+\frac{\rho_{1}}{2}\|\bm{x}_{j}-\bm{u}_{jk}+\bm{\delta}^{u}_{jk}\|_{2}^{2}-\frac{\rho_{1}}{2}\|\bm{\delta}^{u}_{jk}\|_{2}^{2}
+ρ12𝒙k𝒖kj+𝜹kju22ρ12𝜹kju22)\displaystyle\hskip 54.00009pt+\frac{\rho_{1}}{2}\|\bm{x}_{k}-\bm{u}_{kj}+\bm{\delta}^{u}_{kj}\|_{2}^{2}-\frac{\rho_{1}}{2}\|\bm{\delta}^{u}_{kj}\|_{2}^{2}\Big{)}

where 𝜹\bm{\delta} = [{𝜹jku,𝜹kju}(j,k)][\{\bm{\delta}^{u}_{jk},\bm{\delta}^{u}_{kj}\}^{(j,k)\in\mathcal{E}}] are scaled dual variables for each equality constraint on the elements of 𝒖\bm{u}. The parameters ρ1>0\rho_{1}>0 penalize the violation of equality constraints in the spatial and temporal domain [21] respectively. The iterative scheme of ADMM under the above setting can be written as follows, with ll denoting the iteration index:

(2.7) 𝒙(l+1)=argmin𝒙ρ1(𝒙,(𝒖,𝜶,𝜹)(l))(𝒖,𝜶)(l+1)=argmin𝒖,𝜶ρ1(𝒙(l+1),𝒖,𝜶,𝜹(l))𝜹(l+1)=𝜹(l)+(𝒙~(l+1)𝒖(l+1))}\displaystyle\begin{rcases}&\bm{x}^{(l+1)}=\underset{\bm{x}}{\operatorname{argmin~{}}}\mathcal{L}_{\rho_{1}}(\bm{x},(\bm{u},\bm{\alpha},\bm{\delta})^{(l)})\\ &(\bm{u},\bm{\alpha})^{(l+1)}=\underset{\bm{u},\bm{\alpha}}{\operatorname{argmin~{}}}\mathcal{L}_{\rho_{1}}(\bm{x}^{(l+1)},\bm{u},\bm{\alpha},\bm{\delta}^{(l)})\\ &\bm{\delta}^{(l+1)}=\bm{\delta}^{(l)}+(\tilde{\bm{x}}^{(l+1)}-\bm{u}^{(l+1)})\end{rcases}

where 𝒙~\tilde{\bm{x}}=[{𝒙j,𝒙k}(j,k)][\{\bm{x}_{j},\bm{x}_{k}\}^{(j,k)\in\mathcal{E}}] is composed of replicated elements in 𝒙\bm{x}, and thus has a one-to-one correspondence with elements in 𝒖\bm{u}.

Because the augmented Lagrangian in Eq. (2.6) has a separable structure as well, we can further split the optimization above over each univariate element in 𝒙\bm{x} and 𝒛\bm{z}. Next, we provide details for each ADMM update step.

𝒙\bm{x}-Update. In the first update step of our ADMM updating scheme, we can decompose the problem of minimizing 𝒙\bm{x} into separately minimizing 𝒙j\bm{x}_{j} for each node jj :

𝒙j(l+1)=argmin𝒙j(fj(𝒙j)+ρ12kNj𝒙j𝒖jk(l)+𝜹jku(l)22)\displaystyle\bm{x}_{j}^{(l+1)}=\underset{\bm{x}_{j}}{\operatorname{argmin~{}}}\Big{(}f_{j}(\bm{x}_{j})+\frac{\rho_{1}}{2}\sum_{k\in N_{j}}\|\bm{x}_{j}-\bm{u}_{jk}^{(l)}+\bm{\delta}_{jk}^{u(l)}\|_{2}^{2}\Big{)}

As above equation suggests, 𝒙j\bm{x}_{j} on node jj first receives local information from the corresponding consensus variables belonging to all of its neighbors kNjk\in N_{j}, then updates its value to minimize the loss function and remain close to neighbouring consensus variables. Since all remaining regularizations are quadratic, the 𝒙\bm{x}-update problem can be efficiently solved whenever the loss function fjf_{j} has certain properties, such as strong convexity.

(u,α\bm{u},\bm{\alpha})-Update. We can further decompose the (𝒛,𝜶\bm{z},\bm{\alpha})-update step in Eq. (E.5) into subproblems on each edge (and its related variables 𝒖jk,𝒖kj,𝜶jk\bm{u}_{jk},\bm{u}_{kj},\bm{\alpha}_{jk}) as follows, which can be solved in parallel:

(2.8) (𝒖jk(l+1),𝒖kj(l+1),𝜶jk(l+1))=argmin(λ1(1μ1)𝜶jkp+λ1μ1ωjk𝒖jk+𝜶jk𝒖kj2+ρ12𝒙j(l+1)𝒖jk+𝜹jku(l)22+ρ12𝒙k(l+1)𝒖kj+𝜹kju(l)22)(\bm{u}_{jk}^{(l+1)},\bm{u}_{kj}^{(l+1)},\bm{\alpha}_{jk}^{(l+1)})=\\ \operatorname{argmin~{}}\Big{(}\lambda_{1}(1-\mu_{1})\|\bm{\alpha}_{jk}\|_{p}+\lambda_{1}\mu_{1}\omega_{jk}\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\|_{2}\\ \hskip 9.00002pt+\frac{\rho_{1}}{2}\|\bm{x}_{j}^{(l+1)}-\bm{u}_{jk}+\bm{\delta}^{u(l)}_{jk}\|_{2}^{2}+\frac{\rho_{1}}{2}\|\bm{x}_{k}^{(l+1)}-\bm{u}_{kj}+\bm{\delta}^{u(l)}_{kj}\|_{2}^{2}\Big{)}

Notice that the sum of convex functions which are defined on different sets of variables preserves the convexity. To minimize the objective with 𝒖jk,𝒖kj\bm{u}_{jk},\bm{u}_{kj} and 𝜶jk\bm{\alpha}_{jk} simultaneously, the convexity of Eq. (2.2) motivates us to adopt an alternating descent algorithm, which minimizes each component iteratively with respect to (𝒖jk,𝒖kj)(\bm{u}_{jk},\bm{u}_{kj}) and 𝜶jk\bm{\alpha}_{jk} while holding the other fixed. In detail, we solve Eq. (2.9) and Eq. (2.10) iteratively until convergence is achieved:

(2.9) (𝒖jk(l+1),𝒖kj(l+1))=argmin(λ1μ1ωjk𝒖jk+𝜶jk(l)𝒖kj2\displaystyle(\bm{u}_{jk}^{(l^{\prime}+1)},\bm{u}_{kj}^{(l^{\prime}+1)})={\operatorname{argmin}}\Big{(}\lambda_{1}\mu_{1}\omega_{jk}\|\bm{u}_{jk}+\bm{\alpha}_{jk}^{(l^{\prime})}-\bm{u}_{kj}\|_{2}
+ρ12𝒙j(l+1)𝒖jk+𝜹jku(l)22+ρ12𝒙k(l+1)𝒖kj+𝜹kju(l)22)\displaystyle+\frac{\rho_{1}}{2}\|\bm{x}_{j}^{(l+1)}-\bm{u}_{jk}+\bm{\delta}^{u(l)}_{jk}\|_{2}^{2}+\frac{\rho_{1}}{2}\|\bm{x}_{k}^{(l+1)}-\bm{u}_{kj}+\bm{\delta}^{u(l)}_{kj}\|_{2}^{2}\Big{)}
(2.10) 𝜶jk(l+1)\displaystyle\bm{\alpha}_{jk}^{(l^{\prime}+1)} =argmin((1μ1)𝜶jkp\displaystyle={\operatorname{{argmin}}}\Big{(}(1-\mu_{1})\|\bm{\alpha}_{jk}\|_{p}
+μ1ωjk𝒖jk(l+1)+𝜶jk𝒖kj(l+1)2)\displaystyle\hskip 18.00003pt+\mu_{1}\omega_{jk}\|\bm{u}_{jk}^{(l^{\prime}+1)}+\bm{\alpha}_{jk}-\bm{u}_{kj}^{(l^{\prime}+1)}\|_{2}\Big{)}

where ll^{\prime} denoting the iteration index of alternating descent. Further, we can utilize the analytic solution to speed up the calculation of 𝒖jk\bm{u}_{jk} and 𝒖kj\bm{u}_{kj}:

Lemma 2.1

Problem (2.9) has a closed-form solution:

𝒖jk(l+1)=(1θ)𝒂+θ𝒃θ𝜶jk(l)𝒖kj(l+1)=θ𝒂+(1θ)𝒃+θ𝜶jk(l)}\begin{rcases}\bm{u}_{jk}^{(l^{\prime}+1)}=(1-\theta)\bm{a}+\theta\bm{b}-\theta\bm{\alpha}_{jk}^{(l^{\prime})}\qquad\\ \bm{u}_{kj}^{(l^{\prime}+1)}=\theta\bm{a}+(1-\theta)\bm{b}+\theta\bm{\alpha}_{jk}^{(l^{\prime})}\qquad\end{rcases}

where we denote 𝐚=𝐱j(l+1)+𝛅jku(l)\bm{a}=\bm{x}_{j}^{(l^{\prime}+1)}+{\bm{\delta}_{jk}^{u}}^{(l^{\prime})}, 𝐛=𝐱k(l+1)+𝛅kju(l)\bm{b}=\bm{x}_{k}^{(l^{\prime}+1)}+{\bm{\delta}_{kj}^{u}}^{(l^{\prime})}, c=λ1(1μ1)ωjkc=\lambda_{1}(1-\mu_{1})\omega_{jk}, and θ=c/(ρ1𝐚𝐛+𝛂jk(l)2)\theta=c/(\rho_{1}\|\bm{a}-\bm{b}+\bm{\alpha}_{jk}^{(l^{\prime})}\|_{2}) for simplification. Details in Appendix B.

𝜹\bm{\delta}-Update. In the last step of our ADMM updating scheme, we have fully independent update rules for each scaled dual variable 𝜹jku\bm{\delta}_{jk}^{u} as follows:

(2.11) 𝜹jku(l+1)\displaystyle{\bm{\delta}_{jk}^{u}}^{(l+1)} =𝜹jku(l)+(𝒙j(l+1)𝒖jk(l+1))\displaystyle={\bm{\delta}_{jk}^{u}}^{(l)}+(\bm{x}_{j}^{(l+1)}-\bm{u}_{jk}^{(l+1)})

Finally, we present the pseudo-code of our DANR approach in Appendix D.

Stopping Criterion and Global Convergence. For our ADMM iterative scheme in Eq. (E.5), we use the norm of primal residual 𝒓(l)=𝒙~(l)𝒖(l)\bm{r}^{(l)}=\tilde{\bm{x}}^{(l)}-\bm{u}^{(l)} and dual residual 𝒔(l)=ρ1(𝒖(l)𝒖(l+1))\bm{s}^{(l)}=\rho_{1}(\bm{u}^{(l)}-\bm{u}^{(l+1)}) as the termination measure. The optimality condition [4] of ADMM shows that if both residuals are small then the objective suboptimality must be small, and thus suggests 𝒓(l)2ϵ1𝒔(l)2ϵ2\|\bm{r}^{(l)}\|_{2}\leq\epsilon_{1}\wedge\|\bm{s}^{(l)}\|_{2}\leq\epsilon_{2} as a reasonable stopping criterion. Convex subproblems in 𝒙\bm{x}-update and (𝒖,𝜶)(\bm{u},\bm{\alpha})-update need iterative methods to solve as well. The stopping criterion for these subproblems is naturally to keep iteration differences below thresholds, i.e. in (𝒖,𝜶)(\bm{u},\bm{\alpha})-update, we require Δ𝒖(l)2ϵ{\small\|\Delta\bm{u}^{(l^{\prime})}\|_{2}\leq\epsilon^{\prime}} and Δ𝜶(l)2ϵ{\small\|\Delta\bm{\alpha}^{(l^{\prime})}\|_{2}\leq\epsilon^{\prime}}.

Lemma 2.2

Our ADMM approach to solve the DANR problem is guaranteed to converge to the global optimum. Details in Appendix C.

Computational Complexity. Let NcN_{c} denote the number of iterations that ADMM takes to achieve an approximate solution 𝒙^\hat{\bm{x}} with an accuracy of ϵx\epsilon_{x}. Based on the convergence analysis in [14], the time complexity scales as O(1/ϵx)O(1/\epsilon_{x}), which in our problem mostly depends on the properties of cost functions fjtf_{jt}’s. Assume that all convex subproblems in xx-update and (u,α)(u,\alpha)-update are solved by general first-order gradient descent methods that take NxN_{x} and NαN_{\alpha} iterations to converge respectively, the overall complexity of the algorithm is therefore O(Nc(Nx|𝒱|+Nα||+||))O\big{(}N_{c}\big{(}N_{x}|\mathcal{V}|+N_{\alpha}|\mathcal{E}|+|\mathcal{E}|\big{)}\big{)}.

3 Extension to Spatio-temporal Setting

The aforementioned shortcomings with pre-defined edge weights accentuate when we consider temporal networks since acquiring explicit temporal edge weights is usually not feasible. Thus, discrepancy-buffering variables are also crucial for the temporal setting.

Consider a temporal undirected network 𝒢\mathcal{G} consisting of MM sequential network snapshots {𝒢1,𝒢2,,𝒢M}\{\mathcal{G}_{1},\mathcal{G}_{2},\ldots,\mathcal{G}_{M}\}, where each network snapshot 𝒢t\mathcal{G}_{t} contains a node set 𝒱t\mathcal{V}_{t} and an edge set t\mathcal{E}_{t}. We denote ωj,kt\omega_{j,k}^{t} as the weight of spatial edge (jt,kt)(j_{t},k_{t}) between nodes jtj_{t} and ktk_{t} within snapshot 𝒢t\mathcal{G}_{t}, and ωjt,t+1\omega_{j}^{t,t+1} for the weight of temporal edge (jt,jt+1)(j_{t},j_{t+1}) that links node jtj_{t} with jt+1j_{t+1} across snapshots 𝒢t\mathcal{G}_{t} and 𝒢t+1\mathcal{G}_{t+1} (shown in Fig.1). On each node jtj_{t}, a convex loss function fj,t:df_{j,t}:\mathbb{R}^{d}\to\mathbb{R} is given to measure the fitness of a local model parameterized by 𝒙j,td\bm{x}_{j,t}\in\mathbb{R}^{d}. With sparse observations for all snapshots, a straightforward task is to find jointly optimal models for every node and every timestamp under the regularization for both network topology and temporal evolution. We can propose an extension of the DANR formulation to spatio-temporal networks, referred as ST-DANR:

(3.12) min𝒙,𝜶,𝜷j𝒱t=1Mfj,t(𝒙j,t)+λ1S(𝒙,𝜶)+λ2T(𝒙,𝜷)\underset{\bm{x},\bm{\alpha},\bm{\beta}}{\operatorname{min}}\quad\sum_{j\in\mathcal{V}}\sum_{t=1}^{M}f_{j,t}(\bm{x}_{j,t})+\lambda_{1}\cdot\mathcal{R}_{S}(\bm{x},\bm{\alpha})+\lambda_{2}\cdot\mathcal{R}_{T}(\bm{x},\bm{\beta})

where we define the discrepancy-aware regularizers as:

S(𝒙,𝜶)\displaystyle\vspace{-4ex}\mathcal{R}_{S}(\bm{x},\bm{\alpha}) =μ1t=1M(jt,kt)t(ωjkt𝒙j,t+𝜶jk,t𝒙k,t2)\displaystyle=\mu_{1}\sum_{t=1}^{M}\sum_{(j_{t},k_{t})\in\mathcal{E}_{t}}\left(\omega_{jk}^{t}\left\|\bm{x}_{j,t}+\bm{\alpha}_{jk,t}-\bm{x}_{k,t}\right\|_{2}\right)
(3.13) +(1μ1)𝜶1,p(spatial penalty term)\displaystyle+(1-\mu_{1})\|\bm{\alpha}\|_{1,p}\qquad\text{(spatial penalty term)}
T(𝒙,𝜷)\displaystyle\mathcal{R}_{T}(\bm{x},\bm{\beta}) =μ2j𝒱t=1M1(ωjt,t+1𝒙j,t+𝜷j,t𝒙j,t+12)\displaystyle=\mu_{2}\sum_{j\in\mathcal{V}}\sum_{t=1}^{M-1}\left(\omega_{j}^{t,t+1}\|\bm{x}_{j,t}+\bm{\beta}_{j,t}-\bm{x}_{j,t+1}\|_{2}\right)
(3.14) +(1μ2)𝜷1,p(temporal penalty term)\displaystyle+(1-\mu_{2})\|\bm{\beta}\|_{1,p}\quad\text{(temporal penalty term)}

As we are interested in the heterogeneous evolution of nodal models across time, the chosen unsquared norms in spatial and temporal regularization terms S(𝒙,𝜶)\mathcal{R}_{S}(\bm{x},\bm{\alpha}) and T(𝒙,𝜷)\mathcal{R}_{T}(\bm{x},\bm{\beta}) enforce piecewise consensus in both spatial and temporal aspects, which indicates abrupt changes of regional models or sudden transitions in network structures at particular timestamps, and also implies valid persistent models in the remaining segments of time [12]. Beside the spatial DB variables 𝜶jk,t\bm{\alpha}_{jk,t} in S(𝒙,𝜶)\mathcal{R}_{S}(\bm{x},\bm{\alpha}), we define another set of temporal DB variables 𝜷j,td\bm{\beta}_{j,t}\in\mathbb{R}^{d} in T(𝒙,𝜷)\mathcal{R}_{T}(\bm{x},\bm{\beta}) to compensate for inadequate regularizations caused by temporal fluctuations. For example, when modeling housing prices in a city, 𝜷j,t\bm{\beta}_{j,t} would tolerate anomalous short-term fluctuations in prices, while large values of 𝜶jk,t\bm{\alpha}_{jk,t} might be found near boundaries of disparate neighborhoods.

Refer to caption
Figure 1: Overview of our problem setting in temporal networks.

Adaptation to Streaming Data. In many applications, observed network snapshots arrive in a streaming fashion. Instead of recomputing all models on past snapshots whenever a new snapshot is observed, incorporating existing models to facilitate the new incoming snapshot is more efficient. Assume that we have learned models {𝒙^j,t}j𝒱\{\hat{\bm{x}}_{j,t}\}_{j\in\mathcal{V}} for the τ\tau-th snapshot, we then read observations of the next mm snapshots indexed by τ+1,,τ+m\tau+1,\cdots,\tau+m and attempt to learn models for them. To this intent, we deploy our ST-DANR formulation in Eq. (3.12) on a limited time interval t=τ,,τ+mt=\tau,\cdots,\tau+m, and then integrate established model at t=τt=\tau by fixing {𝒙j,τ}j𝒱\{\bm{x}_{j,\tau}\}_{j\in\mathcal{V}} = {𝒙^j,τ}j𝒱\{\hat{\bm{x}}_{j,\tau}\}_{j\in\mathcal{V}} and solving the remaining variables {𝒙j,t}j𝒱t=τ+1,,τ+m\{\bm{x}_{j,t}\}_{j\in\mathcal{V}}^{t=\tau+1,\cdots,\tau+m} in the corresponding problem. Note that in our setting, temporal messages are only transferable through consecutive snapshots, meaning that fixing the τ\tau-th snapshot is the same as fixing all past snapshots.

Lastly, we point out that distributed ADMM-based solution for DANR (§2.2) can be adapted to ST-DANR, for either batch formulation (Eq. (3.12)) or streaming-case formulation, which readers can refer to in Appendix E.

Temporal Evolutionary Patterns. In this work, we consider the unsquared edge penalty term 𝒙j,t+𝜷j,t𝒙j,t+12\|\bm{x}_{j,t}+\bm{\beta}_{j,t}-\bm{x}_{j,t+1}\|_{2} in T\mathcal{R}_{T} since it enforces temporal reconstruction, indicating sharp transitions or change points at particular timestamps. Similar to the spatial case, there could be multiple available alternatives to enforce different types of evolutionary patterns [12]. For example, replacing the unsquared norm with squared norm form 𝒙j,t+𝜷j,t𝒙j,t+122\|\bm{x}_{j,t}+\bm{\beta}_{j,t}-\bm{x}_{j,t+1}\|_{2}^{2} will yield smoothly varying models along consecutive snapshots, and has been well studied in [29, 6].

4 EXPERIMENTS

In this section, we present an experimental analysis of the proposed method (DANR) and evaluate its performance on classification and regression tasks. First, we employ a synthetic dataset (§4.1) and two housing price datasets (BAY and SAC in §4.2) to demonstrate the set of scenarios and applications that are particularly good fit for DANR. In addition to an improvement on the estimation task, DANR learns fine-grained neighborhood boundaries and reveals interesting insights into the network’s heterogeneous structure. Lastly, in order to validate the efficacy of ST-DANR in the temporal setting, we conduct experiments with the geospatial and temporal database of Northeastern US lakes (§4.3), from which we aim to estimate the water quality of lakes over 10 years.

Baseline Algorithms. In spatial network scenarios (§4.1 & §4.2), we compare DANR against three baselines: network lasso (NL) [11], robust multi-task feature learning (rMTFL) [9], factorized multi-task learning (FORMULA) [32]. In spatial-temporal network scenarios (§4.3), we compare ST-DANR with two widely-applied temporal regularizers in the literature, which are detailed in §4.3.

Parameters Setting. For both NL and DANR, we tune λ\lambda parameters from 10310^{-3} to 10210^{2} where λn+1=1.3λn\lambda^{n+1}=1.3\lambda^{n}. Concerning the DANR, for each value of λ~{}\lambda, we further tune μ\mu parameters from 0.30.3 to 11, where μn+1=μn+0.02\mu^{n+1}=\mu^{n}+0.02. Lastly, we set p=3p=3. We follow the same strategy for both spatial (DANR) and spatio-temporal (ST-DANR) variants of the proposed method. For all penalty parameters in rMTFL and FORMULA, we tune them in the same way as λ\lambda in DANR. In addition, we vary the number of factorized base models in FORMULA from 11 to 5050 to achieve its best performance. For each dataset, we standardize all features and response variables to zero mean and unit variance.

4.1 Node Classification with Limited Data

Following previous work [11], we first experiment with a synthetic network in which each node attempts to solve its own support vector machine (SVM) classifier, but only has a limited amount of data to learn from. We further split the nodes into clusters where each cluster has an underlying model, i.e., nodes in the same cluster share the same underlying model. Our aim is to learn accurate models for each node by leveraging their network connections and limited observations. Note that the neighbors with different underlying models provide misleading information to each other, which potentially harms the overall performance. Therefore, we later investigate the robustness of the DANR with a varying ratio of such malicious edges.

Synthetic Network Generation. We create a network of 100 nodes, split into 5 ground-truth communities C1C^{1}, C2C^{2}, \cdots, C5C^{5}. Each of these communities consists of 20 nodes. We denote the community of a node ii with CiC_{i}. Next, we form the network connections by adding edges between nodes based on the following criterion: The probability of adding an edge between any node pair (i,j)(i,j) is 0.5 if Ci=CjC_{i}=C_{j}, i.e., the edge connects nodes from the same community (intra-community edge). Otherwise, we set the probability as 0.02 if CiCjC_{i}\neq C_{j}, i.e., the edge connects nodes from different communities (inter-community edge). The resulting synthetic network G0(𝒱,)G_{0}(\mathcal{V},\mathcal{E}) has 100 nodes and 574 edges, with 82% of them being intra-community edges. Each ground-truth community CkC^{k} has an underlying classifier model 𝑿kR10\bm{X}^{k}\in R^{10}, drawn independently from a normal distribution of zero mean and unit variance, where k[1,,5]k\in[1,\cdots,5]. Next, we generate 5 random training example pairs (𝒘,y)(\bm{w},y) per node, where 𝒘R10\bm{w}\in R^{10} denotes an observation and y{1,1}y\in\{-1,1\} denotes the ground-truth value to estimate. The ground-truth value yy for each observation 𝒘\bm{w} is then computed using the underlying classifier of the community that a node belongs to:

(4.15) y=sign(𝒘T𝑿k+η)\displaystyle y=\text{sign}(\bm{w}^{T}\bm{X}^{k}+\eta)

where 𝑿kR10\bm{X}^{k}\in R^{10} represents the corresponding weight of the classifier w.r.t. observations, while η\eta is the noise. All observations and noises are drawn independently from a normal distribution for each data point. Note that 5 observations are insufficient to accurately estimate a model 𝒙iR10\bm{x}_{i}\in R^{10}. Hence for this setting, the network connections play an important role as nodes can “borrow” training examples from its neighbors to improve their own models.

Optimization Problem. The objective function fif_{i} on each node is formulated by the soft margin SVM objective, where node ii estimates its model (i.e. separating hyperplane) 𝒙iR10\bm{x}_{i}\in R^{10} using its five training examples (𝒘,y)(\bm{w},y) as follows:

min(12𝒙iT𝒙i+Cl=15ξl)s.t. y(l)(𝒙iT𝒘(l))(1ξl),ξl0\displaystyle\min\Big{(}\frac{1}{2}\bm{x}_{i}^{T}\bm{x}_{i}+C\sum_{l=1}^{5}\xi_{l}\Big{)}\quad\text{s.t. }y^{(l)}(\bm{x}_{i}^{T}\bm{w}^{(l)})\geqslant(1-\xi_{l}),~{}\xi_{l}\geqslant 0

where ξl\xi_{l}’s are slack variables that accounts for non-separable data. The constant CC controls the trade-off between minimizing the training error and maximizing the margin. We set CC to 0.75, which we empirically find to perform well.

Test Results. In following subsections, we first thoroughly inspect DANR and its prototype method NL on synthetic network G0G_{0}, in order to show the benefits of introducing discrepancy-buffering variable 𝜶\bm{\alpha}. Later in §4.1.2, we evaluate the performance and robustness of DANR and all three baseline methods under more noisy scenarios. To assess the performance, prediction accuracies are computed over a separate set of 1000 test pairs (10 per node). The test pairs are again randomly generated using Eq. (4.15).

4.1.1 Effect of Discrepancy-Buffering Variables.

We compare DANR with NL and report the prediction accuracy with varying λ\lambda values in Figure 2. Recall that the proposed method introduces the μ\mu parameter which is coupled with the λ\lambda parameter. Therefore in order to have an adequate comparison between DANR and NL, we modify our λ\lambda parameter to be λ~=λ/μ\tilde{\lambda}=\lambda/\mu. Doing so ensures that both methods apply the same amount of penalty to their network regularization terms (See Eq. 2.2 and 2.3). For each value of λ~{}\lambda, we vary μ\mu from 0.3 to 1 and report the highest accuracy achieved.

Refer to caption
Figure 2: Prediction accuracy comparison with varying λ\lambda values.
Refer to caption
Refer to caption
Figure 3: Effect of μ\mu on the prediction accuracy, with a fixed λ\lambda.

We first briefly mention the desired behaviors of the NL method and later accentuate its drawbacks. As shown by Figure 2, for small λ\lambda values, the problem reduces to solving the local optimization problem where each node estimates its model solely based on its limited observations. This achieves 61.3% accuracy on the test set. Furthermore, as λ\lambda increases, the NL penalty enforces nodes to form clusters, where nodes in the same cluster share the same model. This behavior firmly improves the test performance with prediction accuracy rising to 75.3%. Right after the peak performance is reached however, a slight increase in λ\lambda causes a sudden drop in the performance and the problem reduces to solving the global optimization problem over the entire network. Therefore when λ>λcritical\lambda>\lambda_{critical}, the problem converges to a common model for all the nodes in the network , i.e., 𝒙i=𝒙j\bm{x}_{i}=\bm{x}_{j} for i,j𝒱\forall i,j\in\mathcal{V}. This further drops the prediction accuracy to 56.4% on the test set.

The observed rapid drop in performance implies that the NL method is highly sensitive to the λ\lambda parameter. There is only a narrow window of λ\lambda values that result in a proper clustering of the network. As a result, one needs to excessively tune the λ\lambda parameter around λcritical\lambda_{critical} to find its optimal value. In addition, such clustering performance is also highly affected by the volume of “malicious” (inter-cluster) edges in the network, as shown in §4.1.2.

From Figure 2, we can observe that the DANR approach exhibits improved performance thanks to its discrepancy-buffering variables, α\alpha, which allow our regularization term to better exploit the good edges while providing additional tolerance towards bad edges. As a result, DANR achieves 79.1% accuracy on the test set, outperforming the best baseline method by 3.8%. Another important observation is that DANR provides a much wider window for selecting near-optimal λ\lambda than the baseline approach. That being said, we now analyze how the μ\mu parameter couples with the λ\lambda parameter, and further present its effect on DANR’s performance. Figure 3 displays the prediction accuracy vs. μ\mu, with a fixed λ=100\lambda=10^{0} and 10110^{1}. Intuitively, as λ\lambda increases, the optimal value of μ\mu also increases. The reason behind is that as λ\lambda increases, the more non-zero discrepancy-buffering variables (parameterized by (1μ1-\mu)) are needed to persist the accurate clustering formation since high values of λ\lambda forces global consensus over the network.

Refer to caption
Figure 4: Prediction accuracy comparison with varying noise.

4.1.2 Robustness to Malicious Edges.

Note that the ratio of inter-community edges in the earlier synthetic network G0G_{0} is 18%. Here we use term noise for the ratio of such malicious edges in the network. Taking into account that the amount of noise is critical in learning accurate models for nodes in a network, we now investigate the efficacy of DANR w.r.t. varying noise. First, we remove all inter-community edges and later iteratively add malicious edges to the network. Then we solve the same problem with the noise varying from 0 to 0.6, and report prediction accuracies of all methods in Figure 4. As shown, the performance of DANR, NL and FORMULA drops in the absence of noise. Meanwhile the pure multitask method rMTFL maintains the same low accuracy, since it doesn’t utilize the network information. However, as we increase the noise, DANR starts to outperform NL and FORMULA, where the gap between the models’ performances becomes larger at a higher ratio of malicious edges. This confirms that compared to the baseline methods, the DANR exhibits more robust performance in noisy settings thanks to its discrepancy-buffering variables.

4.2 Spatial: Housing Rental Price Estimation

In this section, our goal is to jointly (i) estimate the rental prices of houses by leveraging their geological location and the set of features; (ii) discover boundaries between neighborhoods. The intuition is that the houses in the same neighborhood often tend to have similar pricing models. However, such neighborhoods can have complex underlying structures (as described later in this section), which imposes additional challenges in learning accurate models.

Dataset. We experiment with two largely populated areas in Northern California: the Greater Sacramento Area (SAC) and the Bay Area (BAY). The anonymized data is provided by the property management software company Appfolio Inc, from which we further sample houses with at least one signed rental agreement during the year 2017. The resulting dataset covers 3849 houses in SAC and 1498 houses in BAY. Concerning the houses that have more than one rental agreement signed during 2017, we average the rental prices listed in all the agreements. Each house holds the information about its location (latitude/longitude), number of bedrooms, number of bathrooms, square footage, and the rental price. We regard these areas (SAC and BAY) as two separate datasets for the remainder of this section. We also randomly split 20 % of the houses in each dataset for testing.

Network Construction. After excluding test houses from both datasets, we construct two networks (one for each dataset) based on the houses’ locations. An undirected edge exists between node ii and jj, if at least one of them is in the set of 10 nearest neighbors of the other. (Note that node jj being one of the 10 nearest neighbors of ii doesn’t necessarily imply that node ii is also in the set of nearest neighbors of jj.) In the resulting networks, the average degree of a node is 12.16 for SAC and 12.04 for BAY. Additionally, we construct two versions of these networks, weighted and unweighted. While the weighted network has edge weights inversely proportional to the distances between houses, the unweighted network ignores the proximity between houses, and thus weights on all the edges are 1.

Optimization Problem. The model at each house estimates its rental price by solving a linear regression problem. More specifically, at each node ii we learn a 4-dimensional model 𝒙i=[ai,bi,ci,di]\bm{x}_{i}=[a_{i},b_{i},c_{i},d_{i}]. 𝒙\bm{x} simply represents the coefficients of each feature, which later is used to estimate the rental price pip_{i} as follows:

pi¯=ai(#bedrooms)+bi(#bathrooms)\displaystyle\overline{p_{i}}=a_{i}\cdot(\#bedrooms)+b_{i}\cdot(\#bathrooms)
+ci(squarefootage)+di,\displaystyle+c_{i}\cdot(squarefootage)+d_{i},

where did_{i} is the bias term. The training objective is

min𝒙,𝜶i𝒱pi¯pi22+c𝒙i22+λS(𝒙,𝜶)\displaystyle\min_{\bm{x,\alpha}}~{}\sum_{i\in\mathcal{V}}\|\overline{p_{i}}-p_{i}\|_{2}^{2}+c\cdot\|\bm{x}_{i}\|_{2}^{2}+\lambda\cdot\mathcal{R}_{S}(\bm{x},\bm{\alpha})

where S\mathcal{R}_{S} represents the proposed network regularization term (see Eq. 2.3), while cc is the regularization weight to prevent over-fitting.

Refer to caption
Refer to caption
Figure 5: Examples of complex neighborhood structures captured by the DANR. In left, the house shown in yellow resides at the border of three different area codes. In right, the creek side house (colored in blue) differs from all of its neighbors, possibly due to its appealing location.
Refer to caption
(a) 2000
Refer to caption
(b) 2002
Refer to caption
(c) 2004
Figure 6: Evolution of clustering captured by the ST-DANR over years.

Test Results. Once training converges, we predict the rental prices on the test set. To do that, we connect each house in the test set to its 10 nearest neighbors in the training set. We then infer the new model 𝒙j\bm{x}_{j} by taking the average of the models on its neighbors: 𝒙j=(1/|N(j)|)kN(j)𝒙k\bm{x}_{j}=(1/|N(j)|)\sum_{k\in N(j)}\bm{x}_{k}. The model 𝒙j\bm{x}_{j} is further used to estimate the rental price of the corresponding house. Alternatively, one can also infer 𝒙j\bm{x}_{j} by solving min𝒙jkN(j)𝒙j𝒙k2\min_{\bm{x}_{j}}\sum_{k\in N(j)}\|\bm{x}_{j}-\bm{x}_{k}^{*}\|_{2}, while keeping the models on neighbors fixed. However, we empirically find that simply averaging the neighbors’ models performs better for both methods in this particular setting. We compute Mean Squared Error (MSE) over test nodes to evaluate the performance.

Method BAY SAC
Local estimation (λ=0\lambda=0) 0.5984 0.6250
Global estimation (λ>λcritical\lambda>\lambda_{critical}) 0.4951 0.5403
rMTFL 0.4774 0.4115
FORMULA (unweighted) 0.4446 0.3503
FORMULA (weighted) 0.4181 0.3379
Network Lasso (unweighted) 0.4392 0.3273
Network Lasso (weighted) 0.4173 0.3022
DANR (unweighted) 0.4106 0.2978
Table 1: MSE for housing rental price prediction on test set.

Table 3 displays the test results of eight different settings on both datasets. As shown, local & global estimations and rMTFL method produce high errors for both datasets. We further apply FORMULA and Network Lasso (NL) methods to both weighted and unweighted versions of the networks, while the proposed method is only applied to the unweighted version. Notice that the network weights are irrelevant for the local & global estimation settings and the rMTFL method. Intuitively, both FORMULA and NL performs better on the weighted setting compared to the unweighted setting for both datasets. As shown, the rental price estimation errors achieved by the NL are 0.4173 (weighted) vs. 0.4392 (unweighted) for BAY and 0.3022 (weighted) vs. 0.3273 (unweighted) for SAC. These results suggest that the pre-defined weights on these networks help to learn more accurate models on houses.

However, we further argue that although such pre-defined weights improve the overall clustering performance, they don’t account for more complex clustering scenarios, e.g., two nearby houses falling into different school districts or some houses having higher values compared to their neighbors due to geography, e.g., having a view of the city. That being said, DANR outperforms all the other baselines and achieves the smallest errors for both datasets; 0.4106 for BAY and 0.2978 for SAC. Especially, the DANR (unweighted) even outperforms the weighted version of the baseline approaches by a notable margin. This reveals that the DANR indeed accounts for such heterogeneities in data and provides enhanced clustering of the network.

Figure 5 shows examples of two complex scenarios that are captured by the DANR from the SAC network. We use a color code to represent the clusters in the network, where the same colored houses (ii, jj) are in consensus on their models (𝒙i=𝒙j\bm{x}_{i}=\bm{x}_{j}). In the left subfigure, the house shown in yellow uses a different model than all of its neighbors. Interestingly, it resides at the border of three different area codes. The area code for this house is 95864, while the area code on its west is 95825 and 95826 on its south-east. Additionally, we observe similar heterogeneous behaviors in some houses that are near creeks, lakes, and rivers. As an example, Figure 5 (right) displays a creek side house (colored in blue), which again uses a different underlying model from its neighbors.

4.3 Spatio-Temporal: Water Quality Estimation of Northeastern US Lakes

We now evaluate our method in spatio-temporal setting, where the aim is to dynamically estimate the water quality of Northeastern US lakes over years. We follow the same procedure in §4.2, but have an additional temporal regularization term in our objective that allows nodes to obtain signals from past snapshots of the network, along with their neighbors in the current snapshot.

Dataset and Network. We experiment with the geospatial and temporal dataset of lakes in 17 states in the Northeastern United States, called LAGOS [26]. The dataset holds extensive information about the physical characteristics, various nutrient concentration measurements (water quality), ecological context (surrounding land use/cover, climate etc.), and location of lakes; from which we select the following available set of features: {max depth, surface area, elevation, annual mean temperature, % of agricultural land, % of urban land, % of forest land, % of wetland}. The water quality metric to estimate is the summerly mean chlorophyll concentration of the lakes. We represent the feature vector with 𝒘8\bm{w}\in\mathbb{R}^{8} and the water quality score with yy\in\mathbb{R}.

During our experiments, we consider a 10-year period between 2000 and 2010. Due to sparsity in the data, we allow 2 years range between the two consecutive snapshots of the network. This results in total of 1039 lakes with all the aforementioned features and the water quality measurements available for each of the selected years. After randomly splitting 20% of the lakes for testing, we build our network by using the latitude/longitude information of lakes, where each lake (node) is connected to 10 nearest lakes.

Optimization Problem. After constructing the network, we now aim to dynamically estimate the water quality (yi,ty_{i,t}) of lakes by using their feature vectors (𝒘i,t\bm{w}_{i,t}). More formally, for each year t[2000,2002,,2010]t\in[2000,2002,\cdots,2010], we learn a model 𝒙i,t8\bm{x}_{i,t}\in\mathbb{R}^{8} per node by solving the following spatio-temporal problem in a streaming fashion: (4.16)

min𝒙,𝜶,𝜷i𝒱f(𝒙i,t,𝒘i,t,yi,t)+λ1St(𝒙,𝜶)+λ2Tt(𝒙,𝜷)\displaystyle\min_{\bm{x,\alpha,\beta}}~{}\sum_{i\in\mathcal{V}}f(\bm{x}_{i,t},\bm{w}_{i,t},y_{i,t})+\lambda_{1}\cdot\mathcal{R}_{S}^{t}(\bm{x},\bm{\alpha})+\lambda_{2}\cdot\mathcal{R}_{T}^{t}(\bm{x},\bm{\beta})

where ff is the linear regression objective and S\mathcal{R}_{S} is the DANR term applied on the spatial edges. T\mathcal{R}_{T} is the temporal regularization term, for which we consider two formulations as described next.

Test results. For each year, we first solve the spatial problem (Eq. (4.16) without the temporal term T\mathcal{R}_{T}) and report the Mean Squared Errors in Table 2. Note that the test nodes are again inferred by averaging the neighbors’ models in the current snapshot. Later, in order to successfully assess the improvements gained by the temporal discrepancy-buffering variables (𝜷\bm{\beta}), we solve the above spatio-temporal problem with three different temporal regularizers:

DANR+T-SON: DANR with temporal sum-of-norms regularizer where Tt(𝒙,)\mathcal{R}_{T}^{t}(\bm{x},\cdot) = i𝒱𝒙i,t𝒙^i,t12\sum_{i\in\mathcal{V}}\|\bm{x}_{i,t}-\hat{\bm{x}}_{i,t-1}\|_{2}

DANR+T-SOS: DANR with temporal sum-of-squares regularizer where Tt(𝒙,)\mathcal{R}_{T}^{t}(\bm{x},\cdot) = i𝒱𝒙i,t𝒙^i,t122\sum_{i\in\mathcal{V}}\|\bm{x}_{i,t}-\hat{\bm{x}}_{i,t-1}\|_{2}^{2}

ST-DANR: Spatio-temporal discrepancy-aware network regularizer where Tt(𝒙,𝜷)\mathcal{R}_{T}^{t}(\bm{x},\bm{\beta}) = μ2i𝒱𝒙i,t𝒙^i,t1+𝜷i,t2+(1μ2)𝜷1,p\mu_{2}\cdot\sum_{i\in\mathcal{V}}\|\bm{x}_{i,t}-\hat{\bm{x}}_{i,t-1}+\bm{\beta}_{i,t}\|_{2}+(1-\mu_{2})\cdot\|\bm{\beta}\|_{1,p}

DANR+T-SON and DANR+T-SOS approaches simply apply two widely adopted temporal regularizers on the temporal edges (the sum-of-norms regularizer [19] and sum-of-squares reqularizer [6, 29] respectively), while ST-DANR includes the discrepancy-buffering variables on both spatial and temporal edges. Recall that we solve the Eq. (4.16) in a streaming fashion for simplicity, i.e., each snapshot of the network solves the problem while holding the models learned on the previous snapshot (if available) fixed. Potentially, the performance can further be improved by allowing updates on the previous snapshots.

As we can see from the Table 2, leveraging the temporal connections between the two consecutive snapshots significantly improves the performance. The DANR+T-SON outperforms the DANR by 8%-14%, while DANR+TSOS outperforms the DANR by 10%-15%. Moreover, the ST-DANR consistently outperforms both baselines for all the years. This confirms that the proposed formulation accounts for the heterogeneous nature of the temporal networks where some group of nodes exhibits different evolution patterns than the others.

Method 2000 2002 2004 2006 2008 2010
DANR 0.8479 0.9033 0.6384 0.6722 0.4556 0.4113
\hdashline+ T-SON N/A 0.8308 0.5744 0.6061 0.4109 0.3517
+ T-SOS N/A 0.8045 0.5618 0.6045 0.3978 0.3476
ST-DANR N/A 0.8037 0.5604 0.5866 0.3844 0.3311
Table 2: MSE for water quality estimation over years.

Figure 6 further displays the models (color-coded) learned on three consecutive snapshots, corresponding to years 2000, 2002 and 2004. In 2000, the formed clusters don’t go much beyond the boundaries of the states, resulting in coarse clustering of the network. Yet some states such as Missouri and Iowa are grouped into one cluster (shown in green). This suggests that accurately estimating the chlorophyll concentration of lakes based on the selected set of features is indeed challenging. Specifically, the estimation problem depends on several other external factors that potentially affect the volume of plants and algae in lakes; cultural eutrophication, nutrient inputs from human activities to name a few [26]. However, as it can be seen from the models learned in 2002 and 2004, the clustering performance improves once we allow temporal regularization to synchronize models between two consecutive snapshots, which is analogous with the reduction in errors over years as reported in Table 2. Overall, we observe a split of clusters over time, e.g., in Wisconsin, Minnesota, Iowa and New England. Additionally, a dark brown cluster in south Missouri begins to appear in 2002 and further spreads north in 2004. This indicates that by leveraging the temporal connections, the ST-DANR learns improved models and clustering while allowing for heterogeneity in group level evolutions of nodes.

5 CONCLUSIONS

We propose discrepancy-aware network regularization (DANR) approach for spatio-temporal networks. By introducing discrepancy-buffering variables, the DANR automatically compensates for inaccurate prior knowledge encoded in edge weights, and enables modeling heterogeneous temporal patterns of network clusters. We develop a scalable algorithm based on alternating direction method of multipliers (ADMM) to solve the proposed problem, which enjoys analytic solutions for decomposed subproblems and employs a distributed update scheme on nodes and edges. Our experimental results show that DANR successfully captures structural changes over spatio-temporal networks and yields enhanced models by providing robustness towards missing or inaccurate edge weights.

6 ACKNOWLEDGMENTS

This project has received funding from the National Science Foundation under grant IIS-1817046 and the Defense Threat Reduction Agency under award HDTRA1-19-1-0017.

References

  • [1] L. Anselin, Under the hood: Issues in the specification and interpretation of spatial regression models, Agricultural economics, 27 (2002).
  • [2] M. Belkin et al., Regularization and semi-supervised learning on large graphs, in COLT, 2004.
  • [3] M. Belkin, P. Niyogi, and V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, Journal of machine learning research, 7 (2006), pp. 2399–2434.
  • [4] S. Boyd et al., Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends® in Machine Learning, 3 (2011).
  • [5] E. C. Chi et al., Splitting methods for convex clustering, Journal of Computational and Graphical Statistics, 24 (2015).
  • [6] X.-H. Dang et al., Subnetwork mining with spatial and temporal smoothness, in SDM, 2017.
  • [7] J. Eckstein et al., On the douglas—rachford splitting method and the proximal point algorithm for maximal monotone operators, Mathematical Programming, 55 (1992).
  • [8] D. Gabay and B. Mercier, A dual algorithm for the solution of nonlinear variational problems via finite element approximation, Computers & Mathematics with Applications, (1976).
  • [9] P. Gong, J. Ye, and C. Zhang, Robust multi-task feature learning, in ACM SIGKDD, 2012.
  • [10] N. Guan, D. Tao, Z. Luo, and B. Yuan, Manifold regularized discriminative nonnegative matrix factorization with fast gradient descent, IEEE Transactions on Image Processing, 20 (2011), pp. 2030–2048.
  • [11] D. Hallac et al., Network lasso: Clustering and optimization in large graphs, in ACM SIGKDD, 2015.
  • [12] D. Hallac, Y. Park, S. Boyd, and J. Leskovec, Network inference via the time-varying graphical lasso, 2017.
  • [13] R. Harris, J. Moffat, and V. Kravtsova, In search of ’w’, Spatial Economic Analysis, 6 (2011), pp. 249–270.
  • [14] B. He and X. Yuan, On non-ergodic convergence rate of douglas–rachford alternating direction method of multipliers, Numerische Mathematik, 130 (2015), pp. 567–577.
  • [15] Q. Ho et al., Evolving cluster mixed-membership blockmodel for time-evolving networks, in AISTATS, 2011.
  • [16] T. D. Hocking et al., Clusterpath an algorithm for clustering using convex fusion penalties, in ICML, 2011.
  • [17] M. Kim and J. Leskovec, Nonparametric multi-group membership model for dynamic networks, in NIPS, 2013.
  • [18] C. Li and H. Li, Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics, 24 (2008), pp. 1175–1182.
  • [19] F. Lindsten et al., Clustering using sum-of-norms regularization, in Statistical Signal Processing Workshop, 2011.
  • [20] Y.-Y. Liu, J.-J. Slotine, and A.-L. Barabási, Controllability of complex networks, Nature, 473 (2011), p. 167.
  • [21] R. Nishihara et al., A general analysis of the convergence of admm, in ICML, 2015, pp. 343–352.
  • [22] J. Pang et al., Graph laplacian regularization for image denoising, IEEE Transactions on Image Processing, (2017).
  • [23] Y. Peng et al., Robust alignment by sparse and low-rank decomposition for linearly correlated images, PAMI, (2012).
  • [24] L. I. Rudin, S. Osher, and E. Fatemi, Nonlinear total variation based noise removal algorithms, Physica D: nonlinear phenomena, 60 (1992).
  • [25] J. Sharpnack, A. Singh, and A. Rinaldo, Sparsistency of the edge lasso over graphs, in AISTATS, 2012.
  • [26] P. A. Soranno et al., Lagos-ne, GigaScience, 6 (2017).
  • [27] R. Tibshirani et al., Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society, (2005).
  • [28] J. E. Vogt and V. Roth, A complete analysis of the 1,p\ell_{1,p} group-lasso, in ICML, 2012.
  • [29] H. Wang, F. Nie, and H. Huang, Low-rank tensor completion with spatio-temporal consistency, in AAAI, 2014.
  • [30] Y.-X. Wang, J. Sharpnack, A. J. Smola, and R. J. Tibshirani, Trend filtering on graphs, in AISTATS, 2015.
  • [31] K. Q. Weinberger, F. Sha, Q. Zhu, and L. K. Saul, Graph laplacian regularization for large-scale semidefinite programming, in NIPS, 2007.
  • [32] J. Xu et al., Factorized multi-task learning for task discovery in personalized medical models, in SDM, 2015.
  • [33] H.-F. Yu et al., Temporal regularized matrix factorization for high-dimensional time series prediction, in NIPS, 2016.

A RELATED WORKS

Among existing network-based regularization approaches, there are two major types of edge objectives that are most commonly used: the square-norm objective and the unsquared-norm objective, encouraging different styles of expected solutions. The squared-norm edge objective (i.e., jkωjk𝒙j𝒙k22\sum_{jk}\omega_{jk}\|\bm{x}_{j}-\bm{x}_{k}\|_{2}^{2}), well-known as graph Laplacian regularization [3], assumes underlying models on nodes are smoothly varying as one traverses edges in the network, and accordingly enforces similar but not identical models on linked nodes. Due to the merits of simple computations and good performance, graph Laplacian regularizer has gained popularity in solving problems, such as image deblurring and denoising [22], genomic data analysis [18], semi-supervised regression [2], non-negative matrix factorization [10] and semi-definite programming [31]. However, the square-norm objective usually induces very dense models. The unsquared-norm objective, known as sum-of-norms regularizer [16, 19], bears some similarity to scale-valued fused lasso signal approximator [27] that exists in signal processing applications, for example, total-variation (TV) regularizer for image denoising, and graph trend filtering (GTF) regularizer for nonparametric regression. TV regularizer [24] is developed to penalize both the horizontal and vertical differences between neighboring pixels, which can be thought of as a special scalar-valued network lasso on a 2D grid network. GTF [30] generalizes the successful idea of trend filtering to graphs, by directly penalizing higher order differences across nodes. All regularizers above require correct edge weights so that network structures can be properly used to improve joint estimation. In contrast, our proposed approach allows ambiguous input of network edge weights by introducing the discrepancy-buffering variables. Also, as a variation of SON regularization, our formulation enjoys small model complexity as local consensus is imposed across large-scale networks.

B Proof of Lemma 2.1

We present the analytic solution for updating (𝒖jk,𝒖kj)(\bm{u}_{jk},\bm{u}_{kj}), introduced in Eq. (2.9). For simplicity, we omit iteration index ll^{\prime}, and denote 𝒂=𝒙j+𝜹jku\bm{a}=\bm{x}_{j}+\bm{\delta}_{jk}^{u}, 𝒃=𝒙k+𝜹kju\bm{b}=\bm{x}_{k}+\bm{\delta}_{kj}^{u} and c=λ1(1μ1)ωjkc=\lambda_{1}(1-\mu_{1})\omega_{jk}.

ming(𝒖jk,\displaystyle\operatorname{min~{}}\quad g(\bm{u}_{jk}, 𝒖kj)=c𝒖jk+𝜶jk𝒖kj2\displaystyle\bm{u}_{kj})=\quad c\left\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\right\|_{2}
(B.1) +ρ12(𝒂𝒖jk22+𝒃𝒖kj22)\displaystyle+\frac{\rho_{1}}{2}\left(\left\|\bm{a}-\bm{u}_{jk}\right\|_{2}^{2}+\left\|\bm{b}-\bm{u}_{kj}\right\|_{2}^{2}\right)

We discuss the problem in the following two cases:

Case (1): When the objective function in Eq. (B) is differentiable under the condition of 𝒖jk+𝜶jk𝒖kj20\left\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\right\|_{2}\neq 0, taking the sufficient condition of optimality as g=0\nabla g=0, we have:

c𝒖jk+𝜶jk𝒖kj𝒖jk+𝜶jk𝒖kj2ρ1(𝒂𝒖jk)=0\displaystyle c\frac{\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}}{\left\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\right\|_{2}}-\rho_{1}(\bm{a}-\bm{u}_{jk})=0
c𝒖jk+𝜶jk𝒖kj𝒖jk+𝜶jk𝒖kj2ρ1(𝒃𝒖kj)=0\displaystyle-c\frac{\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}}{\left\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\right\|_{2}}-\rho_{1}(\bm{b}-\bm{u}_{kj})=0

and we can solve above linear system equations to obtain:

𝒖jk=(c+τρ1)𝒂+c𝒃c𝜶jk2c+τρ1\displaystyle\bm{u}_{jk}=\frac{(c+\tau\rho_{1})\bm{a}+c\bm{b}-c\bm{\alpha}_{jk}}{2c+\tau\rho_{1}}
𝒖kj=c𝒂+(c+τρ1)𝒃+c𝜶jk2c+τρ1\displaystyle\bm{u}_{kj}=\frac{c\bm{a}+(c+\tau\rho_{1})\bm{b}+c\bm{\alpha}_{jk}}{2c+\tau\rho_{1}}

where we define τ\tau \coloneqq 𝒖jk+𝜶jk𝒖kj2\left\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\right\|_{2} which still depends on 𝒖jk\bm{u}_{jk} and 𝒖kj\bm{u}_{kj}. Thus we plug the expressions of 𝒖jk\bm{u}_{jk} and 𝒖kj\bm{u}_{kj} back into the definition of τ\tau, and achieve the expression of τ=𝒂𝒃+𝜶jk22c/ρ1\tau=\|\bm{a}-\bm{b}+\bm{\alpha}_{jk}\|_{2}-2c/\rho_{1} which is fully comprised of fixed variables. Hence, the solution is:

𝒖jk=(1θ)𝒂+θ𝒃θ𝜶jk\displaystyle\bm{u}_{jk}^{\ast}=(1-\theta)\bm{a}+\theta\bm{b}-\theta\bm{\alpha}_{jk}
(B.2) 𝒖kj=θ𝒂+(1θ)𝒃+θ𝜶jk\displaystyle\bm{u}_{kj}^{\ast}=\theta\bm{a}+(1-\theta)\bm{b}+\theta\bm{\alpha}_{jk}

where θc/(ρ1𝒂𝒃+𝜶jk2)0\theta\coloneqq c/(\rho_{1}\|\bm{a}-\bm{b}+\bm{\alpha}_{jk}\|_{2})\geq 0.

Case (2): When the objective function in (B) is not differentiable, we need to solve the quadratic problem min𝒂𝒖jk22+𝒃𝒖kj22\operatorname{min}\left\|\bm{a}-\bm{u}_{jk}\right\|_{2}^{2}+\left\|\bm{b}-\bm{u}_{kj}\right\|_{2}^{2} with the constraint 𝒖jk+𝜶jk𝒖kj2=0\|\bm{u}_{jk}+\bm{\alpha}_{jk}-\bm{u}_{kj}\|_{2}=0. It yields 𝒖jk=(𝒂+𝒃𝜶jk)/2\bm{u}_{jk}^{\ast}=(\bm{a}+\bm{b}-\bm{\alpha}_{jk})/2, 𝒖kj=(𝒂+𝒃+𝜶kj)/2\bm{u}_{kj}^{\ast}=(\bm{a}+\bm{b}+\bm{\alpha}_{kj})/2, which is in the same form of Eq. (B) if θ=1/2\theta=1/2.

To determine which of above differentiable conditions is satisfied, we compare resulting objective values in two cases. Denote F1F_{1}^{\ast} and F2F_{2}^{\ast} as the optimal objective value in above two cases, we compute their difference as:

F1F2𝒂𝒃+𝜶jk2=(θ214)cθ+cτρ12c+τρ1\displaystyle\frac{F_{1}^{\ast}-F_{2}^{\ast}}{\|\bm{a}-\bm{b}+\bm{\alpha}_{jk}\|_{2}}=(\theta^{2}-\frac{1}{4})\frac{c}{\theta}+\frac{c\tau\rho_{1}}{2c+\tau\rho_{1}}

From the definition of τ\tau and θ\theta, we can derive τρ1=c(1/θ2)\tau\rho_{1}=c(1/\theta-2). Combining above, we have:

F1F2c𝒂𝒃+𝜶jk2=(θ12)2θ\displaystyle\frac{F_{1}^{\ast}-F_{2}^{\ast}}{c\|\bm{a}-\bm{b}+\bm{\alpha}_{jk}\|_{2}}=-\frac{(\theta-\frac{1}{2})^{2}}{\theta}

which shows that F1F2F_{1}^{\ast}\leq F_{2}^{\ast} unless θ=1/2\theta=1/2. Therefore, we can have a unified solution as Eq. (B).

C Proof of Lemma 2.2

Based on rigorous analysis in [4] and [7], we know that the convergence to the global optimum is guaranteed for ADMM algorithms on problems of the following form:

(C.3) minh(𝒙0)+g(𝒛0)s.t.A𝒙0+B𝒛0=𝒄0\displaystyle\operatorname{min}~{}h(\bm{x}_{0})+g(\bm{z}_{0})\qquad~{}s.t.~{}A\bm{x}_{0}+B\bm{z}_{0}=\bm{c}_{0}

where both hh and gg need to be convex functions and constraints are linear. Note that our problem in Eq.(2.2) is equivalent to Eq.(2.5). To satisfy conditions of convergence guarantee, we need to fit Eq.(2.5) to the form in Eq.(C.3).

We let 𝒙0=𝒙\bm{x}_{0}=\bm{x} and let 𝒛0=[𝒖,𝜶]\bm{z}_{0}=[\bm{u},\bm{\alpha}], and further define h(𝒙0)=j𝒱fj(𝒙j)h(\bm{x}_{0})=\sum_{j\in\mathcal{V}}f_{j}(\bm{x}_{j}) and g(𝒛0)g(\bm{z}_{0}) == λ1(1μ1)𝜶1,p\lambda_{1}(1-\mu_{1})\|\bm{\alpha}\|_{1,p} ++ λ1μ1(j,k)ωjk𝒖jk\lambda_{1}\mu_{1}\sum_{(j,k)\in\mathcal{E}}\omega_{jk}\|\bm{u}_{jk} ++ 𝜶jk𝒖kj2\bm{\alpha}_{jk}-\bm{u}_{kj}\|_{2}. Next, h(𝒙0)h(\bm{x}_{0}) is convex because each cmponent fj(𝒙j)f_{j}(\bm{x}_{j}) is required to be convex in our problem setting. g(𝒛0)g(\bm{z}_{0}) is also convex in terms of the joint variable [𝒖,𝜶][\bm{u},\bm{\alpha}], by the definition of convexity and the triangle inequality. Lastly, the constraint {[𝒖jk,𝒖kj]\{[\bm{u}_{jk},\bm{u}_{kj}] == [𝒙j,𝒙k],[\bm{x}_{j},\bm{x}_{k}], (j,k)}\forall(j,k)\in\mathcal{E}\} is also linear in terms of entries in 𝒖\bm{u} and 𝒙\bm{x}. Summing up, we can see that Eq.(2.5) fits the form of Eq.(C.3). Then by theorems in [4] and [7], Our ADMM approach to solve the problem Eq.(2.2) is guaranteed to converge to the global optimum.

D Pseudo-code for Solving DANR Problem

[Uncaptioned image]

E Solution of ST-DANR

E.1 Batch Formulation

Similar to the spatial only case in §2, we define 𝒙\bm{x}={𝒙j,t}t=1MjV\{\bm{x}_{j,t}\}_{t=1...M}^{j\in V} for model parameter vectors on temporal undirected network 𝒢\mathcal{G}, and define consensus variables

𝒛\displaystyle\bm{z} =[𝒖,𝒗]\displaystyle=[\bm{u},\bm{v}]
=[{𝒖jk,t,0,𝒖kj,t,0}t=1M(j,k)t,\displaystyle=[\{\bm{u}_{jk,t,0},\bm{u}_{kj,t,0}\}_{t=1...M}^{(j,k)\in\mathcal{E}_{t}},
{𝒗j,t,1}t=1M1j𝒱t,{𝒗j,t,2}t=2Mj𝒱t]\displaystyle\hskip 20.00003pt\{\bm{v}_{j,t,1}\}_{t=1...M-1}^{j\in\mathcal{V}_{t}},\{\bm{v}_{j,t,2}\}_{t=2...M}^{j\in\mathcal{V}_{t}}]

as copies of 𝒙\bm{x} in the corresponding spatial penalty term S(𝒙,𝜶)\mathcal{R}_{S}(\bm{x},\bm{\alpha}) and temporal penalty term T(𝒙,𝜷)\mathcal{R}_{T}(\bm{x},\bm{\beta}). Then we rewrite the problem as follows:

min𝒙,𝒛,𝜶,𝜷\displaystyle\underset{\bm{x},\bm{z},\bm{\alpha},\bm{\beta}}{\operatorname{min}} t=1Mj𝒱fj,t(𝒙j,t)\displaystyle\quad\sum_{t=1}^{M}\sum_{j\in\mathcal{V}}f_{j,t}(\bm{x}_{j,t})
+λ1μ1t=1M(j,k)ωjkt𝒖jk,t+𝜶jk,t𝒖kj,t2\displaystyle+\lambda_{1}\mu_{1}\sum_{t=1}^{M}\sum_{(j,k)\in\mathcal{E}}\omega_{jk}^{t}\left\|\bm{u}_{jk,t}+\bm{\alpha}_{jk,t}-\bm{u}_{kj,t}\right\|_{2}
+λ2μ2j𝒱t=1M1ωjt,t+1𝒗j,t,1+𝜷j,t𝒗j,t+1,22\displaystyle+\lambda_{2}\mu_{2}\sum_{j\in\mathcal{V}}\sum_{t=1}^{M-1}\omega_{j}^{t,t+1}\left\|\bm{v}_{j,t,1}+\bm{\beta}_{j,t}-\bm{v}_{j,t+1,2}\right\|_{2}
(E.4) +λ1(1μ1)𝜶1,p+λ2(1μ2)𝜷1,p\displaystyle+\lambda_{1}(1-\mu_{1})\|\bm{\alpha}\|_{1,p}+\lambda_{2}(1-\mu_{2})\|\bm{\beta}\|_{1,p}
  s.t. [𝒖jk,t,𝒖kj,t]=[𝒙j,t,𝒙k,t],t=1,,M\displaystyle\left[\bm{u}_{jk,t},\bm{u}_{kj,t}\right]~{}~{}~{}=\left[\bm{x}_{j,t},\bm{x}_{k,t}\right],\quad t=1,\ldots,M
[𝒗j,t,1,𝒗j,t+1,2]=[𝒙j,t,𝒙j,t+1],t=1,,M1\displaystyle\left[\bm{v}_{j,t,1},\bm{v}_{j,t+1,2}\right]=\left[\bm{x}_{j,t},\bm{x}_{j,t+1}\right],\quad t=1,\ldots,M-1

The iterative scheme of ADMM under the above setting is modified as follows, with ll denoting the iteration index:

(E.5) 𝒙(l+1)=argmin𝒙ρ1,ρ2(𝒙,(𝒛,𝜶,𝜷,𝜹)(l))(𝒛,𝜶,𝜷)(l+1)=argmin𝒛,𝜶,𝜷ρ1,ρ2(𝒙(l+1),𝒛,𝜶,𝜷,𝜹(l))𝜹(l+1)=𝜹(l)+(𝒙~(l+1)𝒛(l+1))}\displaystyle\begin{rcases}&\bm{x}^{(l+1)}=\underset{\bm{x}}{\operatorname{argmin~{}}}\mathcal{L}_{\rho_{1},\rho_{2}}(\bm{x},(\bm{z},\bm{\alpha},\bm{\beta},\bm{\delta})^{(l)})\\ &(\bm{z},\bm{\alpha},\bm{\beta})^{(l+1)}=\underset{\bm{z},\bm{\alpha},\bm{\beta}}{\operatorname{argmin~{}}}\mathcal{L}_{\rho_{1},\rho_{2}}(\bm{x}^{(l+1)},\bm{z},\bm{\alpha},\bm{\beta},\bm{\delta}^{(l)})\\ &\bm{\delta}^{(l+1)}=\bm{\delta}^{(l)}+(\tilde{\bm{x}}^{(l+1)}-\bm{z}^{(l+1)})\end{rcases}

where 𝒙~\tilde{\bm{x}}=[{𝒙j,t,𝒙k,t}t=1M(j,k)t[\{\bm{x}_{j,t},\bm{x}_{k,t}\}_{t=1...M}^{(j,k)\in\mathcal{E}_{t}},{𝒙j,t}t=1M1j𝒱t\{\bm{x}_{j,t}\}_{t=1...M-1}^{j\in\mathcal{V}_{t}}, {𝒙j,t}t=2Mj𝒱t]\{\bm{x}_{j,t}\}_{t=2...M}^{j\in\mathcal{V}_{t}}] is composed of replicated elements in 𝒙\bm{x}.

𝒙\bm{x}-Update. Now we need to decompose the problem of minimizing 𝒙\bm{x} into separately minimizing 𝒙jt\bm{x}_{jt} for each node jtj_{t} :

𝒙j,t(l+1)=argmin𝒙j,t(fj,t(𝒙j,t)+ρ12kNj,t𝒙j,t𝒖jk,t(l)+𝜹jk,tu(l)22\displaystyle\bm{x}_{j,t}^{(l+1)}=\underset{\bm{x}_{j,t}}{\operatorname{argmin~{}}}\Big{(}f_{j,t}(\bm{x}_{j,t})+\frac{\rho_{1}}{2}\sum_{k\in N_{j,t}}\|\bm{x}_{j,t}-\bm{u}_{jk,t}^{(l)}+\bm{\delta}_{jk,t}^{u(l)}\|_{2}^{2}
(E.6) +\displaystyle+ ρ22𝒙j,t𝒗j,t,1(l)+𝜹j,t,1v(l)22+ρ22𝒙j,t𝒗j,t,2(l)+𝜹j,t,2v(l)22)\displaystyle\frac{\rho_{2}}{2}\|\bm{x}_{j,t}-\bm{v}_{j,t,1}^{(l)}+\bm{\delta}^{v(l)}_{j,t,1}\|_{2}^{2}+\frac{\rho_{2}}{2}\|\bm{x}_{j,t}-\bm{v}_{j,t,2}^{(l)}+\bm{\delta}^{v(l)}_{j,t,2}\|_{2}^{2}\Big{)}

(z,α,β\bm{z},\bm{\alpha},\bm{\beta})-Update. The (𝒛,𝜶,𝜷\bm{z},\bm{\alpha},\bm{\beta})-update step in Eq. (E.5) can be decoupled into separate updates for a spatial (𝒖,𝜶\bm{u},\bm{\alpha})-update and a temporal (𝒗,𝜷\bm{v},\bm{\beta})-update. For spatial (𝒖,𝜶\bm{u},\bm{\alpha})-update, we have:

(𝒖jk,t(l+1),𝒖kj,t(l+1))=argmin(λ1μ1ωjkt𝒖jk,t+𝜶jk,t(l)𝒖kj,t2\displaystyle(\bm{u}_{jk,t}^{(l^{\prime}+1)},\bm{u}_{kj,t}^{(l^{\prime}+1)})={\operatorname{argmin}}\Big{(}\lambda_{1}\mu_{1}\omega_{jk}^{t}\|\bm{u}_{jk,t}+\bm{\alpha}_{jk,t}^{(l^{\prime})}-\bm{u}_{kj,t}\|_{2}
+ρ12𝒙j,t(l+1)𝒖jk,t+𝜹jk,tu(l)22\displaystyle\hskip 72.00012pt+\frac{\rho_{1}}{2}\|\bm{x}_{j,t}^{(l+1)}-\bm{u}_{jk,t}+\bm{\delta}^{u(l)}_{jk,t}\|_{2}^{2}
+ρ12𝒙k,t(l+1)𝒖kj,t+𝜹kj,tu(l)22)\displaystyle\hskip 72.00012pt+\frac{\rho_{1}}{2}\|\bm{x}_{k,t}^{(l+1)}-\bm{u}_{kj,t}+\bm{\delta}^{u(l)}_{kj,t}\|_{2}^{2}\Big{)}
𝜶jk,t(l+1)=argmin(μ1ωjkt𝒖jk,t(l+1)+𝜶jk,t𝒖kj,t(l+1)2\displaystyle\bm{\alpha}_{jk,t}^{(l^{\prime}+1)}={\operatorname{{argmin}}}\Big{(}\mu_{1}\omega_{jk}^{t}\|\bm{u}_{jk,t}^{(l^{\prime}+1)}+\bm{\alpha}_{jk,t}-\bm{u}_{kj,t}^{(l^{\prime}+1)}\|_{2}
+(1μ1)𝜶jk,tp)\displaystyle\hskip 99.00017pt+(1-\mu_{1})\|\bm{\alpha}_{jk,t}\|_{p}\Big{)}

We can follow the same path to solve the temporal (𝒗,𝜷\bm{v},\bm{\beta})-update, which is omitted here.

𝜹\bm{\delta}-Update. In the last step of our ADMM updating scheme, we have fully independent update rules for each scaled dual variable 𝜹jk,tu\bm{\delta}_{jk,t}^{u}, 𝜹j,t,1v\bm{\delta}_{j,t,1}^{v} and 𝜹j,t,2v\bm{\delta}_{j,t,2}^{v} as follows:

(E.7) 𝜹jk,tu(l+1)=𝜹jk,tu(l)+(𝒙j,t(l+1)𝒖jk,t(l+1))𝜹j,t,1v(l+1)=𝜹j,t,1v(l)+(𝒙j,t(l+1)𝒗j,t,1(l+1))𝜹j,t+1,2v(l+1)=𝜹j,t+1,2v(l)+(𝒙j,t+1(l+1)𝒗j,t+1,2(l+1))}\displaystyle\begin{rcases}{\bm{\delta}_{jk,t}^{u}}^{(l+1)}&={\bm{\delta}_{jk,t}^{u}}^{(l)}+(\bm{x}_{j,t}^{(l+1)}-\bm{u}_{jk,t}^{(l+1)})\\ {\bm{\delta}_{j,t,1}^{v}}^{(l+1)}&={\bm{\delta}_{j,t,1}^{v}}^{(l)}+(\bm{x}_{j,t}^{(l+1)}-\bm{v}_{j,t,1}^{(l+1)})\\ {\bm{\delta}_{j,t+1,2}^{v}}^{(l+1)}&={\bm{\delta}_{j,t+1,2}^{v}}^{(l)}+(\bm{x}_{j,t+1}^{(l+1)}-\bm{v}_{j,t+1,2}^{(l+1)})\end{rcases}

E.2 Streaming Case Formulation

Following the same notions in batch formulation of ST-DANR, we formally present the streaming case formulation:

min𝒙,𝜶,𝜷\displaystyle\underset{\bm{x},\bm{\alpha},\bm{\beta}}{\operatorname{min}} t=τ+1τ+mj𝒱fj,t(𝒙j,t)\displaystyle\quad\sum_{t=\tau+1}^{\tau+m}\sum_{j\in\mathcal{V}}f_{j,t}(\bm{x}_{j,t})
+λ1(1μ1)𝜶1,p+λ2(1μ2)𝜷1,p\displaystyle+\lambda_{1}(1-\mu_{1})\|\bm{\alpha}\|_{1,p}+\lambda_{2}(1-\mu_{2})\|\bm{\beta}\|_{1,p}
+λ1μ1t=τ+1τ+m(j,k)ωjkt𝒙jk,t+𝜶jk,t𝒙kj,t2\displaystyle+\lambda_{1}\mu_{1}\sum_{t=\tau+1}^{\tau+m}\sum_{(j,k)\in\mathcal{E}}\omega_{jk}^{t}\left\|\bm{x}_{jk,t}+\bm{\alpha}_{jk,t}-\bm{x}_{kj,t}\right\|_{2}
+λ2μ2j𝒱(ωjτ,τ+1𝒙^j,τ,1+𝜷j,τ𝒙j,τ+1,22\displaystyle+\lambda_{2}\mu_{2}\sum_{j\in\mathcal{V}}\Big{(}\omega_{j}^{\tau,\tau+1}\left\|\hat{\bm{x}}_{j,\tau,1}+\bm{\beta}_{j,\tau}-\bm{x}_{j,\tau+1,2}\right\|_{2}
+t=τ+1τ+m1ωjt,t+1𝒙j,t,1+𝜷j,t𝒙j,t+1,22)\displaystyle\hskip 40.00006pt+\sum_{t=\tau+1}^{\tau+m-1}\omega_{j}^{t,t+1}\left\|\bm{x}_{j,t,1}+\bm{\beta}_{j,t}-\bm{x}_{j,t+1,2}\right\|_{2}\Big{)}

where 𝜶1,p=jkt=τ+1τ+m𝜶jk,tp\|\bm{\alpha}\|_{1,p}=\sum_{jk\in\mathcal{E}}\sum_{t=\tau+1}^{\tau+m}\|\bm{\alpha}_{jk,t}\|_{p} and 𝜷1,p=j𝒱t=ττ+m1\|\bm{\beta}\|_{1,p}=\sum_{j\in\mathcal{V}}\sum_{t=\tau}^{\tau+m-1} 𝜷j,tp\|\bm{\beta}_{j,t}\|_{p}. We then fit the problem to classic two-block ADMM framework as we did for ST-DANR in batch formulation:

(E.8) min𝒙,𝒖,𝒗,𝜶,𝜷\displaystyle\underset{\bm{x},\bm{u},\bm{v},\bm{\alpha},\bm{\beta}}{\operatorname{min}}~{} t=τ+1τ+mj𝒱fj,t(𝒙j,t)\displaystyle\sum_{t=\tau+1}^{\tau+m}\sum_{j\in\mathcal{V}}f_{j,t}(\bm{x}_{j,t})
+λ1(1μ1)𝜶1,p+λ2(1μ2)𝜷1,p\displaystyle+\lambda_{1}(1-\mu_{1})\|\bm{\alpha}\|_{1,p}+\lambda_{2}(1-\mu_{2})\|\bm{\beta}\|_{1,p}
+λ1μ1t=τ+1τ+m(j,k)ωjkt𝒖jk,t+𝜶jk,t𝒖kj,t2\displaystyle+\lambda_{1}\mu_{1}\sum_{t=\tau+1}^{\tau+m}\sum_{(j,k)\in\mathcal{E}}\omega_{jk}^{t}\left\|\bm{u}_{jk,t}+\bm{\alpha}_{jk,t}-\bm{u}_{kj,t}\right\|_{2}
+λ2μ2j𝒱(ωjτ,τ+1𝒙^j,τ+𝜷j,τ𝒗j,τ+1,22\displaystyle+\lambda_{2}\mu_{2}\sum_{j\in\mathcal{V}}\Big{(}\omega_{j}^{\tau,\tau+1}\left\|\hat{\bm{x}}_{j,\tau}+\bm{\beta}_{j,\tau}-\bm{v}_{j,\tau+1,2}\right\|_{2}
+t=τ+1τ+m1ωjt,t+1𝒗j,t,1+𝜷j,t𝒗j,t+1,22)\displaystyle\hskip 18.00003pt+\sum_{t=\tau+1}^{\tau+m-1}\omega_{j}^{t,t+1}\left\|\bm{v}_{j,t,1}+\bm{\beta}_{j,t}-\bm{v}_{j,t+1,2}\right\|_{2}\Big{)}
  s.t. [𝒖jk,t,𝒖kj,t]=[𝒙j,t,𝒙k,t],t=τ+1,,τ+m\displaystyle\left[\bm{u}_{jk,t},\bm{u}_{kj,t}\right]~{}~{}~{}=\left[\bm{x}_{j,t},\bm{x}_{k,t}\right],\quad t=\tau+1,\ldots,\tau+m
[𝒗j,t,1,𝒗j,t+1,2]=[𝒙j,t,𝒙j,t+1],t=τ+1,,τ+m1\displaystyle\left[\bm{v}_{j,t,1},\bm{v}_{j,t+1,2}\right]=\left[\bm{x}_{j,t},\bm{x}_{j,t+1}\right],~{}t=\tau+1,\ldots,\tau+m-1

where 𝒙^τ\hat{\bm{x}}_{\tau} is the estimated models at τ\tau-th snapshot. Note that the existence of 𝜷j,τ\bm{\beta}_{j,\tau} and 𝒗j,τ+1,2\bm{v}_{j,\tau+1,2} causes a difference between Eq. (E.8) and the batch case formula in Eq. (E.1), which leads to a modification of updates in (𝒗,𝜷)(\bm{v},\bm{\beta})-updates w.r.t. these two variables:

(𝒗j,τ+1,2(l+1),𝜷j,τ(l+1))=argmin(λ2μ2ωjτ,τ+1𝒙^j,τ+𝜷j,τ𝒗j,τ+1,22\displaystyle(\bm{v}_{j,\tau+1,2}^{(l+1)},\bm{\beta}_{j,\tau}^{(l+1)})=\operatorname{argmin~{}}\Big{(}\lambda_{2}\mu_{2}\omega_{j}^{\tau,\tau+1}\left\|\hat{\bm{x}}_{j,\tau}+\bm{\beta}_{j,\tau}-\bm{v}_{j,\tau+1,2}\right\|_{2}
+ρ22𝒙j,τ+1(l+1)𝒗j,τ+1,2+𝜹j,τ+1,2v(l)22+λ2(1μ2)𝜷j,τp)\displaystyle+\frac{\rho_{2}}{2}\|\bm{x}_{j,\tau+1}^{(l+1)}-\bm{v}_{j,\tau+1,2}+\bm{\delta}^{v(l)}_{j,\tau+1,2}\|_{2}^{2}+\lambda_{2}(1-\mu_{2})\|\bm{\beta}_{j,\tau}\|_{p}\Big{)}

This subproblem can still be solved by alternative minimization with following iterative scheme:

𝒗j,τ+1,2(l+1)=argmin(λ2μ2ωjτ,τ+1𝒙^j,τ+𝜷j,τ(l)𝒗j,τ+1,22\displaystyle\bm{v}_{j,\tau+1,2}^{(l^{\prime}+1)}=\operatorname{argmin~{}}\Big{(}\lambda_{2}\mu_{2}\omega_{j}^{\tau,\tau+1}\left\|\hat{\bm{x}}_{j,\tau}+\bm{\beta}_{j,\tau}^{(l^{\prime})}-\bm{v}_{j,\tau+1,2}\right\|_{2}
+ρ22𝒙j,τ+1(l+1)𝒗j,τ+1,2+𝜹j,τ+1,2v(l)22)\displaystyle\hskip 72.00012pt+\frac{\rho_{2}}{2}\|\bm{x}_{j,\tau+1}^{(l+1)}-\bm{v}_{j,\tau+1,2}+\bm{\delta}^{v(l)}_{j,\tau+1,2}\|_{2}^{2}\Big{)}
𝜷j,τ(l+1)=argmin(μ2ωjτ,τ+1𝒙^j,τ+𝜷j,τ𝒗j,τ+1,2(l+1)2\displaystyle\bm{\beta}_{j,\tau}^{(l^{\prime}+1)}={\operatorname{{argmin}}}\Big{(}\mu_{2}\omega_{j}^{\tau,\tau+1}\left\|\hat{\bm{x}}_{j,\tau}+\bm{\beta}_{j,\tau}-\bm{v}_{j,\tau+1,2}^{(l^{\prime}+1)}\right\|_{2}
+(1μ2)𝜷j,τp)\displaystyle\hskip 72.00012pt+(1-\mu_{2})\|\bm{\beta}_{j,\tau}\|_{p}\Big{)}

Each step above is a simple optimization problem. Besides, updating steps for all remaining variables remain the same as in solution for batch formulation.

F Running Time and Scalability

# nodes DANR NL rMTFL FORMULA
100 4.18 s 1.41 s 3.32 s 7.38 s
500 15.10 s 5.75 s 7.88 s 24.62 s
1000 48.85 s 12.36 s 53.91 s 62.68 s
5000 146.78 s 37.99 s 221.52 s 446.21 s
10000 484.68 s 123.81 s - -
Table 3: Running time on synthetic datasets when we vary the number of nodes in the network. ‘-’ means the method cannot finish.

To test the scalability of our method, we create synthetic dataset in the same way as in §4.1, but varying the number of nodes in the graph from 100100 to 1000010000. Moreover, we tune the probability of adding edges within and between clusters so that the degree of each node is maintained as 20 for all graph sizes. We compare the running time of all methods in the table above. The rMTFL and FORMULA cannot give a result when the graph has 10000 nodes. DANR and NL are both distributed methods, and therefore their running times increase almost linearly with the increasing number of nodes. Comparing with NL, DANR spends a longer time to get more robust results, which suggests a trade-off between complexity and robustness.