Learning Product Graphs Underlying
Smooth Graph Signals

Muhammad Asad Lodhi and Waheed U. Bajwa MAL and WUB are with the Department of Electrical and Computer Engineering, Rutgers, The State University of New Jersey, Piscataway, NJ 08854 (Emails: {masad.lodhi, waheed.bajwa}@rutgers.edu).This research is supported in part by the National Science Foundation under grants CCF-1453073 and CCF-1910110, and by the Army Research Office under grant W911NF-17-1-0546.

Abstract

Real-world data is often times associated with irregular structures that can analytically be represented as graphs. Having access to this graph, which is sometimes trivially evident from domain knowledge, provides a better representation of the data and facilitates various information processing tasks. However, in cases where the underlying graph is unavailable, it needs to be learned from the data itself for data representation, data processing and inference purposes. Existing literature on learning graphs from data has mostly considered arbitrary graphs, whereas the graphs generating real-world data tend to have additional structure that can be incorporated in the graph learning procedure. Structure-aware graph learning methods require learning fewer parameters and have the potential to reduce computational, memory and sample complexities. In light of this, the focus of this paper is to devise a method to learn structured graphs from data that are given in the form of product graphs. Product graphs arise naturally in many real-world datasets and provide an efficient and compact representation of large-scale graphs through several smaller factor graphs. To this end, first the graph learning problem is posed as a linear program, which (on average) outperforms the state-of-the-art graph learning algorithms. This formulation is of independent interest itself as it shows that graph learning is possible through a simple linear program. Afterwards, an alternating minimization-based algorithm aimed at learning various types of product graphs is proposed, and local convergence guarantees to the true solution are established for this algorithm. Finally the performance gains, reduced sample complexity, and inference capabilities of the proposed algorithm over existing methods are also validated through numerical simulations on synthetic and real datasets.

Index Terms:

Cartesian product, graph Laplacians, graph signals, Kronecker product graphs, product graphs, smooth graph signals, strong product graphs, tensor

I Introduction

Graph signal processing (GSP) is an emerging field in data science and machine learning that aims to generalize existing information processing methods to data that live on an irregular domain. This underlying irregular domain can be represented as a graph and analysis of signals on the vertices of this graph, aptly named graph signals, is enabled by the graph shift operator (GSO). Recent developments in GSP have already established that GSO-based data processing outperforms classical signal processing for several common tasks such as noise removal, signal filtering, wavelet representations, etc. [1, 2, 3, 4, 5]. The GSO is at the core of graph signal processing and could refer to either the adjacenecy matrix or one of the many types of Laplacian matrices associated with a graph. The exact choice of the GSO depends on the signal domain and the application of interest. The eigenvectors of GSO provide bases for the spectral analysis of graph signals and generalize the concepts of bandlimited signals to the graph domain [1, 2, 3, 4, 5]. GSO also facilitates the synthesis of graph-based filters [6, 7] and plays a pivotal role in the description of the notion of smoothness for graph signals [1, 2, 3, 4, 5].

The underlying graph (and hence the GSO) for some real-world datasets is either known apriori, or can trivially be constructed through domain knowledge. As an example, consider weather data collected over a region. In this example, different weather stations would act as nodes, their observations as graph signals, and one (possible) way to construct the graph would be to connect physically adjacent nodes. For most real-world data, however, such a trivially constructed graph is either non-optimal or it cannot be constructed in the first place due to lack of precise knowledge about the data generation process. This presents the need to learn the true underlying graph from the data itself. In this regard, the problem of graph learning from the observed data (i.e., graph signals) has gained a lot of attention in the recent years [6, 8, 7, 9, 10, 11, 12, 1].

Graph learning refers to the problem of learning an unknown underlying graph from observed graph signals by exploiting some property of the graph signals. Traditional approaches for graph learning have proposed algorithms whose best-case complexity scales quadratically with the number of nodes in the graph [12, 10, 13, 9, 11]. These approaches might be suitable for learning small graphs, but even for moderately sized graphs the learning cost would be prohibitive. Moreover, for learning an arbitrary graph (Laplacian), the number of parameters one needs to learn also scales quadratically with the number of nodes. Both of these problems hinder the amenability of traditional graph learning approaches to large-scale real-world datasets. Our work on graph learning, in contrast, hinges on the fact that real-world data is often generated over graphs that have an additional inherent structure. This inherent structure is dictated by either the way the data is acquired, by the arrangement of the sensors, or by the inherent relation of variables being observed [3]. Moreover, this inherent structure of the graph being considered also presents itself in the associated GSO, which can incidentally be represented in terms of the product of several smaller factor GSOs. In this paper, we will focus on three such structures that can be described in terms of three different products termed Kronecker, Cartesiasn and strong products. Although aware of the presence of these product structures in real-world graphs (and the associated GSOs) [3], the research community has yet to propose algorithms that incorporate the graph product structure in the graph learning procedure.¹¹1During the course of revising this paper, we became aware of a recent work [14] that aims to learn a Cartesian structured graph from data. This work targets a different objective function than ours, and (additionally) our method also addresses the problems of learning Kronecker and strong graphs, both of which fall outside the scope of the method proposed in [14]. Additionally, as the number of free parameters scales quadratically with the number of nodes in the graph, and given the massive nature of the datasets available today, it has become imperative to devise methods that fully utilize the product structure of graphs to reduce the number of parameters to be learned. Moreover, posing the problems in terms of smaller factor graphs instead of the graph itself can enable efficient data representation [3], and can result in reduced sample complexity as one has to learn fewer parameters. To this end, the main objective of this work is to investigate the problem of learning product graphs from data in an efficient and scalable manner.

I-A Prior work

The existing works in graph signal processing can mainly be divided into four chronological categories. The first set of works in GSP introduced the idea of information processing over graphs [4, 2, 5]. These works highlight the advantages and superior performance of graph-based signal processing approach (with known underlying graph) over classical signal processing. The second wave of research in this area built upon the first one to exploit knowledge of the underlying graph for graph signal recovery from samples obtained over a subset of graph nodes or from noisy observations of all nodes [15, 16]. Through these works, the idea of bandlimitidness was extended to graph signals and the concept of smooth graph signals was introduced. Since the underlying graph is not always available beforehand, the third wave of GSP analyzed the problem of recovering the underlying graph through observations over the graph [6, 8, 7, 9, 10, 11, 12]. Finally, the fourth wave of research in GSP has focused on joint identification of the underlying graph and graph signals from samples/noisy observations using interrelated properties of graph signals and graphs [17, 18, 19, 12].

Within the third set of papers in GSP, our work falls in the category of combinatorial graph Laplacian estimation [9, 10, 11, 12] from graph signals. Combinatorial graph Laplacian refers to the unnormalized Laplacian of an unstructured graph with no self loops [11]. The earliest of works in this category [12] aims to jointly denoise noisy graph signals observed at the nodes of the graph and also learn the graph from these denoised graph signals. The authors pose this problem as a multiconvex problem in the graph Laplacian and the denoised graph signals, and then solve it via an alternating minimization approach that solves a convex problem in each unknown. The authors in [8] examine the problem of graph Laplacian learning when the eigenvectors of the graph Laplacian are known beforehand. They achieve this by formulating a convex program to learn a valid graph Laplacian (from a feasible set) that is diagonalized by the noiseless and noisy Laplacian eigenvectors.

The work in [9] takes a slightly different route and learns a sparse unweighted graph Laplacian matrix from noisy graph signals through an alternating minimization approach that restricts the number of edges in the graph. In contrast to earlier work, [6] focuses on learning a graph diffusion process from observations of stationary signals on graphs through convex formulations. In this regard, the authors also consider different criteria in addition to searching for a valid Laplacian and devise specialized algorithms to infer the diffusion processes under these criteria. In [13, 10] the graph learning problem is adressed by posing it in terms of learning a sparse weighted adjacency matrix from the observed graph signals. Finally, the authors in [11] provide a comprehensive unifying framework for inferring several types of graph Laplacians from graph signals. They also make connection with the state-of-the-art and describe where the past works fit in light of their proposed framework.

It should be mentioned here that since Laplacian matrices (and thus adjacency matrices) are related to precision matrices, defined as the (pseudo-)inverses of covariance matrices, imposing a structure on the graph adjacency matrix amounts to imposing a structure on the covariance of data. Earlier works in the field have already made comparisons of Laplacian learning approaches with those for learning precision matrices from data [12, 11], and established the superior graph recovery performance of Laplacian-based learning. Some recent works have also investigated learning structured covariance and precision matrices, and their usefulness in efficiently representing real-world datasets [20, 21, 22]. While these models work well in practice, we will demonstrate through our experiments that there are scenarios where graph-based learning outperforms structured covariance-based learning (see Sec. VI for details).

I-B Our contributions

Our first contribution in this work is a novel formulation of the graph learning problem as a linear program. We show, both theoretically and empirically, that graph adjacency matrices can be learned through a simple and fast linear program. We then shift our attention towards learning structured graphs. Most of the prior works regarding graph learning have considered arbitrary graphs with either some connectivity constraints [10], or no constraints at all [7, 12]. In all cases, the complexity of the graph learning procedure and the number of free parameters scale quadratically with the number of nodes in the graph, which can be prohibitively large in real-world scenarios. In contrast, our work focuses on inferring the underlying graph from graph signals in the context of structured graphs. Specifically, we investigate graphs that can be represented as Kronecker, Cartesian, and strong products of several smaller graphs. We first show how, for these product graphs, the graph adjacency matrix, the graph Laplacian, the graph Fourier transform, and the graph smoothness measure can be represented with far fewer parameters than required for arbitrary graphs. This reduction in number of parameters to be learned results in reduced sample complexity and helps avoid overfitting. Afterwards, we outline an algorithm to learn these product graphs from the data and provide convergence guarantees for the proposed algorithm in terms of the estimation error of factor graphs. We validate the performance of our algorithm with numerical experiments on both synthetic and real data.

I-C Notation and organization

The following notational convention is used throughout the rest of this paper. We use bold lower-case and bold-upper case letters to represent vectors and matrices, respectively. Calligraphic letters are used to represent tensors, which are arrays of three or more dimensions. For a tensor $\mathcal{T}$ , $\mathcal{T}_{(i)}$ represents its matricization (flattening) in the $i$ -th mode and $\mathrm{vec}(\mathcal{T})$ represents its vectorization along the first mode [23]. Also, “ $:$ ” represents the scalar product or double dot product between two tensors, which results in a scalar [23]. The Hadamard product (elemntwise product) of two vectors or matrices is denoted by “ $\circ$ ”. For a matrix $\mathbf{A}$ , $\|\mathbf{A}\|_{F}$ represents its Frobenius norm, $\|\mathbf{A}\|$ represents its spectral norm, and $\mathbf{A}^{\dagger}$ represents its Moore-Penrose inverse. Moreover, $\|\mathbf{A}\|_{1}$ represents the $\ell_{1}$ -norm of the entries of $\mathbf{A}$ , while $\|\mathbf{A}\|_{1,\text{off}}$ represents the $\ell_{1}$ -norm of the off-diagonal entries of $\mathbf{A}$ . The Kronecker and Cartesian products of two matrices $\mathbf{A}$ and $\mathbf{B}$ are denoted by $\mathbf{A}\otimes\mathbf{B}$ and $\mathbf{A}\oplus\mathbf{B}$ , respectively [24]. The strong product of two matrices is denoted by $\mathbf{A}\boxtimes\mathbf{B}=\mathbf{A}\otimes\mathbf{B}~{}+~{}\mathbf{A}\oplus\mathbf{B}$ , which is the sum of Cartesian and Kronecker products of the respective matrices. Furthermore, $\underset{\mathbf{s}}{\otimes}$ , $\underset{\mathbf{s}}{\oplus}$ , and $\underset{\mathbf{s}}{\boxtimes}$ , respectively denote the Kronecker, Cartesian, and strong products taken over the indices provided by the entries of the vector $\mathbf{s}$ . Finally, $\times_{i}$ represents matrix multiplication in the $i$ -th mode of a tensor and $\underset{\mathbf{s}}{\times}$ represents matrix multiplications in the modes of a tensor specified in the entries of the vector $\mathbf{s}$ .

We use $\mathrm{diag}(\mathbf{x})$ to denote a diagonal matrix with diagonal entries given by the entries of the vector $\mathbf{x}$ , $\mathbf{1}$ to denote a vector of all ones with appropriate length, and $\mathds{1}$ to denote a tensor of all ones of appropriate size. We denote the set with elements $\{1,2,\dots,K\}$ as $[K]$ , and $[K]\setminus k$ represents the same set without the element $k$ . The sets of valid Laplacian and weighted adjacency matrices are represented by $\mathcal{L}$ and $\mathcal{W}$ , respectively. The set of weighted adjacency matrices with any product structure is denoted by $\mathcal{W}_{p}$ .

The rest of this paper is organized as follows. In Sec. II we give a probabilistic formulation of the graph learning problem in line with existing literature. Then, we propose our novel formulation of the graph learning problem as a liner program in Sec. III. Sec. IV describes the motivation for product graphs and formulates the graph learning problem in the context of product graphs. In Sec. V we propose an algorithm for learning product graphs from data and derive error bounds on estimated factor graphs. We present our numerical experiments with synthetic and real datasets in Sec. VI, and the paper is concluded in Sec. VII.

II Probabilistic Problem formulation

In this section we formulate the arbitrary graph learning problem from a probabilistic standpoint. Let us assume access to $m=1,\dots,M_{0}$ graph signals $\mathbf{x}_{m}\in\mathbb{R}^{n}$ observed on $n$ nodes of an undirected graph $G=\{V,E\}$ without any self loops, where $V$ and $E$ represent the nodes and edges of the graph. Weighted edges of this graph can be represented as a weighted adjacency matrix $\mathbf{W}\in\mathbb{R}^{n\times n}$ , which has a zero diagonal owing to no self-loops in the graph. Based on the adjacency matrix $\mathbf{W}$ , one can define the degree matrix $\mathbf{D}=\mathrm{diag}(W\mathbf{1})$ , which is a diagonal matrix containing the weighted degree of each node at the respective diagonal entry. The associated unnormalized graph Laplacian can then be defined as $\mathbf{L}=\mathbf{D}-\mathbf{W}$ . The adjacency matrix $\mathbf{W}$ of the graph can be decomposed as $\mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^{T}$ and its eigenvectors define the graph Fourier basis for the graph Fourier transform [3].

The signals observed on the nodes of a graph are assumed to have a joint distribution given by a multivariate normal distribution, i.e., $\mathbf{x}_{m}\sim\mathcal{N}(0,\mathbf{L}^{\dagger})$ , where $\mathbf{L}^{\dagger}$ is the pseudoinverse of $\mathbf{L}$ and $\mathbf{L}$ represents the graph Laplacian. In words, signals generated over a graph can be seen as being generated over a Gaussian Markov Random Field (GMRF) whose precision matrix is the graph Laplacian [12]. Given independent observations $\{\mathbf{x}_{m}\}$ , the maximum likelihood estimate (MLE) of $\mathbf{L}$ can be expressed as:

	$\displaystyle\widehat{\mathbf{L}}_{\text{MLE}}$	$\displaystyle=\underset{\mathbf{L}\in\mathcal{L}}{\operatorname*{arg\,max}}\|\mathbf{L}\|^{\frac{M_{0}}{2}}\exp(-\frac{1}{2}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{L}\mathbf{x}_{m})$
		$\displaystyle=\underset{\mathbf{L}\in\mathcal{L}}{\operatorname*{arg\,min}}-\log\|\mathbf{L}\|+\frac{1}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{L}\mathbf{x}_{m},$		(1)

where $\mathcal{L}$ represents the class of valid Laplacians, i.e., a symmetric positive semi-definite matrix with rows that sum to zero and nonpositive off-diagonal entries. With the Laplacian constraints, the problem in (II) can be further expressed as:

		$\displaystyle\widehat{\mathbf{L}}_{\text{MLE}}=~{}\underset{\mathbf{L}}{\operatorname*{arg\,min}}~{}-\log\|\mathbf{L}\|+\frac{1}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{L}\mathbf{x}_{m}$
		$\displaystyle\quad\quad\quad~{}\text{s.t. }\mathbf{L}\mathbf{1}=\mathbf{0},~{}\mathrm{trace}(\mathbf{L})=n,~{}(\mathbf{L})_{ij}=(\mathbf{L})_{ji}\leq 0.$		(2)

Interactions in the real world tend to be mostly local, and thus not all nodes in a graph are connected to each other in real-world datasets. To impose only local interactions, usually a sparsity term regularizing the off-diagonal entries of the Laplacian matrix is added to the graph learning objective to learn sparse graphs. Therefore, traditional graph learning approaches [9, 10, 11, 12, 13] take a form similar to the following:

		$\displaystyle\widehat{\mathbf{L}}_{\text{REG}}=~{}\underset{\mathbf{L}}{\operatorname*{arg\,min}}~{}-\log\|\mathbf{L}\|+\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{L}\mathbf{x}_{m}+\beta\\|\mathbf{L}\\|_{1,\text{off}}$
		$\displaystyle\quad\quad\quad~{}\text{s.t. }\mathbf{L}\mathbf{1}=\mathbf{0},~{}\mathrm{trace}(\mathbf{L})=n,~{}(\mathbf{L})_{ij}=(\mathbf{L})_{ji}\leq 0,$		(3)

where $\|\mathbf{L}\|_{1,\text{off}}$ represents a sparsity penalty on the off-diagonal entries of $\mathbf{L}$ , and parameters $\alpha$ and $\beta$ control the penalty on the quadratic term and the density of the graph, respectively. In the following section, we show that the traditional graph learning problem can be significantly simplified and that graphs can actually be learned through a simple linear program.

III Graph learning as a linear program

Let us start by inspecting the traditional graph learning problem in (II). In particular, let us first focus on the term $\log|\mathbf{L}|$ in the objective function and the constraint $\mathrm{trace}(\mathbf{L})=n$ . We can express this log-determinant term in the objective as $\log|\mathbf{L}|=\sum_{i=1}^{n}\log\lambda_{i}$ , where $\lambda_{i}$ is the $i$ -th largest eigenvalue of $\mathbf{L}$ . Thus, through this $\log|\mathbf{L}|$ term, (II) constrains the spectrum of the Laplacian matrix to be estimated. However, for our problem of estimating the Laplacian matrix, the constraint $\mathrm{trace}(\mathbf{L})=\sum_{i=1}^{n}\lambda_{i}=n$ , is already putting a hard constraint on the sum of eigenvalues of the Laplacian matrix. Moreover, the constraint $\mathbf{L}\mathbf{1}=\mathbf{0}$ is forcing the smallest eigenvalue of the Laplacian to be zero. In the presence of these constraints, the log-determinant regularization in the objective function is no longer necessary to arrive at a valid Laplacian matrix or to avoid trivial solutions. Another advantage of removing the log-determinant term is the massive savings in computational complexity as this term forces one to employ singular value decomposition at each step of the learning algorithm [21, 22].

Let us also examine the term $\sum_{m=1}^{M_{0}}\mathbf{x}_{m}^{T}\mathbf{L}\mathbf{x}_{m}$ in the objective in (II). This term comes from the likelihood of the observed signals with the Laplacian as the precision matrix, and also represents the sum of Dirichlet energy or “smoothness” of the observed graph signals [13, 12, 11]. It has been shown in the existing literature [13] that this term can be expressed as a weighted sparsity regularization on the graph adjacency matrix as $\sum_{m=1}^{M_{0}}\mathbf{x}_{m}^{T}\mathbf{L}\mathbf{x}_{m}=\mathrm{trace}(\mathbf{X}^{T}\mathbf{L}\mathbf{X})=\|\mathbf{W}\circ\mathbf{Z}\|_{1}$ . Here $\mathbf{X}$ is the data matrix with $\mathbf{x}_{m}$ as the $m$ -th column, and $\mathbf{Z}$ is the matrix of pairwise distances between rows of $\mathbf{X}$ such that $(i,j)$ -th entry in $\mathbf{Z}$ is the Euclidean distance between the $i$ -th and $j$ -th row of $\mathbf{X}$ . This implies that the sum of Dirichlet energy in the objective implicitly regularizes the sparsity of $\mathbf{W}$ and thus controls the density of edges in the graph. Therefore, presence of this term in the objective eliminates the need to explicitly regularize the sparsity of the graph to be learned.

In light of the preceding discussion, we propose to solve the following linear program [25] for learning graphs:

	$\displaystyle\widehat{\mathbf{L}}=~{}$	$\displaystyle\underset{\mathbf{L}}{\min}~{}~{}~{}\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{L}\mathbf{x}_{m}$
		$\displaystyle\text{s.t. }\mathbf{L}\mathbf{1}=\mathbf{0},~{}\mathrm{trace}(\mathbf{L})=n,~{}(\mathbf{L})_{ij}=(\mathbf{L})_{ji}\leq 0.$		(4)

where $\alpha$ is a regularization parameter that controls the smoothness of the graph signals and thus the sparsity of edges in the graph.

III-A Fast solver for the graph learning linear program

We now present an algorithm, named Graph learning with Linear Programming (GLP), for solving the linear graph learning problem (III). To proceed, note that the objective term in the graph learning problem can be reformulated as:

$\displaystyle\frac{1}{M_{0}}\underset{m=1}{\overset{M_{0}}{\sum}}$	$\displaystyle\mathbf{x}_{m}^{T}\mathbf{L}\mathbf{x}_{m}=\frac{1}{M_{0}}\underset{m=1}{\overset{M_{0}}{\sum}}(\mathbf{x}_{m}^{T}\mathbf{D}\mathbf{x}_{m}-\mathbf{x}_{m}^{T}\mathbf{W}\mathbf{x}_{m})$
	$\displaystyle=\frac{1}{M_{0}}\underset{m=1}{\overset{M_{0}}{\sum}}(\mathbf{x}_{m}^{T}\mathrm{diag}(\mathbf{W}\mathbf{1})\mathbf{x}_{m}-\mathbf{x}_{m}^{T}\mathbf{W}\mathbf{x}_{m})$
	$\displaystyle=\frac{1}{M_{0}}\underset{m=1}{\overset{M_{0}}{\sum}}(\mathbf{x}_{m}^{T}\mathrm{diag}(\mathbf{x}_{m})\mathbf{W}\mathbf{1}-\mathbf{x}_{m}^{T}\mathbf{W}\mathbf{x}_{m})$
	$\displaystyle=\frac{1}{M_{0}}\underset{m=1}{\overset{M_{0}}{\sum}}(\bar{\mathbf{x}}_{m}^{T}\mathbf{W}\mathbf{1}-\mathbf{x}_{m}^{T}\mathbf{W}\mathbf{x}_{m})$
	$\displaystyle=\frac{1}{M_{0}}\underset{m=1}{\overset{M_{0}}{\sum}}\mathrm{trace}(\bar{\mathbf{x}}_{m}^{T}\mathbf{W}\mathbf{1}-\mathbf{x}_{m}^{T}\mathbf{W}\mathbf{x}_{m})$
	$\displaystyle=\frac{1}{M_{0}}\mathrm{trace}\Big{[}\mathbf{W}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{1}\bar{\mathbf{x}}_{m}^{T}-\mathbf{W}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}\mathbf{x}_{m}^{T}\Big{]}$
	$\displaystyle=\mathrm{trace}\Big{[}\mathbf{W}\Big{(}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{1}\bar{\mathbf{x}}_{m}^{T}-\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}\mathbf{x}_{m}^{T}\Big{)}/M_{0}\Big{]}$
	$\displaystyle=\mathrm{trace}(\mathbf{W}\widetilde{\mathbf{S}})=\mathrm{vec}(\widetilde{\mathbf{S}})^{T}\mathrm{vec}(\mathbf{W})=\widetilde{\mathbf{s}}^{T}\mathbf{M}\mathbf{w},$	(5)

where the matrix $\widetilde{\mathbf{S}}=(\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{1}\bar{\mathbf{x}}_{m}^{T}-\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}\mathbf{x}_{m}^{T})/M_{0}$ , the vector $\bar{\mathbf{x}}_{m}^{T}=\mathbf{x}_{m}^{T}\mathrm{diag}(\mathbf{x}_{m})=\mathbf{x}_{m}^{T}\circ\mathbf{x}_{m}^{T}$ , $\mathrm{vec}(\mathbf{W})=\mathbf{M}\mathbf{w}$ , $\mathbf{w}$ is a vector of distinct elements of upper triangular part of the symmetric matrix $\mathbf{W}$ , and $\mathbf{M}$ is the duplication matrix that duplicates the elements from $\mathbf{w}$ to generate a vectorized version of $\mathbf{W}$ . With this rearrangement of the objective and $\mathbf{W}$ , our graph learning problem can be posed as:

\displaystyle\widehat{\mathbf{w}}=~{}

\displaystyle\underset{\mathbf{w}}{\min}~{}\alpha~{}\widetilde{\mathbf{s}}^{T}\mathbf{M}\mathbf{w}\quad\text{s.t. }\mathbf{A}\mathbf{w}=\mathbf{b},~{}\mathbf{w}_{i}\geq 0,i\in F,

(6)

where $\mathbf{A}$ is a matrix that represents the equality constraints from (III) in terms of equality constraints on $\mathbf{w}$ , $\mathbf{b}=[\mathbf{0}^{T},n]^{T}$ , and $F$ is the set containing the indices of the off-diagonal elements in $\mathbf{w}$ . Once the solution $\widehat{\mathbf{w}}$ is obtained, it can be converted to the symmetric adjacency matrix $\widehat{\mathbf{W}}$ , which can then be used to get $\widehat{\mathbf{L}}$ .

The standard way of solving a linear program with mixed (equality and inequality) constraints is through interior point methods whose complexity scales quadratically with the problem dimension [25]. A better alternative is to deploy a first-order method whose per-iteration complexity is linear in the number of nonzero entries of $\mathbf{A}$ . However, a first-order method would exhibit slow convergence for a linear program because of the lack of smoothness and strong convexity in linear programs [26]. To overcome these issues, linear programs have been solved through the Alternating Direction Method of Multiplier (ADMM) [27, 26]. To solve our proposed linear formulation of graph learning, we follow a recent algorithm proposed in [26]. This ADMM-based algorithm for linear programs proposed a new variable splitting scheme that achieves a convergence rate of $\mathcal{O}(\|\mathbf{A}\|^{2}\log(1/\epsilon))$ , where $\epsilon$ is the desired accuracy of the solution. To this end, we start by modifying the original graph learning problem with the introduction of an additional variable $\mathbf{y}$ as follows:

\displaystyle\widehat{\mathbf{w}}=~{}

\displaystyle\underset{\mathbf{w}}{\min}~{}\mathbf{c}^{T}\mathbf{w}\quad\text{s.t. }\mathbf{A}\mathbf{w}=\mathbf{b},~{}\mathbf{w}=\mathbf{y},~{}\mathbf{y}_{i}\geq 0,i\in F,

(7)

where $\mathbf{c}=\alpha\mathbf{M}^{T}\widetilde{\mathbf{s}}$ . The corresponding augmented Lagrangian can then be expressed as follows:

	$\displaystyle L(\mathbf{w},\mathbf{y},\mathbf{z})=\mathbf{c}^{T}\mathbf{w}+h(\mathbf{y})+\mathbf{z}^{T}(\mathbf{A}_{\mathbf{w}}\mathbf{w}+\mathbf{A}_{\mathbf{y}}\mathbf{y}-\widetilde{\mathbf{b}})$
	$\displaystyle+\rho/2\\|\mathbf{A}_{\mathbf{w}}\mathbf{w}+\mathbf{A}_{\mathbf{y}}\mathbf{y}-\widetilde{\mathbf{b}}\\|_{2}^{2},$		(8)

where $h(\mathbf{y})$ denotes the non-negativity constraint on the entries of $\mathbf{y}$ indexed by $F$ , i.e., $\forall i\in F$ , $h(\mathbf{y})=0$ when $y_{i}\geq 0$ , and $h(\mathbf{y})=\infty$ when $y_{i}<0$ . Moreover, $\mathbf{z}=[\mathbf{z}_{\mathbf{w}}^{T},\mathbf{z}_{\mathbf{y}}^{T}]^{T}$ , $\mathbf{z}_{\mathbf{w}}$ and $\mathbf{z}_{\mathbf{y}}$ are the Lagrange multipliers, $\mathbf{A}_{\mathbf{w}}=[\mathbf{A}^{T},\mathbf{I}]^{T}$ , $\mathbf{A}_{\mathbf{y}}=[\mathbf{0},-\mathbf{I}]^{T}$ , and finally $\widetilde{\mathbf{b}}=[\mathbf{b}^{T},\mathbf{0}^{T}]^{T}$ . One can then use ADMM to go through the steps outlined in Algorithm 1 until convergence to obtain $\widehat{\mathbf{w}}$ .

Algorithm 1 : GLP—ADMM for graph learning with linear programming

Input: Observations $\{\mathbf{x}_{m}\}_{m=1}^{M_{0}}$ , maximum iterations $T_{0}$ , and parameters $\alpha,\rho>0$
Initialize: $\mathbf{y}^{(1)}\leftarrow\mathbf{0}$ , $\mathbf{z}^{(1)}\leftarrow\mathbf{1}$

for

t=1

T_{0}

\mathbf{e}^{(t+1)}\leftarrow-\mathbf{A}_{\mathbf{w}}^{T}[\mathbf{z}^{(t)}+\rho(\mathbf{A}_{\mathbf{y}}\mathbf{y}^{(t)}-\widetilde{\mathbf{b}})]-\mathbf{c}

\mathbf{w}^{(t+1)}\leftarrow\rho^{-1}(\mathbf{I}+\mathbf{A}^{T}\mathbf{A})^{-1}\mathbf{e}^{(t+1)}

\mathbf{y}^{(t+1)}\leftarrow[\mathbf{w}^{(t+1)}+\mathbf{z}_{\mathbf{y}}^{(t)}/\rho]_{\underset{F}{\geqslant}\mathbf{0}}

\mathbf{z}^{(t+1)}\leftarrow\mathbf{z}^{(t)}+\rho(\mathbf{A}_{\mathbf{w}}\mathbf{w}^{(t+1)}+\mathbf{A}_{\mathbf{y}}\mathbf{y}^{(t+1)}-\widetilde{\mathbf{b}})

end

Output: Final adjacency estimate $\widehat{\mathbf{w}}\leftarrow\mathbf{w}^{(t+1)}$

In Algorithm 1, $[\cdot]_{\underset{F}{\geqslant}\mathbf{0}}$ is entrywise thresholding that projects the entries with indices in $F$ to the nonnegative orthant. As we can see from the algorithm, all updates have closed-form solutions and the most computationally expensive step is the $\mathbf{w}^{(t+1)}$ update that involves matrix inversion. This matrix inversion, however, can be computed efficiently using the identity $(\mathbf{I}+\mathbf{A}^{T}\mathbf{A})^{-1}=\mathbf{I}-\mathbf{A}^{T}(\mathbf{I}+\mathbf{A}\mathbf{A}^{T})^{-1}\mathbf{A}$ . Since matrix $\mathbf{A}$ is a fat matrix, $\mathbf{A}\mathbf{A}^{T}$ has smaller dimensions than $\mathbf{A}^{T}\mathbf{A}$ . Moreover, one can easily see that $\mathbf{A}\mathbf{A}^{T}$ is a matrix of dimensions $(n+1)\times(n+1)$ , and

\displaystyle\mathbf{A}\mathbf{A}^{T}

\displaystyle=\begin{bmatrix}c_{n}&\mathbf{1}^{T}\\ \mathbf{1}&\mathbf{I}\end{bmatrix}.

(9)

where $c_{n}=2n^{2}-n$ . In addition, the inverse only needs to be computed once at the start of the algorithm since this matrix is deterministic and depends only on the size of the adjacency matrix being estimated.

III-B Parameter and computational complexities

The number of parameters that one needs to learn a graph adjacency matrix is $\frac{n(n+1)}{2}$ . This implies that the number of unknown parameters scales quadratically with the number of nodes in the graph. Additionally, the per-iteration computational complexity of the proposed method also scales quadratically with the number of nodes [26]. The same computational and memory complexities also hold for the existing state-of-the-art graph learning algorithms [12, 10, 13, 9, 11]. However, while these complexities are manageable for small graphs, for real-world datasets with even hundreds of nodes current methods become prohibitive. To overcome these issues, we next examine the problem of learning product graphs.

IV Product graphs: Problem formulation
& advantages in data representation

In this section we briefly review product graphs and their implications towards graph learning. We investigate how product graphs provide a way to efficiently represent graphs with a huge number of nodes, and we revisit the notion of smoothness of signals over product graphs.

Let us consider $K_{0}$ graphs $G_{k}=\{V_{k},E_{k}\}$ , for $k=1,\dots,K_{0}$ , where $V_{k}$ and $E_{k}$ represent the vertices and edges of the $k$ -th graph. The product of these graphs would result in a product graph $G=\{V,E\}$ , with $V$ and $E$ representing the vertices and edges of the resultant graph. The three most commonly investigated graph products and their respective adjacency matrices are discussed below. Note that graph adjacency matrices are considered in this work because each kind of product structure is directly reflected in adjacency matrix of the resultant graph.

IV-A Kronecker graphs

For the Kronecker product of graphs $G_{k}$ , for $k=1,\dots,K_{0}$ , with adjacency matrices $\mathbf{W}_{k}$ , the Kronecker product graph can be expressed as $G=\underset{[K_{0}]}{\otimes}G_{k}=G_{K_{0}}\otimes G_{K_{0}-1}\otimes\dots\otimes G_{1}$ . The respective Kronecker-structured adjaceny matrix of the resultant graph can written in terms of component/factor adjacency matices as $\mathbf{W}=\underset{[K_{0}]}{\otimes}\mathbf{W}_{k}$ . Additionally, if the factor adjacency matrix $\mathbf{W}_{k}$ can be expressed via eigenvalue decomposition (EVD) as $\mathbf{W}_{k}=\mathbf{U}_{k}\boldsymbol{\Lambda}_{k}\mathbf{U}_{k}^{T}$ , then the Kronecker adjaceny matrix can be written as (using properties in [3]):

	$\displaystyle\mathbf{W}$	$\displaystyle=(\mathbf{U}_{K_{0}}\boldsymbol{\Lambda}_{K_{0}}\mathbf{U}_{K_{0}}^{T})\otimes\dots\otimes(\mathbf{U}_{1}\boldsymbol{\Lambda}_{1}\mathbf{U}_{1}^{T})$
		$\displaystyle=(\underset{[K_{0}]}{\otimes}\mathbf{U}_{k})~{}(\underset{[K_{0}]}{\otimes}\boldsymbol{\Lambda}_{k})~{}(\underset{[K_{0}]}{\otimes}\mathbf{W}_{k}^{T})=\mathbf{U}\boldsymbol{\Lambda}_{\mathrm{kron}}\mathbf{U}^{T}.$		(10)

One can see that both the eigenmatrix and the eigenvalue matrix of the Kronecker adjacency matrix have a Kronecker structure in terms of the component eigenmatrices and component eigenvalue matrices, respectively. Given the number of edges in the component graphs are $|E_{k}|$ , the number of edges in the Kronecker graph are $|E|=2^{K_{0}-1}\overset{K_{0}}{\underset{k=1}{\prod}}|E_{i}|$ .

An example of Kronecker product graph is the bipartite graph of a recommendation system like Netflix [28] where the graph between users and movies can be seen as a Kronecker product of two smaller factor graphs. In fact, the adjacency matrix of any bipartite graph can be represented in terms of a Kronecker product of appropriate factor matrices [29]. As adjacency matrices are also closely related to precision matrices (inverse covariance matrices), and inverse of Kronecker product is Kronecker product of inverses [3], imposing Kronecker structure on the adjacency matrix also amounts to imposing a Kronecker structure on the covariance matrix of the data.

The optimization problem in (III) can be specialized to the case of learning Kronecker graphs by explicitly imposing the Kronecker product structure on the adjacency matrix and posing the problems in terms of the individual factor adjacency matrices, rather than the bigger adjacency matrix produced after the product. This leads us to the following nonconvex problem for learning Kronecker graphs:

\displaystyle\underset{\{\mathbf{W}_{k}\in\mathcal{W}\}_{k=1}^{K_{0}}}{\min}

\displaystyle\frac{\alpha}{M_{0}}\mathrm{trace}\Bigg{[}\big{[}\underset{[K_{0}]}{\otimes}\mathbf{W}_{k}\big{]}\Big{(}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{1}\bar{\mathbf{x}}_{m}^{T}-\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}\mathbf{x}_{m}^{T}\Big{)}\Bigg{]}.

(11)

IV-B Cartesian graphs

The Cartesian product (also called Kronecker sum product) of graphs $G_{k}$ is represented as $G=\underset{[K_{0}]}{\oplus}G_{k}=G_{K_{0}}\oplus G_{K_{0}-1}\oplus\dots\oplus G_{1}$ . The correspoding cartesian adjacency matrix can be written in terms of the component adjacency matrices as $\mathbf{W}=\underset{[K_{0}]}{\oplus}\mathbf{W}_{k}$ . Furthermore, with the EVD of the component adjacency matrices, the Cartesian adjacency matrix can be decomposed as [3]:

	$\displaystyle\mathbf{W}$	$\displaystyle=(\mathbf{U}_{K_{0}}\boldsymbol{\Lambda}_{K_{0}}\mathbf{U}_{K_{0}}^{T})\oplus\dots\oplus(\mathbf{U}_{1}\boldsymbol{\Lambda}_{1}\mathbf{U}_{1}^{T})$
		$\displaystyle=(\underset{[K_{0}]}{\otimes}\mathbf{U}_{k})~{}(\underset{[K_{0}]}{\oplus}\boldsymbol{\Lambda}_{i})~{}(\underset{[K_{0}]}{\otimes}\mathbf{W}_{k}^{T})=\mathbf{U}\boldsymbol{\Lambda}_{\mathrm{cart}}\mathbf{U}^{T}.$		(12)

This means that the eigenmatrix and the eigenvalue matrix of the Cartesian adjacency matrix are represented, respectively, as Kronecker and Cartesian products of component eigenmatrices and eigenvalue matrices. The number of edges in the Cartesian graph can be found as $|E|=\sum_{k=1}^{K_{0}}(\underset{[K_{0}]\setminus k}{\otimes}n_{i})|E_{k}|$ , where $|E_{k}|$ represents the number of edges and $n_{k}$ represents the number of vertices in the $k$ -th component graph.

A typical exmaple of a Cartesian product graph is images. Images reside on two-dimensional rectangular grids that can be represented as the Cartesian product between two line graphs pertaining to the rows and columns of the image [3]. A social network can also be approximated as a Cartesian product of an inter-community graph with an intra-community graph [3].

Similar to the previous discussion, the optimization problem in (III) can be specialized to learning Cartesian graphs by explicitly imposing the Cartesian structure and posing the problem in terms of the factor adjacency matrices as follows:

\displaystyle\underset{\{\mathbf{W}_{k}\in\mathcal{W}\}_{k=1}^{K_{0}}}{\min}

\displaystyle\frac{\alpha}{M_{0}}\mathrm{trace}\Bigg{[}\big{[}\underset{[K_{0}]}{\oplus}\mathbf{W}_{k}\big{]}\Big{(}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{1}\bar{\mathbf{x}}_{m}^{T}-\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}\mathbf{x}_{m}^{T}\Big{)}\Bigg{]}.

(13)

IV-C Strong graphs

The strong product of graphs $G_{k}$ can be represented as $G=\underset{[K_{0}]}{\boxtimes}G_{k}=G_{K_{0}}\boxtimes G_{K_{0}-1}\boxtimes\dots\boxtimes G_{1}$ . The respective strong adjacency matrix of the resultant strong graph is given in terms of the component adjacency matrices as $\mathbf{W}=\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{k}$ , and can be further expressed as:

	$\displaystyle\mathbf{W}$	$\displaystyle=(\mathbf{U}_{K_{0}}\boldsymbol{\Lambda}_{K_{0}}\mathbf{U}_{K_{0}}^{T})\boxtimes\dots\boxtimes(\mathbf{U}_{1}\boldsymbol{\Lambda}_{1}\mathbf{U}_{1}^{T})$
		$\displaystyle=(\underset{[K_{0}]}{\otimes}\mathbf{U}_{k})~{}(\underset{[K_{0}]}{\boxtimes}\boldsymbol{\Lambda}_{i})~{}(\underset{[K_{0}]}{\otimes}\mathbf{W}_{k}^{T})=\mathbf{U}\boldsymbol{\Lambda}_{\mathrm{str}}\mathbf{U}^{T},$		(14)

in terms of EVD of the component adjacency matrices.

The strong product graphs can be seen as the sum of Kronecker and Cartesian products of the factor adjacency matrices. An example of data conforming to the strong product graph is a spatiotemporal sensor network graph, which consists of a strong product of a spatial graph and a temporal graph (representing the temporal dependencies of the sensors). The spatial graph has as many nodes as the number of sensors in the sensor network and represents the spatial distribution of sensors. On the other hand, the temporal graph has as many nodes as the number of temporal observations of the whole sensor network and represents the overall temporal dynamics (changes in connectivity over time) of the network [3].

By making the strong product structure explicit in terms of the factor adjacency matrices, the optimization problem for learning strong graphs can be expressed as the following nonconvex problem:

\displaystyle\underset{\{\mathbf{W}_{k}\in\mathcal{W}\}_{k=1}^{K_{0}}}{\min}

\displaystyle\frac{\alpha}{M_{0}}\mathrm{trace}\Bigg{[}\big{[}\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{k}\big{]}\Big{(}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{1}\bar{\mathbf{x}}_{m}^{T}-\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}\mathbf{x}_{m}^{T}\Big{)}\Bigg{]}.

(15)

IV-D Product graph Fourier transform

One can see from (IV-A), (IV-B), and (IV-C) that the graph Fourier transform of a product graph (which is the eigenmatrix of the product adjacency matrix), has a Kronecker structure in terms of the eigenmatrices of the component graph adjacency matrices: $\mathbf{U}=\underset{[K_{0}]}{\otimes}\mathbf{U}_{k}$ . In terms of the implementation of the graph Fourier transform, this structure provides an efficient implementation of the graph Fourier as (using the properties of Kronecker product and tensors [23]):

\displaystyle\mathbf{U}^{T}\mathbf{x}=(\underset{[K_{0}]}{\otimes}\mathbf{U}_{k})^{T}\mathbf{x}=\mathrm{vec}(\mathcal{X}\underset{[K_{0}]}{\times}\mathbf{U}_{k}),

(16)

where $\mathbf{x}\in\mathbb{R}^{n_{1}n_{2}\dots n_{K_{0}}}$ is an arbitrary graph signal on the product graph, and $\mathcal{X}\in\mathbb{R}^{n_{1}\times n_{2}\times n_{K_{0}}}$ represents appropriately tensorized version of the signal $\mathbf{x}$ . Because of this, one does not need to form the huge Fourier matrix $\mathbf{U}$ and can avoid costly matrix multiplications by just applying the component graph Fourier matrices to each respective mode of the tensorized observation $\mathcal{X}$ and then vectorizing the result.

IV-E Smoothness

Smoothness of a graph signal is one of the core concepts in graph signal processing [1, 2, 3, 4, 5] and product graph Laplacians provide an efficient representation for the notion of smoothness. The smoothness of a graph signal can be measured through the Dirchlet energy defined as $\mathbf{x}^{T}\mathbf{L}\mathbf{x}$ . The Dirichlet energy can be reexpressed as: $\mathbf{x}^{T}\mathbf{L}\mathbf{x}=\mathbf{x}^{T}(\mathbf{D}-\mathbf{W})\mathbf{x}=\mathbf{x}^{T}\mathbf{D}\mathbf{x}-\mathbf{x}^{T}\mathbf{W}\mathbf{x}.$ Let us now focus on each term separately in the context of product graphs. For the term involving $\mathbf{W}$ we have:

	$\displaystyle\mathbf{x}^{T}\mathbf{W}\mathbf{x}$	$\displaystyle=\mathbf{x}^{T}\mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^{T}\mathbf{x}=(\mathbf{U}^{T}\mathbf{x})^{T}\boldsymbol{\Lambda}(\mathbf{U}^{T}\mathbf{x})$
		$\displaystyle=\mathrm{vec}(\mathcal{X}\underset{[K_{0}]}{\times}\mathbf{U}_{k})^{T}\boldsymbol{\Lambda}\mathrm{vec}(\mathcal{X}\underset{[K_{0}]}{\times}\mathbf{U}_{k}).$		(17)

Similarly, the term involving $\mathbf{D}$ can be re-expressed as:

	$\displaystyle\mathbf{x}^{T}\mathbf{D}\mathbf{x}$	$\displaystyle=\mathbf{x}^{T}\mathrm{diag}(\mathbf{W}\mathbf{1})\mathbf{x}=\mathbf{x}^{T}\mathrm{diag}(\mathbf{x})\mathbf{W}\mathbf{1}$
		$\displaystyle=(\mathbf{x}\odot\mathbf{x})^{T}\mathbf{W}\mathbf{1}=\bar{\mathbf{x}}^{T}\mathbf{W}\mathbf{1},$		(18)

which can be computed efficiently along the lines of (IV-E). With this reformulation, one circumvents the need to explicitly form the prohibitively large eigenmatrix $\mathbf{U}$ and can evaluate the Dirichlet energy much more efficiently with just mode-wise products with the smaller component eigenmatrices.

IV-F Representation complexity

Let us consider an unknown graph $G$ with the number of nodes $|V|=n=\prod_{k=1}^{K_{0}}n_{k}$ , where $n_{k}$ represents the number of nodes in each component graph and $K_{0}$ is total number of component graphs. If one were to learn this graph by means of an arbitrary adjacency matrix, the number of parameters that needs to be estimated would be $\frac{n(n+1)}{2}$ (since the graph adjacency matrix is a symmetric matrix). On the other hand, for the same graph, by utilizing the product model of the graph adjacency matrix, one would need to estimate only $\sum_{k=1}^{K_{0}}\frac{n_{k}(n_{k}+1)}{2}$ parameters. Considering the special case of $n_{1}=n_{2}=\cdots=n_{K_{0}}=\bar{n}$ , and imposing the product structure on graph adjacency matrix reduces the number of parameters needed to be learned by $\bar{n}^{K_{0}-1}/K_{0}$ !

Algorithm 2 : B-PGL—BCD for product graph learning

Input: Observations $\{\mathbf{x}_{m}\}_{m=1}^{M_{0}}$ , maximum iterations $N_{0}$ , maximum inner iterations $T_{0}$ , and parameters $\alpha$ , $\rho$
Initialize: $\{\widehat{\mathbf{W}}_{k}\}_{k=1}^{K_{0}}$

for

n=1

N_{0}

(outer interations)

for

k=1

K_{0}

If stopping criteria not met

Solve (11), (13), or (15) for

\widehat{\mathbf{W}}_{k}

via Algorithm 1

end

Output: Final adjacency matrix estimates $\widehat{\mathbf{W}}_{k}$

V Algorithm for learning product graphs

In the previous section we highlighted some properties and advantages of product graphs and we posed the optimization problems for learning these graphs. We now propose an algorithm for solving these product graph learning problems. To this end, we first recognize that even though the problems posed in (11), (13), and (15) are nonconvex, minimization of each objective with respect to any single factor adjacency matrix is still convex (while all the other factors matrices are fixed). Moreover, these factor-wise minimization problems can be solved through Algorithm 1 proposed in the earlier sections. These observations lead us to propose a block coordinate descent (BCD) based algorithm, named B-PGL (BCD for product graph learning), that minimizes over each factor adjacency matrix in cyclic fashion. The proposed algorithm is provided in Algorithm 2, and in the following discussion we present the factor-wise problems for each product graph.

V-A Kronecker graphs

Since Algorithm 2 utilizes factor-wise minimization, we can characterize the error for product graph learning in terms of the factor-wise errors of each factor adjacency matrix. The error bounds for learning Kronecker graphs are provided in the following theorem with the proof in Appendix A.

Theorem 1.

After each outer iteration of Algorithm 2, the error between the sample-based minimization (with $M_{0}$ samples) and the population-based minimization (with infinitely many samples) of (11) with respect to the ${k}$ -th factor $\mathbf{W}_{k}$ satisfies $\mathcal{O}\Big{(}\frac{n_{k}^{2}\log(n_{k})}{nM_{0}}\Big{(}1+\big{\|}\underset{[K_{0}]\setminus k}{\otimes}\widehat{\mathbf{W}}_{i}-\underset{[K_{0}]\setminus k}{\otimes}\mathbf{W}_{i}^{*}\big{\|}_{F}\Big{)}\Big{)}$ , with high probability as $\frac{nM}{n_{k}}\rightarrow\infty$ , when $\alpha\leq\sqrt{\frac{n_{k}\log(n_{k})}{nM_{0}}}$ . Additionally, the error between the estimated factor $\widehat{\mathbf{W}}_{k}$ and the original generating factor $\mathbf{W}_{k}^{*}$ also satisfies $\|\widehat{\mathbf{W}}_{k}-\mathbf{W}_{k}^{*}\|_{F}=\mathcal{O}\Bigg{(}\sqrt{\frac{n_{k}\log(n_{k})}{nM_{0}}}\Bigg{)}$ , with high probability as $\frac{nM}{n_{k}}\rightarrow\infty$ .

Remark 1.

The error between the estimated $k$ -th factor and the original generating $k$ -th factor for learning Kronecker graphs also depends on the estimation errors of the other factors. This dependence, however, is absorbed in big- $\mathcal{O}$ .

V-B Cartesian graphs

The following theorem characterizes the error of the factor-wise minimization of the Cartesian graph learning problem.

Theorem 2.

The objective function (13) for the Cartesian graph learning problem can be represented as a sum of terms that are linear in each factor adjacency matrix, and is therefore convex. After each outer iteration of Algorithm 2, the error between the sample-based minimization (with $M_{0}$ samples) and the population-based minimization (with infinitely many samples) of (13) with respect to the ${k}$ -th factor $\mathbf{W}_{k}$ satisfies $\mathcal{O}\Big{(}\frac{n_{k}^{2}\log(n_{k})}{nM_{0}}\Big{)}$ , with high probability as $\frac{nM}{n_{k}}\rightarrow\infty$ , when $\alpha\leq\sqrt{\frac{n_{k}\log(n_{k})}{nM_{0}}}$ . Additionally, the error between the estimated factor $\widehat{\mathbf{W}}_{k}$ and the original generating factor $\mathbf{W}_{k}^{*}$ also satisfies $\|\widehat{\mathbf{W}}_{k}-\mathbf{W}_{k}^{*}\|_{F}=\mathcal{O}\Bigg{(}\sqrt{\frac{n_{k}\log(n_{k})}{nM_{0}}}\Bigg{)}$ , with high probability as $\frac{nM}{n_{k}}\rightarrow\infty$ .

The proof of this theorem is given in Appendix B. The theorem states, remarkably, that the objective function for learning Cartesian product graphs is convex and separable in each factor adjacency matrix, i.e., the objective function can be represented as a sum of linear terms, each of which is dependent on only one factor adjacency matrix. As a consequence of this fact, for learning Cartesian graphs, one can optimize over all factor adjacency matrices in parallel!

Remark 2.

The error between the estimated $k$ -th factor and the original generating $k$ -th factor for learning Cartesian graphs is independent of the estimation errors of the other factors due to the convexity and separability of (13).

V-C Strong graphs

The following theorem, with its proof in Appendix C, characterizes the behavior of factor-wise minimization for learning strong product graphs.

Theorem 3.

After each outer iteration of Algorithm 2, the error between the sample-based minimization (with $M_{0}$ samples) and the population-based minimization (with infinitely many samples) of (15) with respect to the ${k}$ -th factor $\mathbf{W}_{k}$ satisfies $\mathcal{O}\Big{(}\frac{n_{k}\log(n)}{M_{0}}\Big{(}{1+2\big{\|}\underset{[K_{0},k]}{\boxtimes}\widehat{\mathbf{W}}_{i}-\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i}^{*}\big{\|}_{F}}\Big{)}\Big{)}$ , with high probability as $\frac{nM}{n_{k}}\rightarrow\infty$ , when $\alpha\leq\sqrt{\frac{n_{k}\log(n_{k})}{nM_{0}}}$ . Additionally, the error between the estimated factor $\widehat{\mathbf{W}}_{k}$ and the original generating factor $\mathbf{W}_{k}^{*}$ also satisfies $\|\widehat{\mathbf{W}}_{k}-\mathbf{W}_{k}^{*}\|_{F}=\mathcal{O}\Bigg{(}\sqrt{\frac{n_{k}\log(n_{k})}{nM_{0}}}\Bigg{)}$ , with high probability as $\frac{nM}{n_{k}}\rightarrow\infty$ .

Remark 3.

The error between the estimated $k$ -th factor and the original generating $k$ -th factor for learning strong graphs is dependent on the estimation errors of the other factors. This dependence on other factors, however, is absorbed in big- $\mathcal{O}$ .

V-D Error bound for arbitrary graphs

As a byproduct of Theorem 1, we can also obtain an error bound for arbitrary graph learning problem. To this end, see that the objective in (III) for learning arbitrary graphs can be expressed as $\alpha/M_{0}\sum_{m=1}^{M_{0}}\mathbf{x}_{m}^{T}\mathbf{L}\mathbf{x}_{m}=\alpha\mathrm{trace}(\mathbf{D}\bar{\mathbf{S}})-\alpha\mathrm{trace}(\mathbf{W}\mathbf{S})$ , where $\mathbf{S}$ and $\bar{\mathbf{S}}$ are similiar to the definitions in Appendix A. Following along the lines of Theorem 1, by solving (III) via Algorithm 1 one is guaranteed to converge to the original generating adjacency matrix of the unstructured graph with the error $\|\widehat{\mathbf{W}}-\mathbf{W}^{*}\|_{F}=\mathcal{O}_{P}\Big{(}\sqrt{\log(n)/M_{0}}\Big{)}$ , with a high probability as $M\rightarrow\infty$ .

V-E Computational complexity

The computational complexity of solving each factor-wise problem scales quadratically with the number of nodes in the graph. This implies that when the product structure is imposed, one only has to solve $K_{0}$ smaller problems each with computational complexity of $\mathrm{O}(\bar{n}^{2})$ , assuming the special case of $n_{1}=n_{2}=\dots=n_{K_{0}}=\bar{n}$ . In contrast, for learning unstructured graphs the computational complexity would scale as $\mathrm{O}(n^{2})=\mathrm{O}(\prod_{k=1}^{K_{0}}n_{k}^{2})\sim\mathrm{O}(\bar{n}^{2K_{0}})$ . Thus, the computational gains are huge in comparison to the original problem for learning unstructured graphs!

This computational gain is only made possible due to the way that we pose the original graph learning problem as a linear program. This way of posing the original problem made the objective amenable to factor-wise minimization, which would not be possible if the original graph learning problem was posed in the traditional way from the existing literature.

V-F Convergence properties & sample complexity

The overall convergence of the algorithm to a stationary point of the problem can be established through the following theorem, whose proof is provided in Appendix D.

Theorem 4.

Since each factor-wise problem for each graph learning problem is convex, the product graph learning algorithm Algorithm 2 is guaranteed to converge to a stationary point at a linear rate.

We now provide an insight into our theorems statements with regards to the required number of observation. Theorem 1, 2, and 3 claim that the estimated factor lies within a ball of radius $\frac{\log(n_{k})}{(n/n_{k})M_{0}}$ around the true factor. The accuracy of the estimate increases with the number of available observations $M_{0}$ , and the product of the dimensions of the other factors $n/n_{k}=\prod_{j\neq k}n_{j}$ . Moreover, the accuracy decreases with the increasing dimensions of the factor being estimated.

Taking a closer look at the error bounds for learning arbitrary and structured graphs reveals an important point. The denominator in the error bound is the number of observations available to estimate the graph. For $M_{0}$ observed graph signals, the number of observations available to arbitrary graph learning are (obviously) $M_{0}$ ; however, for estimating the $k$ -th factor adjacency when learning product graphs the effective number of observations are $\prod_{j\neq k}n_{j}\times M_{0}$ . This means that imposing the product structure results in an increased number of effective observations to estimate each factor adjacency matrix. This combined with the reduced number of parameters required to learn these graphs makes product graphs very attractive for real world applications.

VI Numerical experiments

This section provides results for learning product graphs from synthetic and real datasets. We first present experiments for learning arbitrary graphs through our proposed linear program in Sec. III, and then the results for learning products graphs from synthetic data through Alg. 2. Afterwards we validate the performance of our proposed algorithm for product graphs on real-world datasets.

Refer to caption — Figure 1: F-measure values for various graphs for our proposed graph learning algorithm (GLP), LOG, and CGL.

VI-A Synthetic data: Arbitrary graphs

To showcase the performance of our new formulation for graph learning, we run synthetic experiments on a graph with $n=64$ nodes. We generate various different graph types such as: (i) a sparse random graph with Gaussain weights, (ii) an Erdos-Renyi random graph with edge probability 0.7, (iii) a scale-free graph with preferential attachment (6 edges at each step), (iv) a random regular graph where each node is connected to $0.7n$ other nodes, (v) a uniform grid graph, (vi) a spiral graph , (vii) a community graph , and (viii) a low stretch tree graph on a grid of points. The related details of how the graphs are simulated can be found in [30, 12]. For each kind of graph, we generate 20 different realizations, and for each realization we generate observations using a degenerate multivariate Gaussian distribution with the graph Laplacian as the precision matrix [12, 11, 13].

We compare the performance of our proposed method with two other state-of-the-art methods for arbitrary graph learning: (i) combinatorial graph learning from [11] (which we refer to as CGL), and (ii) graph learning method from [10] (which we refer to as LOG), which also aims to learn a combinatorial graph Laplacian through a slightly different optimization problem than [11]. We choose $\alpha$ for our algorithm in the range $0.75^{\{0:14\}}\times\sqrt{\frac{\log(n)}{M_{0}}}$ , as dictated by the error bounds for learning graphs in App. A and by the existing literature [12, 11, 13]. Furthermore, we choose $\rho=0.75/\log(M_{0})$ as the value that works best for most cases. For each algorithm, in the prescribed range of the optimization parameters, we choose the parameters that produce the best results.

The results of our experiments are shown as F-measure values in Fig. 1. F-measure is the harmonic mean of precision and recall, and signifies the overall accuracy of the algorithm [11]. Precision here denotes the fraction of true graph edges recovered among all the recovered edges, and recall signifies the fraction of edges recovered from the true graph edges. One can see that our algorithm (except for community graphs) performs just as well or better than the existing state-of-the-art algorithms. Moreover, the average performance over all graphs in Fig. 2 shows that on average we outperform the existing algorithms. A runtime comparison of all algorithms in Fig. 2 also reveals competitive run time for our proposed scheme. The LOG algorithm, with the smallest runtime, has a huge computational overhead incurred by the construction of a matrix of pairwise distances of all rows of data matrix $\mathbf{X}$ . In contrast, other algorithms work with the graph signal observations directly and do not require preprocessing steps.

Remark 4.

For some graphs in Fig. 1, the performance for GLP seems to worsen as number of observations grow. This is likely due to the limited range that we have considered for searching the optimization parameter. For a bigger range, this downward trend will disappear. The range that we have prescribed is the one mostly used in the literature and on average works well in most settings.

VI-B Synthetic data: Product graphs

We now present the results of our numerical experiments involving synthetic data for product graphs. We run experiments for random Erdos-Renyi factor graphs with $n=n_{1}n_{2}n_{3}=12\times 12\times 12$ nodes, and having either Cartesian, Kronecker or strong structure. We then use our proposed algorithms to learn the generated graphs with varying number of observations and compare the performance with LOG as its performance was the second best in Fig. 1. The results for all three types of product graphs are shown in Fig. 3 (top). For a fixed number of observations, Cartesian product graphs can be learned with the highest F-measure score followed by strong and then Kronecker graphs. The figure also shows that for each graph, imposing product structure on the learned graph drastically improves the performance of the learning algorithm. Fig. 3 (bottom) also shows the runtimes comparison of our approach BPGL with the algorithm in [10]. Even for a graph of this size, with total number of nodes $n=1728$ , we can see a considerable reduction in runtimes. Thus, our learning algorithm that explicitly incorporates the product structure of the graph enjoys superior performance, reduced computational complexity and faster runtimes.

VI-C United States wind speed data

The first real data we use for experimentation is NCEP wind speed data. The NCEP wind speed data [31] represents wind conditions in the lower troposphere and contains daily averages of U (east-west) and V (north-south) wind components over the years $1948-2012$ . Similar to the experiments in [22] with preprocessed data, we use a grid of $n_{1}n_{2}=10\times 10=100$ stations to extract the data available for the United States region. From this extracted data, we choose the years $2003-2007$ for training and keep the years $2008-2012$ for testing purposes. Using a non-overlapping window of length $n_{3}=8$ , which amounts to dividing the data into chunks of 8 days, we obtain $M_{0}=228$ samples, each of length $n=n_{1}n_{2}n_{3}$ . Therefore, for graph learning we have 228 samples, where each sample contains spatiotemporal data for 100 stations over 8 days. Next, the testing procedure consists of introducing missing values in each sample of the test data by omitting the data for the 8th day, and then using a linear minimum-mean-square-error (LMMSE) estimator [32, Chapter 4] to predict the missing values. As for learning, 228 samples are obtained for testing through the same procedure. Our proposed method estimates the (structured) adjacency matrix of the graph (which is related to the precision matrix of the data), and we use $\mathbf{W}+\mathbf{I}$ in place of the data covariance for the LMMSE estimator.

We make a comparison of the following approaches: (1) sample covariance matrix (SCM) learning, (2) the permuted rank-penalized least squares (PRLS) approach [22] with $r=6$ Kronecker components, (3) PRLS with $r=2$ Kronecker components, (4) time-varying graph learning (TVGL) approach from [33] (which was shown to outperform the approach in [34]), (5) spatiotemporal strong graph with BPGL with a spatial component of size $n_{1}n_{2}$ and a temporal component of size $n_{3}$ , and (6) spatiotemporal Cartesian graph with BPGL of the same dimensions. The parameters for PRLS were chosen for optimal performance as given in [22]. The optimization parameters for TVGL and BPGL were manually tuned for best performance. It should be noted here that SCM and PRLS aim to learn a covariance matrix and a structured covariance matrix from the data, respectively.

The SCM aims to estimate $\frac{n(n+1)}{2}$ parameters, while the number of parameters that PRLS needs to estimate is $r(\frac{n_{1}n_{2}(n_{1}n_{2}+1)}{2}+\frac{n_{3}(n_{3}+1)}{2})$ . On the other hand, TVGL estimates $n_{3}(\frac{n_{1}n_{2}(n_{1}n_{2}+1)}{2})$ parameters, while BPGL needs to learn only $\frac{n_{1}n_{2}(n_{1}n_{2}+1)}{2}+\frac{n_{3}(n_{3}+1)}{2}$ parameters for both strong and Cartesian graphs. The mean prediction root-mean-squared errors (RMSE) for all methods are shown in Table I. One can see that our proposed method outperforms PRLS and TVGL while estimating far fewer parameters than both. The table also shows that learning a strong graph for this data results in a higher RMSE reduction over the baseline (SCM), and is thus better suited for this data than the Cartesian product graph.

TABLE I: Comparison of prediction RMSE for US wind speed data

Method

RMSE reduction

over SCM (dB)

parameters

SCM

–

320400

TVGL [33]

1.0461

40656

PRLS [22] (

r=6

)

1.7780

30492

PRLS [22] (

r=2

)

-1.5473

10164

BPGL strong

1.8640

5082

BPGL Cartesian

1.3105

5082

Comparison of our graph learning method with SCM, PRLS and TVGL. Our proposed method outperforms the existing methods for time-varying graph learning and for learning structured covariance matrix. Moreover, our proposed procedure outperforms while using considerably fewer parameters.

VI-D ABIDE fMRI data: Exploratory data analysis

The second real data that we use as an application for our proposed algorithm is a part of the ABIDE fMRI dataset [35, 36]. Our aim is to learn the graphs over the fMRI data of control and autistic subjects and to use the learned graphs to higlight the differences in the control and autistic brains. The data we obtain is already preprocessed to remove various fMRI artifacts and for controlization of the obtained scans [37]. The final preprocessed data consists of measurements from $n_{1}=111$ brain regions scanned over $116$ time instances for each subject. The data contains scans for control and autistic subjects. To avoid class imbalance we randomly choose $47$ subjects for each class. Out of the $47$ subjects for each class, we then randomly choose $30$ subjects for training and keep the remaining $17$ for testing purposes. We use a non-overlapping window length of $n_{2}=29$ which results into $M_{0}=120$ samples of length $n=n_{1}\times n_{2}$ .

As before, we compare the performance of our proposed approach with SCM and PRLS. Table II shows the results of our experiments. One can see that our approach performs very similar to PRLS for both Cartesian and strong product graphs, all the while using much fewer parameters (five times fewer). We also see that strong product graphs are more suited to model brain activity. The work in [37] suggests that autistic brains exhibit hypoconnectivity in different regions of the brain as compared to control subjects. The results from our graph learning procedure go a step further and bring more insight into the spatiotemporal dynamics of the brain. Firstly, as already suggested in [37], we see clear evidence of spatial hypoconnectivity (see Fig. 4). More importantly, our learned graphs in Fig. 5 reveal that, in addition to spatial hypoconnectivity, autistic brains also suffer from temporal hypoconnectivity.

TABLE II: Comparison of prediction RMSE for ABIDE fMRI data

Method

RMSE reduction

over SCM (dB)

parameters

SCM

–

5182590

PRLS Normal

2.1793

33255

Cartesian GL Control

2.0980

6651

Strong GL Control

2.1753

6651

PRLS Autism

2.375

33255

Cartesian GL Autism

2.3400

6651

Strong GL Autism

2.3563

6651

Comparison of our graph learning method with SCM and PRLS.

VI-E Estrogen receptor data

The final dataset that we experiment on is the estrogen receptor data [38, 39], which consists of 157 samples of 693 probe sets related to estrogen receptor pathway. We aim to learn a Kronecker structured graph on this data using $120$ randomly selected samples for training and the remaining $37$ for testing. We choose Kronecker structured graph for this data because transposable models, i.e., models that learn a Kronecker structured data covariance, have been shown to work well for genomic data in the existing literature [28]. And as pointed out in Sec. IV-A, Kronecker structured adjacency matrix corresponds to a Kronecker structured data covariance. For testing purposes, we follow a procedure similar to the previous subsections.

We compare our graph learning approach with SCM, PRLS and sparse covariance estimation (SEC) from [40]. Optimization parameters are manually tuned for best results for each method. We learn a graph through our method (and covariance through PRLS), with a Kronecker structure composed of two factor matrices of dimensions $n_{1}=21$ and $n_{2}=33$ . We then use LMMSE estimator to predict $33$ probe set measurements removed from the test data. PRLS, SEC and BPGL result in an improvement of $\mathbf{0.91347}$ dB, $\mathbf{0.93598}$ dB, and $\mathbf{1.0242}$ dB over SCM, respectively. This demonstrates that our method outperforms the state-of-the-art unstructured and structured sparse covariance estimation techniques, and provides a better model for real datasets.

VII Conclusion

In this paper, we introduced a new linear formulation of the graph learning problem from graph signals. We demonstrated the performance of this new formulation with numerical experiments and derived bounds on its estimation performance. Based on the proposed formulation, we also posed the problem to learn product graphs from data. We devised a block coordinate descent based algorithm for learning product graphs, and derived the associated error bounds for various product structures. Finally, we validated the performance characteristics and superior learning capabilities of our proposed method through numerical simulations on synthetic and real datasets.

Appendix A Proof of Theorem 1

Note that the current form of the objective function (11) can be expressed as

		$\displaystyle\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\bar{\mathbf{x}}_{m}^{T}(\underset{[K_{0}]}{\otimes}\mathbf{W}_{i})\mathbf{1}-\mathbf{x}_{m}^{T}(\underset{[K_{0}]}{\otimes}\mathbf{W}_{i})\mathbf{x}_{m}$
		$\displaystyle=\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\langle\bar{\mathcal{X}}_{m},\mathds{1}\underset{[K_{0}]}{\times}\mathbf{W}_{i}\rangle-\langle\mathcal{X}_{m},\mathcal{X}_{m}\underset{[K_{0}]}{\times}\mathbf{W}_{i}\rangle$		(19)

using the properties of the Kronecker product. Let us define the following: the tensor $\mathcal{Y}_{m}=\mathcal{X}_{m}\underset{[K_{0}]\setminus k}{\times}\mathbf{W}_{i}^{1/2}\underset{k}{\times}\mathbf{I}_{k}$ , the matrix $\mathbf{S}_{k}=\frac{1}{M_{0}}\sum_{m=1}^{M_{0}}\mathcal{Y}_{m(k)}\mathcal{Y}_{m(k)}^{T}$ , $\bar{\mathbf{d}}_{k}$ as a vector of weighted degrees of the product adjacency matrix $\underset{[K_{0}]\setminus k}{\otimes}\mathbf{W}_{i}$ , $\mathbf{d}_{k}$ as the vector of weighted degrees of $\mathbf{W}_{k}$ , $\bar{\mathbf{y}}_{mj}$ as the $j$ -th column of $\bar{\mathcal{X}}_{m(k)}$ , $\mathbf{y}_{mj}$ as the $j$ -th column of $\mathcal{X}_{m(k)}$ , and finally the matrix $\bar{\mathbf{S}}_{k}=\frac{1}{M_{0}}\sum_{m=1}^{M_{0}}\sum_{j=1}^{n/n_{k}}(\bar{\mathbf{d}}_{k})_{j}\mathbf{y}_{mj}\mathbf{y}_{mj}^{T}$ . Then we can further express the terms in (A) as:

		$\displaystyle\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\langle\mathcal{X}_{m},\mathcal{X}_{m}\underset{[K_{0}]}{\times}\mathbf{W}_{i}\rangle$
		$\displaystyle=\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathcal{X}_{m(k)}(\underset{[K_{0}]\setminus k}{\otimes}\mathbf{W}_{i})\mathcal{X}_{m(k)}^{T}\big{)}$
		$\displaystyle=\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathcal{Y}_{m(k)}\mathcal{Y}_{m(k)}^{T}\big{)}=\alpha~{}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)},$		(20)

and,

		$\displaystyle\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\langle\bar{\mathcal{X}}_{m},\mathds{1}\underset{[K_{0}]}{\times}\mathbf{W}_{i}\rangle$
		$\displaystyle=\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\mathrm{trace}\big{(}\mathbf{W}_{k}\bar{\mathcal{X}}_{m(k)}(\underset{[K_{0}]\setminus k}{\otimes}\mathbf{W}_{i})\mathds{1}_{(k)}^{T}\big{)}$
		$\displaystyle=\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\mathrm{trace}\big{(}\mathbf{W}_{k}\bar{\mathcal{X}}_{m(k)}\bar{\mathbf{d}}_{k}\mathbf{1}^{T}\big{)}=\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\mathrm{trace}\big{(}\mathbf{d}_{k}^{T}\bar{\mathcal{X}}_{m(k)}\bar{\mathbf{d}}_{k}\big{)}$
		$\displaystyle=\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\mathrm{trace}\big{(}\mathbf{d}_{k}^{T}[\bar{\mathbf{y}}_{m1},\bar{\mathbf{y}}_{m2},\cdots,\bar{\mathbf{y}}_{mn/n_{k}}]\bar{\mathbf{d}}_{k}\big{)}$
		$\displaystyle=\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}~{}\underset{j=1}{\overset{n/n_{k}}{\sum}}\mathrm{trace}\big{(}(\bar{\mathbf{d}}_{k})_{j}\mathbf{y}_{mj}^{T}\mathbf{D}_{k}\mathbf{y}_{mj}\big{)}=\alpha~{}\mathrm{trace}(\mathbf{D}_{k}\bar{\mathbf{S}}_{k})$		(21)

Moreover, for the terms in (A) and (A), the difference from their expected values takes the form of:

		$\displaystyle\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}-\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}\big{]}\Big{\|}$
		$\displaystyle=\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}-\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbb{E}_{\mathbf{x}}[\mathbf{S}_{k}]\big{)}\Big{\|}$
		$\displaystyle=\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{W}_{k}(\mathbf{S}_{k}-\mathbb{E}_{\mathbf{x}}[\mathbf{S}_{k}])\big{)}\Big{\|}$
		$\displaystyle=\alpha~{}\Big{\|}\underset{i,j}{\overset{}{\sum}}\big{(}\mathbf{W}_{k}(\mathbf{S}_{k}-\mathbb{E}_{\mathbf{x}}[\mathbf{S}_{k}])\big{)}_{i,j}\Big{\|}$
		$\displaystyle\leq\alpha~{}\underset{i,j}{\max}\Big{\|}(\mathbf{S}_{k}-\mathbb{E}_{\mathbf{x}}[\mathbf{S}_{k}])_{i,j}\Big{\|}~{}\underset{i,j}{\overset{}{\sum}}\big{(}\mathbf{W}_{k}\big{)}_{i,j}$
		$\displaystyle=\alpha n_{k}~{}\underset{i,j}{\max}\Big{\|}(\mathbf{S}_{k}-\mathbb{E}_{\mathbf{x}}[\mathbf{S}_{k}])_{i,j}\Big{\|}$
		$\displaystyle\leq C_{1}\frac{n_{k}^{2}\log(n_{k})}{nM_{0}}\Big{(}{1+\big{\\|}\underset{[K_{0}]\setminus k}{\otimes}\widehat{\mathbf{W}}_{i}-\underset{[K_{0}]\setminus k}{\otimes}\mathbf{W}_{i}^{*}\big{\\|}_{F}}\Big{)},$		(22)

and

		$\displaystyle\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}-\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}\big{]}\Big{\|}$
		$\displaystyle\leq\alpha~{}\underset{i,j}{\max}\Big{\|}(\bar{\mathbf{S}}_{k}-\mathbb{E}_{\mathbf{x}}[\bar{\mathbf{S}}_{k}])_{i,j}\Big{\|}~{}\underset{i,j}{\overset{}{\sum}}\big{(}\mathbf{D}_{k}\big{)}_{i,j}$
		$\displaystyle=\alpha n_{k}~{}\underset{i,j}{\max}\Big{\|}(\bar{\mathbf{S}}_{k}-\mathbb{E}_{\mathbf{x}}[\bar{\mathbf{S}}_{k}])_{i,j}\Big{\|}$
		$\displaystyle\leq C_{2}\frac{n_{k}^{2}\log(n_{k})}{nM_{0}}\Big{(}{1+\big{\\|}\underset{[K_{0}]\setminus k}{\otimes}\widehat{\mathbf{W}}_{i}-\underset{[K_{0}]\setminus k}{\otimes}\mathbf{W}_{i}^{*}\big{\\|}_{F}}\Big{)},$		(23)

with probability for both inequalities exceeding $1-4n_{k}^{2}\Big{[}\exp\big{(}\frac{-nM_{0}}{2n_{k}}\big{)}+\exp\big{(}-(0.25+\sqrt{\log n_{k}})^{2}\big{)}\Big{]}$ ; details of the last inequalities of (A) and (A) in [41, Lemma B.1]. Here $\widehat{\mathbf{W}}_{i}$ is the current estimate of the $i$ -th adjacency matrix while $\mathbf{W}_{i}^{*}$ is the original $i$ -th generating factor. The last inequalities in both expressions follow from [41] by the fact that for estimating the $k$ -th factor the estimates $\widehat{\mathbf{W}}_{i}$ for other factors are used, and by choosing $\alpha\leq\sqrt{\frac{n_{k}\log(n_{k})}{nM_{0}}}$ . With these bounds, the error between the sample-based objective and the population-based objective, which are defined as $\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}-\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}$ with $M_{0}$ samples and as $\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}\big{]}+\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}\big{]}$ with infinitely many samples, respectively, can be upper bounded as:

	$\displaystyle\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}-\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}$
	$\displaystyle-\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}\big{]}+\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}\big{]}\Big{\|}$
	$\displaystyle\leq\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}-\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}\big{]}\Big{\|}+$
	$\displaystyle+\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}-\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}\big{]}\Big{\|}$
	$\displaystyle\leq C\frac{n_{k}^{2}\log(n_{k})}{nM_{0}}\Big{(}{1+\big{\\|}\underset{[K_{0}]\setminus k}{\otimes}\widehat{\mathbf{W}}_{i}-\underset{[K_{0}]\setminus k}{\otimes}\mathbf{W}_{i}^{*}\big{\\|}_{F}}\Big{)}.$		(24)

To derive the bound on the error between the factor estimate $\widehat{\mathbf{W}}_{k}$ and the original generating $\mathbf{W}_{k}^{*}$ , let us first define the following convex function of $\boldsymbol{\Delta}$ :

\displaystyle F_{k}(\boldsymbol{\Delta})

\displaystyle=\alpha~{}\mathrm{trace}\big{(}\bar{\mathbf{S}}_{k}\mathrm{diag}(\boldsymbol{\Delta}\mathbf{1})\big{)}-\alpha~{}\mathrm{trace}\big{(}\mathbf{S}_{k}\boldsymbol{\Delta}\big{)},

(25)

where $\boldsymbol{\Delta}=\mathbf{W}_{k}-\mathbf{W}_{k}^{*}$ , and $\mathrm{diag}(\boldsymbol{\Delta}\mathbf{1})=\mathbf{D}_{k}-\mathbf{D}_{k}^{*}$ . Now, we want to prove that $F_{k}(\boldsymbol{\Delta})>0$ , for $\boldsymbol{\Delta}\in\mathbb{R}^{n_{k}\times n_{k}}$ with $\|\boldsymbol{\Delta}\|_{F}=\|\mathbf{W}_{k}-\mathbf{W}_{k}^{*}\|_{F}=R\sqrt{\frac{n_{k}\log(n_{k})}{nM_{0}}}$ with a constant $R>0$ . Consider $F_{k}(\cdot)$ at $\widehat{\boldsymbol{\Delta}}=\widehat{\mathbf{W}}_{k}-\mathbf{W}_{k}^{*}$ , which is the minima of $F_{k}(\boldsymbol{\Delta})$ because $\widehat{\mathbf{W}}_{k}$ is the minima of our factor-wise minimization of (11). Then we have:

$\displaystyle F_{k}(\widehat{\boldsymbol{\Delta}})$	$\displaystyle=\alpha\mathrm{trace}\big{(}\widehat{\mathbf{D}}_{k}\bar{\mathbf{S}}_{k}-\widehat{\mathbf{W}}_{k}\mathbf{S}_{k}\big{)}-\alpha\mathrm{trace}\big{(}\mathbf{D}_{k}^{}\bar{\mathbf{S}}_{k}-\mathbf{W}_{k}^{}\mathbf{S}_{k}\big{)},$
	$\displaystyle\leq F_{k}(\mathbf{0})=\alpha\mathrm{trace}\big{(}\mathbf{D}_{k}^{}\bar{\mathbf{S}}_{k}-\mathbf{W}_{k}^{}\mathbf{S}_{k}\big{)}-$
	$\displaystyle\quad\quad\quad\quad\quad\quad\alpha\mathrm{trace}\big{(}\mathbf{D}_{k}^{}\bar{\mathbf{S}}_{k}-\mathbf{W}_{k}^{}\mathbf{S}_{k}\big{)}=0,$	(26)

If we can prove that $F_{k}(\boldsymbol{\Delta})>0$ for $\boldsymbol{\Delta}\in\mathbb{R}^{n_{k}\times n_{k}}$ with a certain norm, then since $F_{k}(\widehat{\boldsymbol{\Delta}})\leq 0$ , it must satisfy $\|\widehat{\boldsymbol{\Delta}}\|_{F}<R\sqrt{\frac{n_{k}\log(n_{k})}{nM_{0}}}$ . To see that $F_{k}>0$ for $\boldsymbol{\Delta}\in\mathbb{R}^{n_{k}\times n_{k}}$ with the prescribed norm, first consider the following using the property of the trace of product of matrices [42]:

	$\displaystyle\mathrm{trace}\big{(}\bar{\mathbf{S}}_{k}\mathrm{diag}(\boldsymbol{\Delta}\mathbf{1})\big{)}\geq\lambda_{n_{k}}(\bar{\mathbf{S}}_{k})\mathrm{trace}\big{(}\mathrm{diag}(\boldsymbol{\Delta}\mathbf{1})\big{)}$
	$\displaystyle\quad\quad\geq\lambda_{n_{k}}(\bar{\mathbf{S}}_{k})\\|\mathrm{diag}(\boldsymbol{\Delta}\mathbf{1})\\|_{F}=\lambda_{n_{k}}(\bar{\mathbf{S}}_{k})\\|\boldsymbol{\Delta}\mathbf{1}\\|_{F}>0$		(27)

since $\|\boldsymbol{\Delta}\|_{F}>0$ , and where $\lambda_{n_{k}}(\bar{\mathbf{S}}_{k})$ is the minimum eigenvalue of $\mathbf{S}_{k}$ . Secondly, one can also see that:

\displaystyle\mathrm{trace}\big{(}\mathbf{S}_{k}\boldsymbol{\Delta}\big{)}

\displaystyle\leq\lambda_{1}(\mathbf{S}_{k})\mathrm{trace}\big{(}\boldsymbol{\Delta}\big{)}=0,

(28)

where $\lambda_{1}(\mathbf{S}_{k})$ is the largest eigenvalue of $\mathbf{S}_{k}$ , and $\mathrm{trace}\big{(}\boldsymbol{\Delta}\big{)}=0$ because of the adjacency constraints. Using the upper and lower bounds on the trace terms in (25) one can see that $F_{k}(\boldsymbol{\Delta})>0$ which completes the proof. ∎

Appendix B Proof of Theorem 2

Let us focus on the Cartesian objective function in (13):

		$\displaystyle\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}(\underset{[K_{0}]}{\oplus}\mathbf{W}_{i})\mathbf{x}_{m}$
		$\displaystyle=\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\underset{k=1}{\overset{K_{0}}{\sum}}\mathbf{x}_{m}^{T}((\underset{[k-1]}{\otimes}\mathbf{I}_{i})\otimes\mathbf{W}_{k}\otimes(\underset{[K_{0}]\setminus[k]}{\otimes}\mathbf{I}_{j}))\mathbf{x}_{m}$
		$\displaystyle=\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\underset{k=1}{\overset{K_{0}}{\sum}}\langle\mathcal{X}_{m},\mathcal{X}_{m}\underset{k}{\times}\mathbf{W}_{k}\rangle$
		$\displaystyle=\frac{\alpha}{M_{0}}\underset{m=1}{\overset{M_{0}}{\sum}}\underset{k=1}{\overset{K_{0}}{\sum}}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathcal{X}_{m(k)}\mathcal{X}_{m(k)}^{T}\big{)}=\alpha\underset{k=1}{\overset{K_{0}}{\sum}}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{T}_{k}\big{)},$		(29)

where the matrix $\mathbf{T}_{k}=\frac{1}{M_{0}}\sum_{m=1}^{M_{0}}\mathcal{X}_{m(k)}\mathcal{X}_{m(k)}^{T}$ . Similar steps along the lines of (A) can be followed to arrive at $\alpha/M_{0}\sum_{m=1}^{M_{0}}\mathbf{x}_{m}^{T}(\underset{[K_{0}]}{\oplus}\mathbf{D}_{i})\mathbf{x}_{m}=\alpha\sum_{k=1}^{K_{0}}\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{T}}_{k}\big{)}$ . Thus, one can clearly see that the objective function can be expressed as a sum of terms each of which is dependent on only one of the factor adjacency matrices $\mathbf{W}_{k}$ . As in Appendix A, using [41, Lemma B.1] we can see that with high probability:

\displaystyle\underset{i,j}{\max}\Big{|}(\mathbf{T}_{k}-\mathbb{E}_{\mathbf{x}}[\mathbf{T}_{k}])_{i,j}\Big{|}\leq C_{1}\frac{n_{k}^{2}\log(n_{k})}{nM_{0}},

(30)

since each term in the objective function (B) is dependent on only one factor adjacency matrix. After this, one can follow the steps in Appendix A to obtain the final error bounds. ∎

Appendix C Proof of Theorem 3

Focusing again on the objective in (15), the terms involving only the $k$ -th factor $\mathbf{L}_{k}$ can be expressed as:

		$\displaystyle\frac{\alpha}{M_{0}}~{}\sum_{m=1}^{M_{0}}\sum_{j=0}^{K_{0}-1}\sum_{\mathbf{p}}^{\mathbf{P}(k,j)}\mathcal{X}_{m}{:}(\mathcal{X}_{m}\times_{\mathbf{p}}\mathbf{W}_{\mathbf{p}}\times_{k}\mathbf{W}_{k})$
		$\displaystyle=\frac{\alpha}{M_{0}}~{}\sum_{m=1}^{M_{0}}\sum_{j=0}^{K_{0}-1}\sum_{\mathbf{p}}^{\mathbf{P}(k,j)}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathcal{X}_{m(k)}\mathbf{M}_{\mathbf{p}}\mathcal{X}_{m(k)}^{T}\big{)}$
		$\displaystyle=\frac{\alpha}{M_{0}}~{}\sum_{m=1}^{M_{0}}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathcal{X}_{m(k)}\Big{[}\sum_{j=0}^{K_{0}-1}\sum_{\mathbf{p}\in\mathbf{P}(k,j)}\mathbf{M}_{\mathbf{p}}\Big{]}\mathcal{X}_{m(k)}^{T}\big{)}$
		$\displaystyle=\frac{\alpha}{M_{0}}~{}\sum_{m=1}^{M_{0}}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathcal{X}_{m(k)}\mathbf{Q}_{k}\mathcal{X}_{m(k)}^{T}\big{)}=\alpha~{}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{Z}_{k}\big{)},$		(31)

where $\mathbf{p}$ denotes a column from matrix $\mathbf{P}(k,j)$ , $\sum_{\mathbf{p}\in\mathbf{P}(k,j)}$ denotes the summation over the columns of $\mathbf{P}(k,j)$ , and the columns of $\mathbf{P}(k,j)$ are different combinations of indices given by $C^{[1,\dots,K_{0}]\setminus[k]}_{j}$ . Additionally, $\mathbf{M}_{\mathbf{p}}$ denotes a matrix of size $(n-n_{k})\times(n-n_{k})$ that contains an appropriate Kronecker product of identity matrices and factor adjacency matrices in accordance with the entries of the vector $\mathbf{p}$ , and $\mathbf{Z}_{k}=\frac{1}{M_{0}}\sum_{m=1}^{M_{0}}\mathcal{X}_{m(k)}\mathbf{Q}_{k}\mathcal{X}_{m(k)}^{T}$ . Expressing the objective as a sum of terms that all contain the $k$ -th factor facilitates the process of solving factor-wise problems.

To derive the error bounds see that the objective function in (15) can be expressed as: $\frac{\alpha}{M_{0}}~{}\underset{m=1}{\overset{M_{0}}{\sum}}\bar{\mathbf{x}}_{m}^{T}(\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i})\mathbf{1}-\mathbf{x}_{m}^{T}(\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i})\mathbf{x}_{m}$ . The difference of the second term in this expression from its expected value can be further expressed and upper bounded as:

	$\displaystyle\frac{\alpha}{M_{0}}\Big{\|}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}(\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i})\mathbf{x}_{m}-\mathbb{E}_{\mathbf{x}}\big{[}\mathbf{x}_{m}^{T}(\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i})\mathbf{x}_{m}\big{]}\Big{\|}$
	$\displaystyle=\frac{\alpha}{M_{0}}\Big{\|}\mathrm{trace}\Big{(}(\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i})\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{x}_{m}\Big{)}-$
	$\displaystyle\quad\quad\quad\quad\quad\quad\mathrm{trace}\Big{(}(\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i})\mathbb{E}_{\mathbf{x}}\big{[}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{x}_{m}\big{]}\Big{)}\Big{\|}$
	$\displaystyle\leq\frac{\alpha\lambda_{1}(\mathbf{W}_{k})}{M_{0}}\Big{\|}\mathrm{trace}\Big{(}(\underset{[k-1]}{\boxtimes}\mathbf{W}_{i}\boxtimes\mathbf{I}_{k}\underset{[K_{0}]\setminus[k]}{\boxtimes}\mathbf{W}_{j})\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{x}_{m}\Big{)}$
	$\displaystyle\quad\quad-\mathrm{trace}\Big{(}(\underset{[k-1]}{\boxtimes}\mathbf{W}_{i}\boxtimes\mathbf{I}_{k}\underset{[K_{0}]\setminus[k]}{\boxtimes}\mathbf{W}_{j})\mathbb{E}_{\mathbf{x}}\big{[}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{x}_{m}\big{]}\Big{)}\Big{\|}$
	$\displaystyle\leq C_{1}\frac{n_{k}\log(n)}{M_{0}}\Big{(}{1+2\big{\\|}\underset{[K_{0},k]}{\boxtimes}\widehat{\mathbf{W}}_{i}-\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i}^{*}\big{\\|}_{F}}\Big{)},$		(32)

where $\underset{[K_{0},k]}{\boxtimes}\widehat{\mathbf{W}}_{i}=\underset{[k-1]}{\boxtimes}\mathbf{W}_{i}\boxtimes\mathbf{I}_{k}\underset{[K_{0}]\setminus[k]}{\boxtimes}\mathbf{W}_{j}$ , and the last inequality follows with high probability from [41, Lemma B.1]. With this bound, the remaining steps are similar to the proof of Theorem 2, and are thus omitted in the interest of space. ∎

Appendix D Proof of Theorem 4

To prove this theorem, we first realize that since each mode-wise problem is convex, the update for each mode-wise problem is guaranteed to converge to its minimum. The rate of convergence can then be established through the work in [43] which provides convergence guarantees and rates of convergence for block coordinate descent for multiconvex objectives. It can be trivially seen that each factor-wise problem (III-A) for learning factor graphs is strongly convex. The strong convexity of the factor-wise problems, in conjunction with [43, Theorem 2.9], implies that Alg. 2 presented in this paper converges to its critical points at a linear rate.∎

References

[1] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: Going beyond Euclidean data,” IEEE Signal Processing Mag., vol. 34, no. 4, pp. 18–42, 2017.
[2] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing on graphs,” IEEE Trans. Signal Processing, vol. 61, no. 7, pp. 1644–1656, 2013.
[3] ——, “Big data analysis with signal processing on graphs: Representation and processing of massive data sets with irregular structure,” IEEE Signal Processing Mag., vol. 31, no. 5, pp. 80–90, 2014.
[4] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, “The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Mag., vol. 30, no. 3, pp. 83–98, 2013.
[5] A. Ortega, P. Frossard, J. Kovačević, J. M. F. Moura, and P. Vandergheynst, “Graph signal processing: Overview, challenges, and applications,” Proceedings of the IEEE, vol. 106, no. 5, pp. 808–828, 2018.
[6] B. Pasdeloup, V. Gripon, G. Mercier, D. Pastor, and M. G. Rabbat, “Characterization and inference of graph diffusion processes from observations of stationary signals,” IEEE Trans. Signal and Inform. Processing over Netw., vol. 4, no. 3, pp. 481–496, 2017.
[7] H. E. Egilmez, E. Pavez, and A. Ortega, “Graph learning from filtered signals: Graph system and diffusion kernel identification,” IEEE Trans. Signal and Inform. Processing over Netw., vol. 5, no. 2, pp. 360–374, 2018.
[8] S. Segarra, A. G. Marques, G. Mateos, and A. Ribeiro, “Network topology inference from spectral templates,” IEEE Trans. Signal and Inform. Processing over Netw., vol. 3, no. 3, pp. 467–483, 2017.
[9] S. P. Chepuri, S. Liu, G. Leus, and A. O. Hero, “Learning sparse graphs under smoothness prior,” in Proc. IEEE Intl. Conf. Acoustics, Speech, and Signal Processing (ICASSP’17), 2017, pp. 6508–6512.
[10] V. Kalofolias and N. Perraudin, “Large scale graph learning from smooth signals,” in Proc. Intl. Conf. Learning Representations (ICLR’19), 2019.
[11] H. E. Egilmez, E. Pavez, and A. Ortega, “Graph learning from data under laplacian and structural constraints,” IEEE J. Select. Topics in Signal Processing, vol. 11, no. 6, pp. 825–841, 2017.
[12] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “Learning laplacian matrix in smooth graph signal representations,” IEEE Trans. Signal Processing, vol. 64, no. 23, pp. 6160–6173, 2016.
[13] V. Kalofolias, “How to learn a graph from smooth signals,” in Proc. Artificial Intelligence and Statistics (AISTATS’16), 2016, pp. 920–929.
[14] S. K. Kadambari and S. P. Chepuri, “Learning product graphs from multidomain signals,” in Proc. IEEE Intl. Conf. Acoustics, Speech and Signal Processing (ICASSP’20), 2020, pp. 5665–5669.
[15] S. Chen, A. Sandryhaila, J. M. F. Moura, and J. Kovačević, “Signal recovery on graphs: Variation minimization,” IEEE Trans. Signal Processing, vol. 63, no. 17, pp. 4609–4624, 2015.
[16] S. Chen, R. Varma, A. Singh, and J. Kovačević, “Signal recovery on graphs: Fundamental limits of sampling strategies,” IEEE Trans. Signal and Inform. Processing over Netw., vol. 2, no. 4, pp. 539–554, 2016.
[17] V. N. Ioannidis, Y. Shen, and G. B. Giannakis, “Semi-blind inference of topologies and dynamical processes over dynamic graphs,” IEEE Trans. Signal Processing, vol. 67, no. 9, pp. 2263–2274, 2019.
[18] E. Ceci, Y. Shen, G. B. Giannakis, and S. Barbarossa, “Signal and graph perturbations via total least-squares,” in Proc. Asilomar Conf. Signals, Systems, and Computers, 2018, pp. 747–751.
[19] S. Sardellitti, S. Barbarossa, and P. Di Lorenzo, “Graph topology inference based on sparsifying transform learning,” IEEE Trans. Signal Processing, vol. 67, no. 7, pp. 1712–1727, 2019.
[20] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.
[21] K. Greenewald, S. Zhou, and A. Hero III, “Tensor graphical lasso (TeraLasso),” R. Stat. Soc. B, vol. 81, pp. 901–931, 2019.
[22] T. Tsiligkaridis and A. O. Hero, “Covariance estimation in high dimensions via kronecker product expansions,” IEEE Trans. Signal Processing, vol. 61, no. 21, pp. 5347–5360, 2013.
[23] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Rev., vol. 51, no. 3, pp. 455–500, 2009.
[24] R. A. Horn, C. R. Johnson, and L. Elsner, Topics in Matrix Analysis. Cambridge University Press, 1994.
[25] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
[26] S. Wang and N. Shroff, “A new alternating direction method for linear programming,” in Proc. Advances in Neural Information Processing Systems (NeurIPS’17), 2017, pp. 1480–1488.
[27] J. Eckstein and D. P. Bertsekas, “An alternating direction method for linear programming,” Massachusetts Institute of Technology, Tech. Rep. LIDS-P-1967, Apr. 1990.
[28] G. I. Allen and R. Tibshirani, “Transposable regularized covariance models with an application to missing data imputation,” Annals Applied Stat., vol. 4, no. 2, p. 764, 2010.
[29] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani, “Kronecker graphs: An approach to modeling networks,” J. Machine Learning Res., vol. 11, no. Feb, pp. 985–1042, 2010.
[30] N. Perraudin, J. Paratte, D. Shuman, L. Martin, V. Kalofolias, P. Vandergheynst, and D. K. Hammond, “GSPBOX: A toolbox for signal processing on graphs,” arXiv e-print arXiv:1408.5781, Aug. 2014.
[31] E. Kalnay, M. Kanamitsu, R. Kistler, W. Collins, D. Deaven, L. Gandin, M. Iredell, S. Saha, G. White, J. Woollen et al., “The NCEP/NCAR 40-year reanalysis project,” Bull. Amer. Meteor. Soc., vol. 77, no. 3, pp. 437–472, 1996.
[32] D. G. Luenberger, Optimization by Vector Space Methods. John Wiley & Sons, 1997.
[33] K. Yamada, Y. Tanaka, and A. Ortega, “Time-varying graph learning based on sparseness of temporal variation,” in Proc. IEEE Intl. Conf. Acoustics, Speech and Signal Processing (ICASSP’19), 2019, pp. 5411–5415.
[34] V. Kalofolias, A. Loukas, D. Thanou, and P. Frossard, “Learning time varying graphs,” in IEEE Intl. Conf. Acoustics, Speech and Signal Processing (ICASSP’17), 2017, pp. 2826–2830.
[35] C. Craddock, Y. Benhajali, C. Chu, F. Chouinard, A. Evans, A. Jakab, B. S. Khundrakpam, J. D. Lewis, Q. Li, M. Milham et al., “The Neuro Bureau Preprocessing Initiative: Open sharing of preprocessed neuroimaging data and derivatives,” Neuroinformatics, 2013.
[36] M. Narayan, “Preprocessed ABIDE Dataset (UM and UCLA),” Dec 2015. [Online]. Available: {https://figshare.com/articles/Preprocessed_ABIDE_Dataset/1533313/2}
[37] M. Narayan and G. I. Allen, “Mixed effects models for resampled network statistics improves statistical power to find differences in multi-subject functional connectivity,” Frontiers in Neuroscience, vol. 10, p. 108, 2016.
[38] A. Dobra, “Variable selection and dependency networks for genomewide data,” Biostatistics, vol. 10, no. 4, pp. 621–639, 2009.
[39] L. Li and K.-C. Toh, “An inexact interior point method for l1-regularized sparse covariance selection,” Math. Programming Computation, vol. 2, no. 3-4, pp. 291–315, 2010.
[40] Y. Cui, C. Leng, and D. Sun, “Sparse estimation of high-dimensional correlation matrices,” Computational Statistics & Data Analysis, vol. 93, pp. 390–403, 2016.
[41] W. Sun, Z. Wang, H. Liu, and G. Cheng, “Non-convex statistical optimization for sparse tensor graphical model,” in Proc. Advances in Neural Information Processing Systems (NeurIPS’15), 2015, pp. 1081–1089.
[42] Y. Fang, K. A. Loparo, and X. Feng, “Inequalities for the trace of matrix product,” IEEE Trans. Automatic Control, vol. 39, no. 12, pp. 2489–2490, 1994.
[43] Y. Xu and W. Yin, “A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion,” SIAM J. Imaging Sci., vol. 6, no. 3, pp. 1758–1789, 2013.

		$\displaystyle\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}-\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}\big{]}\Big{\|}$
		$\displaystyle=\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}-\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbb{E}_{\mathbf{x}}[\mathbf{S}_{k}]\big{)}\Big{\|}$
		$\displaystyle=\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{W}_{k}(\mathbf{S}_{k}-\mathbb{E}_{\mathbf{x}}[\mathbf{S}_{k}])\big{)}\Big{\|}$
		$\displaystyle=\alpha~{}\Big{\|}\underset{i,j}{\overset{}{\sum}}\big{(}\mathbf{W}_{k}(\mathbf{S}_{k}-\mathbb{E}_{\mathbf{x}}[\mathbf{S}_{k}])\big{)}_{i,j}\Big{\|}$
		$\displaystyle\leq\alpha~{}\underset{i,j}{\max}\Big{\|}(\mathbf{S}_{k}-\mathbb{E}_{\mathbf{x}}[\mathbf{S}_{k}])_{i,j}\Big{\|}~{}\underset{i,j}{\overset{}{\sum}}\big{(}\mathbf{W}_{k}\big{)}_{i,j}$
		$\displaystyle=\alpha n_{k}~{}\underset{i,j}{\max}\Big{\|}(\mathbf{S}_{k}-\mathbb{E}_{\mathbf{x}}[\mathbf{S}_{k}])_{i,j}\Big{\|}$
		$\displaystyle\leq C_{1}\frac{n_{k}^{2}\log(n_{k})}{nM_{0}}\Big{(}{1+\big{\\|}\underset{[K_{0}]\setminus k}{\otimes}\widehat{\mathbf{W}}_{i}-\underset{[K_{0}]\setminus k}{\otimes}\mathbf{W}_{i}^{*}\big{\\|}_{F}}\Big{)},$		(22)

		$\displaystyle\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}-\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}\big{]}\Big{\|}$
		$\displaystyle\leq\alpha~{}\underset{i,j}{\max}\Big{\|}(\bar{\mathbf{S}}_{k}-\mathbb{E}_{\mathbf{x}}[\bar{\mathbf{S}}_{k}])_{i,j}\Big{\|}~{}\underset{i,j}{\overset{}{\sum}}\big{(}\mathbf{D}_{k}\big{)}_{i,j}$
		$\displaystyle=\alpha n_{k}~{}\underset{i,j}{\max}\Big{\|}(\bar{\mathbf{S}}_{k}-\mathbb{E}_{\mathbf{x}}[\bar{\mathbf{S}}_{k}])_{i,j}\Big{\|}$
		$\displaystyle\leq C_{2}\frac{n_{k}^{2}\log(n_{k})}{nM_{0}}\Big{(}{1+\big{\\|}\underset{[K_{0}]\setminus k}{\otimes}\widehat{\mathbf{W}}_{i}-\underset{[K_{0}]\setminus k}{\otimes}\mathbf{W}_{i}^{*}\big{\\|}_{F}}\Big{)},$		(23)

	$\displaystyle\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}-\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}$
	$\displaystyle-\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}\big{]}+\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}\big{]}\Big{\|}$
	$\displaystyle\leq\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}-\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{W}_{k}\mathbf{S}_{k}\big{)}\big{]}\Big{\|}+$
	$\displaystyle+\alpha~{}\Big{\|}\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}-\mathbb{E}_{\mathbf{x}}\big{[}\mathrm{trace}\big{(}\mathbf{D}_{k}\bar{\mathbf{S}}_{k}\big{)}\big{]}\Big{\|}$
	$\displaystyle\leq C\frac{n_{k}^{2}\log(n_{k})}{nM_{0}}\Big{(}{1+\big{\\|}\underset{[K_{0}]\setminus k}{\otimes}\widehat{\mathbf{W}}_{i}-\underset{[K_{0}]\setminus k}{\otimes}\mathbf{W}_{i}^{*}\big{\\|}_{F}}\Big{)}.$		(24)

$\displaystyle F_{k}(\widehat{\boldsymbol{\Delta}})$	$\displaystyle=\alpha\mathrm{trace}\big{(}\widehat{\mathbf{D}}_{k}\bar{\mathbf{S}}_{k}-\widehat{\mathbf{W}}_{k}\mathbf{S}_{k}\big{)}-\alpha\mathrm{trace}\big{(}\mathbf{D}_{k}^{}\bar{\mathbf{S}}_{k}-\mathbf{W}_{k}^{}\mathbf{S}_{k}\big{)},$
	$\displaystyle\leq F_{k}(\mathbf{0})=\alpha\mathrm{trace}\big{(}\mathbf{D}_{k}^{}\bar{\mathbf{S}}_{k}-\mathbf{W}_{k}^{}\mathbf{S}_{k}\big{)}-$
	$\displaystyle\quad\quad\quad\quad\quad\quad\alpha\mathrm{trace}\big{(}\mathbf{D}_{k}^{}\bar{\mathbf{S}}_{k}-\mathbf{W}_{k}^{}\mathbf{S}_{k}\big{)}=0,$	(26)

	$\displaystyle\frac{\alpha}{M_{0}}\Big{\|}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}(\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i})\mathbf{x}_{m}-\mathbb{E}_{\mathbf{x}}\big{[}\mathbf{x}_{m}^{T}(\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i})\mathbf{x}_{m}\big{]}\Big{\|}$
	$\displaystyle=\frac{\alpha}{M_{0}}\Big{\|}\mathrm{trace}\Big{(}(\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i})\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{x}_{m}\Big{)}-$
	$\displaystyle\quad\quad\quad\quad\quad\quad\mathrm{trace}\Big{(}(\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i})\mathbb{E}_{\mathbf{x}}\big{[}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{x}_{m}\big{]}\Big{)}\Big{\|}$
	$\displaystyle\leq\frac{\alpha\lambda_{1}(\mathbf{W}_{k})}{M_{0}}\Big{\|}\mathrm{trace}\Big{(}(\underset{[k-1]}{\boxtimes}\mathbf{W}_{i}\boxtimes\mathbf{I}_{k}\underset{[K_{0}]\setminus[k]}{\boxtimes}\mathbf{W}_{j})\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{x}_{m}\Big{)}$
	$\displaystyle\quad\quad-\mathrm{trace}\Big{(}(\underset{[k-1]}{\boxtimes}\mathbf{W}_{i}\boxtimes\mathbf{I}_{k}\underset{[K_{0}]\setminus[k]}{\boxtimes}\mathbf{W}_{j})\mathbb{E}_{\mathbf{x}}\big{[}\underset{m=1}{\overset{M_{0}}{\sum}}\mathbf{x}_{m}^{T}\mathbf{x}_{m}\big{]}\Big{)}\Big{\|}$
	$\displaystyle\leq C_{1}\frac{n_{k}\log(n)}{M_{0}}\Big{(}{1+2\big{\\|}\underset{[K_{0},k]}{\boxtimes}\widehat{\mathbf{W}}_{i}-\underset{[K_{0}]}{\boxtimes}\mathbf{W}_{i}^{*}\big{\\|}_{F}}\Big{)},$		(32)

Learning Product Graphs Underlying Smooth Graph Signals

Abstract

Index Terms:

I Introduction

I-A Prior work

I-B Our contributions

I-C Notation and organization

II Probabilistic Problem formulation

III Graph learning as a linear program

III-A Fast solver for the graph learning linear program

III-B Parameter and computational complexities

IV Product graphs: Problem formulation & advantages in data representation

IV-A Kronecker graphs

IV-B Cartesian graphs

IV-C Strong graphs

IV-D Product graph Fourier transform

IV-E Smoothness

IV-F Representation complexity

V Algorithm for learning product graphs

V-A Kronecker graphs

Theorem 1.

Remark 1.

V-B Cartesian graphs

Theorem 2.

Remark 2.

V-C Strong graphs

Theorem 3.

Remark 3.

V-D Error bound for arbitrary graphs

V-E Computational complexity

V-F Convergence properties & sample complexity

Theorem 4.

VI Numerical experiments

VI-A Synthetic data: Arbitrary graphs

Remark 4.

VI-B Synthetic data: Product graphs

VI-C United States wind speed data

VI-D ABIDE fMRI data: Exploratory data analysis

VI-E Estrogen receptor data

VII Conclusion

Appendix A Proof of Theorem 1

Appendix B Proof of Theorem 2

Appendix C Proof of Theorem 3

Appendix D Proof of Theorem 4

References

Learning Product Graphs Underlying
Smooth Graph Signals

IV Product graphs: Problem formulation
& advantages in data representation