Random Projections of Sparse Adjacency Matrices

Frank Qiu Statistics Department; University of California, Berkeley

Abstract

We analyze a random projection method for adjacency matrices, studying its utility in representing sparse graphs. We show that these random projections retain the functionality of their underlying adjacency matrices while having extra properties that make them attractive as dynamic graph representations. In particular, they can represent graphs of different sizes and vertex sets in the same space, allowing for the aggregation and manipulation of graphs in a unified manner. We also provide results on how the size of the projections need to scale in order to preserve accurate graph operations, showing that the size of the projections can scale linearly with the number of vertices while accurately retaining first-order graph information. We conclude by characterizing our random projection as a distance-preserving map of adjacency matrices analogous to the usual Johnson-Lindenstrauss map.

1 Introduction

The adjacency matrix is a popular and flexible graph representation, encoding a graph’s edgeset in an explicit and easy to access manner. Furthermore, many natural graph operations correspond to linear algebraic operations on the adjacency matrix, such as edge addition-deletion, edge or vertex queries, and edge composition. The properties of the adjacency matrix and related quantities like the graph Laplacian also yield important insights about the underlying graph [19] [1].

However, a major defect of the adjacency matrix is that its size scales quadratically with the number of vertices. In many real world applications where the underlying graph is extremely large, this presents a fundamental problem. For example, graphs in the Stanford Large Network Collection include graphs derived from social, communication, and road networks that each have millions of vertices. The adjacency matrices of such large graphs would require trillions of parameters to store and work with, which is infeasible in practice. Hence, there has been much interest in graph compression techniques. There is a particular interest in sparse graph techniques, since a majority of real world graphs tend to have sparsely connected vertices.

In a previous work [18], we introduced a graph embedding method that represents a graph’s edgeset as a random sum. This method can be viewed as a random projection technique for adjacency matrices, and in this paper we discuss its application to large sparse graphs. In particular, we show that these random projections have the same space and operational time complexity as popular sparse graph representations like the Compressed Sparse Row format (CSR). Moreover, they retain all of the linear-algebraic graph operations of adjacency matrices, giving a flexible and computationally-efficient representation. Interestingly, the random projection described in this paper is an inner-product preserving projection on the space of adjacency matrices, analogous to the Johnson-Lindenstrauss map for vectors[7].

2 Related Work

We proposed our random projection method in previous work [18], framing it as a graph embedding method in the spirit of hyperdimensional computing(HDC) [8][10][6]. HDC graph representations encode vertices as high-dimensional vectors, bind these vectors to generate edge embeddings, and sum these edge embeddings to represent the entire edgeset: a bind-and-sum approach [15][12][17][16][9]. Our method can be classified as a bind-and-sum method, since we assign a random vector to each vertex and represent an edge by binding the source and target vertices via the tensor(outer) product. However, our method is purely a random projection and does not seek to learn a good node or graph embedding, instead leveraging pseudo-orthogonality to maintain properties of the adjacency matrix.

Our randomized projections can also be viewed as a ”reverse” adjacency matrix factorization. These techniques seek to find an informative and compressed factorization of the adjacency matrix (or adjacency tensor for typed graphs). For example, the RESCAL algorithm [14] applied to a $n\times n$ adjacency matrix $A$ learns a $d\times n$ entity matrix $E$ and $d\times d$ relational matrix $R$ such that $A\approx ERE^{T}$ . Here, the columns of $E$ are the entity (vertex) embeddings that encode information about each entity, and the matrix $R$ encodes information about entity-entity interactions. We work with the source-target factorization of an adjacency matrix $A=CD^{T}$ , where $C$ and $D$ are $n\times k$ matrices whose $i^{th}$ columns are the respective coordinate vectors of the source and target vertex of the $i^{th}$ edge. Our method encodes each vertex as a random vector and then replaces the columns of $C$ and $D$ with their corresponding random vertex codes.

As a random projection of matrices, our method falls within the field of random numerical linear algebra [13]. These are random algorithms and techniques for solving problems in linear algebra, such as solving linear systems and finding maximum/minimum eigenvalues. These techniques aim to offer advantages in speed and memory over their deterministic counterparts, and our method aims to provide both in representing and working with a sparse adjacency matrix. Among these algorithms, the FastRP algorithm[4] is another example of a random projection method. It computes node embeddings by taking a weighted sum of the power of the adjacency matrix and then randomly projects them into a $d\times n$ matrix whose columns form the node embeddings. While the goal of FastRP is to generate node embeddings that capture multi-hop information, our method seeks to informatively embed the adjacency matrix itself.

3 Intuition Behind the Projection

Given directed graph $G=(V,E)$ and an arbitrary ordering of its vertex set $V$ , its adjacency matrix can be expressed as the sum of outer products of coordinate vectors:

A=\sum_{ij:A_{ij}=1}e_{i}e_{j}^{T}

In this form, it is easy to see that many graph operations correspond to linear operations on the adjacency matrix. For example, adding/deleting the edge $(v_{s},v_{t})$ corresponds to adding/subtracting the matrix $e_{s}e_{t}^{T}$ from $A$ ; querying if the graph contains the edge $(v_{s},v_{t})$ corresponds to the product $e_{s}^{T}Ae_{t}$ ; the $k^{th}$ matrix powers correspond to $k$ -length paths. One property these fundamental graph operations share is that they only require orthonormality of the vertex vectors. If we swap the coordinate vectors with any set of orthonormal vectors, the transformed adjacency matrix retains all of its functionality.

For example, let $B=[b_{1}\cdots b_{n}]$ be any orthogonal matrix, whose columns form an orthonormal basis. Changing bases to $B$ , our transformed adjacency matrix $A_{B}$ takes the form:

A_{B}=\sum_{ij:A_{ij}=1}b_{i}b_{j}^{T}=BAB^{T}

In this form, we can perform all the usual adjacency matrix operations by swapping coordinate vectors with their counterparts in $B$ . For example, adding/deleting the edge $(v_{s},v_{t})$ now corresponds to adding/subtracting the matrix $b_{s}b_{t}^{T}$ from $A_{B}$ ; the edge query of $(v_{s},v_{t})$ becomes $b_{s}A_{B}^{T}b_{t}$ ; matrix powers still correspond to finite length paths in the sense that $b_{i}^{T}A_{B}^{k}b_{j}=1$ if and only if there is a $k$ -length path from $v_{i}$ to $v_{j}$ .

From the above discussion, we see that orthonormality is the key property required for exact graph functionality. Hence, relaxing to approximate orthonormality allows us to compress adjacency matrices while retaining graph functionality in a minimally noisy manner. We make use of random pseudo-orthonormal vectors, which are vectors whose dot products are negligible with high probability. While we can pack at most $d$ orthonormal vectors in a $d$ -dimensional space, we can pack many more pseudo-orthonormal vectors in the same space. This allows us to compress adjacency matrices, and in the case of sparse matrices this compression effect is especially pronounced.

4 Random Projection of Adjacency Matrices

Consider a directed graph $G=(V,E)$ with $|V|=n$ . An ordering of the vertex set $V$ induces the $n\times n$ adjacency matrix $A$ :

A_{ij}=\begin{cases}1\quad\text{if $(v_{i},v_{j})\in E$}\\ 0\quad\text{otherwise}\end{cases}

We then construct a random $d\times n$ projection matrix $P_{V}$ , where the columns of $P_{V}$ are sampled i.i.d. from the uniform distribution on the $d$ -dimension unit sphere $\mathbb{S}^{d-1}$ . The random projection $\pi_{V}(A)$ is then given by the following equation:

\pi_{V}(A)=P_{V}AP_{V}^{T}

Expressing the adjacency matrices as the sum of outer products between coordinate vectors $A=\sum e_{i}e_{j}^{T}$ , the projection swaps the $i^{th}$ coordinate vector with the $i^{th}$ column $p_{i}$ of $P_{V}$ : $\sum e_{i}e_{j}^{T}\mapsto\sum p_{i}p_{j}^{T}$ . In light of the discussion of Section 3, this can be viewed as a pseudo-orthogonal basis change.

4.1 Graph Operations on Random Projections

We saw in Section 3 that operations in a new orthonormal basis require swapping coordinate vectors with the new basis vectors. Similarly, we can perform all the usual graph operations with the projection $\pi_{V}(A)$ by substituting the appropriate random code for each vertex. For example, to query if the edge $(v_{i},v_{j})$ is in the edgeset, we would usually compute the product $e_{i}^{T}Ae_{j}$ . This returns a 1 if $(v_{i},v_{j})$ is an edge and 0 otherwise. The randomized analogue then takes the form $p_{i}^{T}\pi_{V}(A)p_{j}$ . This quantity is close to 1 with high probability if $(v_{i},v_{j})$ is an edge and close to 0 with high probability otherwise.

In general, every graph operation has an analogue on the projected matrices by substituting in the random vertex codes. Rather than using the coordinate vector $e_{i}$ associated with vertex $v_{i}$ , when working with the randomized projection $\pi_{V}(A)$ we use the $i^{th}$ column vector $p_{i}$ of the projection matrix $P_{V}$ instead.

4.2 Changing the Vertex Set and Graph Aggregation

One attractive property about our random projections is that they transform nicely under changes to the underlying vertex set. Suppose we wish to expand our graph by adding a new vertex $v_{n+1}$ and new edges $\{(v_{i},v_{n+1})\}_{i}$ . With the usual $n\times n$ adjacency matrix, this would require expanding to a $n+1\times n+1$ matrix and filling in the appropriate entries of the added column and row. However, for the projected matrix we need to only generate a new random vector $p_{n+1}$ and add the appropriate edges to the projection: $\pi_{V}(A)+\sum_{i}p_{i}p_{n+1}^{T}$ .

Our random matrix projection can be viewed as a projection method for the set of graphs whose vertices are in a fixed vertex set $V$ . In light of the above discussion, this projection can be naturally extended to sets containing $V$ or restricted to subsets of $V$ by expanding or restricting the set of random vertex codes respectively. This property is particularly attractive for dynamic graph representations, where the edge and vertex set of the graph change over time. The dimension of the projected matrices stays the same under changes to the vertex set, and the addition/deletion of edges involving new vertices is simple. We only need to keep in mind the total capacity of the projection space: how many edges we can add to a projected matrix before accuracy in graph operations begins to break down. We analyze this behavior in Section 5.

4.3 Translation between Different Projections

Once we assign a random code to each vertex and construct our projection matrix $P_{V}$ , that vertex-vector codebook is fixed. However, we might desire to reassign a new random vector to each vertex. This translation procedure has a simple analogue for projected matrices. Suppose we have a vertex set $V$ with two different random projection matrices $P_{V}$ and $Q_{V}$ . The $i^{th}$ columns of $P_{V}$ and $Q_{V}$ are the random vectors assigned to vertex $v_{i}$ under each projection respectively. In order to swap the vector $p_{i}$ with $q_{i}$ , we construct the translation matrix $T_{(P,Q)}=Q_{V}P_{V}^{T}$ . By pseudo-orthonormality, we see that $T_{(P,Q)}p_{i}\approx q_{i}$ for every $i$ . Hence, if $A_{P}$ is the projection of $A$ under $P_{V}$ and $A_{Q}$ is the projection under $Q_{V}$ , we have the following relation:

A_{Q}\approx T_{(P,Q)}AT_{(P,Q)}^{T}

4.4 Graph Subsets and Aggregation

Given a subset $S\subseteq V$ , the subgraph generated generated by $S$ , denoted $G_{S}\subset G$ , is the subgraph generated by all edges involving vertices in $S$ . Given the projected adjacency matrix $\pi(A_{G})$ , we can extract the projection of the subgraph $\pi(A_{G_{S}})$ by the following procedure. Let $P_{S}$ be the $d\times|S|$ matrix whose columns are the random codes assigned to each vertex of $S$ , and define $T_{S}=P_{S}P_{S}^{T}$ . By pseudo-orthonormality, we have the following approximate relation:

\pi(A_{G_{S}})\approx T_{S}\pi(A_{G})T_{S}^{T}

This subset procedure, along with graph translation, can be applied to aggregate multiple graphs into a single large graph. For example, suppose we have two disjoint graphs $G$ and $H$ . Their aggregate graph $G\cup H=(V_{G}\cup V_{H},E_{G}\cup E_{H})$ is the result of combining their vertex and edge sets together. We saw earlier that edge addition corresponds to adding the projected edges to our projected matrices. Therefore, given projections $\pi(A_{G})$ and $\pi(A_{H})$ , the projection of their aggregate is the sum of their projections:

\pi(A_{G}\cup A_{H})=\pi(A_{G})+\pi(A_{H})

This can be combined in tandem with the subsetting procedure to combine selected subgraphs from a collection of graphs. Even when their vertex codes are different, the translation procedure mentioned previously allows use to translate them all into a single code before aggregation.

One interesting application of this suite of procedures is allowing a divide-and-conquer approach to storing a large sparse graph. Graph operations on the projected matrices depend on the size of the projection space (see Section 5), so breaking a large graph into subgraphs allows us to store their projections in a smaller projection space. Operations involving an individual subgraph also happen in a lower-dimensional space and have decreased time complexity. However, when the need arises to operate on information from a pool of these subgraphs, we can use the above subset and pooling procedures to easily generate a wide suite of projections that correspond to those generated by their graph aggregates.

The above examples demonstrate that we can represent graphs of varying size and different vertex sets in the same projection space. Translation and pooling operations are of fixed dimension, allowing for a unified way to manipulate and combine a broad range of graphs. This suggests that random projections are also ideal for applications where graph aggregation is an important operation such as the divid-and-conquer situation described above.

5 Scaling Properties of Random Projections

We now study how the size of the random projection space needs to scale with the underlying graph. Importantly, we show that the size scales with size of the edgeset rather than the vertex set. This makes our method particularly amenable to sparse graphs, where the size of the edgeset is proportional to the size of the vertex set.

5.1 $m$ -Order Graph Operations

First, we need to define $m$ -order graph operations. This is important because each order requires a different scaling of the projection space. We say a graph operation has order $\boldsymbol{m}$ if it can be expressed as an operation involving the $m^{th}$ power of the adjacency matrix. For example, the edge query is a first order graph operation because it is a function of the adjacency matrix: $q((v_{i},v_{j}),G)=e_{i}^{T}Ae_{j}$ .

5.2 First and Second Order Scaling

We first characterize how the size of the projection needs to scale in order to retain accurate first and second order graph operations. Intuitively, these correspond to edge information of the 1-hop and 2-hop neighborhoods of the graph respectively. We give two informal statements on how the dimension $d$ of the projection space $\mathbb{R}^{d\times d}$ must scale with the number of edges, subject to a constraint on the vertex connectivity (number of edges each vertex participates in). An account of the technical results justifying these statements is given in Appendix A.

Property 1 (First Order Scaling).

Let $G$ be a graph with $k$ edges with maximum node degree $O(\sqrt{k})$ . A random projection into $\mathbb{R}^{d\times d}$ needs to have $d=\Omega(\sqrt{k})$ in order to retain accurate first order operations.

Property 2 (Second Order Scaling).

Let $G$ be a graph with $k$ edges with maximum node degree $O(k^{1/3})$ . A random projection into $\mathbb{R}^{d\times d}$ needs to have $d=\Omega(k^{2/3})$ in order to retain accurate second order operations.

For sparse graphs, the node degree condition is satisfied for all or a majority of vertices. Indeed, a closer inspection of the proofs in Appendix A show that as long as the vertices of the relevant edges satisfy this or a much looser bound on the node degree, the results still hold. This allows us to include sparse graphs that might have central vertices, which act like hubs and connect to many other vertices. If we are only interested in first order properties of the graph, the first result implies that we can compress a large sparse adjacency matrix using $n^{2}$ parameters into a smaller matrix using $\Omega(n)$ parameters, resulting a drastic compression for large sparse graphs.

5.3 $m$ -Order Scaling

While first and possibly second order operations comprise a bulk of the interesting graph operations, for completeness we have the following informal scaling statement for $m$ -order graph operations.

Property 3 ( $m$ -Order Scaling).

Let $G$ be a graph with $k$ edges with maximum node degree $O(k^{1/(m+1)})$ . A random projection into $\mathbb{R}^{d\times d}$ needs to have $d=\Omega(k^{\frac{m}{m+1}})$ in order to retain accurate $m$ -order operations.

6 Edge Representations of Sparse Matrices

In this section, for various sparse matrix representations we analyze both their size complexity as well as the time complexity of graph operations. We also consider the time complexity of numerical sparse matrix methods when appropriate. In this manner, we hope to contextualize both the strengths and weaknesses of our random projections relative to other sparse matrix representations. Table 1 summarizes the results of this section.

6.1 Alternate Sparse Matrix Representations

We first consider the coordinate list representation. This represents a sparse matrix as a list of 3-tuples $(r,c,v)$ corresponding to each non-zero entry, where $r$ and $c$ are the row and column indices while $v$ is the value. A related representation is the Dictionary of Keys (DoK) representation, which is similar to the coordinate list representation except now each row-column index $(r,c)$ serves as a key with value $v$ . Finally, the CSR format represents a matrix using three arrays containing the non-zero values, their column indices, and the number of non-zero entries above each row respectively. The CSR format can be easily computed from the previous two representations and vice versa.

6.2 Space Complexity

A sparse graph with $n$ vertices has $O(n)$ edges. Hence, from their descriptions all three of the alternate sparse matrix representations require $O(n)$ parameters. For random projections, if we are only concerned with representing first-order properties, Theorem A.1 establishes that the $d\times d$ projections must have $d=\Omega(n)$ , meaning the projections are of size $O(n)$ .

6.3 Graph Operation Complexity

We analyze some graph operations where the complexity differs based on the representation. For data structure operations, we use their complexity in Python as stated in the Python Wiki. Throughout this section, our random projection are $d\times d$ matrices.

6.3.1 Edge Query

To look up if an edge $(v_{i},v_{j})$ exists in a coordinate list, we merely check if the list contains a tuple whose first two entries are $(i,j)$ . Since the coordinate list has length $O(n)$ , list lookup has complexity $O(n)$ . As a three list format, the CSR representation has the same complexity. However, the ”in” operator for dictionaries is $O(1)$ , meaning edge lookup in DoK is $O(1)$ . The edge query for random projections is the product $p_{i}^{T}\pi_{V}(A)p_{j}$ , which has $O(d^{2})$ . If our project space is calibrated to preserve only first-order operations, then this edge query has complexity $O(n)$ .

6.3.2 Edge Composition

Edge composition is a complicated case. For each of the sparse matrix representatiosn, edge compositions requires a nested procedure where, for each edge $(i,j)$ , one needs to identify all edges $(j,k)$ and return the edge $(i,k)$ , which is naively $O(n^{2})$ . Importantly, this naive algorithm is not parallelizable, making it especially painful to compute. Alternatively, edge composition naturally corresponds to the second power of the adjacency matrix. For a sparse adjacency matrix with $O(n)$ edges, numerical methods for sparse matrices have complexity at most $O(n^{2})$ [2].

As for random projections, the naive algorithm for the multiplication of two $d\times d$ matrices has complexity $O(d^{3})$ [5] [11]. In order to retain accuracy for a second-order operations, property 2 states that $d=\Omega(n^{2/3})$ . Thus, computing the second matrix power of our projections has naive complexity $O(n^{2})$ while using methods like Strassen’s algorithm[11] can speed it up even further.

6.4 Fast Numerical Linear Algebra

One important aspect of our random projections is they benefit from numerical methods for linear algebra, such as parallel computation. This means that the time complexity of graph operations can be substantially reduced through these methods. The other sparse matrix representations do not enjoy these benefits, as we saw with the edge composition example. Hence, in practice the time complexity of graph operations with random projections can be significantly reduced, and they benefit from advances in numerical linear algebra.

	Random Projections	DoK	CL	CSR	Sparse NLA
Space Complexity	$O(n)$	$O(n)$	$O(n)$	$O(n)$	N/A
Edge Query	$O(n)$	$O(1)$	$O(n)$	$O(n)$	N/A
Edge Composition	$O(n^{2})^{*}$	$O(n^{2})$	$O(n^{2})$	$O(n^{2})$	N/A
Matrix Addition	$O(n)$	N/A	N/A	N/A	$O(n)$
Matrix Multiplication	$O(n^{2})^{*}$	N/A	N/A	N/A	$O(n^{2})$

Table 1: Table of space/time complexity for various graph and matrix operations. The underlying sparse graph has

n

vertices and

O(n)

edges. From left to right, the methods considered are random projections, dictionary of keys (DoK), coordinate list (CL), compressed sparse row (CSR), and sparse numerical linear algebra (Sparse NLA). For matrix multiplication, we use the time complexity of the naive algorithm rather than other advanced methods [11]. We assume our random projections are calibrated to first order operations except for edge composition/matrix multiplication(

*

) where we assume calibration to second order operations.

7 Johnson-Lindenstrauss Analogue for Graphs

In the introduction, we stated that our random projection method is a norm-preserving map analogous to the one given by Johnson-Lindenstrauss (JL) lemma for vectors. Here, we first discuss a natural inner product on the space of adjacency matrices. We then state results showing how our random projections preserve this inner product and its associated norm for a finite set of adjacency matrices, characterizing our method as a JL map for the space of graphs.

7.1 A Natural Inner Product for Adjacency Matrices

In our previous work [18], we noted that there is a natural inner product on the space of adjacency matrices that counts the number of shared edges. Consider two graphs $G$ and $H$ whose vertex sets are contained in some larger set $V$ . Given an ordering of $V$ , we have the induced adjacency matrices $A_{G}$ and $A_{H}$ . Define the inner product between their adjacency matrices as:

\langle A_{G},A_{H}\rangle\coloneqq tr(A_{G}^{T}A_{H})

Expressing each adjacency matrix as the sum of coordinate outer products, we see that this function counts the number of edges common to both graphs:

	$\displaystyle tr(A_{G}^{T}A_{H})$	$\displaystyle=tr([\sum_{(i,j)\in E_{G}}e_{i}e_{j}^{T}]^{T}[\sum_{(k,l)\in E_{H}}e_{k}e_{l}^{T}])$
		$\displaystyle=tr(\sum_{(i,j)\in E_{G},(j,l)\in E_{H}}e_{i}e_{l}^{T})$
		$\displaystyle=\sum_{(i,j)\in E_{G}\cap E_{H}}1$
		$\displaystyle=\|E_{G}\cap E_{H}\|$

One can check this is indeed an inner product, and in fact it generates the Frobenius norm: $tr(A^{T}A)=||A||_{F}^{2}$ .

Note that we assumed the vertex sets of both graphs were contained in some larger set $V$ . This is not an issue for a finite set of graphs $G_{i}$ with finite vertex sets $V_{i}$ , as we can define $V=\cup V_{i}$ . The adjacency matrix of a graph with respect to this large set $V$ is a block matrix whose single block is equal to the original adjacency matrix.

7.2 Random Projections are a JL Map

In light of graph inner product defined in Section 7.1, we almost have a JL-type result since the Frobenius norm of matrices coincides with the Euclidean norm if we regard matrices in $\mathbb{R}^{d\times d}$ as vectors in $\mathbb{R}^{d^{2}}$ . We confirm this by proving that, with high probability, our random projection method is a map that satisfies the conditions of the JL Lemma.

Theorem 7.1 (Random Projections are a JL map).

Consider a set of $N$ graphs with adjacency matrices $A_{1},\cdots,A_{n}$ . For small $\epsilon>0$ and $d=\Omega(\frac{\log N}{\epsilon^{2}})$ , with high probability our random projections satisfy:

(1-\epsilon)||A_{i}-A_{j}||^{2}\leq||\pi(A_{i})-\pi(A_{j})||^{2}\leq(1+\epsilon)||A_{i}-A_{j}||^{2}\quad\forall i,j

(1)

8 Discussion

We presented a random projection method for sparse adjacency matrices. These random projections exploit pseudo-orthonormality of random vectors to drastically compress the adjacency matrix while still retaining all of its functionality. While the exact scaling depends on the graph properties we wish to preserve, Theorem A.1 shows that we can compress $n\times n$ sparse matrices into $\Omega(\sqrt{n})\times\Omega(\sqrt{n})$ matrices while preserving first order graph operations. These random projections also enjoy properties that their underlying adjacency matrices do not. Sections 4.2 and 4.3 show that random projection allow us to represent graphs of varying size and different vertex sets in the same projection space. This common space is equipped with modification and aggregation operations that apply to all graphs in the space, and random projections provide a unified way for representing and working with graphs. The complexity analysis of 6 shows that these random projections are competitive with existing sparse matrix representations and numerical methods, and they can take advantage of numerical techniques for speeding up linear algebra operations. All these properties suggest that random adjacency matrix projections are a dynamic, flexible, and expressive graph compression technique well suited to a variety of applications where large sparse matrices occur. One interesting application of our random compression technique would be to the field of graph neural networks [20]. Such networks use the adjacency matrix during the linear portion of its computations, and random projections could help extend such networks to both large sparse graphs as well as time-varying, dynamic graphs.

References

[1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2003.
[2] Keivan Borna and Sohrab Fard. A note on the multiplication of sparse matrices. Open Computer Science, 4(1):1–11, 2014.
[3] Stéphane Boucheron, Gábor Lugosi, and Olivier Bousquet. Concentration Inequalities, pages 208–240. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.
[4] Haochen Chen, Syed Fahad Sultan, Yingtao Tian, Muhao Chen, and Steven Skiena. Fast and accurate network embeddings via very sparse random projection, 2019.
[5] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009.
[6] Ross W. Gayler and Simon D. Levy. A distributed basis for analogical mapping. 2009.
[7] William B. Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics, 26, 1984.
[8] Pentti Kanerva. Sparse Distributed Memory. MIT Press, Cambridge, MA, USA, 1988.
[9] Jaeyoung Kang, Minxuan Zhou, Abhinav Bhansali, Weihong Xu, Anthony Thomas, and Tajana Rosing. Relhd: A graph-based learning on fefet with hyperdimensional computing. 2022 IEEE 40th International Conference on Computer Design (ICCD), pages 553–560, 2022.
[10] Denis Kleyko, Dmitri A. Rachkovskij, Evgeny Osipov, and Abbas Jawdat Rahim. A survey on hyperdimensional computing aka vector symbolic architectures, part ii: Applications, cognitive models, and challenges. ArXiv, abs/2112.15424, 2021.
[11] Yan Li, Sheng-Long Hu, Jie Wang, and Zheng-Hai Huang. An introduction to the computational complexity of matrix multiplication. Journal of the Operations Research Society of China, 8(1):29–43, 2020.
[12] Yunpu Ma, Marcel Hildebrandt, Volker Tresp, and Stephan Baier. Holistic representations for memorization and inference. In UAI, 2018.
[13] Per-Gunnar Martinsson and Joel Tropp. Randomized numerical linear algebra: Foundations and algorithms, 2021.
[14] Maximilian Nickel, Xueyan Jiang, and Volker Tresp. Reducing the rank in relational factorization models by including observable patterns. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
[15] Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. Holographic embeddings of knowledge graphs. In AAAI, 2016.
[16] Igor O. Nunes, Mike Heddes, Tony Givargis, Alexandru Nicolau, and Alexander V. Veidenbaum. Graphhd: Efficient graph classification using hyperdimensional computing. 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1485–1490, 2022.
[17] Prathyush Poduval, Haleh Alimohamadi, Ali Zakeri, Farhad Imani, M. Hassan Najafi, Tony Givargis, and Mohsen Imani. Graphd: Graph-based hyperdimensional memorization for brain-like cognitive learning. Frontiers in Neuroscience, 16, 2022.
[18] Frank Qiu. Graph embeddings via tensor products and approximately orthonormal codes, 2022.
[19] Ulrike von Luxburg. A tutorial on spectral clustering, 2007.
[20] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57–81, 2020.

Appendix A Scaling Theorems

We provide theorems justifying the properties listed Section 5. These are mainly an adaptation of results found in our previous work [18]. Throughout this section, we abuse notation and use the same symbol $u$ to denote both the vertex and its random code. Similarly, we use the same symbol $G$ to denote the graph and its random projection.

A.1 Auxiliary Lemma for Spherical Random Vectors

We will need the following lemma [18].

Lemma A.1.

Let $X$ be the dot product between two vectors sampled uniformly and independently from the $d$ -dimensional unit sphere $\mathbb{S}^{d-1}$ . Then $E(X)=0$ and $E(X^{2})=\frac{1}{d}$ .

A.2 Property 1: First-Order Scaling

The following theorem and its proof is an abbreviated adaptation of Theorems 10.7 in our previous work [18]. Given an edge $(v_{i},v_{j})$ and adjacency matrix $A$ , the edge query $e_{i}^{T}Ae_{j}$ returns 1 if $(v_{i},v_{j})$ is an edge of the graph and 0 otherwise. Our goal is prove the edge query analogue for our random projection returns the correct value with high probability, showing that our random projection accurately retains first-order edge information. An edge query is a true query if the queried edge exists in the graph and a false query if it doesn’t.

Theorem A.1.

Let $G$ be a graph with $k+1$ edges and maximum node degree $\frac{l}{2}$ . For a random projection into $\mathbb{R}^{d\times d}$ , we have the following:

1.

If $T$ denotes the result of a true query, then:

$P(|T-1|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\frac{l}{d}+\frac{k-l}{d^{2}}+\frac{\epsilon}{3})})$ (2)
2.

If $F$ denotes the result of a false query, then:

$P(|F|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\frac{l}{d}+\frac{k+1-l}{d^{2}}+\frac{\epsilon}{3})})$ (3)

Proof.

Consider the random projection of a graph with $k+1$ edges:

G=uv^{T}+\sum_{i=1}^{k}p_{i}q_{i}^{T}

(4)

where $u$ , $v$ , $p_{i}$ , $q_{i}$ are random vectors drawn i.i.d. from the uniform distribution on the $d$ -dimensional unit sphere.

We first prove equation 2 by considering the result of querying our random projection for the exist of the edge $(u,v)$ , which is a valid edge. We assume $m$ of the other edges share one vertex with $(u,v)$ , and WLOG we assume they all share vertex $u$ . The result of true query $T$ can be expressed as:

T=u^{T}Gv=1+\sum_{i=1}^{m}\langle p_{i},v\rangle+\sum_{j=1}^{k-m}\langle u,q_{j}\rangle\langle v,r_{j}\rangle

(5)

From Lemma A.1, both sums have mean 0 with the first sum having variance $\frac{m}{d}$ and the second having variance $\frac{k-m}{d^{2}}$ . Hence, the total variance is $\frac{m}{d}+\frac{k-m}{d^{2}}$ , and applying Bernstein’s inequality gives:

P(|T-1|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\frac{m}{d}+\frac{k-m}{d^{2}}+\frac{\epsilon}{3})})

By assumption, we can bound $m\leq\frac{l}{2}$ , and plugging in the worst case of $m=\frac{l}{2}$ gives equation 2.

Similarly, now suppose we query by a spurious edge $(s,t)$ . Assuming $q$ of the other edges share one vertex with $(s,t)$ , we use the same argument as above to derive the false query bound given by equation 3. ∎

Letting $Q$ denote a general edge query and $\sigma^{2}=Var(Q)$ , Theorem A.1 can be summarized as:

P(|Q-EQ|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\sigma^{2}+\frac{\epsilon}{3})})

Rewriting this bound in terms of $\sigma$ :

P(|Q-EQ|>C\sigma)\leq 2\exp(\frac{C^{2}\sigma^{2}}{2\sigma^{2}+\frac{2\sigma}{3}})\approx 2\exp(\frac{-3C^{2}}{2\sigma})

If $l=O(\sqrt{k})$ then $\sigma^{2}=\frac{O(\sqrt{k})}{d}+\frac{O(k)}{d^{2}}$ . Hence, if $d=\Omega(\sqrt{k})$ then $Q$ is close to $EQ$ with high probability. Since $EQ=1$ for a true query and $EQ=0$ for a false query, $Q$ is close to the correct value with high probability. This justifies the statement of property 1.

A.3 Property 2: Second-Order Scaling

This following theorem and its proof is an abbreviated adaptation of Theorems 11.7 from our previous work [18]. Two edges are composable if the target vertex of one is the source vertex of another, and an edge $(u,v)$ is in the second power of the adjacency matrix if and only if it is the composition of two composable edges. We need to show that the second matrix power accurately represents edge information. To this end, we prove the edge query returns the correct result with high probability when applied to the second power of the random projection.

Theorem A.2.

Let $G$ be a graph with composable edges $(u,v)$ and $(v,w)$ along with $k-2$ nuisance edges, and assume $G$ has maximum node degree $\frac{l}{2}$ . For a random projection into $\mathbb{R}^{d\times d}$ , we have the following results for edge queries involving projection’s second matrix power:

If $T$ denotes the result of a true query of the second matrix power, then:

P(|T-1|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\frac{2l+1}{d}+\frac{kl-3l-3}{d^{2}}+\frac{k^{2}-k(l+2)+l+1}{d^{3}}+\frac{\epsilon}{3})}))

(6)

2.

If $F$ denotes the result of a false query of the second matrix power, then:

$P(|F|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\frac{kl}{d^{2}}+\frac{k^{2}-kl}{d^{3}}+\frac{\epsilon}{3})}))$ (7)

Proof.

For the nuisance edge, assume $m_{1}$ have common vertex $v$ and $m_{2}$ have common source $u$ or target $w$ , and let $m=m_{1}+m_{2}$ . WLOG we may assume all $m_{2}$ edges have common source vertex $u$ , and our projection can be expressed as:

G=uv^{T}+vw^{T}+\sum_{i=1}^{m_{1}}vp_{i}^{T}+\sum_{j=1}^{m_{2}}up_{j}^{T}+\sum_{l=1}^{k-m-2}q_{l}r_{l}^{T}

(8)

From equation 8, we see the second matrix power should only contain the edge $(u,w)$ .

We first prove equation 6 and query the second matrix power $G^{2}$ for the presence of the edge $(u,w)$ . Letting $\epsilon$ denote the dot product of two independent spherical vectors, the result of this edge query is a sum of terms that are products of i.i.d. $\epsilon$ ’s. Let $n_{1}=k(m_{2})-m_{2}-m-3$ and $n_{2}=k^{2}-k(m_{2}+2)+m_{2}+1$ . Grouping terms by how many $\epsilon$ ’s they contain, we write our true query $T$ as:

T=1+\sum_{i=1}^{m+1}\epsilon_{i}+\sum_{j=1}^{n_{1}}\epsilon_{j_{1}}\epsilon_{j_{2}}+\sum_{h=1}^{n_{2}}\epsilon_{h_{1}}\epsilon_{h_{2}}\epsilon_{h_{3}}

Using Lemma A.1, we see the variance $\sigma^{2}$ of the error terms is $\sigma^{2}=\frac{m+1}{d}+\frac{n_{1}}{d^{2}}+\frac{n_{3}}{d^{3}}$ . An application of Bernstein’s inequality gives:

P(|T-1|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\frac{m+1}{d}+\frac{n_{1}}{d^{2}}+\frac{n_{2}}{d^{3}}+\frac{\epsilon}{3})})

Since $m_{1},m2\leq l$ and $m\leq 2l$ , plugging in the worst case of $m_{1}=m_{2}=l$ and $m=2l$ gives equation 6.

Similarly, now we consider querying $G^{2}$ by a spurious edge $(s,t)$ . We assume that $m$ of the edges have either common source vertex $s$ or target vertex $t$ . Then, a similar computation as above shows that we can express the false query $F$ as:

F=\sum_{i=1}^{km}\epsilon_{i_{1}}\epsilon_{i_{2}}+\sum_{j=1}^{k^{2}-km}\epsilon_{j_{1}}\epsilon_{j_{2}}\epsilon_{j_{3}}

An application of Bernstein’s inequality and a worst-case bound gives equation 7. ∎

As with Theorem A.1, the bounds of Theorem A.2 can be expressed in terms of the query variance $\sigma^{2}$ :

P(|Q-EQ|>C\sigma)\leq 2\exp(\frac{C^{2}\sigma^{2}}{2\sigma^{2}+\frac{2\sigma}{3}})\approx 2\exp(\frac{-3C^{2}}{2\sigma})

If $l=O(k^{1/3})$ , then in both cases all terms in the variance can be expressed as powers of $O(\frac{k^{2/3}}{d})$ . Hence, as long as $d=\Omega(k^{2/3})$ then we see that the edge query $Q$ is close to its correct value $EQ$ with high probability. This justifies the statement of Property 2.

A.4 Property 3: $m$ -Order Scaling

We give a proof sketch is adapted from Section 12.1 of our previous work[18]. We aim to analyze the accuracy recovering edge information from the $m^{th}$ matrix power of our random projections and how the dimension of the projection space $\mathbb{R}^{d\times d}$ needs to scale with $m$ .

As in the proof of Theorem A.2, let $\epsilon$ denote the dot product of distinct spherical vectors. From the proofs of Theorems A.1 and A.2, we need to control the edge query variance to ensure that true and false queries are close to their expected values with high probability. Intuitively, if the node degree is small then performing an edge query on the $m^{th}$ matrix power will result in a majority of the $k^{m}$ error terms being products of $m+1$ independent $\epsilon$ ’s. Such terms will be mean 0 and variance $\frac{1}{d^{m+1}}$ . Since the true and false query scores are 1 and 0 respectively, for accurate recovery we need the edge query variance to be less than 1. As a sum of independent terms, the variance of the noise term will be approximately $\frac{k^{m}}{d^{m+1}}$ , implying that $d=\Omega(k^{\frac{m}{m+1}})$ in order to retain accuracy. The node degree bound of $l=O(k^{1/(m+1)})$ ensures that the variance can be expressed as sums of powers of $O(\frac{k^{m/(m+1)}}{d})$ .

Appendix B JL Lemma for Graphs

Here, we aim to prove Theorem 7.1 by establishing intermediate results. Importantly, along the way we show how our random projection method preserves the graph inner product of Section 7.1.

B.1 Random Projections Preserve Inner Products

Theorem B.1.

Consider two graphs $G$ and $H$ with $n_{1}$ and $n_{2}$ edges respectively. Suppose they have $k$ edges in common. Among all $n_{1}n_{2}$ pairs $(e,e^{\prime})\in E_{G}\times E_{H}$ , suppose $q$ of these edge pairs share exactly one vertex. Let $\pi(G)$ and $\pi(H)$ denote the random projections of their adjacency matrices $A_{G}$ and $A_{H}$ into $\mathbb{R}^{d\times d}$ . For any $\epsilon>0$ , we have:

P(|\langle\pi(G),\pi(H)\rangle-\langle A_{G},A_{H}\rangle|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{\frac{q}{d}+\frac{n_{1}n_{2}-k-q}{d^{2}}+\frac{\epsilon}{3}})

(9)

Proof.

We can express their random projections as:

	$\displaystyle\pi(G)=\sum^{k}_{i=1}a_{i}b^{T}+\sum_{j=1}^{n_{1}-k}c_{j}d_{j}^{T}$
	$\displaystyle\pi(H)=\sum^{k}_{i=1}a_{i}b^{T}+\sum_{l=1}^{n_{2}-k}e_{l}f_{l}^{T}$

where $a,b,c,d,e,f$ are random spherical vectors. Computing their inner product:

tr(\pi(G)^{T}\pi(H))=k+tr(\pi(G)^{T}\sum_{l=1}^{n_{2}-k}e_{l}f_{l}^{T})+tr(\pi(H)^{T}\sum_{j=1}^{n_{1}-k}c_{j}d_{j}^{T})=k+E

(10)

We proceed to bound the error term $E$ in equation 10. By assumption, $q$ of the terms in $E$ share one vertex with the remaining terms sharing no vertices:

\displaystyle E=\sum_{i=1}^{q}\epsilon_{i}+\sum_{j=1}^{n_{1}n_{2}-k-q}\epsilon_{j}\epsilon^{\prime}_{j}

where each $\epsilon$ is the dot product of independent spherical vectors. Using Lemma A.1 and Bernstein’s inequality[3], we get the following bound:

P(|E|\geq\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{\frac{q}{d}+\frac{n_{1}n_{2}-k-q}{d^{2}}+\frac{\epsilon}{3}})

∎

B.2 Random Projections are a JL Map for Graphs

Theorem B.2.

Let $G$ be a graph with $k$ edges. For $d<k$ and small $\epsilon>0$ , its random projection into $\mathbb{R}^{d\times d}$ satisfies:

P(|\lVert\pi(G)\lVert^{2}-k|>k\epsilon)\leq 2\exp{-d\epsilon^{2}}

(11)

Proof.

Of the $k^{2}$ pairs $(e,e^{\prime})\in E_{G}\times E_{G}$ , suppose that $q$ of them share exactly one vertex. Theorem B.1 states that:

P(|\lVert\pi(G)\lVert^{2}-k|>k\epsilon)\leq 2\exp(\frac{-k^{2}\epsilon^{2}}{\frac{q}{d}+\frac{k^{2}-k-q}{d^{2}}+\frac{k\epsilon}{3}})

As $q\leq k^{2}-k$ , the worst case bound occurs when $q=k^{2}-k$ which gives equation 11. Note that if $\epsilon$ is large, the third term $\frac{k\epsilon}{3}$ dominates and gives the trivial bound $2\exp(-k\epsilon)$ . ∎

Theorem B.3.

Let $G$ and $H$ be two graphs with $\lVert A_{G}-A_{H}\lVert^{2}=m$ . For $d<k$ and small $\epsilon>0$ , the random projections of their adjacency matrices into $\mathbb{R}^{d\times d}$ satisfies the following:

P(|\lVert\pi(G)-\pi(H)\lVert-m|>m\epsilon)\leq 2\exp(-d\epsilon^{2})

(12)

Proof.

If we include signed edges, the graph $G-H$ has $m$ total edges. An application of Theorem B.2 gives the result. ∎

B.2.1 Proof of Theorem 7.1

Proof.

From Theorem B.3, our random projection satisfies the following inequalities with high probability:

(1-\epsilon)||A_{G}-A_{H}||^{2}\leq||\pi(G)-\pi(H)||^{2}\leq(1+\epsilon)||A_{G}-A_{H}||^{2}

(13)

If we want this to hold for a set of $N$ sparse adjacency matrices, a union bound over all $\binom{N}{2}$ pairs shows equation 13 holds with probability at least $1-2\binom{N}{2}\exp{-d\epsilon^{2}}=1-N(N-1)\exp{-d\epsilon^{2}}$ . For a fixed probability threshold $T$ of violating the inequalities 13, let us choose the optimal $d$ given $N$ , denoted $d_{opt}$ . That is, $d_{opt}$ is the smallest integer $d$ such that $N(N-1)\exp{-d\epsilon^{2}}\leq T$ . The optimal $d_{opt}$ satisfies the following inequalities:

	$\displaystyle N(N-1)\exp{-d_{opt}\epsilon^{2}}\leq T$
	$\displaystyle N(N-1)\exp{-(d_{opt}-1)\epsilon^{2}}>T$

Combining the two inequalities gives:

\frac{\log T-2\log N}{\epsilon^{2}}\leq d_{opt}<\frac{\log T-2\log N}{\epsilon^{2}}+1

Thus, we have $d_{opt}=\Omega(\frac{\log N}{\epsilon^{2}})$ , which matches the scaling of the usual Johnson-Lindenstrauss lemma and establishes our random projections as a JL map for graphs. ∎

Random Projections of Sparse Adjacency Matrices

Abstract

1 Introduction

2 Related Work

3 Intuition Behind the Projection

4 Random Projection of Adjacency Matrices

4.1 Graph Operations on Random Projections

4.2 Changing the Vertex Set and Graph Aggregation

4.3 Translation between Different Projections

4.4 Graph Subsets and Aggregation

5 Scaling Properties of Random Projections

5.1 mm-Order Graph Operations

5.2 First and Second Order Scaling

Property 1 (First Order Scaling).

Property 2 (Second Order Scaling).

5.3 mm-Order Scaling

Property 3 (mm-Order Scaling).

6 Edge Representations of Sparse Matrices

6.1 Alternate Sparse Matrix Representations

6.2 Space Complexity

6.3 Graph Operation Complexity

6.3.1 Edge Query

6.3.2 Edge Composition

6.4 Fast Numerical Linear Algebra

7 Johnson-Lindenstrauss Analogue for Graphs

7.1 A Natural Inner Product for Adjacency Matrices

7.2 Random Projections are a JL Map

Theorem 7.1 (Random Projections are a JL map).

8 Discussion

References

Appendix A Scaling Theorems

A.1 Auxiliary Lemma for Spherical Random Vectors

Lemma A.1.

A.2 Property 1: First-Order Scaling

Theorem A.1.

Proof.

A.3 Property 2: Second-Order Scaling

Theorem A.2.

Proof.

A.4 Property 3: mm-Order Scaling

Appendix B JL Lemma for Graphs

B.1 Random Projections Preserve Inner Products

Theorem B.1.

Proof.

B.2 Random Projections are a JL Map for Graphs

Theorem B.2.

Proof.

Theorem B.3.

Proof.

B.2.1 Proof of Theorem 7.1

Proof.

5.1 $m$ -Order Graph Operations

5.3 $m$ -Order Scaling

Property 3 ( $m$ -Order Scaling).

A.4 Property 3: $m$ -Order Scaling