This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Random Projections of Sparse Adjacency Matrices

Frank Qiu Statistics Department; University of California, Berkeley
Abstract

We analyze a random projection method for adjacency matrices, studying its utility in representing sparse graphs. We show that these random projections retain the functionality of their underlying adjacency matrices while having extra properties that make them attractive as dynamic graph representations. In particular, they can represent graphs of different sizes and vertex sets in the same space, allowing for the aggregation and manipulation of graphs in a unified manner. We also provide results on how the size of the projections need to scale in order to preserve accurate graph operations, showing that the size of the projections can scale linearly with the number of vertices while accurately retaining first-order graph information. We conclude by characterizing our random projection as a distance-preserving map of adjacency matrices analogous to the usual Johnson-Lindenstrauss map.

1 Introduction

The adjacency matrix is a popular and flexible graph representation, encoding a graph’s edgeset in an explicit and easy to access manner. Furthermore, many natural graph operations correspond to linear algebraic operations on the adjacency matrix, such as edge addition-deletion, edge or vertex queries, and edge composition. The properties of the adjacency matrix and related quantities like the graph Laplacian also yield important insights about the underlying graph [19] [1].

However, a major defect of the adjacency matrix is that its size scales quadratically with the number of vertices. In many real world applications where the underlying graph is extremely large, this presents a fundamental problem. For example, graphs in the Stanford Large Network Collection include graphs derived from social, communication, and road networks that each have millions of vertices. The adjacency matrices of such large graphs would require trillions of parameters to store and work with, which is infeasible in practice. Hence, there has been much interest in graph compression techniques. There is a particular interest in sparse graph techniques, since a majority of real world graphs tend to have sparsely connected vertices.

In a previous work [18], we introduced a graph embedding method that represents a graph’s edgeset as a random sum. This method can be viewed as a random projection technique for adjacency matrices, and in this paper we discuss its application to large sparse graphs. In particular, we show that these random projections have the same space and operational time complexity as popular sparse graph representations like the Compressed Sparse Row format (CSR). Moreover, they retain all of the linear-algebraic graph operations of adjacency matrices, giving a flexible and computationally-efficient representation. Interestingly, the random projection described in this paper is an inner-product preserving projection on the space of adjacency matrices, analogous to the Johnson-Lindenstrauss map for vectors[7].

2 Related Work

We proposed our random projection method in previous work [18], framing it as a graph embedding method in the spirit of hyperdimensional computing(HDC) [8][10][6]. HDC graph representations encode vertices as high-dimensional vectors, bind these vectors to generate edge embeddings, and sum these edge embeddings to represent the entire edgeset: a bind-and-sum approach [15][12][17][16][9]. Our method can be classified as a bind-and-sum method, since we assign a random vector to each vertex and represent an edge by binding the source and target vertices via the tensor(outer) product. However, our method is purely a random projection and does not seek to learn a good node or graph embedding, instead leveraging pseudo-orthogonality to maintain properties of the adjacency matrix.

Our randomized projections can also be viewed as a ”reverse” adjacency matrix factorization. These techniques seek to find an informative and compressed factorization of the adjacency matrix (or adjacency tensor for typed graphs). For example, the RESCAL algorithm [14] applied to a n×nn\times n adjacency matrix AA learns a d×nd\times n entity matrix EE and d×dd\times d relational matrix RR such that AERETA\approx ERE^{T}. Here, the columns of EE are the entity (vertex) embeddings that encode information about each entity, and the matrix RR encodes information about entity-entity interactions. We work with the source-target factorization of an adjacency matrix A=CDTA=CD^{T}, where CC and DD are n×kn\times k matrices whose ithi^{th} columns are the respective coordinate vectors of the source and target vertex of the ithi^{th} edge. Our method encodes each vertex as a random vector and then replaces the columns of CC and DD with their corresponding random vertex codes.

As a random projection of matrices, our method falls within the field of random numerical linear algebra [13]. These are random algorithms and techniques for solving problems in linear algebra, such as solving linear systems and finding maximum/minimum eigenvalues. These techniques aim to offer advantages in speed and memory over their deterministic counterparts, and our method aims to provide both in representing and working with a sparse adjacency matrix. Among these algorithms, the FastRP algorithm[4] is another example of a random projection method. It computes node embeddings by taking a weighted sum of the power of the adjacency matrix and then randomly projects them into a d×nd\times n matrix whose columns form the node embeddings. While the goal of FastRP is to generate node embeddings that capture multi-hop information, our method seeks to informatively embed the adjacency matrix itself.

3 Intuition Behind the Projection

Given directed graph G=(V,E)G=(V,E) and an arbitrary ordering of its vertex set VV, its adjacency matrix can be expressed as the sum of outer products of coordinate vectors:

A=ij:Aij=1eiejTA=\sum_{ij:A_{ij}=1}e_{i}e_{j}^{T}

In this form, it is easy to see that many graph operations correspond to linear operations on the adjacency matrix. For example, adding/deleting the edge (vs,vt)(v_{s},v_{t}) corresponds to adding/subtracting the matrix esetTe_{s}e_{t}^{T} from AA; querying if the graph contains the edge (vs,vt)(v_{s},v_{t}) corresponds to the product esTAete_{s}^{T}Ae_{t}; the kthk^{th} matrix powers correspond to kk-length paths. One property these fundamental graph operations share is that they only require orthonormality of the vertex vectors. If we swap the coordinate vectors with any set of orthonormal vectors, the transformed adjacency matrix retains all of its functionality.

For example, let B=[b1bn]B=[b_{1}\cdots b_{n}] be any orthogonal matrix, whose columns form an orthonormal basis. Changing bases to BB, our transformed adjacency matrix ABA_{B} takes the form:

AB=ij:Aij=1bibjT=BABTA_{B}=\sum_{ij:A_{ij}=1}b_{i}b_{j}^{T}=BAB^{T}

In this form, we can perform all the usual adjacency matrix operations by swapping coordinate vectors with their counterparts in BB. For example, adding/deleting the edge (vs,vt)(v_{s},v_{t}) now corresponds to adding/subtracting the matrix bsbtTb_{s}b_{t}^{T} from ABA_{B}; the edge query of (vs,vt)(v_{s},v_{t}) becomes bsABTbtb_{s}A_{B}^{T}b_{t}; matrix powers still correspond to finite length paths in the sense that biTABkbj=1b_{i}^{T}A_{B}^{k}b_{j}=1 if and only if there is a kk-length path from viv_{i} to vjv_{j}.

From the above discussion, we see that orthonormality is the key property required for exact graph functionality. Hence, relaxing to approximate orthonormality allows us to compress adjacency matrices while retaining graph functionality in a minimally noisy manner. We make use of random pseudo-orthonormal vectors, which are vectors whose dot products are negligible with high probability. While we can pack at most dd orthonormal vectors in a dd-dimensional space, we can pack many more pseudo-orthonormal vectors in the same space. This allows us to compress adjacency matrices, and in the case of sparse matrices this compression effect is especially pronounced.

4 Random Projection of Adjacency Matrices

Consider a directed graph G=(V,E)G=(V,E) with |V|=n|V|=n. An ordering of the vertex set VV induces the n×nn\times n adjacency matrix AA:

Aij={1if (vi,vj)E0otherwiseA_{ij}=\begin{cases}1\quad\text{if $(v_{i},v_{j})\in E$}\\ 0\quad\text{otherwise}\end{cases}

We then construct a random d×nd\times n projection matrix PVP_{V}, where the columns of PVP_{V} are sampled i.i.d. from the uniform distribution on the dd-dimension unit sphere 𝕊d1\mathbb{S}^{d-1}. The random projection πV(A)\pi_{V}(A) is then given by the following equation:

πV(A)=PVAPVT\pi_{V}(A)=P_{V}AP_{V}^{T}

Expressing the adjacency matrices as the sum of outer products between coordinate vectors A=eiejTA=\sum e_{i}e_{j}^{T}, the projection swaps the ithi^{th} coordinate vector with the ithi^{th} column pip_{i} of PVP_{V}: eiejTpipjT\sum e_{i}e_{j}^{T}\mapsto\sum p_{i}p_{j}^{T}. In light of the discussion of Section 3, this can be viewed as a pseudo-orthogonal basis change.

4.1 Graph Operations on Random Projections

We saw in Section 3 that operations in a new orthonormal basis require swapping coordinate vectors with the new basis vectors. Similarly, we can perform all the usual graph operations with the projection πV(A)\pi_{V}(A) by substituting the appropriate random code for each vertex. For example, to query if the edge (vi,vj)(v_{i},v_{j}) is in the edgeset, we would usually compute the product eiTAeje_{i}^{T}Ae_{j}. This returns a 1 if (vi,vj)(v_{i},v_{j}) is an edge and 0 otherwise. The randomized analogue then takes the form piTπV(A)pjp_{i}^{T}\pi_{V}(A)p_{j}. This quantity is close to 1 with high probability if (vi,vj)(v_{i},v_{j}) is an edge and close to 0 with high probability otherwise.

In general, every graph operation has an analogue on the projected matrices by substituting in the random vertex codes. Rather than using the coordinate vector eie_{i} associated with vertex viv_{i}, when working with the randomized projection πV(A)\pi_{V}(A) we use the ithi^{th} column vector pip_{i} of the projection matrix PVP_{V} instead.

4.2 Changing the Vertex Set and Graph Aggregation

One attractive property about our random projections is that they transform nicely under changes to the underlying vertex set. Suppose we wish to expand our graph by adding a new vertex vn+1v_{n+1} and new edges {(vi,vn+1)}i\{(v_{i},v_{n+1})\}_{i}. With the usual n×nn\times n adjacency matrix, this would require expanding to a n+1×n+1n+1\times n+1 matrix and filling in the appropriate entries of the added column and row. However, for the projected matrix we need to only generate a new random vector pn+1p_{n+1} and add the appropriate edges to the projection: πV(A)+ipipn+1T\pi_{V}(A)+\sum_{i}p_{i}p_{n+1}^{T}.

Our random matrix projection can be viewed as a projection method for the set of graphs whose vertices are in a fixed vertex set VV. In light of the above discussion, this projection can be naturally extended to sets containing VV or restricted to subsets of VV by expanding or restricting the set of random vertex codes respectively. This property is particularly attractive for dynamic graph representations, where the edge and vertex set of the graph change over time. The dimension of the projected matrices stays the same under changes to the vertex set, and the addition/deletion of edges involving new vertices is simple. We only need to keep in mind the total capacity of the projection space: how many edges we can add to a projected matrix before accuracy in graph operations begins to break down. We analyze this behavior in Section 5.

4.3 Translation between Different Projections

Once we assign a random code to each vertex and construct our projection matrix PVP_{V}, that vertex-vector codebook is fixed. However, we might desire to reassign a new random vector to each vertex. This translation procedure has a simple analogue for projected matrices. Suppose we have a vertex set VV with two different random projection matrices PVP_{V} and QVQ_{V}. The ithi^{th} columns of PVP_{V} and QVQ_{V} are the random vectors assigned to vertex viv_{i} under each projection respectively. In order to swap the vector pip_{i} with qiq_{i}, we construct the translation matrix T(P,Q)=QVPVTT_{(P,Q)}=Q_{V}P_{V}^{T}. By pseudo-orthonormality, we see that T(P,Q)piqiT_{(P,Q)}p_{i}\approx q_{i} for every ii. Hence, if APA_{P} is the projection of AA under PVP_{V} and AQA_{Q} is the projection under QVQ_{V}, we have the following relation:

AQT(P,Q)AT(P,Q)TA_{Q}\approx T_{(P,Q)}AT_{(P,Q)}^{T}

4.4 Graph Subsets and Aggregation

Given a subset SVS\subseteq V, the subgraph generated generated by SS, denoted GSGG_{S}\subset G, is the subgraph generated by all edges involving vertices in SS. Given the projected adjacency matrix π(AG)\pi(A_{G}), we can extract the projection of the subgraph π(AGS)\pi(A_{G_{S}}) by the following procedure. Let PSP_{S} be the d×|S|d\times|S| matrix whose columns are the random codes assigned to each vertex of SS, and define TS=PSPSTT_{S}=P_{S}P_{S}^{T}. By pseudo-orthonormality, we have the following approximate relation:

π(AGS)TSπ(AG)TST\pi(A_{G_{S}})\approx T_{S}\pi(A_{G})T_{S}^{T}

This subset procedure, along with graph translation, can be applied to aggregate multiple graphs into a single large graph. For example, suppose we have two disjoint graphs GG and HH. Their aggregate graph GH=(VGVH,EGEH)G\cup H=(V_{G}\cup V_{H},E_{G}\cup E_{H}) is the result of combining their vertex and edge sets together. We saw earlier that edge addition corresponds to adding the projected edges to our projected matrices. Therefore, given projections π(AG)\pi(A_{G}) and π(AH)\pi(A_{H}), the projection of their aggregate is the sum of their projections:

π(AGAH)=π(AG)+π(AH)\pi(A_{G}\cup A_{H})=\pi(A_{G})+\pi(A_{H})

This can be combined in tandem with the subsetting procedure to combine selected subgraphs from a collection of graphs. Even when their vertex codes are different, the translation procedure mentioned previously allows use to translate them all into a single code before aggregation.

One interesting application of this suite of procedures is allowing a divide-and-conquer approach to storing a large sparse graph. Graph operations on the projected matrices depend on the size of the projection space (see Section 5), so breaking a large graph into subgraphs allows us to store their projections in a smaller projection space. Operations involving an individual subgraph also happen in a lower-dimensional space and have decreased time complexity. However, when the need arises to operate on information from a pool of these subgraphs, we can use the above subset and pooling procedures to easily generate a wide suite of projections that correspond to those generated by their graph aggregates.

The above examples demonstrate that we can represent graphs of varying size and different vertex sets in the same projection space. Translation and pooling operations are of fixed dimension, allowing for a unified way to manipulate and combine a broad range of graphs. This suggests that random projections are also ideal for applications where graph aggregation is an important operation such as the divid-and-conquer situation described above.

5 Scaling Properties of Random Projections

We now study how the size of the random projection space needs to scale with the underlying graph. Importantly, we show that the size scales with size of the edgeset rather than the vertex set. This makes our method particularly amenable to sparse graphs, where the size of the edgeset is proportional to the size of the vertex set.

5.1 mm-Order Graph Operations

First, we need to define mm-order graph operations. This is important because each order requires a different scaling of the projection space. We say a graph operation has order 𝒎\boldsymbol{m} if it can be expressed as an operation involving the mthm^{th} power of the adjacency matrix. For example, the edge query is a first order graph operation because it is a function of the adjacency matrix: q((vi,vj),G)=eiTAejq((v_{i},v_{j}),G)=e_{i}^{T}Ae_{j}.

5.2 First and Second Order Scaling

We first characterize how the size of the projection needs to scale in order to retain accurate first and second order graph operations. Intuitively, these correspond to edge information of the 1-hop and 2-hop neighborhoods of the graph respectively. We give two informal statements on how the dimension dd of the projection space d×d\mathbb{R}^{d\times d} must scale with the number of edges, subject to a constraint on the vertex connectivity (number of edges each vertex participates in). An account of the technical results justifying these statements is given in Appendix A.

Property 1 (First Order Scaling).

Let GG be a graph with kk edges with maximum node degree O(k)O(\sqrt{k}). A random projection into d×d\mathbb{R}^{d\times d} needs to have d=Ω(k)d=\Omega(\sqrt{k}) in order to retain accurate first order operations.

Property 2 (Second Order Scaling).

Let GG be a graph with kk edges with maximum node degree O(k1/3)O(k^{1/3}). A random projection into d×d\mathbb{R}^{d\times d} needs to have d=Ω(k2/3)d=\Omega(k^{2/3}) in order to retain accurate second order operations.

For sparse graphs, the node degree condition is satisfied for all or a majority of vertices. Indeed, a closer inspection of the proofs in Appendix A show that as long as the vertices of the relevant edges satisfy this or a much looser bound on the node degree, the results still hold. This allows us to include sparse graphs that might have central vertices, which act like hubs and connect to many other vertices. If we are only interested in first order properties of the graph, the first result implies that we can compress a large sparse adjacency matrix using n2n^{2} parameters into a smaller matrix using Ω(n)\Omega(n) parameters, resulting a drastic compression for large sparse graphs.

5.3 mm-Order Scaling

While first and possibly second order operations comprise a bulk of the interesting graph operations, for completeness we have the following informal scaling statement for mm-order graph operations.

Property 3 (mm-Order Scaling).

Let GG be a graph with kk edges with maximum node degree O(k1/(m+1))O(k^{1/(m+1)}). A random projection into d×d\mathbb{R}^{d\times d} needs to have d=Ω(kmm+1)d=\Omega(k^{\frac{m}{m+1}}) in order to retain accurate mm-order operations.

6 Edge Representations of Sparse Matrices

In this section, for various sparse matrix representations we analyze both their size complexity as well as the time complexity of graph operations. We also consider the time complexity of numerical sparse matrix methods when appropriate. In this manner, we hope to contextualize both the strengths and weaknesses of our random projections relative to other sparse matrix representations. Table 1 summarizes the results of this section.

6.1 Alternate Sparse Matrix Representations

We first consider the coordinate list representation. This represents a sparse matrix as a list of 3-tuples (r,c,v)(r,c,v) corresponding to each non-zero entry, where rr and cc are the row and column indices while vv is the value. A related representation is the Dictionary of Keys (DoK) representation, which is similar to the coordinate list representation except now each row-column index (r,c)(r,c) serves as a key with value vv. Finally, the CSR format represents a matrix using three arrays containing the non-zero values, their column indices, and the number of non-zero entries above each row respectively. The CSR format can be easily computed from the previous two representations and vice versa.

6.2 Space Complexity

A sparse graph with nn vertices has O(n)O(n) edges. Hence, from their descriptions all three of the alternate sparse matrix representations require O(n)O(n) parameters. For random projections, if we are only concerned with representing first-order properties, Theorem A.1 establishes that the d×dd\times d projections must have d=Ω(n)d=\Omega(n), meaning the projections are of size O(n)O(n).

6.3 Graph Operation Complexity

We analyze some graph operations where the complexity differs based on the representation. For data structure operations, we use their complexity in Python as stated in the Python Wiki. Throughout this section, our random projection are d×dd\times d matrices.

6.3.1 Edge Query

To look up if an edge (vi,vj)(v_{i},v_{j}) exists in a coordinate list, we merely check if the list contains a tuple whose first two entries are (i,j)(i,j). Since the coordinate list has length O(n)O(n), list lookup has complexity O(n)O(n). As a three list format, the CSR representation has the same complexity. However, the ”in” operator for dictionaries is O(1)O(1), meaning edge lookup in DoK is O(1)O(1). The edge query for random projections is the product piTπV(A)pjp_{i}^{T}\pi_{V}(A)p_{j}, which has O(d2)O(d^{2}). If our project space is calibrated to preserve only first-order operations, then this edge query has complexity O(n)O(n).

6.3.2 Edge Composition

Edge composition is a complicated case. For each of the sparse matrix representatiosn, edge compositions requires a nested procedure where, for each edge (i,j)(i,j), one needs to identify all edges (j,k)(j,k) and return the edge (i,k)(i,k), which is naively O(n2)O(n^{2}). Importantly, this naive algorithm is not parallelizable, making it especially painful to compute. Alternatively, edge composition naturally corresponds to the second power of the adjacency matrix. For a sparse adjacency matrix with O(n)O(n) edges, numerical methods for sparse matrices have complexity at most O(n2)O(n^{2}) [2].

As for random projections, the naive algorithm for the multiplication of two d×dd\times d matrices has complexity O(d3)O(d^{3})[5] [11]. In order to retain accuracy for a second-order operations, property 2 states that d=Ω(n2/3)d=\Omega(n^{2/3}). Thus, computing the second matrix power of our projections has naive complexity O(n2)O(n^{2}) while using methods like Strassen’s algorithm[11] can speed it up even further.

6.4 Fast Numerical Linear Algebra

One important aspect of our random projections is they benefit from numerical methods for linear algebra, such as parallel computation. This means that the time complexity of graph operations can be substantially reduced through these methods. The other sparse matrix representations do not enjoy these benefits, as we saw with the edge composition example. Hence, in practice the time complexity of graph operations with random projections can be significantly reduced, and they benefit from advances in numerical linear algebra.

Random Projections DoK CL CSR Sparse NLA
Space Complexity O(n)O(n) O(n)O(n) O(n)O(n) O(n)O(n) N/A
Edge Query O(n)O(n) O(1)O(1) O(n)O(n) O(n)O(n) N/A
Edge Composition O(n2)O(n^{2})^{*} O(n2)O(n^{2}) O(n2)O(n^{2}) O(n2)O(n^{2}) N/A
Matrix Addition O(n)O(n) N/A N/A N/A O(n)O(n)
Matrix Multiplication O(n2)O(n^{2})^{*} N/A N/A N/A O(n2)O(n^{2})
Table 1: Table of space/time complexity for various graph and matrix operations. The underlying sparse graph has nn vertices and O(n)O(n) edges. From left to right, the methods considered are random projections, dictionary of keys (DoK), coordinate list (CL), compressed sparse row (CSR), and sparse numerical linear algebra (Sparse NLA). For matrix multiplication, we use the time complexity of the naive algorithm rather than other advanced methods [11]. We assume our random projections are calibrated to first order operations except for edge composition/matrix multiplication(*) where we assume calibration to second order operations.

7 Johnson-Lindenstrauss Analogue for Graphs

In the introduction, we stated that our random projection method is a norm-preserving map analogous to the one given by Johnson-Lindenstrauss (JL) lemma for vectors. Here, we first discuss a natural inner product on the space of adjacency matrices. We then state results showing how our random projections preserve this inner product and its associated norm for a finite set of adjacency matrices, characterizing our method as a JL map for the space of graphs.

7.1 A Natural Inner Product for Adjacency Matrices

In our previous work [18], we noted that there is a natural inner product on the space of adjacency matrices that counts the number of shared edges. Consider two graphs GG and HH whose vertex sets are contained in some larger set VV. Given an ordering of VV, we have the induced adjacency matrices AGA_{G} and AHA_{H}. Define the inner product between their adjacency matrices as:

AG,AHtr(AGTAH)\langle A_{G},A_{H}\rangle\coloneqq tr(A_{G}^{T}A_{H})

Expressing each adjacency matrix as the sum of coordinate outer products, we see that this function counts the number of edges common to both graphs:

tr(AGTAH)\displaystyle tr(A_{G}^{T}A_{H}) =tr([(i,j)EGeiejT]T[(k,l)EHekelT])\displaystyle=tr([\sum_{(i,j)\in E_{G}}e_{i}e_{j}^{T}]^{T}[\sum_{(k,l)\in E_{H}}e_{k}e_{l}^{T}])
=tr((i,j)EG,(j,l)EHeielT)\displaystyle=tr(\sum_{(i,j)\in E_{G},(j,l)\in E_{H}}e_{i}e_{l}^{T})
=(i,j)EGEH1\displaystyle=\sum_{(i,j)\in E_{G}\cap E_{H}}1
=|EGEH|\displaystyle=|E_{G}\cap E_{H}|

One can check this is indeed an inner product, and in fact it generates the Frobenius norm: tr(ATA)=AF2tr(A^{T}A)=||A||_{F}^{2}.

Note that we assumed the vertex sets of both graphs were contained in some larger set VV. This is not an issue for a finite set of graphs GiG_{i} with finite vertex sets ViV_{i}, as we can define V=ViV=\cup V_{i}. The adjacency matrix of a graph with respect to this large set VV is a block matrix whose single block is equal to the original adjacency matrix.

7.2 Random Projections are a JL Map

In light of graph inner product defined in Section 7.1, we almost have a JL-type result since the Frobenius norm of matrices coincides with the Euclidean norm if we regard matrices in d×d\mathbb{R}^{d\times d} as vectors in d2\mathbb{R}^{d^{2}}. We confirm this by proving that, with high probability, our random projection method is a map that satisfies the conditions of the JL Lemma.

Theorem 7.1 (Random Projections are a JL map).

Consider a set of NN graphs with adjacency matrices A1,,AnA_{1},\cdots,A_{n}. For small ϵ>0\epsilon>0 and d=Ω(logNϵ2)d=\Omega(\frac{\log N}{\epsilon^{2}}), with high probability our random projections satisfy:

(1ϵ)AiAj2π(Ai)π(Aj)2(1+ϵ)AiAj2i,j(1-\epsilon)||A_{i}-A_{j}||^{2}\leq||\pi(A_{i})-\pi(A_{j})||^{2}\leq(1+\epsilon)||A_{i}-A_{j}||^{2}\quad\forall i,j (1)

8 Discussion

We presented a random projection method for sparse adjacency matrices. These random projections exploit pseudo-orthonormality of random vectors to drastically compress the adjacency matrix while still retaining all of its functionality. While the exact scaling depends on the graph properties we wish to preserve, Theorem A.1 shows that we can compress n×nn\times n sparse matrices into Ω(n)×Ω(n)\Omega(\sqrt{n})\times\Omega(\sqrt{n}) matrices while preserving first order graph operations. These random projections also enjoy properties that their underlying adjacency matrices do not. Sections 4.2 and 4.3 show that random projection allow us to represent graphs of varying size and different vertex sets in the same projection space. This common space is equipped with modification and aggregation operations that apply to all graphs in the space, and random projections provide a unified way for representing and working with graphs. The complexity analysis of 6 shows that these random projections are competitive with existing sparse matrix representations and numerical methods, and they can take advantage of numerical techniques for speeding up linear algebra operations. All these properties suggest that random adjacency matrix projections are a dynamic, flexible, and expressive graph compression technique well suited to a variety of applications where large sparse matrices occur. One interesting application of our random compression technique would be to the field of graph neural networks [20]. Such networks use the adjacency matrix during the linear portion of its computations, and random projections could help extend such networks to both large sparse graphs as well as time-varying, dynamic graphs.

References

  • [1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2003.
  • [2] Keivan Borna and Sohrab Fard. A note on the multiplication of sparse matrices. Open Computer Science, 4(1):1–11, 2014.
  • [3] Stéphane Boucheron, Gábor Lugosi, and Olivier Bousquet. Concentration Inequalities, pages 208–240. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.
  • [4] Haochen Chen, Syed Fahad Sultan, Yingtao Tian, Muhao Chen, and Steven Skiena. Fast and accurate network embeddings via very sparse random projection, 2019.
  • [5] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. The MIT Press, 3rd edition, 2009.
  • [6] Ross W. Gayler and Simon D. Levy. A distributed basis for analogical mapping. 2009.
  • [7] William B. Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics, 26, 1984.
  • [8] Pentti Kanerva. Sparse Distributed Memory. MIT Press, Cambridge, MA, USA, 1988.
  • [9] Jaeyoung Kang, Minxuan Zhou, Abhinav Bhansali, Weihong Xu, Anthony Thomas, and Tajana Rosing. Relhd: A graph-based learning on fefet with hyperdimensional computing. 2022 IEEE 40th International Conference on Computer Design (ICCD), pages 553–560, 2022.
  • [10] Denis Kleyko, Dmitri A. Rachkovskij, Evgeny Osipov, and Abbas Jawdat Rahim. A survey on hyperdimensional computing aka vector symbolic architectures, part ii: Applications, cognitive models, and challenges. ArXiv, abs/2112.15424, 2021.
  • [11] Yan Li, Sheng-Long Hu, Jie Wang, and Zheng-Hai Huang. An introduction to the computational complexity of matrix multiplication. Journal of the Operations Research Society of China, 8(1):29–43, 2020.
  • [12] Yunpu Ma, Marcel Hildebrandt, Volker Tresp, and Stephan Baier. Holistic representations for memorization and inference. In UAI, 2018.
  • [13] Per-Gunnar Martinsson and Joel Tropp. Randomized numerical linear algebra: Foundations and algorithms, 2021.
  • [14] Maximilian Nickel, Xueyan Jiang, and Volker Tresp. Reducing the rank in relational factorization models by including observable patterns. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  • [15] Maximilian Nickel, Lorenzo Rosasco, and Tomaso A. Poggio. Holographic embeddings of knowledge graphs. In AAAI, 2016.
  • [16] Igor O. Nunes, Mike Heddes, Tony Givargis, Alexandru Nicolau, and Alexander V. Veidenbaum. Graphhd: Efficient graph classification using hyperdimensional computing. 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1485–1490, 2022.
  • [17] Prathyush Poduval, Haleh Alimohamadi, Ali Zakeri, Farhad Imani, M. Hassan Najafi, Tony Givargis, and Mohsen Imani. Graphd: Graph-based hyperdimensional memorization for brain-like cognitive learning. Frontiers in Neuroscience, 16, 2022.
  • [18] Frank Qiu. Graph embeddings via tensor products and approximately orthonormal codes, 2022.
  • [19] Ulrike von Luxburg. A tutorial on spectral clustering, 2007.
  • [20] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57–81, 2020.

Appendix A Scaling Theorems

We provide theorems justifying the properties listed Section 5. These are mainly an adaptation of results found in our previous work [18]. Throughout this section, we abuse notation and use the same symbol uu to denote both the vertex and its random code. Similarly, we use the same symbol GG to denote the graph and its random projection.

A.1 Auxiliary Lemma for Spherical Random Vectors

We will need the following lemma [18].

Lemma A.1.

Let XX be the dot product between two vectors sampled uniformly and independently from the dd-dimensional unit sphere 𝕊d1\mathbb{S}^{d-1}. Then E(X)=0E(X)=0 and E(X2)=1dE(X^{2})=\frac{1}{d}.

A.2 Property 1: First-Order Scaling

The following theorem and its proof is an abbreviated adaptation of Theorems 10.7 in our previous work [18]. Given an edge (vi,vj)(v_{i},v_{j}) and adjacency matrix AA, the edge query eiTAeje_{i}^{T}Ae_{j} returns 1 if (vi,vj)(v_{i},v_{j}) is an edge of the graph and 0 otherwise. Our goal is prove the edge query analogue for our random projection returns the correct value with high probability, showing that our random projection accurately retains first-order edge information. An edge query is a true query if the queried edge exists in the graph and a false query if it doesn’t.

Theorem A.1.

Let GG be a graph with k+1k+1 edges and maximum node degree l2\frac{l}{2}. For a random projection into d×d\mathbb{R}^{d\times d}, we have the following:

  1. 1.

    If TT denotes the result of a true query, then:

    P(|T1|>ϵ)2exp(ϵ22(ld+kld2+ϵ3))P(|T-1|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\frac{l}{d}+\frac{k-l}{d^{2}}+\frac{\epsilon}{3})}) (2)
  2. 2.

    If FF denotes the result of a false query, then:

    P(|F|>ϵ)2exp(ϵ22(ld+k+1ld2+ϵ3))P(|F|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\frac{l}{d}+\frac{k+1-l}{d^{2}}+\frac{\epsilon}{3})}) (3)
Proof.

Consider the random projection of a graph with k+1k+1 edges:

G=uvT+i=1kpiqiTG=uv^{T}+\sum_{i=1}^{k}p_{i}q_{i}^{T} (4)

where uu, vv, pip_{i}, qiq_{i} are random vectors drawn i.i.d. from the uniform distribution on the dd-dimensional unit sphere.

We first prove equation 2 by considering the result of querying our random projection for the exist of the edge (u,v)(u,v), which is a valid edge. We assume mm of the other edges share one vertex with (u,v)(u,v), and WLOG we assume they all share vertex uu. The result of true query TT can be expressed as:

T=uTGv=1+i=1mpi,v+j=1kmu,qjv,rjT=u^{T}Gv=1+\sum_{i=1}^{m}\langle p_{i},v\rangle+\sum_{j=1}^{k-m}\langle u,q_{j}\rangle\langle v,r_{j}\rangle (5)

From Lemma A.1, both sums have mean 0 with the first sum having variance md\frac{m}{d} and the second having variance kmd2\frac{k-m}{d^{2}}. Hence, the total variance is md+kmd2\frac{m}{d}+\frac{k-m}{d^{2}}, and applying Bernstein’s inequality gives:

P(|T1|>ϵ)2exp(ϵ22(md+kmd2+ϵ3))P(|T-1|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\frac{m}{d}+\frac{k-m}{d^{2}}+\frac{\epsilon}{3})})

By assumption, we can bound ml2m\leq\frac{l}{2}, and plugging in the worst case of m=l2m=\frac{l}{2} gives equation 2.

Similarly, now suppose we query by a spurious edge (s,t)(s,t). Assuming qq of the other edges share one vertex with (s,t)(s,t), we use the same argument as above to derive the false query bound given by equation 3. ∎

Letting QQ denote a general edge query and σ2=Var(Q)\sigma^{2}=Var(Q), Theorem A.1 can be summarized as:

P(|QEQ|>ϵ)2exp(ϵ22(σ2+ϵ3))P(|Q-EQ|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\sigma^{2}+\frac{\epsilon}{3})})

Rewriting this bound in terms of σ\sigma:

P(|QEQ|>Cσ)2exp(C2σ22σ2+2σ3)2exp(3C22σ)P(|Q-EQ|>C\sigma)\leq 2\exp(\frac{C^{2}\sigma^{2}}{2\sigma^{2}+\frac{2\sigma}{3}})\approx 2\exp(\frac{-3C^{2}}{2\sigma})

If l=O(k)l=O(\sqrt{k}) then σ2=O(k)d+O(k)d2\sigma^{2}=\frac{O(\sqrt{k})}{d}+\frac{O(k)}{d^{2}}. Hence, if d=Ω(k)d=\Omega(\sqrt{k}) then QQ is close to EQEQ with high probability. Since EQ=1EQ=1 for a true query and EQ=0EQ=0 for a false query, QQ is close to the correct value with high probability. This justifies the statement of property 1.

A.3 Property 2: Second-Order Scaling

This following theorem and its proof is an abbreviated adaptation of Theorems 11.7 from our previous work [18]. Two edges are composable if the target vertex of one is the source vertex of another, and an edge (u,v)(u,v) is in the second power of the adjacency matrix if and only if it is the composition of two composable edges. We need to show that the second matrix power accurately represents edge information. To this end, we prove the edge query returns the correct result with high probability when applied to the second power of the random projection.

Theorem A.2.

Let GG be a graph with composable edges (u,v)(u,v) and (v,w)(v,w) along with k2k-2 nuisance edges, and assume GG has maximum node degree l2\frac{l}{2}. For a random projection into d×d\mathbb{R}^{d\times d}, we have the following results for edge queries involving projection’s second matrix power:

  1. 1.

    If TT denotes the result of a true query of the second matrix power, then:

    P(|T1|>ϵ)2exp(ϵ22(2l+1d+kl3l3d2+k2k(l+2)+l+1d3+ϵ3)))P(|T-1|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\frac{2l+1}{d}+\frac{kl-3l-3}{d^{2}}+\frac{k^{2}-k(l+2)+l+1}{d^{3}}+\frac{\epsilon}{3})})) (6)
  2. 2.

    If FF denotes the result of a false query of the second matrix power, then:

    P(|F|>ϵ)2exp(ϵ22(kld2+k2kld3+ϵ3)))P(|F|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\frac{kl}{d^{2}}+\frac{k^{2}-kl}{d^{3}}+\frac{\epsilon}{3})})) (7)
Proof.

For the nuisance edge, assume m1m_{1} have common vertex vv and m2m_{2} have common source uu or target ww, and let m=m1+m2m=m_{1}+m_{2}. WLOG we may assume all m2m_{2} edges have common source vertex uu, and our projection can be expressed as:

G=uvT+vwT+i=1m1vpiT+j=1m2upjT+l=1km2qlrlTG=uv^{T}+vw^{T}+\sum_{i=1}^{m_{1}}vp_{i}^{T}+\sum_{j=1}^{m_{2}}up_{j}^{T}+\sum_{l=1}^{k-m-2}q_{l}r_{l}^{T} (8)

From equation 8, we see the second matrix power should only contain the edge (u,w)(u,w).

We first prove equation 6 and query the second matrix power G2G^{2} for the presence of the edge (u,w)(u,w). Letting ϵ\epsilon denote the dot product of two independent spherical vectors, the result of this edge query is a sum of terms that are products of i.i.d. ϵ\epsilon’s. Let n1=k(m2)m2m3n_{1}=k(m_{2})-m_{2}-m-3 and n2=k2k(m2+2)+m2+1n_{2}=k^{2}-k(m_{2}+2)+m_{2}+1. Grouping terms by how many ϵ\epsilon’s they contain, we write our true query TT as:

T=1+i=1m+1ϵi+j=1n1ϵj1ϵj2+h=1n2ϵh1ϵh2ϵh3T=1+\sum_{i=1}^{m+1}\epsilon_{i}+\sum_{j=1}^{n_{1}}\epsilon_{j_{1}}\epsilon_{j_{2}}+\sum_{h=1}^{n_{2}}\epsilon_{h_{1}}\epsilon_{h_{2}}\epsilon_{h_{3}}

Using Lemma A.1, we see the variance σ2\sigma^{2} of the error terms is σ2=m+1d+n1d2+n3d3\sigma^{2}=\frac{m+1}{d}+\frac{n_{1}}{d^{2}}+\frac{n_{3}}{d^{3}}. An application of Bernstein’s inequality gives:

P(|T1|>ϵ)2exp(ϵ22(m+1d+n1d2+n2d3+ϵ3))P(|T-1|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{2(\frac{m+1}{d}+\frac{n_{1}}{d^{2}}+\frac{n_{2}}{d^{3}}+\frac{\epsilon}{3})})

Since m1,m2lm_{1},m2\leq l and m2lm\leq 2l, plugging in the worst case of m1=m2=lm_{1}=m_{2}=l and m=2lm=2l gives equation 6.

Similarly, now we consider querying G2G^{2} by a spurious edge (s,t)(s,t). We assume that mm of the edges have either common source vertex ss or target vertex tt. Then, a similar computation as above shows that we can express the false query FF as:

F=i=1kmϵi1ϵi2+j=1k2kmϵj1ϵj2ϵj3F=\sum_{i=1}^{km}\epsilon_{i_{1}}\epsilon_{i_{2}}+\sum_{j=1}^{k^{2}-km}\epsilon_{j_{1}}\epsilon_{j_{2}}\epsilon_{j_{3}}

An application of Bernstein’s inequality and a worst-case bound gives equation 7. ∎

As with Theorem A.1, the bounds of Theorem A.2 can be expressed in terms of the query variance σ2\sigma^{2}:

P(|QEQ|>Cσ)2exp(C2σ22σ2+2σ3)2exp(3C22σ)P(|Q-EQ|>C\sigma)\leq 2\exp(\frac{C^{2}\sigma^{2}}{2\sigma^{2}+\frac{2\sigma}{3}})\approx 2\exp(\frac{-3C^{2}}{2\sigma})

If l=O(k1/3)l=O(k^{1/3}), then in both cases all terms in the variance can be expressed as powers of O(k2/3d)O(\frac{k^{2/3}}{d}). Hence, as long as d=Ω(k2/3)d=\Omega(k^{2/3}) then we see that the edge query QQ is close to its correct value EQEQ with high probability. This justifies the statement of Property 2.

A.4 Property 3: mm-Order Scaling

We give a proof sketch is adapted from Section 12.1 of our previous work[18]. We aim to analyze the accuracy recovering edge information from the mthm^{th} matrix power of our random projections and how the dimension of the projection space d×d\mathbb{R}^{d\times d} needs to scale with mm.

As in the proof of Theorem A.2, let ϵ\epsilon denote the dot product of distinct spherical vectors. From the proofs of Theorems A.1 and A.2, we need to control the edge query variance to ensure that true and false queries are close to their expected values with high probability. Intuitively, if the node degree is small then performing an edge query on the mthm^{th} matrix power will result in a majority of the kmk^{m} error terms being products of m+1m+1 independent ϵ\epsilon’s. Such terms will be mean 0 and variance 1dm+1\frac{1}{d^{m+1}}. Since the true and false query scores are 1 and 0 respectively, for accurate recovery we need the edge query variance to be less than 1. As a sum of independent terms, the variance of the noise term will be approximately kmdm+1\frac{k^{m}}{d^{m+1}}, implying that d=Ω(kmm+1)d=\Omega(k^{\frac{m}{m+1}}) in order to retain accuracy. The node degree bound of l=O(k1/(m+1))l=O(k^{1/(m+1)}) ensures that the variance can be expressed as sums of powers of O(km/(m+1)d)O(\frac{k^{m/(m+1)}}{d}).

Appendix B JL Lemma for Graphs

Here, we aim to prove Theorem 7.1 by establishing intermediate results. Importantly, along the way we show how our random projection method preserves the graph inner product of Section 7.1.

B.1 Random Projections Preserve Inner Products

Theorem B.1.

Consider two graphs GG and HH with n1n_{1} and n2n_{2} edges respectively. Suppose they have kk edges in common. Among all n1n2n_{1}n_{2} pairs (e,e)EG×EH(e,e^{\prime})\in E_{G}\times E_{H}, suppose qq of these edge pairs share exactly one vertex. Let π(G)\pi(G) and π(H)\pi(H) denote the random projections of their adjacency matrices AGA_{G} and AHA_{H} into d×d\mathbb{R}^{d\times d}. For any ϵ>0\epsilon>0, we have:

P(|π(G),π(H)AG,AH|>ϵ)2exp(ϵ2qd+n1n2kqd2+ϵ3)P(|\langle\pi(G),\pi(H)\rangle-\langle A_{G},A_{H}\rangle|>\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{\frac{q}{d}+\frac{n_{1}n_{2}-k-q}{d^{2}}+\frac{\epsilon}{3}}) (9)
Proof.

We can express their random projections as:

π(G)=i=1kaibT+j=1n1kcjdjT\displaystyle\pi(G)=\sum^{k}_{i=1}a_{i}b^{T}+\sum_{j=1}^{n_{1}-k}c_{j}d_{j}^{T}
π(H)=i=1kaibT+l=1n2kelflT\displaystyle\pi(H)=\sum^{k}_{i=1}a_{i}b^{T}+\sum_{l=1}^{n_{2}-k}e_{l}f_{l}^{T}

where a,b,c,d,e,fa,b,c,d,e,f are random spherical vectors. Computing their inner product:

tr(π(G)Tπ(H))=k+tr(π(G)Tl=1n2kelflT)+tr(π(H)Tj=1n1kcjdjT)=k+Etr(\pi(G)^{T}\pi(H))=k+tr(\pi(G)^{T}\sum_{l=1}^{n_{2}-k}e_{l}f_{l}^{T})+tr(\pi(H)^{T}\sum_{j=1}^{n_{1}-k}c_{j}d_{j}^{T})=k+E (10)

We proceed to bound the error term EE in equation 10. By assumption, qq of the terms in EE share one vertex with the remaining terms sharing no vertices:

E=i=1qϵi+j=1n1n2kqϵjϵj\displaystyle E=\sum_{i=1}^{q}\epsilon_{i}+\sum_{j=1}^{n_{1}n_{2}-k-q}\epsilon_{j}\epsilon^{\prime}_{j}

where each ϵ\epsilon is the dot product of independent spherical vectors. Using Lemma A.1 and Bernstein’s inequality[3], we get the following bound:

P(|E|ϵ)2exp(ϵ2qd+n1n2kqd2+ϵ3)P(|E|\geq\epsilon)\leq 2\exp(\frac{-\epsilon^{2}}{\frac{q}{d}+\frac{n_{1}n_{2}-k-q}{d^{2}}+\frac{\epsilon}{3}})

B.2 Random Projections are a JL Map for Graphs

Theorem B.2.

Let GG be a graph with kk edges. For d<kd<k and small ϵ>0\epsilon>0, its random projection into d×d\mathbb{R}^{d\times d} satisfies:

P(|π(G)2k|>kϵ)2exp(dϵ2)P(|\lVert\pi(G)\lVert^{2}-k|>k\epsilon)\leq 2\exp{-d\epsilon^{2}} (11)
Proof.

Of the k2k^{2} pairs (e,e)EG×EG(e,e^{\prime})\in E_{G}\times E_{G}, suppose that qq of them share exactly one vertex. Theorem B.1 states that:

P(|π(G)2k|>kϵ)2exp(k2ϵ2qd+k2kqd2+kϵ3)P(|\lVert\pi(G)\lVert^{2}-k|>k\epsilon)\leq 2\exp(\frac{-k^{2}\epsilon^{2}}{\frac{q}{d}+\frac{k^{2}-k-q}{d^{2}}+\frac{k\epsilon}{3}})

As qk2kq\leq k^{2}-k, the worst case bound occurs when q=k2kq=k^{2}-k which gives equation 11. Note that if ϵ\epsilon is large, the third term kϵ3\frac{k\epsilon}{3} dominates and gives the trivial bound 2exp(kϵ)2\exp(-k\epsilon). ∎

Theorem B.3.

Let GG and HH be two graphs with AGAH2=m\lVert A_{G}-A_{H}\lVert^{2}=m. For d<kd<k and small ϵ>0\epsilon>0, the random projections of their adjacency matrices into d×d\mathbb{R}^{d\times d} satisfies the following:

P(|π(G)π(H)m|>mϵ)2exp(dϵ2)P(|\lVert\pi(G)-\pi(H)\lVert-m|>m\epsilon)\leq 2\exp(-d\epsilon^{2}) (12)
Proof.

If we include signed edges, the graph GHG-H has mm total edges. An application of Theorem B.2 gives the result. ∎

B.2.1 Proof of Theorem 7.1

Proof.

From Theorem B.3, our random projection satisfies the following inequalities with high probability:

(1ϵ)AGAH2π(G)π(H)2(1+ϵ)AGAH2(1-\epsilon)||A_{G}-A_{H}||^{2}\leq||\pi(G)-\pi(H)||^{2}\leq(1+\epsilon)||A_{G}-A_{H}||^{2} (13)

If we want this to hold for a set of NN sparse adjacency matrices, a union bound over all (N2)\binom{N}{2} pairs shows equation 13 holds with probability at least 12(N2)exp(dϵ2)=1N(N1)exp(dϵ2)1-2\binom{N}{2}\exp{-d\epsilon^{2}}=1-N(N-1)\exp{-d\epsilon^{2}}. For a fixed probability threshold TT of violating the inequalities 13, let us choose the optimal dd given NN, denoted doptd_{opt}. That is, doptd_{opt} is the smallest integer dd such that N(N1)exp(dϵ2)TN(N-1)\exp{-d\epsilon^{2}}\leq T. The optimal doptd_{opt} satisfies the following inequalities:

N(N1)exp(doptϵ2)T\displaystyle N(N-1)\exp{-d_{opt}\epsilon^{2}}\leq T
N(N1)exp((dopt1)ϵ2)>T\displaystyle N(N-1)\exp{-(d_{opt}-1)\epsilon^{2}}>T

Combining the two inequalities gives:

logT2logNϵ2dopt<logT2logNϵ2+1\frac{\log T-2\log N}{\epsilon^{2}}\leq d_{opt}<\frac{\log T-2\log N}{\epsilon^{2}}+1

Thus, we have dopt=Ω(logNϵ2)d_{opt}=\Omega(\frac{\log N}{\epsilon^{2}}), which matches the scaling of the usual Johnson-Lindenstrauss lemma and establishes our random projections as a JL map for graphs. ∎