Scalable Relational Query Processing on Big Matrix Data

Yongyang Yu Facebook1 Hacker WayMenlo ParkCA94025USA [email protected] , Mingjie Tang Chinese Academy of ScienceZhongguan districtBeijingBeijing10010China [email protected] and Walid G. Aref Purdue University305 N. University StreetWest LafayetteIN47907USA [email protected]

Abstract.

The use of large-scale machine learning methods is becoming ubiquitous in many applications ranging from business intelligence to self-driving cars. These methods require a complex computation pipeline consisting of various types of operations, e.g., relational operations for pre-processing or post-processing the dataset, and matrix operations for core model computations. Many existing systems focus on efficiently processing matrix-only operations, and assume that the inputs to the relational operators are already pre-computed and are materialized as intermediate matrices. However, the input to a relational operator may be complex in machine learning pipelines, and may involve various combinations of matrix operators. Hence, it is critical to realize scalable and efficient relational query processors that directly operate on big matrix data. This paper presents new efficient and scalable relational query processing techniques on big matrix data for in-memory distributed clusters. The proposed techniques leverage algebraic transformation rules to rewrite query execution plans into ones with lower computation costs. A distributed query plan optimizer exploits the sparsity-inducing property of merge functions as well as Bloom join strategies for efficiently evaluating various flavors of the join operation. Furthermore, optimized partitioning schemes for the input matrices are developed to facilitate the performance of join operations based on a cost model that minimizes the communication overhead. The proposed relational query processing techniques are prototyped in Apache Spark. Experiments on both real and synthetic data demonstrate that the proposed techniques achieve up to two orders of magnitude in performance improvement over state-of-the-art systems on a wide range of applications.

query optimization, matrix data, distributed computing

^†^†copyright: acmcopyright^†^†doi: 0000001.0000001^†^†ccs: Information systems Database query processing^†^†ccs: Information systems Computing platforms^†^†ccs: Information systems MapReduce languages^†^†ccs: Computing methodologies MapReduce algorithms^†^†ccs: Computing methodologies

1. Introduction

Data analytics, including machine learning (ML, for short) and scientific research, often need to analyze large volumes of matrix data in various applications, e.g., business intelligence (BI), natural language processing (Manning and Klein, 2003), social network analysis, recommender systems (Beel et al., 2016), and genomic diagnostics (Qi et al., 2014). As ML models increase in complexity, and data volume accumulates drastically, traditional single-node solutions are incapable of processing this data efficiently. As a result, a number of distributed systems have been built to optimize data processing pipelines in a unified ecosystem of big matrix data for various kinds of applications.

In addition, queries to ML and scientific applications exhibit quite different characteristics from traditional business data processing applications. Complex analytics, e.g., linear regression (LR, for short) for classification, and principle component analysis (PCA) for dimension reduction, are prevalent in these query workloads. These complex analytics replace the traditional SQL aggregations that are commonly used in BI applications. These analytical tasks rely on large-scale matrix computations, and are more CPU-intensive than typical relational database computations.

The advent of MapReduce (Dean and Ghemawat, 2004) and Apache Spark (Zaharia et al., 2012) has spurred a number of distributed matrix computation systems, e.g., Mahout (mah, 2017), SystemML (Ghoting et al., 2011), DMac (Yu et al., 2015), MLlib (mll, 2018), and MatFast (Yu et al., 2017). In contrast to popular scientific platforms, e.g., R (r-l, 2018), the above systems provide better scalability and fault-tolerance in addition to efficiency in computations. However, they only focus on the efficient execution of pure matrix computations, where the computation consists of only matrix operations, e.g., matrix-matrix multiplications.

Data analytics on real-world applications need an entire data processing pipeline that consists of multiple stages and various data operations, e.g., relational selections and aggregations on matrix data. A typical pipeline of ML and scientific applications consists of several components, e.g., data pre-processing, core computation, and post-processing.

Refer to caption — Figure 1. Collaborative filtering with side information.

Example 0 (Collaborative Filtering).

Consider collaborative filtering with side information (Zhao et al., 2016). In Figure 1, an $n\times t$ item-user recommendation matrix $\boldsymbol{\mathrm{Y}}$ contains the observed recommendations of $t$ users over $n$ items. Entry $Y_{ij}\neq 0$ indicates that Item $i$ is recommended by User $j$ . An $n\times d$ item-feature Matrix $\boldsymbol{\mathrm{X}}$ contains side information, where each row is a feature vector for the corresponding item. The goal is to identify the most-promising top- $k$ items for each target user from her non-recommended items based on $\boldsymbol{\mathrm{Y}}$ and $\boldsymbol{\mathrm{X}}$ . However, Matrix $\boldsymbol{\mathrm{X}}$ needs to be constructed, e.g., by crawling each item’s webpage. Certain features may contain inconsistent or empty values. For successful feature extraction, all the empty columns are removed from $\boldsymbol{\mathrm{X}}$ and any inconsistent data is corrected by a data cleaning toolkit, i.e., matrix column and entry selection. Cross-validation (Hastie et al., 2001) is a common technique to estimate the regularizers to prevent over-fitting. The $k$ -fold cross-validation divides the training dataset into $k$ disjoint partitions, selects $(k-1)$ partitions as the training set, and keeps the $k$ -th partition for testing. The generation of $k$ disjoint partitions can be achieved by relational selections on the row dimension of Matrix $\boldsymbol{\mathrm{Y}}$ . For the post-processing, a max aggregation is conducted on the predicated matrix $\tilde{\boldsymbol{\mathrm{Y}}}$ that is computed from the collaborative filtering model using $\boldsymbol{\mathrm{Y}}\Join\tilde{\boldsymbol{\mathrm{X}}}$ , where $\boldsymbol{\mathrm{Y}}$ joins $\tilde{\boldsymbol{\mathrm{X}}}$ on the row dimension, to find the entries with the largest predicted values to recommend them to potential users. The matrix-only operations cannot evaluate the entire ML pipeline as certain relational operations cannot be computed, e.g., matrix-matrix join. The example suggests that the successful training and prediction of a complex ML model requires a seamless execution of high-performance relational as well as linear-algebra operations.

Although several systems provide efficient matrix computations, they do not offer relational operations over matrix data, or only provide limited optimizations. SciDB (Brown, 2010) is an array database system that supports rich relational operations on array data. However, SciDB is not well-tuned for sparse matrix computations. For other systems built on general dataflow platforms, e.g., SystemML, MLlib, MatFast, the relational operations could become a bottleneck in the entire matrix data processing pipeline if they are not treated properly. An effective matrix query optimizer should be able to process both relational and matrix operations efficiently, and discover the potential to execute the mixed types of operations interchangeably, e.g., pushing a relational operation under a matrix operation in the logical query evaluation plan with correctness guarantees.

In this paper, we identify equivalent transformation rules for rewriting logical query plans that involve both relational and matrix operations. For example, a query optimizer that is not sensitive to matrix operations evaluates $\Gamma_{\texttt{sum},r}(\boldsymbol{\mathrm{A}}\times(\boldsymbol{\mathrm{B}}+\boldsymbol{\mathrm{C}}))$ as in Figure 2a. The improved logical plan in Figure 2b executes the aggregate function before matrix multiplication. This plan is cheaper as the dimensions of the summed results are much smaller, hence reducing the cost of the matrix-matrix multiplication. The query optimizer should be able to enumerate all the equivalent transformations and their corresponding plans, and pick the plan with the minimum computational cost.

Also, in a distributed setup, the query optimizer for plans with mixed relational and matrix operations should be aware of the communication overhead incurred by expensive relational operations over big matrices. Load-balanced matrix data partitioning schemes, e.g., hash-based schemes, may not be efficient for joins with data dependencies. Consider the collaborative filtering example. The predicted item-user recommendation matrix $\tilde{\boldsymbol{\mathrm{Y}}}$ could be computed from two factor-matrices $\boldsymbol{\mathrm{W}}$ and $\boldsymbol{\mathrm{H}}$ learned from the observed data $\boldsymbol{\mathrm{Y}}$ and $\tilde{\boldsymbol{\mathrm{X}}}$ , e.g., $\tilde{\boldsymbol{\mathrm{Y}}}=\boldsymbol{\mathrm{W}}\times\boldsymbol{\mathrm{H}}^{T}$ . Suppose that we have another user-item matrix $\boldsymbol{\mathrm{C}}$ with the same users as $\tilde{\boldsymbol{\mathrm{Y}}}$ on different items. Tensor decomposition (Kolda and Bader, 2009) allows to capture connections among the users and two different sets of items. As will be explained in this paper, this tensor is constructed by conducting a join operation on two matrices, i.e., $\tilde{\boldsymbol{\mathrm{Y}}}\Join\boldsymbol{\mathrm{C}}$ , or $(\boldsymbol{\mathrm{W}}\times\boldsymbol{\mathrm{H}}^{T})\Join\boldsymbol{\mathrm{C}}$ , on the column dimension of $\tilde{\boldsymbol{\mathrm{Y}}}$ and the row dimension of $\boldsymbol{\mathrm{C}}$ . An efficient execution plan would partition Matrix $\tilde{\boldsymbol{\mathrm{Y}}}$ in the column dimension and Matrix $\boldsymbol{\mathrm{C}}$ in the row dimension to satisfy the requirement of the join predicate. Partitioning Matrix $\tilde{\boldsymbol{\mathrm{Y}}}$ in the column dimension further requires evaluating the costs of different matrix data partitioning strategies on $\boldsymbol{\mathrm{W}}$ and $\boldsymbol{\mathrm{H}}$ , and choosing the one with the minimum communication overhead. Therefore, optimizing the data partitioning of the input and intermediate matrices is a critical step for efficiently evaluating relational operations over big matrix data.

In this paper, we introduce MatRel, an in-memory distributed relational query processing system on big matrix data. MatRel has (1) a query optimizer to identify and leverage transformation rules of the interleaved relational and matrix operations to reduce the computation cost and the memory footprint, and (2) a matrix data partitioner to reduce the communication cost for relational joins over matrix data. The optimizer utilizes rule-based query rewrite to generate an optimized logical plan. MatRel leverages a cost model to partition the input matrices for relational join operations to minimize communication overhead. MatRel extends Spark SQL by providing a series of matrix operations, and leverages Spark SQL’s relational query optimization techniques and dataflow operators.

The main contributions of this paper are as follows:

•

We develop MatRel, an in-memory distributed system for relational query processing over big matrix data.
•

We identify equivalence transformation rules to rewrite a logical plan involving both relational and matrix operations.
•

We define join operators over matrix data and introduce optimization techniques to enhance their runtime cost.
•

We introduce a cost model to distribute matrix data for communication-efficient relational join operations.
•

We evaluate MatRel against state-of-the-art distributed matrix computation systems using real and synthetic datasets. Experimental results illustrate MatRel’s up to two orders of magnitude enhancement in performance.

The rest of this paper proceeds as follows. Section 2 presents the supported matrix operations and MatRel’s architecture. Sections 3 and 4 present the formal definitions of the supported relational operations and their corresponding optimizations over matrix data. Section 5 discusses the local execution of operators and the system implementation. Section 6 presents experimental results. Section 7 presents the related work, and Section 8 concludes the paper.

2. Preliminaries

Notation:
We adopt the following conventions: (1) A bold, upper-case Roman letter, e.g., $\boldsymbol{\mathrm{A}}$ , denotes a matrix. (2) A regular Roman letter with subscripts, e.g., $A_{ij}$ , is an element in Matrix $A$ . (3) A bold, lower-case Roman letter, e.g., $\boldsymbol{\mathrm{x}}$ is a column vector. (4) A lower-case Greek letter, e.g., $\beta$ , is a scalar. (5) $\boldsymbol{\mathrm{A}}_{ij}$ is a block matrix, where $i$ and $j$ are the row- and column-block indexes. (6) $(X_{ij})$ is a matrix with Element $X_{ij}$ at the $i$ -th row and the $j$ -th column. (7) A bold, upper-case calligraphic Roman letter, e.g., $\boldsymbol{\mathcal{A}}$ , is a tensor. (8) A regular calligraphic Roman letter with subscripts, e.g., $\mathcal{A}_{ijk}$ or $\mathcal{A}_{ijk\ell}$ , is an element in a 3rd- or 4th-order tensor (Kolda and Bader, 2009). The order of a tensor is the number of dimensions.

Supported Matrix Operators:
MatRel extends MatFast (Yu et al., 2017). MatRel supports the unary matrix transpose operator $\boldsymbol{\mathrm{B}}=\boldsymbol{\mathrm{A}}^{T},B_{ij}=A_{ji},\forall i,j$ ; following binary matrix operators: matrix-scalar addition, $\boldsymbol{\mathrm{B}}=\boldsymbol{\mathrm{A}}+\beta,B_{ij}=A_{ij}+\beta$ ; matrix-scalar multiplication, $\boldsymbol{\mathrm{B}}=\boldsymbol{\mathrm{A}}*\beta,B_{ij}=A_{ij}*\beta$ ; matrix-matrix addition, matrix-matrix element-wise multiplication/division, $\boldsymbol{\mathrm{C}}=\boldsymbol{\mathrm{A}}\star\boldsymbol{\mathrm{B}},C_{ij}=A_{ij}\star B_{ij}$ , where $\star\in\{+,*,/\}$ ; and matrix-matrix multiplication, $\boldsymbol{\mathrm{C}}=\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}},C_{ij}=\sum_{k}A_{ik}*B_{kj}$ . By default, matrix transpose has the highest precedence, and matrix-matrix multiplication has a higher precedence over element-wise operators. Besides basic matrix operators, many advanced operations are also supported in terms of basic ones (Golub and Van Loan, 2013), e.g., matrix inverse and QR factorization. In addition, MatRel supports relational operations, e.g., selection, aggregation, and join.

Overview of MatRel:

Refer to Figure 3. The input to MatRel can be an ML algorithm or a graph analytical task that consists of various relational and matrix operators over the input matrices. MatRel organizes the operators into a logical evaluation plan that is fed to MatRel’s query optimizer. MatRel’s optimizer extends Spark SQL’s Catalyst optimizer (Armbrust et al., 2015) by introducing new relational data types for matrix data. The optimizer casts matrix data as relational tables with predefined schemas. Using rule-based query rewrite heuristics, the initial query plan is transformed into an optimized execution plan in a distributed environment. The optimizer invokes the matrix data partitioner to assign efficient partitioning schemes to the distributed matrix data based on a cost model. Internally, the matrix data is organized as Resilient Distributed Datasets (RDDs) (Zaharia et al., 2012) for distributed execution. For efficient and fault-tolerant storage of matrix data, MatRel uses Hadoop Distributed File System (HDFS) for external storage.

3. Relational operators on matrix data

Relational Algebra for Matrix Data:
Relational algebra (Codd, 1970) and its primitive operators operate over relations. All relational operators accept, as input, relations with properly defined schemas, and produce, as output, relations with certain schemas that are derived from the input schemas. To ensure expressive relational operators over matrix data, MatRel casts a general schema over all matrix data.

3.1. Relational Schema of a Matrix

A matrix, say $\boldsymbol{\mathrm{A}}$ , consists of two dimensional entries, $A_{ij}$ . Each entry in a matrix resembles a tuple in a table with schema: matrixA (RID, CID, val), where RID and CID are the row and column dimensions, respectively, and val is the value of the matrix entry. In addition, MatRel maintains the dimensions of the matrix in the system catalog, i.e., the number of rows and columns. This schema specifies the logical structure of a matrix. However, it does not reflect how the matrix is physically stored. For efficiency and storage considerations, a matrix is divided into “blocks”, i.e., it is partitioned into smaller matrices that can be moved to specific compute nodes for high-performance local computations. Detailed physical storage of matrices is presented in Section 5.1.

3.2. Relational Selection on Matrix Data

Relational select ( $\sigma$ ) selects matrix entries that qualify a certain predicate. We distinguish between selections on the dimensions and selections on the values of a matrix. A relational select is a unary operator of the form: $\sigma_{\theta(RID,CID,val)}(\boldsymbol{\mathrm{A}}),$ where $\theta$ is a propositional formula on the dimensions and entries of Matrix $\boldsymbol{\mathrm{A}}$ . An atom in $\theta$ is defined as $u\varphi c$ or $u\varphi v$ , where $u,v\in\{RID,CID,val\}$ , $u\neq v$ , and $c$ is a constant, $\varphi\in\{<,\leq,=,\neq,\geq,>\}$ ; and the logical expressions among atoms, i.e., $\land$ (and), $\lor$ (or), and $\neg$ (negation) expressions. A relational select on a matrix produces another matrix, perhaps with different dimensions. The dimensions are determined by the predicate, e.g., the predicate $``RID=i"$ produces a matrix of a single row while the predicate $``RID=CID"$ produces the diagonal vector of the matrix.

Selecting the Entries:
A select produces a matrix of the same dimension if the select predicate only involves Attribute $val$ . The matrix entries that qualify the select predicate are preserved, and the remaining entries are set to NULL. The relational select on matrix entries does not change the dimension of the input. If a series of select predicates apply over a matrix, the following transformation rule applies:

(1)

\sigma_{\theta_{1}}(\sigma_{\theta_{2}}(\cdots(\sigma_{\theta_{k}}(\boldsymbol{\mathrm{A}}))))=\sigma_{\theta_{1}\land\theta_{2}\land\cdots\land\theta_{k}}(\boldsymbol{\mathrm{A}}),

where each $\theta_{i}$ is a predicate on the matrix entries.

Relational Select on the Matrix Dimensions:
A select predicate can operate on the matrix dimensions to pick slices of a matrix according to the given row/column dimension value. For example, cross-validation (Hastie et al., 2001) is an ML technique to estimate how accurately a predicative model will perform in practice. Specifically, leave-one-out cross-validation can be realized by selecting a single row from a training matrix as the test set. The output of a select on the dimensions is a matrix with different dimensions than the input’s, e.g., if the input is an $m\times n$ Matrix $\boldsymbol{\mathrm{A}}$ , $\sigma_{RID=i}(\boldsymbol{\mathrm{A}})$ is a $1\times n$ matrix whose entries are the $i$ -th row of $\boldsymbol{\mathrm{A}}$ . Similarly, $\sigma_{CID=j}(\boldsymbol{\mathrm{A}})$ is an $m\times 1$ matrix whose entries are the $j$ -th column; $\sigma_{RID\geq i_{1}\land RID\leq i_{2}}(\boldsymbol{\mathrm{A}})$ produces a matrix slice of $\boldsymbol{\mathrm{A}}$ , whose dimension is $(i_{2}-i_{1}+1)$ -by- $n$ .

The select predicate can be composed of conditions on both dimensions and entries. For instance, selecting all the entries whose values equal 10 from the 5-th row is expressed by $\sigma_{val=10\land RID=5}(\boldsymbol{\mathrm{A}})$ . Furthermore, in real-world ML applications, an empty column suggests no metrics have been observed for some feature. Removing empty columns will benefit the feature extraction procedure for better efficiency. Thus, we introduce two special predicates, $\sigma_{rows\neq NULL}(\boldsymbol{\mathrm{A}})~{}\text{and}~{}\sigma_{cols\neq NULL}(\boldsymbol{\mathrm{A}}).$ The former excludes the rows that are all NULLs, and the latter excludes the columns that are all NULLs. These are important extensions to the select operator because empty rows/columns usually do not contribute to matrix computations.

Optimizations for Selects over Dimensions:
MatRel supports matrix operations including transpose, matrix-scalar operation, matrix element-wise operation, and matrix-matrix multiplications. For matrix transpose, the transformation rules: $\sigma_{RID=i}(\boldsymbol{\mathrm{A}}^{T})=(\sigma_{CID=i}(\boldsymbol{\mathrm{A}}))^{T}$ and $\sigma_{CID=j}(\boldsymbol{\mathrm{A}}^{T})=(\sigma_{RID=j}(\boldsymbol{\mathrm{A}}))^{T}$ reduce the computational costs significantly by avoiding the matrix transpose. It avoids $O(n^{2})$ operations to compute the matrix transpose by leveraging a select on a single dimension.

Similar rewrite rules apply to matrix-scalar and matrix element-wise operations. The following rules can be viewed as pushing the select below the matrix operations: $\sigma_{RID=i}(\boldsymbol{\mathrm{A}}\star\boldsymbol{\mathrm{B}})=\sigma_{RID=i}(\boldsymbol{\mathrm{A}})\star\sigma_{RID=i}(\boldsymbol{\mathrm{B}})$ , where $\star\in\{+,*,/\}$ . This rule also applies to matrix-scalar multiplications. We illustrate these transformation rules on the selects w.r.t. the row dimension. The rules apply to selections on the column dimension as well. The number of required operations drops from $O(n^{2})$ to $O(n)$ by circumventing the intermediate matrix. For matrix-matrix multiplications, we have $\sigma_{RID=i}(\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}})=\sigma_{RID=i}(\boldsymbol{\mathrm{A}})\times\boldsymbol{\mathrm{B}}$ .

Proof.

Let $\boldsymbol{\mathrm{C}}=\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}}$ , where $C_{ij}=\sum_{k}A_{ik}*B_{kj}$ . Let $C_{i:}$ denote the $i$ -th row of Matrix $\boldsymbol{\mathrm{C}}$ . Then, by definition, $C_{i:}=\sum_{k}A_{ik}*B_{k:}$ , where $A_{ik}$ is the $(i,k)$ -th entry from $\boldsymbol{\mathrm{A}}$ , and $B_{k:}$ is $k$ -th row from $\boldsymbol{\mathrm{B}}$ . This can be viewed as the summation of each entry in the $i$ -th row of $\boldsymbol{\mathrm{A}}$ multiplying with corresponding rows of $\boldsymbol{\mathrm{B}}$ , i.e., $\boldsymbol{\mathrm{C}}_{i:}=\boldsymbol{\mathrm{A}}_{i:}\times\boldsymbol{\mathrm{B}}$ . ∎

Analogously, the Rule: $\sigma_{CID=j}(\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}})=\boldsymbol{\mathrm{A}}\times\sigma_{CID=j}(\boldsymbol{\mathrm{B}})$ can be verified. It takes $O(n^{3})$ operations for matrix-matrix multiplications, while a matrix-vector multiplication costs $O(n^{2})$ operations. Furthermore, the rule: $\sigma_{RID=i\land CID=j}(\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}})=\sigma_{RID=i}(\boldsymbol{\mathrm{A}})\times\sigma_{CID=j}(\boldsymbol{\mathrm{B}})$ reduces the complexity from $O(n^{3})$ (matrix-matrix multiplication) to $O(n)$ (vector inner-product).

3.3. Aggregation on Matrix Data

Traditionally, an aggregation is defined on the columns of a relation with various aggregate functions, e.g., sum and count. The aggregation operator is also beneficial on matrix data. For example, assume that for a user-movie rating database, each row in the matrix represents a user, and each column represents a certain movie. Then, computing the average ratings of all the users can take place in the following way. The $(i,j)$ -th entry in the user-movie rating matrix denotes the rating of $user_{i}$ on $movie_{j}$ . This requires an average computation on the row dimension. Unlike relational data, aggregations on a matrix may occur in different dimensions. Formally, an aggregation is a unary operator, defined as

\Gamma_{\rho,dim}(\boldsymbol{\mathrm{A}}),

where $\rho$ is the name of the aggregate function, and $dim$ indicates the chosen dimension, i.e., $dim$ could be row (r), column (c), diagonal (d), or all (a). Specially, MatRel supports various aggregate functions, $\rho\in\{\texttt{sum},\texttt{nnz}$ , $\texttt{avg},\texttt{max},\texttt{min}\}$ . We introduce each aggregate function and corresponding optimizations in this section.

Sum() Aggregate:
MatRel supports four different variants of sum aggregations $\Gamma_{\texttt{sum},r/c/d/a}(\boldsymbol{\mathrm{A}})$ . We first introduce their definitions, then discuss optimization strategies. Given an $m$ -by- $n$ Matrix $\boldsymbol{\mathrm{A}}$ , $\Gamma_{\texttt{sum},r}(\boldsymbol{\mathrm{A}})$ produces an $m$ -by-1 matrix, or a column vector, whose $i$ -th entry computes the summation along the $i$ -th row of $\boldsymbol{\mathrm{A}}$ . Similarly, $\Gamma_{\texttt{sum},c}(\boldsymbol{\mathrm{A}})$ generates a 1-by- $n$ matrix, or a row vector that computes the summations of entries along the column direction. $\Gamma_{\texttt{sum},d}(\boldsymbol{\mathrm{A}})$ is defined only for square matrices that is also called the trace of a square matrix. The trace is simply a scalar value. Finally, $\Gamma_{\texttt{sum},a}(\boldsymbol{\mathrm{A}})$ computes the summation of all the entries.

We present transformation rules for optimizing sum aggregations when the input consists of various matrix operations. If the input is a matrix transpose, then we have:

(2)		$\displaystyle\Gamma_{\texttt{sum},r/c}(\boldsymbol{\mathrm{A}}^{T})$	$\displaystyle=(\Gamma_{\texttt{sum},c/r}(\boldsymbol{\mathrm{A}}))^{T},$
(3)		$\displaystyle\Gamma_{\texttt{sum},d/a}(\boldsymbol{\mathrm{A}}^{T})$	$\displaystyle=\Gamma_{\texttt{sum},d/a}(\boldsymbol{\mathrm{A}}),$

By pushing a sum aggregation before a matrix transpose, MatRel computes the transpose on a matrix of smaller dimensions. For trace and all-entry summation, the evaluation of the transpose can be avoided.

For matrix-scalar operations, we can derive the following transformation rules,

(4)	$\displaystyle\Gamma_{\texttt{sum},r/c/a}(\boldsymbol{\mathrm{A}}+\beta)$	$\displaystyle=\Gamma_{\texttt{sum},r/c/a}(\boldsymbol{\mathrm{A}})+\beta\boldsymbol{\mathrm{e}}_{n}/\beta\boldsymbol{\mathrm{e}}^{T}_{m}/\beta mn,$
(5)	$\displaystyle\Gamma_{\texttt{sum},d}(\boldsymbol{\mathrm{A}}+\beta)$	$\displaystyle=\Gamma_{\texttt{sum},d}(\boldsymbol{\mathrm{A}})+\beta n,$
(6)	$\displaystyle\Gamma_{\texttt{sum},\cdot}(\boldsymbol{\mathrm{A}}*\beta)$	$\displaystyle=\Gamma_{\texttt{sum},\cdot}(\boldsymbol{\mathrm{A}})*\beta,$

where we denote $\boldsymbol{\mathrm{e}}_{n}$ to be an all-one column vector of length $n$ and Rule 5 only holds for square matrices, i.e., $m=n$ . For matrix-scalar multiplications, similar rules apply when the aggregation is executed along the column direction, the diagonal direction, and the entire matrix.

For matrix element-wise operations, only the matrix element-wise addition has a compatible transformation rule for sum aggregations.

(7)

\displaystyle\Gamma_{\texttt{sum},\cdot}(\boldsymbol{\mathrm{A}}+\boldsymbol{\mathrm{B}})=\Gamma_{\texttt{sum},\cdot}(\boldsymbol{\mathrm{A}})+\Gamma_{\texttt{sum},\cdot}(\boldsymbol{\mathrm{B}}).

Though Rule 7 does not change the computation overhead, it circumvents storing the intermediate sum matrix $\boldsymbol{\mathrm{A}}+\boldsymbol{\mathrm{B}}$ , alleviating memory footprint.

For matrix-matrix multiplications, we derive efficient transformation rules to compute the sum aggregations,

(8)	$\displaystyle\Gamma_{\texttt{sum},r}(\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}})$	$\displaystyle=\boldsymbol{\mathrm{A}}\times\Gamma_{\texttt{sum},r}(\boldsymbol{\mathrm{B}}),$
(9)	$\displaystyle\Gamma_{\texttt{sum},c}(\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}})$	$\displaystyle=\Gamma_{\texttt{sum},c}(\boldsymbol{\mathrm{A}})\times\boldsymbol{\mathrm{B}},$
(10)	$\displaystyle\Gamma_{\texttt{sum},a}(\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}})$	$\displaystyle=\Gamma_{\texttt{sum},c}(\boldsymbol{\mathrm{A}})\times\Gamma_{\texttt{sum},r}(\boldsymbol{\mathrm{B}}),$
(11)	$\displaystyle\Gamma_{\texttt{sum},d}(\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}})$	$\displaystyle=\Gamma_{\texttt{sum},a}(\boldsymbol{\mathrm{A}}^{T}\boldsymbol{\mathrm{B}})=\Gamma_{\texttt{sum},a}(\boldsymbol{\mathrm{A}}\boldsymbol{\mathrm{B}}^{T}).$

Proof.

(Rule 8) Let $\boldsymbol{\mathrm{C}}=\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}}$ , where $C_{ij}=\sum_{k}A_{ik}*B_{kj}$ . Let the column Vector $\boldsymbol{\mathrm{y}}=\Gamma_{\texttt{sum},r}(\boldsymbol{\mathrm{C}})$ . Therefore,

y_{i}=\sum_{j}C_{ij}=\sum_{j}\sum_{k}A_{ik}*B_{kj}=\sum_{k}A_{ik}*b_{k},

where Vector $\boldsymbol{\mathrm{b}}=\Gamma_{\texttt{sum},r}(\boldsymbol{\mathrm{B}})$ . It is clear that the derivation in the matrix format, $\boldsymbol{\mathrm{y}}=\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{b}}$ that is exactly Rule 8. ∎

Similar derivations also apply to Rules 9 and 10. For Rule 11, both input matrices are required to be square.

Proof.

(Rule 11) Let us write Matrix $\boldsymbol{\mathrm{A}}$ in terms of rows, and Matrix $\boldsymbol{\mathrm{B}}$ in terms of columns, i.e.,

\boldsymbol{\mathrm{A}}=\begin{bmatrix}\boldsymbol{\mathrm{a}}_{1}^{T}\\ \vdots\\ \boldsymbol{\mathrm{a}}_{n}^{T}\end{bmatrix},\quad\boldsymbol{\mathrm{B}}=\begin{bmatrix}\boldsymbol{\mathrm{b}}_{1}&\cdots&\boldsymbol{\mathrm{b}}_{n}\end{bmatrix}.

Therefore, the matrix-matrix multiplication can be represented by the vectors $\boldsymbol{\mathrm{a}}_{i}$ ’s and $\boldsymbol{\mathrm{b}}_{i}$ ’s.

\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}}=\begin{bmatrix}\boldsymbol{\mathrm{a}}_{1}^{T}\boldsymbol{\mathrm{b}}_{1}&\dots&\boldsymbol{\mathrm{a}}_{1}^{T}\boldsymbol{\mathrm{b}}_{n}\\ \vdots&\ddots&\vdots\\ \boldsymbol{\mathrm{a}}_{n}^{T}\boldsymbol{\mathrm{b}}_{1}&\dots&\boldsymbol{\mathrm{a}}_{n}^{T}\boldsymbol{\mathrm{b}}_{n}\end{bmatrix}

The trace of $\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}}$ is computed as $\Gamma_{\texttt{sum},d}(\boldsymbol{\mathrm{A}}\times\boldsymbol{\mathrm{B}})=\sum_{i=1}^{n}\boldsymbol{\mathrm{a}}_{i}^{T}\times\boldsymbol{\mathrm{b}}_{i}$ . The transpose of $\boldsymbol{\mathrm{A}}$ is denoted as,

\boldsymbol{\mathrm{A}}^{T}=\begin{bmatrix}\boldsymbol{\mathrm{a}}_{1}&\cdots&\boldsymbol{\mathrm{a}}_{n}\end{bmatrix},

and the matrix element-wise multiplication is computed as:

\boldsymbol{\mathrm{A}}^{T}*\boldsymbol{\mathrm{B}}=\begin{bmatrix}\boldsymbol{\mathrm{a}}_{1}*\boldsymbol{\mathrm{b}}_{1}&\cdots&\boldsymbol{\mathrm{a}}_{n}*\boldsymbol{\mathrm{b}}_{n}\end{bmatrix}.

When conducting the all-entry sum aggregation, it is essential to compute the inner-product of corresponding vector pairs, i.e., $\Gamma_{\texttt{sum},a}(\boldsymbol{\mathrm{A}}^{T}*\boldsymbol{\mathrm{B}})=\sum_{i=1}^{n}\boldsymbol{\mathrm{a}}_{i}^{T}\times\boldsymbol{\mathrm{b}}_{i}$ . The second half of Rule 11 can be verified in a similar manner. ∎

The computation complexity is usually $O(n^{3})$ for square matrix-matrix multiplications. By using matrix element-wise multiplication and matrix transpose, the complexity is reduced to $O(n^{2})$ .

Unfortunately, equivalent transformation rules do not apply when the input involves a relational selection. For general selection condition $\theta$ ,

(12)

\displaystyle\Gamma_{\texttt{sum},\cdot}(\sigma_{\theta}(\boldsymbol{\mathrm{A}}))

\displaystyle\neq\sigma_{\theta}(\Gamma_{\texttt{sum},\cdot}(\boldsymbol{\mathrm{A}})).

However, there exist valid transformation rules if the sum dimension happens to match with the predicate of the selection, e.g., $\Gamma_{\texttt{sum},r}(\sigma_{RID=i}(\boldsymbol{\mathrm{A}}))=\sigma_{RID=i}(\Gamma_{\texttt{sum},r}(\boldsymbol{\mathrm{A}}))$ . Swapping the execution order of sum aggregation leads to the same result. Pushing a select relational operation below a matrix operation is beneficial as it could reduce the dimension of the matrix.

Count() Aggregate:
For relational data, the count aggregate function computes the number of tuples that satisfy a certain criteria. For matrix data, the count function is usually interpreted as computing the number of nonzeros (nnz). This function is quite useful for optimizing matrix computations, e.g., sparsity estimation for sparse matrix chain multiplications (Yu et al., 2017). Similar to sum aggregation, MatRel supports four count variants, $\Gamma_{\texttt{nnz},r/c/d/a}(\boldsymbol{\mathrm{A}})$ . The output of each function is defined as its counterpart of the sum function correspondingly.

We present equivalent transformation rules for optimizing count aggregations when the input is composed of various matrix operations. For matrix transpose, we have:

(13)		$\displaystyle\Gamma_{\texttt{nnz},r/c}(\boldsymbol{\mathrm{A}}^{T})$	$\displaystyle=(\Gamma_{\texttt{nnz},c/r}(\boldsymbol{\mathrm{A}}))^{T},$
(14)		$\displaystyle\Gamma_{\texttt{nnz},d/a}(\boldsymbol{\mathrm{A}}^{T})$	$\displaystyle=\Gamma_{\texttt{nnz},d/a}(\boldsymbol{\mathrm{A}}).$

These rules push a count aggregation below a matrix transpose. The postponed matrix transpose is executed on a smaller matrix, e.g., a vector, avoiding $O(n^{2})$ computations for the matrix transpose.

For matrix-scalar operations, let $\boldsymbol{\mathrm{A}}$ be an $m$ -by- $n$ matrix, and scalar $\beta\neq 0$ . The following rules hold for an input that involves matrix-scalar operations,

(15)	$\displaystyle\Gamma_{\texttt{nnz},r}(\boldsymbol{\mathrm{A}}+\beta)$	$\displaystyle=\boldsymbol{\mathrm{e}}_{m}*n,$
(16)	$\displaystyle\Gamma_{\texttt{nnz},c}(\boldsymbol{\mathrm{A}}+\beta)$	$\displaystyle=\boldsymbol{\mathrm{e}}_{n}^{T}*m,$
(17)	$\displaystyle\Gamma_{\texttt{nnz},d}(\boldsymbol{\mathrm{A}}+\beta)$	$\displaystyle=n,$
(18)	$\displaystyle\Gamma_{\texttt{nnz},a}(\boldsymbol{\mathrm{A}}+\beta)$	$\displaystyle=mn,$
(19)	$\displaystyle\Gamma_{\texttt{nnz},\cdot}(\boldsymbol{\mathrm{A}}*\beta)$	$\displaystyle=\Gamma_{\texttt{nnz},\cdot}(\boldsymbol{\mathrm{A}}).$

By assuming the constant $\beta\neq 0$ , the number of zeros is not altered for matrix-scalar multiplications. For the matrix-scalar addition, the output does not depend on the entries in $\boldsymbol{\mathrm{A}}$ , which only requires the dimensions to conduct the computation.

For matrix element-wise operations, only the division operator works properly with the count aggregation.

(20)

\displaystyle\Gamma_{\texttt{nnz},\cdot}(\boldsymbol{\mathrm{A}}/\boldsymbol{\mathrm{B}})=\Gamma_{\texttt{nnz},\cdot}(\boldsymbol{\mathrm{A}})

Rule 20 indicates that the matrix element-wise division preserves the number of nonzeros of the numerator matrix. Unfortunately, such convenient transformation rules do not apply to other matrix element-wise operators and matrix-matrix multiplications.

For an input consisting of relational selections, Rule 12 can be extended too. It is impossible to swap the execution order of a count aggregation and a selection for a correct output. It is only valid to swap the execution order of a count aggregation and a selection when both work on the same dimension, e.g., $\Gamma_{\texttt{nnz},c}(\sigma_{CID=j}(\boldsymbol{\mathrm{A}}))=\sigma_{CID=j}(\Gamma_{\texttt{nnz},c}(\boldsymbol{\mathrm{A}}))$ .

Avg() Aggregate:
The avg aggregate function computes the average statistic on a matrix in a certain dimension. MatRel supports four different variants, $\Gamma_{\texttt{avg},r/c/d/a}(\boldsymbol{\mathrm{A}})$ . The output of each avg aggregate function is defined as its counterpart of the sum aggregation correspondingly. The avg aggregation can be viewed as a compound operator from a sum and a count aggregations. Formally, any avg aggregation can be computed as:

\Gamma_{\texttt{avg},\cdot}(\boldsymbol{\mathrm{A}})=\Gamma_{\texttt{sum},\cdot}(\boldsymbol{\mathrm{A}})/\Gamma_{\texttt{nnz},\cdot}(\boldsymbol{\mathrm{A}}).

Therefore, optimizing an avg aggregation can leverage all the transformation rules we have explored for sum and count aggregations. MatRel’s optimizer automatically invokes query planning for sum and count aggregations when generating an execution plan for an avg aggregation.

Max()/Min() Aggregate:
The max (or min) aggregations are used to obtain the extreme values from a matrix, and serves as an important building block for anomaly detection (Chandola et al., 2009) in data mining applications. Naturally, MatRel supports four max (or min) variants, $\Gamma_{\texttt{max}/\texttt{min},r/c/d/a}(\boldsymbol{\mathrm{A}})$ . The output of each aggregation is defined as its counterpart of the sum aggregation correspondingly. The transformation rules for optimizing the max (or min) aggregations can be derived as follows:

(21)	$\displaystyle\Gamma_{\texttt{max}/\texttt{min},r/c}(\boldsymbol{\mathrm{A}}^{T})$	$\displaystyle=(\Gamma_{\texttt{max}/\texttt{min},c/r}(\boldsymbol{\mathrm{A}}))^{T},$
(22)	$\displaystyle\Gamma_{\texttt{max}/\texttt{min},d/a}(\boldsymbol{\mathrm{A}}^{T})$	$\displaystyle=\Gamma_{\texttt{max}/\texttt{min},d/a}(\boldsymbol{\mathrm{A}}),$
(23)	$\displaystyle\Gamma_{\texttt{max}/\texttt{min},\cdot}(\boldsymbol{\mathrm{A}}+\beta)$	$\displaystyle=\Gamma_{\texttt{max}/\texttt{min},\cdot}(\boldsymbol{\mathrm{A}})+\beta,$
(24)	$\displaystyle\Gamma_{\texttt{max}/\texttt{min},\cdot}(\boldsymbol{\mathrm{A}}*\beta)$	$\displaystyle=\Gamma_{\texttt{max}/\texttt{min},\cdot}(\boldsymbol{\mathrm{A}})*\beta\quad(\beta>0),$
(25)	$\displaystyle\Gamma_{\texttt{max}/\texttt{min},\cdot}(\boldsymbol{\mathrm{A}}*\beta)$	$\displaystyle=\Gamma_{\texttt{min}/\texttt{max},\cdot}(\boldsymbol{\mathrm{A}})*\beta\quad(\beta<0).$

We assume that the constant scalar $\beta\neq 0$ . When $\beta<0$ , the optimizer computes the min/max aggregation instead of avoiding the intermediate matrix $\boldsymbol{\mathrm{A}}*\beta$ . For matrix element-wise operations and matrix-matrix multiplications, no general rules apply. For example, $\Gamma_{\texttt{min},r}([1,1,2]+[3,0.5,0])=\Gamma_{\texttt{min},r}([4,1.5,2])=1.5$ , which could only be evaluated after computing the sum of the two matrices. For an input consisting of relational selections, it is only valid to swap the execution order of a max/min aggregation and selection when both work on the same dimensions.

4. Join on matrix data

The relational join ( $\Join$ ) is a useful operator for picking corresponding entries that satisfy the join predicates from two separate matrices. For example, raster data overlay analysis (Eldawy et al., 2016) requires a join on two matrices with matching row/column index, where a raster map can be interpreted as a matrix. Formally, a relational join is a binary operator,

\boldsymbol{\mathrm{A}}\Join_{\gamma,f}\boldsymbol{\mathrm{B}},

where $\gamma$ is the join predicate, $f$ is the user-defined merge function that takes two matching entries and outputs a merging result, e.g., $z=f(x,y)$ . We introduce various formats of $\gamma$ in different semantics in the subsequent sections. In general, the result of a relational join on two matrices is a tensor, or a multi-dimensional array, whose dimensions are determined by the predicate $\gamma$ . We first discuss how to infer the schema of a join output based on the join predicates. Next, we introduce all the valid formats of $\gamma$ . Finally, we investigate the optimization strategies for join executions.

4.1. Schema of a Join Result

Just like every matrix has a schema, MatRel automatically produces a schema for a join result. The schema of a join result consists of two parts: the index and the value. In this work, we only consider equality predicates as the join condition. Given two input matrices $A(RID_{A},CID_{A},val_{A})$ and $B(RID_{B},CID_{B},val_{B})$ , the cardinality of the dimension of $\boldsymbol{\mathrm{A}}\Join_{\gamma,f}\boldsymbol{\mathrm{B}}$ is

d=4-\delta_{dim},

where $\delta_{dim}$ is the number of equality predicates on the join dimensions. For example, if the join predicates $\gamma=``RID_{A}=RID_{B}"$ , the join output boasts the schema $(D_{1},D_{2},D_{3},val)$ , where $D_{1}$ takes matching dimension from $RID_{A}$ , $D_{2}$ takes relevant dimension from $CID_{A}$ , $D_{3}$ takes relevant dimension from $CID_{B}$ , and $val$ is the evaluation of the merge function on the matching entries from $\boldsymbol{\mathrm{A}}$ and $\boldsymbol{\mathrm{B}}$ . The join output may degrade to a normal matrix when $\gamma$ contains two or more predicates on the join dimensions.

4.2. Cross-product on Matrices

The cross-product operator is a special join operator between two matrices, where the join predicate $\gamma$ is empty, i.e., $\boldsymbol{\mathrm{A}}\otimes\boldsymbol{\mathrm{B}}=\boldsymbol{\mathrm{A}}\Join_{f}\boldsymbol{\mathrm{B}}$ . If both input matrices are viewed as relations, this operator is essentially the Cartesian product. The cross-product is widely used in tensor decomposition and applications, e.g., Kronecker product (Kolda and Bader, 2009). MatRel’s optimizer infers the output schema from the join predicate, which is empty for cross-product. Therefore, a cross-product produces a 4th-order tensor from two input matrices. For example, $C(D_{1},D_{2},D_{3},D_{4},val)$ is the schema of $\boldsymbol{\mathcal{C}}=\boldsymbol{\mathrm{A}}\otimes\boldsymbol{\mathrm{B}}$ . The first two dimensions $(D_{1},D_{2})$ inherit from $\boldsymbol{\mathrm{A}}$ , and the last two dimensions $(D_{3},D_{4})$ inherit from $\boldsymbol{\mathrm{B}}$ .

The execution of a cross-product is conducted in a fully parallel manner. MatRel stores a matrix in blocks of equal size, where the blocks are distributed among all the workers in a cluster. First, each block of the smaller matrix is duplicated $p$ times, where $p$ is the number of partitions from the larger matrix. Next, each duplicated block is sent to a corresponding partition of the larger matrix. Finally, each worker performs join operation locally without further communication. The 4th-order tensor is stored as a series of block matrices in a distributed manner. The detailed physical storage is described in Section 5.1.

4.3. Join on Two Dimensions

The dimension-based predicates for join on two dimensions can take two different formats:

\boldsymbol{\mathrm{A}}\Join_{RID_{A}=RID_{B}\land CID_{A}=CID_{B},f}\boldsymbol{\mathrm{B}},~{}\text{or}~{}\boldsymbol{\mathrm{A}}\Join_{RID_{A}=CID_{B}\land CID_{A}=RID_{B},f}\boldsymbol{\mathrm{B}}.

These two predicates are the only valid formats, as a propositional formula takes dimensions from both inputs. The former predicate can be viewed as an overlay on the two input matrices directly (direct overlay), while the latter can be viewed as an overlay on Matrix $\boldsymbol{\mathrm{A}}$ and the transpose of Matrix $\boldsymbol{\mathrm{B}}$ (transpose overlay). Figure 4 illustrates an example of a direct overlay, which can be viewed as a full-outer join in relational algebra. Conceptually, two input matrices can be regarded as two relations with the schema introduced in Section 3.1. The output of a join on two dimensions is a normal matrix, since two common dimensions are shared with the inputs.

To efficiently evaluate joins on two dimensions, MatRel adopts the hash join strategy to partition the matrix blocks from the inputs. By hashing the matched matrix blocks to the same worker, a worker conducts local join execution without further communications. For direct overlay, MatRel’s query planner partitions the two input matrices using the same partitioner. For transpose overlay, the planner makes sure the partitioning scheme on one input matches the partitioning scheme on the transpose of the other.

4.4. Join on a Single Dimension

The dimension-to-dimension (D2D) predicates can take four different formats:

\boldsymbol{\mathrm{A}}\Join_{ID_{A}=ID_{B},f}\boldsymbol{\mathrm{B}},

where $ID_{A/B}\in\{RID_{A/B},CID_{A/B}\}$ . Therefore, the join output is a 3rd-order tensor, and takes the schema $(D_{1},D_{2},D_{3},val)$ . Each entry from a matched row/column from $\boldsymbol{\mathrm{A}}$ is joined with the entries from the corresponding row/column from $\boldsymbol{\mathrm{B}}$ .

MatRel leverages a hash-based data partitioner to map matched row or column matrix blocks to the same worker. Suppose both input matrices are partitioned into $\ell$ -by- $\ell$ square blocks. Let us consider the D2D predicate to be $``RID_{A}=RID_{B}"$ . Two matrix blocks $\boldsymbol{\mathrm{A}}_{mp}$ and $\boldsymbol{\mathrm{B}}_{mq}$ are mapped to the same worker. For the $t$ -th row from $\boldsymbol{\mathrm{A}}_{mp}$ , each entry is joined with the same row from $\boldsymbol{\mathrm{B}}_{mq}$ . Thus, a $1$ -by- $\ell$ vector is produced by joining a single entry in the $t$ -th row from $\boldsymbol{\mathrm{A}}_{mp}$ and an entire row from $\boldsymbol{\mathrm{B}}_{mq}$ . After the join operation, the $t$ -th row from $\boldsymbol{\mathrm{A}}_{mp}$ leads to an $\ell$ -by- $\ell$ matrix block. Potentially, there are totally $\ell$ such matrix blocks for the join result from two input matrix blocks.

4.5. Join on the Entries

The entry-based (value-to-value, or V2V) join operation takes the following format:

\boldsymbol{\mathrm{A}}\Join_{val_{A}=val_{B},f}\boldsymbol{\mathrm{B}}.

The join output is a 4th-order tensor, where the first two dimensions come from the dimensions of $\boldsymbol{\mathrm{A}}$ , and the other two from the dimensions of $\boldsymbol{\mathrm{B}}$ .

The execution of a V2V join is conducted in a fully parallel manner, similar to the execution of cross-product. MatRel broadcasts each block of the smaller input to each partition of the larger one. Each worker in the cluster adopts a nested-loop join strategy to compute the join locally by taking each pair of matrix blocks. The four dimensions are copies of the dimension values from matched entries.

4.6. Join on a Single Dimension and an Entry

The join on a single dimension and an entry operation (D2V, or V2D) takes the following format:

\boldsymbol{\mathrm{A}}\Join_{ID_{A}=val_{B},f}\boldsymbol{\mathrm{B}}\text{, or }\boldsymbol{\mathrm{A}}\Join_{val_{A}=ID_{B},f}\boldsymbol{\mathrm{B}}

where $ID_{A/B}\in\{RID_{A/B},CID_{A/B}\}$ . The output is a 4th-order tensor, where the 4 dimensions derive from the entry locations of the matched rows/columns, and the matched entries from the other input. To evaluate this type of join, the matrix blocks containing the matched entries are mapped to the other matrix, where the row/column dimension matches. Each worker conducts local join computation based on the matched dimensions and entries.

4.7. Join Optimization

Relational joins over matrix data are more complex than other relational operations. They potentially have higher computation and communication overhead. We have identified several heuristics to mitigate the heavy memory footprint and computation burden. We leverage the following features: (1) sparsity-inducing merge functions, and (2) Bloom-join for join on entries. Furthermore, we also develop a cost model to capture the communication overhead among different types of joins. By utilizing the heuristics and the cost model, MatRel generates a computation- and communication-efficient execution plan for a join operation.

Identifying Sparsity-inducing Merge Functions. Given a general join operation, $\boldsymbol{\mathrm{A}}\Join_{\gamma,f}\boldsymbol{\mathrm{B}}$ , there are two important parameters, $\gamma$ and $f$ . The predicate $\gamma$ is utilized to locate the join entries, and $f$ is evaluated on the two entries for an output entry. In real-world applications, large matrices are usually sparse and structured¹¹1http://www.cise.ufl.edu/research/sparse. For example, Figure 4 illustrates a direct overlay on two sparse matrices, where the merge function is $f(x,y)=x+y$ . We say $f(x,y)$ is a sparsity-inducing function if $f(0,\cdot)=0$ or $f(\cdot,0)=0$ . Thus, the summation merge function is not sparsity-inducing since $C_{00}=A_{00}+B_{00}=0+2=2$ , which does not preserve sparsity from the left- or right-hand side. On the other hand, $f_{1}(x,y)=x*y$ , is sparsity-inducing on both sides.

The benefit of utilizing sparsity-inducing function is obvious, as it avoids evaluating the function completely if one of the inputs contains all zeros. Traditionally, a straw man execution plan takes two input matrix blocks. It at first converts the sparse matrix formats to the dense counterpart by explicitly filling in 0’s, then evaluates the merge function on the matched entries. Note that a sparse matrix format only records nonzeros and their locations. Compressed sparse representations reduce memory consumption and computation overhead for matrix manipulations. However, the memory consumption grows significantly when a sparse matrix is stored in the dense format.

Thus, it is critical to identify the sparsity-inducing merge functions, and avoid converting a sparse matrix format to dense format whenever possible. We consider a special family of merge functions: linear function and their linear combinations, i.e., $f(x,y)=g(x)*y+h(x)$ , or $f(x,y)=g(y)*x+h(y)$ , where both $g(\cdot)$ and $h(\cdot)$ are linear functions. For any merge function in this family, MatRel adopts a sampling approach to identify the sparsity-inducing property. Given a merge function $f(x,y)$ , we compute $t_{1}=f(0,s_{1})$ , and $t_{2}=f(0,s_{2})$ , where $s_{1}$ and $s_{2}$ are non-zero random numbers. If both $t_{1}$ and $t_{2}$ are 0, it can be easily confirmed that both $g(0)$ and $h(0)$ equal to 0. Thus, $f(x,y)$ is sparsity-inducing because of the linearity of function components. A similar sparsity-inducing test could be conducted on the $y$ value.

Once a sparsity-inducing function is identified, MatRel’s planner orchestrates the execution of join operations by leveraging the sparse matrix computations. For example, if the merge function $f(x,y)$ boasts the sparsity-inducing property on the $x$ component, MatRel only retrieves the nonzero entries from the left-hand side, and joins with entries from the right-hand side. All the computations are conducted in the sparse matrix format without converting to dense matrix format.

Bloom-join on Entries. For a join predicate that involves a dimension on the matrix, it is efficient for MatRel to locate the corresponding row matrix blocks or column matrix blocks. The irrelevant matrix blocks are not accessed during evaluation. However, the join operation becomes expensive when the join predicate contains a comparison on the entries. A straw man plan compares each pair of entries exhaustively, where a lot of computation is wasted on the mismatched entries.

MatRel adopts a Bloom-join strategy when a join predicate contains matrix entries. Each worker computes a Bloom filter on the entries. Due to the existence of sparse matrices, a worker needs to determine whether to store 0’s in the Bloom filter. Thanks to the heuristic of identifying zero-preserving merge functions, 0 values are not inserted into the Bloom filter if $f(x,y)$ is sparsity-inducing. After each matrix block creates its own Bloom filter, a worker picks an entry from one of the input matrix and consults the Bloom filter on the other input. If no match is detected from the Bloom filter, the join execution continues to the next entry; otherwise, a nested-loop join is conducted on the input matrices to generate the join output.

Cost Model for Communications. MatRel extends MatFast’s (Yu et al., 2017) matrix data partitioner, and it naturally supports three matrix data partitioning schemes: Row (“ $r$ ”), Column (“ $c$ ”), and Broadcast (“ $b$ ”). The Broadcast scheme is only used for sharing a matrix of low dimensions, e.g., a single vector. Different matrix data partitioning schemes lead to significantly different communication overhead for join executions. Therefore, we introduce a cost model to evaluate the communication costs of various partitioning schemes for join operations on matrix data.

We design a communication cost model based on the join predicates. For cross-product, we have

\displaystyle C_{comm}(\boldsymbol{\mathrm{A}}\otimes\boldsymbol{\mathrm{B}})=\begin{cases}\hfil 0,&\text{if }s_{A/B}=b.\\ (N-1)\min\{|\boldsymbol{\mathrm{A}}|,|\boldsymbol{\mathrm{B}}|\},&\text{otherwise}.\end{cases}

where $s_{A}$ and $s_{B}$ are the partitioning scheme of input matrix $\boldsymbol{\mathrm{A}}$ and $\boldsymbol{\mathrm{B}}$ , and $N$ is the number of workers in the cluster. $|\boldsymbol{\mathrm{A}}|$ refers to the size of the Matrix $\boldsymbol{\mathrm{A}}$ , i.e., $|\boldsymbol{\mathrm{A}}|=mn$ if $\boldsymbol{\mathrm{A}}$ is an $m$ -by- $n$ dense matrix; and it means nnz $(\boldsymbol{\mathrm{A}})$ if $\boldsymbol{\mathrm{A}}$ is sparse. The communication cost is 0 if one of the inputs has been broadcast to every worker in the cluster. This only applies to the case when either $\boldsymbol{\mathrm{A}}$ or $\boldsymbol{\mathrm{B}}$ is a matrix of tiny dimensions. When $\boldsymbol{\mathrm{A}}$ and $\boldsymbol{\mathrm{B}}$ are partitioned in Row/Column scheme, each partition of the smaller matrix has to be mapped to every worker. Therefore, it incurs a communication cost of $(N-1)\min\{|\boldsymbol{\mathrm{A}}|,|\boldsymbol{\mathrm{B}}|\}$ .

For direct overlay, we have the following cost function,

\displaystyle C_{comm}(\boldsymbol{\mathrm{A}}\Join_{\gamma,f}\boldsymbol{\mathrm{B}})=\begin{cases}\frac{N-1}{N}\min\{|\boldsymbol{\mathrm{A}}|,|\boldsymbol{\mathrm{B}}|\},&\text{if }(s_{A},s_{B})=(r,c)\text{ or }(c,r).\\ \hfil 0,&\text{otherwise}.\end{cases}

It is clear that the direct overlay operation induces 0 communication overhead if one of the inputs is broadcast to all the workers in the cluster. Furthermore, there is no communication when both inputs are partitioned using the same scheme. Each worker receives matrix blocks with the same row/column block IDs from both inputs, and the join predicates are naturally satisfied. The communication overhead is only incurred when one of operands is partitioned in Row scheme and the other is in Column scheme. To mitigate this partitioning scheme incompatibility, MatRel repartitions the smaller matrix using the same partitioning scheme from the larger matrix.

For transpose overlay, we have a similar cost function,

\displaystyle C_{comm}(\boldsymbol{\mathrm{A}}\Join_{\gamma,f}\boldsymbol{\mathrm{B}})=\begin{cases}\frac{N-1}{N}\min\{|\boldsymbol{\mathrm{A}}|,|\boldsymbol{\mathrm{B}}|\},&\text{if }(s_{A},s_{B})=(r,r)\text{ or }(c,c).\\ \hfil 0,&\text{otherwise}.\end{cases}

The cost model for a transpose overlay is very similar to that of the direct overlay case. The difference is that the communication overhead is introduced when two matrices are partitioned using the same scheme.

Table 1 summarizes the communication costs for joins on a single dimension. If any input matrix is broadcast to all the workers, there is no communication cost. We omit this case in Table 1. Let us focus on the case when $\gamma=``RID_{A}=RID_{B}"$ and $(s_{A},s_{B})=(r,c)$ . Figures 5 illustrates that Matrix $\boldsymbol{\mathrm{A}}$ is partitioned in Row scheme and Matrix $\boldsymbol{\mathrm{B}}$ is partitioned in Column scheme. Suppose we have 3 workers in the cluster, i.e., $N=3$ . Each square in the matrix is a block partition. The blocks with the same color are stored on the same worker. The join predicate requires that the entries sharing the same $RID$ ’s are joined together. There are two possible execution strategies based on the partitioning schemes of $\boldsymbol{\mathrm{A}}$ and $\boldsymbol{\mathrm{B}}$ . Strategy I sends the blocks with the same $RID$ from $\boldsymbol{\mathrm{A}}$ to the workers which hold the joining blocks from $\boldsymbol{\mathrm{B}}$ . For example, Worker $W_{1}$ sends all the blocks to Worker $W_{2}$ and $W_{3}$ , since both $W_{2}$ and $W_{3}$ hold some joining blocks from $\boldsymbol{\mathrm{B}}$ . Both $W_{2}$ and $W_{3}$ send the blocks in a similar manner. Each worker needs to send the blocks $(N\!-1)$ times to all the other workers. Thus, this strategy induces $(N\!-1)|\boldsymbol{\mathrm{A}}|$ communication cost. Strategy II adopts a different policy by sending different blocks from $\boldsymbol{\mathrm{B}}$ to the same worker that holds the blocks with the same $RID$ from $\boldsymbol{\mathrm{A}}$ . For example, the blocks with diagonal lines from $W_{2}$ and $W_{3}$ are sent to $W_{1}$ to satisfy the join predicate. As illustrated in Figure 5, all the blocks with diagonal lines introduce a communication cost of $\frac{N-1}{N}|\boldsymbol{\mathrm{B}}|$ . Thus, the best communication cost is $\min\{(N-1)|\boldsymbol{\mathrm{A}}|,\frac{N-1}{N}|\boldsymbol{\mathrm{B}}|\}$ when $\gamma=``RID_{A}=RID_{B}"$ and $(s_{A},s_{B})=(r,c)$ . The remaining entries in Table 1 can be computed with similar strategies. Notice, the diagonal of Table 1 contains all 0’s. This is because the partitioning schemes of both matrices happen to match the join predicate.

	$(s_{A},s_{B})$
$\gamma$	(r, r)	(r, c)	(c, r)	(c, c)
$RID_{A}=RID_{B}$	0	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,\frac{N-1}{N}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{\frac{N-1}{N}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$(N\!-1)\min\{\|\boldsymbol{\mathrm{A}}\|,\|\boldsymbol{\mathrm{B}}\|\}$
$RID_{A}=CID_{B}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,\frac{N-1}{N}\|\boldsymbol{\mathrm{B}}\|\}$	0	$(N-1)\min\{\|\boldsymbol{\mathrm{A}}\|,\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{\frac{N-1}{N}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$
$CID_{A}=RID_{B}$	$\min\{\frac{N-1}{N}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$(N-1)\min\{\|\boldsymbol{\mathrm{A}}\|,\|\boldsymbol{\mathrm{B}}\|\}$	0	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,\frac{N-1}{N}\|\boldsymbol{\mathrm{B}}\|\}$
$CID_{A}=CID_{B}$	$(N-1)\min\{\|\boldsymbol{\mathrm{A}}\|,\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{\frac{N-1}{N}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,\frac{N-1}{N}\|\boldsymbol{\mathrm{B}}\|\}$	0

Table 1. Communication cost of different D2D join predicates.

A join on entries has an identical cost function as a cross-product. It is clear that the join execution incurs 0 communication cost if either $\boldsymbol{\mathrm{A}}$ or $\boldsymbol{\mathrm{B}}$ is partitioned in Broadcast scheme. For any other combination of partitioning schemes of $\boldsymbol{\mathrm{A}}$ and $\boldsymbol{\mathrm{B}}$ , a worker needs to broadcast the smaller of the inputs to all the other workers. Hence, the communication cost is $(N-1)\min\{|\boldsymbol{\mathrm{A}}|,|\boldsymbol{\mathrm{B}}|\}$ .

For a join on a single dimension and entries, Table 2 gives the communication costs for the various schemes. We use $\eta$ to denote the selectivity of entries that could match the row/column dimensions from the other matrix. Let us examine the cost function for $\gamma=``RID_{A}=val_{B}"$ . When $(s_{A},s_{B})=(r,r)$ , there exist two strategies to evaluate the join. Strategy I requires each worker to send its own matrix blocks of $\boldsymbol{\mathrm{A}}$ to the other $(N-1)$ workers, since any worker may hold a block from $\boldsymbol{\mathrm{B}}$ that has matched entries with the row dimension. Broadcasting $\boldsymbol{\mathrm{A}}$ incurs a communication cost of $(N-1)|\boldsymbol{\mathrm{A}}|$ . Strategy II sends the matched entries from $\boldsymbol{\mathrm{B}}$ to each corresponding worker that holds the blocks of $\boldsymbol{\mathrm{A}}$ . The selectivity indicates a total amount of $\eta_{B}|\boldsymbol{\mathrm{B}}|$ of matrix data is transferred. Thus, the communication cost would be $\min\{(N-1)|\boldsymbol{\mathrm{A}}|,\eta_{B}|\boldsymbol{\mathrm{B}}|\}$ . For $(s_{A},s_{B})=(c,r)$ , the only difference is that the selected matching entries from $\boldsymbol{\mathrm{B}}$ are sent to all the workers in the cluster, since $\boldsymbol{\mathrm{A}}$ is partitioned in Column scheme. The remaining entries in Table 2 can be computed in a similar manner.

	$(s_{A},s_{B})$
$\gamma$	(r, r)	(r, c)	(c, r)	(c, c)
$RID_{A}=val_{B}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,N\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,N\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$
$CID_{A}=val_{B}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,N\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,N\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$
$val_{A}=RID_{B}$	$\min\{\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{N\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{N\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$
$val_{A}=CID_{B}$	$\min\{N\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{N\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$

Table 2. Communication cost of different D2V and V2D join predicates.

Algorithm for Partitioning Scheme Assignment of Joins. Before we delve into the algorithm of partitioning scheme assignment for joins, we first discuss the cost function for converting distributed matrix data from one partitioning scheme to another. Table 3 illustrates the conversion costs, where $s_{A}$ is the partition scheme of an input, and $s^{\prime}_{A}$ is the scheme for an output. $\xi$ denotes the case when the input data is randomly partitioned among all the workers in the cluster, e.g., round-robin. It introduces the communication cost of $|\boldsymbol{\mathrm{A}}|$ when re-partitioning Matrix $\boldsymbol{\mathrm{A}}$ from round-robin to Row/Column scheme. Broadcast scheme is more expensive and it costs $N|\boldsymbol{\mathrm{A}}|$ , where each worker has a copy of all the matrix data.

	$s^{\prime}_{A}$
$s_{A}$	$r$	$c$	$b$
$r$	0	$\frac{N-1}{N}$ \| $\boldsymbol{\mathrm{A}}$ \|	$(N\!-1)\|\boldsymbol{\mathrm{A}}\|$
$c$	$\frac{N-1}{N}\|\boldsymbol{\mathrm{A}}\|$	0	$(N\!-1)\|\boldsymbol{\mathrm{A}}\|$
$b$	0	0	0
$\xi$	$\|\boldsymbol{\mathrm{A}}\|$	$\|\boldsymbol{\mathrm{A}}\|$	$N\|\boldsymbol{\mathrm{A}}\|$

Table 3. Communication cost of converting partition schemes.

Given a join operation and input matrices $\boldsymbol{\mathrm{A}}$ and $\boldsymbol{\mathrm{B}}$ , MatRel’s data partitioner computes the best partition schemes for $\boldsymbol{\mathrm{A}}$ and $\boldsymbol{\mathrm{B}}$ by optimizing the following,

\displaystyle(s^{\prime}_{A},s^{\prime}_{B})\leftarrow\operatorname*{arg\,min}\limits_{(s^{\prime}_{A},s^{\prime}_{B})}\{C_{comm}(\boldsymbol{\mathrm{A}}\Join_{\gamma,f}\boldsymbol{\mathrm{B}},s^{\prime}_{A},s^{\prime}_{B})+C_{vt}(\boldsymbol{\mathrm{A}},s_{A}\rightarrow s^{\prime}_{A})+C_{vt}(\boldsymbol{\mathrm{B}},s_{B}\rightarrow s^{\prime}_{B})\}.

Function $C_{comm}(\boldsymbol{\mathrm{A}}\Join_{\gamma,f}\boldsymbol{\mathrm{B}},s^{\prime}_{A},s^{\prime}_{B})$ computes the communication cost of the join according to our cost model when $\boldsymbol{\mathrm{A}}$ is partitioned in Scheme $s_{A}$ and $\boldsymbol{\mathrm{B}}$ in Scheme $s_{B}$ . Function $C_{vt}(\boldsymbol{\mathrm{A}},s_{A}\rightarrow s^{\prime}_{A})$ computes the communication cost when converting $\boldsymbol{\mathrm{A}}$ from Scheme $s_{A}$ to $s^{\prime}_{A}$ . Essentially, the data partitioner adopts a grid-search strategy among all the possible combinations of partition schemes on the input matrices, and outputs the cheapest schemes for $\boldsymbol{\mathrm{A}}$ and $\boldsymbol{\mathrm{B}}$ .

5. System Implementation

After generating the query execution plan and the matrix data partitioning schemes, each worker conducts matrix and relational computations locally. MatRel leverages block matrices as a basic unit for storing and manipulating matrix data in the distributed memory. We discuss briefly MatRel’s implementation on top of Spark SQL.

5.1. Physical Storage of a Local Matrix/Tensor

MatRel partitions a matrix into smaller square blocks to store and manipulate matrix data in the distributed memory. The smaller blocks can better utilize the spatial locality of the nearby entries in a matrix. A matrix block is the basic unit for storage and computation. Figure 6(a) illustrates that Matrix $\boldsymbol{\mathrm{A}}$ is partitioned into blocks of size $3\times 3$ , where each block is stored on a worker locally. For simplicity, we only consider square blocks. To fully exploit modern CPU cache size, MatRel sets block size to be 1000.

Every matrix block consists of two parts; a block ID and matrix data. A block ID is an ordered pair, i.e., (row-block-ID, column-block-ID). The matrix data field is a quadruple, $\langle$ matrix format, number of rows, number of columns, data storage $\rangle$ . A local matrix block supports both dense and sparse matrix storage formats. For the sparse format, the nonzero entries are stored in Compressed Sparse Column (CSC), and Compressed Sparse Row (CSR) formats. More detail about local matrices is discussed in (Yu et al., 2017).

A join may produce a normal matrix or a higher-dimensional tensor, e.g., a 3rd-order tensor for joins on a single dimension. MatRel utilizes block matrices to manipulate higher-dimensional tensors as well. Given a 3rd-order tensor, the schema of the tensor is represented as (D1, D2, D3, val), where D1, D2, and D3 denote the three dimensions. To leverage the block matrix storage, MatRel extends the block ID component to higher dimensions, e.g., (D1, D2-block-ID, D3-block-ID), where D1 records the exact dimension value, and the rest record the matrix block IDs for the other dimensions. Figure 6(b) illustrates the storage layout of a 3rd-order tensor in terms of matrix blocks. There is freedom in choosing which dimension of the tensor to serve as D1, D2, or D3. Usually there exits an aggregation on a certain dimension following the join. A heuristic is to choose a non-aggregated dimension as D1 in a physical block storage. It is beneficial that each worker only needs to execute the aggregation locally without further communication. Figure 6(b) demonstrates two possible tensor layouts on dimension D1 and D3, respectively. If a subsequent aggregation is performed on Dimensions D2 or D3, MatRel stores the join result in block matrices with respect to Dimension D1.

5.2. System Design and Implementation

MatRel is implemented as a library in Apache Spark. It extends Spark SQL and its Catalyst optimizer to work seamlessly with matrix/tensor data. We provide a Scala API for conducting relational query processing on distributed matrix data. It uses a DataFrame (Armbrust et al., 2015) to represent distributed matrix blocks. Each row in a DataFrame has the schema (RowBlkID, ColBlkID, Matrix), where both IDs take a long integer type, and Matrix is a user-defined type that encodes the quadruple in Figure 6(a). For logical optimization, we extend the Catalyst optimizer with all the transformation rules for optimizing the relational operations on matrix data. Once an optimized logical plan is produced, MatRel generates one or more physical plans. By leveraging the cost model for communication, the matrix data partitioner selects the partitioning schemes for input matrices that incurs minimum communication overhead. Finally, the optimized physical plan is realized by RDD’s transformation operations, e.g., map, flatMap, zipPartitions, and reduceByKey, etc. The RDD partitioner class is extended with three partitioning schemes for distributed matrices, i.e., Row, Column, and Broadcast. Spark’s fault tolerance mechanism applies naturally to MatRel. In case of node failure, the lost node is evicted and a standby node is chosen to recover the lost one. An open-source version of MatRel is available on GitHub²²2https://github.com/purduedb/MatRel/tree/spark-2.1-dev.

6. Experimental Results

We study the performance of the optimized execution plans by conducting different relational operations on matrix data from various ML applications. The performance is measured by the average execution time.

6.1. Experiment Setup

The experiments are conducted on an HP DL360G9 cluster with Intel Xeon E5-2660 realized over 6 nodes. The cluster uses Cloudera 5.9 consisting of Spark 2.1 as a computational framework and Hadoop HDFS as a distributed file system. Each node has 16 cores, 32 GB of RAM, and 400 GB of local storage. The total HDFS size is 1 Terabyte. Spark was configured with 5 executors, 16 cores/executor, 16 GB driver memory, and 24 GB executor memory.

6.2. Comparison Across Different Platforms

Platforms Tested. The platforms we evaluated are:

(1)

MatRel. This is our implementation on Spark 2.1. All computations are written in Scala using the extended DataFrame API. The experiments are conducted on MatRel and MatRel(w/o-opt), where the optimizations are turned on and off respectively.
(2)

Spark MLlib. This is the built-in Spark library, mllib.linalg. This is run on Spark V2.1 in the cluster mode. All computations are written in Scala.
(3)

SystemML. This is SystemML V0.13.1, which provides the option to run on top of Apache Spark. All computations are written in SystemML’s DML programming language.
(4)

SciDB. This is SciDB V15.12.1. All computations are written in SciDB’s AQL language that is similar to SQL. Dense matrix computations are conducted with gemm() API, and sparse matrix computations are with spgemm() API.

6.3. Datasets

Our experiments are performed on both real-world and synthetic datasets. The three real-world datasets are: soc-pokec³³3https://snap.stanford.edu/data/, cit-Patents ${}^{\ref{snap}}$ , and LiveJournal ${}^{\ref{snap}}$ . All these datasets are sparse matrices and their statistics are shown in Table 4. Furthermore, we generate extra smaller datasets, e.g., u100k and d15k. The first letter denotes the matrix type, i.e., $``\texttt{u}"$ for a sparse matrix with a uniform distribution of the nonzero entries, and $``\texttt{d}"$ for a dense matrix. The trailing number indicates the dimension of the square matrix. For example, u100k is a 100K-by-100K sparse matrix, where nonzero entries are uniformly distributed. d15k is a 15K-by-15K dense matrix.

Graph	#nodes	#edges
soc-pokec	1,632,803	30,622,564
cit-Patents	3,774,768	16,518,978
LiveJournal	4,847,571	68,993,773

Table 4. Statistics of the social network datasets.

6.4. Performance Evaluation on Different Operators

In our experiments, we conduct several representative computations. We manually kill the job if it fails to finish within 1 hour, or 3600s.

Aggregation on a Gram Matrix. A Gram matrix is computed as the inner-products of a set of vectors. It is commonly used in ML applications to compute the kernel functions and covariance matrices. Given a matrix $\boldsymbol{\mathrm{X}}$ to store the input vectors, the Gram matrix can be computed as $\boldsymbol{\mathrm{G}}=\boldsymbol{\mathrm{X}}^{T}\times\boldsymbol{\mathrm{X}}$ . We consider two different aggregations on the Gram matrix, i.e., summation along the row dimension $\Gamma_{\texttt{sum},r}(\boldsymbol{\mathrm{G}})$ , and trace computation $\Gamma_{\texttt{sum},d}(\boldsymbol{\mathrm{G}})$ . Code 1 illustrates how to compute $\Gamma_{\texttt{sum},d}(\boldsymbol{\mathrm{G}})$ in MatRel.

Figure 7(a) illustrates the execution time for different systems when performing $\Gamma_{\texttt{sum},r}(\boldsymbol{\mathrm{X}}^{T}\times\boldsymbol{\mathrm{X}})$ . For all the sparse matrices, Spark MLlib throws out-of-memory (OOM) exceptions due to the fact that MLlib converts a sparse matrix to the dense format for matrix-matrix multiplications. Even for the smallest 100K-by-100K sparse matrix with the sparsity of $10^{-5}$ , it requires about 160 GB memory to store the dense matrices, which exceeds the hardware limit of our cluster. Without pushing the aggregation below the matrix-matrix multiplications, SciDB runs over 1 hour to compute the product of two matrices before the aggregation. On the other hand, both MatRel and SystemML adopt the similar rewrite rule to push sum aggregation below the matrix-matrix multiplications. MatRel and SystemML spend 120s and 680s for the computation on LiveJournal dataset, respectively. The extra performance gain for MatRel comes from the fact that MatRel is implemented on Spark Dataset API while SystemML is run on RDDs directly. A Dataset makes extra effort for efficient data compression, serialization, and de-serialization. For u100k dataset, SciDB spends about 800s to compute the sparse matrix multiplication, since SciDB is not optimized for sparse matrix computations. For d15k dataset, all systems finish the job within the time budget.

Figure 7(b) shows the execution time for various systems when performing $\Gamma_{\texttt{sum},d}(\boldsymbol{\mathrm{X}}^{T}\times\boldsymbol{\mathrm{X}})$ . The trace computation leverages the rule to rewrite matrix-matrix multiplications in terms of matrix element-wise multiplications. MatRel’s optimizer detects the two inputs are table scans on the same matrix after query rewrite. It generates the code to compute the matrix element-wise multiplication on a single matrix without duplicate table scans. For the LiveJournal dataset, it takes 90s for MatRel and 525s for SystemML to complete the query execution. Other systems spend similar amount of time as the previous query since they fail to leverage the more efficient query rewrite rule.

Selection over Matrix Operations. Least squares linear regression (LR) is a popular ML model for classification (Hastie et al., 2001). The input is a feature matrix $\boldsymbol{\mathrm{X}}$ and a label vector $\boldsymbol{\mathrm{y}}$ . Each label $y_{i}$ can be viewed as a linear combination of the feature vector $\boldsymbol{\mathrm{x}}_{i}^{T}$ , i.e., $y_{i}\approx\boldsymbol{\mathrm{x}}_{i}^{T}\times\boldsymbol{\mathrm{b}}+\varepsilon_{i}$ , where $\boldsymbol{\mathrm{b}}$ is the vector of regression coefficients, and $\varepsilon_{i}$ is the error term. The most common estimator for $\boldsymbol{\mathrm{b}}$ is the least squares estimator $\hat{\boldsymbol{\mathrm{b}}}=(\boldsymbol{\mathrm{X}}^{T}\times\boldsymbol{\mathrm{X}})^{-1}\times\boldsymbol{\mathrm{X}}^{T}\times\boldsymbol{\mathrm{y}}$ . Code 2 demonstrates how to conduct selection on a Gram matrix and least squares linear regression in MatRel.

Figure 8(a) illustrates the execution time for selecting a row of $\hat{\boldsymbol{\mathrm{b}}}$ . For an efficient evaluation of the coefficient vector $\hat{\boldsymbol{\mathrm{b}}}$ , a good execution plan should compute $\boldsymbol{\mathrm{X}}^{T}\times\boldsymbol{\mathrm{y}}$ before multiplying with $(\boldsymbol{\mathrm{X}}^{T}\times\boldsymbol{\mathrm{X}})^{-1}$ , since multiplying $\boldsymbol{\mathrm{X}}^{T}$ with $\boldsymbol{\mathrm{y}}$ results in a lower dimension matrix. MatRel adopts this optimized evaluation plan. It also selects a single row on $(\boldsymbol{\mathrm{X}}^{T}\times\boldsymbol{\mathrm{X}})^{-1}$ . Spark MLlib throws OOM exceptions for all the sparse matrices due to the inefficient implementation on matrix multiplications. SystemML adopts a similar strategy for selection pushdown. However, the implementation on RDD prohibits SystemML from obtaining the better performance achieved by MatRel that implements the operations on Datasets. SciDB does not generate an efficient execution plan for LR computation, and performs the selection only after the complete evaluation of $\hat{\boldsymbol{\mathrm{b}}}$ . Figure 8(b) gives the different effects of optimizations that MatRel has adopted. Especially, we manually turn off the optimizations of order selection on matrix chain multiplications, and relational selection pushdown below matrix multiplications, e.g., MatRel (matrix order) means turning on the optimization of order selection on matrix chain multiplications only.

Figure 9(a) gives the execution time for evaluating a selection on a matrix entry of the Gram matrix. All the systems conduct Gram matrix computation in the same manner. With selection pushdown, MatRel and SystemML are able to finish the computation in 39s and 254s for the cit-Patents dataset. When selecting $(i,j)$ -th entry on $\boldsymbol{\mathrm{X}}^{T}\times\boldsymbol{\mathrm{X}}$ , MatRel’s optimizer rewrites the query plan to $(\sigma_{CID=i}(\boldsymbol{\mathrm{X}}))^{T}\times\sigma_{CID=j}(\boldsymbol{\mathrm{X}})$ . The rewritten query plan avoids evaluation of a matrix transpose by a vector transpose. SciDB cannot finish the evaluation of matrix multiplication $\boldsymbol{\mathrm{X}}^{T}\times\boldsymbol{\mathrm{X}}$ in 3600s. Figure 9(b) illustrates the effect of selection pushdown below matrix multiplications on MatRel.

Cross-product. The Kronecker product is a generalization of outer-product from vectors to matrices. It is widely used in tensor decomposition and applications (Kolda and Bader, 2009). Formally, given an $m$ -by- $n$ matrix $\boldsymbol{\mathrm{A}}$ and a $p$ -by- $q$ matrix $\boldsymbol{\mathrm{B}}$ , the Kronecker product $\boldsymbol{\mathrm{A}}\otimes\boldsymbol{\mathrm{B}}$ is the $mp$ -by- $nq$ block matrix,

\boldsymbol{\mathrm{A}}\otimes\boldsymbol{\mathrm{B}}=\begin{bmatrix}a_{11}*\boldsymbol{\mathrm{B}}&\dots&a_{1n}*\boldsymbol{\mathrm{B}}\\ \vdots&\ddots&\vdots\\ a_{m1}*\boldsymbol{\mathrm{B}}&\dots&a_{mn}*\boldsymbol{\mathrm{B}}\end{bmatrix}.

The Kronecker product is essentially the cross-product between matrix $\boldsymbol{\mathrm{A}}$ and $\boldsymbol{\mathrm{B}}$ with the merge function of $f(x,y)=x*y$ . We compare MatRel and SciDB for the performance of the Kronecker product computation, since all the other systems do not support joins. Code 3 illustrates how to compute Kronecker product in MatRel. The Kronecker product is an expensive operation that consumes lots of resources. Therefore, we generate another 25K-by-25K sparse matrix u25k, where nonzero entries are uniformly distributed with sparsity of $10^{-6}$ . Table 6.4 demonstrates the execution time for different systems on various cross-product tasks. Both MatRel(w/o-opt) and SciDB first conduct the cross-product computation on each pair of matching entries. Then, they evaluate the merge function on the matched entries. On the other hand, MatRel’s optimizer identifies that merge function $f(x,y)$ has the zero-preserving property. Thus, MatRel only computes the cross-product using the nonzero entries from the sparse input. MatRel spends about 294s to finish the evaluation of the Kronecker product between a dense matrix and a sparse one, while MatRel(w/o-opt) and SciDB cannot finish the job within 1 hour. MatRel spends only 44s for computing the Kronecker product between two sparse matrices. When both inputs are dense matrices, no optimizations can be applied, and no system can finish the job successfully. There are totally $15^{4}\times 10^{12}$ entries when computing the Kronecker product between two 15K-by-15K dense matrices, which costs about $4\times 10^{5}$ TB. This is far beyond the total amount of memory of our cluster, or even the disk space. Thus, MatRel throws OOM exceptions after 5 minutes, and SciDB throws no space left on device (NSLOD) errors after 3 hours.

Dataset	MatRel	MatRel(w/o)	SciDB
u25k $\otimes$ d15k	294s	> 1h	> 1h
u25k $\otimes$ u25k	44s	> 1h	> 1h
d15k $\otimes$ u25k	312s	> 1h	> 1h
d15k $\otimes$ d15k	OOM	OOM	NSLOD

Table 5. The Kronecker product on different systems

Join on Dimensions. Many real-world applications rely on joins on dimensions of big matrices, e.g., raster data overlay analysis. We examine two different join predicates: Equi-join on two dimensions, and equi-join on a single dimension. Among all the systems we compare with, only SciDB provides similar functionality with join() and cross_join(). The join() operator combines the entires of two input matrices at matching dimension values. The cross_join() operator computes the cross-product of the two input matrices, and applies equality predicates to pairs of dimensions.

We conduct experiments on two different types of joins: Direct overlay and transpose overlay. The join() operator implements direct overlay exactly. The cross_join() operator is leveraged to evaluate the transpose overlay. Code 4 demonstrates how to compute direct overlay in MatRel. Figure 10(a) illustrates the execution time of MatRel and SciDB on various combinations of inputs. d30k and d50k are two dense square matrices with dimensions 30K and 50K, respectively. Both u100k and u200k are sparse matrices with the sparsity of $10^{-5}$ . Thanks to the Multidimensional Array Clustering (MAC) technique, SciDB stores matrix partitions in a way that the matrix chunks are aligned along the dimensions. This data layout is preferred for evaluating a direct overlay as the join can be performed locally on the matched chunks. On the other hand, MatRel does not make any assumptions on data locations when loading data from HDFS. MatRel needs to shuffle the matrix blocks to fulfill the requirement of Row/Column partitioners for join executions. Figure 10(a) shows a detailed decomposition of the execution time of MatRel. The execution time is divided into two parts; the matrix data loading time and query execution time. For SciDB, there is only execution time since the data is already partitioned. The actual execution time of MatRel is similar to that of SciDB. The MatRel(w/o-opt) uses the default hash partitioner of Spark that has a similar performance to that of MatRel for direct overlay queries.

Figure 10(b) gives the performance comparison for the transpose overlay queries. For this type of query, SciDB no longer benefits from the MAC technique because the matrix chunks have to be transferred to different workers to facilitate the query execution. For example, SciDB spends about 51min to conduct the transpose overlay query when joining u100k and d30k. On the other hand, it takes about 44s for MatRel to finish the same query that shows a consistent performance as that of the direct overlay query. When testing on larger datasets, say u100k and d50k, SciDB spends about 172min to complete, while MatRel only spends 276s.

We further perform experiments for joins on a single dimension. Figure 11(a) illustrates the performance comparison for $\boldsymbol{\mathrm{A}}\Join_{RID_{A}=RID_{B},f}\boldsymbol{\mathrm{B}}$ , where $f(x,y)=x*y$ . We use different combinations of input matrices, e.g., $(\texttt{u25k},\texttt{d15k})$ . MatRel leverages the sparsity-inducing property of the merge function $f(x,y)$ . MatRel spends about 51s, and SciDB spends about 45min when computing the join on $\boldsymbol{\mathrm{A}}(\texttt{u25k})$ and $\boldsymbol{\mathrm{B}}(\texttt{d15k})$ . MatRel(w/o-opt) converts a sparse matrix to the dense format, and conducts the join on the dense block, spending about 60s. SciDB’s cross_join() implementation is sensitive to the order of arguments, and it incurs huge overhead when the inner join matrix is a dense one. When swapping matrix $\boldsymbol{\mathrm{A}}$ and $\boldsymbol{\mathrm{B}}$ , SciDB’s performance improves significantly, and it only takes about 279s to complete the join. MatRel(w/o-opt) spends about 1690s to finish the join, since it has to duplicate the outer dense matrix multiple times and converting a sparse matrix block to a dense one for query evaluation. MatRel’s optimizer detects both the dimensions and the sparsity of the input matrices, and shuffles the matrix with fewer nonzero entries to meet the join predicate with the other matrix for better performance.

When both inputs are sparse, MatRel and SciDB exhibit similar performance. It takes longer time for MatRel(w/o-opt) to evaluate the query as it does not leverage the sparsity of the matrices. When both inputs are dense matrices, no optimizations can be applied, and no system can finish the job successfully. For d25k dataset, each block size is 1K-by-1K and there are 225 blocks in each input. The join generates 1K new blocks for each input block. That results in about 1800 GB, which exceeds the total amount of available memory of our cluster, or even the free disk space. However, MatRel throws OOM exceptions after about 5 minutes, while SciDB throws NSLOD errors after 3 hours.

Figure 11(b) illustrates the performance comparison for the join predicate of $``CID_{A}=RID_{B}"$ . All the systems show similar trends as Figure 11(a). Figure 11(c) depicts the amounts of data shuffle for MatRel and MatRel(w/o-opt) when the join predicate is $``RID_{A}=RID_{B}"$ . According to our communication cost model, MatRel’s optimizer partitions both $\boldsymbol{\mathrm{A}}$ and $\boldsymbol{\mathrm{B}}$ using Row scheme. MatRel(w/o-opt) relies on Spark’s default hash partitioner to distribute the matrix among all the workers. For a load-balanced execution, we need to repartition the matrices to satisfy the join predicate. Figure 11(c) verifies our cost model since MatRel(w/o-opt) shuffles twice as much data as MatRel.

Join on Entries. For all the systems we’ve compared with, only MatRel supports the joins when the predicate involves matrix entries. Code 5 shows how to compute join on entries between two matrices in MatRel. We leverage the sparsity-inducing merge function $f(x,y)=x*y$ for this query. MatRel takes advantages of two optimizations for this kind of queries, i.e., identifying sparsity-inducing merge functions, and leveraging a Bloom-join on matrix blocks. Figure 11(d) demonstrates the performance of joins on matrix entries of MatRel when turning on and off the optimizations. MatRel(Bloom) leverages a Bloom filter for probing the entries of the inner matrix blocks without exploiting the sparsity of a sparse input. On the other hand, MatRel(sparsity) examines non-empty blocks in the inner matrix when conducting the join. In general, MatRel(sparsity) achieves a better performance than MatRel(Bloom) when there exists a sparse input. When both inputs are dense matrices, the optimization of sparsity-inducing merge function cannot apply. All the three variants of MatRel cannot finish the query within 1 hour.

Poisson Non-negative Matrix Factorization (PNMF). PNMF (Liu et al., 2010) is a popular model for dimension reduction, and it tries to approximate a sparse input matrix $\boldsymbol{\mathrm{A}}$ with two factor matrices $\boldsymbol{\mathrm{W}}$ and $\boldsymbol{\mathrm{H}}$ of low rank $k$ , typically 10 – 200. The computation steps are

	$\displaystyle\boldsymbol{\mathrm{W}}$	$\displaystyle\leftarrow\boldsymbol{\mathrm{W}}*((\boldsymbol{\mathrm{A}}/(\boldsymbol{\mathrm{W}}\times\boldsymbol{\mathrm{H}}))\times\boldsymbol{\mathrm{H}}^{T})/(\boldsymbol{\mathrm{E}}\times\boldsymbol{\mathrm{H}}^{T}),$
	$\displaystyle\boldsymbol{\mathrm{H}}$	$\displaystyle\leftarrow\boldsymbol{\mathrm{H}}*(\boldsymbol{\mathrm{W}}^{T}\times(\boldsymbol{\mathrm{A}}/(\boldsymbol{\mathrm{W}}\times\boldsymbol{\mathrm{H}})))/(\boldsymbol{\mathrm{W}}^{T}\times\boldsymbol{\mathrm{E}}),$

where $\boldsymbol{\mathrm{A}}_{m\times n}$ is the input sparse matrix, $\boldsymbol{\mathrm{W}}_{m\times k}$ and $\boldsymbol{\mathrm{H}}_{k\times n}$ are two factor matrices, and $\boldsymbol{\mathrm{E}}$ is an $m$ -by- $n$ matrix, where $E_{ij}=1$ . The updates for $\boldsymbol{\mathrm{W}}$ and $\boldsymbol{\mathrm{H}}$ continue until convergence. The objective function of PNMF is a combination of Euclidean distance and Kullback-Leibler divergence, i.e.,

f(\boldsymbol{\mathrm{W}},\boldsymbol{\mathrm{H}})=\sum_{i,j}(\boldsymbol{\mathrm{W}}\times\boldsymbol{\mathrm{H}})_{ij}-\sum_{i,j}(\boldsymbol{\mathrm{A}}*\log(\boldsymbol{\mathrm{W}}\times\boldsymbol{\mathrm{H}}))_{ij}

where $\log(\boldsymbol{\mathrm{X}})$ computes the logarithm of each entry in $\boldsymbol{\mathrm{X}}$ .

The sparsity of each generated sparse matrix is $10^{-3}$ . Table 6 shows the execution time per iteration of PNMF on different systems. MatRel performs the best, followed by SystemML. MLlib has a better performance than SciDB for dataset of small sizes. However, MLlib does not scale for larger datasets, since it does not adopts native matrix multiplications for sparse matrices.

Dataset	MatRel	MLlib	SystemML	SciDB
u25k	13s	61s	23s	62s
u50k	28s	372s	37s	260s
u100k	52s	2041s	63s	1092s
u200k	148s	OOM	160s	> 1h

Table 6. PNMF on different systems

Notice that there are lots of opportunities when the sparsity-inducing property could be leveraged, e.g., $\boldsymbol{\mathrm{A}}/(\boldsymbol{\mathrm{W}}\times\boldsymbol{\mathrm{H}})$ and $\boldsymbol{\mathrm{A}}*\log(\boldsymbol{\mathrm{W}}\times\boldsymbol{\mathrm{H}})$ . Matrix $\boldsymbol{\mathrm{A}}$ is sparse, and thus, it is unnecessary to evaluate the dense intermediate matrix of $\boldsymbol{\mathrm{W}}\times\boldsymbol{\mathrm{H}}$ . MatRel only computes the blocks of $\boldsymbol{\mathrm{W}}\times\boldsymbol{\mathrm{H}}$ that map to the corresponding locations of sparse blocks from $\boldsymbol{\mathrm{A}}$ . A similar argument also applies to $\boldsymbol{\mathrm{A}}*\log(\boldsymbol{\mathrm{W}}\times\boldsymbol{\mathrm{H}})$ . Furthermore, MatRel adopts the rule of aggregation pushdown below matrix multiplications when evaluating $\sum_{i,j}(\boldsymbol{\mathrm{W}}\times\boldsymbol{\mathrm{H}})_{ij}$ of the objective function. MatRel also interprets $\boldsymbol{\mathrm{E}}\times\boldsymbol{\mathrm{H}}^{T}$ into $\Gamma_{\texttt{sum},r}(\boldsymbol{\mathrm{H}})$ . Therefore, MatRel involves no matrix multiplications for evaluating the PNMF pipeline.

7. Related Work

We review related work of relational and matrix query processing and optimizations on matrix data.

Matrix query processing on matrices: There has been a long history of research on conducting efficient matrix computations in the high-performance computing community (HPC). Existing libraries provide efficient matrix operators, e.g., BLAS (bla, 2017) and LAPACK (lap, 2017) (and its distributed variant ScaLAPACK (Choi et al., 1992)). However, they lack support for sparse matrices, and are prone to machine failures.

Recently, many systems have been proposed to support efficient matrix computations using Hadoop (had, 2018) and Spark (Zaharia et al., 2012). These systems provide an easy-to-use interface on multiple matrix operations, e.g., HAMA (Seo et al., 2010), Mahout (mah, 2017), MLI (Sparks et al., 2013), MadLINQ (Qian et al., 2012), Cumulon (Huang et al., 2013), SystemML (Ghoting et al., 2011; Boehm et al., 2014; Elgohary et al., 2016; Boehm et al., 2016; Elgamal et al., 2017), DMac (Yu et al., 2015), and MatFast (Yu et al., 2017). Each system provides various optimization strategies for reducing memory consumption, computation, and communication overheads. For example, SystemML (Elgohary et al., 2016) adopts column encoding schemes and operations over compressed matrices to mitigate memory overhead when the original matrices are too large to fit in the available compute resources. MatFast (Yu et al., 2017) adopts a sampling-based approach to estimate the sparsity of matrix chain multiplications, and organizes distributed matrix partitions in a communication-efficient manner by leveraging matrix data dependencies. However, these systems focus on matrix-only operations on big matrix data. None of these systems provide full-fledged relational query processing on matrix data. MatRel is built on MatFast, and provides optimized relational query processing on the distributed matrix data. MATLANG (Brijder et al., 2018) examines matrix manipulation from the view of expressive power of database query languages. It confirms that the matrix operators provided by these systems are adequate for a wide range of applications.

Matrix query processing on relations: A lot of work consider the problem of providing ML over relational data for specific ML algorithms. The assumption is that the input dataset can be viewed as a join result on multiple tables. The so-called “factorized machine learning” tries to learn a model based on the de-normalized tables without performing the join operation explicitly. For instance, (Schleich et al., 2016) aims at optimizing the linear regression model over factorized join tables. (Kumar et al., 2015) shows how to learn a generalized linear model over factorized join tables. Furthermore, (Kumar et al., 2016) discusses the cases when it is safe to conduct feature selection by avoiding key-foreign key joins to obtain features from all base tables. More recently, Morpheus (Chen et al., 2017) is proposed to conduct general linear algebra operations over factorized tables for popular ML algorithms. However, these systems rely on the assumption that the input matrix is obtained from joins on multiple tables that is not always the case for real-world applications, e.g., recommender systems.

Relational query processing on matrices: The idea of building a universal array DBMS on multidimensional arrays for scientific and numerical applications has been explored for a long time. One of the most notable efforts is Rasdaman (Baumann et al., 1998).

Recently, there are several systems built from scratch with native support for linear algebra, e.g., SciDB (Brown, 2010; Soroush et al., 2011; Duggan et al., 2015) and TensorFlow (Abadi et al., 2016). TensorFlow has limited support for distributed computation. The user has to manually map computation and data to each worker, since Tensorflow does not offer automatic work assignment (Mehta et al., 2017). While Tensorflow is mainly designed for neural network models, it lacks well-defined relational operations on a matrix (tensor). SciDB supports the array data model, and adopts a share-nothing massively parallel processing (MPP) architecture. SciDB provides a rich set of relational operations on array data. However, it treats each array operator individually without tuning for a series of array operators.

The MADlib project (Hellerstein et al., 2012) supports analytics, including linear algebra functionality, on top of a database system. MADlib shows the potential for high-performance linear algebra on top of a relational database. Recent extensions on SimSQL (Luo et al., 2017) investigate the possibility of making a small set of changes to SQL for enabling a distributed, relational database engine to become a high-performance platform for distributed linear algebra. However, the extension does not cover the optimization on a series of mixed relational and matrix operations. MatRel explores the potential to optimize the query execution pipeline by pushing relational operators below matrix operators, and computes costs for the different plans. MatRel’s query optimizer also takes into account the optimized data layout of join operands for communication-efficient executions.

8. Conclusions

This paper presents MatRel, an in-memory system that enables scalable relational query processing on big matrix data in a distributed setup. MatRel supports common relational operations on big matrix data, e.g., relational selection, aggregation, join. MatRel’s query optimizer leverages rule-based heuristics to rewrite a query into an equivalent execution plan with lower computation costs. For relational joins, MatRel can leverage the sparsity-inducing property of the merge function and Bloom-join strategies for efficient executions. Furthermore, MatRel adopts a cost model to generate communication-efficient matrix data partitioning schemes for input matrices on various join predicates. The experimental study on various applications demonstrates that MatRel achieves up to two orders of magnitude performance gain compared to state-of-the-art systems.

9. Acknowledgements

Walid G. Aref acknowledges the support of the U.S. National Science Foundation under Grant Numbers III-1815796 and IIS-1910216.

References

(1)
bla (2017) 2017. BLAS. http://www.netlib.org/blas/.
lap (2017) 2017. LAPACK. http://www.netlib.org/lapack/.
mah (2017) 2017. Mahout. http://mahout.apache.org/.
had (2018) 2018. Hadoop. http://hadoop.apache.org/.
mll (2018) 2018. MLlib. http://spark.apache.org/mllib/.
r-l (2018) 2018. R. https://www.r-project.org/.
Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (Savannah, GA, USA) (OSDI’16). USENIX Association, Berkeley, CA, USA, 265–283.
Armbrust et al. (2015) Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD ’15 (Melbourne, Victoria, Australia). 1383–1394.
Baumann et al. (1998) P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. 1998. The Multidimensional Database System RasDaMan. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (Seattle, Washington, USA) (SIGMOD ’98). ACM, New York, NY, USA, 575–577.
Beel et al. (2016) Joeran Beel, Bela Gipp, Stefan Langer, and Corinna Breitinger. 2016. Research-paper Recommender Systems: A Literature Survey. Int. J. Digit. Libr. 17, 4 (Nov. 2016), 305–338.
Boehm et al. (2016) Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark. Proc. VLDB Endow. 9, 13 (Sept. 2016), 1425–1436.
Boehm et al. (2014) Matthias Boehm, Shirish Tatikonda, Berthold Reinwald, Prithviraj Sen, Yuanyuan Tian, Douglas R. Burdick, and Shivakumar Vaithyanathan. 2014. Hybrid Parallelization Strategies for Large-scale Machine Learning in SystemML. Proc. VLDB Endow. 7, 7 (March 2014), 553–564.
Brijder et al. (2018) Robert Brijder, Floris Geerts, Jan Van den Bussche, and Timmy Weerwag. 2018. On the Expressive Power of Query Languages for Matrices. In 21st International Conference on Database Theory, ICDT 2018, March 26-29, 2018, Vienna, Austria. 10:1–10:17.
Brown (2010) Paul G. Brown. 2010. Overview of sciDB: Large Scale Array Storage, Processing and Analysis. In SIGMOD’10 (Indianapolis, Indiana, USA). ACM, New York, NY, USA, 963–968.
Chandola et al. (2009) Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly Detection: A Survey. ACM Comput. Surv. 41, 3, Article 15 (July 2009), 58 pages.
Chen et al. (2017) Lingjiao Chen, Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel. 2017. Towards Linear Algebra over Normalized Data. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1214–1225.
Choi et al. (1992) Jaeyoung Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. 1992. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers In Frontiers of Massively Parallel Computation. In Symposium on the Frontiers of Massively Parallel Computation.
Codd (1970) E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13, 6 (June 1970), 377–387.
Dean and Ghemawat (2004) Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: simplified data processing on large clusters. In OSDI’04. USENIX Association.
Duggan et al. (2015) Jennie Duggan, Olga Papaemmanouil, Leilani Battle, and Michael Stonebraker. 2015. Skew-Aware Join Optimization for Array Databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD ’15). ACM, 123–135.
Eldawy et al. (2016) Ahmed Eldawy, Mohamed F. Mokbel, and Christopher Jonathan. 2016. HadoopViz: A MapReduce framework for extensible visualization of big spatial data. ICDE’16 (2016), 601–612.
Elgamal et al. (2017) Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V. Evfimievski, Shirish Tatikonda, Berthold Reinwald, and Prithviraj Sen. 2017. SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning. In CIDR’17, 8th Biennial Conference on Innovative Data Systems Research.
Elgohary et al. (2016) Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, and Berthold Reinwald. 2016. Compressed Linear Algebra for Large-scale Machine Learning. Proc. VLDB Endow. 9, 12 (Aug. 2016), 960–971. https://doi.org/10.14778/2994509.2994515
Ghoting et al. (2011) Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative Machine Learning on MapReduce. In ICDE’11. IEEE Computer Society, Washington, DC, USA, 231–242.
Golub and Van Loan (2013) Gene H. Golub and Charles F. Van Loan. 2013. Matrix Computations (4th Ed.). Johns Hopkins University Press, Baltimore, MD, USA.
Hastie et al. (2001) Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer New York Inc., New York, NY, USA.
Hellerstein et al. (2012) Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library: Or MAD Skills, the SQL. Proc. VLDB Endow. 5, 12 (2012), 1700–1711. https://doi.org/10.14778/2367502.2367510
Huang et al. (2013) Botong Huang, Shivnath Babu, and Jun Yang. 2013. Cumulon: Optimizing Statistical Data Analysis in the Cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (New York, New York, USA) (SIGMOD ’13). ACM, New York, NY, USA, 1–12.
Kolda and Bader (2009) Tamara G. Kolda and Brett W. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev. 51, 3 (Aug. 2009), 455–500.
Kumar et al. (2015) Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel. 2015. Learning Generalized Linear Models Over Normalized Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD ’15). ACM, New York, NY, USA, 1969–1984.
Kumar et al. (2016) Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu. 2016. To Join or Not to Join?: Thinking Twice About Joins Before Feature Selection. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). ACM, New York, NY, USA, 19–34.
Liu et al. (2010) Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, and Yi-Min Wang. 2010. Distributed Nonnegative Matrix Factorization for Web-scale Dyadic Data Analysis on Mapreduce. In Proceedings of the 19th International Conference on World Wide Web (Raleigh, North Carolina, USA) (WWW ’10). ACM, New York, NY, USA, 681–690.
Luo et al. (2017) Shangyu Luo, Zekai Gao, Michael Gubanov, Luis Perez, and Christopher Jermaine. 2017. Scalable Linear Algebra on a Relational Database System. In ICDE’17. IEEE Computer Society, Washington, DC, USA, 523–534.
Manning and Klein (2003) Christopher Manning and Dan Klein. 2003. Optimization, Maxent Models, and Conditional Estimation Without Magic. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials - Volume 5 (Edmonton, Canada) (NAACL-Tutorials ’03). Association for Computational Linguistics, Stroudsburg, PA, USA, 8–8.
Mehta et al. (2017) Parmita Mehta, Sven Dorkenwald, Dongfang Zhao, Tomer Kaftan, Alvin Cheung, Magdalena Balazinska, Ariel Rokem, Andrew Connolly, Jacob Vanderplas, and Yusra AlSayyad. 2017. Comparative Evaluation of Big-data Systems on Scientific Image Analytics Workloads. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1226–1237.
Qi et al. (2014) Jianlong Qi, Hassan Asl, Johan Björkegren, and Tom Michoel. 2014. kruX: matrix-based non-parametric eQTL discovery. BMC Bioinformatics 15 (2014), 11.
Qian et al. (2012) Zhengping Qian, Xiuwei Chen, Nanxi Kang, Mingcheng Chen, Yuan Yu, Thomas Moscibroda, and Zheng Zhang. 2012. MadLINQ: Large-scale Distributed Matrix Computation for the Cloud. In EuroSys’12 (Bern, Switzerland). ACM, New York, NY, USA, 197–210. https://doi.org/10.1145/2168836.2168857
Schleich et al. (2016) Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. 2016. Learning Linear Regression Models over Factorized Joins. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). ACM, New York, NY, USA, 3–18.
Seo et al. (2010) Sangwon Seo, Edward J. Yoon, Jaehong Kim, Seongwook Jin, Jin-Soo Kim, and Seungryoul Maeng. 2010. HAMA: An Efficient Matrix Computation with the MapReduce Framework. In CLOUDCOM’10. IEEE Computer Society, Washington, DC, USA, 721–726.
Soroush et al. (2011) Emad Soroush, Magdalena Balazinska, and Daniel Wang. 2011. ArrayStore: A Storage Manager for Complex Parallel Array Processing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (Athens, Greece) (SIGMOD ’11). ACM, New York, NY, USA, 253–264.
Sparks et al. (2013) Evan R. Sparks, Ameet Talwalkar, Virginia Smith, Jey Kottalam, Xinghao Pan, Joseph E. Gonzalez, Michael J. Franklin, Michael I. Jordan, and Tim Kraska. 2013. MLI: An API for Distributed Machine Learning. In ICDM. 1187–1192.
Yu et al. (2015) Lele Yu, Yingxia Shao, and Bin Cui. 2015. Exploiting Matrix Dependency for Efficient Distributed Matrix Computation. In SIGMOD’15 (Melbourne, Victoria, Australia). ACM, New York, NY, USA, 93–105.
Yu et al. (2017) Yongyang Yu, Mingjie Tang, Walid Aref, Qutaibah Malluhi, Mostafa Abbas, and Mourad Ouzzani. 2017. In-memory Distributed Matrix Computation Processing and Optimization. In ICDE’17. IEEE Computer Society, Washington, DC, USA, 1047–1058.
Zaharia et al. (2012) Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI’12. USENIX, San Jose, CA, 15–28.
Zhao et al. (2016) Feipeng Zhao, Min Xiao, and Yuhong Guo. 2016. Predictive Collaborative Filtering with Side Information. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (New York, New York, USA) (IJCAI’16). AAAI Press, 2385–2390.

Appendix A Code snippets

⬇

1val X = loadMatrix("in/X") // read matrix X

2val tr = X.t().multiply(X).trace()

3println(s"Trace of the Gram matrix is $tr")

Code 1: Aggregation of a Gram matrix along diagonal in Scala

⬇

1val X = loadMatrix("in/X") // read matrix X

2val y = loadMatrix("in/y") // read vector y

3val pred = SelectPredicate.parse("RID=1 AND CID=1")

4val g11 = X.t().multiple(X).select(pred)

5println(s"(1, 1) entry of Gram matrix is $g11")

6val b = X.t().multiply(X).inverse().multiply(X).multiply(y)

7b.saveAsTextFile("out/b") // save vector b to disk

Code 2: Selection and linear regression in Scala

⬇

1val A = loadMatrix("in/A") // read matrix A

2val B = loadMatrix("in/B") // read matrix B

3val func = (x: Double, y: Double) => {x * y} // merge function

4val kron = A.cross_prod(B, func)

5kron.saveAsTextFile("out/Kronecker") // save matrix kron to disk

Code 3: Kronecker product in Scala

⬇

1val A = loadMatrix("in/A") // read matrix A

2val B = loadMatrix("in/B") // read matrix B

3val join_type = JoinType.parse("RID=RID AND CID=CID") // parse the join condition

4val func = (x: Double, y: Double) => {x * y} // merge function

5val C = A.join(B, join_type, func)

6C.saveAsTextFile("out/C") // save matrix C to disk

Code 4: Direct overlay in Scala

⬇

1val A = loadMatrix("in/A") // read matrix A

2val B = loadMatrix("in/B") // read matrix B

3val join_type = JoinType.parse("VAL=VAL") // parse the join condition

4val func = (x: Double, y: Double) => {x * y} // merge function

5val C = A.join(B, join_type, func)

6C.saveAsTextFile("out/C") // save matrix C to disk

Code 5: Join on entries in Scala

	$(s_{A},s_{B})$
$\gamma$	(r, r)	(r, c)	(c, r)	(c, c)
$RID_{A}=val_{B}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,N\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,N\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$
$CID_{A}=val_{B}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,N\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,N\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{(N\!-1)\|\boldsymbol{\mathrm{A}}\|,\eta_{B}\|\boldsymbol{\mathrm{B}}\|\}$
$val_{A}=RID_{B}$	$\min\{\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{N\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{N\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$
$val_{A}=CID_{B}$	$\min\{N\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{N\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$	$\min\{\eta_{A}\|\boldsymbol{\mathrm{A}}\|,(N\!-1)\|\boldsymbol{\mathrm{B}}\|\}$