An Algebraic Approach for High-level Text Analytics

Xiuwen Zheng [email protected] University of California San DiegoSan Diego Supercomputer CenterLa JollaCaliforna, USA92130 and Amarnath Gupta [email protected] University of California San DiegoSan Diego Supercomputer CenterLa JollaCaliforna, USA92130

Abstract.

Text analytical tasks like word embedding, phrase mining and topic modeling, are placing increasing demands as well as challenges to existing database management systems. In this paper, we provide a novel algebraic approach based on associative arrays. Our data model and algebra can bring together relational operators and text operators, which enables interesting optimization opportunities for hybrid data sources that have both relational and textual data. We demonstrate its expressive power in text analytics using several real-world tasks.

associative array, text analytics, natural language processing

^†^†conference: SSDBM ’20: Int. Conf. on Scientific and Statistical Data Management; June 03–05, 2020; Amsterdam, Netherlands^†^†booktitle: SSDBM ’20: Int. Conf. on Scientific and Statistical Data Management, June 03–05, 2020, Amsterdam, Netherlands^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†submissionid: xxx

1. Introduction

A significant part of today’s analytical tasks involve text operations. A data scientist who has to manipulate and analyze text data today typically uses a set of text analysis software libraries (e.g., NLTK, Stanford CoreNLP, GenSim) for tasks like word embedding, phrase extraction, named entity recognition and topic modeling. In addition, most DBMS systems today have built-in support for full-text search. PostgreSQL, for instance, admits a text vector (called tsvector) that extracts and creates term and positional indices to enable efficient queries (called tsquery). Yet, some common and seemingly simple text analysis tasks cannot be performed simply within the boundaries of a single information system.

Example 1. Consider a relational table $R$ (newsID, date, newspaper, title, content) where $title$ and $content$ are text-valued attributes, and two sets $L_{o},L_{p}$ that represent a collection of organization names and person names respectively. Now, consider the following analysis:

•

$N1=$ Select a subset of news articles from date $d_{1}$ through $d_{2}$
•

$N2=$ Identify all news articles in $N1$ that have at least $c_{1}$ organization names from $L_{o}$ and $c_{2}$ persons from $L_{p}$
•

$T1=$ Create a document-term matrix on $N2.text$
•

$T2=$ Remove rows and columns of the matrix if either of their row or column marginal sums is below $\theta_{1}$ and $\theta_{2}$ respectively.
•

$M=$ Compute a topic model using $T2$

The intention of the analysis is to find the topic distribution of those news items that cover, for example, any two members of the senate (list $L_{p}$ ) and any one government organizations (list $L_{o}$ ). The analysis itself is straightforward and can be performed with a combination of SQL queries and Python scripts.

Our goal in this short paper is to present the idea that a novel relation-flanked associative array data model has the potential of serving as the underlying framework for the management and analysis of text-centric data. We develop the theoretical elements of the model and illustrate its utility through examples.

2. The Data Model

2.1. Text Associative Arrays

A number of current data systems, typically in the domain of polystore data systems, use associative arrays (Jananthan et al., 2017; Kepner et al., 2020) or its variants like associative tables (Barceló et al., 2019) and tensor data model (Leclercq et al., 2019). Many of these data models are used to support analytical (e.g., machine learning) tasks. In our setting, we specialize the essential associative model for text analytics. For our level of abstraction, our model reuses relational operations for all metadata of the associative arrays. While it has been shown (Barceló et al., 2019) that associative arrays can express relational operations, we believe that using relational abstraction along with our text-centric algebraic operations makes the system easier to program and interpret. At a more basic level, since most text processing operations include sorting (e.g., by TF-IDF scores), our model is based on partially ordered semirings.

Definition 2.1 (Semiring).

A semiring is a set $R$ with two binary operations addition $\oplus$ and multiplication $\odot$ , such that, 1) $\oplus$ is associative and commutative and has an identity element $0\in R$ ; 2) $\odot$ is associative with an identity element $1\in R$ ; 3) $\odot$ distributes over $\oplus$ ; and 4) $\odot$ by 0 annihilates $R$ .

Definition 2.2 (Partially-Ordered Semiring).

(Golan, 2013) A semiring $R$ is partially ordered if and only if there exists a partial order relation $\leq$ on $R$ satisfying the following conditions for all $a,b\in R$ :

•

If $a\leq b$ , then $a\oplus c\leq b\oplus c$ ;
•

If $a\leq b$ and $0\leq c$ , then $a\odot c\leq b\odot c$ and $c\odot a\leq c\odot b$ .

Definition 2.3 (Text Associative Array).

The Text Associative Array (TAA) $\mathbf{A}$ is defined as a mapping:

\mathbf{A}:K_{1}\times K_{2}\to R

where $K_{1}$ and $K_{2}$ are two key sets (named row key set and column key set respectively), and $R$ is a partially-ordered semiring (Definition 2.2). We call $K_{1}\times K_{2}$ “the dimension of $\mathbf{A}$ ”, and denote $\mathbf{A}.K_{1}$ , $\mathbf{A}.K_{2}$ and $\mathbf{A}.K$ as the row key set, column key sets, and set of key pairs of $\mathbf{A}$ , respectively.

Next, we define the basic operations on text associative arrays, to be used by our primary text operations (Sec. 2.2).

Definition 2.4 (Addition).

Given two TAAs $\mathbf{A},\mathbf{B}:K_{1}\times K_{2}\to R$ , the addition operation $\mathbf{C}=(\mathbf{A}\oplus\mathbf{B}):K_{1}\times K_{2}\to R$ is defined as,

\mathbf{C}(k_{1},k_{2})=(\mathbf{A}\oplus\mathbf{B})(k_{1},k_{2})=\mathbf{A}(k_{1},k_{2})\oplus\mathbf{B}(k_{1},k_{2}).

Define $\mathbb{0}_{K_{1},K_{2}}$ as a TAA where $\mathbb{0}_{K_{1},K_{2}}(k_{1},k_{2})=0$ for $\forall k_{1}\in K_{1},k_{2}\in K_{2}$ . $\mathbb{0}_{K_{1},K_{2}}$ serves as an identity for addition operation on key set $K_{1}\times K_{2}$ .

Definition 2.5 (Hadamard Product).

Given two TAAs $\mathbf{A},\mathbf{B}:K_{1}\times K_{2}\to R$ , the Hadamard product operation $\mathbf{C}=(\mathbf{A}\odot\mathbf{B}):K_{1}\times K_{2}\to R$ is defined as,

\mathbf{C}(k_{1},k_{2})=(\mathbf{A}\odot\mathbf{B})(k_{1},k_{2})=\mathbf{A}(k_{1},k_{2})\odot\mathbf{B}(k_{1},k_{2}).

Define $\mathbb{1}_{K_{1},K_{2}}$ as a TAA where $\mathbb{1}_{K_{1},K_{2}}(k_{1},k_{2})=1$ for $\forall k_{1}\in K_{1},k_{2}\in K_{2}$ . $\mathbb{1}_{K_{1},K_{2}}$ serves as an identity for Hadamard product on key set $K_{1}\times K_{2}$ .

Definition 2.6 (Array Multiplication).

Given two TAAs $\mathbf{A}:K_{1}\times K_{2}\to R$ and $\mathbf{B}:K_{2}\times K_{3}\to R$ , the array multiplication operation $\mathbf{C}=(\mathbf{A}\otimes\mathbf{B}):K_{1}\times K_{3}\to R$ is defined as,

\mathbf{C}(k_{1},k_{3})=(\mathbf{A}\otimes\mathbf{B})(k_{1},k_{3})=\bigoplus_{k_{2}\in K_{2}}\mathbf{A}(k_{1},k_{2})\odot\mathbf{B}(k_{2},k_{3}).

Definition 2.7 (Array Identity).

Given two key sets $K_{1}$ and $K_{2}$ , and a partial function $f:K_{1}\hookrightarrow K_{2}$ , the array identity $\mathbb{E}_{K_{1},K_{2},f}:K_{1}\times K_{2}\to R$ is defined as a TAA such that

\mathbb{E}_{K_{1},K_{2},f}(k_{1},k_{2})=\begin{cases}1,&\text{if }k_{1}\in\text{dom }f\text{ and }k_{2}=f(k_{1});\\ 0,&\text{otherwise}.\end{cases}

Specifically, if $\text{dom }f=K_{1}\cap K_{2}$ and $f(k_{1})=k_{1}$ for $\forall k_{1}\in K_{1}$ , $\mathbb{E}_{K_{1},K_{2},f}$ is abbreviated to $\mathbb{E}_{K_{1},K_{2}}$ .

In general, $\mathbb{E}_{K_{1},K_{2},f}(k_{1},k_{2})$ is not an identity for general array multiplication. However, $\mathbb{E}_{K,K}$ is an identity element for array multiplication on associative arrays $K\times K\to R$ .

Definition 2.8 (Kronecker Product).

Given two TAAs $\mathbf{A}:K_{1}\times K_{2}\to R$ and $\mathbf{B}:K_{3}\times K_{4}\to R$ , their Kronecker product $\mathbf{C}=\mathbf{A}\circledast\mathbf{B}:(K_{1}\times K_{3})\times(K_{2}\times K_{4})$ is defined by

\mathbf{C}((k_{1},k_{3}),(k_{2},k_{4}))=\mathbf{A}(k_{1},k_{2})\odot\mathbf{B}(k_{3},k_{4}).

Definition 2.9 (Transpose).

Given a TAA $\mathbf{A}:K_{1}\times K_{2}\to R$ , its transpose, denoted by $\mathbf{A}^{\mathsf{T}}$ , is defined by $\mathbf{A}^{\mathsf{T}}:K_{2}\times K_{1}\to R$ where $\mathbf{A}^{\mathsf{T}}(k_{2},k_{1})=\mathbf{A}(k_{1},k_{2})$ for $k_{1}\in K_{1}$ and $k_{2}\in K_{2}$ .

2.2. Text Operations

We can express a number of fundamental text operations using the proposed TAA algebra. We first define three basic TAAs specifically for text analytics, then a series of text operations will be defined on general TAA or these basic structures.

Definition 2.10 (Document-Term Matrix).

Given a text corpus, a document term matrix is defined as a TAA $\mathbf{M}:D\times T\to R$ where $D$ and $T$ are the document set and term set of a text corpus.

The term set in the document-term matrix can be the vocabulary or the bigram of the corpus, or an application-specific user-defined set of interesting terms. The matrix value $\mathbf{M}(d,t)$ can also take different semantics, in one application it can be the occurrence of $t$ in document $d$ , while in another application, it can be the term frequency-inverse document frequency (tf-idf). Typically, elements of $D$ and $T$ will have additional relational metadata. A document may have a date and a term may have an annotation like a part-of-speech (POS) tag.

Definition 2.11 (Term-Index Matrix).

Given a document $d$ , the term index matrix is defined as a TAA, $\mathbf{N}:T_{d}\times I\to\{0,1\}$ where $T_{d}=\{d\}\times T$ is the set of terms in document $d$ and $I=\{1,\cdots,I_{d}\}$ is the index set ( $I_{d}$ is the size of $d$ ). Specifically, for $(d,t)\in T_{d}$ and $i\in I$ ,

\mathbf{N}((d,t),i)=\begin{cases}1,&\text{if }i\text{-th word of document }d\text{ is }t;\\ 0,&\text{otherwise}.\end{cases}

Example 2. For a document $d$ = “Today is a sunny day”, let its term index matrix be $\mathbf{N}:(\{d\}\times T)\times I\to\{0,1\}$ , then we have $T=\{\text{``today'', is'', ``a'', ``sunny'', ``day''}\}$ , $I=\{1,2,3,4,5\}$ . $\mathbf{N}(\text{``today''},1)=1,\mathbf{N}(\text{``is''},2)=1,\mathbf{N}(\text{``a''},3)=1,\mathbf{N}(\text{``sunny''},4)=1,\mathbf{N}(\text{``day''},5)=1$ , and for other $(t,i)$ pairs where $(t,i)\in T\times I$ , we have $\mathbf{N}(t,i)=0$ .

Definition 2.12 (Term Vector).

There are two types of term vectors. 1) Given a set of terms $T$ of a document $d$ , the term vector is defined as a TAA $\mathbf{V}:\{d\}\times T\to R$ . 2) Given a set of terms $T$ for a collection of documents $D$ , $\mathbf{V}:\{1\}\times T\to R$ is a term vector for the corpus $D$ .

The term vector represents some attribute of terms in the scope of one document or a corpus. For example, for a document $d$ , the value of the term vector $\mathbf{V}:\{d\}\times T$ can be the occurrence of each term in this document. For a corpus $D$ , the value of its term vector $\mathbf{V}:\{1\}\times T$ can be idf value for each term in the whole corpus, and the value is not specific to a single document.

Based on these structures, we can define our unit text operators as follows. Some operators are defined for general TAAs, while some are defined for a specific type of TAAs.

Definition 2.13 (Extraction).

Given a TAA $\mathbf{A}:K_{1}\times K_{2}\to R$ and two projection sets $K_{1}^{\prime}\subseteq K_{1}$ , $K_{2}^{\prime}\subseteq K_{2}$ , we define the extraction operation as

\Pi_{K_{1}^{\prime},K_{2}^{\prime}}(\mathbf{A})=\mathbb{E}_{K_{1}^{\prime},K_{1}}\otimes\mathbf{A}\otimes\mathbb{E}_{K_{2}^{\prime},K_{2}}^{\mathsf{T}}.

Let $\mathbf{B}=\Pi_{K_{1}^{\prime},K_{2}^{\prime}}(\mathbf{A})$ , we have $B(k_{1},k_{2})=A(k_{1},k_{2}),\text{ for }\forall(k_{1},k_{2})\in K_{1}^{\prime}\times K_{2}^{\prime}$ .

When only extracting row keys, the operation can be expressed as $\Pi_{K_{1}^{\prime},:}$ and when extracting column keys, it is expressed as $\Pi_{:,K_{2}^{\prime}}$ .

Definition 2.14 (Rename).

Given a TAA $\mathbf{A}:K_{1}\times K_{2}\to R$ , suppose $K_{2}^{\prime}$ is another column key set and there exists a bijection $f:K_{2}\to K_{2}^{\prime}$ . The column rename operation is defined as

\rho_{K_{1},K_{2}\to K_{2}^{\prime},f}(\mathbf{A})=\mathbf{A}\otimes\mathbb{E}_{K_{2},K_{2}^{\prime},f}.

Similarly, given another row key set $K_{1}^{\prime}$ and a bijection $f:K_{1}\to K_{1}^{\prime}$ , the row rename operation is defined as

\rho_{K_{1}\to K_{1}^{\prime},K_{2},f}(\mathbf{A})=\mathbb{E}_{K_{1}^{\prime},K_{1},f^{-1}}\otimes\mathbf{A}.

The subscript $f$ can be omitted if the bijection is clear, e.g., $|\text{dom }f|=1$ . In addition, the row rename operation and column rename operation can be combined together as $\rho_{K_{1}\to K_{1}^{\prime},K_{2}\to K_{2}^{\prime}}(\mathbf{A})$ . Our rename operator is more general than the rename operation of relational algebra since it supports both row key set and column key set renaming.

Definition 2.15 (Apply).

Given a TAA $\mathbf{A}:K_{1}\times K_{2}\to R$ and a function $f:R\to R$ , define the apply operator by $\mathbf{Apply}_{f}(\mathbf{A}):K_{1}\times K_{2}\to R$ where,

\mathbf{Apply}_{f}(\mathbf{A})(k_{1},k_{2})=f(\mathbf{A}(k_{1},k_{2})),\forall(k_{1},k_{2})\in K_{1}\times K_{2}.

Definition 2.16 (Filter).

Given a TAA $\mathbf{A}:K_{1}\times K_{2}\to R$ and an indicator function $f:R\to\{0,1\}$ , define the filter operation on $\mathbf{A}$ as

\mathbf{B}=\mathbf{Filter}_{f}(\mathbf{A})=\sigma_{f}(\mathbf{A}):K_{1f},K_{2f}\to R,

where $K_{1f}\times K_{2f}=\{(k_{1},k_{2})|(k_{1},k_{2})\in K_{1}\times K_{2}\text{ and }f(\mathbf{A}(k_{1},k_{2}))=1\}$ , and $\mathbf{B}(k_{1},k_{2})=\mathbf{A}(k_{1},k_{2})$ .

Definition 2.17 (Sort).

Given a TAA $\mathbf{A}:K_{1}\times K_{2}\to R$ , for any $k\in K_{1}$ , we extract a TAA $\mathbf{V}=\Pi_{\{k\},:}(\mathbf{A})$ of dimension $\{k\}\times K_{2}$ . Since $R$ is a partially-ordered semiring (Definition 2.2), the value set $\{\mathbf{V}(k,x)|\forall x\in K_{2}\}\subseteq R$ inherits the partial order from $R$ , which implies an order $\mathbf{V}(k,x_{1})\leq\mathbf{V}(k,x_{2})\leq\cdots\leq\mathbf{V}(k,x_{|K_{2}|})$ . Define $\mathbf{Idx}(k,x_{i})=i$ , then the sort by column operation is defined as

\mathbf{Sort}_{2}(\mathbf{A}):K_{1}\times K_{2}\to\{1,\cdots,|K_{2}|\},

where $\mathbf{Sort}_{2}(\mathbf{A})(k,x)=\mathbf{Idx}(k,x)$ . Similarly, we have sort by row operation defined as

\mathbf{Sort}_{1}(\mathbf{A}):K_{1}\times K_{2}\to\{1,\cdots,|K_{1}|\}.

When the column key dimension or row key dimension is 1 (e.g., for a term vector), $\mathbf{Sort}_{1}$ or $\mathbf{Sort}_{2}$ is abbreviated to $\mathbf{Sort}$ .

Definition 2.18 (Merge).

Given two TAAs $\mathbf{A}:K_{A1}\times K_{A2}$ and $\mathbf{B}:K_{B1}\times K_{B2}$ , if $(K_{A1}\times K_{A2})\cap(K_{B1}\times K_{B2})=\emptyset$ , then merge operation can be applied on them, and it is defined as,

\mathbf{C}=\mathbf{Merge}(\mathbf{A},\mathbf{B}):K_{1}\times K_{2}\to R

where $K_{1}=K_{A1}\cup K_{B1}$ and $K_{2}=K_{A2}\cup K_{B2}$ , and

\displaystyle\mathbf{C}(k_{1},k_{2})=\begin{cases}\mathbf{A}(k_{1},k_{2}),&\text{if }(k_{1},k_{2})\in K_{A1}\times K_{A2};\\ \mathbf{B}(k_{1},k_{2}),&\text{if }(k_{1},k_{2})\in K_{B1}\times K_{B2};\\ 0,&\text{otherwise.}\end{cases}

Definition 2.19 (Expand).

Given an elementwise binary operator $\mathbf{OP}$ on associative arrays, e.g., $\oplus$ and $\odot$ , a term vector $\mathbf{V}:\{1\}\times T\to R$ and a document-term matrix $\mathbf{M}:D\times T\to R$ , the expand operator is defined as

\mathbf{Expand}_{\mathbf{OP}}(\mathbf{V},\mathbf{M})=\rho_{\{1\}\times D\to D,T\times\{1\}\to T}\left(\mathbf{V}\circledast\mathbb{1}_{D,\{1\}}\right)\,\,\mathbf{OP}\,\,\mathbf{M}.

This operator implicitly expands the term vector $\mathbf{V}$ to generate another associative array $\mathbf{M^{\prime}}:D\times T\to R$ where $\mathbf{M^{\prime}}(d,t)=\mathbf{V}(1,t),\forall d\in D\text{ and }\forall t\in T$ , and then applies $\mathbf{OP}$ on $\mathbf{M^{\prime}}$ and $\mathbf{M}$ .

Suppose that for a corpus $D$ , there is a term vector $\mathbf{V}:\{1\}\times T\to R$ where $\mathbf{V}(1,t)$ is the mean occurrence of term $t$ in $D$ (i.e., $\frac{Count_{t}}{|D|}$ where $Count_{t}$ is the total occurrence of $t$ in $D$ ), and there is a document-term matrix $\mathbf{M}:D\times T$ , then

\mathbf{Expand}_{\oplus}(\mathbf{Apply}_{f(x)=-x}(\mathbf{V}),\mathbf{M})

will generate the difference of terms occurrences for each document from their average occurrences.

Definition 2.20 (Flatten).

Given an associative array $\mathbf{A}:K_{1}\times K_{2}\to R$ , the flatten operation is defined by $\mathbf{Flatten}(\mathbf{A}):\{1\}\times(K_{1}\times K_{2})\to R$ where

\mathbf{Flatten}(\mathbf{A})(1,(k_{1},k_{2}))=\mathbf{A}(k_{1},k_{2})\text{ for }\forall(k_{1},k_{2})\in K_{1}\times K_{2}.

Definition 2.21 (Left Shift).

Given a term-index matrix $\mathbf{N}:(\{d\}\times T)\times I\to R$ , and a non-negative integer $n$ , define the left shift operator by $\mathbf{LShift}_{n}(\mathbf{N}):(\{d\}\times T)\times I\to R$ where

	$\displaystyle\mathbf{LShift}_{n}(\mathbf{N})$	$\displaystyle=\mathbf{LShift}_{1}(\mathbf{LShift}_{n-1}(\mathbf{N}))\text{ and }$
	$\displaystyle\mathbf{LShift}_{1}(\mathbf{N})((d,t),i)$	$\displaystyle=\begin{cases}\mathbf{N}((d,t),i+1),&\text{if }i<\|T\|;\\ 0,&\text{if }i=\|T\|;.\end{cases}$

For a term-index matrix $\mathbf{N}$ of document $d$ , $\mathbf{LShift}_{1}(\mathbf{N})$ generates another term-index matrix $\mathbf{N^{\prime}}$ where $\mathbf{N^{\prime}}((d,t),i)=1$ when $t$ is the $(i+1)$ -th word in $d$ .

Definition 2.22 (Union).

Suppose there are two term-index matrices with the same index set $I$ , $\mathbf{N}_{1}:(\{d\}\times T)\times I\to R$ and $\mathbf{N}_{2}:(\{d\}\times T)\times I\to R$ , the union operation on $\mathbf{N}_{1}$ and $\mathbf{N}_{2}$ is defined by

	$\displaystyle\mathbf{Union}(\mathbf{N_{1}},\mathbf{N_{2}})$	$\displaystyle=\rho_{(\{d\}\times T)\times(\{d\}\times T)\to\{d\}\times(T\times T),I\times I\to I}$
		$\displaystyle\qquad\qquad\left(\Pi_{:,\{(i,i)\|i\in I\}}(\mathbf{N}_{1}\circledast\mathbf{N}_{2})\right).$

Suppose $\mathbf{N}=\mathbf{Union}(\mathbf{N}_{1},\mathbf{N}_{2})$ , then

\displaystyle\mathbf{N}((d,(t_{1},t_{2})),i)=\begin{cases}1,&\text{if }\mathbf{N}_{1}((d,t_{1}),i)=1\text{ and }\mathbf{N}_{2}((d,t_{2}),i)=1;\\ 0,&\text{otherwise}.\end{cases}

The left shift and union operations can be composed to compute all bigrams of a document. Given a term-index matrix $\mathbf{N}$ of document $d$ , let $\mathbf{N}^{\prime}=\mathbf{Union}(\mathbf{N},\mathbf{LShift}_{1}(\mathbf{N}))$ , then $\mathbf{N}^{\prime}((d,(t_{1},t_{2})),i)=1$ when $(t_{1},t_{2})$ is the $i$ -th bigram in document $d$ .

Definition 2.23 (Sum).

The sum operation takes a TAA $\mathbf{A}:K_{1}\times K_{2}\to R$ and an integer which can take the value of 0, 1 or 2 as inputs and will have different semantics based on the integer value:

	$\displaystyle\mathbf{B}:\{1\}\times K_{2}$	$\displaystyle=\mathbf{Sum}_{1}(\mathbf{A})\text{ where }B(1,k_{2})=\bigoplus_{k_{1}\in K_{1}}\mathbf{A}(k_{1},k_{2});$
	$\displaystyle\mathbf{B}:{K_{1}}\times\{1\}$	$\displaystyle=\mathbf{Sum}_{2}(\mathbf{A})\text{ where }B(k_{1},1)=\bigoplus_{k_{2}\in K_{2}}\mathbf{A}(k_{1},k_{2}).$

3. Text Analytic Tasks

3.1. Constructing a Document Term Matrix

As we state in Section 2.2, a document term matrix is a common representation model for a collection of documents where the terms can be a list of import terms or the whole vocabulary or bigrams. The entry of the matrix can be either the occurrence of each term or the tf-idf value.

Example 3. For document collection $C$ , build a document term matrix where terms are all unigrams and bigrams in $C$ , and the values should be the occurrence of each term in the whole corpus.

Suppose there is a tokenization function called $\mathbf{Tokenize}$ that takes a document $d$ as input and generates a term index matrix $\mathbf{N}:(\{d\}\times T)\times{I}$ . The construction can be decomposed to two parts, the first part is to construct a Term Vector for one single document $d$ containing all unigrams and bigrams together with their corresponding occurrences. Fig. 1 shows the construction process.

	$\displaystyle\mathbf{N}=\mathbf{Tokenize}(d)\quad:(\{d\}\times T)\times{I}$	$\displaystyle 1$
	$\displaystyle\mathbf{V_{1}}=\rho_{\{1\}\to\{d\},\{d\}\times T\to T}(\mathbf{Sum}_{2}(\mathbf{N}))^{\mathsf{T}}\quad:\{d\}\times T$	$\displaystyle 2$
	$\displaystyle\mathbf{T}=\mathbf{N}\otimes\mathbf{LShift}_{1}(\mathbf{N})^{\mathsf{T}}\quad:(\{d\}\times T)\times(\{d\}\times T)$	$\displaystyle 3$
	$\displaystyle\mathbf{V_{2}}=\mathbf{Flatten}(T)\quad:\{1\}\times(\{d\}\times T)\times(\{d\}\times T))$	$\displaystyle 4$
	$\displaystyle\mathbf{V_{2}}=\rho_{\{1\}\to\{d\},(\{d\}\times T)\times(\{d\}\times T)\to(T\times T)}(\mathbf{V}_{2})\quad:\{d\}\times(T\times T)$	$\displaystyle 5$
	$\displaystyle\mathbf{V_{2}}=\sigma_{f:x\to\mathbb{1}(x>0)}(\mathbf{V_{2}})\quad:\{d\}\times(T\times T)$	$\displaystyle 6$
	$\displaystyle\mathbf{V}_{d}=\mathbf{Merge}(\mathbf{V_{1}},\mathbf{V_{2}})\quad:\{d\}\times(T\cup(T\times T))$	$\displaystyle 7$

Figure 1. Algebraic representation for task in Example 3.

Step 1 generates the term index matrix where each term is the unigram. The $\mathbf{Sum}_{1}$ operation in Step 2 generates the term vector where $\mathbf{V_{1}}(d,t)$ is the unigram $t$ in document $d$ . Steps 3–6 get the term vector $\mathbf{V_{2}}$ where the column key set is all bigrams in $d$ . Step 7 concatenates the two term vectors to get the representation for $d$ .

For each document $d_{i}$ in collection $D=\{d_{1},\cdots,d_{n}\}$ , we get its term vector $\mathbf{V_{di}}:\{d_{i}\}\times(T_{i}\cup(T_{i}\times T_{i}))\to R$ using the above steps, then apply the $\mathbf{Merge}$ operation to get the document-term matrix $\mathbf{M}:D\times T\to R$ where $T=(T_{1}\cup\cdots\cup T_{n})\cup((T_{1}\times T_{1})\cup\cdots\cup(T_{n}\times T_{n}))$ is the union of all unigrams and bigrams in the whole corpus,

$\mathbf{Merge}(\mathbf{V_{d1}},\mathbf{Merge}(\mathbf{V_{d2}},\cdots,\mathbf{Merge}(\mathbf{V_{d(n-1)}},\mathbf{V_{d(n)}}))).$

Besides word-occurrence as the values of term document matrix, one can also use a term’s tf-idf value. If all terms are considered, term document matrix $\mathbf{M}$ would be of high dimension and sparse, which would be costly to manipulate. A simple and commonly adopted method to reduce dimension is to select out informative words. The following presents the queries to get document-term matrix $\mathbf{M}$ with the tf-idf values for only informative terms where the informativeness is measured by idf value.

Example 4. Given a collection of documents $D$ , we have to generate a document-term matrix $\mathbf{M}$ for the top 1000 “informative words” where $\mathbf{M}(d,t)$ is the tf-idf value for term $t$ in document $d$ . Suppose there is a term-document matrix $\mathbf{M_{1}}$ which stores the occurrence for all unigrams in each document (the construction is similar to that of example 2 and thus is skipped), $\mathbf{M}$ can be generated by the following steps. The function $idf$ in Step 3 is to calculate idf value, which is defined as $idf(x)=-\log\frac{x}{|D|}$ where $x$ is the number of documents that contains a specific term.

	$\displaystyle\mathbf{M}_{2}=\mathbf{Apply}_{f:x\to\mathbb{1}(x>0)}(\mathbf{M}_{1})\quad:D\times T$	$\displaystyle 1$
	$\displaystyle\mathbf{V}=\mathbf{Sum}_{1}(\mathbf{M_{2}})\quad:\{1\}\times T$	$\displaystyle 2$
	$\displaystyle\mathbf{I}=\sigma_{f:x\to\mathbb{1}(x\leq 1000)}(\mathbf{Sort}(\mathbf{V}))\quad:\{1\}\times T^{\prime}$	$\displaystyle 3$
	$\displaystyle\mathbf{V_{1}}=\mathbf{Apply}_{idf}(\Pi_{:,\mathbf{I}.K_{2}}(\mathbf{V}))\quad:\{1\}\times T^{\prime}$	$\displaystyle 4$
	$\displaystyle\mathbf{M_{3}}=\Pi_{:,\mathbf{I}.K_{2}}(\mathbf{M_{1}})\quad:D\times T^{\prime}$	$\displaystyle 5$
	$\displaystyle\mathbf{M}=\mathbf{Expand}_{\odot}(\mathbf{V_{1}},\mathbf{M_{3}})\quad:D\times T^{\prime}$	$\displaystyle 6$

Figure 2. Algebraic representation for task in Example 4.

3.2. Using TAAs

For Example 1 introduced in Section 1, we express this analysis using relational algebra and the associative array operations. Suppose that the maximum number of words for a term in $L_{o}\cup L_{p}$ is 3, now this analysis can be expressed as the following. The Step 1 is expressed in relational algebra. $\mathbf{TopicModel}$ in the last step is a function which takes a document-term matrix and produce document topic matrix and topic term matrix, which are the standard outputs of topic modeling, represented by another two TAAs $\mathbf{DTM}$ and $\mathbf{TTM}$ . Let $\mathbf{T}=\rho_{f:x\to\mathbb{1}(x\geq|D|-k)}(\mathbf{Sort}_{2}(\mathbf{DTM}))$ , then $\mathbf{T}.K$ will return all $(d,t)$ pairs where $t$ is one of the top- $k$ topics for $d$ .

	$\displaystyle D=\pi_{content}(\sigma_{d_{1}\leq data\leq d_{2}}(R))$	$\displaystyle 1$
	$\displaystyle\mathbf{M}:\{\}\times\{\}\to R,\quad\mathbf{FV}:\{\}\times\{\}\to R$	$\displaystyle 2$
	$\displaystyle\text{for }d\in D:$	$\displaystyle 3$
	$\displaystyle\quad\quad\mathbf{N_{1}}=\mathbf{Tokenize}(d)$	$\displaystyle 3.1$
	$\displaystyle\quad\quad\mathbf{V}=\rho_{\{1\}\to\{d\},\{d\}\times T\to T}(\mathbf{Sum}_{2}(\mathbf{N_{1}}))^{\mathsf{T}}$	$\displaystyle 3.2$
	$\displaystyle\quad\quad\mathbf{N_{2}}=\mathbf{Union}(\mathbf{N_{1}},\mathbf{LShift_{1}(\mathbf{N_{1}})})$	$\displaystyle 3.3$
	$\displaystyle\quad\quad\mathbf{N_{3}}=\mathbf{Union}(\mathbf{N_{2}},\mathbf{LShift_{2}(\mathbf{N_{1}})})$	$\displaystyle 3.4$
	$\displaystyle\quad\quad\mathbf{N}=\mathbf{Merge}(\mathbf{N_{1}},\mathbf{Merge}(\mathbf{N_{2}},\mathbf{N_{3}}))$	$\displaystyle 3.5$
	$\displaystyle\quad\quad\mathbf{V_{f}}=\rho_{\{1\}\to\{d\},\{d\}\times T^{\prime}\to T^{\prime}}(\mathbf{Sum}_{2}(\mathbf{N})^{\mathsf{T}})$	$\displaystyle 3.6$
	$\displaystyle\quad\quad\mathbf{FV}=\mathbf{Merge}(\mathbf{FV},\mathbf{V_{f}})$	$\displaystyle 3.7$
	$\displaystyle\quad\quad\mathbf{M}=\mathbf{Merge}(\mathbf{M},\mathbf{V})$	$\displaystyle 3.8$
	$\displaystyle\mathbf{FV_{o}}=\mathbf{\Pi}_{:,L_{o}}(\mathbf{FV})$	$\displaystyle 4$
	$\displaystyle\mathbf{FV_{p}}=\mathbf{\Pi}_{:,L_{p}}(\mathbf{FV})$	$\displaystyle 5$
	$\displaystyle\mathbf{I_{o}}=\sigma_{f:x\to\mathbb{1}(x>c_{1})}(\mathbf{Sum}_{2}(FV_{o}))$	$\displaystyle 6$
	$\displaystyle\mathbf{I_{p}}=\sigma_{f:x\to\mathbb{1}(x>c_{2})}(\mathbf{Sum}_{2}(FV_{p}))$	$\displaystyle 7$
	$\displaystyle\mathbf{M}=\Pi_{I_{o}.K_{1}\cap I_{p}.K_{1},:}(\mathbf{M})$	$\displaystyle 8$
	$\displaystyle\mathbf{I_{t}}=\sigma_{f:x\to\mathbb{1}(x<\theta_{2})}(\mathbf{Sum}_{1}(\mathbf{M}))$	$\displaystyle 9$
	$\displaystyle\mathbf{I_{d}}=\sigma_{f:x\to\mathbb{1}(x<\theta_{1})}(\mathbf{Sum}_{2}(\mathbf{M}))$	$\displaystyle 10$
	$\displaystyle\mathbf{M}=\Pi_{I_{d}.K_{1},I_{t}.K_{2}}(\mathbf{M})$	$\displaystyle 11$
	$\displaystyle\mathbf{DTM},\mathbf{TTM}=\mathbf{TopicModel(\mathbf{M})}$	$\displaystyle 12$

Figure 3. Algebraic representation for the task in Example 1.

References

(1)
Barceló et al. (2019) Pablo Barceló, Nelson Higuera, Jorge Pérez, and Bernardo Subercaseaux. 2019. On the Expressiveness of LARA: A Unified Language for Linear and Relational Algebra. arXiv preprint arXiv:1909.11693 (2019).
Golan (2013) Jonathan S Golan. 2013. Semirings and affine equations over them: theory and applications. Vol. 556. Springer Science & Business Media.
Jananthan et al. (2017) Hayden Jananthan, Ziqi Zhou, Vijay Gadepally, Dylan Hutchison, Suna Kim, and Jeremy Kepner. 2017. Polystore mathematics of relational algebra. In Int. Conf. on Big Data. IEEE, 3180–3189.
Kepner et al. (2020) Jeremy Kepner, Vijay Gadepally, Hayden Jananthan, Lauren Milechin, and Siddharth Samsi. 2020. AI Data Wrangling with Associative Arrays. arXiv preprint arXiv:2001.06731 (2020).
Leclercq et al. (2019) Éric Leclercq, Annabelle Gillet, Thierry Grison, and Marinette Savonnet. 2019. Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data Analytics. In Trans. on Large-Scale Data-and Knowledge-Centered Systems XLII. Springer, 51–90.