This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Algebraic Approach for High-level Text Analytics

Xiuwen Zheng [email protected] University of California San DiegoSan Diego Supercomputer CenterLa JollaCaliforna, USA92130  and  Amarnath Gupta [email protected] University of California San DiegoSan Diego Supercomputer CenterLa JollaCaliforna, USA92130
Abstract.

Text analytical tasks like word embedding, phrase mining and topic modeling, are placing increasing demands as well as challenges to existing database management systems. In this paper, we provide a novel algebraic approach based on associative arrays. Our data model and algebra can bring together relational operators and text operators, which enables interesting optimization opportunities for hybrid data sources that have both relational and textual data. We demonstrate its expressive power in text analytics using several real-world tasks.

associative array, text analytics, natural language processing
conference: SSDBM ’20: Int. Conf. on Scientific and Statistical Data Management; June 03–05, 2020; Amsterdam, Netherlandsbooktitle: SSDBM ’20: Int. Conf. on Scientific and Statistical Data Management, June 03–05, 2020, Amsterdam, Netherlandsprice: 15.00isbn: 978-1-4503-XXXX-X/18/06submissionid: xxx

1. Introduction

A significant part of today’s analytical tasks involve text operations. A data scientist who has to manipulate and analyze text data today typically uses a set of text analysis software libraries (e.g., NLTK, Stanford CoreNLP, GenSim) for tasks like word embedding, phrase extraction, named entity recognition and topic modeling. In addition, most DBMS systems today have built-in support for full-text search. PostgreSQL, for instance, admits a text vector (called tsvector) that extracts and creates term and positional indices to enable efficient queries (called tsquery). Yet, some common and seemingly simple text analysis tasks cannot be performed simply within the boundaries of a single information system.

Example 1. Consider a relational table RR(newsID, date, newspaper, title, content) where titletitle and contentcontent are text-valued attributes, and two sets Lo,LpL_{o},L_{p} that represent a collection of organization names and person names respectively. Now, consider the following analysis:

  • N1=N1= Select a subset of news articles from date d1d_{1} through d2d_{2}

  • N2=N2= Identify all news articles in N1N1 that have at least c1c_{1} organization names from LoL_{o} and c2c_{2} persons from LpL_{p}

  • T1=T1= Create a document-term matrix on N2.textN2.text

  • T2=T2= Remove rows and columns of the matrix if either of their row or column marginal sums is below θ1\theta_{1} and θ2\theta_{2} respectively.

  • M=M= Compute a topic model using T2T2

The intention of the analysis is to find the topic distribution of those news items that cover, for example, any two members of the senate (list LpL_{p}) and any one government organizations (list LoL_{o}). The analysis itself is straightforward and can be performed with a combination of SQL queries and Python scripts.

Our goal in this short paper is to present the idea that a novel relation-flanked associative array data model has the potential of serving as the underlying framework for the management and analysis of text-centric data. We develop the theoretical elements of the model and illustrate its utility through examples.

2. The Data Model

2.1. Text Associative Arrays

A number of current data systems, typically in the domain of polystore data systems, use associative arrays (Jananthan et al., 2017; Kepner et al., 2020) or its variants like associative tables (Barceló et al., 2019) and tensor data model (Leclercq et al., 2019). Many of these data models are used to support analytical (e.g., machine learning) tasks. In our setting, we specialize the essential associative model for text analytics. For our level of abstraction, our model reuses relational operations for all metadata of the associative arrays. While it has been shown (Barceló et al., 2019) that associative arrays can express relational operations, we believe that using relational abstraction along with our text-centric algebraic operations makes the system easier to program and interpret. At a more basic level, since most text processing operations include sorting (e.g., by TF-IDF scores), our model is based on partially ordered semirings.

Definition 2.1 (Semiring).

A semiring is a set RR with two binary operations addition \oplus and multiplication \odot, such that, 1) \oplus is associative and commutative and has an identity element 0R0\in R; 2) \odot is associative with an identity element 1R1\in R; 3) \odot distributes over \oplus; and 4) \odot by 0 annihilates RR.

Definition 2.2 (Partially-Ordered Semiring).

(Golan, 2013) A semiring RR is partially ordered if and only if there exists a partial order relation \leq on RR satisfying the following conditions for all a,bRa,b\in R:

  • If aba\leq b, then acbca\oplus c\leq b\oplus c;

  • If aba\leq b and 0c0\leq c, then acbca\odot c\leq b\odot c and cacbc\odot a\leq c\odot b.

Definition 2.3 (Text Associative Array).

The Text Associative Array (TAA) 𝐀\mathbf{A} is defined as a mapping:

𝐀:K1×K2R\mathbf{A}:K_{1}\times K_{2}\to R

where K1K_{1} and K2K_{2} are two key sets (named row key set and column key set respectively), and RR is a partially-ordered semiring (Definition 2.2). We call K1×K2K_{1}\times K_{2} “the dimension of 𝐀\mathbf{A}”, and denote 𝐀.K1\mathbf{A}.K_{1}, 𝐀.K2\mathbf{A}.K_{2} and 𝐀.K\mathbf{A}.K as the row key set, column key sets, and set of key pairs of 𝐀\mathbf{A}, respectively.

Next, we define the basic operations on text associative arrays, to be used by our primary text operations (Sec. 2.2).

Definition 2.4 (Addition).

Given two TAAs 𝐀,𝐁:K1×K2R\mathbf{A},\mathbf{B}:K_{1}\times K_{2}\to R, the addition operation 𝐂=(𝐀𝐁):K1×K2R\mathbf{C}=(\mathbf{A}\oplus\mathbf{B}):K_{1}\times K_{2}\to R is defined as,

𝐂(k1,k2)=(𝐀𝐁)(k1,k2)=𝐀(k1,k2)𝐁(k1,k2).\mathbf{C}(k_{1},k_{2})=(\mathbf{A}\oplus\mathbf{B})(k_{1},k_{2})=\mathbf{A}(k_{1},k_{2})\oplus\mathbf{B}(k_{1},k_{2}).

Define 𝟘K1,K2\mathbb{0}_{K_{1},K_{2}} as a TAA where 𝟘K1,K2(k1,k2)=0\mathbb{0}_{K_{1},K_{2}}(k_{1},k_{2})=0 for k1K1,k2K2\forall k_{1}\in K_{1},k_{2}\in K_{2}. 𝟘K1,K2\mathbb{0}_{K_{1},K_{2}} serves as an identity for addition operation on key set K1×K2K_{1}\times K_{2}.

Definition 2.5 (Hadamard Product).

Given two TAAs 𝐀,𝐁:K1×K2R\mathbf{A},\mathbf{B}:K_{1}\times K_{2}\to R, the Hadamard product operation 𝐂=(𝐀𝐁):K1×K2R\mathbf{C}=(\mathbf{A}\odot\mathbf{B}):K_{1}\times K_{2}\to R is defined as,

𝐂(k1,k2)=(𝐀𝐁)(k1,k2)=𝐀(k1,k2)𝐁(k1,k2).\mathbf{C}(k_{1},k_{2})=(\mathbf{A}\odot\mathbf{B})(k_{1},k_{2})=\mathbf{A}(k_{1},k_{2})\odot\mathbf{B}(k_{1},k_{2}).

Define 𝟙K1,K2\mathbb{1}_{K_{1},K_{2}} as a TAA where 𝟙K1,K2(k1,k2)=1\mathbb{1}_{K_{1},K_{2}}(k_{1},k_{2})=1 for k1K1,k2K2\forall k_{1}\in K_{1},k_{2}\in K_{2}. 𝟙K1,K2\mathbb{1}_{K_{1},K_{2}} serves as an identity for Hadamard product on key set K1×K2K_{1}\times K_{2}.

Definition 2.6 (Array Multiplication).

Given two TAAs 𝐀:K1×K2R\mathbf{A}:K_{1}\times K_{2}\to R and 𝐁:K2×K3R\mathbf{B}:K_{2}\times K_{3}\to R, the array multiplication operation 𝐂=(𝐀𝐁):K1×K3R\mathbf{C}=(\mathbf{A}\otimes\mathbf{B}):K_{1}\times K_{3}\to R is defined as,

𝐂(k1,k3)=(𝐀𝐁)(k1,k3)=k2K2𝐀(k1,k2)𝐁(k2,k3).\mathbf{C}(k_{1},k_{3})=(\mathbf{A}\otimes\mathbf{B})(k_{1},k_{3})=\bigoplus_{k_{2}\in K_{2}}\mathbf{A}(k_{1},k_{2})\odot\mathbf{B}(k_{2},k_{3}).
Definition 2.7 (Array Identity).

Given two key sets K1K_{1} and K2K_{2}, and a partial function f:K1K2f:K_{1}\hookrightarrow K_{2}, the array identity 𝔼K1,K2,f:K1×K2R\mathbb{E}_{K_{1},K_{2},f}:K_{1}\times K_{2}\to R is defined as a TAA such that

𝔼K1,K2,f(k1,k2)={1,if k1dom f and k2=f(k1);0,otherwise.\mathbb{E}_{K_{1},K_{2},f}(k_{1},k_{2})=\begin{cases}1,&\text{if }k_{1}\in\text{dom }f\text{ and }k_{2}=f(k_{1});\\ 0,&\text{otherwise}.\end{cases}

Specifically, if dom f=K1K2\text{dom }f=K_{1}\cap K_{2} and f(k1)=k1f(k_{1})=k_{1} for k1K1\forall k_{1}\in K_{1}, 𝔼K1,K2,f\mathbb{E}_{K_{1},K_{2},f} is abbreviated to 𝔼K1,K2\mathbb{E}_{K_{1},K_{2}}.

In general, 𝔼K1,K2,f(k1,k2)\mathbb{E}_{K_{1},K_{2},f}(k_{1},k_{2}) is not an identity for general array multiplication. However, 𝔼K,K\mathbb{E}_{K,K} is an identity element for array multiplication on associative arrays K×KRK\times K\to R.

Definition 2.8 (Kronecker Product).

Given two TAAs 𝐀:K1×K2R\mathbf{A}:K_{1}\times K_{2}\to R and 𝐁:K3×K4R\mathbf{B}:K_{3}\times K_{4}\to R, their Kronecker product 𝐂=𝐀𝐁:(K1×K3)×(K2×K4)\mathbf{C}=\mathbf{A}\circledast\mathbf{B}:(K_{1}\times K_{3})\times(K_{2}\times K_{4}) is defined by

𝐂((k1,k3),(k2,k4))=𝐀(k1,k2)𝐁(k3,k4).\mathbf{C}((k_{1},k_{3}),(k_{2},k_{4}))=\mathbf{A}(k_{1},k_{2})\odot\mathbf{B}(k_{3},k_{4}).
Definition 2.9 (Transpose).

Given a TAA 𝐀:K1×K2R\mathbf{A}:K_{1}\times K_{2}\to R, its transpose, denoted by 𝐀𝖳\mathbf{A}^{\mathsf{T}}, is defined by 𝐀𝖳:K2×K1R\mathbf{A}^{\mathsf{T}}:K_{2}\times K_{1}\to R where 𝐀𝖳(k2,k1)=𝐀(k1,k2)\mathbf{A}^{\mathsf{T}}(k_{2},k_{1})=\mathbf{A}(k_{1},k_{2}) for k1K1k_{1}\in K_{1} and k2K2k_{2}\in K_{2}.

2.2. Text Operations

We can express a number of fundamental text operations using the proposed TAA algebra. We first define three basic TAAs specifically for text analytics, then a series of text operations will be defined on general TAA or these basic structures.

Definition 2.10 (Document-Term Matrix).

Given a text corpus, a document term matrix is defined as a TAA 𝐌:D×TR\mathbf{M}:D\times T\to R where DD and TT are the document set and term set of a text corpus.

The term set in the document-term matrix can be the vocabulary or the bigram of the corpus, or an application-specific user-defined set of interesting terms. The matrix value 𝐌(d,t)\mathbf{M}(d,t) can also take different semantics, in one application it can be the occurrence of tt in document dd, while in another application, it can be the term frequency-inverse document frequency (tf-idf). Typically, elements of DD and TT will have additional relational metadata. A document may have a date and a term may have an annotation like a part-of-speech (POS) tag.

Definition 2.11 (Term-Index Matrix).

Given a document dd, the term index matrix is defined as a TAA, 𝐍:Td×I{0,1}\mathbf{N}:T_{d}\times I\to\{0,1\} where Td={d}×TT_{d}=\{d\}\times T is the set of terms in document dd and I={1,,Id}I=\{1,\cdots,I_{d}\} is the index set (IdI_{d} is the size of dd). Specifically, for (d,t)Td(d,t)\in T_{d} and iIi\in I,

𝐍((d,t),i)={1,if i-th word of document d is t;0,otherwise.\mathbf{N}((d,t),i)=\begin{cases}1,&\text{if }i\text{-th word of document }d\text{ is }t;\\ 0,&\text{otherwise}.\end{cases}

Example 2. For a document dd = “Today is a sunny day”, let its term index matrix be 𝐍:({d}×T)×I{0,1}\mathbf{N}:(\{d\}\times T)\times I\to\{0,1\}, then we have T={“today”, is”, “a”, “sunny”, “day”}T=\{\text{``today'', is'', ``a'', ``sunny'', ``day''}\}, I={1,2,3,4,5}I=\{1,2,3,4,5\}. 𝐍(“today”,1)=1,𝐍(“is”,2)=1,𝐍(“a”,3)=1,𝐍(“sunny”,4)=1,𝐍(“day”,5)=1\mathbf{N}(\text{``today''},1)=1,\mathbf{N}(\text{``is''},2)=1,\mathbf{N}(\text{``a''},3)=1,\mathbf{N}(\text{``sunny''},4)=1,\mathbf{N}(\text{``day''},5)=1, and for other (t,i)(t,i) pairs where (t,i)T×I(t,i)\in T\times I, we have 𝐍(t,i)=0\mathbf{N}(t,i)=0.

Definition 2.12 (Term Vector).

There are two types of term vectors. 1) Given a set of terms TT of a document dd, the term vector is defined as a TAA 𝐕:{d}×TR\mathbf{V}:\{d\}\times T\to R. 2) Given a set of terms TT for a collection of documents DD, 𝐕:{1}×TR\mathbf{V}:\{1\}\times T\to R is a term vector for the corpus DD.

The term vector represents some attribute of terms in the scope of one document or a corpus. For example, for a document dd, the value of the term vector 𝐕:{d}×T\mathbf{V}:\{d\}\times T can be the occurrence of each term in this document. For a corpus DD, the value of its term vector 𝐕:{1}×T\mathbf{V}:\{1\}\times T can be idf value for each term in the whole corpus, and the value is not specific to a single document.

Based on these structures, we can define our unit text operators as follows. Some operators are defined for general TAAs, while some are defined for a specific type of TAAs.

Definition 2.13 (Extraction).

Given a TAA 𝐀:K1×K2R\mathbf{A}:K_{1}\times K_{2}\to R and two projection sets K1K1K_{1}^{\prime}\subseteq K_{1}, K2K2K_{2}^{\prime}\subseteq K_{2}, we define the extraction operation as

ΠK1,K2(𝐀)=𝔼K1,K1𝐀𝔼K2,K2𝖳.\Pi_{K_{1}^{\prime},K_{2}^{\prime}}(\mathbf{A})=\mathbb{E}_{K_{1}^{\prime},K_{1}}\otimes\mathbf{A}\otimes\mathbb{E}_{K_{2}^{\prime},K_{2}}^{\mathsf{T}}.

Let 𝐁=ΠK1,K2(𝐀)\mathbf{B}=\Pi_{K_{1}^{\prime},K_{2}^{\prime}}(\mathbf{A}), we have B(k1,k2)=A(k1,k2), for (k1,k2)K1×K2B(k_{1},k_{2})=A(k_{1},k_{2}),\text{ for }\forall(k_{1},k_{2})\in K_{1}^{\prime}\times K_{2}^{\prime}.

When only extracting row keys, the operation can be expressed as ΠK1,:\Pi_{K_{1}^{\prime},:} and when extracting column keys, it is expressed as Π:,K2\Pi_{:,K_{2}^{\prime}}.

Definition 2.14 (Rename).

Given a TAA 𝐀:K1×K2R\mathbf{A}:K_{1}\times K_{2}\to R, suppose K2K_{2}^{\prime} is another column key set and there exists a bijection f:K2K2f:K_{2}\to K_{2}^{\prime}. The column rename operation is defined as

ρK1,K2K2,f(𝐀)=𝐀𝔼K2,K2,f.\rho_{K_{1},K_{2}\to K_{2}^{\prime},f}(\mathbf{A})=\mathbf{A}\otimes\mathbb{E}_{K_{2},K_{2}^{\prime},f}.

Similarly, given another row key set K1K_{1}^{\prime} and a bijection f:K1K1f:K_{1}\to K_{1}^{\prime}, the row rename operation is defined as

ρK1K1,K2,f(𝐀)=𝔼K1,K1,f1𝐀.\rho_{K_{1}\to K_{1}^{\prime},K_{2},f}(\mathbf{A})=\mathbb{E}_{K_{1}^{\prime},K_{1},f^{-1}}\otimes\mathbf{A}.

The subscript ff can be omitted if the bijection is clear, e.g., |dom f|=1|\text{dom }f|=1. In addition, the row rename operation and column rename operation can be combined together as ρK1K1,K2K2(𝐀)\rho_{K_{1}\to K_{1}^{\prime},K_{2}\to K_{2}^{\prime}}(\mathbf{A}). Our rename operator is more general than the rename operation of relational algebra since it supports both row key set and column key set renaming.

Definition 2.15 (Apply).

Given a TAA 𝐀:K1×K2R\mathbf{A}:K_{1}\times K_{2}\to R and a function f:RRf:R\to R, define the apply operator by 𝐀𝐩𝐩𝐥𝐲f(𝐀):K1×K2R\mathbf{Apply}_{f}(\mathbf{A}):K_{1}\times K_{2}\to R where,

𝐀𝐩𝐩𝐥𝐲f(𝐀)(k1,k2)=f(𝐀(k1,k2)),(k1,k2)K1×K2.\mathbf{Apply}_{f}(\mathbf{A})(k_{1},k_{2})=f(\mathbf{A}(k_{1},k_{2})),\forall(k_{1},k_{2})\in K_{1}\times K_{2}.
Definition 2.16 (Filter).

Given a TAA 𝐀:K1×K2R\mathbf{A}:K_{1}\times K_{2}\to R and an indicator function f:R{0,1}f:R\to\{0,1\}, define the filter operation on 𝐀\mathbf{A} as

𝐁=𝐅𝐢𝐥𝐭𝐞𝐫f(𝐀)=σf(𝐀):K1f,K2fR,\mathbf{B}=\mathbf{Filter}_{f}(\mathbf{A})=\sigma_{f}(\mathbf{A}):K_{1f},K_{2f}\to R,

where K1f×K2f={(k1,k2)|(k1,k2)K1×K2 and f(𝐀(k1,k2))=1}K_{1f}\times K_{2f}=\{(k_{1},k_{2})|(k_{1},k_{2})\in K_{1}\times K_{2}\text{ and }f(\mathbf{A}(k_{1},k_{2}))=1\}, and 𝐁(k1,k2)=𝐀(k1,k2)\mathbf{B}(k_{1},k_{2})=\mathbf{A}(k_{1},k_{2}).

Definition 2.17 (Sort).

Given a TAA 𝐀:K1×K2R\mathbf{A}:K_{1}\times K_{2}\to R, for any kK1k\in K_{1}, we extract a TAA 𝐕=Π{k},:(𝐀)\mathbf{V}=\Pi_{\{k\},:}(\mathbf{A}) of dimension {k}×K2\{k\}\times K_{2}. Since RR is a partially-ordered semiring (Definition 2.2), the value set {𝐕(k,x)|xK2}R\{\mathbf{V}(k,x)|\forall x\in K_{2}\}\subseteq R inherits the partial order from RR, which implies an order 𝐕(k,x1)𝐕(k,x2)𝐕(k,x|K2|)\mathbf{V}(k,x_{1})\leq\mathbf{V}(k,x_{2})\leq\cdots\leq\mathbf{V}(k,x_{|K_{2}|}). Define 𝐈𝐝𝐱(k,xi)=i\mathbf{Idx}(k,x_{i})=i, then the sort by column operation is defined as

𝐒𝐨𝐫𝐭2(𝐀):K1×K2{1,,|K2|},\mathbf{Sort}_{2}(\mathbf{A}):K_{1}\times K_{2}\to\{1,\cdots,|K_{2}|\},

where 𝐒𝐨𝐫𝐭2(𝐀)(k,x)=𝐈𝐝𝐱(k,x)\mathbf{Sort}_{2}(\mathbf{A})(k,x)=\mathbf{Idx}(k,x). Similarly, we have sort by row operation defined as

𝐒𝐨𝐫𝐭1(𝐀):K1×K2{1,,|K1|}.\mathbf{Sort}_{1}(\mathbf{A}):K_{1}\times K_{2}\to\{1,\cdots,|K_{1}|\}.

When the column key dimension or row key dimension is 1 (e.g., for a term vector), 𝐒𝐨𝐫𝐭1\mathbf{Sort}_{1} or 𝐒𝐨𝐫𝐭2\mathbf{Sort}_{2} is abbreviated to 𝐒𝐨𝐫𝐭\mathbf{Sort}.

Definition 2.18 (Merge).

Given two TAAs 𝐀:KA1×KA2\mathbf{A}:K_{A1}\times K_{A2} and 𝐁:KB1×KB2\mathbf{B}:K_{B1}\times K_{B2}, if (KA1×KA2)(KB1×KB2)=(K_{A1}\times K_{A2})\cap(K_{B1}\times K_{B2})=\emptyset, then merge operation can be applied on them, and it is defined as,

𝐂=𝐌𝐞𝐫𝐠𝐞(𝐀,𝐁):K1×K2R\mathbf{C}=\mathbf{Merge}(\mathbf{A},\mathbf{B}):K_{1}\times K_{2}\to R

where K1=KA1KB1K_{1}=K_{A1}\cup K_{B1} and K2=KA2KB2K_{2}=K_{A2}\cup K_{B2}, and

𝐂(k1,k2)={𝐀(k1,k2),if (k1,k2)KA1×KA2;𝐁(k1,k2),if (k1,k2)KB1×KB2;0,otherwise.\displaystyle\mathbf{C}(k_{1},k_{2})=\begin{cases}\mathbf{A}(k_{1},k_{2}),&\text{if }(k_{1},k_{2})\in K_{A1}\times K_{A2};\\ \mathbf{B}(k_{1},k_{2}),&\text{if }(k_{1},k_{2})\in K_{B1}\times K_{B2};\\ 0,&\text{otherwise.}\end{cases}
Definition 2.19 (Expand).

Given an elementwise binary operator 𝐎𝐏\mathbf{OP} on associative arrays, e.g., \oplus and \odot, a term vector 𝐕:{1}×TR\mathbf{V}:\{1\}\times T\to R and a document-term matrix 𝐌:D×TR\mathbf{M}:D\times T\to R, the expand operator is defined as

𝐄𝐱𝐩𝐚𝐧𝐝𝐎𝐏(𝐕,𝐌)=ρ{1}×DD,T×{1}T(𝐕𝟙D,{1})𝐎𝐏𝐌.\mathbf{Expand}_{\mathbf{OP}}(\mathbf{V},\mathbf{M})=\rho_{\{1\}\times D\to D,T\times\{1\}\to T}\left(\mathbf{V}\circledast\mathbb{1}_{D,\{1\}}\right)\,\,\mathbf{OP}\,\,\mathbf{M}.

This operator implicitly expands the term vector 𝐕\mathbf{V} to generate another associative array 𝐌:D×TR\mathbf{M^{\prime}}:D\times T\to R where 𝐌(d,t)=𝐕(1,t),dD and tT\mathbf{M^{\prime}}(d,t)=\mathbf{V}(1,t),\forall d\in D\text{ and }\forall t\in T, and then applies 𝐎𝐏\mathbf{OP} on 𝐌\mathbf{M^{\prime}} and 𝐌\mathbf{M}.

Suppose that for a corpus DD, there is a term vector 𝐕:{1}×TR\mathbf{V}:\{1\}\times T\to R where 𝐕(1,t)\mathbf{V}(1,t) is the mean occurrence of term tt in DD (i.e., Countt|D|\frac{Count_{t}}{|D|} where CounttCount_{t} is the total occurrence of tt in DD), and there is a document-term matrix 𝐌:D×T\mathbf{M}:D\times T, then

𝐄𝐱𝐩𝐚𝐧𝐝(𝐀𝐩𝐩𝐥𝐲f(x)=x(𝐕),𝐌)\mathbf{Expand}_{\oplus}(\mathbf{Apply}_{f(x)=-x}(\mathbf{V}),\mathbf{M})

will generate the difference of terms occurrences for each document from their average occurrences.

Definition 2.20 (Flatten).

Given an associative array 𝐀:K1×K2R\mathbf{A}:K_{1}\times K_{2}\to R, the flatten operation is defined by 𝐅𝐥𝐚𝐭𝐭𝐞𝐧(𝐀):{1}×(K1×K2)R\mathbf{Flatten}(\mathbf{A}):\{1\}\times(K_{1}\times K_{2})\to R where

𝐅𝐥𝐚𝐭𝐭𝐞𝐧(𝐀)(1,(k1,k2))=𝐀(k1,k2) for (k1,k2)K1×K2.\mathbf{Flatten}(\mathbf{A})(1,(k_{1},k_{2}))=\mathbf{A}(k_{1},k_{2})\text{ for }\forall(k_{1},k_{2})\in K_{1}\times K_{2}.
Definition 2.21 (Left Shift).

Given a term-index matrix 𝐍:({d}×T)×IR\mathbf{N}:(\{d\}\times T)\times I\to R, and a non-negative integer nn, define the left shift operator by 𝐋𝐒𝐡𝐢𝐟𝐭n(𝐍):({d}×T)×IR\mathbf{LShift}_{n}(\mathbf{N}):(\{d\}\times T)\times I\to R where

𝐋𝐒𝐡𝐢𝐟𝐭n(𝐍)\displaystyle\mathbf{LShift}_{n}(\mathbf{N}) =𝐋𝐒𝐡𝐢𝐟𝐭1(𝐋𝐒𝐡𝐢𝐟𝐭n1(𝐍)) and\displaystyle=\mathbf{LShift}_{1}(\mathbf{LShift}_{n-1}(\mathbf{N}))\text{ and }
𝐋𝐒𝐡𝐢𝐟𝐭1(𝐍)((d,t),i)\displaystyle\mathbf{LShift}_{1}(\mathbf{N})((d,t),i) ={𝐍((d,t),i+1),if i<|T|;0,if i=|T|;.\displaystyle=\begin{cases}\mathbf{N}((d,t),i+1),&\text{if }i<|T|;\\ 0,&\text{if }i=|T|;.\end{cases}

For a term-index matrix 𝐍\mathbf{N} of document dd, 𝐋𝐒𝐡𝐢𝐟𝐭1(𝐍)\mathbf{LShift}_{1}(\mathbf{N}) generates another term-index matrix 𝐍\mathbf{N^{\prime}} where 𝐍((d,t),i)=1\mathbf{N^{\prime}}((d,t),i)=1 when tt is the (i+1)(i+1)-th word in dd.

Definition 2.22 (Union).

Suppose there are two term-index matrices with the same index set II, 𝐍1:({d}×T)×IR\mathbf{N}_{1}:(\{d\}\times T)\times I\to R and 𝐍2:({d}×T)×IR\mathbf{N}_{2}:(\{d\}\times T)\times I\to R, the union operation on 𝐍1\mathbf{N}_{1} and 𝐍2\mathbf{N}_{2} is defined by

𝐔𝐧𝐢𝐨𝐧(𝐍𝟏,𝐍𝟐)\displaystyle\mathbf{Union}(\mathbf{N_{1}},\mathbf{N_{2}}) =ρ({d}×T)×({d}×T){d}×(T×T),I×II\displaystyle=\rho_{(\{d\}\times T)\times(\{d\}\times T)\to\{d\}\times(T\times T),I\times I\to I}
(Π:,{(i,i)|iI}(𝐍1𝐍2)).\displaystyle\qquad\qquad\left(\Pi_{:,\{(i,i)|i\in I\}}(\mathbf{N}_{1}\circledast\mathbf{N}_{2})\right).

Suppose 𝐍=𝐔𝐧𝐢𝐨𝐧(𝐍1,𝐍2)\mathbf{N}=\mathbf{Union}(\mathbf{N}_{1},\mathbf{N}_{2}), then

𝐍((d,(t1,t2)),i)={1,if 𝐍1((d,t1),i)=1 and 𝐍2((d,t2),i)=1;0,otherwise.\displaystyle\mathbf{N}((d,(t_{1},t_{2})),i)=\begin{cases}1,&\text{if }\mathbf{N}_{1}((d,t_{1}),i)=1\text{ and }\mathbf{N}_{2}((d,t_{2}),i)=1;\\ 0,&\text{otherwise}.\end{cases}

The left shift and union operations can be composed to compute all bigrams of a document. Given a term-index matrix 𝐍\mathbf{N} of document dd, let 𝐍=𝐔𝐧𝐢𝐨𝐧(𝐍,𝐋𝐒𝐡𝐢𝐟𝐭1(𝐍))\mathbf{N}^{\prime}=\mathbf{Union}(\mathbf{N},\mathbf{LShift}_{1}(\mathbf{N})), then 𝐍((d,(t1,t2)),i)=1\mathbf{N}^{\prime}((d,(t_{1},t_{2})),i)=1 when (t1,t2)(t_{1},t_{2}) is the ii-th bigram in document dd.

Definition 2.23 (Sum).

The sum operation takes a TAA 𝐀:K1×K2R\mathbf{A}:K_{1}\times K_{2}\to R and an integer which can take the value of 0, 1 or 2 as inputs and will have different semantics based on the integer value:

𝐁:{1}×K2\displaystyle\mathbf{B}:\{1\}\times K_{2} =𝐒𝐮𝐦1(𝐀) where B(1,k2)=k1K1𝐀(k1,k2);\displaystyle=\mathbf{Sum}_{1}(\mathbf{A})\text{ where }B(1,k_{2})=\bigoplus_{k_{1}\in K_{1}}\mathbf{A}(k_{1},k_{2});
𝐁:K1×{1}\displaystyle\mathbf{B}:{K_{1}}\times\{1\} =𝐒𝐮𝐦2(𝐀) where B(k1,1)=k2K2𝐀(k1,k2).\displaystyle=\mathbf{Sum}_{2}(\mathbf{A})\text{ where }B(k_{1},1)=\bigoplus_{k_{2}\in K_{2}}\mathbf{A}(k_{1},k_{2}).

3. Text Analytic Tasks

3.1. Constructing a Document Term Matrix

As we state in Section 2.2, a document term matrix is a common representation model for a collection of documents where the terms can be a list of import terms or the whole vocabulary or bigrams. The entry of the matrix can be either the occurrence of each term or the tf-idf value.

Example 3. For document collection CC, build a document term matrix where terms are all unigrams and bigrams in CC, and the values should be the occurrence of each term in the whole corpus.

Suppose there is a tokenization function called 𝐓𝐨𝐤𝐞𝐧𝐢𝐳𝐞\mathbf{Tokenize} that takes a document dd as input and generates a term index matrix 𝐍:({d}×T)×I\mathbf{N}:(\{d\}\times T)\times{I}. The construction can be decomposed to two parts, the first part is to construct a Term Vector for one single document dd containing all unigrams and bigrams together with their corresponding occurrences. Fig. 1 shows the construction process.

𝐍=𝐓𝐨𝐤𝐞𝐧𝐢𝐳𝐞(d):({d}×T)×I\displaystyle\mathbf{N}=\mathbf{Tokenize}(d)\quad:(\{d\}\times T)\times{I} 1\displaystyle 1
𝐕𝟏=ρ{1}{d},{d}×TT(𝐒𝐮𝐦2(𝐍))𝖳:{d}×T\displaystyle\mathbf{V_{1}}=\rho_{\{1\}\to\{d\},\{d\}\times T\to T}(\mathbf{Sum}_{2}(\mathbf{N}))^{\mathsf{T}}\quad:\{d\}\times T 2\displaystyle 2
𝐓=𝐍𝐋𝐒𝐡𝐢𝐟𝐭1(𝐍)𝖳:({d}×T)×({d}×T)\displaystyle\mathbf{T}=\mathbf{N}\otimes\mathbf{LShift}_{1}(\mathbf{N})^{\mathsf{T}}\quad:(\{d\}\times T)\times(\{d\}\times T) 3\displaystyle 3
𝐕𝟐=𝐅𝐥𝐚𝐭𝐭𝐞𝐧(T):{1}×({d}×T)×({d}×T))\displaystyle\mathbf{V_{2}}=\mathbf{Flatten}(T)\quad:\{1\}\times(\{d\}\times T)\times(\{d\}\times T)) 4\displaystyle 4
𝐕𝟐=ρ{1}{d},({d}×T)×({d}×T)(T×T)(𝐕2):{d}×(T×T)\displaystyle\mathbf{V_{2}}=\rho_{\{1\}\to\{d\},(\{d\}\times T)\times(\{d\}\times T)\to(T\times T)}(\mathbf{V}_{2})\quad:\{d\}\times(T\times T) 5\displaystyle 5
𝐕𝟐=σf:x𝟙(x>0)(𝐕𝟐):{d}×(T×T)\displaystyle\mathbf{V_{2}}=\sigma_{f:x\to\mathbb{1}(x>0)}(\mathbf{V_{2}})\quad:\{d\}\times(T\times T) 6\displaystyle 6
𝐕d=𝐌𝐞𝐫𝐠𝐞(𝐕𝟏,𝐕𝟐):{d}×(T(T×T))\displaystyle\mathbf{V}_{d}=\mathbf{Merge}(\mathbf{V_{1}},\mathbf{V_{2}})\quad:\{d\}\times(T\cup(T\times T)) 7\displaystyle 7
Figure 1. Algebraic representation for task in Example 3.

Step 1 generates the term index matrix where each term is the unigram. The 𝐒𝐮𝐦1\mathbf{Sum}_{1} operation in Step 2 generates the term vector where 𝐕𝟏(d,t)\mathbf{V_{1}}(d,t) is the unigram tt in document dd. Steps 3–6 get the term vector 𝐕𝟐\mathbf{V_{2}} where the column key set is all bigrams in dd. Step 7 concatenates the two term vectors to get the representation for dd.

For each document did_{i} in collection D={d1,,dn}D=\{d_{1},\cdots,d_{n}\}, we get its term vector 𝐕𝐝𝐢:{di}×(Ti(Ti×Ti))R\mathbf{V_{di}}:\{d_{i}\}\times(T_{i}\cup(T_{i}\times T_{i}))\to R using the above steps, then apply the 𝐌𝐞𝐫𝐠𝐞\mathbf{Merge} operation to get the document-term matrix 𝐌:D×TR\mathbf{M}:D\times T\to R where T=(T1Tn)((T1×T1)(Tn×Tn))T=(T_{1}\cup\cdots\cup T_{n})\cup((T_{1}\times T_{1})\cup\cdots\cup(T_{n}\times T_{n})) is the union of all unigrams and bigrams in the whole corpus,

𝐌𝐞𝐫𝐠𝐞(𝐕𝐝𝟏,𝐌𝐞𝐫𝐠𝐞(𝐕𝐝𝟐,,𝐌𝐞𝐫𝐠𝐞(𝐕𝐝(𝐧𝟏),𝐕𝐝(𝐧)))).\mathbf{Merge}(\mathbf{V_{d1}},\mathbf{Merge}(\mathbf{V_{d2}},\cdots,\mathbf{Merge}(\mathbf{V_{d(n-1)}},\mathbf{V_{d(n)}}))).

Besides word-occurrence as the values of term document matrix, one can also use a term’s tf-idf value. If all terms are considered, term document matrix 𝐌\mathbf{M} would be of high dimension and sparse, which would be costly to manipulate. A simple and commonly adopted method to reduce dimension is to select out informative words. The following presents the queries to get document-term matrix 𝐌\mathbf{M} with the tf-idf values for only informative terms where the informativeness is measured by idf value.

Example 4. Given a collection of documents DD, we have to generate a document-term matrix 𝐌\mathbf{M} for the top 1000 “informative words” where 𝐌(d,t)\mathbf{M}(d,t) is the tf-idf value for term tt in document dd. Suppose there is a term-document matrix 𝐌𝟏\mathbf{M_{1}} which stores the occurrence for all unigrams in each document (the construction is similar to that of example 2 and thus is skipped), 𝐌\mathbf{M} can be generated by the following steps. The function idfidf in Step 3 is to calculate idf value, which is defined as idf(x)=logx|D|idf(x)=-\log\frac{x}{|D|} where xx is the number of documents that contains a specific term.

𝐌2=𝐀𝐩𝐩𝐥𝐲f:x𝟙(x>0)(𝐌1):D×T\displaystyle\mathbf{M}_{2}=\mathbf{Apply}_{f:x\to\mathbb{1}(x>0)}(\mathbf{M}_{1})\quad:D\times T 1\displaystyle 1
𝐕=𝐒𝐮𝐦1(𝐌𝟐):{1}×T\displaystyle\mathbf{V}=\mathbf{Sum}_{1}(\mathbf{M_{2}})\quad:\{1\}\times T 2\displaystyle 2
𝐈=σf:x𝟙(x1000)(𝐒𝐨𝐫𝐭(𝐕)):{1}×T\displaystyle\mathbf{I}=\sigma_{f:x\to\mathbb{1}(x\leq 1000)}(\mathbf{Sort}(\mathbf{V}))\quad:\{1\}\times T^{\prime} 3\displaystyle 3
𝐕𝟏=𝐀𝐩𝐩𝐥𝐲idf(Π:,𝐈.K2(𝐕)):{1}×T\displaystyle\mathbf{V_{1}}=\mathbf{Apply}_{idf}(\Pi_{:,\mathbf{I}.K_{2}}(\mathbf{V}))\quad:\{1\}\times T^{\prime} 4\displaystyle 4
𝐌𝟑=Π:,𝐈.K2(𝐌𝟏):D×T\displaystyle\mathbf{M_{3}}=\Pi_{:,\mathbf{I}.K_{2}}(\mathbf{M_{1}})\quad:D\times T^{\prime} 5\displaystyle 5
𝐌=𝐄𝐱𝐩𝐚𝐧𝐝(𝐕𝟏,𝐌𝟑):D×T\displaystyle\mathbf{M}=\mathbf{Expand}_{\odot}(\mathbf{V_{1}},\mathbf{M_{3}})\quad:D\times T^{\prime} 6\displaystyle 6
Figure 2. Algebraic representation for task in Example 4.

3.2. Using TAAs

For Example 1 introduced in Section 1, we express this analysis using relational algebra and the associative array operations. Suppose that the maximum number of words for a term in LoLpL_{o}\cup L_{p} is 3, now this analysis can be expressed as the following. The Step 1 is expressed in relational algebra. 𝐓𝐨𝐩𝐢𝐜𝐌𝐨𝐝𝐞𝐥\mathbf{TopicModel} in the last step is a function which takes a document-term matrix and produce document topic matrix and topic term matrix, which are the standard outputs of topic modeling, represented by another two TAAs 𝐃𝐓𝐌\mathbf{DTM} and 𝐓𝐓𝐌\mathbf{TTM}. Let 𝐓=ρf:x𝟙(x|D|k)(𝐒𝐨𝐫𝐭2(𝐃𝐓𝐌))\mathbf{T}=\rho_{f:x\to\mathbb{1}(x\geq|D|-k)}(\mathbf{Sort}_{2}(\mathbf{DTM})), then 𝐓.K\mathbf{T}.K will return all (d,t)(d,t) pairs where tt is one of the top-kk topics for dd.

D=πcontent(σd1datad2(R))\displaystyle D=\pi_{content}(\sigma_{d_{1}\leq data\leq d_{2}}(R)) 1\displaystyle 1
𝐌:{}×{}R,𝐅𝐕:{}×{}R\displaystyle\mathbf{M}:\{\}\times\{\}\to R,\quad\mathbf{FV}:\{\}\times\{\}\to R 2\displaystyle 2
for dD:\displaystyle\text{for }d\in D: 3\displaystyle 3
𝐍𝟏=𝐓𝐨𝐤𝐞𝐧𝐢𝐳𝐞(d)\displaystyle\quad\quad\mathbf{N_{1}}=\mathbf{Tokenize}(d) 3.1\displaystyle 3.1
𝐕=ρ{1}{d},{d}×TT(𝐒𝐮𝐦2(𝐍𝟏))𝖳\displaystyle\quad\quad\mathbf{V}=\rho_{\{1\}\to\{d\},\{d\}\times T\to T}(\mathbf{Sum}_{2}(\mathbf{N_{1}}))^{\mathsf{T}} 3.2\displaystyle 3.2
𝐍𝟐=𝐔𝐧𝐢𝐨𝐧(𝐍𝟏,𝐋𝐒𝐡𝐢𝐟𝐭𝟏(𝐍𝟏))\displaystyle\quad\quad\mathbf{N_{2}}=\mathbf{Union}(\mathbf{N_{1}},\mathbf{LShift_{1}(\mathbf{N_{1}})}) 3.3\displaystyle 3.3
𝐍𝟑=𝐔𝐧𝐢𝐨𝐧(𝐍𝟐,𝐋𝐒𝐡𝐢𝐟𝐭𝟐(𝐍𝟏))\displaystyle\quad\quad\mathbf{N_{3}}=\mathbf{Union}(\mathbf{N_{2}},\mathbf{LShift_{2}(\mathbf{N_{1}})}) 3.4\displaystyle 3.4
𝐍=𝐌𝐞𝐫𝐠𝐞(𝐍𝟏,𝐌𝐞𝐫𝐠𝐞(𝐍𝟐,𝐍𝟑))\displaystyle\quad\quad\mathbf{N}=\mathbf{Merge}(\mathbf{N_{1}},\mathbf{Merge}(\mathbf{N_{2}},\mathbf{N_{3}})) 3.5\displaystyle 3.5
𝐕𝐟=ρ{1}{d},{d}×TT(𝐒𝐮𝐦2(𝐍)𝖳)\displaystyle\quad\quad\mathbf{V_{f}}=\rho_{\{1\}\to\{d\},\{d\}\times T^{\prime}\to T^{\prime}}(\mathbf{Sum}_{2}(\mathbf{N})^{\mathsf{T}}) 3.6\displaystyle 3.6
𝐅𝐕=𝐌𝐞𝐫𝐠𝐞(𝐅𝐕,𝐕𝐟)\displaystyle\quad\quad\mathbf{FV}=\mathbf{Merge}(\mathbf{FV},\mathbf{V_{f}}) 3.7\displaystyle 3.7
𝐌=𝐌𝐞𝐫𝐠𝐞(𝐌,𝐕)\displaystyle\quad\quad\mathbf{M}=\mathbf{Merge}(\mathbf{M},\mathbf{V}) 3.8\displaystyle 3.8
𝐅𝐕𝐨=𝚷:,Lo(𝐅𝐕)\displaystyle\mathbf{FV_{o}}=\mathbf{\Pi}_{:,L_{o}}(\mathbf{FV}) 4\displaystyle 4
𝐅𝐕𝐩=𝚷:,Lp(𝐅𝐕)\displaystyle\mathbf{FV_{p}}=\mathbf{\Pi}_{:,L_{p}}(\mathbf{FV}) 5\displaystyle 5
𝐈𝐨=σf:x𝟙(x>c1)(𝐒𝐮𝐦2(FVo))\displaystyle\mathbf{I_{o}}=\sigma_{f:x\to\mathbb{1}(x>c_{1})}(\mathbf{Sum}_{2}(FV_{o})) 6\displaystyle 6
𝐈𝐩=σf:x𝟙(x>c2)(𝐒𝐮𝐦2(FVp))\displaystyle\mathbf{I_{p}}=\sigma_{f:x\to\mathbb{1}(x>c_{2})}(\mathbf{Sum}_{2}(FV_{p})) 7\displaystyle 7
𝐌=ΠIo.K1Ip.K1,:(𝐌)\displaystyle\mathbf{M}=\Pi_{I_{o}.K_{1}\cap I_{p}.K_{1},:}(\mathbf{M}) 8\displaystyle 8
𝐈𝐭=σf:x𝟙(x<θ2)(𝐒𝐮𝐦1(𝐌))\displaystyle\mathbf{I_{t}}=\sigma_{f:x\to\mathbb{1}(x<\theta_{2})}(\mathbf{Sum}_{1}(\mathbf{M})) 9\displaystyle 9
𝐈𝐝=σf:x𝟙(x<θ1)(𝐒𝐮𝐦2(𝐌))\displaystyle\mathbf{I_{d}}=\sigma_{f:x\to\mathbb{1}(x<\theta_{1})}(\mathbf{Sum}_{2}(\mathbf{M})) 10\displaystyle 10
𝐌=ΠId.K1,It.K2(𝐌)\displaystyle\mathbf{M}=\Pi_{I_{d}.K_{1},I_{t}.K_{2}}(\mathbf{M}) 11\displaystyle 11
𝐃𝐓𝐌,𝐓𝐓𝐌=𝐓𝐨𝐩𝐢𝐜𝐌𝐨𝐝𝐞𝐥(𝐌)\displaystyle\mathbf{DTM},\mathbf{TTM}=\mathbf{TopicModel(\mathbf{M})} 12\displaystyle 12
Figure 3. Algebraic representation for the task in Example 1.

References

  • (1)
  • Barceló et al. (2019) Pablo Barceló, Nelson Higuera, Jorge Pérez, and Bernardo Subercaseaux. 2019. On the Expressiveness of LARA: A Unified Language for Linear and Relational Algebra. arXiv preprint arXiv:1909.11693 (2019).
  • Golan (2013) Jonathan S Golan. 2013. Semirings and affine equations over them: theory and applications. Vol. 556. Springer Science & Business Media.
  • Jananthan et al. (2017) Hayden Jananthan, Ziqi Zhou, Vijay Gadepally, Dylan Hutchison, Suna Kim, and Jeremy Kepner. 2017. Polystore mathematics of relational algebra. In Int. Conf. on Big Data. IEEE, 3180–3189.
  • Kepner et al. (2020) Jeremy Kepner, Vijay Gadepally, Hayden Jananthan, Lauren Milechin, and Siddharth Samsi. 2020. AI Data Wrangling with Associative Arrays. arXiv preprint arXiv:2001.06731 (2020).
  • Leclercq et al. (2019) Éric Leclercq, Annabelle Gillet, Thierry Grison, and Marinette Savonnet. 2019. Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data Analytics. In Trans. on Large-Scale Data-and Knowledge-Centered Systems XLII. Springer, 51–90.