An Algebraic Approach for High-level Text Analytics
Abstract.
Text analytical tasks like word embedding, phrase mining and topic modeling, are placing increasing demands as well as challenges to existing database management systems. In this paper, we provide a novel algebraic approach based on associative arrays. Our data model and algebra can bring together relational operators and text operators, which enables interesting optimization opportunities for hybrid data sources that have both relational and textual data. We demonstrate its expressive power in text analytics using several real-world tasks.
1. Introduction
A significant part of today’s analytical tasks involve text operations. A data scientist who has to manipulate and analyze text data today typically uses a set of text analysis software libraries (e.g., NLTK, Stanford CoreNLP, GenSim) for tasks like word embedding, phrase extraction, named entity recognition and topic modeling. In addition, most DBMS systems today have built-in support for full-text search. PostgreSQL, for instance, admits a text vector (called tsvector) that extracts and creates term and positional indices to enable efficient queries (called tsquery). Yet, some common and seemingly simple text analysis tasks cannot be performed simply within the boundaries of a single information system.
Example 1. Consider a relational table (newsID, date, newspaper, title, content) where and are text-valued attributes, and two sets that represent a collection of organization names and person names respectively. Now, consider the following analysis:
-
•
Select a subset of news articles from date through
-
•
Identify all news articles in that have at least organization names from and persons from
-
•
Create a document-term matrix on
-
•
Remove rows and columns of the matrix if either of their row or column marginal sums is below and respectively.
-
•
Compute a topic model using
The intention of the analysis is to find the topic distribution of those news items that cover, for example, any two members of the senate (list ) and any one government organizations (list ). The analysis itself is straightforward and can be performed with a combination of SQL queries and Python scripts.
Our goal in this short paper is to present the idea that a novel relation-flanked associative array data model has the potential of serving as the underlying framework for the management and analysis of text-centric data. We develop the theoretical elements of the model and illustrate its utility through examples.
2. The Data Model
2.1. Text Associative Arrays
A number of current data systems, typically in the domain of polystore data systems, use associative arrays (Jananthan et al., 2017; Kepner et al., 2020) or its variants like associative tables (Barceló et al., 2019) and tensor data model (Leclercq et al., 2019). Many of these data models are used to support analytical (e.g., machine learning) tasks. In our setting, we specialize the essential associative model for text analytics. For our level of abstraction, our model reuses relational operations for all metadata of the associative arrays. While it has been shown (Barceló et al., 2019) that associative arrays can express relational operations, we believe that using relational abstraction along with our text-centric algebraic operations makes the system easier to program and interpret. At a more basic level, since most text processing operations include sorting (e.g., by TF-IDF scores), our model is based on partially ordered semirings.
Definition 2.1 (Semiring).
A semiring is a set with two binary operations addition and multiplication , such that, 1) is associative and commutative and has an identity element ; 2) is associative with an identity element ; 3) distributes over ; and 4) by 0 annihilates .
Definition 2.2 (Partially-Ordered Semiring).
(Golan, 2013) A semiring is partially ordered if and only if there exists a partial order relation on satisfying the following conditions for all :
-
•
If , then ;
-
•
If and , then and .
Definition 2.3 (Text Associative Array).
The Text Associative Array (TAA) is defined as a mapping:
where and are two key sets (named row key set and column key set respectively), and is a partially-ordered semiring (Definition 2.2). We call “the dimension of ”, and denote , and as the row key set, column key sets, and set of key pairs of , respectively.
Next, we define the basic operations on text associative arrays, to be used by our primary text operations (Sec. 2.2).
Definition 2.4 (Addition).
Given two TAAs , the addition operation is defined as,
Define as a TAA where for . serves as an identity for addition operation on key set .
Definition 2.5 (Hadamard Product).
Given two TAAs , the Hadamard product operation is defined as,
Define as a TAA where for . serves as an identity for Hadamard product on key set .
Definition 2.6 (Array Multiplication).
Given two TAAs and , the array multiplication operation is defined as,
Definition 2.7 (Array Identity).
Given two key sets and , and a partial function , the array identity is defined as a TAA such that
Specifically, if and for , is abbreviated to .
In general, is not an identity for general array multiplication. However, is an identity element for array multiplication on associative arrays .
Definition 2.8 (Kronecker Product).
Given two TAAs and , their Kronecker product is defined by
Definition 2.9 (Transpose).
Given a TAA , its transpose, denoted by , is defined by where for and .
2.2. Text Operations
We can express a number of fundamental text operations using the proposed TAA algebra. We first define three basic TAAs specifically for text analytics, then a series of text operations will be defined on general TAA or these basic structures.
Definition 2.10 (Document-Term Matrix).
Given a text corpus, a document term matrix is defined as a TAA where and are the document set and term set of a text corpus.
The term set in the document-term matrix can be the vocabulary or the bigram of the corpus, or an application-specific user-defined set of interesting terms. The matrix value can also take different semantics, in one application it can be the occurrence of in document , while in another application, it can be the term frequency-inverse document frequency (tf-idf). Typically, elements of and will have additional relational metadata. A document may have a date and a term may have an annotation like a part-of-speech (POS) tag.
Definition 2.11 (Term-Index Matrix).
Given a document , the term index matrix is defined as a TAA, where is the set of terms in document and is the index set ( is the size of ). Specifically, for and ,
Example 2. For a document = “Today is a sunny day”, let its term index matrix be , then we have , . , and for other pairs where , we have .
Definition 2.12 (Term Vector).
There are two types of term vectors. 1) Given a set of terms of a document , the term vector is defined as a TAA . 2) Given a set of terms for a collection of documents , is a term vector for the corpus .
The term vector represents some attribute of terms in the scope of one document or a corpus. For example, for a document , the value of the term vector can be the occurrence of each term in this document. For a corpus , the value of its term vector can be idf value for each term in the whole corpus, and the value is not specific to a single document.
Based on these structures, we can define our unit text operators as follows. Some operators are defined for general TAAs, while some are defined for a specific type of TAAs.
Definition 2.13 (Extraction).
Given a TAA and two projection sets , , we define the extraction operation as
Let , we have .
When only extracting row keys, the operation can be expressed as and when extracting column keys, it is expressed as .
Definition 2.14 (Rename).
Given a TAA , suppose is another column key set and there exists a bijection . The column rename operation is defined as
Similarly, given another row key set and a bijection , the row rename operation is defined as
The subscript can be omitted if the bijection is clear, e.g., . In addition, the row rename operation and column rename operation can be combined together as . Our rename operator is more general than the rename operation of relational algebra since it supports both row key set and column key set renaming.
Definition 2.15 (Apply).
Given a TAA and a function , define the apply operator by where,
Definition 2.16 (Filter).
Given a TAA and an indicator function , define the filter operation on as
where , and .
Definition 2.17 (Sort).
Given a TAA , for any , we extract a TAA of dimension . Since is a partially-ordered semiring (Definition 2.2), the value set inherits the partial order from , which implies an order . Define , then the sort by column operation is defined as
where . Similarly, we have sort by row operation defined as
When the column key dimension or row key dimension is 1 (e.g., for a term vector), or is abbreviated to .
Definition 2.18 (Merge).
Given two TAAs and , if , then merge operation can be applied on them, and it is defined as,
where and , and
Definition 2.19 (Expand).
Given an elementwise binary operator on associative arrays, e.g., and , a term vector and a document-term matrix , the expand operator is defined as
This operator implicitly expands the term vector to generate another associative array where , and then applies on and .
Suppose that for a corpus , there is a term vector where is the mean occurrence of term in (i.e., where is the total occurrence of in ), and there is a document-term matrix , then
will generate the difference of terms occurrences for each document from their average occurrences.
Definition 2.20 (Flatten).
Given an associative array , the flatten operation is defined by where
Definition 2.21 (Left Shift).
Given a term-index matrix , and a non-negative integer , define the left shift operator by where
For a term-index matrix of document , generates another term-index matrix where when is the -th word in .
Definition 2.22 (Union).
Suppose there are two term-index matrices with the same index set , and , the union operation on and is defined by
Suppose , then
The left shift and union operations can be composed to compute all bigrams of a document. Given a term-index matrix of document , let , then when is the -th bigram in document .
Definition 2.23 (Sum).
The sum operation takes a TAA and an integer which can take the value of 0, 1 or 2 as inputs and will have different semantics based on the integer value:
3. Text Analytic Tasks
3.1. Constructing a Document Term Matrix
As we state in Section 2.2, a document term matrix is a common representation model for a collection of documents where the terms can be a list of import terms or the whole vocabulary or bigrams. The entry of the matrix can be either the occurrence of each term or the tf-idf value.
Example 3. For document collection , build a document term matrix where terms are all unigrams and bigrams in , and the values should be the occurrence of each term in the whole corpus.
Suppose there is a tokenization function called that takes a document as input and generates a term index matrix . The construction can be decomposed to two parts, the first part is to construct a Term Vector for one single document containing all unigrams and bigrams together with their corresponding occurrences. Fig. 1 shows the construction process.
Step 1 generates the term index matrix where each term is the unigram. The operation in Step 2 generates the term vector where is the unigram in document . Steps 3–6 get the term vector where the column key set is all bigrams in . Step 7 concatenates the two term vectors to get the representation for .
For each document in collection , we get its term vector using the above steps, then apply the operation to get the document-term matrix where is the union of all unigrams and bigrams in the whole corpus,
Besides word-occurrence as the values of term document matrix, one can also use a term’s tf-idf value. If all terms are considered, term document matrix would be of high dimension and sparse, which would be costly to manipulate. A simple and commonly adopted method to reduce dimension is to select out informative words. The following presents the queries to get document-term matrix with the tf-idf values for only informative terms where the informativeness is measured by idf value.
Example 4. Given a collection of documents , we have to generate a document-term matrix for the top 1000 “informative words” where is the tf-idf value for term in document . Suppose there is a term-document matrix which stores the occurrence for all unigrams in each document (the construction is similar to that of example 2 and thus is skipped), can be generated by the following steps. The function in Step 3 is to calculate idf value, which is defined as where is the number of documents that contains a specific term.
3.2. Using TAAs
For Example 1 introduced in Section 1, we express this analysis using relational algebra and the associative array operations. Suppose that the maximum number of words for a term in is 3, now this analysis can be expressed as the following. The Step 1 is expressed in relational algebra. in the last step is a function which takes a document-term matrix and produce document topic matrix and topic term matrix, which are the standard outputs of topic modeling, represented by another two TAAs and . Let , then will return all pairs where is one of the top- topics for .
References
- (1)
- Barceló et al. (2019) Pablo Barceló, Nelson Higuera, Jorge Pérez, and Bernardo Subercaseaux. 2019. On the Expressiveness of LARA: A Unified Language for Linear and Relational Algebra. arXiv preprint arXiv:1909.11693 (2019).
- Golan (2013) Jonathan S Golan. 2013. Semirings and affine equations over them: theory and applications. Vol. 556. Springer Science & Business Media.
- Jananthan et al. (2017) Hayden Jananthan, Ziqi Zhou, Vijay Gadepally, Dylan Hutchison, Suna Kim, and Jeremy Kepner. 2017. Polystore mathematics of relational algebra. In Int. Conf. on Big Data. IEEE, 3180–3189.
- Kepner et al. (2020) Jeremy Kepner, Vijay Gadepally, Hayden Jananthan, Lauren Milechin, and Siddharth Samsi. 2020. AI Data Wrangling with Associative Arrays. arXiv preprint arXiv:2001.06731 (2020).
- Leclercq et al. (2019) Éric Leclercq, Annabelle Gillet, Thierry Grison, and Marinette Savonnet. 2019. Polystore and Tensor Data Model for Logical Data Independence and Impedance Mismatch in Big Data Analytics. In Trans. on Large-Scale Data-and Knowledge-Centered Systems XLII. Springer, 51–90.