This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Maximal co-occurrence nonoverlapping sequential rule mining

Yan Li, Chang Zhang, Jie Li, Wei Song, Zhenlian Qi, Youxi Wu,  and Xindong Wu Manuscript received August 22, 2022. (Corresponding author: Y. Wu). Yan Li, Chang Zhang, Jie Li are with the School of Economics and Management, Hebei University of Technology, Tianjin, 300400, China (e-mail: [email protected].)Wei Song is with the North China University of Technology, Beijing, China, (e-mail: [email protected])Zhenlian Qi is with Guangdong Eco-Engineering Polytechnic, Guangzhou 510520, China, (e-mail: [email protected])Youxi Wu is with the School of Artificial Intelligence, Hebei University of Technology, Tianjin, 300400, China (e-mail: [email protected].)Xindong Wu is with Key Laboratory of Knowledge Engineering with Big Data (the Ministry of Education of China), Hefei University of Technology, Hefei, 230009, China (e-mail: [email protected])
Abstract

The aim of sequential pattern mining (SPM) is to discover potentially useful information from a given sequence. Although various SPM methods have been investigated, most of these focus on mining all of the patterns. However, users sometimes want to mine patterns with the same specific prefix pattern, called co-occurrence pattern. Since sequential rule mining can make better use of the results of SPM, and obtain better recommendation performance, this paper addresses the issue of maximal co-occurrence nonoverlapping sequential rule (MCoR) mining and proposes the MCoR-Miner algorithm. To improve the efficiency of support calculation, MCoR-Miner employs depth-first search and backtracking strategies equipped with an indexing mechanism to avoid the use of sequential searching. To obviate useless support calculations for some sequences, MCoR-Miner adopts a filtering strategy to prune the sequences without the prefix pattern. To reduce the number of candidate patterns, MCoR-Miner applies the frequent item and binomial enumeration tree strategies. To avoid searching for the maximal rules through brute force, MCoR-Miner uses a screening strategy. To validate the performance of MCoR-Miner, eleven competitive algorithms were conducted on eight sequences. Our experimental results showed that MCoR-Miner outperformed other competitive algorithms, and yielded better recommendation performance than frequent co-occurrence pattern mining. All algorithms and datasets can be downloaded from https://github.com/wuc567/Pattern-Mining/tree/master/MCoR-Miner.

Index Terms:
Sequential pattern mining, sequential rule mining, rule-antecedent, co-occurrence pattern, maximal rule mining

I Introduction

As an important method of knowledge discovery [1], sequential pattern mining (SPM) [2] aims to mine sub-sequences (patterns) that meet certain conditions from sequence datasets [3]. A variety of SPM methods have been derived for different mining requirements, such as order-preserving SPM for time series [4], SPM for large-scale databases [5, 6], episode pattern mining [7], spatial co-location pattern mining [8, 9], contrast SPM [10, 11], negative SPM [12, 13], high utility SPM [14], high average-utility SPM [15, 16, 17], outlying SPM [18], three-way SPM [19, 20], co-occurrence SPM [21, 22], and SPM with gap constraints [23]. One of the disadvantages of traditional SPM is that it only considers whether a given pattern occurs within a sequence and ignores the repetition of the pattern in the sequence [24]. For example, the support (number of occurrences) of a pattern adad in sequence adbdadadbdad is one according to traditional SPM, despite the pattern adad occurring more than once in the sequence. From this example, we see that repetition is ignored in traditional SPM, meaning that some interesting patterns will be lost [25].

To solve this issue, gap constraint SPM was proposed [26], which can be expressed as p = p1p2p_{1}p_{2} \cdots pm1pmp_{m-1}p_{m} with gapgap = [a,b][a,b] (or p = p1[a,b]p2p_{1}[a,b]p_{2}[a,b]pm[a,b]p_{m}), where aa and bb (0ab)(0\leq a\leq b) are integers indicating the minimum and maximum wildcards between pjp_{j} and pj+1p_{j+1}, respectively [27, 28]. For example, pattern a[0,3]da[0,3]d means that there are zero to three wildcards between aa and dd. Repetitions of patterns can be useful in terms of capturing valuable information from sequences, and gap constraint SPM therefore has many applications, such as septic shock prediction for ICU patients [29], keyphrase extraction [30], missing behaviors analysis [31], and pyramid scheme pattern mining [32]. However, gap constraint SPM is not only difficult to solve, but also has many forms, such as periodic gaps SPM [33], disjoint SPM [34], one-off SPM [35], and nonoverlapping SPM [23]. Previous research work has shown that the nonoverlapping SPM avoids the production of many redundant patterns [24]; for example, under the nonoverlapping condition, the pattern adad occurs twice in sequence adbdadadbdad.

Unfortunately, current schemes based on gap constraint SPM can only mine all of the frequent patterns. Compared with rule mining, frequent pattern mining is not suitable for making predictions or recommendations. Rule mining [36] focuses on mining rules such as p \to q, which means that if pattern p occurs in the sequence, pattern q is likely to appear afterward with a probability higher than or equal to a given confidence, i.e., conf(conf(p \to q)) \geq mincfmincf. However, in cases where users only want to find the rules with the same antecedent p, then mining all the strong rules is not only time-consuming and laborious, but also meaningless. For example, in a recommendation problem, researchers hope to discover potential rules from historical data and use the recent events to predict future events. Obviously, it is valuable to discover rules with the same recent events instead of all rules. This kind of rule is called co-occurrence rule and is more meaningful. For clarification, an illustrative example is as follows.

Example 1.

Suppose we have a sequence s = adbdadcdccabadcdadbdadcdccabadcd, a gap constraint gapgap = [0,3], and a predefined support threshold minsup = 3. According to nonoverlapping SPM, there are 12 frequent patterns: {aa, cc, dd, adad, cdcd, dcdc, dddd, adcadc, addadd, dcddcd, ddcddc, adcdadcd}. Moreover, if we continue to set the minimum confidence threshold mincf = 0.7, there are 11 rules: {aa \to dd, aa \to dcdc, aa \to dddd, aa \to dcddcd, dd \to dd, dd \to cc, adad \to dd, adad \to cc, adad \to cdcd, dddd \to cc, adcadc \to dd}. Obviously, it is difficult for users to apply these excessive numbers of frequent patterns and rules.

However, if we have a prefix-pattern (or antecedent) p = adad, the number of frequent super-patterns and rules will be reduced. The details are shown as follows. We know that p = adad occurs four times in sequence s, since the subsequence s1s2s_{1}s_{2} is adad and the subsequences s5s6s_{5}s_{6}, s11s14s_{11}s_{14}, and s13s16s_{13}s_{16} are also adad. Thus, the support of p in s is four. Similarly, we know that the subsequences s1s4s7s_{1}s_{4}s_{7}, s5s6s9s_{5}s_{6}s_{9}, and s11s14s15s_{11}s_{14}s_{15} are all adcadc. Thus, the support of pattern adcadc in s is three, which is not less than minsupminsup = 3. Hence, the pattern adcadc is a frequent co-occurrence pattern of p = adad. Furthermore, adad \to cc is a co-occurrence rule whose confidence is 3/4 = 0.75 \geq mincfmincf = 0.7. Similarly, we know that patterns addadd and adcdadcd are frequent, and that adad \to dd and adad \to cdcd are co-occurrence rules. Hence, there are only three co-occurrence patterns of pattern p, adcadc, addadd, and adcdadcd, and three co-occurrence rules adad \to cc, adad \to dd, and adad \to cdcd, which means that the number of frequent patterns and rules is greatly reduced.

Inspired by the maximal SPM, the concept of maximal co-occurrence rules (MCoRs) is developed to further decrease the number of co-occurrence rules. For instance, if a rule p \to r is an MCoR, then we know that p \to q is also a co-occurrence rule, where pattern q is the prefix pattern of r. Moreover, if the pattern w is a superpattern of r, then p \to w is not a co-occurrence rule. From the above examples, we see that it is meaningful to investigate MCoR mining. The main contributions of this paper are as follows.

  1. 1.

    To avoid mining irrelevant patterns and obtain better recommendation performance, we develop MCoR mining, which can mine all MCoRs with the same rule-antecedent, and we propose the MCoR-Miner algorithm.

  2. 2.

    In MCoR mining, the user does not need to set the support threshold, since it can be automatically calculated based on the support of the rule-antecedent and the minimum confidence threshold.

  3. 3.

    To improve support calculation, MCoR-Miner consists of three parts: a preparation stage, candidate pattern generation, and a screening strategy to improve efficiency.

  4. 4.

    To validate the performance of MCoR-Miner, eleven competitive algorithms and eight datasets are selected. Our experimental results verify that MCoR-Miner outperforms the other competitive algorithms and yields better recommendation performance than frequent SPM.

The structure of this paper is as follows. Section II gives an overview of related work. Section III defines the problem. Section IV presents the MCoR-Miner algorithm. Section V reports the performance of MCoR-Miner. Section VI concludes this paper.

II Related work

The aim of SPM is to find subsequences (patterns) in a sequence database that meet the given requirements [37]. SPM methods are commonly used for data mining, since their results are intuitive and interpretable [38, 39]. Traditional SPM mainly focuses on frequent pattern mining, which means the mined patterns have high frequency. To reduce the number of patterns, top-kk SPM [12], closed SPM [40], and maximal SPM [41] were developed, all of which require that the data are static rather than dynamic. To overcome this drawback, incremental SPM [42] and window SPM [43] methods were explored. However, the research community noticed that rare patterns were of great significance in the field, and rare pattern mining was also proposed [44]. In addition, to discover missing events, negative SPM methods such as e-NSP [45] and e-RNSP [46] were designed. An approach called high utility SPM was also designed [47, 48], which can discover patterns with low frequency but high utility [49, 50].

Most SPM methods can be seen as classical SPM [51], since these methods only consider whether a pattern occurs within a sequence, while repetitive SPM deeply considers the number of occurrences of a pattern in the sequence [33]. For example, suppose we have two sequences: s1=aabbaaba and s2=abcbc. According to classical SPM, the support of pattern aba in s1 and s2 is one, since pattern aba occurs in sequence s1=aabbaaba, but does not occur in sequence s2. Thus, classical SPM neglects the fact that pattern aba occurs in sequence s1=aabbaaba more than once, while repetitive SPM considers the number of occurrences. Another significant difference is that classical SPM focuses on mining patterns in sequences with itemsets, while repetitive SPM mainly aims to mine patterns in sequences with items [52]. For example, (ab)(bc)(a)(bd)(ad) is a sequence with itemsets, and each itemset has many ordered items. If all itemsets in a sequence have only one item, then the sequence is a sequence with items. Thus, a sequence with items can be seen as a special case of a sequence with itemsets. The sequences with items are used in many fields, such as DNA sequence, protein sequence, clickstreams, and commercial data.

Note that repetitive SPM is similar to episode mining [53]. The differences are three-fold. First, episode mining deals with one sequence, while repetitive SPM processes one or more sequences. Second, episode mining aims to mine patterns in an event sequence, where an event sequence can be represented by <<(e1e_{1},t1t_{1}), (e2e_{2},t2t_{2}), \cdot (ene_{n},tnt_{n})>>, where eie_{i} is an event set, and tit_{i} is the occurrence time of eie_{i}. In contrast, in repetitive SPM, the sequences do not have the occurrence time. Third, in episode mining, eie_{i} can be a set, while in repetitive SPM, the sequence consists of items, rather than sets.

Compared with classical SPM, repetitive SPM not only is more challenging, but also has many forms: general form (no condition) [33], disjoint [52], one-off [54], and nonoverlapping forms [55]. Note that the disjoint form was called the nonoverlapping form in some studies [46, 34], which is far different from the nonoverlapping form in this study. To clarify the difference between disjoint and nonoverlapping, an illustrative example is shown as follows.

Example 2.

Suppose we have a sequence s=aabbaaba and a pattern p=a[0,1]b[0,1]a.

In the disjoint form [46, 34], the first position of an occurrence is greater than the last position of its previous occurrence. Thus, there are two occurrences: <<1,3,5>> and <<6,7,8>>. Note that <<1,3,5>> and <<2,4,6>> do not satisfy the disjoint form, since the first position of occurrence <<2,4,6>> 2 is less than 5, which is the last position of <<1,3,5>>.

In the nonoverlapping form [24], each item cannot be reused by the same pjp_{j}, but can be reused by different pjp_{j}. Thus, there are three occurrences: <<1,3,5>>,<<2,4,6>>, and <<6,7,8>>. Note that <<2,4,6>>, and <<6,7,8>> satisfy the nonoverlapping form, since p3p_{3} matches s6s_{6} in <<2,4,6>>, and p1p_{1} matches s6s_{6} in <<6,7,8>>.

Recently, various applications for SPM with gap constraints were investigated. For example, top-kk contrast nonoverlapping SPM was proposed, in which the mined patterns can be used as features for a sequence classification task [10]. To discover low frequency but high average utility patterns, high average utility one-off SPM [49] and nonoverlapping SPM [56] were developed. To discover missing events, a one-off negative SPM was proposed, which could be used to predict the future trend in traffic flow [13]. Inspired by three-way decisions [19, 57], nonoverlapping three-way SPM [20] was designed to mine the patterns to which users pay the most attention, and can effectively mine patterns composed of strong-interest and medium-interest items.

Most of these schemes aim to discover all of the patterns that satisfy the predefined constraints. However, in general, these methods will discover numerous patterns. Although top-kk SPM [10], nonoverlapping closed SPM [40], and nonoverlapping maximal SPM [41] can reduce the number of patterns, users may not be interested in many of the patterns that are discovered. However, these mining methods fail to reveal the relationship between patterns. Rule mining is an effective method in discovering the relationship between patterns [36]. There are many rule mining methods such as association rule mining [58], sequential rule mining [36], episode rule mining [59], and order-persevering rule mining [60]. Similar to pattern mining, there are many methods to reduce the redundant rules, such as maximum consequent and minimum antecedent [61], top-k rule mining [58], closed rule mining, and maximal rule mining [60]. Nevertheless, in some cases, users know a prefix pattern advance, and they want to discover its super-patterns. This is called co-occurrence pattern mining [62]. Based on the co-occurrence patterns, co-occurrence rules can be further explored to reveal the relationships between the prefix patterns and their super-patterns. Although we can mine all patterns at first, and then filter out useless patterns with different prefixes, this approach will increase the running time [63].

In summary, although the use of SPM with gap constraints can make the mining results more meaningful, the mining results of this scheme are not targeted. In order to make the mining results more suitable for recommendations, based on the special needs of users, we present a scheme inspired by co-occurrence SPM [64] and maximal nonoverlapping SPM [41], called MCoR mining.

III Problem definitions

Definition 1.

(Sequence) A sequence s with length nn is denoted by s = s1s2s_{1}s_{2} \cdots sns_{n}, where sis_{i} (1 in\leq i\leq n) \in\sum, \sum represents a set of items in sequence s, and the size of \sum can be expressed as |||\sum|.

Definition 2.

(Sequence database) A sequence database DD with length kk is a set of sequences, denoted by DD = {s1s_{1}, s2s_{2}, ,\cdots, sks_{k}}.

Definition 3.

(Pattern) A pattern p with length mm is denoted by p = p1[a,b]p_{1}[a,b] p2p_{2} \cdots [a,b]pm[a,b]p_{m} (or abbreviated as p = p1p_{1}p2p_{2} \cdots pm1p_{m-1}pmp_{m} with gapgap = [a,b])[a,b]), where aa and bb (0ab)(0\leq a\leq b) are integers indicating the minimum and maximum wildcards between pjp_{j} and pj+1p_{j+1}, respectively.

Definition 4.

(Occurrence and nonoverlapping occurrence) Suppose we have a sequence s = s1s_{1} s2s_{2} \cdots sns_{n} and a pattern p = p1[a,b]p_{1}[a,b]p2p_{2} \cdots[a,b]pm[a,b]p_{m}. ll = <<l1l_{1}, l2l_{2}, \cdots, lml_{m}>> is an occurrence of pattern p in sequence s if and only if p1p_{1} = sl1s_{l_{1}}, p2p_{2} = sl2s_{l_{2}}, \cdots, pmp_{m} = slms_{l_{m}} (0 l1\leq l_{1} l2\leq l_{2} \leq\cdots lm\leq l_{m} n\leq n) and aa\leq lili11l_{i}-l_{i-1}-1 b\leq b. We assume there is also another occurrence ll^{{}^{\prime}} = <<l1l_{1}^{{}^{\prime}}, l2l_{2}^{{}^{\prime}}, \cdots, lml_{m}^{{}^{\prime}}>>. ll and ll^{{}^{\prime}} are two nonoverlapping occurrences if and only if for any (1jm)(1\leq j\leq m), ljljl_{j}\neq l_{j}^{{}^{\prime}}.

Example 3.

Suppose we have a sequence s = adbdadcdccabadcdadbdadcdccabadcd and a pattern p = a[0,3]da[0,3]d. According to the gap constraint [0,3], all occurrences of p in s are <<1,2>>, <<1,4>>, <<5,6>>, <<5,8>>, <<11,14>>, <<13,14>> and <<13,16>>. <<1,2>> and <<1,4>> do not satisfy the nonoverlapping condition, since 1 appears in these two occurrences in the same position. Thus, there are four nonoverlapping occurrences of p in s, which are <<1,2>>, <<5,6>>, <<11,14>>, and <<13,16>>.

Note that there are many different methods to calculate the nonoverlapping occurrences and the results may be different, such as NETLAP [65], NETGAP [24], Netback [41], DFOM [56]. For example, besides two nonoverlapping occurrences <<1,2>> and <<5,6>>, <<1,4>> and <<5,8>> are also two nonoverlapping occurrences. Although the results may be different, this problem has been theoretically proved to be solved in polynomial time [65]. More importantly, it has been shown that there are four different ways to find the nonoverlapping occurrences, finding the maximal occurrences in the rightmost leaf-root way, finding the maximal occurrences in the rightmost root-leaf way, finding the minimal occurrences in the leftmost leaf-root way, and finding the minimal occurrences in the leftmost root-leaf way [66]. In this example, <<1,2>> and <<5,6>> are called minimal nonoverlapping occurrences [56], while <<1,4>> and <<5,8>> are called maximal nonoverlapping occurrences [65]. In this paper, we search for the minimal nonoverlapping occurrences.

Definition 5.

(Support) The support of pattern p in sequence s is the number of nonoverlapping occurrences, represented by sup(p,s)sup(\textbf{p},\textbf{s}). The support of pattern p in sequence database DD is the sum of the supports in each sequence, i.e., sup(p,D)sup(\textbf{p},D) = l=1ksup(p,si)\sum_{l=1}^{k}sup(\textbf{p},\textbf{s}_{i}).

Definition 6.

(Frequent pattern) If the support of pattern p in sequence s or DD is no less than the minimum support threshold minsupminsup, then pattern p is a frequent pattern.

Definition 7.

(Prefix pattern, subpattern, and superpattern). Given two patterns p = p1p2p_{1}p_{2} \cdots pm1pmp_{m-1}p_{m} and q = q1q2q_{1}q_{2} \cdots qn1qnq_{n-1}q_{n} (m<nm<n) with gapgap = [a,ba,b], if and only if p1p_{1} = q1q_{1}, p2p_{2} = q2q_{2}, \cdots, and pmp_{m} = qmq_{m}, then pattern p is the prefix pattern of pattern q. Moreover, pattern p is a subpattern of pattern q, and pattern q is a superpattern of pattern p.

Example 4.

Given two patterns p = adad and q = adbdadbd, pattern p is the prefix pattern of pattern q.

Definition 8.

(Co-occurrence pattern, co-occurrence rule, rule antecedent, rule consequent, and confidence) Suppose we have a pattern q=p·r = p1p2p_{1}p_{2} \cdots pmqm+1qm+2p_{m}q_{m+1}q_{m+2} \cdots qnq_{n}, where p = p1p2p_{1}p_{2} \cdots pmp_{m} and r = qm+1qm+2q_{m+1}q_{m+2} \cdots qnq_{n}. The pattern r is a co-occurrence pattern of p, and p \to r is a co-occurrence rule, where p and r are the rule antecedent and consequent, respectively. The ratio of the supports of patterns q and p is called the confidence of the sequential rule p \to r, and is denoted by conf(p \to r) = sup(p·r,DD)/sup(p,DD). The confidence indicates the probability of occurrence of pattern r when pattern p occurs.

Definition 9.

(Maximal co-occurrence pattern) Suppose we have a pattern q. If one of its superpattern r is a frequent co-occurrence pattern, then pattern q is not a maximal co-occurrence pattern; otherwise, pattern q is a maximal co-occurrence pattern.

Definition 10.

(Strong co-occurrence rule and MCoR) If conf(conf(p \to r)) is greater than or equal to the predefined threshold mincfmincf, i.e., conf(conf(p \to r)) \geq mincfmincf, then p \to r is a strong co-occurrence rule. Suppose p \to r is a strong co-occurrence rule. For any superpattern w of pattern r, if p \to w is not a strong co-occurrence rule, then p \to r is an MCoR.

MCoR mining: Given a sequence s or sequence dataset DD, mincfmincf, and prefix pattern p, the aim of MCoR mining is to discover all MCoRs.

Example 5.

From Example 1, we know that adad \to cc and adad \to cdcd are two co-occurrence rules. Rule adad \to cc is not an MCoR, while rule adad \to cdcd is an MCoR, since the pattern cc is the prefix pattern of cdcd and for any superpattern of cdcd, such as cdccdc, adad \to cdccdc is not a strong co-occurrence rule.

The symbols used in this paper are shown in Table I.

TABLE I: Notations
Symbol Description
s A sequence with length nn
DD A sequence database with kk sequences
p A pattern with length mm
a,ba,b The minimum and maximum wildcards, respectively
sup(p,s)sup(\textbf{p},\textbf{s}) The number of occurrences of p in s
sup(p,D)sup(\textbf{p},D) The number of occurrences of p in DD
\sum The set of items in sequence database DD
minsupminsup The minimum support threshold
mincfmincf The predefined confidence threshold
p \to r A co-occurrence rule
conf(p \to r) The confidence of the co-occurrence rule p \to r

IV Proposed algorithm

In this section, we propose MCoR-Miner, an algorithm designed to discover all MCoRs. MCoR mining is based on nonoverlapping SPM whose main issue is support calculation. Therefore, support calculation is a key aspect of MCoR mining illustrated in Section IV-A. In addition to support calculation, MCoR-Miner has three parts: preparation stage, candidate pattern generation, and screening strategy. The framework of MCoR-Miner is shown in Fig. 1.

Refer to caption
Figure 1: Framework of MCoR-Miner. In the preparation stage, shown in Section IV-B, a filtering strategy is proposed to prune the patterns whose support is zero to shrink the database, and a minimum support strategy is then explored to obtain minsupminsup based on the user-defined parameter minconfminconf. In candidate pattern generation, shown in Section IV-C, a frequent item enumeration tree (FET) strategy and then a binomial enumeration tree (BET) strategy are developed to reduce the number of candidate patterns. Finally, a screening strategy shown in Section IV-D is proposed to discover the maximal co-occurrence patterns to avoid a brute-force search.

IV-A Support calculation

Given a sequence and a pattern with gap constraints, the calculation of its support is a pattern matching task [65]. From Definition 4, we know that all nonoverlapping occurrences are a subset of all occurrences, and these occurrences can be expressed using a Nettree structure. Wu et al. [65] first theoretically proved that calculating the nonoverlapping occurrences can be solved in polynomial-time. Some state-of-the-art algorithms, such as NETLAP-Best [65] and NETGAP [24], initially create a Nettree and then iteratively prune useless nodes to find all nonoverlapping occurrences on the Nettree. The time complexities of NETLAP-Best and NETGAP are both O(m×m×n×w)O(m\times m\times n\times w), where mm, nn, and ww are the length of pattern and sequence, and ba+1b-a+1, respectively. Since pruning these useless nodes will consume a lot of time, Netback [41] was proposed to improve the efficiency, in which a Nettree is first created and a backtracking strategy is then employed to find all nonoverlapping occurrences on the Nettree. The time complexity of NetBack is reduced to O(m×n×w)O(m\times n\times w). Although a Nettree can intuitively represent all occurrences, it contains a lot of useless information when representing all nonoverlapping occurrences, such as useless nodes and parent-child relationships. To further improve the efficiency, DFOM [56] was proposed; this algorithm does not need to create a whole Nettree, and employs depth-first search and backtracking strategies to find all nonoverlapping occurrences. One of the shortcomings of DFOM is that it employs a sequential searching strategy to find the feasible child nodes of each current node. The time complexity of DFOM is O(m×n)O(m\times n), since DFOM also employs the depth-first and backtracking strategies without creating a whole Nettree. To overcome this drawback, we propose a depth-first search and backtracking with indexes algorithm, called DBI, which employs an indexing mechanism to avoid sequential searching. Example 6 illustrates the principle of DFOM [56].

Example 6.

We use the same sequence s = adbdadcdccabadcdadbdadcdccabadcd as in Example 3 and a pattern p = a[0,3]d[0,3]ca[0,3]d[0,3]c. Fig. 2 illustrates the occurrences searching process of DFOM. We know that p1p_{1} = aa. Thus, DFOM sequentially searches for aa in sequence s. Now, DFOM creates a root, labeled node 1, since s1s_{1} = p1p_{1} = aa. According to the depth-first search strategy, DFOM sequentially searches for dd in sequence s after node 1 with gap constraints [0,3][0,3], since p2p_{2} = dd. We know that s2s_{2} = dd. Thus, DFOM creates a child of node 1, labeled node 2. Since p3p_{3} = cc, DFOM sequentially searches for cc in sequence s after node 2 with gap constraints [0,3][0,3]. Unfortunately, there is no cc between s3s_{3} and s6s_{6}. Hence, using a backtracking strategy, DFOM backtracks node 1 to find a new child. We know that s3s_{3} is bb, which is not equal to p2p_{2} = dd, and DFOM, therefore, continues to search. We know that s4s_{4} is dd which is equal to p2p_{2} = dd. Thus, DFOM finds a new child of node 1, labeled node 4. Following a depth-first search strategy, DFOM searches for cc in sequence s after node 4 with gap constraints [0,3][0,3], since p3p_{3} = cc. It is easy to see that s5s_{5} and s6s_{6} are not equal to cc. Since s7s_{7} = cc is equal to p3p_{3} and the gap between positions 7 and 4 is 2, this satisfies the gap constraints [0,3][0,3]. Thus, DFOM finds a new child of node 4, labeled node 7. Since the length of p is three, DFOM finds an occurrence <<1,4,7>>. Now, DFOM continues to search for a new root. Since s2s_{2}, s3s_{3}, and s4s_{4} are not equal to aa, these characters are ignored. Since s5s_{5} is aa, DFOM finds a new root, labeled node 5. By iterating the above process, DFOM finds a new nonoverlapping occurrence <<5,6,9>>. Finally, occurrence <<11,14,15>> is found.

Refer to caption
Figure 2: Occurrences searching process of DFOM

From Example 6 and Fig. 2, we know that the main drawback of DFOM is its sequential search. To overcome this shortcoming, DBI uses index arrays to store the positions of each character. Example 7 illustrates the principle of DBI.

Refer to caption
Figure 3: Index arrays for sequence s.
Example 7.

We use the same sequence s = adbdadcdccabadcdadbdadcdccabadcd and pattern p = a[0,3]d[0,3]ca[0,3]d[0,3]c as in Example 6. DBI uses index arrays to store the positions of each character, as shown in Fig. 3. Fig. 4 shows the occurrences searching process of DBI. Since p1p_{1} = aa, DBI gets the first element in the array aa, which is position 1. Thus, DBI creates a root, labeled node 1. Following a depth-first search strategy, since p2p_{2} = dd, DBI gets the first element in the array dd, which is 2, and the gap between positions 2 and 1 is zero, which satisfies the gap constraints [0,3][0,3]. Thus, DBI creates a child of node 1, labeled node 2. According to a depth-first search strategy, since p3p_{3} = cc, DBI gets the first element in the array cc, which is position 7. The gap between positions 7 and 2 is four, which does not satisfy the gap constraints [0,3][0,3]. According to the backtracking strategy, DBI backtracks node 1 to find a new child. We know that the second element of the array dd is position 4, which satisfies the gap constraints [0,3][0,3]. Thus, DBI creates a child of node 1, labeled node 4. Now, DBI selects the first element of the array cc, which is position 7, and it satisfies the gap constraints [0,3][0,3]. Hence, DBI obtains the occurrence <<1,4,7>>. Then, DBI gets the second element of array aa, which is position 5. By iterating the above process, DBI obtains the occurrence <<5,6,9>>. Finally, <<11,14,15>> is found.

Refer to caption
Figure 4: Occurrences searching process of DBI

From Example 7, we know that DBI first creates mm level nodes according to the index arrays of pattern p. Then, based on the index arrays, DBI adopts the depth-first search and backtracking strategies to iteratively find the minimal nonoverlapping occurrences. The pseudocode of DBI is shown in Algorithm 1. The main steps of DBI are as follows.

  1. Step 1:

    DBI selects the first element in the index array of p1p_{1} as the current node, and set the current level as the first level (Lines 1 to 3).

  2. Step 2:

    DBI gets the first unused element in the next level as the child node of the current node, and the distance between the current node and the child node satisfies the gap constraints gapgap=[a,ba,b] (Lines 5 to 10).

  3. Step 3:

    If DBI successfully finds the child node, then the child node is selected as the current node and the next level is set as the current level. Otherwise, DBI backtracks to the parent node of the current node and searches for the next child node (Lines 11 to 17).

  4. Step 4:

    Iterate Step 3 until DBI reaches the mm-th level or the first level. If DBI reaches the mm-th level, then DBI finds an occurrence, i.e., sup(p,s)sup(\textbf{p},\textbf{s})++ (Lines 19 to 21).

  5. Step 5:

    DBI selects the next element in the index array of p1p_{1} as the current node, and sets the current level as the first level.

  6. Step 6:

    Iterate Steps 2 to 5 until all elements in the index array of p1p_{1} are checked.

Algorithm 1 DBI
0:  Index arrays arr for sequence s, pattern p, and gapgap=[a,b][a,b]
0:  sup(p,s)
1:  for rootroot \leftarrow each element in arr[p[1]] do
2:     CurNodeCurNode \leftarrow occ[1] \leftarrow rootroot;
3:     CurLevelCurLevel \leftarrow 1;
4:     while CurLevel>0CurLevel>0 &&\&\& CurLevel<len(p)CurLevel<len(\textbf{p}) do
5:        childchild \leftarrow next element in arr[p[CurLevel+1]]CurLevel+1]];
6:        gapgap \leftarrow childCurNode1child-CurNode-1;
7:        while gap<agap<a do
8:           childchild \leftarrow next element in arr[p[CurLevel+1]]CurLevel+1]];
9:           gapgap \leftarrow childCurNode1child-CurNode-1;
10:        end while
11:        if agapa\leq gap &&\&\& gapbgap\leq b then
12:           CurLevelCurLevel++;
13:           CurNodeCurNode \leftarrow occ[CurLevel][CurLevel] \leftarrow childchild;
14:        else
15:           CurLevelCurLevel--;
16:           CurNodeCurNode \leftarrow occ[CurLevel][CurLevel];
17:        end if
18:     end while
19:     if CurLevelCurLevel = len(pthen
20:        sup(p,s)++sup(\textbf{p},\textbf{s})++;
21:     end if
22:  end for
23:  return  sup(p,s)sup(\textbf{p},\textbf{s})
Theorem 1.

The time and space complexities of DBI are both O(m×n/h)O(m\times n/h), where mm, nn, and hh are the lengths of pattern and sequence, and the size of \sum, respectively.

Proof.

It is easy to know that the size of the index array of each item is O(n/h)O(n/h), since the length of the sequence is nn, and the size of items is hh. To calculate the support, there are mm index arrays. We know that each element in the index arrays can be used at most once, which means that DBI visits these elements one by one. Hence, the time and space complexities of DBI are both O(m×n/h)O(m\times n/h).

IV-B Preparation stage

In this section, we propose two strategies: filtering strategy and minimum support strategy to avoid some redundant support calculations. We first propose the filtering strategy.

Filtering strategy. If the support of pattern p in sequence s is zero, then sequence s can be pruned.

Theorem 2.

Filtering strategy is correct and complete.

Proof.

We know that the nonoverlapping SPM satisfies anti-monotonicity [24], which means that the support of a superpattern is no greater than that of its sub-pattern. If sup(p,s)sup(\textbf{p},\textbf{s}) = 0, then we can safely say that sup(q,s)sup(\textbf{q},\textbf{s}) = 0, where pattern q is a superpattern of pattern p, since the nonoverlapping SPM satisfies the anti-monotonicity condition and sequence s can therefore be pruned. Hence, the filtering strategy is correct and complete.

Moreover, based on this anti-monotonicity, we further propose a minimum support strategy.

Minimum support strategy. If the support of a pattern q in DD is less than minsupminsup, then pattern q and its superpatterns can be pruned, where minsupminsup = sup(p,D)sup(\textbf{p},D) ×\times mincfmincf.

Theorem 3.

Minimum support strategy is correct and complete.

Proof.

Suppose a rule p \to r is a strong co-occurrence rule, i.e., conf(conf(p \to r) = sup(psup(\textbf{p}·r,D)\textbf{r},D)/sup(p,D)sup(\textbf{p},D) \geq mincfmincf. Then, sup(psup(\textbf{p}·r,D)sup(p,D)\textbf{r},D)\geq sup(\textbf{p},D) ×\times mincfmincf, which means that the support of pattern p·r is greater than or equal to sup(p,D)×mincfsup(\textbf{p},D)\times mincf. Thus, for MCoR mining, users do not need to set minsupminsup, since it can be automatically calculated based on mincfmincf. Hence, the minimum support strategy is correct and complete.

In the preparation stage, we propose the SDB-Filt algorithm whose pseudocode is shown in Algorithm 2. The main steps of SDB-Filt are as follows.

  1. Step 1:

    Select a sequence s in DD.

  2. Step 2:

    SDB-Filt employs the DBI algorithm to calculate sup(p,s)sup(\textbf{p},\textbf{s}).

  3. Step 3:

    If sup(p,s)sup(\textbf{p},\textbf{s}) is zero, then sequence s can be shrunk according to filtering strategy. Otherwise, sequence s is stored in a shrunk database D1D1, and the support is updated, i.e., sup(p,D)sup(\textbf{p},D) gets sup(p,D)sup(\textbf{p},D) + sup(p,s)sup(\textbf{p},\textbf{s}) (Lines 4 to 7).

  4. Step 4:

    Iterate Steps 1 to 3 until all sequences are calculated.

  5. Step 5:

    After getting sup(p,D)sup(\textbf{p},D), SDB-Filt calculates minsupminsup according to the minimum support strategy (Line 9).

Algorithm 2 SDB-Filt
0:  Sequence index arrays Arr, pattern p, gapgap=[m,n][m,n], and mincfmincf
0:  minsupminsup and shrunk index arrays Arr1Arr1
1:  supsup(p,DD) \leftarrow 0;
2:  for each sequence s in DD do
3:     supportsupport \leftarrow DBI(arrs{}_{\textbf{s}},p, gapgap);
4:     if support0support\neq 0 then
5:        Store sequence index arrays s in Arr1Arr1;
6:        sup(sup(p,D)D) \leftarrow sup(sup(p,D)D) +support+support;
7:     end if
8:  end for
9:  minsupminsup \leftarrow sup(p,D)sup(\textbf{p},D) ×\times mincfmincf;
10:  return  minsupminsup, Arr1Arr1

IV-C Candidate pattern generation

There are two strategies that are commonly used for generating candidate patterns: the enumeration tree strategy and the pattern join strategy. Many nonoverlapping SPM methods have adopted the pattern join strategy to generate candidate patterns, such as NOSEP [24] and NTP-Miner [20], since it can effectively reduce the candidate patterns compared to the enumeration tree strategy. However, the pattern join strategy uses all frequent patterns with length mm to generate candidate patterns with length m+1m+1, which means that we need to mine all of the frequent patterns. Obviously, a brute-force algorithm is used to mine all frequent patterns and then pattern q and its superpatterns are found among them. We therefore adopt the enumeration tree strategy. In the classical enumeration tree strategy, if pattern q is a frequent pattern, then all patterns q·aa are candidate patterns, where aa is any item in sequence s.

To reduce the number of candidate patterns, we first propose the FET strategy and then present the BET strategy.

The FET strategy: If pattern q is a frequent pattern and yy is any frequent item in sequence s, then all patterns q·yy are candidate patterns.

Example 8 demonstrates the advantage of the FET strategy.

Example 8.

Suppose we have a sequence s = s1s2s3s_{1}s_{2}s_{3}s4s5s6s_{4}s_{5}s_{6} s7s8s9s10s_{7}s_{8}s_{9}s_{10} s11s12s_{11}s_{12}s13s14s_{13}s_{14}s15s16s_{15}s_{16} = adbdadcdcadbdadcdccabadcdcabadcd, a pattern q = a[0,3]d[0,3]ca[0,3]d[0,3]c, and minsupminsup = 3. From sequence s, we know that all of the items are {a,b,c,da,b,c,d}. According to the classical enumeration tree strategy, there are four candidate patterns based on pattern q: a[0,3]d[0,3]c[0,3]aa[0,3]d[0,3]c[0,3]a, a[0,3]d[0,3]c[0,3]ba[0,3]d[0,3]c[0,3]b, a[0,3]d[0,3]c[0,3]ca[0,3]d[0,3]c[0,3]c, and a[0,3]d[0,3]c[0,3]da[0,3]d[0,3]c[0,3]d. However, we know that the supports of patterns aa, bb, cc, and dd are four, two, four, and six, respectively. Thus, the frequent items are {a,c,da,c,d}, since minsupminsup = 3. Hence, according to FET, there are only three candidate patterns based on pattern q: a[0,3]d[0,3]c[0,3]aa[0,3]d[0,3]c[0,3]a, a[0,3]d[0,3]c[0,3]ca[0,3]d[0,3]c[0,3]c, and a[0,3]d[0,3]c[0,3]da[0,3]d[0,3]c[0,3]d. This example shows that the FET strategy has better performance than the classical enumeration tree strategy.

To further reduce the number of candidate patterns, inspired by the pattern join strategy, we propose the BET strategy as follows.

The BET strategy: We discover all frequent patterns with length two and store them in set F2F_{2}. If xx is the last item of q, and pattern xyxy is a frequent pattern with length two, i.e. xyxy \in F2F_{2}, then pattern r = q·yy is a candidate pattern.

Example 9 illustrates the advantage of the BET strategy.

Example 9.

We use the same scenario as in Example 8. We know that the frequent items are {a,c,da,c,d}. We can then generate nine candidate patterns with length two {aaaa, acac, adad, caca, cccc, cdcd, dada, dcdc, dddd} using the pattern join strategy. It is then easy to see that the frequent candidate patterns with length two are <<ad>>, <<ca>>, <<cd>>, <<dc>>, and <<dd>>. From Example 8, we see that there are only three candidate patterns based on pattern q using the FET strategy: a[0,3]d[0,3]c[0,3]aa[0,3]d[0,3]c[0,3]a, a[0,3]d[0,3]c[0,3]ca[0,3]d[0,3]c[0,3]c, and a[0,3]d[0,3]c[0,3]da[0,3]d[0,3]c[0,3]d. Pattern a[0,3]d[0,3]c[0,3]aa[0,3]d[0,3]c[0,3]a cannot be pruned by the BET strategy, since pattern c[0,3]ac[0,3]a is a frequent pattern with length two, while pattern a[0,3]d[0,3]c[0,3]ca[0,3]d[0,3]c[0,3]c can be pruned by the BET strategy, since pattern c[0,3]cc[0,3]c is not a frequent pattern with length two. Similarly, pattern a[0,3]d[0,3]c[0,3]da[0,3]d[0,3]c[0,3]d cannot be pruned. Thus, only two candidate patterns are generated by the BET strategy. This example demonstrates that the BET strategy outperforms the FET strategy. A comparison of the candidate patterns is given in Fig. 5.

Refer to caption
(a) Enumeration tree of all items
Refer to caption
(b) Enumeration tree of FET
Refer to caption
(c) Enumeration tree of BET
Figure 5: Comparison of candidate patterns generated by different strategies. Eight candidate patterns are generated by using all terms, while six and four candidate patterns are generated by using the FET and BET strategies, respectively.

Theorem 4.

Both FET and BET strategies are correct and complete.

Proof.

As mentioned above, the nonoverlapping SPM satisfies anti-monotonicity [24]. Thus, the support of a superpattern is no greater than that of its sub-pattern. Therefore, if item yy is infrequent, then the pattern q·yy is infrequent. Hence, the FET strategy is correct and complete. Similarly, we know that the BET strategy is also correct and complete.

IV-D Screening strategy

Obviously, it is easy for us to obtain strong co-occurrence sequential rules based on the frequent co-occurrence patterns. Moreover, we can discover the maximal strong co-occurrence rules based on the maximal co-occurrence patterns. The simplest method is to mine all frequent co-occurrence patterns, and then discover the maximal co-occurrence patterns among them. This method can be seen as a brute-force method. To improve the performance, we propose a screening strategy to discover the maximal co-occurrence patterns. The principle is as follows.

Screening strategy: Suppose we have a pattern q. If one of its superpatterns r is a frequent co-occurrence pattern, then pattern q is not a maximal co-occurrence pattern; otherwise, q is a maximal co-occurrence pattern.

Theorem 5.

Screening strategy is correct and complete.

Proof.

According to the definition of the maximal co-occurrence pattern, for a pattern q, if one of its superpattern r is a frequent co-occurrence pattern, then pattern q is not a maximal co-occurrence pattern; otherwise, pattern q is a maximal co-occurrence pattern. The screening strategy discovers the maximal co-occurrence patterns based on the definition of the maximal co-occurrence pattern. Hence, the screening strategy is correct and complete.

We store all frequent co-occurrence patterns in a stack. Suppose pattern q is the top element of the stack. We generate the candidate patterns based on pattern q using the BET strategy, and then we use the DBI algorithm to calculate the supports of the candidate patterns. If candidate pattern r is a frequent co-occurrence pattern, then we store pattern r in the stack; otherwise, pattern q is a maximal co-occurrence pattern and is stored in the set MpM_{p}. We check all patterns in the stack until it is empty. Example 10 illustrates the principle of the screening strategy.

Example 10.

Suppose we have a sequence s = s1s2s3s4s_{1}s_{2}s_{3}s_{4} s5s6s7s8s9s10s_{5}s_{6}s_{7}s_{8}s_{9}s_{10}s11s12s_{11}s_{12}s13s14s_{13}s_{14}s15s16s_{15}s_{16} = adbdadcdccabadcdadbdadcdccabadcd, p = a[0,a[0, 3]d3]d, and minsupminsup = 3. Suppose the top element of the stack is pattern q = a[0,3]d[0,3]ca[0,3]d[0,3]c. According to the BET strategy, there are two candidate patterns based on pattern q: a[0,3]d[0,3]c[0,3]aa[0,3]d[0,3]c[0,3]a and a[0,3]d[0,3]c[0,3]da[0,3]d[0,3]c[0,3]d. We know that the support of pattern a[0,3]d[0,3]c[0,3]aa[0,3]d[0,3]c[0,3]a is two. This means that a[0,3]d[0,3]c[0,3]aa[0,3]d[0,3]c[0,3]a cannot be stored in the stack, since it is not a frequent co-occurrence pattern. Moreover, we know that the support of pattern a[0,3]d[0,3]c[0,3]da[0,3]d[0,3]c[0,3]d is three. The pattern a[0,3]d[0,3]c[0,3]da[0,3]d[0,3]c[0,3]d is therefore stored in the stack, since it is a frequent co-occurrence pattern. Meanwhile, pattern q is not a maximal co-occurrence pattern, since it has a frequent co-occurrence superpattern a[0,3]d[0,3]c[0,3]da[0,3]d[0,3]c[0,3]d. Now, the top element of the stack is pattern a[0,3]d[0,3]c[0,3]da[0,3]d[0,3]c[0,3]d which has two candidate patterns according to the BET strategy: a[0,3]d[0,3]c[0,3]d[0,3]ca[0,3]d[0,3]c[0,3]d[0,3]c and a[0,3]d[0,3]c[0,3]d[0,3]da[0,3]d[0,3]c[0,3]d[0,3]d. It is then easy to see that the supports of patterns a[0,3]d[0,3]c[0,3]d[0,3]ca[0,3]d[0,3]c[0,3]d[0,3]c and a[0,3]d[0,3]c[0,3]d[0,3]da[0,3]d[0,3]c[0,3]d[0,3]d are both one. Hence, the pattern a[0,3]d[0,3]c[0,3]da[0,3]d[0,3]c[0,3]d is a maximal co-occurrence pattern, since all its superpatterns are not frequent co-occurrence patterns. The advantage of the screening strategy is that the maximal co-occurrence patterns can be discovered without a brute-force search process.

IV-E MCoR-Miner

We introduce the MCoR-Miner algorithm whose pseudocode is shown in Algorithm 3. MCoR-Miner has the following main steps.

  1. Step 1:

    Use SDB-Filt to calculate the support of pattern p, i.e., sup(p,D)sup(\textbf{p},D); prune the sequence s whose support is zero; and calculate minsupminsup = sup(p,D)sup(\textbf{p},D) ×\times mincfmincf (Lines 1 to 2).

  2. Step 2:

    Traverse D1D1 to find all frequent items and store them in F1F_{1} (Line 3).

  3. Step 3:

    Generate candidate pattern set C2C_{2} using the pattern join strategy (Line 4).

  4. Step 4:

    Calculate the support of each pattern in set C2C_{2} and store the frequent patterns in F2F_{2} (Line 5).

  5. Step 5:

    Store pattern p in stack TT (Line 6).

  6. Step 6:

    Obtain the top element pattern q of stack TT (Line 8).

  7. Step 7:

    Generate all candidate co-occurrence patterns of pattern q using frequent items F1F_{1} and store them in set CC (Lines 10 to 13).

  8. Step 8:

    Use BET to prune candidate patterns (Line 14).

  9. Step 9:

    Use the DBI algorithm to calculate the support of each pattern r in CC (Line 15).

  10. Step 10:

    If pattern r is a frequent co-occurrence pattern, then pattern r is stored in stack TT (Lines 16 to 19).

  11. Step 11:

    If all patterns r are not frequent co-occurrence patterns, then pattern q is a maximal co-occurrence pattern and is stored in set MpM_{p} (Lines 22 to 24).

  12. Step 12:

    Iterate Steps 6 to 11, until the stack is empty.

  13. Step 13:

    Generate maximal strong rules based on MpM_{p}.

Algorithm 3 MCoR-Miner
0:  Sequence database DD, minimum confidence mincfmincf, gapgap = [a,b][a,b], prefix pattern p
0:  Maximal co-occurrence sequential rule set MrM_{r}
1:  Scan DD to obtain index arrays ArrArr; //Use the filtering strategy to shrink dataset, and the minimum support strategy to calculate minsupminsup
2:  minsupminsup,Arr1Arr1 \leftarrow SDB-Filt(ArrArr,p,gapgap, mincfmincf);
3:  Scan Arr1Arr1 to find and store frequent items in F1F_{1};
4:  Generate candidate patterns with length two and store them in C2C_{2};
5:  Discover frequent patterns with length two and store them in F2F_{2};
6:  Store pattern p in stack TT;
7:  while not empty TT do
8:      q \leftarrow TT.pop();
9:     flag \leftarrow True;//Use the FET strategy to prune candidate patterns
10:     for each item yy in F1F_{1} do
11:        CC \leftarrow CC \cup q·yy;
12:        xx \leftarrow the last item of q;
13:         r \leftarrow q·yy;//Use the BET strategy to prune candidate patterns
14:        if pattern xx·yy in F2F_{2} then
15:           supportsupport \leftarrow DBI(Arr1Arr1,r, gapgap);
16:           if supportminsupsupport\geq minsup then
17:              TT.push (r);
18:              flag \leftarrow False;
19:           end if
20:        end if
21:     end for//Use the screening strategy to mine the patterns
22:     if  flag=True then
23:        MpM_{p} \leftarrow MpM_{p} \cup q;
24:     end if
25:  end while
26:  MrM_{r} \leftarrow Generate maximal strong rules based on MpM_{p};
27:  return  MrM_{r};
Theorem 6.

The space and time complexities of MCoR-Miner are O(N)O(N) and O(t×M×N/h)O(t\times M\times N/h), where tt, MM, NN, and hh are the number of candidate patterns, the maximal length of candidate patterns, the total length of shrunk sequence database, and the size of \sum, respectively.

Proof.

The time and space complexities of scanning the sequence database DD to obtain its index arrays are O(N)O(N). It is easy to know that the space complexity of SDB-Filt is O(N)O(N). The time complexity of SDB-Filt is O(m×N/h)O(m\times N/h), since the time complexity of DBI is O(m×n/h)O(m\times n/h) and SDB-Filt checks all sequences, where N=i=1kniN=\sum_{i=1}^{k}n_{i}. The space complexity of F1F_{1} is O(h)O(h), since the size of all frequent items is no greater than all items. Thus, the time complexity of Line 3 is O(h)O(h). The time and space complexities of Line 4 are O(h2)O(h^{2}). The time complexity of Line 5 is O(h2×2×N/h)O(h^{2}\times 2\times N/h), since there are O(h2)O(h^{2}) candidate patterns, and the length of each candidate pattern is two. The space complexity of storing frequent patterns with length two is O(h2×2)O(h^{2}\times 2). Suppose MCoR-Miner checks tt candidate patterns and the maximal length of candidate patterns is MM. The space complexity of storing these candidate patterns is O(t×M)O(t\times M), and the time complexity of calculating the supports of these candidate patterns is O(t×M×N/h)O(t\times M\times N/h). Hence, the space complexity of MCoR-Miner is O(N+N+h+h2+k×M)O(N+N+h+h^{2}+k\times M) =O(N)O(N), since generally hh, h2h^{2}, kk, MM, and k×Mk\times M are much smaller than NN. The time complexity of MCoR-Miner is O(N+m×N/h+h2+h2×2×N/h+t×M×N/h)O(N+m\times N/h+h^{2}+h^{2}\times 2\times N/h+t\times M\times N/h)=O(t×M×N/h)O(t\times M\times N/h), since mm and 2×h22\times h^{2} are much smaller than MM and tt, respectively.

V Experimental results and analysis

To validate the performance of MCoR-Miner, we consider the following eight research questions (RQs).

  1. RQ1:

    Does MCoR-Miner yield better performance than other state-of-the-art algorithms in terms of mining co-occurrence patterns?

  2. RQ2:

    Does DBI give better performance than other state-of-the-art algorithms in terms of support calculation?

  3. RQ3:

    Can the filtering strategy reduce the number of support calculations in MCoR-Miner?

  4. RQ4:

    Can the FET and BET strategies reduce the number of candidate patterns, and when can BET achieve better performance?

  5. RQ5:

    What is the effect of the screening strategy?

  6. RQ6:

    Does MCoR-Miner have better scalability?

  7. RQ7:

    How do different parameters affect the performance of MCoR-Miner?

  8. RQ8:

    What is the performance of MCoR-Miner for recommendation tasks?

To answer RQ1, we select NOSEP, DFOM-All, and DBI-All to verify the performance of MCoR-Miner on the problem of co-occurrence pattern mining. In response to RQ2, we propose NETGAP-MCoR, Netback-MCoR, and DFOM-MCoR to explore the effect of DBI, as described in Section V-B. To address RQ3, we propose MCoR-NoFilt to investigate the effect of the filtering strategy, as presented in Section V-B. To answer RQ4, we apply two algorithms, MCoR-AET and MCoR-FET, to verify the effects of the FET and BET strategies, as described in Section V-B. To solve RQ5, we propose MCoR-NoScr to validate the effect of the screening strategy in Section V-B. To answer RQ6, we explore the scalability of MCoR-Miner, as presented in Section V-C. To address RQ7, we explore the effects of different gap constraints and minimum confidence thresholds on the running time of MCoR-Miner in Section V-D. To answer RQ8, we report the performances of MCoR-Miner in terms of confidence, recall, and F1-score in section V-E. All experiments were conducted on a computer with an Intel(R) Core(TM)i7-10870H processor, 32 GB of memory, the Windows 10 operating system, and VS2022 as the experimental environment. All algorithms and datasets can be downloaded from https://github.com/wuc567/Pattern-Mining/tree/master/MCoR-Miner.

V-A Benchmark datasets and baseline algorithms

We used eight datasets shown in Table II as test sequences.

TABLE II: Description of datasets
Dataset Type \sum Number of sequence Length
GAMESALE1 Commerce 12 39 16,446
BABYSALE2 Commerce 6 949 29,971
TRANSACTION3 Commerce 31 4,661 32,851
MOVIE4 Commerce 20 4,988 741,132
MSNBC5 Clickstream 17 200,000 948,674
SARS6 Bio-sequence 4 426 29,751
SARS-Cov-27 Bio-sequence 4 428 29,903
HIV8 Bio-sequence 20 1 10,000

To report the performance of the MCoR-Miner algorithm, four experiments were designed, and nine competitive algorithms were selected.

  1. 1.

    NOSEP [24], DFOM-All [56], and DBI-All: To verify that it is necessary to use a special algorithm to discover frequent co-occurrence patterns, we selected two state-of-the-art algorithms: NOSEP [24] and DFOM [56], where NOSEP can mine all frequent patterns. DFOM-All and DBI-All can also discover all frequent patterns, which employed the DFOM and DBI algorithms to calculate the support, respectively. We can then find the frequent co-occurrence patterns among all frequent patterns. Based on the above analysis, we know that the time complexities of NOSEP, DFOM-All, and DBI-All are O(ta×M×M×N×w)O(t_{a}\times M\times M\times N\times w), O(ta×M×N)O(t_{a}\times M\times N), and O(ta×M×N/h)O(t_{a}\times M\times N/h), respectively, where tat_{a} (tatt_{a}\geq t) and ww are the number of all frequent patterns and ba+1b-a+1.

  2. 2.

    NETGAP-MCoR [24], Netback-MCoR [41], and DFOM-MCoR [56]: To explore the running performance of DBI, three competitive algorithms were proposed: NETGAP-MCoR, Netback-MCoR, and DFOM-MCoR. The three competitive algorithms employed three state-of-the-art algorithms NETGAP [24], Netback [41], and DFOM [56] to calculate the support, respectively. Other strategies, such as filtering, FET, BET, and screening, are the same as MCoR-Miner. Based on the above analysis, we know that the time complexities of NETGAP-MCoR, Netback-MCoR, and DFOM-MCoR are O(t×M×M×N×w)O(t\times M\times M\times N\times w), O(t×M×N×w)O(t\times M\times N\times w), and O(t×M×N)O(t\times M\times N), respectively.

  3. 3.

    MCoR-NoFilt: To demonstrate the performance of the filtering strategy, MCoR-NoFilt was proposed, which did not employ the filtering strategy but is otherwise the same as MCoR-Miner. The time complexity of MCoR-NoFilt is O(t×M×No/h)O(t\times M\times N_{o}/h), while that of MCoR-Miner is O(t×M×N/h)O(t\times M\times N/h), where NoN_{o} and NN (NoNN_{o}\geq N) are the lengths of the original and shrunk sequence databases, respectively.

  4. 4.

    MCoR-AET and MCoR-FET: To investigate the performance of the BET strategy, we proposed MCoR-AET and MCoR-FET, which used all items and frequent items to generate candidate patterns, respectively. The time complexities of MCoR-AET and MCoR-FET are O(ta×M×N/h)O(t_{a}\times M\times N/h) and O(tf×M×N/h)O(t_{f}\times M\times N/h), respectively, where tat_{a} (tatt_{a}\geq t) and tft_{f} are the numbers of candidate patterns. Note that tft_{f} can be smaller than tt, since MCoR-Miner needs to calculate O(h2)O(h^{2}) candidate patterns with length two, and in some cases, it cannot prune O(h2)O(h^{2}) candidate patterns using the BET strategy.

  5. 5.

    MCoR-NoScr: To assess the performance of the screening strategy, we used MCoR-NoScr, which did not include the screening strategy. The time complexity of MCoR-NoScr is almost the same as that of MCoR-Miner.

  6. 6.

    CoP-Miner: To verify the recommendation performance, CoP-Miner was proposed, which can mine all frequent patterns with the predefined prefix pattern, and employed the same strategies as MCoR-Miner. The time complexity of CoP-Miner is O(tc×M×N/h)O(t_{c}\times M\times N/h), where tct_{c} (tctt_{c}\geq t) is the number of all frequent patterns.

V-B Efficiency

To determine the necessity of co-occurrence pattern mining, we selected NOSEP, DFOM-All, and DBI-All as the competitive algorithms. Moreover, to verify the performance of the DBI algorithm and the filtering, BET, and screening strategies, we employed NETGAP-MCoR, Netback-MCoR, DFOM-MCoR, MCoR-NoFilt, MCoR-AET, MCoR-FET, and MCoR-NoScr as the competitive algorithms. Eight datasets were selected. Since the characteristics of these datasets are significantly different, for example in terms of different number of characters and sequences, we set different parameters as shown in Table III. Comparisons of the running time and memory usage are shown in Tables IV and V, respectively. The main indicators of mining results, such as the number of CoRs and MCoRs are shown in Table VI.

TABLE III: Parameters
Dataset Prefix Gap constraint Minimum confidence
GAMESALE dd [0,8] 0.3
BABYSALE aa [0,8] 0.2
TRANSACTION zz [0,3] 0.3
MOVIE dd [0,3] 0.5
MSNBC ee [0,8] 0.3
SARS CC [0,3] 0.6
SARS-Cov-2 CC [0,3] 0.6
HIV CC [0,3] 0.3
TABLE IV: Comparison of running time (s)
GAMESALE BABYSALE TRANSACTION MOVIE MSNBC SARS SARS-Cov-2 HIV
NOSEP 47.95 152.55 10.20 77.66 - 95.47 133.12 211.98
DFOM-All 2.47 12.33 2.11 5.78 - 9.07 10.00 84.28
DBI-All 0.92 8.38 0.45 4.13 15.72 6.56 8.28 1.48
NetGAP-MCoR 16.17 113.00 2.80 50.16 587.91 2.80 2.75 7.74
Netback-MCoR 15.89 112.17 2.70 46.53 582.67 2.58 2.60 7.78
DFOM-MCoR 1.88 11.34 0.70 5.25 97.38 3.60 3.13 5.11
MCoR-NoFilt 0.66 5.53 1.00 4.08 215.83 2.81 2.97 0.13
MCoR-AET 0.70 7.14 3.88 16.22 25.92 1.56 1.42 0.02
MCoR-FET 0.59 5.94 0.47 5.25 24.17 1.55 1.41 0.02
MCoR-NoScr 0.63 5.80 0.48 4.11 16.52 3.13 2.97 0.13
MCoR-Miner 0.42 5.36 0.38 4.00 16.08 2.34 2.66 0.11
TABLE V: Comparison of memory usage (Mb)
GAMESALE BABYSALE TRANSACTION MOVIE MSNBC SARS SARS-Cov-2 HIV
NOSEP 32.1 203.3 908.1 1092.5 1183.2 106.2 106.5 19.4
DFOM-All 22.0 195.7 903.7 971.2 1188.0 95.1 95.4 20.2
DBI-All 14.4 16.9 21.6 38.5 230.7 17.6 17.6 14.1
NetGAP-MCoR 23.7 25.8 24.8 163.1 404.9 25.2 25.2 19.6
Netback-MCoR 26.7 18.8 35.7 187.2 263.9 27.8 27.4 20.3
DFOM-MCoR 15.9 18.7 22.8 40.0 226.0 15.8 15.9 15.9
MCoR-NoFilt 16.7 19.6 25.0 41.1 445.8 17.6 24.3 23.0
MCoR-AET 16.0 18.8 24.5 43.9 232.4 17.2 17.2 16.0
MCoR-FET 24.0 26.4 30.5 47.8 238.80 14.9 14.8 13.4
MCoR-NoScr 20.8 32.1 30.3 47.4 239.0 24.3 16.9 15.8
MCoR-Miner 13.6 16.5 20.8 38.0 228.8 17.4 26.6 21.5
TABLE VI: Main indicators of mining results
GAMESALE BABYSALE TRANSACTION MOVIE MSNBC SARS SARS-Cov-2 HIV
Number of CoRs 36 98 4 4 12 2 2 3
Number of MCoRs 20 47 2 1 1 2 2 3
Number of filtered sequences 642 1,350 104,787 288 49,163,184 28 28 0
Number of non-filtered sequences 11,877 425,700 21,060 79,520 1,236,816 11,900 11,956 472
Number of all items 12 6 30 20 17 4 4 20
Number of frequent items 10 5 4 3 15 4 4 20
Number of BET items 27 15 4 2 1 13 13 286
Number of BET items under non-filtered 27 15 6 2 24 13 13 286
Number of candidate patterns pruned by BET 149 70 9 8 168 0 0 8
Number of candidate patterns nonpruned by BET 221 425 11 7 27 12 12 72

These results give rise to the following observations.

  1. 1.

    Maximal co-occurrence rule mining can effectively reduce the number of rules. From Table VI, on BABYSALE, there are 98 co-occurrence rules, while there are 47 maximal co-occurrence rules. The same phenomenon can be found on all the other datasets. This is because as shown in Example 5, both adad \to cc and adad \to cdcd are rules, and adad \to cdcd is a maximal rule, while rule adad \to cc is not a maximal rule. This means that MCoR mining can effectively reduce the number of rules.

  2. 2.

    It is necessary to explore the co-occurrence pattern mining, since MCoR-Miner is significantly faster than NOSEP [24] and DFOM-All [56], and in particular is faster than DBI-All. For example, on GAMESALE, NOSEP and DFOM-All take 47.95s and 2.47s, respectively, while MCoR-Miner takes 0.42s, meaning that MCoR-Miner is about 100 times faster than NOSEP, and six times faster than DFOM-All. Moreover, DBI-all takes 0.92s, which is slower than MCoR-Miner. These phenomena can be found on the other datasets. The reasons for this are twofold. Firstly, MCoR-Miner employs a more efficient algorithm, called DBI, to calculate the support compared to NOSEP and DFOM-All. As mentioned in Section IV-A, we know that NOSEP has to create a whole Nettree to calculate the support. However, a Nettree contains many redundant nodes and parent-child relationships. Although DFOM can overcome this disadvantage, it uses a sequential searching strategy to find feasible child nodes of the current node, which is an inefficient method. In contrast, DBI is equipped with an indexing mechanism to avoid sequential searching, meaning that MCoR-Miner is significantly faster than NOSEP and DFOM-All. Secondly, although both DBI-All and MCoR-Miner employ DBI to calculate the support, DBI-All runs slower than MCoR-Miner. The reason for this is that the three algorithms have to discover all of the patterns, and since many of these are not targeted patterns, all three algorithms have to filter out these useless patterns. The experimental results therefore indicate that it is necessary to explore the co-occurrence pattern mining.

  3. 3.

    MCoR-Miner runs faster than NETGAP-MCoR, Netback-MCoR, and DFOM-MCoR on all datasets, which indicates the efficiency of DBI. For example, from Table IV, we can see that on BABYSALE, MCoR-Miner takes 5.36s, while NETGAP-MCoR, Netback-MCoR, and DFOM-MCoR take 113.00s, 112.17s, and 11.34s, respectively. Thus, MCoR-Miner is obviously faster than NETGAP-MCoR and Netback-MCoR, and is about twice as fast as DFOM-MCoR. The reasons are as follows. The four algorithms adopt different methods to calculate the support. NETGAP-MCoR and Netback-MCoR have to create the whole Nettree which contain many useless nodes and parent-child relationships. Therefore, their time complexities are higher than those of DFOM-MCoR and MCoR-Miner. Although DFOM-MCoR does not create the whole Nettree, DFOM adopts a sequential searching strategy to find feasible child nodes of the current node, which is inefficient. Hence, MCoR-Miner outperforms NETGAP-MCoR, Netback-MCoR, and DFOM-MCoR.

  4. 4.

    MCoR-Miner runs faster than MCoR-NoFilt on all datasets, which indicates the effectiveness of the filtering strategy. According to Table IV, on BABYSALE, MCoR-Miner takes 5.36s, while MCoR-NoFilt takes 5.53s. This phenomenon can also be found on all the other datasets. The reason for this is that some useless sequences are filtered out. For example, from Table VI, we see that 1350 sequences are filtered out when minimg all rules, which means that it is not necessary to calculate the support for these 1350 sequences, and hence the running performance is improved. The filtering strategy can reduce not only the number of sequences, but also the number of frequent patterns with length two, which means that it can reduce the number of candidate patterns with length more than two. For example, on the TRANSACTION dataset, there are only four patterns under the filtering strategy, but six under the non-filtering strategy, meaning that more time is needed to discover all the rules. All in all, the filtering strategy is effective, and MCoR-Miner outperforms MCoR-NoFilt.

  5. 5.

    MCoR-Miner runs faster than MCoR-AET and MCoR-FET on GAMESALE, BABYSALE, TRANSACTION, MOVIE, and MSNBC, but slower than MCoR-AET and MCoR-FET on SARS, SARS-Cov-2, and HIV, which indicates that the BET strategy has certain advantages and limitations. We choose two typical datasets to illustrate the reason for this. 1) Taking GAMESALE as an example, we see that there are 12 items and 10 frequent items. Thus, there are 10×\times 10=100 candidate patterns with length two. After checking these 100 candidate patterns, we have 27 frequent patterns with length two. Using these 27 patterns, 149 candidate patterns are pruned by the BET strategy, and 221 candidate patterns are not according to Table VI. Thus, by adding 100 times support calculations for candidate patterns with length two, MCoR-Miner can avoid 149 times support calculations for candidate patterns. Hence, MCoR-Miner can improve the running performance on GAMESALE. 2) Taking HIV as another example, we see that there are 20 frequent items. Thus, there are 20×\times20=400 candidate patterns with length two. After checking these 400 candidate patterns, we have 286 frequent patterns with length two. Of these 286 patterns, only eight candidate patterns are pruned by the BET strategy, and 72 are not according to Table VI. Thus, by adding 400 times support calculations for candidate patterns with length two, MCoR-Miner can avoid only eight times support calculations for candidate patterns. The running time is therefore increased. Thus, the experimental results show that the BET strategy has some limitations on the Bio-sequence datasets. Hence, our methods, MCoR-Miner or MCoR-FET, can obtain better performance in all cases.

  6. 6.

    MCoR-Miner runs faster than MCoR-NoScr on all datasets, which indicates the effectiveness of the screening strategy. For example, on TRANSACTION, MCoR-Miner requires 0.38s, while MCoR-NoScr requires 0.48s. This phenomenon can be found on all the other datasets. The reason for this is as follows. The difference between the two algorithms is that MCoR-Miner is equipped with the screening strategy, while MCoR-NoScr is not. Thus, the results indicate that the screening strategy is an effective way to reduce the running time, and hence MCoR-Miner outperforms MCoR-NoScr.

  7. 7.

    From Table V, we notice that MCoR-Miner consumes less memory when mining patterns. For example, on GAMESALE, MCoR-Miner consumes 13.6Mb memory, which is less than the other competitive algorithms. This result can be found on most of the datasets. In particular, MCoR-Miner consumes less memory than NOSEP and DFOM-All algorithms. This is because in order to calculate the support, NOSEP needs to create a whole Nettree, which contains a lot of useless information. Although DFOM overcomes this drawback, it uses the sequential searching strategy to find feasible child nodes of the current node, which is not efficient. However, MCoR-Miner uses the DBI algorithm, which is equipped with the indexing mechanism to avoid sequential searching. Hence, MCoR-Miner consumes less memory.

In summary, MCoR-Miner achieves better running performance and memory consumption performance than other competitive algorithms.

V-C Scalability

To validate the scalability of MCoR-Miner, we selected MSNBC as the experimental dataset, since it was the largest dataset in Table II. Moreover, we created MSNBC1, MSNBC2, MSNBC3, MSNBC4, MSNBC5, and MSNBC6, which were one, two, three, four, five, and six times the size of MSNBC, respectively. We set gapgap = [0,8][0,8], mincfmincf = 0.3, and the prefix pattern was ee. Comparisons of running time and memory usage are shown in Figs. 6 and 7, respectively.

Refer to caption
Figure 6: Comparison of running time with different dataset sizes (s)
Refer to caption
Figure 7: Comparison of memory usage with different dataset sizes (Mb)

The results give rise to the following observations.

From Figs. 6 and 7, we know that the running time and the memory usage of MCoR-Miner both show linear growth with the increase of the size of the dataset. For example, the size of MSNBC6 is six times that of MSNBC1. MCoR-Miner takes 16.08s on MSNBC1, and 92.52s on MSNBC6, giving 92.52/16.08=5.75. The memory usage of MCoR-Miner is 1358.8Mb on MSNBC6, and 228.8Mb on MSNBC1, giving 1358.8/228.8 = 5.94. Thus, the growth rates of the running time and memory usage are slightly lower than the increase of the dataset size. This phenomenon can be found on all the other datasets. These results indicate that the time and space complexities of MCoR-Miner are positively correlated with the dataset size. More importantly, we can see that MCoR-Miner is more than 10 times faster than MCoR-NoFilt, and that the memory usage of MCoR-NoFilt is more than twice that of MCoR-Miner. We therefore draw the conclusion that MCoR-Miner has strong scalability, since the mining performance does not degrade with the increase of the dataset size.

V-D Influence of parameters

We assessed the effects of different gap constraints and minimum confidence on the number of rules and running time.

V-D1 Influence of different gap constraints

To analyze the influence of different gap constraints on the number of rules and running time of MCoR-Miner, we selected GAMESALE as the experimental dataset, and MCoR-NoFilt, MCoR-AET, MCoR-FET, and MCoR-NoScr as the competitive algorithms. The prefix pattern was dd and the minimum confidence was 0.3. The gap constraints were [0,5], [0,6], [0,7], [0,8], [0,9], and [0,10]. A comparison of the running time is shown in Fig. 8 and the numbers of mined CoRs and MCoRs mined are shown in Table LABEL:Comparison_of_number_of_CoRs_and_MCoRs_with_different_gaps.

Refer to caption
Figure 8: Comparison of running time with different gapgap (s)
TABLE VII: Comparison of numbers of CoRs and MCoRs with different gaps
gap= gap= gap= gap= gap= gap=
[0,5] [0,6] [0,7] [0,8] [0,9] [0,10]
Number of CoRs 8 14 20 36 63 134
Number of MCoRs 6 10 12 20 34 63

These results give rise to the following observations.

With the increase of the gap, the running time and number of CoRs and MCoRs are increased. For example, when gap=[0,8], MCoR-Miner takes 0.17s, and mines eight CoRs and six MCoRs, whereas when gap=[0,8], MCoR-Miner takes 0.42s, and mines 36 CoRs and 20 MCoRs. This phenomenon can be found on all the other algorithms. The reason is as follows. We know that with the increase of the gap, the support of a pattern increases. Thus, more patterns can be found, and each co-occurrence pattern corresponds to a CoR. Hence, more CoRs and MCoRs can be found and the running time also increases. MCoR-Miner outperforms the other competitive algorithms for any gap constraints.

V-D2 Influence of different minimum confidence

To report the influence of different minimum confidences on the number of rules and running time of MCoR-Miner, we selected GAMESALE as the experimental dataset, and MCoR-NoFilt, MCoR-AET, MCoR-FET, and MCoR-NoScr as the competitive algorithms. The prefix pattern was dd and the gap constraints was [0,8]. We set the minimum confidence to 0.16, 0.20, 0.24, 0.28, 0.32, and 0.36. A comparison of the running time is shown in Fig. 9 and the numbers of CoRs and MCoRs mined are shown in Table VIII.

Refer to caption
Figure 9: Comparison of running time with different mincfmincf (s)
TABLE VIII: Comparison of numbers of CoRs and MCoRs with different mincfmincf
mincf mincf mincf mincf mincf mincf
=0.16 =0.20 =0.24 =0.28 =0.32 =0.36
Number of CoRs 796 246 100 49 27 17
Number of MCoRs 431 134 53 25 15 9

The results give rise to the following observations.

With the increase of minimum confidence, the running time and numbers of CoRs and MCoRs are decreased. For example, when mincfmincf = 0.16, MCoR-Miner takes 18.45s, and mines 796 CoRs and 431 MCoRs, whereas when mincfmincf = 0.36, MCoR-Miner takes 0.30s, and mines 17 CoRs and nine MCoRs. This phenomenon can be found on the other algorithms. The reason is as follows. For a certain prefix pattern, the minimum support threshold increases with the increase of mincfmincf. Thus, few patterns can be frequent patterns, and each co-occurrence pattern corresponds to a CoR. Hence, fewer CoRs and MCoRs can be found and the running time also decreases. MCoR-Miner outperforms the other competitive algorithms for any minimum confidence.

V-E MCoR-Miner for recommendation

The recommendation task is to provide the recommendation schemes based on the discovered potential rules with the same given antecedent p. To investigate the recommendation performance of MCoR-Miner, we compare the recommendation performance of co-occurrence pattern mining and rule mining. Therefore, we employed CoP-Miner as the competitive algorithm and selected the commerce datasets: GAMESALE, BABYSALE, TRANSACTION, and MOVIE. The first 80% of the sequences were used as the training set, and the remaining 20% as the testing set. The parameters used for training are shown in Table IX.

TABLE IX: Parameter settings
Dataset Prefix Gap mincfmincf for minsupminsup for
constraint MCoR-Miner CoP-Miner
GAMESALE dd [0,3] 0.39 400
BABYSALE fcfc [0,3] 0.40 300
TRANSACTION xzxz [0,3] 0.40 100
MOVIE deedee [0,3] 0.40 150000

To evaluate the performance of recommendation based on the co-occurrence rules, in addition to confidence, there are three commonly used criteria: precision PrPr = TP/(TP+FP)TP/(TP+FP), recall ReRe = TP/(TP+FN)TP/(TP+FN), and F1-score F1F1 = 2×Pr×Re/(Pr+Re)2\times Pr\times Re/(Pr+Re), where TPTP is the number of correct recommendations, FPFP is the number of incorrect recommendations, and FNFN is the number of relevant items that are not included in the recommendation list. Taking GAMESALE as an example, we see that there are two maximal co-occurrence rules in the training set for MCoR-Miner: {dd \to dd, dd \to ee}, which means that if a user purchases item dd, then MCoR-Miner will recommend items dd and ee to the user. In the testing set, all of the patterns of length two with the prefix pattern dd are dada, dbdb, dddd, dede, djdj, and dkdk, with supports 110, 108, 139, 281, 126, and 83, respectively. Therefore, in this example, TPTP is 420, since the sum of the supports for patterns dddd and dede is 139 + 281 = 420. FP is zero, since MCoR-Miner recommends items dd and ee, and in the test set, the next-item actually can be aa, bb, dd, ee, jj, and kk, which means that items dd and ee are both in these items, i.e., MCoR-Miner has no error recommendation. FNFN is 427, since the sum of the supports for patterns dada, dbdb, djdj, and dkdk is 110+ 108 + 126+ 83 = 427. Thus, the precision, recall, and F1-score of MCoR-Miner on the testing set are 420/(420+0) = 1, 420/(420+427) = 0.4959, and 2×\times1×\times0.4959/(1+0.4959) = 0.6630, respectively. Comparisons of the confidence, recall, and F1-score results are shown in Figs. 10, 11, and 12, respectively.

Refer to caption
Figure 10: Comparison of confidence
Refer to caption
Figure 11: Comparison of recall
Refer to caption
Figure 12: Comparison of F1-score

The results indicate that our method has the following advantages.

  1. 1.

    MCoR-Miner can effectively prune the co-occurrence rules with low confidence. For example, Fig.10 shows that the confidence for each maximal co-occurrence rule mined by MCoR-Miner is larger than 0.39, while the confidence for many of the co-occurrence patterns mined by CoP-Miner is less than 0.39, and the lowest is only 0.14. This is because MCoR-Miner discovers rules with high confidence and high frequency, while CoP-Miner only discovers rules with high frequency. Thus, MCoR-Miner can effectively prune co-occurrence patterns with low confidence, and the recommendation confidence is improved.

  2. 2.

    More importantly, MCoR-Miner has better performance for recommendation task than CoP-Miner. For example, from Figs. 11 and 12, on GAMESALE, the recall and F1-score of MCoR-Miner are 0.4959 and 0.6630, respectively, which are higher than those of CoP-Miner. This phenomenon can also be found on the other datasets, meaning that MCoR-Miner yields better recommendation performance than CoP-Miner.

  3. 3.

    MCoR-Miner is a sequential rule-based recommendation method that is more interpretable than the learning-based recommendation method. The reason is as follows. MCoR-Miner discovers the rules with high confidence by calculating the support of patterns in the training set. Taking GAMESALE as an example, the supports of patterns dd, dddd, and dede are 1917, 970, and 763, respectively. Thus, MCoR-Miner generates two rules: {dd \to dd, dd \to ee}, whose confidences are 0.51 and 0.40, respectively. Therefore, MCoR-Miner has better interpretability. However, for a learning-based recommendation method, the recommendation model is very complex, and cannot be expressed intuitively by a mathematical formula.

In summary, MCoR-Miner has better recommendation performance than CoP-Miner. More importantly, MCoR-Miner can form effective recommendation schemes, which can help decision-makers make correct decisions.

VI CONCLUSION

In nonoverlapping SPM or nonoverlapping sequential rule mining, if the prefix pattern or rule-antecedent is known, it is not necessary to discover all patterns or rules and then filter out irrelevant patterns or rules. This problem is called co-occurrence pattern mining or co-occurrence rule mining. To avoid discovering irrelevant patterns and to obtain better recommendation performance, and inspired by the concept of maximal pattern mining, we investigate MCoR mining. Unlike classical rule mining, which requires two parameters, minimum confidence and minimum support, MCoR mining does not need the minimum support, since it can be automatically calculated based on the support of the rule-antecedent and the minimum confidence. To effectively discover all MCoRs with the same rule-antecedent, we propose the MCoR-Miner algorithm. In addition to the support calculation, MCoR-Miner consists of three parts: preparation stage, candidate pattern generation, and screening strategy. To effectively calculate the support, we propose the DBI algorithm, which adopts depth-first search and backtracking strategies equipped with an indexing mechanism. MCoR-Miner uses the filtering strategy to prune the sequences without the rule-antecedent to avoid useless support calculation. It adopts the FET and BET strategies to reduce the number of candidate patterns, and employs the screening strategy to avoid finding the maximal rules by brute force. To evaluate the performance of MCoR-Miner, eleven competitive algorithms were implemented, and eight datasets were considered. Our experimental results showed that MCoR-Miner yielded better running performance and scalability than the other competitive algorithms. More importantly, compared with frequent co-occurrence pattern mining, MCoR-Miner had better recommendation performance.

In this paper, we focus on the nonoverlapping maximal co-occurrence rule mining. The nonoverlapping sequential pattern mining is a kind of repetitive sequential pattern mining. We know that there are some similar mining methods, such as one-off sequential pattern mining and disjoint sequential pattern mining. Different mining methods will lead to different mining results. The one-off and disjoint maximal co-occurrence rule mining deserve further exploration. Moreover, for a specific mining task, which mining method is the best is worth further study. More importantly, the sequence studied in this paper is a special sequence, which means that the sequence consists of items. However, general sequence consists of itemsets, each of which has many ordered items. It is valuable to investigate frequent repetitive pattern mining and repetitive rule mining in general sequence database.

Acknowledgement

This work was partly supported by National Natural Science Foundation of China (61976240, 62120106008), National Key Research and Development Program of China (2016YFB1000901), and Natural Science Foundation of Hebei Province, China (Nos. F2020202013).

References

  • [1] X. Wu, X. Zhu, and M. Wu. ”The Evolution of Search: Three Computing Paradigms,” ACM Trans. Manag. Inf. Syst., vol. 13, no. 2, pp. 20, 2022.
  • [2] C. H. Mooney and J. F. Roddick, “Sequential pattern mining–approaches and algorithms,” ACM Computing Surveys, vol. 45, no. 2, pp. 1-39, 2013.
  • [3] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and P. S. Yu, “A survey of parallel sequential pattern mining,” ACM Trans. Knowl. Discov. Data, vol. 13, no. 3, pp. 25: 1-34, 2019.
  • [4] Y. Wu, Q. Hu, Y. Li, L. Guo, X. Zhu, and X. Wu, “OPP-Miner: Order-preserving sequential pattern mining for time series,” IEEE Trans. Cybern., DOI: 10.1109/TCYB.2022.3169327, 2022.
  • [5] J. C. W. Lin, Y. Djenouri, and G. Srivastava, “Efficient closed high-utility pattern fusion model in large-scale databases,” Information Fusion, vol. 76, pp.122-132, 2021.
  • [6] G. Srivastava, J. C. W. Lin, X. Zhang, and Y. Li, “Large-scale high-utility sequential pattern analytics in Internet of Things,” IEEE Internet of Things Journal, vol. 8, no.16, pp. 12669-12678, 2021.
  • [7] X. Ao, H. Shi, J. Wang, L. Zuo, H. Li, and Q. He, “Large-scale frequent episode mining from complex event sequences with hierarchies,” ACM Trans. Intell. Syst. Technol., vol. 10, no. 4, pp. 1-26, 2019.
  • [8] L. Wang, X. Bao, and L. Zhou, “Redundancy reduction for prevalent co-location patterns,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 1, pp. 142-155, 2018.
  • [9] L. Wang, Y. Fang, and L. Zhou, “Preference-based spatial co-location pattern mining,” Series Title: Big Data Management, Springer Singapore, DOI: 10.1007/978-981-16-7566-9, 2022.
  • [10] Y. Wu, Y. Wang, Y. Li, X. Zhu, and X. Wu, “Top-k self-adaptive contrast sequential pattern mining,” IEEE Trans. Cybern., vol. 52, no. 11, pp. 11819-11833, 2022.
  • [11] Q. Li, L. Zhao, Y. Lee, and J. Lin. “Contrast pattern mining in paired multivariate time series of a controlled driving behavior experiment,” ACM Trans. Spatial Algorithms Syst., vol. 6, no. 4, pp. 1-28, 2020.
  • [12] X. Dong, P. Qiu, J. Lv, L. Cao, and T. Xu, “Mining top-k useful negative sequential patterns via learning,” IEEE Trans. Neural Networks Learn. Syst., vol. 30, no. 9, pp. 2764-2778, 2019.
  • [13] Y. Wu, M. Chen, Y. Li, J. Liu, Z. Li, J. Li, and X. Wu. “ONP-Miner: One-off negative sequential pattern mining,” ACM Trans. Knowl. Discov. Data, DOI: 10.1145/3549940, 2022.
  • [14] W. Gan, J. C. W. Lin, P. Fournier-Viger, H.-C. Chao, V. S. Tseng, P. S. Yu, “A survey of utility-oriented pattern mining,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 4, pp. 1306-1327, 2021.
  • [15] T. C. Truong, H. V. Duong, B. Le, P. Fournier-Viger, “Efficient vertical mining of high average-utility itemsets based on novel upper-bounds,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 2, pp. 301-314, 2019.
  • [16] W. Song, L. Liu, and C. Huang. “Generalized maximal utility for mining high average-utility itemsets,” Knowl. Inf. Syst., vol. 63, pp. 2947–2967, 2021.
  • [17] J. C. W. Lin, T. Li, M. Pirouz, J. Zhang, and P. Fournier-Viger, “High average-utility sequential pattern mining based on uncertain databases,” Knowl. Inf. Syst., vol. 62, no. 3, pp: 1199-1228, 2020.
  • [18] T. Wang, L. Duan, G. Dong, and Z. Bao, “Efficient mining of outlying sequence patterns for analyzing outlierness of sequence data,” ACM Trans. Knowl. Discov. Data, vol. 14, no. 5, pp. 62, 2020.
  • [19] F. Min, Z. Zhang, W. Zhai, and R. Shen, “Frequent pattern discovery with tri-partition alphabets,” Inf. Sci., vol. 507, pp. 715-732, 2020.
  • [20] Y. Wu, L. Luo, Y. Li, L. Guo, P. Fournier-Viger, X. Zhu, and X. Wu, “NTP-Miner: Nonoverlapping three-way sequential pattern mining,” ACM Trans. Knowl. Discov. Data, vol. 16, no. 3, pp. 51, 2022.
  • [21] D. Amagata and T. Hara, “Mining top-k co-occurrence patterns across multiple streams,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 10, pp. 2249-2262, 2017.
  • [22] H. Duong, T. Truong, A. Tran, and B. Le, “Fast generation of sequential patterns with item constraints from concise representations,” Knowl. Inf. Syst., vol. 62, no. 6, pp. 2191-2223, 2020.
  • [23] Y. Wang, Y. Wu, Y. Li, F. Yao, P. Fournier-Viger, and X Wu. “Self-adaptive nonoverlapping sequential pattern mining,” Appl. Intell. , vol. 52, no. 6, pp. 6646-6661, 2021.
  • [24] Y. Wu, Y. Tong, X. Zhu, and X. Wu, “NOSEP: Nonoverlapping sequence pattern mining with gap constraints,” IEEE Trans. Cybern., vol. 48, no. 10, pp. 2809-2822, 2018.
  • [25] Y. H. Ke, J. W. Huang, W. C. Lin, and B. P. Jaysawal, “Finding possible promoter binding sites in DNA sequences by sequential patterns mining with specific numbers of gaps,” IEEE ACM Trans. Comput. Biol. Bioinform., vol. 18, no. 6, pp. 2459-2470, 2020.
  • [26] M. Zhang, B. Kao, D. Cheung, and K. Yip, “Mining periodic patterns with gap requirement from sequences,” ACM Trans. Knowl. Discov. Data, vol. 1, no. 2, pp. 7, 2007.
  • [27] Q. Shi, J. Shan, W. Yan, Y. Wu, and X. Wu, “NetNPG: Nonoverlapping pattern matching with general gap constraints,” Appl. Intell. , vol. 50, no. 6, pp. 1832-1845, 2020.
  • [28] C. Li, Q. Yang, J. Wang, and M. Li, M, “Efficient mining of gap-constrained subsequences and its various applications,” ACM Trans. Knowl. Discov. Data, vol. 6, no. 1, pp. 2:1-39, 2012.
  • [29] S. Ghosh, J. Li, L. Cao, and K. Ramamohanarao, “Septic shock prediction for ICU patients via coupled HMM walking on sequential contrast patterns,” Journal of Biomedical Informatics, no. 66, pp. 19-31, 2017.
  • [30] F. Xie, X. Wu, and X. Zhu, “Efficient sequential pattern mining with wildcards for keyphrase extraction,” Knowl. Based Syst., vol. 115, pp. 27-39, 2017.
  • [31] W. Wang and L. Cao, “VM-NSP: Vertical negative sequential pattern mining with loose negative element constraints,” ACM Trans. Inf. Syst., vol. 39, no. 2, pp. 1-27, 2021.
  • [32] W. Wang, J. Tian, F. Lv, G. Xin, Y. Ma, and B. Wang, “Mining frequent pyramid patterns from time series transaction data with custom constraints,” Computers & Security, vol. 100, pp. 102088, 2021.
  • [33] Y. Wu, L. Wang, J. Ren, W. Ding, and X. Wu, “Mining sequential patterns with periodic wildcard gaps,” Appl. Intell., vol. 41, no. 1, pp. 99-116, 2014.
  • [34] O. Ouarem, F. Nouioua, and P. Fournier-Viger, “Mining episode rules from event sequences under non-overlapping frequency,” Intern. Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Springer, Cham, pp. 73-85, 2021.
  • [35] Y. Wu, X. Wang, Y. Li, L. Guo, Z. Li, J. Zhang, and X. Wu, “OWSP-Miner: Self-adaptive one-off weak-gap strong pattern mining,” ACM Trans. Manag. Inf. Syst., vol. 13, no. 3, pp. 25, 2022.
  • [36] P. Fournier-Viger, C.-W. Wu, V.S. Tseng, L. Cao, R. Nkambou, “Mining partially-ordered sequential rules common to multiple sequences,” IEEE Trans. Knowl. Data Eng., vol. 27, no. 8, pp. 2203-2216, 2015.
  • [37] J. S. Okolica, G. L. Peterson, R. F. Mills, and M. R. Grimaila, “Sequence pattern mining with variables,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 1, pp. 177-187, 2020.
  • [38] J. D. Smedt, G. Deeva, and J. D. Weerdt, “Mining behavioral sequence constraints for classification,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 6, pp. 1130-1142, 2020.
  • [39] P. Ma, Y. Wu, Y. Li, L. Guo, H. Jiang, X. Zhu, and X. Wu, “HW-Forest: Deep forest with hashing screening and window screening,” ACM Trans. Knowl. Discov. Data, vol. 16, no. 6, pp. 123, 2022.
  • [40] Y. Wu, C. Zhu, Y. Li, L. Guo, and X Wu, “NetNCSP: Nonoverlapping closed sequential pattern mining,” Knowl. Based Syst., vol. 196, pp. 105812, 2020.
  • [41] Y. Li, S. Zhang, L. Guo, J. Liu, Y. Wu, and X. Wu, “NetNMSP: Nonoverlapping maximal sequential pattern mining,” Appl. Intell. , vol. 52, pp. 9861–9884, 2022.
  • [42] J. C. W. Lin, M. Pirouz, Y. Djenouri, C. F. Cheng, and U. Ahmed, “Incrementally updating the high average-utility patterns with pre-large concept,” Appl. Intell. , vol. 50, no. 11, pp. 3788-3807, 2020.
  • [43] Y. Baek, U. Yun, H. Kim, H. Nam, H. Kim, J. C. W. Lin, B. Vo, and W. Pedrycz, “RHUPS: Mining recent high utility patterns with sliding window-based arrival time control over data streams,” ACM Trans. Intell. Syst. Technol., vol. 12, no. 2, pp. 16:1-16:27, 2021.
  • [44] Y. S. Koh, and S. D. Ravana, “Unsupervised rare pattern mining: a survey,” ACM Trans. Knowl. Discov. Data, vol. 10, no. 4, pp. 1-29, 2016.
  • [45] L. Cao, X. Dong, and Z. Zheng, “e-NSP: Efficient negative sequential pattern mining,” Artificial Intelligence, vol. 235, pp. 156-182, 2016.
  • [46] X. Dong, Y. Gong, and L. Cao, “e-RNSP: An efficient method for mining repetition negative sequential patterns,” IEEE Trans. Cybern., vol. 50, no. 5, pp. 2084-2096, 2020.
  • [47] J. C. W. Lin, Y. Djenouri, G. Srivastava, Y. Li, and P. S. Yu, “Scalable mining of high-utility sequential patterns with three-tier MapReduce model,” ACM Trans. Knowl. Discov. Data, vol. 16, no. 3, pp. 1-26, 2022.
  • [48] W. Gan, J. C. W. Lin, J. Zhang, P. Fournier-Viger, H. C. Chao, and P. S. Yu, “Fast utility mining on sequence data,” IEEE Trans. Cybern., vol. 51, no. 2, pp. 487-500, 2021.
  • [49] Y. Wu, R. Lei, Y. Li, L. Guo, and X. Wu, “HAOP-Miner: Self-adaptive high-average utility one-off sequential pattern mining,” Expert Syst. Appl., vol. 184, pp. 115449, 2021.
  • [50] M.S. Nawaz, P. Fournier-Viger, U. Yun, Y. Wu, and W. Song. Mining high utility itemsets with hill climbing and simulated annealing. ACM Trans. Manag. Inf. Syst., vol. 13, no. 1, pp. 4, 2022.
  • [51] W. Gan, J. C. W. Lin, J. Zhang, H. Yin, P. Fournier-Viger, H. C. Chao, P. S. Yu, “Utility mining across multi-dimensional sequences,” ACM Trans. Knowl. Discov. Data, vol. 15, no. 5, pp. 82 ,2021.
  • [52] P. Fournier-Viger, W. Gan, Y. Wu, M. Nouioua, W. Song, T. Truong, and H. Duong, “Pattern mining: Current challenges and opportunities,” DASFAA 2022 International Workshops, pp 34–49, 2022.
  • [53] P. Fournier-Viger, M. S. Nawaz, Y. He, Y. Wu, F. Nouioua, and U. Yun, “MaxFEM: Mining maximal frequent episodes in complex event sequences,” International Conference on Multi-disciplinary Trends in Artificial Intelligence , pp 86–98, 2022.
  • [54] Y. Li, L. Yu, J. Liu, L. Guo, Y. Wu, and Xindong Wu, “NetDPO: (delta, gamma)-approximate pattern matching with gap constraints under one-off condition,” Appl. Intell. , vol. 52, no. 11, pp 12155-12174, 2022.
  • [55] Y. Wu, Z. Yuan, Y. Li, L. Guo, P. Fournier-Viger, and Xindong Wu, “NWP-Miner: Nonoverlapping weak-gap sequential pattern mining,” Inf. Sci., vol. 588, pp. 124-141, 2022.
  • [56] Y. Wu, M. Geng, Y. Li, L. Guo, Z. Li, P. Fournier-Viger, X. Zhu, and X. Wu, “HANP-Miner: High average utility nonoverlapping sequential pattern mining,” Knowl. Based Syst., vol. 229, pp. 107361, 2021.
  • [57] J. Zhan, J. Ye, W. Ding, and P. Liu, “A novel three-way decision model based on utility theory in incomplete fuzzy decision systems,” IEEE Trans. Fuzzy Syst., vol. 30, no. 7, pp. 2210 - 2226, 2022.
  • [58] X. Liu, X. Niu, and P. Fournier-Viger, “Fast Top-K association rule mining using rule generation property pruning,” Appl. Intell., vol. 51, no. 4, pp. 2077-2093, 2021.
  • [59] X. Ao, P. Luo, J. Wang, F. Zhuang, and Q. He, “Mining precise-positioning episode rules from event sequences,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 3, pp. 530-543, 2017.
  • [60] Y. Wu, X. Zhao, Y. Li, L. Guo, X. Zhu, P. Fournier-Viger, and X. Wu, “OPR-Miner: Order-preserving rule mining for time series,” IEEE Trans. Knowl. Data Eng., DOI: 10.1109/TKDE.2022.3224963, 2022.
  • [61] R. Wang, W. Ji, M. Liu, X. Wang, J. Weng, S. Deng, S. Gao, and C. Yuan, “Review on mining data from multiple data sources,” Pattern Recognition Letters, vol. 109, pp 120-128, 2018.
  • [62] S. Saleti and R. B. V. Subramanyam, “A novel mapreduce algorithm for distributed mining of sequential patterns using co-occurrence information,” Appl. Intell. , vol. 49, no. 1, pp. 150-171, 2019.
  • [63] G. Huang, W. Gan, and P. S. Yu, ”TaSPM: Targeted Sequential Pattern Mining,” arXiv preprint arXiv:2202.13202, 2022.
  • [64] M. Celik, S. Shekhar, J. P. Rogers, and J. A. Shine, “Mixed-drove spatiotemporal co-occurrence pattern mining,” IEEE Trans. Knowl. Data Eng., vol. 20, no. 10, pp. 1322-1335, 2008.
  • [65] Y. Wu, C. Shen, H. Jiang, and X. Wu, “Strict pattern matching under non-overlapping condition,” Sci. China Inf. Sci., vol. 60, no. 1, pp. 012101, 2017.
  • [66] Y. Wu, X. Liu, W. Yan, L. Guo, and X. Wu, “Efficient algorithm for solving strict pattern matching under nonoverlapping condition,” Journal of Software, vol. 32, no. 11, pp. 3331-3350, 2021.
[Uncaptioned image] Yan Li received the PhD degree in Management Science and Engineering from Tianjin University, Tianjin, China. She is an associate professor with Hebei University of Technology. Her current research interests include data mining and supply chain management.
[Uncaptioned image] Chang Zhang received the master degree in the School of Economics and Management from the Hebei University of Technology, Tianjin, China. His current research interests include data mining and machine learning.
[Uncaptioned image] Jie Li received the Ph.D. degree in electrical engineering from the Hebei University of Technology, Tianjin, China, in 2002. She is currently a Professor with the School of Economics and Management, Hebei University of Technology. Her research interests include Big Data analytics in smart health care, business, and finance.
[Uncaptioned image] Wei Song received his Ph.D. degree in Computer Science from University of Science and Technology Beijing, Beijing, China, in 2008. He is a professor with the School of Information Science and Technology at North China University of Technology. His research interests are in the areas of data mining and knowledge discovery. He has published more than 40 research papers in refereed journals and international conferences.
[Uncaptioned image] Zhenlian Qi received the Ph.D. in Environmental Science and Engineering, Harbin Institute of Technology, Harbin, China in 2020. She was a postdoc with Guangdong University of Technology, from 2020 to 2022. She is currently an Associate Professor with Guangdong Eco-Engineering Polytechnic, Guangzhou, China. Her research interests include environmental engineering, eco-big data, and data analytics.
[Uncaptioned image] Youxi Wu received the Ph.D. degree in Theory and New Technology of Electrical Engineering from the Hebei University of Technology, Tianjin, China. He is currently a Ph.D. Supervisor and a Professor with the Hebei University of Technology. He has published more than 30 research papers in some journals, such as IEEE TKDE, IEEE TCYB, ACM TKDD, ACM TMIS, SCIS, INS, JCST, KBS, EWSA, JIS, Neurocomputing, and APIN. He is a senior member of CCF and a member of IEEE. His current research interests include data mining and machine learning.
[Uncaptioned image] Xindong Wu received the Ph.D. degree from the University of Edinburgh, Edinburgh, U.K. He is a Yangtze River Scholar with the Hefei University of Technology, Hefei, China. His current research interests include data mining, big data analytics, knowledge based systems, and Web information exploration. Dr. Wu was an Editor-in-Chief of IEEE Transactions on Knowledge and Data Engineering (TKDE) (2005-2008) and is the Steering Committee Chair of the IEEE International Conference on Data Mining and the Editor-in-Chief of Knowledge and Information Systems. He is a Fellow of the American Association for the Advancement of Science and IEEE.