FOLD-TR: A Scalable and Efficient Inductive Learning Algorithm for Learning To Rank

Huaduo Wang Gopal Gupta
Computer Science Department, The University of Texas at Dallas, Richardson, USA
{huaduo.wang,gupta}utdallas.edu

Abstract

FOLD-R++ is a new inductive learning algorithm for binary classification tasks. It generates an (explainable) normal logic program for mixed type (numerical and categorical) data. We present a customized FOLD-R++ algorithm with ranking framework, called FOLD-TR, that aims to rank new items following the ranking pattern in the training data. Like FOLD-R++, the FOLD-TR algorithm is able to handle mixed type data directly and provide native justification to explain the comparison between a pair of items.

1 Introduction

Dramatic success of machine learning has led to a torrent of Artificial Intelligence (AI) applications. However, the effectiveness of these systems is limited by the machines’ current inability to explain their decisions and actions to human users. That’s mainly because the statistical machine learning methods produce models that are complex algebraic solutions to optimization problems such as risk minimization or geometric margin maximization. Lack of intuitive descriptions makes it hard for users to understand and verify the underlying rules that govern the model. Also, these methods cannot produce a justification for a prediction they arrive at for a new data sample.

The Explainable AI program (?) aims to create a suite of machine learning techniques that: a) Produce more explainable models, while maintaining a high level of prediction accuracy. b) Enable human users to understand, appropriately trust, and effectively manage the emerging generation of artificially intelligent partners. Inductive Logic Programming (ILP) (?) is one Machine Learning technique where the learned model is in the form of logic programming rules (Horn Clauses) that are comprehensible to humans. It allows the background knowledge to be incrementally extended without requiring the entire model to be re-learned. Meanwhile, the comprehensibility of symbolic rules makes it easier for users to understand and verify induced models and refine them.

The ILP learning problem can be regarded as a search problem for a set of clauses that deduce the training examples. The search is performed either top down or bottom-up. A bottom-up approach builds most-specific clauses from the training examples and searches the hypothesis space by using generalization. This approach is not applicable to large-scale datasets, nor it can incorporate negation-as-failure into the hypotheses. A survey of bottom-up ILP systems and their shortcomings can be found at (?). In contrast, top-down approach starts with the most general clause and then specializes it. A top-down algorithm guided by heuristics is better suited for large-scale and/or noisy datasets (?).

Learning To Rank (LTR) or Machine Learned Ranking (MLR) is a classical machine learning application that has been widely used in information retrieval systems, recommendation systems, and search engines. The training data for LTR is a set of items with explicit or implicit partial order specified between items. The goal of training a LTR model is to rank new items in the similar pattern from the training data. In practice, the LTR model retrieves a list of documents based on a given query. Generally, there are three approaches to build the LTR model: 1) Pointwise approach which assumes that each query-document pair has a numerical score. In this case, the LTR problem can be approximated by a regression problem. 2) Pairwise approach which employs binary classifiers to predict which is better in a given pair of documents based on the query. 3). Listwise approach which aim to optimize the result list directly based on the evaluation metrics.

The FOIL algorithm by Quinlan is a popular top-down inductive logic programming algorithm that learns a logic program. The FOLD algorithm by Shakerin et al is a novel top-down algorithm that learns default rules along with exception(s) that closely model human thinking, as explained by (?). It first learns default predicates that cover positive examples while avoiding covering negative examples. Then it swaps the covered positive examples and negative examples and calls itself recursively to learn the exception to the default. It repeats this process to learn exceptions to exceptions, exceptions to exceptions to exceptions, and so on. The FOLD-R++ algorithm by (?) is a new scalable ILP algorithm that builds upon the FOLD algorithm to deal with the efficiency and scalability issues of the FOLD and FOIL algorithms. It introduces the prefix sum computation and other optimizations to speed up the learning process while providing human-friendly explanation for its prediction.

To introduce explainability to the LTR tasks, we customizes the FOLD-R++ algorithm as the binary classifier in the Pairwise approach to build the comparison function for the training data based on the partial order specified between training items. We call this customized FOLD-R++ algorithm with the ranking framework as FOLD-TR algorithm. The new approach inherits the features of the FOLD-R++ algorithm, it is able to handle mixed type data directly without any data encoding. The new approach generates normal logic rule sets as explainable solutions and native explanation for its prediction.

2 Background

2.1 Inductive Logic Programming

Inductive Logic Programming (ILP) (?) is a subfield of machine learning that learns models in the form of logic programming rules (Horn Clauses) that are comprehensible to humans. This problem is formally defined as:
Given

1.

A background theory $B$ , in the form of an extended logic program, i.e., clauses of the form $h\leftarrow l_{1},...,l_{m},\ not\ l_{m+1},...,\ not\ l_{n}$ , where $l_{1},...,l_{n}$ are positive literals and not denotes negation-as-failure (NAF) (?; ?). We require that $B$ has no loops through negation, i.e., it is stratified.
2.

Two disjoint sets of ground target predicates $E^{+},E^{-}$ known as positive and negative examples, respectively
3.

A hypothesis language of function free predicates $L$ , and a refinement operator $\rho$ under $\theta-subsumption$ (?) that would disallow loops over negation.

Find a set of clauses $H$ such that:

•

$\forall e\in\ E^{+},\ B\cup H\models e$
•

$\forall e\in\ E^{-},\ B\cup H\not\models e$
•

$B\land H$ is consistent.

2.2 Default Rules

Default Logic proposed by (?) is a non-monotonic logic to formalize commonsense reasoning. A default $D$ is an expression of the form

A:\textbf{M}B\over\Gamma

which states that the conclusion $\Gamma$ can be inferred if pre-requisite $A$ holds and $B$ is justified. $\textbf{M}B$ stands for “it is consistent to believe $B$ ” as explained in the book by (?). Normal logic programs can encode a default quite elegantly. A default of the form:

\alpha_{1}\land\alpha_{2}\land\dots\land\alpha_{n}:\textbf{M}\lnot\beta_{1},\textbf{M}\lnot\beta_{2}\dots\textbf{M}\lnot\beta_{m}\over\gamma

can be formalized as the following normal logic program rule:

\gamma~{}\texttt{:-}~{}\alpha_{1},\alpha_{2},\dots,\alpha_{n},\texttt{not}~{}\beta_{1},\texttt{not}~{}\beta_{2},\dots,\texttt{not}~{}\beta_{m}.

where $\alpha$ ’s and $\beta$ ’s are positive predicates and not represents negation as failure (under the stable model semantics as described in (?)). We call such rules default rules. Thus, the default $bird(X):M\lnot penguin(X)\over fly(X)$ will be represented as the following ASP-coded default rule:

fly(X) :- bird(X), not penguin(X).

We call bird(X), the condition that allows us to jump to the default conclusion that X can fly, as the default part of the rule, and not penguin(X) as the exception part of the rule.

Default rules closely represent the human thought process (commonsense reasoning). FOLD-R and FOLD-R++ learn default rules represented as answer set programs. Note that the programs currently generated are stratified normal logic programs, however, we eventually hope to learn non-stratified answer set programs too as in the work of (?) and (?). Hence, we continue to use the term answer set program for a normal logic program in this paper. An advantage of learning default rules is that we can distinguish between exceptions and noise as explained by (?) and (?). The introduction of (nested) exceptions, or abnormal predicates, in a default rule increases coverage of the data by that default rule. A single rule can now cover more examples which results in reduced number of generated rules. The equivalent program without the abnormal predicates will have many more rules if the abnormal predicates calls are fully expanded.

3 The FOLD-R++ Algorithm

The FOLD-R++ algorithm is summarized in Algorithm 1. The output of the FOLD-R++ algorithm is a set of default rules (?) coded as a normal logic program. An example implied by any rule in the set would be classified as positive. Therefore, the FOLD-R++ algorithm rules out the already covered positive examples at line 9 after learning a new rule. To learn a particular rule, the best literal would be repeatedly selected—and added to the default part of the rule’s body—based on information gain using the remaining training examples (line 17). Next, only the examples that can be covered by learned default literals would be used for further learning (specializing) of the current rule (line 20–21). When the information gain becomes zero or the number of negative examples drops below the ratio threshold, the learning of the default part is done. FOLD-R++ next learns exceptions after first learning default literals. This is done by swapping the residual positive and negative examples and calling itself recursively in line 26. The remaining positive and negative examples can be swapped again and exceptions to exceptions learned (and then swapped further to learn exceptions to exceptions of exceptions, and so on). The $ratio$ parameter in Algorithm 1 represents the ratio of training examples that are part of the exception to the examples implied by only the default conclusion part of the rule. It allows users to control the nesting level of exceptions.

Algorithm 1 FOLD-R++ Algorithm

E^{+}

: positive examples,

E^{-}

: negative examples

R=\{r_{1},...,r_{n}\}

: a set of defaults rules with exceptions

3:function fold_rpp(

E^{+},E^{-},L_{used}

)

R\leftarrow

\triangleright

L_{used}

: used literals, initially empty

5: while

|E^{+}|>0

r\leftarrow

learn_rule(

E^{+}

E^{-}

L_{used}

)

E_{FN}\leftarrow covers(r,\ E^{+},\ \textit{false})

8: if

|E_{FN}|=|E^{+}|

then

9: break

10: end if

11:

E^{+}\leftarrow E_{FN}

\triangleright

rule out covered examples

12:

R\leftarrow R\cup\{r\}

13: end while

14: return

R

15:end function

16:function learn_rule(

{E^{+}},{E^{-}},{L_{used}}

)

17:

L\leftarrow

18: while

true

19:

l\leftarrow

find_best_literal(

E^{+}

E^{-}

L_{used}

)

20:

L\leftarrow L\cup\{l\}

21:

r\leftarrow\textit{set\_default}(r,\ L)

22:

E^{+}\leftarrow covers(r,\ E^{+},\ true)

23:

E^{-}\leftarrow covers(r,\ E^{-},\ true)

24: if

l

is invalid or

|E^{-}|\leq|E^{+}|*ratio

then

25: if

l

is invalid then

26:

r\leftarrow\textit{set\_default}(r,\ L\setminus\ \{l\})

27: else

28:

AB\leftarrow

fold_rpp(

E^{-}

E^{+}

L_{used}+L

)

29:

r\leftarrow set\_exception(r,\ AB)

30: end if

31: break

32: end if

33: end while

34: return

r

35:end function

Example 1

In the FOLD-R++ algorithm, the target is to learn rules for fly(X). $B,E^{+},E^{-}$ are background knowledge, positive and negative examples, respectively.

B:  bird(X) :- penguin(X).
    bird(tweety).   bird(et).
    cat(kitty).     penguin(polly).
E+: fly(tweety).    fly(et).
E-: fly(kitty).     fly(polly).

The target predicate {fly(X) :- true.} is specified when calling the learn_rule function at line 4. The function selects the literal bird(X) as result and adds it to the clause $r$ = fly(X) :- bird(X) because it has the best information gain among {bird,penguin,cat}. Then, the training set gets updated to $E^{+}=\{tweety,et\}$ , $E^{-}=\{polly\}$ in line 16-17. The negative example $polly$ is still implied by the generated clause and so is a false negative classification. The default learning of learn_rule function is finished because the best information gain of candidate literal is zero. Therefore, the FOLD-R++ function is called recursively with swapped positive and negative examples, $E^{+}=\{polly\}$ , $E^{-}=\{tweety,et\}$ , to learn exceptions. In this case, an abnormal predicate {ab0(X) :- penguin(X)} is generated and returned as the only exception to the previous learned clause as $r$ = fly(X) :- bird(X), ab0(X). The abnormal rule {ab0(X) :- penguin(X)} is added to the final rule set producing the program below:

fly(X) :- bird(X), not ab0(X).

ab0(X) :- penguin(X).

3.1 Sampling

A classical approach to build a LTR model is to build binary classifier to predict the better one with a given pair of examples. With a given ranked data list, FOLD-TR need to prepare the pairs of items as the training data of the customized FOLD-R++ inside. Fully utilizing all the pairs of items in the ranked data list for training would be inefficient. Considering the learning target is a transitive relation, FOLD-TR assumes that the learned pattern from the item pairs that are closely ranked is suitable for the item pairs that are not. The FOLD-TR samples the item pairs based on the ranking position in normal distribution, in other words, FOLD-TR focus more on the closely ranked item pairs.

3.2 Literal Selection

The literal selection process of FOLD-TR algorithm is customized for the comparison function. The FOLD-R++ algorithm divides features into two categories: numerical features and categorical features. The FOLD-R++ algorithm is able to handle mixed type data directly with its own comparison assumption. The FOLD-TR employ the same comparison assumption. Two numerical (resp. categorical) values are equal if they are identical, else they are unequal. And the numerical comparison ( $\leq$ and $>$ ) between two numerical values is straightforward. However, a different assumption is made for numerical comparisons between a numerical value and a categorical value that is always false.

FOLD-TR compares the values of each features of item pair, the customized FOLD-R++ in FOLD-TR generates numerical difference literals for numerical features and categorical equality literal pairs for categorical features. Therefore, the FOLD-TR algorithm chooses literals in pair that is very different from the original FOLD-R++ algorithm. To accommodate the literal pair selection, the Algorithm 2 expands sampled data to pairwise comparison data.

Algorithm 2 Data Plotting

E

: sampled examples,

idx_{C}

: categorical feature indexes,

idx_{N}

: numerical feature indexes

E_{X}

: expanded examples

3:function plot(

E,\ idx_{C},\ idx_{N}

)

E_{X}\leftarrow

5: for

i\leftarrow 1

|E|

6: for

j\leftarrow i+1

|E|

7: for

k\in idx_{N}

e.append(E_{i}[k]-E_{j}[k])

9: end for

10: for

k\in idx_{C}

11:

e.append(E_{i}[k])

12:

e.append(E_{j}[k])

13: end for

14:

E_{X}\leftarrow E_{X}\cup e

15: end for

16: end for

17: return

E_{X}

18:end function

With the plotted data, with one round of heuristic evaluation, the FOLD-TR can extract numerical literal pairs easily. The FOLD-TR algorithm greedily evaluate each corresponding categorical literal pairs that it finds the optimal literal of one feature then finds the other literal based the selected one.

Algorithm 3 Best Literal Pair

E

: sampled examples,

E_{X}

: expanded examples

3:function best_literal_pair(

E^{+},\ E^{-},\ i,\ j,\ L_{used}

)

4: if

|E^{+}|=0\ and\ |E^{-}|=0

then

5: return

-\infty,invalid,invalid

6: end if

left\leftarrow

best_info_gain(

E^{+},\ E^{-},\ i,\ L_{used}

)

8: if left is invalid then

9: return

-\infty,invalid,invalid

10: else

11:

E_{tp}\leftarrow covers(r,\ E^{+},\ \textit{true})

12:

E_{fp}\leftarrow covers(r,\ E^{-},\ \textit{true})

13:

right\leftarrow

best_info_gain(

E_{tp},\ E_{fp},\ j,\ L_{used}

)

14: end if

15:

rule\leftarrow\textit{set\_default}(rule,\{left,right\})

16:

E_{tp},E_{fn},E_{tn},E_{fp}\leftarrow

classify(

rule,\ E^{+},\ E^{-}

)

17:

ig\leftarrow

ig(

E_{tp},E_{fn},E_{tn},E_{fp}

)

18: return

ig,left,right

19:end function

3.3 Justification

Explainability is very important for some tasks like loan approval, credit card approval, and disease diagnosis system. Answer set programming provides explicit rules for how a prediction is generated compared to black box models like those based on neural networks. To efficiently justify the prediction, the FOLD-TR provide native explanation for its predictions.

Example 2

The “Boston house prices” is a classical linear regression task which contains 506 houses as training and testing examples and their prices based on features such as weighted distances to ﬁve Boston employment centers, average number of rooms per dwelling, index of accessibility to radial highways, etc.. FOLD-TR generates the following program with only 7 rules:

(1) better(A,B) :- rm(A,NA5), rm(B,NB5), NA5-NB5>0.156,
    not ab5(A,B).
(2) better(A,B) :- rm(A,NA5), rm(B,NB5), NA5-NB5=<0.154,
    crim(A,NA0), crim(B,NB0), NA0-NB0=<-5.806.
(3) ab1(A,B) :- crim(A,NA0), crim(B,NB0), NA0-NB0=<4.994,
    indus(A,NA2), indus(B,NB2), NA2-NB2>10.72.
(4) ab2(A,B) :- age(A,NA6), age(B,NB6), NA6-NB6=<2.6,
    crim(A,NA0), crim(B,NB0), NA0-NB0=<3.992.
(5) ab3(A,B) :- age(A,NA6), age(B,NB6), NA6-NB6>-9.0,
    rm(A,NA5), rm(B,NB5), NA5-NB5=<0.363, not ab1(A,B),
    not ab2(A,B).
(6) ab4(A,B) :- b(A,NA11), b(B,NB11), NA11-NB11>-64.79,
    crim(A,NA0), crim(B,NB0), NA0-NB0=<6.595, not ab3(A,B).
(7) ab5(A,B) :- crim(A,NA0), crim(B,NB0), NA0-NB0>2.415,
    not ab4(A,B).

Note that better(A,B) means that the house A is more expensive than the house B. The above program represents the comparison function and achieves 0.82 accuracy, 0.86 precision, 0.75 recall, and 0.80 $F_{1}$ score. Given a pair of examples, the generated rules can easily predict the which is more expensive than the other. The ranking framework of the FOLD-TR then give a ranked list of items based on the above program.

The FOLD-TR ranking system can even generate this proof in a human understandable form. For example, here is the justification tree generated for a pair of examples:

    Proof Tree for example number 8 :
    the item A is better than item B DOES HOLD because
        the rm value of A minus the rm value of B should
            be less equal to 0.154 (DOES HOLD)
        the crim value of A minus the crim value of B
            should be less equal to -5.806 (DOES HOLD)
    {rm(A, 6.575), rm(B, 5.887), crim(A, 0.00632),
    crim(B, 13.3598)}

This justification tree is also shown in another format: by showing which rules were involved in the proof/justification. For each call in each rule that was invoked, FOLD-TR shows whether it is true ([T]) or false ([F]). The head of each applicable rule is similarly annotated.

    [T]better(A,B) :- [T]rm(A,NA5), rm(B,NB5),
        NA5-NB5=<0.154, [T]crim(A,NA0), crim(B,NB0),
        NA0-NB0=<-5.806.
    {rm(A, 6.575), rm(B, 5.887), crim(A, 0.00632),
    crim(B, 13.3598)}

4 Experiments

In this section, we will present our experiment on several datasets. All the training processes have been run on a laptop with Intel i7-9750H CPU2.6GHz and 16 GB RAM. This ranking model is implemented with pairwise method, we use accuracy, precision, recall and F1 score to evaluate the result. We split each dataset into 80% as training data and 20% as test data. The training data used is sampled with normal distribution and plotted for each feature. We have randomly experimented on the reported datasets 5 times, and the scores are averaged.

DataSet			FOLD-TR
Name	#Row	#Col	Acc	Prec	Rec	F1	#Rule	#Pred
boston house	506	14	0.81	0.80	0.82	0.81	8	27
wine quality	6497	12	0.69	0.67	0.17	0.27	32	147
student perf	649	33	0.63	0.81	0.23	0.36	5	21

Table 1: FOLD-TR on various Datasets

5 Conclusions and Future Work

In this paper we presented an efficient and highly scalable algorithm, FOLD-TR, to induce default theories represented as an answer set program. The resulting answer set program has good performance wrt prediction and justification for the ranking. In this new approach, unlike other methods, the encoding for data is not needed. We will perform comparison experiments with the other well-known similar models, DirectRanker, PRM, XGBoost and PolyRank. The results on Mean average precison, NDCG, PrecisonN and Mean reciprocal rank will be reported.

Acknowledgement

Authors gratefully acknowledge support from NSF grants IIS 1718945, IIS 1910131, IIP 1916206, and from Amazon Corp, Atos Corp and US DoD. Thanks to Farhad Shakerin for discussions. We are grateful to Joaquin Arias and the s(CASP) team for their work on providing facilities for generating the justification tree and English encoding of rules in s(CASP).

References

[Baral 2003] Baral, C. 2003. Knowledge representation, reasoning and declarative problem solving. Cambridge, New York, Melbourne: Cambridge University Press.
[Gelfond and Kahl 2014] Gelfond, M., and Kahl, Y. 2014. Knowledge representation, reasoning, and the design of intelligent agents: The answer-set programming approach. Cambridge University Press.
[Gunning 2015] Gunning, D. 2015. Explainable artificial intelligence (xai).
[Muggleton 1991] Muggleton, S. 1991. Inductive logic programming. New Gen. Comput. 8(4).
[Plotkin 1971] Plotkin, G. D. 1971. A further note on inductive generalization, in machine intelligence, volume 6, pages 101-124.
[Reiter 1980] Reiter, R. 1980. A logic for default reasoning. Artificial Intelligence 13(1-2):81–132.
[Sakama 2005] Sakama, C. 2005. Induction from answer sets in nonmonotonic logic programs. ACM Trans. Comput. Log. 6(2):203–231.
[Shakerin and Gupta 2018] Shakerin, F., and Gupta, G. 2018. Heuristic based induction of answer set programs, from default theories to combinatorial problems. In 28th International Conference on Inductive Logic Programming (ILP Up and Coming Papers), volume 2206, 36–51.
[Shakerin, Salazar, and Gupta 2017] Shakerin, F.; Salazar, E.; and Gupta, G. 2017. A new algorithm to automate inductive learning of default theories. TPLP 17(5-6):1010–1026.
[Shakerin 2020] Shakerin, F. 2020. Logic Programming-based Approaches in Explainable AI and Natural Language Processing. Ph.D. Dissertation. Department of Computer Science, The University of Texas at Dallas.
[Wang and Gupta 2022] Wang, H., and Gupta, G. 2022. FOLD-R++: A toolset for automated inductive learning of default theories from mixed data. In Proceedings The First International Workshop on Combining Learning and Reasoning: Programming Languages, Formalisms, and Representations (CLeaR).
[Zeng, Patel, and Page 2014] Zeng, Q.; Patel, J. M.; and Page, D. 2014. Quickfoil: Scalable inductive logic programming. Proc. VLDB Endow. 8(3):197–208.