Mind Reasoning Manners: Enhancing Type Perception for Generalized Zero-shot Logical Reasoning over Text
Abstract
Logical reasoning task involves diverse types of complex reasoning over text, based on the form of multiple-choice question answering. Given the context, question and a set of options as the input, previous methods achieve superior performances on the full-data setting. However, the current benchmark dataset has the ideal assumption that the reasoning type distribution on the train split is close to the test split, which is inconsistent with many real application scenarios. To address it, there remain two problems to be studied: (1) How is the zero-shot capability of the models (train on seen types and test on unseen types)? (2) How to enhance the perception of reasoning types for the models? For problem 1, we propose a new benchmark for generalized zero-shot logical reasoning, named ZsLR. It includes six splits based on the three type sampling strategies. For problem 2, a type-aware model TaCo is proposed. It utilizes both the heuristic input reconstruction and the contrastive learning to improve the type perception in the global representation. Extensive experiments on both the zero-shot and full-data settings prove the superiority of TaCo over the state-of-the-art methods. Also, we experiment and verify the generalization capability of TaCo on other logical reasoning dataset.
Index Terms:
Natural Language Processing, Logical Reasoning, Question Answering, Generalized Zero-shot.1 Introduction
Logical reasoning over text has aroused wide interest in the area of Machine Reading Comprehension (MRC) [1] and Natural Language Processing (NLP) [2][3] recently. In the form of the traditional multiple-choice question answering (MCQA) [4][5][6], the task of logical reasoning requires the model to perform complex reasoning and generalization. One of the main difficulties of the task lies in addressing diverse reasoning types. Fig. 1 shows some examples of reasoning types in the logical reasoning task. Given questions with different reasoning types, humans tend to focus on the respective aspects of interactions between the context and the option. For instance, for the type Identify the flaw (a), the option is strongly related to the detailed logic flaws within the global idea. While for the type Necessary assumption (b), the focus may be switched to the premise of the arguments and detect the missing assumption reflected by the option. Also, for the reasoning type of Parallel reasoning (c), it is required to consider the corresponding logical structure of the context and the option, rather than the specific entities or events. Therefore, the modeling of the specific reasoning type is intuitive and necessary to the logical reasoning task.

Recent works have witnessed improvements in the logical reasoning tasks. Generally, they can be categorized into two folds: graph-based and data-based. In the graph-based family, DAGN [7], FocalReasoner [8] and Logiformer [9] attempt to construct the context graphs from different levels, such as causal and co-occurrence. And AdaLoGN [10] proposes a neural-symbolic system in an adaptive manner. In the data-based family, previous works explore various data augmentation strategies. For example, LReasoner [11] extracts the symbols from the text and extends them with logical rules. MERIt [12] designs several data generation methods to facilitate the training process. However, all of the methods in the two families lack the modeling of the reasoning type features. Although the above mentioned methods achieve superior results over strong baselines, the current researches are based on the ideal setting that train and test splits share the similar distribution of reasoning types. Nevertheless, in the real application, we are mainly exposed to the common types of questions while unfamiliar with the reasoning on the novel and uncommon types. In another word, there is an obvious gap between the ideal setting and the real scenarios. To address this issue, there exist two problems to be studied: 1) What about the test performance on the unseen types during training and how is the zero-shot capability of the models? 2) How to enhance the perception of reasoning types for the models?
As for Problem 1, we propose a new benchmark for the generalized Zero-shot Logical Reasoning, named ZsLR. Based on the ReClor dataset [13] with 17 reasoning types, we form 6 zero-shot data splits according to 3 strategies (i.e., amount, randomness, and difficulty). To make comprehensive assessments, we introduce the generalized zero-shot setting [14], which test on both seen types and unseen types with two defined metrics. The necessity and meaning of ZsLR is verified by the pilot experiments and it encourages more future works to mind reasoning manners.
As for Problem 2, we propose a Type-aware reasoning network based on Contrastive learning, named TaCo. First, we design a keyword-based extractor to output the reasoning type. Then, through the two designed heuristic strategies, we merge each question and the option into a unified Q-A pair. Based on the co-occurrence node extraction algorithm proposed in Logiformer [9], we form the topology within context part and Q-A pair part respectively. To model the interaction of the context and Q-A pair for different reasoning types, we add a global node to connect all the nodes of the two parts. Through the self-attention aggregation, the final representation of the global node is obtained and utilized to predict the answer. Meanwhile, we employ the sentence-BERT [15] to obtain type embeddings based on their descriptions. The margin loss is applied to model the reasoning type, where the global node serves as the anchor, ground truth type as positive example and other types as negative examples. In this way, the global node representation contains both the context and the reasoning type semantics. Therefore, the perception of the reasoning types can facilitate the zero-shot capability.
In all, the main contributions of this paper are summarized as follows:
(1) We are the first to focus on the issue of type-oriented reasoning manners. To address the issue, we propose the first benchmark: generalized ZsLR.111The ZsLR dataset and implementation of TaCo are public at github.com/xufangzhi/TaCo It can well reflect the real scenarios and test the zero-shot capabilities of the models.
(2) We propose to tackle the zero-shot task through the heuristic input reconstruction and type-aware contrastive learning. The proposed model TaCo can function as a strong baseline for the future works on the task of ZsLR.
(3) Extensive experiments on both zero-shot and full-data settings prove the huge potential of the proposed issue, as well as the superiority of TaCo. Also, we conduct additional experiments to further verify the generalization capability of TaCo on other setting and dataset.
2 Related Work
2.1 Logical Reasoning
The logical reasoning task, which aims at testing the reasoning capability of models, has aroused wide interest. Several representative datasets have been proposed, such as ReClor [13] and LogiQA [16]. Current methods on the logical reasoning task can be divided into graph-based methods and data-based methods. The former focuses on the construction of the text graphs and leverage the node connection to model the logical relations. Among this category, DAGN [7] is the first work to split the text into EDUs and perform the reasoning with graph neural networks [17][18], but it only builds a chain-type graph and ignores the long distance interaction between nodes. FocalReasoner [8] proposes to attend to the fact triplets [19] within the text and constructs a supergraph based on the extracted triplets. However, it lacks the modeling of the logical information in the text. To fully explore the logic, AdaLoGN [10] is proposed to construct the adaptive neural-symbolic system to improve the performances. However, its reasoning process is complex and costly. Considering the above drawbacks, Logiformer [9] stresses much importance on both the causal relations and the co-occurrence relations in a two-branch graph transformer network. Yet it still lacks the perception of different question types, which limits the zero-shot reasoning capability of the model. The latter one is the data-based methods. These works aim to improve the performance through the data augmentation strategy. One of the representatives is LReasoner [11], which extends the symbolic expressions by logical rules and templates. In addition, MERIt [12] designs a meta-path-guided contrastive learning method to facilitate the training process, by utilizing the extra data. However, both of them fail to model the reasoning type of each question and lack the application value in zero-shot setting.
2.2 Current Machine Reading Comprehension Settings
Current machine reading comprehension (MRC) tasks [20] can be categorized into two settings: full-data setting and low-resource (i.e., few-shot, zero-shot) setting.
Full-data setting has aroused wide concerns in the area of MRC in recent years. Popular benchmark SQuAD [21] [22], which is sourced from from Wikipedia articles, contains over 100,000 questions. It forms an abundant database for the model training. Similarly, HotpotQA [23] includes 113K questions with a variety of reasoning strategies. It stresses more on the multi-hop reasoning abilities of the model. RACE dataset [24] is specially designed to improve the reading skills for both middle school and high school students, which contains about 9.7K examples. Also, some domain-specific datasets like NewsQA [25], TextbookQA [26], PubMedQA [27], aim at the area of journalism, education and biology respectively. All of them have rich training examples to improve the model capability. However, such full-data setting encourages the models to depend on the ideal scenes with abundant training data, which may not fit in some real cases.
Based on such concern, some previous works focus on the low-resource setting MRC, including few-shot or zero-shot setting. For example, to evaluate the robustness of the model, [28] reconstructs the data from knowledge sources for zero-shot commonsense QA. Also, [29] proposes a neuro-symbolic approach to boost the performance of zero-shot commonsense QA. Both of the methods are not exposed to the training data, forming a zero-shot setting. However, they rely on the extra knowledge bases, which limits their applications. There are also some works related to the few-shot QA. In [30], a new pretraining strategy is proposed to explore the realistic few-shot setting. Also, [31] tackles the few-shot challenge on the task of Visual Question Answering. However, all of the above methods simply obtain the few-shot splits according to the amount, which treat all the samples equally.
To sum up, current MRC benchmarks mainly focus on the common types of questions. However, in reality, it is more challenging when faced with some uncommon types of questions. To make up for these drawbacks, we attend to the reasoning types in the logical reasoning tasks and propose the first benchmark for zero-shot logical reasoning based on type attributes.
3 The Benchmark of Generalized ZsLR
In this section, we introduce the generalized ZsLR benchmark. At the very beginning, we obtain the statistical distribution of the number of reasoning types on the train, development and test splits, shown in Fig. 2. We arrange them in descending order of the number. The distributions are in the similar form. It demonstrates that the full-data setting is based on such an ideal assumption, which is insufficient to verify the zero-shot generalization capability of the models. Therefore, we propose the benchmark of generalized ZsLR and conduct the pilot experiments to verify the necessity.
3.1 Zero-shot Data Construction
The zero-shot logical reasoning datasets are split based on ReClor [13] without shuffling. Considering the situation in reality, it is easier to learn to reason on common types of samples while struggle with the rare ones. To this end, we design three sampling strategies, namely amount, randomness and difficulty.
Strategy | Split | Seen Type | # Type Seen (Unseen) | # Train Seen | # Test Seen (Unseen) |
Amount | v1 | {0,3,4,8,13} | 5 (12) | 2,190 | 475 (525) |
v2 | {0,1,2,3,8,9,14,16} | 8 (9) | 2,700 | 595 (405) | |
Randomness | v3 | {0,2,3,13} | 4 (13) | 1,928 | 435 (565) |
v4 | {0,2,3,5,7,8,13} | 7 (10) | 2,896 | 645 (355) | |
Difficulty | v5 | {0,2,4,6,8,13,15} | 7 (10) | 2,175 | 473 (527) |
v6 | {1,3,5,7,9,10,11,12,14,16} | 10 (7) | 2,463 | 527 (473) |
For amount, we select top-k reasoning types as the seen types, merely by the amount. It can be seen as a simple implementation to filter the uncommon types of samples.
For randomness, we arrange the reasoning types in descending order of amount and select the seen ones based on the geometric distribution. The discrete form of the geometric distribution is,
(1) |
where is the sorted index of the reasoning type, is the hyper-parameter set to 0.1 in our implementation. In this random setting, the type with more training samples has a higher probability to be selected, which is in parallel with the real situation.
For difficulty, we first rank the difficulty of the reasoning types, based on their performance with RoBERTa-Large single model. On one split, we select some of the most difficult reasoning types as the seen ones. On the other split, we select a part of the easiest types as the seen ones.
In total, 6 zero-shot splits are obtained (2 for each strategy) for ReClor. The details of the splits are presented in Table I.
3.2 Generalized Zero-shot Setting


We propose the zero-shot setting for the logical reasoning task. Given the context , question () and option set of the question, the model is required to predict the correct answer . In the definition, denotes the reasoning type of the question and is the type set of the zero-shot splits. For each zero-shot split, a part of the types are sampled as the seen types, while others are viewed as the unseen ones. Only the questions of seen types exist during training and they consist of the training examples. The type set seen during the training stage and the set for the test stage are and respectively, thus they satisfy .
To be closer to the real scenes, we consider the generalized zero-shot setting. That is to say, the test scope is not limited to the unseen types. To this end, we employ two metrics for the generalized zero-shot setting, Test-All and Test-Unseen. The former one denotes the exact match results on the full-data test split, which contains both seen and unseen types, i.e., . The latter one is the exact match results only on the unseen types of the test split, which ignores the performance on the seen types, i.e., . In this way, we expect to achieve comprehensive assessments of the model capabilities.
3.3 Pilot Experiments
Also, we are going to verify the necessity of the zero-shot setting from the pilot experiments. For comparison with each split, we randomly sample the same amount of training examples on both the seen and unseen types, forming the comparison group. Take the zero-shot split v1 as an example, it has seen samples for training. Thus, its comparison training group also includes examples, but are distributed over all types. The purpose of the pilot experiments is to observe the performance changes of the two metrics on the test split.
In the implementation, we utilize the RoBERTa single model [32] to conduct the reasoning and maintain the same hyper-parameters of the zero-shot splits with comparison groups. Especially, to avoid the noise brought by the random sampling, we conduct the comparison experiments five times with different random seeds. Final results are the average values of the five experiments.
We select the results on four of the split versions for illustration (i.e., v1-v4), shown in Fig. 2. The first and the third column in blue are the performance of the comparison group, on the full-data test and the unseen types of test respectively. The second and the fourth column represent the performance on the zero-shot setting. With the same number of training examples, training only on seen types (zero-shot setting) witnesses obvious drops on the performance, especially on the unseen types of the test split.
Such observations are consistent among all the zero-shot splits. Zero-shot pilot experiments based on reasoning types uncover the obvious drawback of the current full-data setting. In another word, it verifies the necessity to propose a new benchmark for the generalized ZsLR.
4 Methods
In this section, we introduce the proposed methods. To tackle the zero-shot challenges, we propose a model named TaCo, which focuses on the reasoning type perception in the logical reasoning task. The architecture of TaCo is shown in Fig. 3. It mainly consists of three parts: (a) heuristic reconstruction to acquire the type-aware input sequences; (b) text graph construction and reasoning for the QA problems; (c) type-aware contrastive learning.

4.1 Heuristic Input Reconstruction
One of the common practices for MCQA problems is to concatenate three sequences as inputs: context, question and option. But it is insufficient for the zero-shot setting. There exist two main drawbacks: 1) the modeling of the reasoning type is implicit in the sequence; 2) it is difficult to bridge the interaction between context and options via the question sequence in the middle since it is not natural for the language model (LM). To this end, this paper introduces the heuristic reconstruction to the inputs based on the type extraction.
To address the first drawback, we are required to label the reasoning type of each question based on the limited inputs. Inspired by LReasoner [11], we propose a simple but effective type extractor through keywords. The procedure of the heuristic type extractor is presented in the pseudo-code of Algorithm 1.
Since the ReClor dataset only makes the reasoning types on the test split public, we are required to collect the classification method from it. Before the extraction, we conclude the keywords and phrases for each reasoning type, forming the keywords base . Also, we include the maximum window size W and the question sentence as the inputs. In Line 1, we first split the question into words and form the word sequence . Then we start the iteration in the descending order of the window size. We slide the window over the question sequence to obtain the sub-sequence (Line 5). Then we match the sub-sequence with keywords base for each reasoning type to derive the number of exact matches (Line 7). After each window size iteration, we judge the exit condition (Line 11). If there exists a unique type with the maximum number of matches, we label the type as the ground truth. Meanwhile, if we can not extract the type at the last round of iteration, we label the instance as Others type (Line 16, 17).
Then, we convert the type index of each question into the natural language (e.g., Implication, Conclusion). Meanwhile, to equip the LM with the type information, we add a type-related prefix at the beginning of the sequence:
(2) |
where [R-Type] denotes the natural language label of the specific type. Thus, the type semantic is merged into the inputs in an explicit manner.
For the second drawback, one intuitive idea is to reconstruct the inputs of question and option sequences, transforming them into a single sequence in a declarative form, defined as Q-A pair. In another word, we are required to fill the option into the proper position in the question. According to our observation, the question sentence is led by a trigger span. For example, in the question Which one of the following most weakens the arguments above ?, the span Which one of the following serves as the trigger. Then, we can replace the trigger span with the option sequence to obtain the Q-A pair, formulated as [Option] most weakens the arguments above.
Therefore, the core of the heuristic strategy is to pinpoint the precise position of the triggers. Although the trigger span is similar to the basic form of Which of the following, it is not always the same. To further improve the diversity of the trigger base, we propose two heuristic strategies:
-
•
Combination: We predefine a set of basic words (i.e., {which, one, of, the, following}), and then randomly combine all or parts of them to form the new triggers. For instances, of which the following and which of following are two newly constructed triggers.
-
•
Tolerance: More trigger spans are not limited to the combination of the predefined words. To this end, we propose a tolerance strategy to contain the extra words within the trigger span. For example, in the trigger of the following claims, which, we include the extra word claims to the span.
In this way, the trigger base can be expanded to adapt to the complex situations in reality. So far, the declarative Q-A pair is obtained.
Concatenating the context sequence and Q-A pair sequence of the question, we send it into the pre-trained LM. In this paper, we employ the RoBERTa-Large model [32] to obtain the token-level representations as follows,
(3) | ||||
where the tokens make up , and make up . The representation of , and the whole sequence (concatenation of prefix, and ) can be obtained through the mean pooling strategy,
(4) | ||||
4.2 Text Graph Construction and Reasoning
According to the previous analysis, the context and Q-A pair may interact differently under different reasoning types. Therefore, we consider to build the topology structure of the two parts respectively and finally learn their interactive semantic. For clearer illustration, we take the context part as an example and present the graph construction process in Fig.4.

Firstly, we split the text sequence into units (e.g., U1-U6 in Fig. 4) based on the predefined explicit connectives or punctuation (marked in blue in Fig. 4). These text units function as the nodes during the graph reasoning process. Therefore, two node sets for the context and Q-A pair parts are obtained respectively.
To form the dual subgraphs of the input sequence, we are also required to define the edge relations, including a) order relation and b) overlap relation. The former one models the original text order, where nodes are connected in sequence. It forms a chain-type structure which maintains the position perception. The latter one models the similarity between nodes, where nodes with more vocabulary overlaps or similar semantics will be connected. In the implementation, we will transform the text units into word sets, where the stopwords are excluded. Through the calculation of the overlap ratio between sets, we connect the node pairs that exceed the threshold.
In this way, the independent topology of the context part and the Q-A pair part are formed. To make the interaction of the two parts learnable, we add a global node to link all the extracted nodes. The global node is initialized with and aggregates the two-way information flow. Therefore, this node representation is expected to contain the type semantic to improve the zero-shot performance.
Further, to overcome over-smoothing in graph neural networks [33] and fully conduct node interactions, we employ the Graph Transformer Network [34, 9]. Let be the node set in one graph, with the number of . We feed the node sequence into the Transformer architecture [35]. For each layer , the updated information of the node is formulated as follows:
(5) |
where denotes the set without node.
The core of the equation is the computation for the weighted attention . It is obtained by the two matrices query and key :
(6) |
where is one of the elements in the matrix . is the dimension of the hidden states. is the attention bias matrix to encode the structural semantic of the whole graph, which is the adjacent matrix of the graph.
Thus, the final feature of the global node is updated through the graph reasoning network, containing the global semantic of the text sequence. Before predicting the correct answer, we concatenate , , and to obtain the score of the option followed by softmax:
(7) |
where denotes the linear projection. Same with most of the MCQA methods, we adopt the cross-entropy loss function for the optimization:
(8) |
where denotes the stack of each option score and is the ground-truth label of the example.
Id | Type | Descriptions |
0 | Necessary Assumptions | identify the claim that must be true or is required in order for the argument to work. |
1 | Sufficient Assumptions | identify a sufficient assumption, that is, an assumption that, if added to the argument, would make it logically valid. |
2 | Strengthen | identify information that would strengthen an argument. |
3 | Weaken | identify information that would weaken an argument. |
4 | Evaluation | identify information that would be useful to know to evaluate an argument. |
5 | Implication | identify something that follows logically from a set of premises. |
6 | Conclusion and Main Point | identify the conclusion/main point of a line of reasoning. |
7 | Most Strongly Supported | find the choice that is most strongly supported by a stimulus. |
8 | Explain or Resolve | identify information that would explain or resolve a situation. |
9 | Principle | identify the principle, or find a situation that conforms to a principle, or match the principles. |
10 | Dispute | identify or infer an issue in dispute. |
11 | Technique | identify the technique used in the reasoning of an argument. |
12 | Role | describe the individual role that a statement is playing in a larger argument. |
13 | Identify a Flaw | identify a flaw in an arguments reasoning. |
14 | Match Flaws | find a choice containing an argument that exhibits the same flaws as the passages argument. |
15 | Match the Structure | match the structure of an argument in a choice to the structure of the argument in the passage. |
16 | Others | other types of questions which are not included by the above. |
4.3 Type-aware Contrastive Learning
In the upper part of the architecture, the interactive semantic of the context and Q-A pair is learned through one global node. Besides the implicit awareness of the type, we also expect the model to distinguish the ground-truth type from the negative ones for zero-shot logical reasoning.
Based on the heuristic extractor mentioned above, the reasoning type of each example can be derived. For each reasoning type, the detailed description in the form of natural language is utilized (listed in Table II).
Feeding all 17 reasoning type descriptions into the LM (i.e., Sentence-BERT [15]), we obtain the sentence-level embeddings, which are fixed during training.
For each example, we propose to attend to the type information through contrastive learning. The final representation of the global node serves as the anchor. The ground-truth type functions as the positive sample, while others are negative samples, with the embedding of and ( is the index of the negative samples) respectively. Our purpose is to close the distance between the global node and the positive sample , while distancing the negative samples.
For simplicity, we model the score of two vectors using Hadamard product. That is:
(9) | ||||
where and represent the score of the positive sample and the scores of the negative ones respectively. We employ the margin loss as the auxiliary optimization function for each example:
(10) |
where controls the difference between positive and negative scores. We select the maximum score of the negative one for the loss computation.
To maximize the joint optimization performance of the two loss functions, we set a trade-off coefficient and transform the final loss function into:
(11) |
In this way, the global semantic for reasoning reinforces mutually with the type-aware semantic, which benefits the zero-shot performance. In addition, it provides the interpretability for the test process, which will distinguish the correct reasoning type from others.
5 Main Experiments
In this section, we will introduce the details of the experiment setup. Also, extensive experiments on both the zero-shot and full-data setting will be analyzed.
5.1 Current Full-data Benchmark
Currently, there exist two datasets in the logical reasoning task, ReClor [13] and LogiQA [16]. Since the LogiQA datasets contain only 5 reasoning types, we consider the ReClor dataset in the zero-shot experiments. And we take both of them into account for the full-data setting to prove the generalization capability of TaCo. The detailed information of the two datasets is presented below.
ReClor is sourced from some standardized graduate admission examinations. It consists of 6,138 examples in total, including 4,638 training samples, 500 validation samples, and 1,000 test samples. As is listed in Table II, ReClor contains 17 reasoning types. Especially, its test split is divided into two parts, which are Test-E and Test-H. The former one is the easy splits, which can be addressed without context. The latter one is the hard splits.
LogiQA is collected from National Civil Servants Examinations of China, including 8,678 samples in total. Among them, 7,376 are training samples while validation and test splits both contain 651 samples. Different from the diverse reasoning types in ReClor, LogiQA contains only 5 reasoning types.
5.2 Baselines
We include all the previous SOTA baselines in the main paper for comparison, as well as the results from some classical language models.
Random The results in this setting are based on the random predictions.
RoBERTa-Large [32] The results are obtained simply utilizing the RoBERTa language model for the predictions. It is similar to the baseline of BERT-Large [36] and XLNet-Large [37].
Human Performance [13] In ReClor, the average score of different graduate students on the test split is utilized as the human performance.
DAGN [7] It is the first work to propose the construction of the text graph based on the extracted EDUs. It mainly relies on the RoBERTa-Large [32] to encode the tokens and graph neural network [17] to update the features.
FocalReasoner [8] It constructs a supergraph for reasoning, which consists of the fact units extracted from the text.
LReasoner [11] It explores the context extension based on the defined logical rules (e.g., De Morgan’s laws). Also, it employs the data augmentation method to improve the performance. Since constructing more training data is proved to be of great help to the zero-shot performance of most of the current models, we do not consider the data augmentation strategy when reproducing.
MERIt [12] It proposes a meta-path-guided contrastive learning method to reason over the text. The supervised pretraining is performed on the abundant unlabeled text data. Due to the employment of the external data, we do not consider this method in the zero-shot baselines for the fair comparison.
AdaLoGN [10] It proposes a neuro-symbolic system to adaptively update the text graph. Additionally, a novel subgraph-to-node mechanism is utilized aggregate the information.
Logiformer [9] It introduces two different strategies to construct the logical graph and the syntax graph respectively. Through the two-branch graph transformer network, the text features are updated to conduct the reasoning.
Name of Param. | Search Scope | Best |
# epoch | {10,15,20,25,30} | 30 |
# head in GT | {3,4,5,6} | 6 |
# layer in GT | {3,4,5,6} | 4 |
max sequence len | {128,256} | 256 |
learning rate | {4e-6. 5e-6, 6e-6} | 5e-6 |
margin | {8,10,12,14} | 12 |
trade-off | {0.1,0.2,0.5,1} | 0.2 |
Model | v1 | v2 | v3 | v4 | v5 | v6 | ||||||
Test-A | Test-U | Test-A | Test-U | Test-A | Test-U | Test-A | Test-U | Test-A | Test-U | Test-A | Test-U | |
BERT-Large | 38.00 | 34.36 | 42.00 | 33.39 | 37.50 | 31.61 | 38.00 | 33.26 | 29.60 | 28.02 | 28.80 | 32.24 |
RoBERTa-Large | 47.70 | 39.47 | 50.60 | 39.90 | 46.10 | 40.58 | 50.40 | 42.45 | 53.00 | 43.66 | 49.90 | 50.92 |
DAGN | 49.20 | 41.37 | 52.70 | 43.56 | 49.60 | 39.73 | 52.50 | 44.51 | 52.40 | 42.63 | 48.50 | 49.15 |
LReasoner | 46.90 | 40.60 | 50.20 | 43.49 | 48.40 | 42.76 | 49.20 | 44.12 | 51.90 | 42.02 | 46.30 | 44.93 |
Logiformer | 43.50 | 39.31 | 54.80 | 46.30 | 48.80 | 42.24 | 52.10 | 44.85 | 52.10 | 40.88 | 51.50 | 51.44 |
TaCo | 52.20 | 47.51 | 55.80 | 48.79 | 52.20 | 44.26 | 54.70 | 49.89 | 56.00 | 46.67 | 54.70 | 55.17 |
5.3 Implementation Details
In this paper, all experiments run on a single GPU of Tesla A100. We employ RoBERTa-large model [32] as the text encoder for all previous methods for fair comparison. For each split of the zero-shot setting, we train on the seen types and select the best epoch on the seen types of the development set for the test. We rerun other baselines on the zero-shot setting with the original configuration and maintain their reported results on the full-data setting. As for our proposed model, the hyper-parameters keep the same for both settings. The training epoch is set to 30 and the batchsize is fixed to 1. For optimization, we take Adam [38], with the peak learning rate as 5e-6. The number of heads and layers in the graph transformer is set to 6 and 4 respectively. in the margin loss is tuned as 12. The loss trade-off is 0.2. The selected five random seeds for the comparison groups are {42, 12, 23, 234, 1234}. Additionally, some important hyper-parameters are tuned for the best within a search scope. The details are included in the Table III.
5.4 Main Results
The model TaCo is designed to bridge the gaps of the zero-shot logical reasoning setting, thus we are going to evaluate its performance. We present the results of 6 zero-shot splits in Table IV. We rerun five strong baselines on the zero-shot setting, including (1) BERT-Large, (2) RoBERTa-Large, (3) graph-based representative method DAGN [7], (4) data-based representative model LReasoner [11], (5) SOTA model Logiformer [9].
It shows that TaCo outperforms these strong baselines with large margins on all the zero-shot splits. Compared with the previous SOTA model Logiformer, the averaged improvements among all the splits are 3.80% and 4.54% respectively for the two metrics Test-All and Test-Unseen. And compared with all suboptimal results on the two metrics, TaCo still has great superiority of 2.50% and 3.65% respectively. To explore the respective roles of the two metrics, Test-Unseen witnesses the greater improvements among all the splits, especially 6.14% and 5.04% over the sub-optimal methods in split v1 and v4 respectively. It illustrates that TaCo performs better on tackling the unseen types of samples than the current methods, while maintaining the superior performances on the seen ones. In addition, among all the zero-shot splits, the performances within each method vary greatly. For example, the previous SOTA Logiformer shows good performances in v2 but struggles a lot in v1. It proves that the benchmark provides diverse distributions of the data splits and encourages the consistent performances of the model. And considering the experiment results, TaCo can function as a strong baseline for the future work on the ZsLR benchmark.
Model | v1 | v2 | v3 | v4 | v5 | v6 | ||||||
Test-A | Test-U | Test-A | Test-U | Test-A | Test-U | Test-A | Test-U | Test-A | Test-U | Test-A | Test-U | |
TaCo | 52.20 | 47.51 | 55.80 | 48.79 | 52.20 | 44.26 | 54.70 | 49.89 | 56.00 | 46.67 | 54.70 | 55.17 |
a) Input Reconstruction | ||||||||||||
w/o type prefix | 51.10 | 42.30 | 55.30 | 44.01 | 49.90 | 41.08 | 53.80 | 45.60 | 53.60 | 42.04 | 53.50 | 53.67 |
-1.10 | -5.21 | -0.50 | -4.78 | -2.30 | -3.18 | -0.90 | -4.39 | -2.40 | -4.63 | -1.20 | -1.50 | |
w/o input reconstruction | 50.20 | 42.17 | 55.30 | 45.39 | 51.20 | 42.70 | 53.60 | 46.62 | 54.40 | 43.74 | 52.30 | 52.14 |
-2.00 | -5.34 | -0.30 | -3.40 | -1.00 | -1.56 | -1.10 | -3.27 | -1.60 | -2.93 | -2.40 | -3.03 | |
b) Graph Construction & Reasoning | ||||||||||||
w/o graph reasoning | 48.70 | 42.28 | 54.10 | 42.19 | 50.60 | 41.99 | 52.70 | 45.55 | 53.10 | 44.25 | 50.90 | 50.63 |
-3.50 | -5.23 | -1.70 | -6.60 | -1.60 | -2.27 | -2.00 | -4.34 | -2.90 | -2.42 | -3.80 | -4.54 | |
w/o global node | 51.50 | 46.96 | 53.70 | 44.78 | 51.40 | 43.51 | 53.50 | 46.16 | 55.10 | 46.57 | 53.50 | 52.53 |
-0.70 | -0.55 | -2.10 | -4.01 | -0.80 | -0.75 | -1.20 | -3.73 | -0.90 | -0.10 | -1.20 | -2.64 | |
c) Type-aware Contrastive Learning | ||||||||||||
w/o type contrast | 51.20 | 42.70 | 55.00 | 44.62 | 50.90 | 43.96 | 54.40 | 46.19 | 55.70 | 45.89 | 53.80 | 53.94 |
-1.00 | -4.81 | -0.80 | -4.17 | -1.30 | -0.30 | -0.30 | -3.70 | -0.30 | -0.78 | -0.90 | -1.23 |
5.5 Ablation Studies
To illustrate the effectiveness of each module of TaCo in the zero-shot setting, we conduct the following ablation studies. The results are listed in Fig. V
Detailedly, in part a), w/o type prefix and w/o reconstruction are related to the heuristic input reconstruction, where the former one ablates the type-oriented prefix, and the latter one maintains the simple concatenation of the question and option. Generally, the heuristic type prefix and input reconstruction bring more improvements on the metric of Test-Unseen with an average of 3.95% and 3.26% respectively respectively on all six splits. It shows that type-oriented prefix is more effective than input reconstruction in most cases (i.e., v2-v5).
In part b), w/o graph reason ablates the whole module of graph construction and reasoning, and replaces the global node feature with the pooled output of the whole sequence. Since the reasoning process is the core of this task, it contributes most to the performance on both the metrics, with the average gains of 2.58% and 4.23%. While w/o global node simply replaces the global node feature with the pooled representation of all the other nodes. It can be regarded as a small part of graph construction and reasoning. From the results, it works on the splits v2, v4 and v6. As the training set of these three splits are obviously larger, it can be concluded that global node is especially effective with more training samples.
In part c), we ablate the type-aware contrastive learning module. From the experiments, the modeling of reasoning types does great help to the unseen types of samples. In the unseen types of v1, v2 and v4, the margin loss brings significant improvements, with gains of 4.81%, 4.17% and 3.70% respectively.
Overall, the three key modules (i.e., type prefix, input reconstruction, type contrast) proposed to enhance the type perception all have positive effects on the model performances, especially on the unseen parts of the test split.
5.6 Parameter Analysis
Further, we present the visualization of different selections for hyper-parameters. All the hyper-parameters are selected in the full-data setting, since the zero-shot splits are diverse and hard to unify. The first one is the number of layers and the number of heads in the graph transformer network, shown in Fig. 5. To make a comprehensive comparison, we search both the head number and the layer number within {3,4,5,6}. As can be seen from the heatmap, the darker color represents the higher performance. TaCo reaches the optimal performance when the head number is 6 and the layer number is 4.


In addition, we present the model performances under different margins . The search scope of is within {8,10,12,14,16}. And we select the optimal performance on the test split. The optimal value of is 12.


6 Analysis of Model Generalization
In this section, we will discuss the generalization capability of the proposed model. Since TaCo has achieved superior performances on the zero-shot setting, we further extend it to the full-data setting. Experiments are conducted on two mainstream logical reasoning dataset, ReClor and LogiQA. In Fig. VI, we present the comparison results.
6.1 Generalization on Full-data Setting of ReClor
Model | ReClor | |||
Dev | Test | Test-E | Test-H | |
Random | 25.00 | 25.00 | 25.00 | 25.00 |
Human Performance [13] | - | 63.00 | 57.10 | 67.20 |
BERT-Large [13] | 53.80 | 49.80 | 72.00 | 32.30 |
XLNet-Large [13] | 62.00 | 56.00 | 75.70 | 40.50 |
RoBERTa-Large [13] | 62.60 | 55.60 | 75.50 | 40.00 |
DAGN [7] | 65.80 | 58.30 | 75.91 | 44.46 |
FocalReasoner [8] | 66.80 | 58.90 | 77.05 | 44.64 |
LReasoner [11] | 66.20 | 62.40 | 81.40 | 47.50 |
MERIt [12] | 66.80 | 59.60 | 78.10 | 45.20 |
AdaLoGN [10] | 65.20 | 60.20 | 79.32 | 45.18 |
Logiformer [9] | 68.40 | 63.50 | 79.09 | 51.25 |
TaCo (Ours) | 69.00 | 63.30 | 81.36 | 49.11 |
Compared with all the current methods, TaCo achieves competitive results. In comparison with the SOTA model Logiformer, TaCo is better on the metric of Dev and Test-E with 0.60% and 2.27% gains. And it is only slightly lower than SOTA with 0.20% in the test split. We argue that in the full-data setting, the ideal distribution of types leads to the weakening of the type modeling. Combined with previous experiments on the zero-shot setting, though Logiformer does good at over-fitting the full-data setting, it loses some generalization capability. From this perspective, TaCo also shows great superiority on the model generalization, performing consistently for both settings.
Model | ReClor | |||||
Valid | Test | Test-E | Test-H | |||
TaCo (Ours) | 69.00 | - | 63.30 | - | 81.36 | 49.11 |
a) Heuristic Input Reconstruction | ||||||
w/o type prefix | 67.60 | -1.40 | 61.30 | -2.00 | 80.45 | 46.25 |
w/o input reconstruction | 67.80 | -1.20 | 61.10 | -2.20 | 79.09 | 51.25 |
b) Graph Construction & Reasoning | ||||||
w/o graph reasoning | 64.40 | -4.60 | 59.30 | -4.00 | 79.55 | 43.39 |
w/o global node | 65.60 | -3.40 | 61.90 | -1.40 | 81.82 | 46.25 |
c) Type-aware Contrastive Learning | ||||||
w/o type contrast | 67.80 | -1.20 | 62.90 | -0.40 | 79.55 | 49.82 |
Further we conduct the ablation studies to analyze the effectiveness of the proposed modules in the full-data setting. From the results, the graph reasoning contributes most to the performance on the test split. Also, we only witness 0.40% improvements for the consideration of type-aware margin loss. Since in the full-data setting, the train and test splits share the same distributions, the effects of the reasoning types are reduced. Therefore, the slight gain of the contrastive learning is reasonable. In all, the generalization capability of TaCo is well verified.
6.2 Generalization on Other Dataset
To discuss the generalization capability of TaCo, we further conduct some experiments on another logical reasoning dataset LogiQA[16]. Following the method utilized for ReClor, we also label each example of LogiQA with one of the 17 reasoning types, and consider the zero-shot setting. We apply the above mentioned strategies to obtain three zero-shot splits (i.e., z1, z2 and z3) for LogiQA.
In comparison, we take some previous logical reasoning models into account, including two typical language model BERT-Large and RoBERTa-Large, the graph-based method DAGN and previous SOTA model Logiformer. We rerun the zero-shot setting for each baseline.
Model | z1 | z2 | z3 | |||
Test-A | Test-U | Test-A | Test-U | Test-A | Test-U | |
BERT-L | 26.26 | 28.64 | 27.80 | 27.06 | 28.26 | 27.14 |
RoBERTa-L | 27.65 | 33.98 | 29.49 | 31.76 | 28.73 | 29.43 |
DAGN | 34.41 | 40.78 | 35.48 | 34.12 | 33.79 | 32.00 |
Logiformer | 35.33 | 36.41 | 33.18 | 35.29 | 34.56 | 33.43 |
TaCo (Ours) | 35.58 | 37.38 | 36.76 | 35.29 | 36.41 | 35.14 |
From the results shown in TABLE VIII, TaCo shows its superiority in most cases, only except the Test-Unseen metric of z1 split. Specifically, compared with previous SOTA Logiformer, TaCo achieves an average gain of 1.89% and 0.89% on the Test-All and Test-Unseen respectively. In general, experiments on the LogiQA dataset prove the good generalization capability of TaCo, which satisfies our expectations.
7 Case Study
In this section, we discuss the interpretability of TaCo for the perception of reasoning types. Fig. 7 shows a successful case and a failure case. For each case study, we present the visualization of the type perception via dimension reduction and make the comparison with the SOTA model Logiformer. For the successful case, TaCo can well distance the ground truth type Implication with others and thus make the correct prediction. In this case, Logiformer which lacks the type perception fails. For the failure one, TaCo obviously struggles at distinguishing Necessary assumption from other reasoning types. It leads to the wrong prediction, which is in same with Logiformer. It encourages us to explore the deeper modeling of types from the graph construction. In all, the type-aware contrastive learning is helpful to the answer prediction and provides the interpretability of the model.


8 Conclusion and Future Work
To study the zero-shot capability of the logical reasoning models, we propose the first benchmark for the generalized zero-shot logical reasoning, named ZsLR. It includes six splits sampled with three strategies and two metrics to comprehensively evaluate the performances. Also, we propose a model TaCo to enhance the reasoning type perception through the heuristic input reconstruction and the type-aware contrastive learning. Also, we conduct extensive experiments on the zero-shot splits, full-data setting as well as other dataset. Superior results illustrate the effectiveness and generalization capability of the proposed modules.
In the future, we encourage more works to mind reasoning manners in the logical reasoning task. And we are also interested in exploring the data-augmentation methods through logics.
References
- [1] S. Liu, X. Zhang, S. Zhang, H. Wang, and W. Zhang, “Neural machine reading comprehension: Methods and trends,” Applied Sciences, 2019.
- [2] J. Hirschberg and C. D. Manning, “Advances in natural language processing,” Science, 2015.
- [3] K. Chowdhary, “Natural language processing,” Fundamentals of artificial intelligence, 2020.
- [4] M. Richardson, C. J. Burges, and E. Renshaw, “MCTest: A challenge dataset for the open-domain machine comprehension of text,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.
- [5] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, “RACE: Large-scale ReAding comprehension dataset from examinations,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017.
- [6] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “CommonsenseQA: A question answering challenge targeting commonsense knowledge,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2019.
- [7] Y. Huang, M. Fang, Y. Cao, L. Wang, and X. Liang, “Dagn: Discourse-aware graph network for logical reasoning,” in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2021.
- [8] S. Ouyang, Z. Zhang, and H. Zhao, “Fact-driven logical reasoning,” arXiv preprint arXiv:2105.10334, 2021.
- [9] F. Xu, Q. Lin, J. Liu, Y. Pan, and L. Zhang, “Logiformer: A two-branch graph transformer network for interpretable logical reasoning,” in International Conference on Research and Development in Information Retrieval (SIGIR), 2022.
- [10] X. Li, G. Cheng, Z. Chen, Y. Sun, and Y. Qu, “Adalogn: Adaptive logic graph network for reasoning-based machine reading comprehension,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
- [11] S. Wang, W. Zhong, D. Tang, Z. Wei, Z. Fan, D. Jiang, M. Zhou, and N. Duan, “Logic-driven context extension and data augmentation for logical reasoning of text,” in Findings of the Association for Computational Linguistics (ACL Findings), 2022.
- [12] F. Jiao, Y. Guo, X. Song, and L. Nie, “Merit: Meta-path guided contrastive learning for logical reasoning,” in Findings of the Association for Computational Linguistics (ACL Findings), 2022.
- [13] W. Yu, Z. Jiang, Y. Dong, and J. Feng, “Reclor: A reading comprehension dataset requiring logical reasoning,” in International Conference on Learning Representations (ICLR), 2019.
- [14] F. Pourpanah, M. Abdar, Y. Luo, X. Zhou, R. Wang, C. P. Lim, X.-Z. Wang, and Q. J. Wu, “A review of generalized zero-shot learning methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022.
- [15] N. Reimers, I. Gurevych, N. Reimers, I. Gurevych, N. Thakur, N. Reimers, J. Daxenberger, I. Gurevych, N. Reimers, I. Gurevych et al., “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
- [16] J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, and Y. Zhang, “Logiqa: a challenge dataset for machine reading comprehension with logical reasoning,” in Proceedings of the International Conference on International Joint Conferences on Artificial Intelligence (IJCAI), 2021.
- [17] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, 2008.
- [18] M. Welling and T. N. Kipf, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations (ICLR), 2016.
- [19] N. Nakashole and T. Mitchell, “Language-aware truth assessment of fact candidates,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), 2014.
- [20] C. Zeng, S. Li, Q. Li, J. Hu, and J. Hu, “A survey on machine reading comprehension—tasks, evaluation metrics and benchmark datasets,” Applied Sciences, 2020.
- [21] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
- [22] P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for squad,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
- [23] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi-hop question answering,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
- [24] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, “Race: Large-scale reading comprehension dataset from examinations,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017.
- [25] A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman, “Newsqa: A machine comprehension dataset,” in Proceedings of the 2nd Workshop on Representation Learning for NLP, 2017.
- [26] A. Kembhavi, M. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi, “Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- [27] Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, “Pubmedqa: A dataset for biomedical research question answering,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
- [28] K. Ma, F. Ilievski, J. Francis, Y. Bisk, E. Nyberg, and A. Oltramari, “Knowledge-driven data construction for zero-shot evaluation in commonsense question answering,” in Proceedings of the Conference on Artificial Intelligence (AAAI), 2021.
- [29] A. Bosselut, R. Le Bras, and Y. Choi, “Dynamic neuro-symbolic knowledge graph construction for zero-shot commonsense question answering,” in Proceedings of the Conference on Artificial Intelligence (AAAI), 2021.
- [30] O. Ram, Y. Kirstain, J. Berant, A. Globerson, and O. Levy, “Few-shot question answering by pretraining span selection,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
- [31] D. Guo and D. Tao, “Learning compositional representation for few-shot visual question answering,” arXiv preprint arXiv:2102.10575, 2021.
- [32] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- [33] Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in Proceedings of the Conference on Artificial Intelligence (AAAI), 2018.
- [34] C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T.-Y. Liu, “Do transformers really perform badly for graph representation?” Advances in Neural Information Processing Systems (NIPS), vol. 34, 2021.
- [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NIPS), 2017.
- [36] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- [37] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in Neural Information Processing Systems (NIPS), vol. 32, 2019.
- [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.