ScopeIt: Scoping Task Relevant Sentences in Documents

Vishwas Suryanarayanan
&Barun Patra¹¹1 This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/.
&Pamela Bhattacharya
\ANDChala Fufa
&Charles Lee\ANDMicrosoft
{visuryan, bapatra, pamelabh, chfufa, charlle}@microsoft.com Equal Contribution

Abstract

A prominent problem faced by conversational agents working with large documents (Eg: email-based assistants) is the frequent presence of information in the document that is irrelevant to the assistant. This in turn makes it harder for the agent to accurately detect intents, extract entities relevant to those intents and perform the desired action. To address this issue we present a neural model for scoping relevant information for the agent from a large document. We show that when used as the first step in a popularly used email-based assistant for helping users schedule meetings¹¹1We use Hedwig in lieu of the actual persona of the agent throughout this paper, our proposed model helps improve the performance of the intent detection and entity extraction tasks required by the agent for correctly scheduling meetings: across a suite of 6 downstream tasks, by using our proposed method, we observe an average gain of 35% in precision without any drop in recall. Additionally, we demonstrate that the same approach can be used for component level analysis in large documents, such as signature block identification.

1 Introduction

Refer to caption — (a) A typical email encountered by the Scheduling Assistant

Intelligent personal digital assistants (IPDA) such as Microsoft Cortana, Amazon Alexa, Apple Siri, and Google Assistant, are becoming increasingly popular. These assistants make use of natural language to communicate, which leads to faster task completion, improving the user’s productivity. A typical interaction with such a digital assistant requires a trigger (such as saying the assistant’s name), followed by a short phrase or sentence describing the user’s ask of the digital assistant. Some examples of these conversations are: “Cortana, what is the weather now?”, “Alexa, play next”, “Siri, turn off Bluetooth”, “Hey Google, take me home”.

While most such assistants are voice based and communicate synchronously with the user, working mostly with short, targeted directives; there also exist email-based assistants that communicate and provide assistance asynchronously and thus have to work with much larger textual queries. Notable examples from the scheduling space are assistants like Cortana from Microsoft Scheduler, Amy and Andrew from x.ai, and Clara from Clara labs. These assistants require that the meeting organizer add them in the email with the attendees, and delegate the scheduling task to the assistant. Fig 1a shows an example of an email that an organizer can send to their virtual assistant. After receiving the email, the assistant needs to identify intents and entities of interest for scheduling the meeting correctly. For example, the duration of the meeting, where the meeting is (location), required and optional attendees, the type of meeting being requested (e.g. lunch, coffee), etc. This intent detection and entity extraction from large documents (eg: emails) can be challenging for two reasons:

•

Information for scheduling the meeting could be spread across the document where most of the content is irrelevant
•

Most generic open source entity extraction models are recall-heavy as they often are context independent, and consequently detect entities that are not relevant for scheduling.

Both the issues can be mitigated by building models (feature-based and/or neural) trained on the task at hand. However as we show in §6, these models can still get confused by the irrelevant information in the document and their performance can be improved by identifying relevant sentences of the document.

We model this problem of finding relevant sentences in a large document as a sentence-level binary classification problem, where every sentence in the email is either considered to be relevant or irrelevant to the context of scheduling meetings. While we focus on scheduling as an example throughout this paper, we believe our approach would be useful for domains outside of scheduling. We show that when used as a preprocessing step (Fig 1b), a good performance of our proposed model (ScopeIt) on the task of identifying relevant sentences in an email boosts the performance of the downstream intent classifiers and entity extractors. Additionally, we show the utility of the same model for signature block detection in component level analyses of emails. We demonstrate that our method can identify signature blocks for signature removal tasks, often required for pre-processing emails for text to speech systems, or for anonymizing email corpora.

The main contributions of this work are:

•

We propose a novel model (ScopeIt) for scoping out task relevant sentences from a large document that outperforms strong baseline methods
•

We illustrate the benefits of using ScopeIt as a preprocessing step and show that it improves the performance of a suite of downstream intent classifiers and entity detectors for the meeting scheduling task; improving precision by 35% (average) without any drop in recall. To the best of our knowledge, this is the first work to explore the utility of scoping task relevant sentences as a preprocessing step for tackling problems involving large text corpora.
•

We show that our proposed architecture also performs better than publicly available baselines on the component level tasks like signature detection and generalizes better to real world data.

We present our approach to the problem of scoping out relevant sentences in §2. In §3, we describe our experimental setup and introduce the baselines we compare our approach against. We discuss ScopeIt’s performance in §4. We analyze the embedding space induced by ScopeIt in §5 to understand why it performs well. In §6 we show the effectiveness of using ScopeIt as a preprocessing step on downstream intent classification and entity extraction tasks. We then show the performance of ScopeIt on the signature detection task (§7). In §8 we discuss the related work. Finally, we conclude in §9.

2 Proposed Method

In this section, we outline our approach to the problem of scoping out relevant sentences for a NLP-based scheduling assistant. Our approach consists of 2 parts: a preprocessing module and a neural model. An incoming email is first passed through the preprocessing module. The preprocessed email is then tokenized, indexed and passed through the neural model to generate a confidence score for each sentence. The model is trained end-to-end with human-labeled gold scores denoting the relevant sentences of the email. We also adopt some data augmentation methods to improve model generalization.

2.1 Preprocessing Module

The preprocessing step fixes any issues due to improperly decoded text (mojibake characters). Furthermore, since we use the wordpiece tokenizer²²2https://github.com/google/sentencepiece to tokenize each word into its constituent wordpieces, having raw urls or emails often generates a large number of uninformative wordpieces³³3E.g “https://coling2020.org/pages/call_for_papers.html” generates 20 wordpieces. In order to circumvent this issue, we replace all urls and emails with special tokens (eg: URLTOKEN, EMAILTOKEN). We keep track of the original urls/emails, and invert the token replacement after obtaining the confidence scores from the neural model.

2.2 Neural Model

Our neural model consists of 3 different modules: an intra-sentence aggregator to aggregate information within a sentence, an inter-sentence aggregator to share information across different sentences, and a classifier to predict the final relevance score of each sentence (Fig 1). Given a document, we first tokenize it into sentences using NLTK’s sentence tokenizer. We then use the wordpiece tokenizer to tokenize each sentence. Let $\mathcal{X}=\{w_{1,1}\cdots w_{m,l_{m}}\}$ be the tokenized document, where $w_{i,j}$ denotes the $j^{th}$ wordpiece of the $i^{th}$ sentence and $l_{m}$ denotes the length of the $m^{th}$ sentence. We predict the relevance of each sentence using the following approach:

	$\displaystyle(e_{i,1},\cdots e_{i,l_{i}})$	$\displaystyle=BERT(w_{i,1}\cdots w_{i,l_{i}})$
	$\displaystyle({h_{f}}_{i,1}\cdots{h_{f}}_{i,l_{i}})$	$\displaystyle=\overrightarrow{Seq2SeqEncoder}(e_{i,1},\cdots e_{i,l_{i}})$
	$\displaystyle({h_{b}}_{i,1}\cdots{h_{b}}_{i,l_{i}})$	$\displaystyle=\overleftarrow{Seq2SeqEncoder}(e_{i,1},\cdots e_{i,l_{i}})$
	$\displaystyle e_{s_{i}}$	$\displaystyle=[{h_{f}}_{i,l_{i}};{h_{b}}_{i,1}]$

	$\displaystyle(f_{s_{1}},\cdots f_{s_{m}})$	$\displaystyle=\overleftrightarrow{Seq2SeqEncoder}(e_{s_{1}},\cdots e_{s_{m}})$
	$\displaystyle(p_{s_{1}},\cdots p_{s_{m}})$	$\displaystyle=(\sigma(f_{s_{1}}),\sigma(f_{s_{2}})\cdots\sigma(f_{s_{m}}))$

Intra Sentence Aggregator: Let $s_{i}=\{w_{i,1}\cdots w_{i,l_{i}}\}$ be the $i^{th}$ sentence. We generate contextual embeddings for each token $w_{i,j}(1\leq j\leq l_{i})$ . We use BERT [Devlin et al., 2018] for generating the embeddings. Note that generating embeddings for each sentence independently, along with replacing urls with special tokens, allows us to circumvent the issue of BERT having a maximum of 512 positional embeddings (i.e we can now encode 512 * $num\_sentences$ wordpieces). Since we want to avoid back-propagating through BERT for compute constraints, using the [CLS] token (as is commonly done to generate sentence representations) doesn’t work. Consequently, we use a Seq2Seq encoder to better adapt the contextual embeddings to the task. We then concatenate the final forward and backward hidden dimensions to get the sentence embedding $e_{s_{i}},1\leq i\leq m$ for each sentence:

Inter Sentence Aggregator: Given sentence embeddings $\{e_{s_{1}}\cdots e_{s_{m}}\}$ , we use a Seq2Seq encoder to aggregate information across different sentences. This allows the model to capture context based on other sentences around it, enabling us to capture document level features. A final Sigmoid output layer generates the probability of each sentence being relevant:

The model is trained with a binary cross entropy loss using gold annotated relevance scores, i.e. given annotations for the sentences $\mathcal{Y}=\{y_{1},y_{2},\cdots y_{m}\}$ :

\displaystyle\mathcal{L}=-\sum_{i=1}^{m}y_{i}\log(p_{s_{i}})+(1-y_{i})\log(1-p_{s_{i}})

(1)

2.3 Data Augmentation

Given that most emails received by the scheduling assistant have some information pertinent to scheduling, we augment the training data with irrelevant emails (i.e emails not relevant to scheduling). These emails are sampled from the Enron dataset [Cohen, 2015, Klimt and Yang, 2004]. Further, we observed that the model was confused when given texts that did not resemble general email writing styles. To avoid this bias, we augment the dataset with negative samples from the IMDb and Yelp datasets [Kotzias et al., 2015]. Furthermore, we observed that the original dataset had a bias of having relevant information being present at the beginning of the email. In order to account for that bias, we also shuffled passages of text within each email except the salutation and signature, and augment our dataset with the shuffled emails. We do so to ensure that the resulting shuffled emails are not nonsensical. We also augment the dataset by first creating templates for emails that are representative of the emails the system would receive, and then randomly replacing proper nouns in the email templates. Additional details can be found in Appendix A.

3 Experiments on Scoping Relevant Sentences

3.1 Dataset and Experimental Setup

We show the performance of ScopeIt on an internal dataset for identifying relevant sentences from emails for the context of scheduling. The details of pertaining to the dataset creation can be found in Appendix A. Table 2 shows the instances present in the train, validation and test splits. During evaluation, any sentence with score $>0.5$ is classified as relevant and others are classified as irrelevant. We use the F1 score of the sentence relevance prediction task as the metric of evaluation. We report the hyperparameters and training details in Appendix B.

3.2 Baselines

Seq2Seq Encoder:

This model does not use BERT for generating contextual embeddings. Instead, a standard word-level BiGRU model is used as the sentence encoder to generate sentence embeddings, with the vocabulary set to the top 10,000 most frequently occurring words encountered in the training data. The sentence embeddings are then projected using a feed-forward layer to generate the relevance probabilities.

No Inter-Sentence Aggregator:

This model uses BERT for generating the contextual embeddings, and then a BiGRU encoder to generate the sentence embeddings. It however does not make use of any inter-sentence aggregator; instead a feed-forward layer directly generates the relevance probabilities.

BERT with [CLS] only:

This model just uses the [CLS] token of BERT for generating the sentence embedding vector. Note that we don’t fine-tune the BERT model.

ScopeIt Without Data Augmentation:

Our proposed model without any data augmentation (§2.3)

BertSum:

A state of the art extractive summarization model leveraging BERT [Liu and Lapata, 2019]. We use the implementation provided by the authors.

Split	n_docs	n_internal	n_sent	n_pos	n_neg
train	21875	10546	233307	24428	208879
validate	2436	1176	25866	2699	23167
test⁴⁴4We augment the test set with 200 completely irrelevant emails to gauge the model performance for that scenario.	1215	1015	12055	1716	10339

Table 1: Relevance Scoping Dataset Details

Model	F1 Score
BERT [CLS]	0.81
Seq2Seq Encoder	0.83
No Inter-Sentence Aggregator	0.89
BertSum [Liu and Lapata, 2019]	0.90
ScopeIt Without Data Augmentation	0.93
ScopeIt	0.94

Table 2: Performance for Relevance Scoping

4 Main Results

Table 2 shows the performance of ScopeIt compared to the baseline models. Since the BERT [CLS] model is not fine-tuned, it does not perform as well as any of the models where the Seq2Seq encoders are trained. Unsurprisingly, the models with BERT augmented embeddings outperform the standard Seq2Seq encoder model substantially. We observe that the inter-sentence aggregator also improves performance. Finally, the model with data-augmentation outperforms all of the baselines. We believe this is because of two reasons. First, most emails have a prior of being relevant, simply because the user cc’d the scheduling assistant. Consequently, the model predicted some sentences as relevant, even for emails which did not have any. Augmenting the data with completely irrelevant emails helps overcome that bias. Further, for most emails, the relevant scope occurs in the beginning of the email. Hence, baseline models bias towards scoring the beginning of the email higher than the end, even if the beginning was not particularly relevant. Training with the shuffling data augmentation mitigates the issue.

Our proposed model also performs better than BertSum. We hypothesize this is because our dataset is orders of magnitude smaller than the CNN/Daily Mail dataset [Hermann et al., 2015] used by ?). And while finetuning BERT models for general tasks does not require as much data, BertSum uses a formulation very different from the original BERT model [Devlin et al., 2018]⁵⁵5Specifically, they use alternating type tokens for each sentence, with each sentence separated with a [SEP] token and a [CLS] token, and use the [CLS] token of the sentence to generate the sentence embedding. . Consequently, finetuning BERT to adapt to these modification potentially requires more data. Moreover, BertSum still suffers from the 512 wordpiece restriction (§2.1), while ScopeIt does not.

5 Clustering in the Embedding Space

We next investigate if sentence embeddings generated by ScopeIt exhibit any clusters that make semantic sense. On preliminary analysis, some clusters that we observed in the data were salutations, signature blocks and sentences containing entities associated with scheduling meetings. We hypothesize that similar clusters should be observed in the sentence embedding space. To test this hypothesis, we propose the following experiment: given the embedding of a sentence belonging to a certain cluster (the query sentence), retrieve the top k nearest neighbor sentence embeddings from a set of sentence embeddings generated by ScopeIt. If similar clusters exist in the embedding space, then sentences associated with the retrieved embeddings should belong in the same cluster as the query sentence.

Due to space constraints, we describe the methodology of the embedding experiment in Appendix C, and report our main findings here. We observe that salutations and signatures are clustered together. We also observe sub-clusters wherein sentences containing similar entities or intents are clustered with sentences containing similar information. Moreover, we find that these sentence embeddings also capture the context in which the sentences occur: syntactically similar query sentences get mapped to different clusters based on the context in which they occur.

6 Improvements to Downstream Tasks

Our main motivation for developing ScopeIt was the hypothesis that using relevant sentences in place of the entire document would improve the performance of downstream NLP tasks. In this section we highlight the impact ScopeIt has on $6$ downstream tasks, $5$ of which are either associated with detecting an intent related with scheduling a meeting or extracting the necessary entities. The models for tackling these tasks use the scoped message generated by ScopeIt as the input. We also consider the “Non-actionable Emails” task which helps the scheduling assistant identify emails that it should ignore. The models used for each task vary: they can either be context independent regex models, or context aware neural models. For each of these tasks, we first describe what the task is, and then describe the model(s) used for solving it. Finally, to show the impact of ScopeIt, for each task, we give the model(s) the original unaltered email and the scoped version as input, and compare the performance difference. We summarize the results of these experiments in §6.1. Note that there is no overlap between the data used for the analysis presented in this section and the data used for training ScopeIt.

1. Meeting Type

When the scheduling assistant receives an email and has determined that the email has an intent to schedule a meeting, the Meeting Type task tries to classify the meeting request into one of the broad classes of meeting types as defined by the system. Each of the categories have special meeting properties that help the assistant populate the meeting details. Some examples of these meeting types defined are Lunch (which constrain the times to schedule), Conference Call (require a Remote Bridge), Phone call etc. The assistant uses an ensemble of different models to classify the meeting requests into these classes. For this case study, we focus on the model responsible for detecting a call or a conference call intent, which maps to the Phone Call and Conference Call meeting type classes, respectively.

Example Input: “Let us get together on a Team’s call.”
Expected Output: Conference Call Intent

This task is modeled as a multi-label classification task, and we use a context aware deep network to tackle it. We use a model similar to the one proposed in [Mullenbach et al., 2018]. Specifically, we generate a contextual embedding using BERT for each token in the email. Then, an attention method [Bahdanau et al., 2014], one for each label, is used to aggregate the embeddings into a document embedding, which is then passed through a sigmoid layer to generate the probability for each label. The entire model is trained end to end by minimizing the negative log likelihood of the gold labels. While using ScopeIt, we only select sentences that occur above a particular threshold (0.01), and feed the concatenation of those sentences as inputs to the model.

2. Meeting Duration

The scheduling assistant needs to extract the duration for the meeting from the meeting organizer’s email. If there wasn’t a duration entity detected, the system uses the default duration set by the organizer in their meeting preferences.

Example Input: “Hedwig, schedule a meeting for 30 minutes.”
Expected Output: 30 minutes.

We use LUIS⁶⁶6https://www.luis.ai/home for extracting the duration of a meeting from the meeting requests. LUIS is the Language Understanding Service in Microsoft Azure Cognitive Services that provides natural language intelligence for conversational AI applications [Williams et al., 2015]. In order to utilize LUIS’ high recall duration extraction model in the context of scheduling meetings, we select sentences scored above 0.01 by ScopeIt, and feed the concatenation of the sentences as the input to LUIS’ duration extraction model.

3. Meeting Phone Number

When users schedule a phone call, the system needs to extract phone numbers from the organizer or attendee to add to the meeting invite.

Example Input: “Hedwig, please schedule a call with Albus. My phone number is +1 000-000-0000. Regards, Gellert Grindelwald”
Expected Output: +1 000-000-0000.

We use LUIS for extracting the phone numbers from an email. We extract sentences scored above a threshold of 0.01 by ScopeIt, concatenate the sentences, and feed that as an input to the high recall phone number extraction model.

4. Meeting Location

In order to schedule the meeting at the right location, the system needs to extract the intended location expressed by the organizer.

Example Input: “Hedwig, schedule a meeting. Hagrid, let’s meet at the 3 Broomsticks.”
Expected Output: the 3 Broomsticks

This is modeled as an entity extraction problem and consequently we fine-tune BERT for tagging (similar to the BERT for NER, as done in ?)). We concatenate sentences scored above a certain threshold by ScopeIt and pass it as an input to the model.

5. Meeting Timezone

Users typically express multi timezones in two ways: express time zones by explicitly mentioning timezone abbreviations like ”EST”, or implicitly by indicating the city and sometimes the country where the meeting is going to be held.

Example Input: “Hedwig, schedule an online meeting with Ron Weasley next week. Ron is in EST, and I am going to be working from Dublin for that week.”
Expected Output: EST, Dublin

By using ScopeIt to filter out sentences irrelevant to scheduling the meeting, the system is able to leverage recall-heavy time zone entity extractors, and city and country extractors to find the right time zones. It utilizes LUIS for time zone entity extraction and LU (Location Understanding) from Bing to extract cities and countries from the input text. These utterances are subsequently resolved for their time zone offsets.

6. Non-actionable Emails

When the scheduling assistant processes a request, the system might receive emails from meeting participants which are irrelevant to scheduling. For example, after the meeting organizer has sent a request to the scheduling assistant (Hedwig, in the prior examples), one of the invitees might reply to the email thread with all meeting participants including the assistant saying, “Thanks for setting this up. Look forward to meeting you.” In these cases, there is no action required from the system’s point of view and the email can be safely ignored. Similar to the approach stated in the previous tasks, sentences in the email that are scored above a threshold are extracted and concatenated. If there are no sentences in the email above the relevance threshold, the email is considered irrelevant and is ignored by the system.

6.1 Results

Task	Task Type	Model Type	Metric	Before ScopeIt	After ScopeIt	$\Delta$
Meeting Type	Classification	Context Aware	Accuracy	0.72	0.96	+0.24
Non-actionable Emails	Classification	Context Aware	Accuracy	N/A	0.96	+ 0.96
Duration	Extraction	Context Independent	Accuracy	0.88	0.92	+0.04
Phone Number	Extraction	Context Independent	Precision	0.46	0.98	+0.52
Phone Number	Extraction	Context Independent	Recall	1	1	0
Location	Extraction	Context Aware	Precision	0.73	0.96	+0.23
Location	Extraction	Context Aware	Recall	0.92	0.96	+0.04
Timezone	Extraction	Context Independent	Precision	0.37	0.67	+0.30
Timezone	Extraction	Context Independent	Recall	0.92	0.96	+0.04
Average			Accuracy			+0.14
			Precision			+0.35
			Recall			+0.02

Table 3: A summary of all improvements resulting from the ScopeIt’s preprocessing

Table 3 summarizes the utility of using ScopeIt. For the intent classification and duration extraction tasks, we see an average increment of 0.14 in the accuracy. An interesting observation is that even the context aware neural model benefits strongly (+0.24 accuracy improvement).

For the entity extraction models, we observe a strong increase in precision, with an average increase of 0.35. The context independent models benefit strongly when we strip out the irrelevant parts of the document: as shown in the example in Figure 1a, phone numbers extracted by the context independent regex based model are often found in the signature block of the email. A similar behavior is also observed in the timezone extraction task, where locations in the signature often get picked up as timezones. As hypothesized, once the email is scoped to only the relevant parts, these models get a substantial boost in precision. A similar gain is also observed for the BERT Location extractor.

An interesting observation is that the recall for these extraction models also improves. On further investigation, we found that this can be attributed to an increase in the true positives. For the BERT Location extraction, this makes sense, since a simplified input allows the model to reason better about the location. For the timezone task, we hypothesize that the LU model has additional heuristics and that the heuristics perform better on the simplified inputs.

Using ScopeIt also offers the benefit of making regex models feasible to use. This is especially advantageous since regex-based models have faster inference times and require much less data to build than their neural counterparts.

Finally, ScopeIt also helps the scheduling assistant decide between which emails to process and which ones to ignore, which plays a crucial role for an email-based agent. People often use reply-all while interacting with each other on an email thread. This leads to the agent receiving emails whose contents are not relevant to the task of scheduling the meeting. Using ScopeIt ensures that those emails are ignored by the agent.

7 Signature Block Detection

As described in §5, we observed that sentences with similar semantics were clustered close to each other in the sentence embedding space. We use this observation to apply our model to component detection in email, specifically for signature block identification. We show the model’s performance on a publicly available dataset, and show that it outperforms the baseline model. We also hypothesize that the publicly available systems for extracting signatures are not suitable for real-world use-cases, as they are often trained on well structured emails using hand crafted features, and hence are not robust to the variety of writing styles that people employ in the real world. In order to validate this hypothesis, we test the effectiveness of the baseline on our use-case.

7.1 Dataset and Experimental Setup

We use the 20-Newsgroup dataset consisting of emails annotated with signature blocks [Carvalho and Cohen, 2004]. This dataset is publicly available ⁷⁷7https://www.cs.cmu.edu/~vitor/codeAndData.html. We use a standard split of 80%, 10% and 10% as the training, validation, and testing splits. In order to validate our hypothesis about the efficacy of the publicly available baseline on our use-case, we annotate 625 emails with signature blocks and then test the performance of the baseline as well as our model (trained on the 20-Newsgroup dataset) on this annotated dataset. The number of instances in the dataset can be found in Table 5.

7.2 Baseline

We compare against a publicly available signature detection tool Jangada[Carvalho and Cohen, 2004]. Jangada uses a CRF model with handcrafted features, and is trained on the 20-Newsgroup dataset.

Dataset	Split	n_docs	n_sent	n_pos	n_neg
20 Newsgroup	train	465	20629	18076	2553
	val	52	1797	1522	275
	test	100	4547	4054	493
Manually Annotated	train	501	6055	2043	4012
	val	62	670	227	443
	test	62	663	260	403

Table 4: Signature Detection: Dataset Details

Dataset	Model	Precision	Recall	Fscore
20	Jangada	0.98	0.971	0.975
Newsgroup	ScopeIt	0.992	0.999	0.996
Manually	Jangada	0.908	0.224	0.359
Annotated	ScopeIt	0.995	0.884	0.936

Table 5: Performance: Jangada Vs ScopeIt

7.3 Results on Signature Block Detection

As seen in Table 5, our proposed neural model outperforms Jangada on the 20 Newsgroup dataset. We also validate our hypothesis: when we use Jangada for our real world use-case to remove signatures, we observe that while it has a high precision, the recall drops drastically (0.224); making it impractical to use in production. On the other hand, ScopeIt, even when trained on 20 Newsgroup, generalizes much better (recall 0.885, fscore: 0.936).

8 Related Work

Our problem of relevance scoping in documents is similar to extractive summarization. Extractive summarization deals with selecting subsets (usually sentences) of a document that succinctly summarizes it. For the case of this scheduling assistant, scoping out the relevant part in an email document is in essence selecting the subset of sentences from the email that accurately summarizes the scheduling intent and specifies the parameters necessary to schedule a meeting correctly. Both traditional feature based methods using word probability, TF-IDF weights, sentence position and sentence length features [Luhn, 1958, Eduard and Lin, 1998, Cao et al., 2015, Ren et al., 2016] and recent neural methods [Nallapati et al., 2017, Zhou et al., 2018, Narayan et al., 2018, Liu et al., 2019, Liu and Lapata, 2019] have been used for the task of extractive summarization. ?) show the benefit of using pretrained language models [Peters et al., 2018, Radford et al., 2018, Devlin et al., 2018, Dong et al., 2019, Zhang et al., 2019] for the same task. Their proposed model BertSum leverages interval segment embeddings to distinguish multiple sentences within a document. BertSum further also finetunes the BERT embeddings to learn the segment embeddings during training, which potentially requires more data; and can also only encode upto 512 wordpiece long documents. In contrast, we used hierarchical RNNs (similar to [Nallapati et al., 2017, Zhou et al., 2018]), with the pretrained embeddings forming the embedding layer (Fig. 1); thereby allowing us to encode emails much larger than 512 tokens long.

Intent classification and entity extraction tasks in the context of conversational understanding have been studied both in academia and corporate research laboratories [de Mori et al., 2008]. There exists a rich body of research in user intent identification from targeted queries [Wang et al., 2014]. However, these methods don’t work as well when applied to large documents. We showed that scoping out the relevant parts in a document improves performance of classification and extraction tasks on large queries. To the best of our knowledge, this is the first work to explore the utility of extractive summarization as a preprocessing step for tackling problems involving large text corpora.

There has been extensive research on the topic of identifying signature blocks and reply lines from an email [Carvalho and Cohen, 2004, Minkov et al., 2005, Balog and de Rijke, 2006, Xiaoqin, 2015]. [Balog and de Rijke, 2006, Xiaoqin, 2015] present heuristic driven methods for unsupervised identification of the signature body, while [Carvalho and Cohen, 2004] present a CRF based approach for identifying and extracting signature and reply lines from Email. We showed in §7 that our proposed method also works well for removing signatures and also generalizes better.

9 Conclusion

In this paper, we proposed a simple method for scoping relevant information within emails and the impact the model had on a suite of tasks that are vital for the scheduling assistant. We also showed our models applicability on the task of Signature Detection. We show that it performs better than existing publicly available baselines and generalizes better on real world use-cases.

In this work we showed that our model works well with emails. For future work, we plan on investigating the impact of our proposed method for other tasks that process large textual inputs (Eg: document classification, sentiment analysis on large reviews). Furthermore, using BERT for inference poses latency challenges in a production system. A promising direction of future work that we plan on investigating is leveraging distilled versions of BERT [Sun et al., 2019, Wang et al., 2020] for the task.

References

[Bahdanau et al., 2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[Balog and de Rijke, 2006] Krisztian Balog and Maarten de Rijke. 2006. Finding experts and their eetails in e-mail corpora. In Proceedings of the 15th international conference on World Wide Web, pages 1035–1036.
[Cao et al., 2015] Ziqiang Cao, Furu Wei, Sujian Li, Wenjie Li, Ming Zhou, and Houfeng Wang. 2015. Learning summary prior representation for extractive summarization. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 829–833.
[Carvalho and Cohen, 2004] Vitor R. Carvalho and William W. Cohen. 2004. Learning to extract signature and reply lines from email. In CEAS 2004 - First Conference on Email and Anti-Spam, Mountain View, CA.
[Cho et al., 2014] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
[Cohen, 2015] William W. Cohen. 2015. Enron email dataset.
[de Mori et al., 2008] R de Mori, F Bechet, D Hakkani-Tur, M McTear, G Riccardi, and G Tur. 2008. Spoken language understanding. IEEE Signal Processing Magazine, 25(3):50–58, 9.
[Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[Dong et al., 2019] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, pages 13042–13054.
[Eduard and Lin, 1998] Hovy Eduard and Chin-Yew Lin. 1998. Automated text summarization and the summarist system. In Proceedings of a workshop on held at Baltimore.
[Gardner et al., 2017] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. Allennlp: A deep semantic natural language processing platform.
[Hermann et al., 2015] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701.
[Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[Klimt and Yang, 2004] Bryan Klimt and Yiming Yang. 2004. Introducing the enron corpus. In CEAS.
[Kotzias et al., 2015] Dimitrios Kotzias, Misha Denil, Nando de Freitas, and Padhraic Smyth. 2015. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, page 597–606, New York, NY, USA. Association for Computing Machinery.
[Liu and Lapata, 2019] Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders.
[Liu et al., 2019] Yang Liu, Ivan Titov, and Mirella Lapata. 2019. Single document summarization as tree induction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1745–1755, Minneapolis, Minnesota, June. Association for Computational Linguistics.
[Luhn, 1958] Hans Peter Luhn. 1958. The automatic creation of literature abstracts. IBM Journal of research and development, 2(2):159–165.
[Minkov et al., 2005] Einat Minkov, Richard C. Wang, and William W. Cohen. 2005. Extracting personal names from email: Applying named entity recognition to informal text. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 443–450, Vancouver, British Columbia, Canada, October. Association for Computational Linguistics.
[Mullenbach et al., 2018] James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. 2018. Explainable prediction of medical codes from clinical text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1101–1111, New Orleans, Louisiana, June. Association for Computational Linguistics.
[Nallapati et al., 2017] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Thirty-First AAAI Conference on Artificial Intelligence.
[Narayan et al., 2018] Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1747–1759, New Orleans, Louisiana, June. Association for Computational Linguistics.
[Pedregosa et al., 2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
[Peters et al., 2018] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June. Association for Computational Linguistics.
[Radford et al., 2018] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf.
[Ren et al., 2016] Pengjie Ren, Furu Wei, Zhumin Chen, Jun Ma, and Ming Zhou. 2016. A redundancy-aware sentence regression framework for extractive summarization. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 33–43.
[Sun et al., 2019] Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient knowledge distillation for BERT model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4323–4332, Hong Kong, China, November. Association for Computational Linguistics.
[Wang et al., 2014] Zhuoran Wang, Hongliang Chen, Guanchun Wang, Hao Tian, Hua Wu, and Haifeng Wang. 2014. Policy learning for domain selection in an extensible multi-domain spoken dialogue system. In Proceedings of EMNLP 2014. Association for Computational Linguistics.
[Wang et al., 2020] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. arXiv preprint arXiv:2002.10957.
[Williams et al., 2015] Jason D Williams, Eslam Kamal, Mokhtar Ashour, Hani Amr, Jessica Miller, and Geoff Zweig. 2015. Fast and easy language understanding for dialog systems with microsoft language understanding intelligent service (luis). In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 159–161.
[Xiaoqin, 2015] Yuan Xiaoqin. 2015. Unsupervised extraction of signatures and roles from large-scale mail archives. International Journal of Security and Its Applications, 9(4):229–238.
[Zhang et al., 2019] Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5059–5069, Florence, Italy, July. Association for Computational Linguistics.
[Zhou et al., 2018] Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou, and Tiejun Zhao. 2018. Neural document summarization by jointly learning to score and select sentences. arXiv preprint arXiv:1807.02305.

Appendix A Dataset Creation Details

The dataset is built by sampling 12737 emails from an internal dataset. These emails were split into a train and test set first at a 90%-10% split, and then the train set was split again at a 90%-10% split to form the training and validation datasets. In order to measure the inter-annotator agreement for the dataset, we randomly sample 200 emails, which were then annotated by another annotator. The Cohen’s kappa ( $\kappa$ ) measured was 0.89.

This dataset is augmented with negative samples from the Enron dataset, which are used to account for emails from a professional settings, which are not related to scheduling. This is done by using a list of disqualification words that remove emails that could potentially be referring to meetings. The disqualification words for Enron are as follows: “book a room”, “let’s meet”, “meeting”, “conference room”, “meet”, “invitation”, “location”, “half an hour”, “30 mins”, “30 minutes”, “45 mins”, “schedule” and “reserve”. If any of these phrases are found in the email, the same disqualifies them as a negative. A total of 5429 emails were added from the Enron dataset.

To account for emails that do not conform to regular language used in a professional setting, text data was sampled and added from the Yelp and IMDB subsets from the UCI Sentiment Labelled Sentences dataset. Specifically, 1000 documents and 748 documents were sampled from the Yelp and IMDB subsets respectively, after similar disqualification rules. These were added to the train and validation dataset in the same proportions as before. The data augmentation was necessary to ensure the model is not biased to believe that all documents contain something relevant. We also add 200 examples to the test set to check if the trained models learn discard completely irrelevant documents.

Other data augmentation methods included random replacement and random swapping. For random replacement, proper nouns in templatized emails were replaced. Finally, the dataset was further augmented by random swapping of passages for longer emails (with $>3$ passages). The last and first passages were not swapped, as these generally tended to be salutation or signature blocks. These account for the remainder of the data points in the dataset.

Appendix B Hyperparameters and Training Details

We use the BERT-Base, Multilingual Cased⁸⁸8https://github.com/google-research/bert/blob/master/multilingual.md model for generating the contextual embeddings. We do not fine-tune the BERT model for any of the models due to compute constraints. We use a 2 layer BiGRU encoder [Cho et al., 2014] (hidden dimension of 128) as the Seq2SeqEncoder for the intra-sentence aggregator, and a 2 layer BiGRU (hidden dimension of 128) as the Seq2Seq Encoder for the inter-sentence aggregator. The model was trained with gradient descent for 50 epochs. We used Adam [Kingma and Ba, 2014] as the optimizer with a learning rate of $0.0001$ . The learning rate was annealed by a factor of 0.5 if the validation loss failed to improve over 5 epochs, and also use early stopping with a patience of 8. All our models were developed using the AllenNLP framework [Gardner et al., 2017]. For tuning the models, a grid search over the following learning rates was done: $\{1e-2,5e-3,1e-3\}$ , as well as a batch size of $\{4,8,16,32\}$ (beyond that caused out of memory issues). All models were trained on a single K80 instance.

Appendix C Clustering in Embedding Space

Original Email	Query	Cluster Type	NN With Surrounding Context
Hey Harry I’m using Hedwig to schedule a meeting! @Hedwig, schedule a meeting for next week, in Hogsmade. Thanks, Ronald Weasley The Burrow Ottery St. Catchpole England	Hey Harry	Salutation	Hi Richard, Per my voicemail, are you available for w/Greg
	Hey Harry	Salutation	Jim, Is there going to be a conference call or some other type of weekly meeting $\cdots$
			Hi Shirley, Is this meeting still set for tomorrow? $\cdots$
	@Hedwig, schedule a meeting for next week, in Hogsmade.	Date-time availab- -ility intent	$\cdots$ call memo that we will forward on early next week. Chris Long will be in touch on Tuesday to help coordinate the recommended call. $\cdots$
	@Hedwig, schedule a meeting for next week, in Hogsmade.	Date-time availab- -ility intent	$\cdots$ Any possibility of rescheduling to another day? Sally is available Thursday, June 1. $\cdots$
			Susan, please organize a meeting with Steve, Kim, and Tracey early next week, say Monday or Tuesday $\cdots$
	Ronald Weasley	Signature	$\cdots$ Thank you Mona ********
	Ronald Weasley	Signature	$\cdots$ Thanks, Larry ********
			$\cdots$ Thanks, Patti
Hey Ron, Sounds’s good. Let’s meet at the Three Broomsticks. @Hedwig, Ron will call me. My phone number is 000-000-0000. Thanks, Harry Potter Ph: 000-000-0000	My phone number is 000-000-0000.	Phone availability intent	$\cdots$ I should contact your assistant to schedule a meeting. If you need to contact me immediately, please call my cell phone at 000-000-0000. $\cdots$
	My phone number is 000-000-0000.	Phone availability intent	$\cdots$ he is available for a meeting (or conference call) to discuss the GE facility agreement sometime tomorrow - either am or after 300. $\cdots$
			Hey Suz Is Sheila still planning on having the GE call tomorrow?
	Ph: 000-000-0000	Signature	$\cdots$ Thanks. Rahul
	Ph: 000-000-0000	Signature	$\cdots$ Larry ******** (000) 000-0000
			$\cdots$ Director Government Affairs - The Americas

Table 6: Nearest Neighbor Analysis on the Enron Dataset. Red denotes the scoped email as predicted by ScopeIt. Blue denotes the actual nearest neighbor in the context. Best viewed in color

To generate the set of sentence embeddings, we use the aforementioned publicly available Enron dataset. We randomly sample 10000 of the 500000 emails present in the dataset, and generate sentence embeddings for all the emails, obtaining $\approx 100000$ sentence embeddings. We then generate a typical email that a user might send to their scheduling assistant and use sentences from those emails as query sentences to probe the sentence embedding space. We use Scikit-learn’s [Pedregosa et al., 2011] NearestNeighbors method ⁹⁹9https://scikit-learn.org/stable/modules/neighbors.html for the NN computation, and retrieve 3 NN sentences.¹⁰¹⁰10Specifically, we retrieve the sentence generating the embedding as well as the email containing the sentence. This is done to provide context, since these sentence embeddings also take context into consideration. We redact personal information like names, phone numbers in order to preserve the privacy of the users in the Enron dataset.

Table 6 shows the results of the NN analysis. We use Red to denote the final scoped out email as predicted by ScopeIt, and we use Blue to denote the actual NN of the query sentence. As shown in the table, the queried NNs belong to the same cluster as the query. We see that salutations and signatures get clustered together. We also observe sub-clusters wherein sentences containing date-time availability or phone call intents get mapped to sentences containing similar information. We also observe that contextual information is captured by these contextual embeddings. As shown by the second generated email, syntactically similar query sentences can get mapped to different clusters, based on the context in which they occur.