This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FACOS: Finding API Relevant Contents on Stack Overflow with Semantic and Syntactic Analysis

Kien Luong1, Mohammad Hadi2, Ferdian Thung1, Fatemeh Fard2, and David Lo1 1School of Computing and Information Systems, Singapore Management University 2Irving K. Barber Faculty of Science, University of British Columbia {kiengialuong, ferdianthung, davidlo}@smu.edu.sg {mohammad.hadi, fatemeh.fard}@ubc.ca
Abstract

Collecting API examples, usages, and mentions relevant to a specific API method over discussions on venues such as Stack Overflow is not a trivial problem. It requires efforts to correctly recognize whether the discussion refers to the API method that developers/tools are searching for. The content of the thread, which consists of both text paragraphs describing the involvement of the API method in the discussion and the code snippets containing the API invocation, may refer to the given API method. Leveraging this observation, we develop FACOS, a context-specific algorithm to capture the semantic and syntactic information of the paragraphs and code snippets in a discussion. FACOS combines a syntactic word-based score with a score from a predictive model fine-tuned from CodeBERT. FACOS beats the state-of-the-art approach by 13.9% in terms of 𝑭1-𝒔𝒄𝒐𝒓𝒆F_{\textit{1}}\textit{-}score.

I Introduction

Developers typically use existing libraries or frameworks to implement certain common functionalities. Understanding which APIs to use, the methods they offer, their distinctive names, and how to use them is vital in this regard. There may be hundreds or even thousands of APIs in a large-scale software library such as the .NET framework and JDK. Microsoft conducted a survey in 2009 in which 67.6% of respondents said that inadequate or absent resources hindered learning APIs [1].

In order to gain a deeper understanding of APIs and their usage information, developers need to inspect many web pages manually and they use automated code search tools. Most of the Code search tools do not consider the semantics of natural language queries because they are based on keyword matching. Stack Overflow is the second most common place for the developers to discover APIs, their simple method names, and their usage through crowd-sourced questions and answers. As many API names share simple names but provide different functionality, it is difficult to find code snippets and APIs that correspond to the specific problem searched by the developers on these platforms. Moreover, API mentions in the informal text content of Stack Overflow are often ambiguous, which makes it difficult to track down APIs and learn their uses.

Developers frequently discuss and mention APIs in natural language in online discussion and question answering forums like Stack Overflow [2, 3, 4, 5]. When developers or automated tools are looking for a specific API, names of API methods sharing the same name can be ambiguous. Therefore, we require API disambiguation to support several downstream tasks such as API recommendation and API mining. To properly index and link APIs to their related information in various sources (e.g., Stack Overflow, Javadoc, etc.), it is important to link ambiguous API mentions to their actual APIs correctly.

Luong et al. recently proposed DATYS [6], which uses type-scoping to disambiguate API mentions in informal text content on Stack Overflow. In type scoping, they considered API methods whose types appear in more parts (i.e., scopes) of a Stack Overflow thread as more likely to refer to the searched API method. However, the statistical word alignment model it uses is based on the appearance of words in a sentence rather than considering in which context the sequence of words are being used and what connotations do these words relaying to the readers.

APIs are often discussed and mentioned in natural language in online forums such as Stack Overflow to better understand them. Developers or automated tools looking for a specific API would be confused by API methods with the same name. In Stack Overflow, API disambiguation is crucial to finding APIs. Several downstream tasks, such as API recommendation [7, 8], are supported by this collection, including API mining [7, 8], which relies on interpreting API mentions correctly to index and link APIs to relevant information in various data sources, including Stack Overflow, Javadoc, etc.

To incorporate a deeper understanding of the underlying semantics in a natural language text content of Stack Overflow, we introduce FACOS, a context-specific algorithm to capture the semantic and syntactic information of the paragraphs and code snippets in a crowd-sourced discussion. We call this API resource retrieval task because FACOS focuses on finding Stack Overflow threads mentioning a given API method. Our work also modified DATYS to perform better search over the code snippets in the Stack Overflow discussion threads. The modified DATYS, denoted as DATYS+ provides an additional metric to better capture the occurrence of API method type in the Stack Overflow thread. By greedily matching the type name with the tokens in the code snippet, DATYS+ performs the syntactic search in FACOS. Yet, both DATYS and DATYS+ are only searching based on the syntactic information provided by the fully qualified name of a target API method. It cannot capture the semantic meaning in paragraphs and code snippets of threads on Stack Overflow and how similar they are to the target API method. Thus, to capture the semantics, in addition to the weighted syntactic information provided by DATYS+, FACOS has a semantic search component that leverages a deep attention based Transformer model, CodeBERT [9]. This semantic search component measures the similarity between the paragraphs and code snippets of a Stack Overflow thread with the target API method comment and implementation code. The more similar they are, the more likely the thread to be relevant to the target API method that we search for. To efficiently leverage both semantic and syntactic knowledge of Stack Overflow thread and the API method, FACOS joined the semantic and the syntactic search element to get the joint relevance score that determines whether a thread relates to a given API method. The contributions of each element in the joint relevance score is defined by a weighting factor.

In this paper, we are going to answer with the following research questions:

  1. RQ1

    Can FACOS perform better than the baseline (DATYS)?

  2. RQ2

    How well does each component of FACOS perform?

  3. RQ3

    How does the weighting factor affect the F1-score of FACOS?

These research questions will help us understand the effectiveness of our approach FACOS and the internal mechanism through which it yields better results than the current baseline. Our work has offered the following main contributions:

  1. 1.

    To our knowledge, we are the first to adopt a transformer-based deep learning technique to incorporate semantic knowledge understanding for API resource retrieval task.

  2. 2.

    As compared to state-of-the-art techniques, our approach performs better while searching for the contents related to the queried API. On a dataset of 380 Stack Overflow threads, FACOS beats the state-of-the-art by 13.9%.

  3. 3.

    We designed an ablation study to understand how the integrated components of our approach are performing. We found that each component contributes to the effectiveness of FACOS.

  4. 4.

    We have also open sourced our code and additional artifacts required for recreating the results and re-purposing our approach for other tasks. The source code for FACOS is available at https://anonymous.4open.science/r/facos-E5C6/

The rest of the paper is structured as follows: Section II deals with the preliminary knowledge about the components on top of which we have built our method. Section III provides an overview of our proposed approach, while Section IV elaborates the various components of our proposed approach. We describe our experiment details and results in Section V and VI, respectively. The related works and the threats to validity are presented in Sections VIII and VII-D. Finally, we concluded our work and present future work in Section IX.

II Preliminaries

II-A DATYS

Two steps are involved in finding the API mentioned in informal text content: (1) API mentions extraction, and (2) API mentions disambiguation. API mention extraction aims to identify common words that refer to the APIs. API mentions disambiguation, on the other hand, links API mentions with the APIs they reference. DATYS [6] specifically deals with the API mention disambiguation via type scoping in the informal text of Stack Overflow to resolve ambiguous mentions of Java API methods, where the mentions have been identified.

After extracting API method candidates from input Java libraries, DATYS scores API method candidates based on how often their types (i.e., classes or interfaces) appear in different parts (i.e., scopes) of the Stack Overflow thread with identified API mentions. Having a type that appears in more scopes will increase the API candidate score. Here, DATYS considers three scopes: Mention scope, which covers the mention itself. Text scope, which covers the textual content of the thread, including the mentions. Code scope, which covers the code snippets in the thread. API candidates are ranked according to their scores for each API mention in the thread. DATYS takes the top API candidate with a non-zero score as the mentioned API. If the leading API candidate has a zero score, DATYS considers the mention as an unknown API. Luong et al. built a ground truth dataset containing 807 Java API mentions from 380 threads in Stack Overflow.

II-B CodeBERT

CodeBERT [9] was developed using a multilayered attention-based Transformer model, BERT [10]. As a result of its effectiveness in learning contextual representation from massive unlabeled text with self-supervised objectives, the BERT model has been adopted widely to develop large pre-trained models. Thanks to the multilayer Transformer [11], CodeBERT developers adopted two different approaches than BERT to learn semantic connections between Natural Language (NL) - Programming Language (PL) more effectively.

Firstly, The CodeBERT developers make use of both bimodal instances of NL-PL pairs (i.e., code snippets and function-level comments or documentations) and a large amount of available unimodal codes. In addition, the developers have pre-trained CodeBERT using a hybrid objective function, which includes masked language modeling [10] and replaced token detection [12]. The incorporation of unimodal codes helps the replaced token detection task, which in turn produces better natural language by detecting plausible alternatives sampled from generators.

Developers trained CodeBERT from Github code repositories in 6 programming languages, where only one pre-trained model is learned for all six programming languages with no explicit indicators used to mark an instance to the one out of six input programming languages. CodeBERT was evaluated on two downstream tasks: natural language code search and code documentation generation. The study found that fine-tuning the parameters of CodeBERT obtained state-of-the-art results on both tasks.

III Approach Overview

III-A Task Definition

Our goal is to find Stack Overflow threads that mention a given API method111In this paper, we use the terms API, method, and API method interchangeably.. Specifically, given an API method, we strive to find Stack Overflow threads containing words matching the simple name of the given API method. In Java, the simple name of an API method is the name of the method without the class and the package names. For example, m is the simple name of API method com.example.Class.m. We want to classify whether the threads having the simple name m is actually relevant to API method com.example.Class.m. In summary, the task is defined as: “For each API method in a set of given API methods, identify Stack Overflow threads that refer to it.”

III-B Architecture

Refer to caption

Figure 1: The architecture of FACOS (Finding API Relevant Contents on Stack Overflow with Semantic and Syntactic Analysis)

The pipeline of FACOS is presented in Figure 1. It is divided into 2 main steps:

(1) Collecting various API-related resources from a given API method name; and (2) Recommending relevant threads using the collected API-related resources.

In the step (1), FACOS finds Potential Threads from Stack Overflow using the simple name of the given API method as the query. Potential Threads are the threads that have at least one word matching with the simple name of given API. The API method comment and implementation code are directly obtained from the source code repository of the given API. Last but not least, the API Candidates are obtained from a database of API methods. The API Candidates are API methods that have the same simple name as the given API method.

The objective of step (2) is to identify whether each Stack Overflow thread in the Potential Threads actually refers to the given API. FACOS has two components: API relevance classifier and DATYS+. API relevance classifier is designed to draw the relevance between a thread and an API method by capturing the semantic similarity between (1) paragraphs and code snippets in the thread; and (2) API method comment and implementation code. API relevance classifier outputs a semantic relevance score representing the relevance it measures. In contrast, DATYS+ outputs a syntactic relevance score based on the existence of the terms from the fully qualified name of the given API in different scopes of a Stack Overflow thread. For example, API ”A.B.C” has terms such as ”A”, ”B”, and ”C”. The last term, ”C”, is the simple name of the API method. The second last term, ”B”, is the type of the API method. Both DATYS and DATYS+ use type scoping [6] to give score based on the existence of the type of the API (i.e., ”B” in the example) in different scopes of the Stack Overflow thread (code scope, text scope, etc.). However, type scoping of DATYS+ is modified to be suitable for the search task and we are going to describe it in Section IV-A. It outputs a score that indicates the syntactic relevance between the given API and the thread, we call it DATYS+ score.

After step (2), each thread will have a score indicating if the thread refers to the given API method. This score is combined from semantic relevance score and DATYS+ score and is called joint relevance score. Threads predicted as referred to the given API method are then returned to the user. We describe FACOS components (i.e., DATYS+ and API relevance classifier) in detail in Section IV.

IV FACOS

FACOS consists of two main components: DATYS+ and API relevance classifier. DATYS+ takes as inputs Potential Threads and API Candidates and outputs scores indicating its confidence that the given API is referred to in the threads (Section IV-A). Given Potential Threads and API method comment and implementation code, FACOS first converts them to API relevance embedding (Section IV-B). The API relevance embedding is input to API relevance classifier, which outputs confidence scores indicating the likelihood that the given threads refer to the API (Section IV-C). Finally, the scores from DATYS+ and API relevance classifier are combined to a joint relevance score, and threads with scores larger than a threshold are returned as the relevant threads (Section IV-D).

IV-A DATYS+

DATYS+ is an extension of DATYS. DATYS used regular expressions to capture the types of API method invocations available in code snippets of the thread. However, these regular expressions are limited and thus DATYS may miss some mentions in code snippets. To capture more types, DATYS+ modifies the type scoping algorithm by adding a new score.

Algorithm 1 indicates how modified type scoping works. Compared to DATYS’s, DATYS+’s type scoping algorithm receives 𝐶𝑜𝑑𝑒𝑆𝑛𝑖𝑝𝑝𝑒𝑡𝑠\mathit{CodeSnippets} as another input. 𝐶𝑜𝑑𝑒𝑆𝑛𝑖𝑝𝑝𝑒𝑡𝑠\mathit{CodeSnippets} represents the content available in code snippets of the Stack Overflow thread. In addition, inputs of the original type scoping algorithm are also considered. 𝐴𝑃𝐼𝑀𝑒𝑛𝑡𝑖𝑜𝑛\mathit{APIMention}, 𝑃𝑇𝑦𝑝𝑒𝐿𝑖𝑠𝑡\mathit{PTypeList}, 𝐴𝑃𝐼𝑀𝑒𝑡ℎ𝑜𝑑𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒\mathit{APIMethodCandidate}, and 𝑇ℎ𝑟𝑒𝑎𝑑𝐶𝑜𝑛𝑡𝑒𝑛𝑡\mathit{ThreadContent} stand for the simple name of the given API, the list of possible types extracted from code snippets following the algorithm used by DATYS, the API Candidates, and the thread’s textual content (i.e., title, text, tags), respectively. The three scopes used by DATYS are also used in DATYS+. In Mention Scope (Lines 3-8), DATYS+ increases an API score if its type appear within the API mention. In Text Scope (Lines 10-13), DATYS+ increases an API score if its type appear within the textual content of the thread. In Code Scope (Lines 17-21), DATYS+ increases an API score if its type matches with the type of method invocation or imported types in the code snippet. Additionally, in Code Scope, DATYS+ also looks at the content of the code snippets and increases the API score of the corresponding API candidate if there are tokens in the code snippets that match with the API type (Lines 14-16). This score helps to capture the occurrence of types that would be missed by a more accurate matching used in DATYS. Thus, we call the scope of this score Extended Code Scope

After executing type scoping, DATYS+ returns scores for the API Candidates. The scores are then normalized to a range of [0, 1] following the minimum and the maximum score from the API Candidates. DATYS+ then takes the normalized score of the given API method and passes it to the next step.

Algorithm 1 Scoring an API Candidate with Type Scoping in DATYS+
0:  ApiMention,PTypesList,APIMethodCandidate,ApiMention,PTypesList,APIMethodCandidate,ThreadContent,CodeSnippetsThreadContent,CodeSnippets
0:  CandScoreCandScore
1:  CandScoreCandScore = 0
2:  CandType=getType(APIMethodCandidate)CandType=getType(APIMethodCandidate)
3:  if hasPrefix(ApiMention)hasPrefix(ApiMention) then
4:     Prefix=getPrefix(ApiMention)Prefix=getPrefix(ApiMention)
5:     if endsWith(Prefix,CandType)endsWith(Prefix,CandType) then
6:        CandScore=CandScore+1CandScore=CandScore+1
7:     end if
8:  end if
9:  TextualTokens=tokenize(ThreadContent)TextualTokens=tokenize(ThreadContent)
10:  CodeTokens=tokenize(CodeSnippets)CodeTokens=tokenize(CodeSnippets)
11:  if CandTypeCandType inin TextualTokensTextualTokens then
12:     CandScore=CandScore+1CandScore=CandScore+1
13:  end if
14:  if CandTypeCandType inin CodeTokensCodeTokens then
15:     CandScore=CandScore+1CandScore=CandScore+1
16:  end if
17:  for PTypePType in PTypesListPTypesList do
18:     if isSameType(PType,CandType)isSameType(PType,CandType)  then
19:        CandScore=CandScore+1CandScore=CandScore+1
20:     end if
21:  end for
22:  return  CandScoreCandScore

IV-B API relevance embedding

We follow the process described in Figure 2 to build API relevance embedding. Firstly, each thread in Potential Threads needs to be converted into an embedding. A thread may contain m paragraphs and n code snippets. A paragraph is a piece of textual content on a Stack Overflow thread that is separated from other contents in the thread via a newline character. Code snippet is a piece of code content on a Stack Overflow thread. It is typically enclosed with a starting tag precode\langle pre\rangle\langle code\rangle and an ending tag /code/pre\langle/code\rangle\langle/pre\rangle. Each paragraph is paired with each code snippet to create a pair of thread content. Therefore, a Stack Overflow thread would have m×nm\times n thread content pairs. A natural-programming language model, CodeBERT222https://github.com/microsoft/CodeBERT, is used to extract the semantic meaning of each thread content pair. It encodes the m×nm\times n thread content pairs into m×nm\times n thread embeddings. thread embedding is the representation vector of thread content that created by CodeBERT’s encoder. By converting the pairs from a textual form to a numerical vector form with a pre-trained CodeBERT model, the semantic relationship between the paragraphs and code snippets is extracted. Before feeding the thread content pairs into the encoder of CodeBERT, each pair is pre-processed following the format:

CLSparagraphSEPcodesnippetEOS\langle CLS\rangle\ paragraph\ \langle SEP\rangle\ code\ snippet\ \langle EOS\rangle

CLS\langle CLS\rangle is the token that informs the start of the pair according to the design of RoBERTa model [13] which CodeBERT is based on. SEP\langle SEP\rangle is the token that separates a Paragraph from a Code Snippet and EOS\langle EOS\rangle indicates the end of the pair. In detail, the maximum number of tokens in a pair before being fed into CodeBERT encoder is 512. We set the number of tokens for a paragraph and a code snippet to 254 and 255 tokens, respectively. The two numbers add up to 512 when the three tokens such as CLS\langle CLS\rangle, SEP\langle SEP\rangle, and EOS\langle EOS\rangle are counted. If the number of tokens in the paragraph is less than 254, then padding tokens would be added to reach 254 tokens. On the other hand, if the number of tokens in the paragraph is more than 254, we truncate the paragraph and take the first 254 tokens. The same process is applied to the code snippet with 255 tokens. The CodeBERT encoder receives these thread content pairs under this format as inputs and outputs embedding vectors. For a thread with m×nm\times n thread content pairs, there would be m×nm\times n thread embedding vectors created and each thread embedding vector has a length of 768.

Refer to caption

Figure 2: How API relevance embeddings are created

Secondly, to build API relevance embedding, API comment and implementation code also need to be converted into an embedding. The API method comment is a piece of textual content that describes the functionality of the API method and how to use it. The API implementation code is the code inside the API method body that implements the described functionality. The API comment and implementation code are extracted from the Javadoc and the JAR files, respectively, they are pre-processed to the following format:

CLScommentSEPimplementationcodeEOS\langle CLS\rangle\ comment\ \langle SEP\rangle\ implementation\ code\ \langle EOS\rangle

They are then transformed into a numerical representation vector via the CodeBERT encoder.

Finally, each thread embedding vector and the method embedding vector are then concatenated to a vector. We call this concatenated vector API relevance embedding. In total, m×nm\times n API relevance embedding vectors would be created.

IV-C API relevance classifier

The API relevance classifier is a binary classifier that utilizes a neural network with two fully connected layers to predict whether the API relevance embedding comes from a Stack Overflow thread that refers to the given API method.

The API relevance classifier has two modes of operation: training and deployment modes. In the training mode, the API relevance embeddings are used to train the API relevance classifier. When there is an imbalance between positive and negative labels, API relevance classifier upsamples the minority label. Whenever the thread refers to the given API method, all API relevance embedding created from the thread would be considered as positive by the classifier. Otherwise, in case the given API method is not referred to by the thread, every API relevance embedding of the thread would have negative labels. In the deployment mode, API relevance classifier produces probability scores for the m×nm\times n API relevance embedding. These scores are averaged and passed to the next step. The averaged score indicates the likelihood that the thread refers to the given API.

IV-D Computing joint relevance score

Refer to caption

Figure 3: Computing joint relevance score

We follow the process in Figure 3 to compute the joint relevance score. DATYS+ and API relevance classifier output scores AA{} and BB{}, respectively. Both represent their confidence that the given API method is mentioned in the thread. The two scores are then combined to a joint relevance score CC following this formula:

C=x×A+(1x)×BC{}=x\times A{}+(1-x)\times B{} (1)

The weighting factor xx{} decides the contributions of DATYS+ score and API relevance classifier in joint relevance score CC. The higher the value of xx{} is, the more DATYS+ score contributes to the final joint relevance score. The range of AA{}, BB{}, and xx{} is from 0 to 11. A thread is considered to refer to the given API if the joint relevance score CC is larger than a threshold tt. Otherwise, the thread is considered not to refer to the given API. By default, tt is set to 0.50.5.

The value of xx{} will be estimated based on the training data. In detail, we let xx{} increase gradually from 0 to 11 with a step of 0.1. There are ten possible values of xx{}: {0,0.1,0.2,,0.9,1.0}\{0,0.1,0.2,...,0.9,1.0\}. The value of xx{} giving the highest performance in the training data is then chosen.

V Experiment

V-A Dataset and Experimental Settings

We utilize the dataset provided in DATYS work [6] to evaluate both FACOS and DATYS. We split 380 Stack Overflow threads to 253 training threads and 127 testing threads with the ratio of 2:1. The training threads are utilized to train the API relevance classifier while the testing threads are used to evaluate FACOS and DATYS. Next, as mentioned in Section IV-B, for each Stack Overflow thread in the training threads, we extract its thread embeddings and these thread embeddings are grouped into a training set. Similarly, for each Stack Overflow thread in the testing threads, we extract its thread embeddings and these thread embeddings are grouped into a testing set. The numbers of API relevance embeddings of the dataset are shown in Table I. The numbers of the embeddings in training set and testing set are 57 69057\,690 and 26 21226\,212, respectively.

TABLE I: Number of API relevance embeddings in each set
# API relevance embeddings
Training set 57,690
Testing set 26,212

To generate API relevance embeddings for the API relevance classifier for training, for each thread, if the given API appears in the thread, we generate API relevance embeddings for thread contents and method contents as described in Section IV-B. These embeddings would have positive label because they are created from the API that is referred to by the thread. To generate embeddings with a negative label for a thread, we find APIs that have the same simple name as the given API and are not mentioned in the thread. We then create API relevance embeddings from these APIs and label these API relevance embeddings as negative.

To train the API relevance classifier, there are 344344 APIs. These APIs are used to generate the training method embeddings. In the testing set, there are 181181 APIs. These APIs are used to generate the testing method embeddings. Table II shows the numbers of positive and negative API relevance embeddings created in training and testing sets. The number of negative API relevance embeddings is approximately 4 times more compared to the positive ones in the same set of thread. Due to this imbalance, positive API relevance embeddings are randomly up-sampled to balance the two classes within the API relevance classifier training process.

TABLE II: Number of positive and negative API relevance embeddings in each set
Positive Embeddings in Training set 9,934
Negative Embeddings in Training set 47,756
Positive Embeddings in Testing set 5,607
Negative Embeddings in Testing set 20,605

The API relevance classifier is trained using 6 epochs on the training data. After the first 6 epochs, the value of the loss function has relatively converged. The learning rate of the training is set to 10310^{-3}.

V-B Metrics

To evaluate the proposed approach on identifying threads that are relevant to an API, we use three metrics: Precision, Recall, and F1-scoreF_{\textit{1}}\textit{-}score. In order to calculate the three aforementioned metrics, True Positive, False Positive, and False Negative should be defined first. Our task focuses on finding threads that actually refer to a given API. True Positive is the case where a thread is deemed to be relevant by the approach is indeed relevant. False Positive is the case where the thread that is deemed to be relevant by the approach is actually irrelevant. False Negative is the case where a threads is deemed to be irrelevant by the approach is actually relevant. The metrics are calculated using the following formulas:

Precision=TruePositiveTruePositive+FalsePositivePrecision=\frac{True\ Positive}{True\ Positive+False\ Positive} (2)
Recall=TruePositiveTruePositive+FalseNegativeRecall=\frac{True\ Positive}{True\ Positive+False\ Negative} (3)
F1-score=2×Precision×RecallPrecision+RecallF_{\textit{1}}\textit{-}score=\frac{2\times Precision\times Recall}{Precision+Recall} (4)

We measure the above scores of all given APIs in the testing set and report the averages of the scores.

V-C Research Questions

Research Question 1: Can FACOS perform better than the baseline (DATYS)?

The baseline, DATYS, was designed for a task of API mention disambiguation. We adopt it to our task of finding threads that are relevant to an API. If DATYS finds an API is mentioned in the thread, the thread is considered to be relevant to the API. To evaluate the improvement that FACOS over DATYS, we evaluate them in the testing data set and compare them in terms of F1-scoreF_{\textit{1}}\textit{-}score. We also analyse some cases that FACOS can resolve and DATYS can not in Section VII-A.

Research Question 2: How well does each component of FACOS perform?

There are three possible variants of of FACOS depending on which component that comes along with it. The variants are (1) FACOS with API relevance classifier; (2) FACOS with DATYS+; and (3) FACOS with DATYS+ and API relevance classifier. API relevance classifier is a semantic-based algorithm while DATYS+ is a syntactic-based algorithm. In this study, we aim to analyze the contribution of each component in FACOS. From the analysis, we would like to answer the question whether combining a semantic-based algorithm and a syntactic-based algorithm leads to a better result than running them individually.

Research Question 3: How does the weighting factor affect the F1-score of the relevant thread classification? Does our strategy work well?

The weighting factor is an importance factor that would affect how well FACOS perform. We select the importance factor based on the best performance in the training data. We analyze whether our strategy leads to the best performance in the testing data. We vary the values of weighting factor in both the training data and the testing data. The values that we use are {0,0.1,0.2,,0.9,1.0}\{0,0.1,0.2,...,0.9,1.0\}. We analyze whether picking values in the training data that leads to the best performance in the training data also leads to the best performance in the testing data.

VI Result

VI-A RQ1: FACOS Effectiveness

TABLE III: FACOS vs DATYS in terms of 𝑭1F_{\textit{1}}-score in the testing set
Approach Avg. Avg. Avg.
Precision Recall 𝑭1F_{\textit{1}}-score
DATYS 0.7441 0.7703 0.7340
FACOS 0.8697 0.9016 0.8730
TABLE IV: Contribution of FACOS Components
Components Avg. Avg. Avg.
Precision Recall 𝑭1F_{\textit{1}}-score
FACOS 0.8697 0.9016 0.8730
FACOS with only API 0.3408 0.3658 0.3408
relevance classifier
FACOS with only DATYS+ 0.8620 0.8723 0.8530

Table III shows the performance of the DATYS and FACOS in finding threads that are relevant to the given API. FACOS, in general, outperforms DATYS. On average, FACOS achieves an F1-score of 0.873, which is an improvement of 13.9% compared to DATYS. FACOS also beats DATYS in terms of precision and recall.

VI-B RQ2: Ablation Study

TABLE V: Average Precision, average Recall, and average 𝑭1F_{\textit{1}}-score of testing sets when weighting factor varies
x Avg. Precision Avg. Recall Avg. 𝐅1F_{\textit{1}}-score
0 0.3420 0.3658 0.3408
0.10.1 0.4485 0.4653 0.4441
0.20.2 0.8650 0.8925 0.8641
0.30.3 0.8697 0.9016 0.8730
0.40.4 0.8684 0.8934 0.8689
0.50.5 0.8588 0.8723 0.8097
0.60.6 0.8588 0.8723 0.8510
0.70.7 0.8606 0.8723 0.8521
0.80.8 0.8606 0.8723 0.8521
0.90.9 0.8606 0.8723 0.8521
1.01.0 0.8620 0.8723 0.8530

Table IV shows how well each component in FACOS is. Since the “API relevance classifier-only” version of FACOS gives worst result, API relevance classifier may not be able to resolve the task well independently. Partly, this might be be due to the limited amount of the training data (i.e., only 253 training threads). In addition, the “DATYS+ only” version of FACOS performs much better compared to the “API relevance classifier-only” version. However, FACOS is still better than both of them. It demonstrates that both components are useful and essential.

VI-C RQ3: Effect of the weighting factor

Table V shows the performance of FACOS in the training set when we vary the values of weighting factor. The bold numbers in each row of the table are the average F1-scoresF_{\textit{1}}\textit{-}scores of the chosen values of x in the training sets. Similarly, Table VI the performance of FACOS in the test set when we vary the values of weighting factor. The highest F1-scoreF_{\textit{1}}\textit{-}score for both the training and testing set is achieved when the value of the weighting factor is equal to 0.3. It demonstrates that our strategy to pick the value of the weighting factor that leads to the best performance in the training data works really well.

TABLE VI: Average Precision, average Recall, and average 𝑭1F_{\textit{1}}-score of training sets when weighting factor varies
x Avg. Precision Avg. Recall Avg. 𝐅1F_{\textit{1}}-score
0 0.6565 0.6685 0.6473
0.10.1 0.7111 0.7272 0.7080
0.20.2 0.8261 0.8506 0.8269
0.30.3 0.8328 0.8498 0.8329
0.40.4 0.8265 0.8410 0.8254
0.50.5 0.8159 0.8234 0.8097
0.60.6 0.8132 0.8234 0.8079
0.70.7 0.8132 0.8234 0.8079
0.80.8 0.8132 0.8234 0.8079
0.90.9 0.8132 0.8234 0.8079
1.01.0 0.8180 0.8191 0.8073

VII Discussion

VII-A Cases where FACOS outperforms DATYS

(1) The relevant thread does not contain the type name of the given API method

Figure 4 shows the example of a case where the content of the thread does not relate to the given API method. The thread contains paragraph and code snippet of a Stack Overflow thread with ID 56135373333https://stackoverflow.com/questions/56135373/. org.mockito.stubbing.OngoingStubbing.thenReturn444https://javadoc.io/doc/org.mockito/mockito-all/2.0.2-beta/org/mockito/stubbing/OngoingStubbing.html is the API method the thread refers to.

From the content of the thread, it would be difficult to find the relevance between the text written in the paragraphs and the given API method (i.e., org.mockito.stubbing.OngoingStubbing.thenReturn) since the type (i.e., OngoingStubbing) does not appear in the thread. The text only shows the user view towards the code snippet without having a description mentioning the application or usage of the observed API method invocation (e.g., thenReturn in the code snippet of Figure 4). Sentences such as ”This works like charm!” do not provide much information to identify whether the observed API method refers to the given API.

Therefore, we leverage the content of the thread which might be relevant to the content of the API method. For example, in the thread above, its title which is shown in Figure 6(a), ”Optional cannot be returned by stream() in Mockito Test classes”, relates to the comment of the given API which is Sets a return value to be returned when the method is called in Figure 6(b). Due to this feature, FACOS can successfully consider this thread as relevant while DATYS missed it.

Refer to caption

Figure 4: Thread 56135373 on Stack Overflow where API is referred by a code snippet of the thread

(2) The irrelevant thread contains the type name of the given API method

An example of this case is shown in Figure 5. In the thread555https://stackoverflow.com/questions/16919751/, the given API method is com.google.common.base.CharMatcher.is and there is a word that matches the simple name of the API method is which we highlighted. Since the type of the given API method (e.g., CharMatcher) appears in both the textual content and the code snippet, DATYS mistakenly accepts the thread as referring to the given API. By leveraging the semantic knowledge learnt by the API relevance classifier, FACOS is able to detect the irrelevance between the textual content, code snippet around the word is and the API comment and implementation code. FACOS can conclude that the thread is irrelevant to the given API com.google.common.base.CharMatcher.is.

Refer to caption

Figure 5: Thread 16919751 on Stack Overflow where API com.google.common.base.CharMatcher.is is not referred by content of the thread.
Refer to caption
(a)
Refer to caption
(b)
Figure 6: The similarity in semantic meaning between the API comment of method org.mockito.stubbing.OngoingStubbing.thenReturn in Figure 6(a) and the textual content (i.e., the title) of thread 56135373 in Figure 6(b)

VII-B Case where FACOS fail to exclude irrelevant threads

Figure 7 shows a case where FACOS fail to exclude the thread666https://stackoverflow.com/questions/30127057/ out of the relevant results for the given API method org.mockito.Mockito.mock. The issue occurs when there is an API method that has a similar functionality as the given API method. These two methods usually have the same simple name and highly similar functionality description.

In Figure 7, PowerMock and Mockito, perform similar functions such as mocking (i.e., creating a version of a service in order to quickly and reliably run tests on that service777https://circleci.com/blog/how-to-test-software-part-i-mocking-stubbing-and-contract-testing/). Since both of them have the API method whose simple name is mock, and both their mock methods have the same API signature (i.e., parameters, return type), it would be easy to mistakenly recognize one as the other, even for a human. Figure 8 shows the comment of API method org.mockito.Mockito.mock888https://javadoc.io/static/org.mockito/mockito-all/2.0.2-beta/org/mockito/Mockito.html, which is Creates mock object of given class or interface. Because of the similarity between the API method from Mockito library and the title of thread 30127057 in Figure 7, FACOS wrongly recognizes that the simple API name mock in the thread refers to the given API method org.mockito.Mockito.mock. In fact, the simple API name mock refers to the one from PowerMock library.

Refer to caption

Figure 7: Thread 30127057 on Stack Overflow that FACOS falsely recognize as referring to the API method org.mockito.Mockito.mock.

Refer to caption

Figure 8: The API comment of method org.mockito.Mockito.mock

VII-C Adding Semantic Information for API Content Search

The API method and relevant information search does not always utilize syntactic information. For some libraries, some methods are chained (as in Figure 4), so the type is not displayed. It requires a type system to determine its type. In addition, the Type system is not 100 percent reliable due to a lack of import information, variables, etc. It is ineffective to use the similarity of word representations provided by language models for API contents search. API contents search is unique to other tasks such as code search. In code search, due to the term ”transfer” and ”convert” being related to currency, the method named ”transferSGDToUSD” might be the answer to the query ”convert SGD to USD.” However, searching for an API method requires explicit mention of the API name in the query, which could be difficult as different APIs may share the same API method names. To solve the problem, solely using a model that only learns the syntactic meaning of textual and code content is insufficient. Therefore, we incorporated semantic information for API content search, which generated better results, as we have demonstrated from the results of our experiment.

VII-D Threats to validity

A threat to internal validity is related to experiment bias. We obtain our dataset from another work. We also run the baseline using the code provided by the author. We then check our code multiple times to ensure that we do not make mistakes. We believe there should be little threats to internal validity. We also release the dataset and the code for our experiments for all to use.

A threat to external validity is on whether the approach is applicable to other platform other than Stack Overflow. This experiment mainly focuses on Stack Overflow and Java programming language, therefore, it is uncertain whether FACOS can be applied on other discussion venues that also talk about API issues. The potential platform might be Reddit, which has sub-reddit (i.e., a place gathering Reddit’s threads discussing a particular problem) discussing programming languages and frameworks. It has title, textual and code content as same as Stack Overflow. Thus, the similarity suggests that we can potentially apply FACOS to Reddit too. We leave this possibility for a future work. Regarding the threat that changing targeting programming language would affect the accuracy of FACOS, although we focus only on Java, the features (e.g., API comment/documentation, API implementation code, fully qualified name, class/type name, etc.) required for the approach can be found from other programming languages. Therefore, we leave this as a future work to study whether it may work well when applied to other programming languages.

There is a also threat of construct validity on whether precision, recall, and F1-scoreF_{\textit{1}}\textit{-}score is a suitable evaluation metric for our task. Our task is a classification task. Many work in software engineering has used precision, recall, and F1-score as the evaluation metric for classification task [6, 14, 15, 16]. Thus, we believe the threats are minimal.

VIII Related Work

VIII-A API Disambiguation

Many past works [14, 17, 18, 19, 20, 21, 22, 23] deals with API disambiguation. There are two main groups: informal text disambiguation [18, 19, 20, 21, 14] and code snippet disambiguation [22, 23, 17]. As suggested by its name, the first aims to disambiguate API mentions in textual content while the second deals with disambiguating API mention in code snippets.

For informal text disambiguation, several work utilizes classical information retrieval approaches such as Vector Space Model and Latent Semantic Indexing to disambiguate the API mentions [18, 19, 20] while some others use heuristics [21]. Bacchelli et al [20] combined string matching and information retrieval algorithms to link emails to source code entities. Dagenais and Robillard [21] identified Java APIs mentioned in support channels (e.g., mailing list, forums), documents, and code snippets. Ye et al. [14] worked on API disambiguation in the textual content of Stack Overflow thread by utilizing mention-mention similarity, mention-entry similarity, and scope filter. Luong et al.[6] used type scoping to disambiguate the API mentions in Stack Overflow thread.

The work on API disambiguation on Stack Overflow thread can be viewed as another side of the coin of the task in finding threads that are relevant to the API. When we disambiguate an API in a thread, the disambiguated API is relevant to the thread as the thread is talking about the API.

VIII-B API Resource Retrieval

Several studies have explored how to search for the code for API and related information retrieval. Lv et al. [24] proposed Codehow to deal with the lack of query understanding ability of the existing tool. By expanding a user query with APIs, Codehow can identify potential APIs and perform a code search based on the Extended Boolean model, which considers the impact of APIs on code search. Gu et al. [25] proposed DeepAPI to search for API usage sequences. As opposed to assuming a bag of words, it learns the sequence of words within a query and the sequence of APIs associated with it. DeepAPI encodes a single user query into a fixed-length context vector to generate an API sequence.

Other studies have also exploited different aspects of APIs and natural language to better retrieve the APIs and their related information. The techniques include: using global and local contexts of the queries [26], leveraging usage similarity for effective retrieval of API examples [27], employing word embeddings to document similarities for improved API retrieval [28], exploiting user knowledge [29], and task-API knowledge gap [30] during retrieval of semantically annotated API operations.

Wang et al. [31] developed a transformer-based framework for unifying code summarization and code search. Shahbazi et al. [32] proposed API2Com to improve automatically generated code comments by fetching API documentations. Alhamzeh et al. [33] built DistilBERT-based argumentation retrieval for answering comparative questions. Dibia et al. [34] and Vale et al.[35] developed a usable library for question answering with contextual query expansion and a question-answering assistant for software development using a transformer-based language model, respectively. Ciniselli et al.[36] performed an empirical study on the usage of Transformer Models for code search and completion.

Our study also work on API resource retrieval. Specifically, we retrieve Stack Overflow threads that are relevant to a target API that we are searching for.

VIII-C Contribution of StackOverflow for API documentation

Treude et al. [37] studied the augmenting API documentation with insights from stack overflow. [4] explored the Crowd documentation by examining the dynamics of API discussions on Stack Overflow, whereas [38] dubbed the Stack Overflow as the Social Media for Developer Support in terms of provided utilities. [39] and [40] worked on classifying stack overflow posts on API issues and contextual documentation referencing on stack overflow. The dichotomy of these studies is notable where some research like [41] and [42] studies how API documentation fails via the API misuse on stack overflow, other studies [43, 44] heavily lean on the Crowdsourced knowledge on stack overflow for automated API documentation with tutorials. Similarly, crowdsourced knowledge was hailed by [45] and [46], who explored the innovation diffusion and web resource recommendation for hyperlinks through link sharing on stack overflow.

Our work support the effort in this line of study. FACOS can automatically find threads about a particular API in Stack Overflow that can be augmented to the corresponding API documentation.

VIII-D Word Sense and Entity Disambiguation Study

There are several works focused on disambiguation task [21, 22, 47, 23]. We also have found a variety of word sense and entity disambiguation methods employed for different objectives [48, 49, 50, 51, 52, 53, 54, 55, 56]. These studies have solved myriads of problems via solving lexical disambiguations in literature. The task of word sense disambiguation is to identify a target word’s intended meaning by examining its context. Researchers have used Word Sense Disambiguation to predict election results by enhanced sentiment analysis on Twitter data. Researchers have associated place-name mentions in unstructured text with their actual references in geographic space using word disambiguation. Other research has also proposed unsupervised, knowledge-Free, and interpretable Word Sense Disambiguation for various applications. Researchers used this approach to add meaning to social network posts when it comes to named entity recognition and disambiguation. Different Entity-fishing tools were also developed for facilitating the recognition and disambiguation service. In recent years, tools that allow researchers to recognize and extract named entities have become increasingly popular.

IX Conclusion and Future Work

We present FACOS, an approach to search Stack Overflow threads that refer to API of which users or tools may want to find the usage. We utilize the semantic and syntactic features of the paragraphs and code snippets in a thread to determine whether the thread is related to a given API. Our evaluation shows that FACOS has an improvement compared to DATYS when adapting both approaches to the search task. We have added a weight parameter to balance the usage of syntactic and semantic information for retrieving API mentions and related threads. We have proved the utility of the weight factor by incorporating an ablation study. In future, we plan to improve our approach with larger dataset which has more threads and APIs. Also, we plan to make our approach become robust with more programming languages so that it can be more useful to developers.

Replication Package. The source code for FACOS is available at https://anonymous.4open.science/r/facos-E5C6/.

References

  • [1] M. P. Robillard, “What makes apis hard to learn? answers from developers,” IEEE Software, vol. 26, no. 6, pp. 27–34, 2009.
  • [2] P. K. Venkatesh, S. Wang, F. Zhang, Y. Zou, and A. E. Hassan, “What do client developers concern when using web apis? an empirical study on developer forums and stack overflow,” in 2016 IEEE International Conference on Web Services (ICWS).   IEEE, 2016, pp. 131–138.
  • [3] M. Linares-Vásquez, G. Bavota, M. Di Penta, R. Oliveto, and D. Poshyvanyk, “How do api changes trigger stack overflow discussions? a study on the android sdk,” in proceedings of the 22nd International Conference on Program Comprehension, 2014, pp. 83–94.
  • [4] C. Parnin, C. Treude, L. Grammel, and M.-A. Storey, “Crowd documentation: Exploring the coverage and the dynamics of api discussions on stack overflow,” Georgia Institute of Technology, Tech. Rep, vol. 11, 2012.
  • [5] G. Uddin and F. Khomh, “Automatic mining of opinions expressed about apis in stack overflow,” IEEE Transactions on Software Engineering, 2019.
  • [6] K. Luong, F. Thung, and D. Lo, “Disambiguating mentions of api methods in stack overflow via type scoping,” in ICSME.   IEEE, 2021.
  • [7] Q. Huang, X. Xia, Z. Xing, D. Lo, and X. Wang, “Api method recommendation without worrying about the task-api knowledge gap,” in ASE.   IEEE, 2018.
  • [8] M. M. Rahman, C. K. Roy, and D. Lo, “Rack: Automatic api recommendation using crowdsourced knowledge,” in SANER 2016, vol. 1.   IEEE, 2016.
  • [9] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
  • [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [12] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than generators,” arXiv preprint arXiv:2003.10555, 2020.
  • [13] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  • [14] D. Ye, L. Bao, Z. Xing, and S.-W. Lin, “Apireal: an api recognition and linking approach for online developer forums,” Empirical Software Engineering, vol. 23, no. 6, pp. 3129–3160, 2018.
  • [15] Q. Huang, E. Shihab, X. Xia, D. Lo, and S. Li, “Identifying self-admitted technical debt in open source projects using text mining,” Empirical Software Engineering, vol. 23, no. 1, pp. 418–451, 2018.
  • [16] G. A. A. Prana, C. Treude, F. Thung, T. Atapattu, and D. Lo, “Categorizing the content of github readme files,” Empirical Software Engineering, vol. 24, no. 3, pp. 1296–1327, 2019.
  • [17] C. K. Saifullah, M. Asaduzzaman, and C. K. Roy, “Learning from examples to find fully qualified names of api elements in code snippets,” in ASE.   IEEE, 2019.
  • [18] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo, “Recovering traceability links between code and documentation,” TSE, vol. 28, no. 10, 2002.
  • [19] A. Marcus and J. I. Maletic, “Recovering documentation-to-source-code traceability links using latent semantic indexing,” in ICSE.   IEEE, 2003.
  • [20] A. Bacchelli, M. Lanza, and R. Robbes, “Linking e-mails and source code artifacts,” in ICSE, 2010.
  • [21] B. Dagenais and M. P. Robillard, “Recovering traceability links between an api and its learning resources,” in 2012 34th international conference on software engineering (icse).   IEEE, 2012, pp. 47–57.
  • [22] S. Subramanian, L. Inozemtseva, and R. Holmes, “Live api documentation,” in Proceedings of the 36th International Conference on Software Engineering, 2014, pp. 643–652.
  • [23] H. Phan, H. A. Nguyen, N. M. Tran, L. H. Truong, A. T. Nguyen, and T. N. Nguyen, “Statistical learning of api fully qualified names in code snippets of online forums,” in 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).   IEEE, 2018, pp. 632–642.
  • [24] F. Lv, H. Zhang, J.-g. Lou, S. Wang, D. Zhang, and J. Zhao, “Codehow: Effective code search based on api understanding and extended boolean model (e),” in 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2015, pp. 260–270.
  • [25] X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep api learning,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE 2016.   New York, NY, USA: Association for Computing Machinery, 2016, p. 631–642. [Online]. Available: https://doi.org/10.1145/2950290.2950334
  • [26] T. Nguyen, N. Tran, H. Phan, T. Nguyen, L. Truong, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, “Complementing global and local contexts in representing api descriptions to improve api retrieval tasks,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2018.   New York, NY, USA: Association for Computing Machinery, 2018, p. 551–562. [Online]. Available: https://doi.org/10.1145/3236024.3236036
  • [27] S. K. Bajracharya, J. Ossher, and C. V. Lopes, “Leveraging usage similarity for effective retrieval of examples in code repositories,” in Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE ’10.   New York, NY, USA: Association for Computing Machinery, 2010, p. 157–166. [Online]. Available: https://doi.org/10.1145/1882291.1882316
  • [28] X. Ye, H. Shen, X. Ma, R. Bunescu, and C. Liu, “From word embeddings to document similarities for improved information retrieval in software engineering,” in Proceedings of the 38th International Conference on Software Engineering, ser. ICSE ’16.   New York, NY, USA: Association for Computing Machinery, 2016, p. 404–415. [Online]. Available: https://doi.org/10.1145/2884781.2884862
  • [29] M. Rój, “Exploiting user knowledge during retrieval of semantically annotated api operations,” in Proceedings of the Fourth Workshop on Exploiting Semantic Annotations in Information Retrieval, ser. ESAIR ’11.   New York, NY, USA: Association for Computing Machinery, 2011, p. 21–22. [Online]. Available: https://doi.org/10.1145/2064713.2064726
  • [30] Q. Huang, X. Xia, Z. Xing, D. Lo, and X. Wang, “Api method recommendation without worrying about the task-api knowledge gap,” in 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), 2018, pp. 293–304.
  • [31] W. Wang, Y. Zhang, Z. Zeng, and G. Xu, “Trans^ 3: A transformer-based framework for unifying code summarization and code search,” arXiv preprint arXiv:2003.03238, 2020.
  • [32] R. Shahbazi, R. Sharma, and F. H. Fard, “Api2com: On the improvement of automatically generated code comments using api documentations,” arXiv preprint arXiv:2103.10668, 2021.
  • [33] A. Alhamzeh, M. Bouhaouel, E. Egyed-Zsigmond, and J. Mitrović, “Distilbert-based argumentation retrieval for answering comparative questions,” Working Notes of CLEF, 2021.
  • [34] V. Dibia, “Neuralqa: A usable library for question answering (contextual query expansion+ bert) on large datasets,” arXiv preprint arXiv:2007.15211, 2020.
  • [35] L. d. N. Vale and M. d. A. Maia, “Towards a question answering assistant for software development using a transformer-based language model,” arXiv preprint arXiv:2103.09423, 2021.
  • [36] M. Ciniselli, N. Cooper, L. Pascarella, A. Mastropaolo, E. Aghajani, D. Poshyvanyk, M. Di Penta, and G. Bavota, “An empirical study on the usage of transformer models for code completion,” arXiv preprint arXiv:2108.01585, 2021.
  • [37] C. Treude and M. P. Robillard, “Augmenting api documentation with insights from stack overflow,” in 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).   IEEE, 2016, pp. 392–403.
  • [38] M. Squire, “” should we move to stack overflow?” measuring the utility of social media for developer support,” in 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol. 2.   IEEE, 2015, pp. 219–228.
  • [39] M. Ahasanuzzaman, M. Asaduzzaman, C. K. Roy, and K. A. Schneider, “Classifying stack overflow posts on api issues,” in 2018 IEEE 25th international conference on software analysis, evolution and reengineering (SANER).   IEEE, 2018, pp. 244–254.
  • [40] S. Baltes, C. Treude, and M. P. Robillard, “Contextual documentation referencing on stack overflow,” IEEE Transactions on Software Engineering, 2020.
  • [41] G. Uddin and M. P. Robillard, “How api documentation fails,” Ieee software, vol. 32, no. 4, pp. 68–75, 2015.
  • [42] T. Zhang, G. Upadhyaya, A. Reinhardt, H. Rajan, and M. Kim, “Are code examples on an online q&a forum reliable?: a study of api misuse on stack overflow,” in 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).   IEEE, 2018, pp. 886–896.
  • [43] S. Meldrum, S. A. Licorish, and B. T. R. Savarimuthu, “Crowdsourced knowledge on stack overflow: A systematic mapping study,” in Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, 2017, pp. 180–185.
  • [44] A. M. Rocha and M. A. Maia, “Automated api documentation with tutorials generated from stack overflow,” in Proceedings of the 30th Brazilian Symposium on Software Engineering, 2016, pp. 33–42.
  • [45] C. Gómez, B. Cleary, and L. Singer, “A study of innovation diffusion through link sharing on stack overflow,” in 2013 10th Working Conference on Mining Software Repositories (MSR).   IEEE, 2013, pp. 81–84.
  • [46] J. Li, Z. Xing, D. Ye, and X. Zhao, “From discussion to wisdom: web resource recommendation for hyperlinks in stack overflow,” in Proceedings of the 31st Annual ACM Symposium on Applied Computing, 2016, pp. 1127–1133.
  • [47] A. T. Nguyen, P. C. Rigby, T. Nguyen, D. Palani, M. Karanfil, and T. N. Nguyen, “Statistical translation of english texts to api code templates,” in 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).   IEEE, 2018, pp. 194–205.
  • [48] T. Steiner, R. Verborgh, J. Gabarró Vallés, and R. Van de Walle, “Adding meaning to social network microposts via multiple named entity disambiguation apis and tracking their data provenance,” International Journal of Computer Information Systems and Industrial Management, vol. 5, pp. 69–78, 2013.
  • [49] M. Karimzadeh, W. Huang, S. Banerjee, J. O. Wallgrün, F. Hardisty, S. Pezanowski, P. Mitra, and A. M. MacEachren, “Geotxt: A web api to leverage place references in text,” in Proceedings of the 7th Workshop on Geographic Information Retrieval, ser. GIR ’13.   New York, NY, USA: Association for Computing Machinery, 2013, p. 72–73. [Online]. Available: https://doi.org/10.1145/2533888.2533942
  • [50] S. Patwardhan, S. Banerjee, and T. Pedersen, “Senserelate:: Targetword-a generalized framework for word sense disambiguation,” in ACL, vol. 2005, 2005, pp. 73–76.
  • [51] R. Jose and V. S. Chooralil, “Prediction of election result by enhanced sentiment analysis on twitter data using word sense disambiguation,” in 2015 International Conference on Control Communication Computing India (ICCC), 2015, pp. 638–641.
  • [52] L. Foppiano and L. Romary, “entity-fishing: a dariah entity recognition and disambiguation service,” Journal of the Japanese Association for Digital Humanities, vol. 5, no. 1, pp. 22–60, 2020.
  • [53] S. Zwicklbauer, C. Seifert, and M. Granitzer, “Do we need entity-centric knowledge bases for entity disambiguation?” in Proceedings of the 13th International Conference on Knowledge Management and Knowledge Technologies, 2013, pp. 1–8.
  • [54] A. Mandalios, K. Tzamaloukas, A. Chortaras, and G. Stamou, “Geek: Incremental graph-based entity disambiguation,” in LDOW@ WWW, 2018.
  • [55] D. Klein, K. Toutanova, H. T. Ilhan, S. D. Kamvar, and C. D. Manning, “Combining heterogeneous classifiers for word sense disambiguation,” in Proceedings of the ACL-02 workshop on Word sense disambiguation: recent successes and future directions, 2002, pp. 74–80.
  • [56] P. Chen, W. Ding, C. Bowes, and D. Brown, “A fully unsupervised word sense disambiguation method using dependency knowledge,” in Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2009, pp. 28–36.