Towards Understanding the Impacts of Textual Dissimilarity on Duplicate Bug Report Detection
Abstract
About 40% of software bug reports are duplicates of one another, which pose a major overhead during software maintenance. Traditional techniques often focus on detecting duplicate bug reports that are textually similar. However, in bug tracking systems, many duplicate bug reports might not be textually similar, for which the traditional techniques might fall short. In this paper, we conduct a large-scale empirical study to better understand the impacts of textual dissimilarity on the detection of duplicate bug reports. First, we collect a total of 92,854 bug reports from three open-source systems and construct two datasets containing textually similar and textually dissimilar duplicate bug reports. Then we determine the performance of three existing techniques in detecting duplicate bug reports and show that their performance is significantly poor for textually dissimilar duplicate reports. Second, we analyze the two groups of bug reports using a combination of descriptive analysis, word embedding visualization, and manual analysis. We found that textually dissimilar duplicate bug reports often miss important components (e.g., expected behaviors and steps to reproduce), which could lead to their textual differences and poor performance by the existing techniques. Finally, we apply domain-specific embedding to duplicate bug report detection problems, which shows mixed results. All these findings above warrant further investigation and more effective solutions for detecting textually dissimilar duplicate bug reports.
Index Terms:
Software bug, duplicate bug detection, textual dissimilarity, word embedding, t-SNEI Introduction
Software bugs are human-made errors in the code that prevent software from working correctly. During software maintenance, software bugs are submitted to a bug-tracking system as bug reports [7]. Hundreds of bugs are reported every day in large software systems (e.g., Mozilla, Eclipse) [7]. Duplicated bug reports occur when multiple persons submit multiple bug reports for the same bug. Due to the asynchronous nature of bug report submission, traditional bug tracking systems (e.g., Bugzilla) can not prevent duplicate bug reports. Thus, on average, 35.8%–41.6% of bug reports remain duplicates of one another in the bug tracking systems [70]. These duplicate bug reports pose a major overhead during software maintenance since they often cost valuable development time and resources [34].
Manually examining hundreds of bug reports for duplicates is neither feasible nor practical. One of the major challenges in detecting duplicate bug reports is their unstructured and ambiguous nature. Bug reports are written in natural language texts and thus may contain different words describing the same issue. The probability of two persons using the same text to explain the same issue is very low (e.g., 10%–15%) [23]. Given all these inherent challenges, automated detection of duplicate bug reports has become an active research topic since the last decade, which is also known as bug deduplication [52].
To automate the process of detecting duplicate bug reports, researchers employ various methodologies, including Natural Language Processing (NLP) [56, 63, 58], Information Retrieval (IR) [67, 1, 4, 25], and Machine Learning (ML) [59, 57, 37, 30, 11]. However, they are far from perfect due to the complexity and ambiguity of natural language texts. NLP-based techniques might be limited in detecting duplicate reports when there is a textual mismatch between the reports [27]. IR-based approaches suffer from the vocabulary mismatch problem [15, 23], a typical phenomenon that stems from two textual documents describing the same concept with different vocabularies. On the other hand, ML-based approaches suffer from data imbalance problems, and a lack of generalizability [27, 59, 32].
Duplicate bug reports can be divided into two different categories: those that describe the same issue with similar texts and those that describe two similar issues using different texts (e.g., Fig. 1) [56]. The second category refers to duplicate bug reports that have the same underlying root cause but completely different writing styles. There could be instances where two bugs have different observable behaviors (OB) and steps to reproduce (S2R), but they share the same underlying cause. We call these types of duplicate bug reports textually dissimilar duplicate bug reports in this paper. According to our investigation, 19%–23% of the duplicate bug reports could be textually dissimilar.
Most of the existing NLP and IR-based techniques focus on detecting duplicate bug reports that use similar texts. Unlike NLP and IR-based techniques, ML-based techniques can capture the non-linear relationships between two items [47, 5, 10], and thus have the potential to tackle the challenge of textually dissimilar duplicate bug reports. However, they also suffer from poor outlier handling, class imbalance problem, and a lack of monitoring [19]. Thus, automated detection of duplicate bug reports still remains a highly challenging problem that warrants further investigation [22].

In this paper, we conduct a large-scale empirical study to better understand the impacts of textual dissimilarity on the detection of duplicate bug reports. First, we collect a total of 92,854 bug reports from three large-scale software systems (Eclipse, Firefox, and Mobile) and empirically show how the existing techniques (BM25 [67], LDA+GloVe [3], and Siamese CNN [20]) perform poorly in detecting textually dissimilar duplicate bug reports. Second, we compare textually similar and textually dissimilar duplicate bug reports using a combination of quantitative and qualitative analyses. We found that the textually dissimilar duplicate bug reports differ from textually similar duplicate bug reports in terms of their underlying semantics and structures. For instance, textually dissimilar duplicate bug reports often have missing components or components (e.g., observed behaviors) that are written differently, which could lead to their overall textual differences. Finally, being inspired by the previous successes of domain-specific embedding [16, 46], we apply domain-specific embedding to counteract the impact of textual dissimilarity in duplicate bug report detection. We thus answer three important research questions in our study as follows.
-
(a)
RQ1: Does the performance of existing techniques differ significantly in duplicate bug report detection between textually similar and textually dissimilar duplicate bug reports?
We conducted experiments on our dataset using three existing techniques in duplicate bug report detection that employ Information Retrieval (IR), Topic Modeling, and Machine Learning (ML), respectively. We found that the performance of existing techniques is higher (e.g., recall rate@100) is higher (10.02%–18.45% for BM25, 2.00%–6.49% for LDA+GloVe) in detecting textually similar duplicate bug reports than that of textually dissimilar duplicate bug reports. Our statistical tests (e.g., Wilcoxon Signed Rank test [64], Cliff’s delta) also report that their performance is significantly low for textually dissimilar duplicate bug reports. Although our findings reinforce a common belief about existing techniques, they also substantiate it with solid empirical evidence of the performance gap between the two categories of duplicate bug reports. -
(b)
RQ2: How do textually similar and textually dissimilar duplicate bug reports differ in their semantics and structures?
To investigate the differences between textually similar and textually dissimilar duplicate bug reports, we use three different analyses: descriptive analysis, embedding analysis, and manual analysis. We found negative skewness in the similarity scores of textually dissimilar duplicate bug reports, which indicates a low textual similarity between each pair. We also visualize their embedding matrices using t-SNE [61], a non-linear dimensionality reduction technique. The visualization shows that textually dissimilar duplicate bug reports have a lower dimensional space than that of textually similar duplicate bug reports, which indicates a lower pairwise distance within the embedding space (Fig. 4). Finally, our manual analysis suggests that textually dissimilar duplicate bug reports often miss important components (e.g., steps to reproduce) or they have components (e.g., observed behaviors) that are written differently, which could lead to their overall textual differences. -
(c)
RQ3: Does domain-specific embedding help improve the detection of textually dissimilar duplicate bug reports?
Our experiments in RQ1 use a pre-trained, generic embedding model, GloVE [48], which might not have effectively overcome the challenges of textual dissimilarity in duplicate bug report detection [21, 42, 49]. We thus retrain our selected DL-based technique (e.g., Siamese CNN [20]) with a domain-specific embedding model [16, 54, 46] and repeat our experiments. In particular, we analyze 92,854 bug reports from Eclipse, Firefox, and Mobile systems to capture domain-specific embedding, which is then used to retrain the DL-based technique. We have also used oversampling to deal with imbalanced data problems during model training [68]. We found that domain-specific embedding shows mixed results by improving the detection of textually dissimilar duplicate bug reports but worsening the detection of textually similar duplicate bug reports.
II Study Methodology
II-A Construction of dataset
Dataset collection. We collect duplicate bug reports from three large-scale, open-source software systems – Eclipse, Firefox, and Mobile – using a popular bug tracking system, namely Bugzilla. Existing studies [27, 4, 59, 37] have frequently used these systems, which makes them suitable for our research. Besides, these systems have stemmed from diverse application domains. Eclipse is a popular open-source Integrated Development Environment (IDE) written in Java. Firefox is a popular open-source web browser. While the above two systems are desktop-based applications, the remaining systems are mobile-based (e.g., Firefox for iOS, Focus for iOS, and GeckoView for Android). For the sake of brevity, we combine these three small systems and call them Mobile in the rest of the paper. Almost all the existing studies on duplicate bug report detection used the dataset dated before 2017 [27], which might not be an ideal representation of recent software bugs and issues [69]. In order to avoid the issue of concept drift [43], we choose the bug reports from the last five years. We thus collected a total of 92,854 bug reports from three systems that were submitted within the last five years (01-01-2017 to 01-01-2022).
For many duplicate bug reports, the master reports were created before the last five years (Eclipse: 766, Firefox: 3514, Mobile: 214). In order to make a complete dataset with all the duplicate bug reports along with their master bug reports, we also retrieved them separately. Table I summarizes our study dataset. We see that about 6.60%, 20.53%, and 10.56% of the submitted bug reports were duplicates in Eclipse, Firefox, and Mobile systems, respectively.
Dataset (2017 – 2022) | Eclipse | Firefox | Mobile |
Whole Dataset | 49,244 | 38,290 | 5,320 |
Total Duplicate | 3,248 | 7,859 | 562 |
Duplicate Ratio | 6.60 % | 20.53% | 10.56% |
Experimental Dataset (BM25, LDA + GloVe) | |||
Textually Similar Duplicate | 679 | 1,414 | 122 |
Textually Dissimilar Duplicate | 662 | 1,455 | 131 |
Experimental Dataset (Siamese CNN) | |||
Training Set | 39,395 | 30,632 | 4,256 |
Testing Set | 9,848 | 7,658 | 1,064 |
Textually Similar Duplicate | 504 | 610 | 117 |
Textually Dissimilar Duplicate | 497 | 734 | 89 |
Data cleaning and preprocessing. We capture four key fields from each bug report for our study: (a) bug id, which is unique for each bug report, (b) duplicate bug id, which points to the duplicate bug report, (c) title and description of the bug report, and (d) resolution, which indicates the duplicate status of a bug report.
We collect title and description from each bug report since they capture pertinent information for detecting duplicate bug reports. We clean and preprocess the title and description from each bug report using several steps as follows.
We apply standard natural language preprocessing to the title and description texts. First, we remove stopwords using a standard set of stopwords, which have little to no significance in capturing semantics. Then we perform token splitting along with the removal of punctuation marks, non-alphanumeric characters, numbers, HTML meta tags, and URLs. We also replace any non-alphanumeric characters with spaces and transform the text into lowercase [52]. Lastly, to transform each term into its base form, we use lemmatization using the NLTK library in Python. As performed by an earlier work [52], we discard any description with fewer than 50 characters since they do not contain enough information to be meaningful. On the other hand, bug reports might have long description text containing source code, lengthy stack traces, and log files, which could be noisy [52]. Hence, several previous studies [20, 52] selected bug reports containing at most 350 to 500 tokens. We experimented with 350, 500, and 1000 tokens. Using 500 tokens, the model delivers the best performance. Thus, we chose 500 as our token limit for the bug reports.
Construction of triplets. After we clean and preprocess the dataset, we construct triplets (b, b+, b-) where the existing techniques are supposed to detect b+. Here, b means the query bug report, b+ means duplicate bug report, and b- means non-duplicate bug reports. We created these triplets inspired by an existing work [20]. In the existing work, the duplicates (b,b+) were determined based on bug-tracking systems, whereas non-duplicates (b,b-) were randomly selected from the dataset. We also follow the same process in our ground truth construction. Then (b, b+) pairs were used as the ground truth for evaluating the existing techniques in duplicate bug report detection.
Dataset preprocessing for ML-based approach.
To design an ML-based model (e.g., Siamese CNN) for duplicate bug report detection, pairwise bug reports containing texts and ground truth are required. Hence, we extract (b, b+) and (b, b-) pairs from the triplets above to generate our positive and negative samples, respectively. A similar process was adopted by the existing literature [20]. For training and testing, we split the whole dataset into an 80:20 ratio with random shuffling. Table I (bottom section) shows our Machine Learning models’ training and testing datasets from all three systems.
Unigram | Bigram | Trigram | Duplicate Bug Reports | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | Median |
|
|
Median |
|
|
Median |
|
|
|
|
||||||||||||||||
Eclipse | 0.0502 | 0.0361 | 0.0675 | 0.0261 | 0.0189 | 0.0368 | 0.0212 | 0.0156* | 0.0297* | 679 | 662 | ||||||||||||||||
Firefox | 0.0633 | 0.0483 | 0.0776 | 0.0298 | 0.0233 | 0.0355 | 0.0232 | 0.0179* | 0.0274* | 1414 | 1455 | ||||||||||||||||
Mobile | 0.0679 | 0.0488 | 0.0944 | 0.0451 | 0.0316 | 0.0639 | 0.0403 | 0.0283* | 0.0565* | 122 | 131 |
Constructing subsets of study datasets based on textual similarity. We divide our dataset into textually similar and textually dissimilar duplicate bug reports, which are essential for answering our research questions.
First, we store all duplicate bug reports as pairs in a separate dataset. Then, we collect N-grams (n = 1, 2, and 3) to compute the textual similarity [13] of each duplicate bug report pair. We use the cosine similarity metric [56] to calculate the textual similarity between two bug reports. After getting the similarity score between the duplicate pairs, we analyze their descriptive statistics to determine their median similarity score, lower quartile (25th percentile) value, and upper quartile (75th percentile) value. For duplicate pairs that have a similarity score less than the lower quartile value, we denote them as textually dissimilar duplicate bug reports.
On the other hand, for duplicate pairs that have a similarity score more than the upper quartile value, we denote them as textually similar duplicate bug reports. We repeat all these steps using Unigram, Bigram, and Trigram to collect the common set of textually similar and textually dissimilar duplicate bug reports across three trials for our experiment. Table II shows similarity scores used for various N-grams to construct our textually similar and dissimilar duplicate bug reports. Finally, we got two subsets for textually similar and textually dissimilar duplicate bug reports, respectively (Eclipse: 679 & 662, Firefox: 1414 & 1455, Mobile: 122 & 131).
II-B Replication of existing techniques for experiments
To answer our first research question, we needed to replicate existing techniques on duplicate bug report detection. We thus select suitable representatives from the frequently used methodologies in duplicate bug report detection. In particular, we choose baseline methods from three frequently used methodologies – Information Retrieval, Topic Modeling, and Machine Learning. We select BM25 [67] from Information Retrieval, LDA+GloVe [3] from Topic-Modeling, and Siamese CNN [20] from Deep Learning for our experiment. Most of the recent models are based on these three primary approaches with incremental improvements [27, 8, 44]. We chose these baseline methods to determine the impact of textual dissimilarity on duplicate bug report detection without the effect of compounding factors (e.g., severity, priority, components, products, multimedia attachments).
We used the authors’ replication package for the LDA+GloVe model [3]. On the other hand, the replication packages of BM25 [67], and Siamese CNN [20] were not publicly available, and thus those techniques were carefully re-implemented by us based on the corresponding papers.
Information Retrieval (IR) relies on keyword overlaps between any two documents. We select a representative of IR, namely BM25, with default parameters (k1=1.5, b=0.75) for retrieving the duplicate bug reports. Yang et al. [67] first use BM25 for duplicate bug report detection. While BM25 is an established technique, it suffers from Vocabulary Mismatch Problem (VMP) [15, 23]. Several studies [3, 45, 31] adopt Topic Modeling in duplicate bug report detection to overcome this challenge. Latent Dirichlet Allocation (LDA) is a Topic Modeling approach that has the potential to overcome the vocabulary mismatch problem [28]. We call this approach LDA+GloVe in our research. As done by the original work [3], we use LDA for topic-based clustering (topic number=20), GloVe for the pre-trained word embedding (embedding dim = 100), and a unified text similarity measure (cosine similarity and euclidean metrics) for ranking the topmost, similar bug reports against a given bug report.
Unlike the above two techniques, Machine Learning based approaches might be able to find non-linear relationships between dependent and independent variables [47, 5, 10]. We thus implement the Siamese Convolutional Neural Network (CNN) for duplicate bug report detection by adapting an earlier research [20]. We train our Siamese CNN model using K-fold cross-validation (e.g., K=10) and batch gradient descent, where original authors’ batch sizes (e.g., 512 for Eclipse and Firefox, 256 for Mobile), learning rate (e.g., 0.001) and epoch number (e.g., 12) were chosen during the training phase to avoid model overfitting. Since the Mobile system has a small number of bug reports, a smaller batch size was chosen.
II-C Performance evaluation
As we detected duplicate bug reports using Information Retrieval, Topic Modeling, and Machine Learning techniques in our experiments, they were evaluated using appropriate performance metrics from these domains. In the case of BM25 and LDA+GloVe, we find all the duplicate bug reports for a given query bug report. Then we used Recall-rate@K [27, 3, 41, 63], one of the most popular performance metrics, to evaluate the IR and Topic Modeling approaches. On the other hand, for the deep learning-based model (Siamese CNN), we used traditional metrics such as F1 score, AUC, recall, and precision [27]. We used different evaluation metrics based on the original works [67, 3, 20]. It should be noted that our main goal was to contrast the performance of existing techniques between two sets of duplicate bug reports rather than to compare the techniques.
Recall-rate@K: Recall-rate@K determines the percentage of bug reports for each of which the duplicate bug report is found within the top K positions [57].
(1) |
Ndetected is the number of bug reports for which the duplicate reports have been correctly detected and Ntotal is the number of total bug reports. We used nine different values of K (K = 1, 5, 10, 20, 25, 30, 50, 75, 100) to calculate the results of our IR-based and Topic Modeling techniques – BM25 and LDA+GloVe models.
Precision: Precision determines the percentage of bug reports for which duplicate bug reports are correctly detected. We calculate the precision of a technique as follows:
(2) |
Here TP = true positive and FP = false positive.
Recall: Recall determines the percentage of all duplicate bug reports that are correctly detected by a technique. We calculate the metrics as follows:
(3) |
Here, TP = true positive and FN = false negative.
F1-measure: While both precision and recall focus on a specific aspect of a technique’s effectiveness, F1-measure is a more comprehensive and effective method for evaluation. We take the harmonic mean of precision and recall to compute the F1-measure as follows:
(4) |
AUC: The Receiver Operating Characteristics (ROC) curve is a probability curve that separates between true-positive and false-positive rates [33]. Area Under the Curve (AUC) calculates the fraction of the area that falls under the ROC [33]. The AUC score ranges between 0 and 1, with 1 indicating that the model can perfectly classify observations into classes. The positive and negative samples are frequently imbalanced in actual data, such as ours. This imbalance significantly impacts precision and recall, whereas Area Under Curve (AUC) is robust against the data imbalance.
After conducting the experiments, we evaluated our BM25 and LDA+GloVe models using Recall-rate@K, as used by existing work [67, 3]. On the other hand, we have evaluated the Siamese CNN model, as was done by the original work [20], using the remaining performance metrics.
Dataset | Method | k = 1 | k = 5 | k = 10 | k =100 |
---|---|---|---|---|---|
Eclipse | BM25 | 22.64 | 36.60 | 42.03 | 57.31 |
LDA+GloVe | 0 | 5.5 | 10.5 | 16.0 | |
Firefox | BM25 | 16.11 | 27.98 | 33.64 | 52.68 |
LDA+GloVe | 0 | 8.5 | 10.5 | 16.5 | |
Mobile | BM25 | 17.25 | 28.95 | 35.38 | 57.60 |
LDA+GloVe | 0 | 2.5 | 7.0 | 20.0 |
|
|
||||||||||||
Dataset | Method | k=1 | k=5 | k=10 | k=100 | k=1 | k=5 | k=10 | k=100 | ||||
BM25 | 24.15 | 37.63 | 43.26 | 62.78 | 21.83 | 37.50 | 41.27 | 52.58 | |||||
Eclipse | LDA + GloVe | 0.00 | 10.0 | 13.5 | 20.50 | 0.00 | 6.50 | 11.50 | 18.50 | ||||
BM25 | 20.72 | 34.49 | 39.92 | 57.58 | 11.64 | 20.82 | 26.56 | 46.72 | |||||
Firefox | LDA + GloVe | 0.00 | 6.00 | 8.50 | 14.49 | 0.00 | 2.00 | 5.00 | 8.00 | ||||
BM25 | 22.00 | 44.00 | 48.00 | 78.00 | 15.73 | 28.09 | 35.96 | 59.55 | |||||
Mobile | LDA + GloVe | 0.00 | 4.00 | 7.50 | 20.50 | 0.00 | 2.50 | 4.50 | 15.00 |
III Study Findings
III-A RQ1: Does the performance of existing techniques differ significantly in duplicate bug report detection between textually similar and textually dissimilar duplicate bug reports?
We first evaluate the existing techniques against our whole dataset. Tables III and VIII summarize their performances.
From Table III, we see that the BM25 approach, on average, performs higher with the Eclipse system than with Firefox and Mobile systems. On the other hand, the performance of the LDA+GloVe model is approximately 38.36% lower than that of BM25 for Recall-rate@100. LDA is limited in modeling topic correlations [39], which could be crucial to duplicate bug report detection. Table VIII (top section) shows that the performance of our ML-based technique in detecting duplicate bug reports ranges from 61.41% – 84.80% in terms of AUC. Precision, Recall, and F1-measure are slightly higher for Eclipse than for Firefox. Firefox has a better AUC score than Eclipse, as the AUC metric is robust to imbalanced data [68]. Precision, Recall, F1-measure, and AUC scores are slightly higher for the Mobile dataset than for the other two systems. Overall, the ML-based approach delivers the highest performance in duplicate bug report detection.
While the above analysis focuses on the whole dataset, we also determine the performance gap of existing techniques between textually similar and textually dissimilar duplicate bug reports. Table IV shows that BM25 delivers a higher Recall-rate@K for textually similar duplicate bug reports than for dissimilar ones across all three systems - Eclipse, Firefox, and Mobile. For instance, with K=100, the difference between these two sets ranges from 10.20% to 18.45%. Fig. 2 further shows the performance difference between these two sets of bug reports for various K values. We see that the difference is noticeably higher for Firefox and Mobile systems.
On the other hand, for the LDA+GloVe model, the performance differences between textually similar and textually dissimilar duplicate bug reports are 2.00% for Eclipse, 6.49% for Firefox, and 5.5% for the Mobile system with K=100 (Table IV). The performance gap is higher for Firefox and Mobile systems in the same manner as BM25. One possible reason behind this could be the higher duplicate ratios in Firefox and Mobile systems (see Table I).
Table VIII shows the performance of our Machine Learning model - Siamese CNN - for both sets of duplicate bug reports across three systems. Here, the performance difference between textually similar and textually dissimilar duplicates is less apparent than that of the traditional methods above. Machine Learning models, especially deep learning models, can capture more contextual information beyond the syntax [40], which might explain the phenomenon. From Table VIII, we see that the performance difference is smaller for Eclipse than for the other two datasets. For instance, the AUC differences between textually similar and textually dissimilar duplicate bug reports are 9.57% for Eclipse, 9.38% for Firefox, and 2.97% for the Mobile dataset. In the case of the F1-measure, the performance difference for the Eclipse dataset is 10.92%, whereas, for the Firefox dataset, it is 8.48%. For Mobile data, the performance difference is 10.00%. We also note that the performance of our ML-based technique is lower for the two subsets of bug reports than for the whole dataset. As shown in Table I, both subsets contain a smaller number of duplicate bugs than the whole dataset. That is, the two experiments use the same trained model but are tested with different numbers of test samples, which might explain the finding.
We also perform statistical tests to determine the significance of the performance gap between textually similar and dissimilar duplicate bug reports (Table V). For each of the three systems, we first evaluate BM25 and LDA+GloVe using Recall-rate@K measures against textually similar and dissimilar duplicate bug reports where we consider various K values (K = 1, 5, 10, 20, 25, 30, 50, 75, 100). Then, we performed Shapiro-Wilk normality test [55] to determine the distribution of each set. We got two non-normal distribution pairs (LDA+GloVe for Eclipse and Firefox) out of six sets of pairs. Then we used appropriate parametric, non-parametric tests and effect size tests to compare the two sets of Recall-rate@k values from textually similar and dissimilar duplicate bug reports. For the normal distribution, we used paired t-test as the parametric test [36], and for the non-parametric test, we used the Wilcoxon Signed‐Rank test [64]. In both types of significance tests, the p-values were less than the threshold (0.05) except for two sets (BM25 in the Eclipse system & LDA+GloVe in the Mobile system). Thus, the null hypothesis can be rejected for all comparisons except for the case of BM25 in Eclipse and LDA+GloVE in the Mobile system. In other words, the performances of BM25 and LDA+GloVE techniques are significantly different between textually similar and dissimilar duplicate bug reports.
While the significance of a result indicates how probable it is that it is due to chance, the effect size indicates the extent of the difference [53]. Our experiments found different effect sizes ranging from medium to large (Table V). We see that the effect size of BM25 is large for Firefox and Mobile systems and medium for Eclipse. On the other hand, the LDA+GloVe model has a medium to large effect size across the three systems. Thus, our results from effect size tests reinforce the above finding from significance tests. In other words, the existing techniques perform significantly poorly detecting textually dissimilar duplicate bug reports. Even though our findings above mostly match natural intuition, we performed extensive experiments on three different systems using three different methodologies, which resulted in strong empirical evidence. Thus, we not only reinforce the existing belief about the existing techniques on duplicate bug detection but also substantiate it with solid empirical evidence.
Summary of RQ1: The performances of existing techniques (e.g., BM25, LDA+GloVe) are significantly lower in detecting textually dissimilar duplicate bug reports than that of textually similar duplicate bug reports. Our finding also substantiates a common belief about existing techniques on duplicate bug report detection with solid empirical evidence. \endMakeFramed
Dataset | Method | PD |
|
Effect Size | ||
---|---|---|---|---|---|---|
Eclipse | BM25 | N | 0.2865 | Medium (0.52) | ||
LDA+GloVe | NN | 0.0113* | Large (0.80) | |||
Firefox | BM25 | N | 0.0329** | Large (1.10) | ||
LDA+GloVe | NN | 0.0103* | Medium (0.31) | |||
Mobile | BM25 | N | 0.0586** | Large (0.96) | ||
LDA+GloVe | N | 0.1471 | Medium (0.72) |
PD=Probability Distribution, NN=Non-normal, N=Normal, *=Significant, **=Strongly Significant

Dataset | Skew | Kurt | Mean | Median | Std | |
---|---|---|---|---|---|---|
Eclipse | ||||||
|
0.64 | -0.83 | 0.09 | 0.08 | 0.02 | |
|
-0.70 | 0.03 | 0.02 | 0.02 | 0.01 | |
Firefox | ||||||
|
0.89 | 0.37 | 0.09 | 0.09 | 0.01 | |
|
-0.50 | -0.67 | 0.03 | 0.03 | 0.01 | |
Mobile | ||||||
|
-0.88 | -0.53 | 0.16 | 0.17 | 0.03 | |
|
-0.38 | -0.58 | 0.03 | 0.03 | 0.01 |
III-B RQ2: How do textually similar and textually dissimilar duplicate bug reports differ in their semantics and structures?
In this research question, we investigate how textually dissimilar duplicate bug reports might be different from textually similar duplicate bug reports. We answer this question using three different analyses – descriptive analysis, embedding analysis, and manual analysis – as follows:

Descriptive analysis. Descriptive analysis involves examining the data that helps describe, show, or summarize data points. It helps determine patterns or outliers that might emerge, which could lead to further statistical analyses. After cleaning and preprocessing the bug reports from each subject system, we calculate the cosine similarity score between each pair of duplicate bug reports using their TF-IDF measures from the textually similar and dissimilar subsets. Then we perform descriptive analysis on these similarity scores and capture five different statistics: Skewness, Kurtosis, Mean, Median, and Standard Deviation. Fig. 3 and Table VI summarize our descriptive analysis for Eclipse, Firefox, and Mobile systems.
Skewness is a measure of symmetry [26]. Distribution is symmetric if it looks the same on the left and right of the center point [26]. From Fig. 3, we find the scores of textually similar duplicate bug reports to be positively skewed for Eclipse and Firefox. The positive skewness indicates that a significant number of duplicate pairs are highly similar [26]. On the other hand, for the textually dissimilar dataset, we found negative skewness for all three datasets, indicating that the cosine similarity measures are very low for most of the textually dissimilar duplicate pairs [26].
Kurtosis is a measure of whether the distribution is heavy-tailed or light-tailed relative to a normal distribution [26]. Distributions with positive kurtosis tend to have heavy tails or outliers, whereas distributions with negative kurtosis tend to have light tails [18]. A uniform distribution would be the extreme case. From Fig. 3 and Table VI, we note that textually dissimilar duplicate pairs have a negative kurtosis for Firefox and Mobile systems. The kurtosis for the Eclipse system is also close to zero. That is, the similarity scores from textually dissimilar duplicate pairs have a lighter tail than the normal distribution. In other words, their similarity scores are mostly centered around the mean value, which is also low.
On the other hand, a similar conclusion can be made for the textually similar duplicate pairs with the Eclipse and Mobile systems. However, it should be noted that the Firefox system contains several times more bug reports than the Eclipse and Mobile systems. The remaining two of three statistics - mean and median - are also several times lower for textually dissimilar duplicate bug reports than their counterparts.

Embedding analysis. Word embedding is a frequently used mechanism for detecting duplicate bug reports, representing words as semantically relevant dense real-valued vectors [38]. While our descriptive analysis above focuses on text-level similarity, we now perform embedding analysis to visualize the semantic differences between textually similar and textually dissimilar duplicate bug reports. We employ t-SNE to visualize high-dimensional data representing embedding vectors in lower dimensions [61]. It illustrates high-dimensional data (e.g., bug report embeddings) by projecting them into a two-dimensional space. The intra-cluster detail can be observed by measuring the pairwise distances in the higher and lower dimensions spaces [61]. We collect 100 random bug report pairs from both textually similar and dissimilar datasets. We generated each pair’s GloVe embeddings and performed t-SNE to visualize the embedding of each pair. We use the optimal perplexity and iterations (e.g., perplexity = 40, iterations = 7000) and default similarity metric (i.e., cosine similarity) for our visualization.
From Fig. 4, we see the difference in embedding visualization of textually similar and dissimilar bug reports for all three datasets. In embedding space, textually dissimilar duplicate pairs (orange dots) are clustered in a different dimensional area than the textually similar pairs (blue dots). In all three systems, the embedding positions of textually dissimilar bug reports are in lower coordinates than that of textually similar duplicate bug reports. In particular, the changes are noticeable across the vertical dimension between textually similar and dissimilar duplicate bug reports for Firefox and Mobile. Interestingly, for Eclipse, we see that textually dissimilar duplicate bug reports are at different locations, even across the horizontal dimension. All these differences above suggest that the word semantics of duplicate pairs within the same dataset could be similar but very different from that of other datasets. In other words, the semantics of textually dissimilar duplicate bug reports are noticeably different from that of textually similar bug reports.
Manual Analysis. We chose 100 randomly selected duplicate bug report pairs (50 textually similar + 50 textually dissimilar) from each subject system. Then, we manually analyze 150 pairs of textually similar and 150 pairs of textually dissimilar duplicate bug reports. Ideally, each bug report should have three components – Expected Behaviour (EB), Observed Behaviour (OB), and Steps to Reproduce (S2R) [9]. These components have been used by previous research to reformulate queries during duplicate bug report detection using Information Retrieval [14]. Similarly, we make use of these components from each duplicate pair to understand how textually similar duplicate bug reports and textually dissimilar duplicate bug reports might differ from each other.
First, we go through the title and description of each bug report and detect the presence of EB, OB, and S2R components in each bug report. Then we analyze the prevalence ratios of these components in both textually similar and textually dissimilar duplicate bug reports. We also calculate the textual similarity between the two bug reports from each pair for each of the three components separately. As a part of manual analysis, we also look for shared terms, keywords, technologies, and overall literary analogies between a duplicate bug report and a master bug report. We spent a total of 25 hours on our manual analysis.
Table VII shows the prevalence ratios of all three components from both textually similar and textually dissimilar duplicate bug report pairs. We see that textually dissimilar duplicate bug reports have a higher percentage of missing components. For example, as shown in Table VII, 44.61% and 13% of textually dissimilar pairs do not contain any steps to reproduce (S2R) and expected behaviors (EB) in their bug reports, whereas such statistics are 13% and 1% respectively for the textually similar duplicate bug reports. Such higher ratios of missing components might explain the lower textual similarity between each duplicate pair of their bug reports.
Table VII also shows the component-level similarity between two bug reports from each duplicate pair. We see a lower component-level similarity for textually dissimilar duplicate bug reports. For example, on average, two bug reports from each of their pairs are only 38% and 57% similar when OB and EB components are considered, whereas such statistics are 83% and 78%, respectively, for the textually similar duplicate pairs. Thus, even at the component level, textually dissimilar duplicate bug reports displayed lower similarity ratios.
In other words, missing components and component-level differences might have led to their overall textual dissimilarity. We also record several qualitative insights during our manual analysis of textually similar and textually dissimilar duplicate bug reports. They are outlined as follows.
(a) Shared phrases. Textually similar duplicate bug reports have more shared phrases (e.g., Bigram, Trigram) than unique words (e.g., unigram). For example, rather than the word ”scroll”, phrases such as “horizontal scroll”, and “horizontal scroll installation” are more prevalent in these reports.
(b) Components’ prevalence. Observed behaviors (OB) and expected behaviors (EB) are more prevalent in textually similar duplicate bug reports, which could lead to their increased textual similarity. We found up to 90% overall similarity between two duplicate reports from this category.
(c) Missing components. We observed a higher percentage of missing components in textually dissimilar duplicate bug reports than in textually similar ones. Bug reports with missing components are likely to have lower similarity scores. Although we notice minimal keyword overlaps, the root cause was mostly similar for both bugs from the same duplicate pair.
(d) Unique phrases. Textually dissimilar duplicate bug reports often use completely unique phrases, which could lead to their dissimilarity. We found that although the EB was somewhat similar, the OB and S2R components were written differently for the two bug reports of the same duplicate pair.
Type | EB | OB | S2R | Overall | ||
---|---|---|---|---|---|---|
Prevalence Ratio | ||||||
|
99.00% | 100.00% | 87.00% | N/A | ||
|
87.05% | 100.00% | 55.39% | N/A | ||
Similarity Ratio | ||||||
|
78.00% | 83.00% | 55.00% | 90.00% | ||
|
56.84% | 38.20% | 31.65% | 21.58% |
EB=Expected behaviour, OB=Observed behaviour, S2R=Steps to reproduce
Dataset | Eclipse | Firefox | Mobile | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AUC | Recall | Precision | F1 | AUC | Recall | Precision | F1 | AUC | Recall | Precision | F1 | ||
Pre-trained word embedding only | |||||||||||||
|
61.41 | 93.00 | 92.00 | 92.49 | 64.70 | 82.00 | 78.00 | 79.95 | 84.80 | 94.00 | 93.00 | 93.49 | |
|
56.31 | 55.00 | 63.00 | 58.73 | 66.44 | 52.00 | 75.00 | 61.41 | 63.96 | 72.00 | 80.00 | 75.79 | |
|
46.74 | 45.00 | 51.00 | 47.81 | 57.06 | 48.00 | 59.00 | 52.93 | 60.99 | 58.00 | 76.00 | 65.79 | |
Pre-trained word embedding + Oversampling | |||||||||||||
|
56.51 | 55.00 | 63.00 | 58.73 | 66.51 | 52.00 | 75.00 | 61.41 | 63.96 | 72.00 | 80.00 | 75.79 | |
|
46.98 | 47.00 | 53.00 | 49.82 | 57.35 | 49.00 | 60.00 | 53.94 | 60.99 | 58.00 | 76.00 | 65.79 | |
Domain-specific word embedding + Oversampling | |||||||||||||
|
56.31 | 51.00 | 52.00 | 51.49 | 56.03 | 61.00 | 63.00 | 61.98 | 64.98 | 70.00 | 73.00 | 71.47 | |
|
53.54 | 51.00 | 51.00 | 51.00 | 55.16 | 60.00 | 61.00 | 60.50 | 62.25 | 69.00 | 69.00 | 69.00 |
Summary of RQ2: Textually similar and textually dissimilar duplicate bug reports are different in terms of their descriptive statistics (e.g., skewness), underlying semantics (e.g., t-SNE clusters), and prevalence of structural components. In particular, textually dissimilar duplicate bug reports often miss important components such as expected behaviors (EB) or steps to reproduce (S2R), which could lead to their textual dissimilarity within each pair. \endMakeFramed
III-C RQ3: Does domain-specific embedding help improve the detection of textually dissimilar duplicate bug reports?
From RQ1 and RQ2, we see that textually similar and dissimilar duplicate bug reports could be different in their lexicon, underlying semantics, and structures. Our analysis above also shows that the performance gap between these two sets is the smallest when the Machine Learning approach is used for duplicate bug report detection (RQ1). Machine Learning approaches might be able to capture more contextual information than the other approaches, which could be useful for the detection task [47, 5, 10].
In RQ1, we used pre-trained word embeddings from GloVe to train our Siamese CNN model [20] for duplicate bug report detection. Pre-trained word embedding has proven to be invaluable for improving the performance of various natural language understanding tasks (e.g., text classification [6], sentiment analysis [49]). However, GloVe has been pre-trained on natural language texts (e.g., Wikipedia) [48], which might not be relevant to the texts from bug reports. Thus, we use domain-specific embedding to retrain our Siamese CNN model. We set the max token size to 20,000 and the embedding dimension to 100, which are similar to the parameters used in RQ1. We generate the embedding matrix with Skip-gram algorithm [46], use the whole dataset of 92,854 bug reports, and apply the same deep learning architecture to Siamese CNN, as used in RQ1 [20].
Datasets constructed from bug-tracking systems are often heavily imbalanced. The number of duplicate bug reports is considerably smaller than that of non-duplicate bug reports [1]. Hence, we use oversampling [68] to handle the data imbalance problem during model training [68]. To the best of our knowledge, the original work [20] did not use sampling in their Siamese CNN model. However, for an in-depth investigation, we replicate another variant of the original DL-based model [20] applying oversampling. Table VIII summarizes the experimental results of our DL-based model for three different scenarios: pre-trained embedding only (original work [20]), pre-trained embedding + oversampling, and domain-specific embedding + oversampling. When compared between these model scenarios– pre-trained embedding + oversampling, and domain-specific embedding + oversampling, we see the noticeable impact of domain-specific embeddings on duplicate bug report detection.
From Table VIII, we see that, in terms of F1-measure, the model’s performance with textually dissimilar duplicate bug reports has increased by 1.18% for Eclipse, 6.56% for Firefox, and 3.21% for Mobile. Furthermore, the AUC improved by 6.56% for Eclipse and remained comparable for the other two systems. Thus, domain-specific embeddings have a positive impact on detecting textually dissimilar duplicate bug reports. However, they have mostly negative impacts, except in a few cases, on detecting textually similar bug reports. From Table VIII, we see that the model’s F1-measure decreased by 7.24% for Eclipse and 4.32% for the Mobile system. Furthermore, the AUC decreased by 10.48% for the Mozilla system. Thus, while domain-specific embeddings have the potential to tackle the challenge of textual dissimilarity, they have a mixed impact on detecting duplicate bug reports.
Summary of RQ3: The use of domain-specific embeddings (e.g., trained on bug reports) improves our model’s performance for textually dissimilar duplicate bug reports (e.g., up to 6.56% in F1-measure for Firefox). However, these embeddings have either negligible or negative impacts on detecting textually similar duplicate bug reports. \endMakeFramed
IV Threats to Validity
We identify a few threats to the validity of our findings. In this section, we discuss these threats and the necessary steps taken to mitigate them as follows.
Threats to internal validity relate to experimental errors and human biases [60]. Traditional bug tracking systems (e.g., Bugzilla) have thousands of reports whose quality cannot be guaranteed, which could be a source of threat. Bug reports often contain poor, insufficient, missing, or even inaccurate information [27]. To address the issue, we apply standard natural language preprocessing and token threshold to them and also check for missing features in each bug report. Another potential source of threat could be the replication and reproduction of existing work. The replication package was unavailable for the BM25 and Siamese CNN models, and we had to re-implement them. However, we did it carefully using standard libraries and corresponding papers, tuned the parameters, and reported their best results.
We use TF-IDF and cosine similarity to determine the textual similarity between any two duplicate bug reports. TF-IDF and cosine similarity have been frequently used to determine the textual similarity between two documents for the last 50 years [35]. Besides, we also used N-gram-based similarity and quartile analysis to systematically separate the textually similar and textually dissimilar duplicate bug reports (Section II), and report the detailed similarity measures for replication (see Table II Thus, the threats concerning similarity calculation and construction of two bug report groups might be mitigated.
Threats to conclusion validity. The observations from our study and the conclusions we drew from them could be a source of threat to conclusion validity [24]. In this research, we answer three research questions using 92,854 bug reports from three different subject systems and re-implementing three existing techniques. We use appropriate statistical tests (e.g. Wilcoxon Signed Rank) and report the test details (e.g., p-value, Cliff’s delta) to draw any conclusion. Thus, such threats might also be mitigated.
Threats to construct validity relate to the use of appropriate performance metrics. We evaluate BM25 and LDA+GloVE techniques with Recall-rate@K and Machine Learning model with AUC, precision, recall, and F1-measure, which have been used in their corresponding papers and the relevant literature [27]. Thus, such threats might also be mitigated.
V Related Work
Information Retrieval (IR). IR approaches rely on the textual overlap between query bug reports and candidate bug reports for duplicate detection. Runeson et al. [56] first use a simple approach namely Bag of Words (BOW), to tally the frequency of words and then use BOW-model to detect duplicate bug reports. They determine the similarity between two bug reports using cosine, Jaccard, and dice similarity measures. However, the BOW-based approach could be biased towards large documents and might not be able to capture the semantics of a bug report precisely [65]. Wang et al. [63] later improved this technique using TF-IDF [2] and quantify the similarity of two document vectors.
Later, they used BM25 [50] as a traditional IR-based model for duplicate bug report detection. Aggarwal et al. [1] demonstrate that BM25F, an improvement of BM25, is more suitable for weighting words in diverse domains. Like us, they also use domain-specific, categorical, and textual features.[58] use n-gram models for textual similarity calculation during duplicate bug report detection. In particular, they focus on character-level language models rather than word-level ones. We also use n-gram-based similarity to separate textually similar and textually dissimilar duplicate bug reports.
Chaparro et al. [14] use three strategies to reformulate a query bug report and use Information Retrieval to detect duplicate bug reports. Later, Cooper et al. [17] propose a duplicate bug report detection for video-based bug reports where they make use of text retrieval and computer vision methods. Since we focus on textual bug reports, their work might not be a great fit. In our research, we thus use BM25, a popular IR baseline, to investigate the impacts of textual dissimilarity on duplicate bug report detection.
Topic Modeling. IR-based approaches might suffer from Vocabulary Mismatch Problems [23]. Topic Modeling has the potential to tackle such problems concerning textual similarity calculation [62]. Alipour et al. [4] employ the Latent Dirichlet allocation (LDA) model to capture contextual information from history (e.g., prior knowledge on software quality) and leverage the information in duplicate bug report detection. Aggarwal et al. [1] capture domain-specific contextual information to improve duplicate bug report detection.
Nguyen et al. [45] combine IR and topic-based features to improve duplicate bug report detection. Recently Akilan et al. [3] propose a hybrid model that combines the Topic Modeling (e.g., LDA) with pre-trained word embedding (e.g., GloVe) for duplicate bug report detection. We replicate their technique carefully for our experiments, and detailed results can be found in Table III. In another research, Budhiraja and Shrivastava [12] combines Latent Dirichlet Allocation (LDA) and domain-specific word embeddings. Similarly, we leverage domain-specific embedding to counteract the impact of textual dissimilarity in duplicate bug report detection (RQ3).
Machine Learning and Deep Learning. Unlike the above two methodologies, Machine Learning can detect non-linear relationships between any two bug reports for duplicate detection [47, 5, 10]. Sun et al. [57] first used a Support Vector Machine (SVM) to design a discriminative model for detecting duplicate bug reports. However, their approach lacks rigorous validation. Klein et al. [37] design several models using K-NN, Linear SVM, RBF, Decision Tree, Random Forest, and Naive Bayes to classify duplicate bug reports.
Deshmukh et al. [20] used the Siamese variations of CNN and RNN to design a deep-learning model for duplicate bug report detection. We replicate their work for our experiments. Rocha and Carvalho [51] incorporate the attention mechanism into the Siamese network for semantic and context-based embedding. Xie et al. [66] propose an architecture, namely DBR-CNN, where CNN is used to encode the textual data and logistic regression to classify each pair of bug reports as either duplicate or non-duplicate. Haering et al. [29] employ DistilBERT, a context-sensitive embedding technique using BERT and deep matching for duplicate bug report detection.
To summarize, we replicate three existing techniques on duplicate bug report detection from Information Retrieval, Topic-Modeling, and Deep Learning. Then we conduct experiments using a total of 92K bug reports to better understand the impacts of textual dissimilarity on duplicate bug report detection. To our best knowledge, this is the first attempt to comprehensively understand the impacts of textual dissimilarity on duplicate bug detection, which makes our work novel.
VI Conclusion & Future Work
Automated detection of duplicate bug reports has been an active research topic for over a decade. However, existing approaches might not be sufficient to detect textually dissimilar but duplicate bug reports. In this paper, we thus perform a large-scale empirical study using 92K bug reports from three open-source systems to better understand the challenges of textual dissimilarity in duplicate bug report detection. First, we empirically demonstrate that existing techniques perform poorly in detecting textually dissimilar duplicate bug reports. Second, we found that textually dissimilar duplicates often miss important components (e.g., steps to reproduce), which could lead to their textual dissimilarity within the same pair. Finally, inspired by the earlier findings, we apply domain-specific embedding to duplicate bug report detection, which provided mixed results. All these findings above warrant further investigation and more effective solutions for detecting textually dissimilar duplicate bug reports. Future work can focus on complementing these reports with relevant information from other sources (e.g., version control history).
References
- Aggarwal et al. [2017] K. Aggarwal, F. Timbers, T. Rutgers, A. Hindle, E. Stroulia, and R. Greiner. Detecting duplicate bug reports with software engineering domain knowledge. Journal of Software: Evolution and Process, 29(3):e1821, 2017.
- Aizawa [2003] A. Aizawa. An information-theoretic perspective of tf–idf measures. Information Processing and Management, 39(1):45–65, 2003.
- Akilan et al. [2020] T. Akilan, D. Shah, N. Patel, and R. Mehta. Fast detection of duplicate bug reports using lda-based topic modeling and classification. In Proc. SMC, pages 1622–1629, 2020.
- Alipour et al. [2013] A. Alipour, A. Hindle, and E. Stroulia. A contextual approach towards more accurate duplicate bug report detection. In MSR, pages 183–192, 2013.
- Almeida [2002] J. S. Almeida. Predictive non-linear modeling of complex data by artificial neural networks. Current opinion in biotechnology, 13(1):72–76, 2002.
- Alwehaibi and Roy [2018] A. Alwehaibi and K. Roy. Comparison of pre-trained word vectors for arabic text classification using deep learning approach. In ICMLA, pages 1471–1474, 2018.
- Anvik et al. [2006] J. Anvik, L. Hiew, and G. C. Murphy. Who should fix this bug? In Proc. ICSE, pages 361–370, 2006.
- Bansal and Rohil [2021] K. Bansal and H. Rohil. Literature review of finding duplicate bugs in open source systems. In CCICT, pages 389–396, 2021.
- Bettenburg et al. [2008] N. Bettenburg, S. Just, A. Schröter, C. Weiss, R. Premraj, and T. Zimmermann. What makes a good bug report? In Proc. SIGSOFT/FSE, pages 308–318, 2008.
- Bitvai and Cohn [2015] Z. Bitvai and T. Cohn. Non-linear text regression with a deep convolutional neural network. In Proc. ACL, pages 180–185, 2015.
- Budhiraja et al. [2018] A. Budhiraja, K. Dutta, R. Reddy, and M. Shrivastava. Dwen: deep word embedding network for duplicate bug report detection in software repositories. In Proc. ICSE, pages 193–194, 2018.
- Budhiraja and Shrivastava [2018] R. Reddy Budhiraja and M. Shrivastava. Lwe: Lda refined word embeddings for duplicate bug report detection. In Proc. ICSE, pages 165–166, 2018.
- Buscaldi et al. [2012] D. Buscaldi, R. Tournier, N. Aussenac-Gilles, and J. Mothe. Irit: Textual similarity combining conceptual similarity with an n-gram comparison method. In SemEval, pages 552–556, 2012.
- Chaparro et al. [2019] O. Chaparro, J. M. Florez, U. Singh, and A. Marcus. Reformulating queries for duplicate bug report detection. In SANER, pages 218–229, 2019.
- Chawla and Singh [2013] I. Chawla and S. K. Singh. Performance evaluation of vsm and lsi models to determine bug reports similarity. In Proc. IC3, pages 375–380, 2013.
- Chen et al. [2021] T. L. Chen, M. Emerling, G. R. Chaudhari, Y. R. Chillakuru, Y. Seo, T. H. Vu, and J. H. Sohn. Domain specific word embeddings for natural language processing in radiology. Journal of biomedical informatics, 113:103665, 2021.
- Cooper et al. [2021] N. Cooper, C. Bernal-Cárdenas, O. Chaparro, K. Moran, and D. Poshyvanyk. It takes two to tango: Combining visual and textual information for detecting duplicate video-based bug reports. In ICSE, pages 957–969, 2021.
- DeCarlo [1997] L. T. DeCarlo. On the meaning and use of kurtosis. Psychological methods, 2(3):292, 1997.
- Deng and Liu [2018] L. Deng and Y. (Eds.) Liu. Deep learning in natural language processing. Springer, 2018.
- Deshmukh et al. [2017] J. Deshmukh, K. M. Annervaz, S. Podder, S. Sengupta, and N. Dubash. Towards accurate duplicate bug retrieval using deep learning techniques. In Proc. ICSME, pages 115–124, 2017.
- Efstathiou et al. [2018] V. Efstathiou, C. Chatzilenas, and D. Spinellis. Word embeddings for the software engineering domain. In Proc. MSR, pages 38–41, 2018.
- Feng et al. [2013] L. Feng, L. Song, C. Sha, and X. Gong. Practical duplicate bug reports detection in a large web-based development community. In Asia-Pacific Web Conference, pages 709–720, 2013.
- Furnas et al. [1987] G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais. The vocabulary problem in human-system communication. Commun. ACM, 30(11):964–971, 1987.
- García-Pérez [2012] M. A. García-Pérez. Statistical conclusion validity: Some common threats and simple remedies. Frontiers in psychology, 3:325, 2012.
- Gopalan and Krishna [2014] R. P. Gopalan and A. Krishna. Duplicate bug report detection using clustering. In ASWEC, pages 104–109, 2014.
- Groeneveld and Meeden [1984] R. A. Groeneveld and G. Meeden. Measuring skewness and kurtosis. Journal of the Royal Statistical Society Series D (The Statistician), 33(4):391–399, 1984.
- Gupta and Gupta [2021] S. Gupta and S. K. Gupta. A systematic study of duplicate bug report detection. International Journal of Advanced Computer Science and Applications, 12(1), 2021.
- Gupta and Varma [2017] S. Gupta and V. Varma. Scientific article recommendation by using distributed representations of text and graph. In Proc. WWW, pages 1267–1268, 2017.
- Haering et al. [2021] M. Haering, C. Stanik, and W. Maalej. Automatically matching bug reports with related app reviews. In ICSE, pages 970–981, 2021.
- He et al. [2020] J. He, L. Xu, X. Yan, M. an Xia, and Y. Lei. Duplicate bug report detection using dual-channel convolutional neural networks. In Proc. ICPC, pages 117–127, 2020.
- Hindle and Onuczko [2019] A. Hindle and C. Onuczko. Preventing duplicate bug reports by continuously querying bug reports. EMSE, 24(2):902–936, 2019.
- Ho et al. [2020] S. Y. Ho, K. Phua, L. Wong, and W. W. B. Goh. Extensions of the external validation for checking learned model interpretability and generalizability. Patterns, 1(8):100129, 2020.
- Hossin and Sulaiman [2015] M. Hossin and M. N. Sulaiman. A review on evaluation metrics for data classification evaluations. IJDKP, 5:01–11, 03 2015.
- Jalbert and Weimer [2008] N. Jalbert and W. Weimer. Automated duplicate detection for bug tracking systems. In Proc. DSN, pages 52–61, 2008.
- Jones [1972] K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 1972.
- Kim [2015] T. K. Kim. T test as a parametric statistic. Korean journal of anesthesiology, 68(6):540, 2015.
- Klein et al. [2014] N. Klein, C. S. Corley, and N. A. Kraft. New features for duplicate bug detection. In Proc. MSR, page 324–327, 2014.
- Kusner et al. [2015] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. From word embeddings to document distances. In ICML, pages 957–966, 2015.
- Lafferty and Blei [2006] J. Lafferty and D. Blei. Correlated topic models. Advances in neural information processing systems, 18:147, 2006.
- Lai et al. [2015] S. Lai, L. Xu, K. Liu, and J. Zhao. Recurrent convolutional neural networks for text classification. In Proc. AAAI, 2015.
- Lazar et al. [2014] A. Lazar, S. Ritchey, and B. Sharif. Improving the accuracy of duplicate bug report detection using textual similarity measures. In Proc. MSR, 2014.
- Li et al. [2020] B. S. Li, T. Estlander, R. Jolanki, and H. I. Maibach. Advantages and disadvantages of gloves. Kanerva’s Occupational Dermatology, pages 2547–2561, 2020.
- Lu et al. [2018] J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang. Learning under concept drift: A review. IEEE TKDE, 31(12):2346–2363, 2018.
- Neysiani and Babamir [2019] B. S. Neysiani and S. M. Babamir. Duplicate detection models for bug reports of software triage systems: A survey. Current Trends In Computer Sciences and Applications, 1(5):128–134, 2019.
- Nguyen et al. [2012] A. T. Nguyen, T. T. Nguyen, T. N. Nguyen, D. Lo, and C Sun. Duplicate bug report detection with a combination of information retrieval and topic modeling. In Proc. ASE, pages 70–79, 2012.
- Nooralahzadeh et al. [2018] F. Nooralahzadeh, L. Øvrelid, and J. T. Lønning. Evaluation of domain-specific word embeddings using knowledge resources. In Proc. LREC), 2018.
- Obulesu et al. [2018] O. Obulesu, M. Mahendra, and M. ThrilokReddy. Machine learning techniques and tools: A survey. In ICIRCA, pages 605–611, 2018.
- Pennington et al. [2014] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Proc. EMNLP, pages 1532–1543, 2014.
- Rezaeinia et al. [2019] S. M. Rezaeinia, R. Rahmani, A. Ghodsi, and H. Veisi. Sentiment analysis based on improved pre-trained word embeddings. Expert Systems with Applications, 117:139–147, 2019.
- Robertson and Zaragoza [2009] S. Robertson and H. Zaragoza. The probabilistic relevance framework: Bm25 and beyond. FnTs in Information Retrieval, 3(4):333–389, 2009.
- Rocha and Carvalho [2021] T. M. Rocha and A. L. D. C. Carvalho. Siameseqat:a semantic context-based duplicate bug report detection using replicated cluster information. IEEE Access, 9:44610–44630, 2021.
- Rodrigues et al. [2020] I. M. Rodrigues, D. Aloise, E. R. Fernandes, and M. Dagenais. A soft alignment model for bug deduplication. In Proc. MSR, pages 43–53, 2020.
- Rosenthal et al. [1994] R. Rosenthal, H. Cooper, and L. Hedges. Parametric measures of effect size. The handbook of research synthesis, 621(2):231–244, 1994.
- Roy et al. [2017] A. Roy, Y. Park, and S. Pan. Learning domain-specific word embeddings from sparse cybersecurity texts. arXiv:1709.07470, 2017.
- Royston [1992] P. Royston. Approximating the shapiro-wilk w-test for non-normality. Statistics and computing, 2(3):117–119, 1992.
- Runeson et al. [2007] P. Runeson, M. Alexandersson, and O. Nyholm. Detection of duplicate defect reports using natural language processing. In Proc. ICSE, pages 499–510, 2007.
- Sun et al. [2010] C. Sun, D. Lo, X. Wang, J. Jiang, and S. C. Khoo. A discriminative model approach for accurate duplicate bug report retrieval. In Proc. ICSE, volume 1, pages 45–54, 2010.
- Sureka and Jalote [2010] A. Sureka and P. Jalote. Detecting duplicate bug report using character n-gram-based features. In APSEC, pages 366–374, 2010.
- Tian et al. [2012] Y. Tian, C. Sun, and D. Lo. Improved duplicate bug report identification. In Proc. CSMR, pages 385–390, 2012.
- Tian et al. [2014] Y. Tian, D. Lo, and J. Lawall. Automated construction of a software-specific word similarity database. In Proc. CSMR-WCRE, pages 44–53, 2014.
- Van der Maaten and Hinton [2008] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Wallach [2006] H. M. Wallach. Topic modeling: beyond bag-of-words. In Proc. ICML, pages 977–984, 2006.
- Wang et al. [2008] X. Wang, L. Zhang, T. Xie, J. Anvik, and J. Sun. An approach to detecting duplicate bug reports using natural language and execution information. In Proc. ICSE, pages 461–470, 2008.
- Woolson [2007] R. F. Woolson. Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials, pages 1–3, 2007.
- Wu et al. [2010] L. Wu, S. C. Hoi, and N. Yu. Semantics-preserving bag-of-words models and applications. IEEE TIP, 19(7):1908–1920, 2010.
- Xie et al. [2018] Q. Xie, Z. Wen, J. Zhu, C. Gao, and Z. Zheng. Detecting duplicate bug reports with convolutional neural networks. In Proc. APSEC, pages 416–425, 2018.
- Yang et al. [2012] C. Z. Yang, H. H. Du, S. S. Wu, and X. Chen. Duplication detection for software bug reports based on bm25 term weighting. In Proc. TAAI, pages 33–38, 2012.
- Yap et al. [2014] B. W. Yap, K. A. Rani, H. A. A. Rahman, S. Fong, Z. Khairudin, and N. N. Abdullah. An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In Proc. DaEng, pages 13–22, 2014.
- Žliobaitė [2010] I. Žliobaitė. Learning under concept drift: an overview. arXiv:1010.4784, 2010.
- Zou et al. [2018] W. Zou, D. Lo, Z. Chen, X. Xia, Y. Feng, and B. Xu. How practitioners perceive automated bug report management techniques. IEEE TSE, 46(8):836–862, 2018.