∎

¹¹institutetext: Peter Devine ²²institutetext: Kelly Blincoe ³³institutetext: Human Aspects of Software Engineering Lab, University of Auckland, New Zealand
³³email: [email protected]; [email protected] ⁴⁴institutetext: Yun Sing Koh ⁵⁵institutetext: School of Computer Science, University of Auckland

Evaluating Software User Feedback Classifiers on Unseen Apps, Datasets, and Metadata

Peter Devine Yun Sing Koh Kelly Blincoe

(Received: date / Accepted: date)

Abstract

Listening to user’s requirements is crucial to building and maintaining high quality software. Online software user feedback has been shown to contain large amounts of information useful to requirements engineering (RE). Previous studies have created machine learning classifiers for parsing this feedback for development insight. While these classifiers report generally good performance when evaluated on a test set, questions remain as to how well they extend to unseen data in various forms.

This study evaluates machine learning classifiers performance on feedback for two common classification tasks (classifying bug reports and feature requests). Using seven datasets from prior research studies, we investigate the performance of classifiers when evaluated on feedback from different apps than those contained in the training set and when evaluated on completely different datasets (coming from different feedback platforms and/or labelled by different researchers). We also measure the difference in performance of using platform-specific metadata as a feature in classification.

We demonstrate that classification performance is similar on feedback from unseen apps compared to seen apps in the majority of cases tested. However, the classifiers do not perform well on unseen datasets. We show that multi-dataset training or zero shot classification approaches can somewhat mitigate this performance decrease. Finally, we find that using metadata as features in classifying bug reports and feature requests does not lead to a statistically significant improvement in the majority of datasets tested. We discuss the implications of these results on developing user feedback classification models to analyse and extract software requirements.

Keywords:

Software user feedback, User feedback classification, Unseen data domains, Machine learning, requirements engineering, Software quality

Conflict of interest

None of the authors listed have a declared conflict of interest related to this work.

1 Introduction

Software product quality is deeply tied to user satisfaction and the extent to which the product meets the users’ needsgillies2011software . To that end, Requirements Engineering (RE) is considered key to developing high-quality software which meets users’ needs berki2004requirements ; nuseibeh2000requirements . Recent research has found online software user feedback to be a rich source of information for understanding users’ needs and the software requirements associated with those needs. For example, Pagano et al. showed that more than 30% of 1100 manually analysed reviews on mobile app stores contained requirements relevant data that can then be leveraged by developers to improve their product pagano2013user . Similarly, feedback platforms such as Twitter posts guzman2017little , forum posts tizard2019can , Reddit posts ali2020conceptualising , Facebook posts sultan2014back , and Steam reviews lin2019empirical , have also been shown to contain helpful insights to guide the development and maintenance of software.

Studies have proposed classification methods to automate the ingestion and analysis of this feedback to help identify software requirements hadi2021evaluating ; henao2021transfer ; iacob2013online ; panichella2016ardoc . These methods are largely underpinned by machine learning models which require manually labelled example data to train on. Human annotators give labels such as “bug report” or “feature request” to each piece of feedback. These feedback-label pairs are then used as a training dataset to train a model to label feedback into one of these classes automatically.

The utility of these classifiers is multifaceted. Proposals have been made for using classifiers to help developers understanding their users’ requirements by integrating user feedback into the development cycle. An example of this is MARA, which classifies app store reviews into “feature request” and “bug report”, and these reviews are then used to inform software design and maintenance iacob2013online . Commercial solutions, such as MonkeyLearn¹¹1https://monkeylearn.com/, also exist. In addition to aiding software teams to identify requirements, these classifiers can also be used in research studies. The classifiers can be trained on labelled training data and applied to large unlabelled datasets. This can enable researchers to study the requirements relevant characteristics of a large set of feedback, for example as was done by Nayebi et al. analysing user feedback on Twitter nayebi2018app .

Despite classifiers widespread use throughout the literature, questions remain as to how effective these solutions are at classification out-of-the-box for software projects (i.e. without needing a representative sample of labelled feedback about the project as training data). Tizard et al. demonstrated that the popular ARDOC user feedback classifier, which has been trained on user feedback from app store reviews, does not perform well when applied to forum post feedback tizard2019can . Another concern is that many of these classification techniques rely on using feedback platform metadata (e.g., app review rating) as input features for classifiers - metadata which may not be available when applying such a classifier to another platform. Applying classifiers to feedback from unseen apps (i.e. apps of which none of its feedback was included in the classifier’s training set) has also not been explicitly explored within the literature, which makes it unclear as to how they would perform on this unseen app feedback. If these classifiers are not able to correctly classify feedback from new apps and new platforms to a reasonably satisfactory degree, then they only have limited practical use in supporting the software development cycle.

We investigate the robustness of user feedback classifiers over separate apps, domains, and features. To do this, we focus on three aspects of user feedback classifier training: training and testing on separate apps, training and testing on separate datasets, and training and testing both with and without metadata-based features.

Firstly, many reported classification performance statistics are based on training and evaluating on user feedback from the same app, which leaves the expected performance of these classifiers on unseen apps unclear. Therefore, this study examines the difference in classification performance between models trained and tested on feedback from separate apps, and trained and tested on the same apps. This is done to investigate how well models can classify feedback from unseen apps.

Secondly, many public datasets of user feedback exist from prior studies. These datasets contain feedback from various feedback platforms (e.g. app store reviews, tweets, forum posts) and are labelled using various label sets. However, these datasets often contain labels that are similar as labels in the other datasets (e.g. “bug”, “bug report”, and “error” labels from three separate datasets). One labeller’s definition of, for example, a bug report may differ from another’s. Therefore, we evaluate whether a classifier trained on the labels of one dataset transfer to labelling another dataset. This will provide an understanding on the ability of user feedback classification models to generalise to new domains or slightly different labelling schemas.

Thirdly, feedback metadata such as review rating and forum post position have also been used in many studies as a feature in classification. However, the effect of including metadata in classification on performance across multiple datasets is not fully known. Indeed, few studies examine the effect of each type of metadata to isolate the effect of metadata on overall performance. Therefore, we train classifiers both using and not using metadata as features to determine the change in performance with their inclusion. Understanding the relative importance of metadata on feedback classification informs how applicable they are to different sources which may have different metadata available.

These aims resulted in the following three research questions:

•

RQ1: How does training and testing on feedback from separate apps affect classification performance?
•

RQ2: How does training and testing on feedback from separate datasets affect classification performance?
•

RQ3: How does training and testing with metadata affect classification performance?

We answer these questions by evaluating the classification performance of state-of-the-art text classifiers under different data configurations using seven datasets from the literature. We find little difference in classification performance between feedback from seen and unseen apps, but a large drop in performance on unseen datasets. We also find that using metadata as a feature in a classifier tends not to improve classification of bug reports and feature requests in the majority of cases.

This paper first outlines the previous work related to user feedback classification in Section 2. The datasets used in this work are then detailed in Section 3 and the method used to train the classifiers is explained in Section 4. The results of the evaluation on these datasets is described in Section 5, and the implications of these results are described in Section 6. Finally, threats to validity are considered in Section 7.

The replication package for this work can be found online²²2https://doi.org/10.5281/zenodo.5733504.

2 Related work

This section first details the effect that Requirements Engineering (RE) has on software quality, before exploring the applications of machine learning in RE within the literature. Finally, it gives examples of previous studies regarding the evaluation of different machine learning techniques in RE.

2.1 Software quality and requirements engineering

The benefits gained from RE has been shown to be integral to software quality. Many models exist within the literature that include RE to improve software quality berki2004requirements ; broy2006requirements . The practice of RE has also been shown to be beneficial, with Damian and Chisan demonstrating that productivity, quality, and risk management were all improved when effective RE was done within a commercial software project damian2006empirical . Similarly, Radliński showed that multiple RE factors had a positive, statistically significant effect on software quality factors within a literature dataset of thousands of software projects radlinski2012empirical . A survey of developers also showed that teams that used RE approaches were much more likely to say that their product’s capabilities fit their customers’ needs well and that end users found their products easy to use than those which did not kassab2014state . Other RE-adjacent concepts such as requirements traceability have also been shown to positively impact software quality rempel2016preventing .

2.2 Machine learning in requirements engineering

Machine learning has become a common tool used within the requirements engineering literature for supporting the creation of requirements. Approaches like the one proposed by Cleland-Huang et al. cleland2007automated have been made to integrate automated text classifiers into the requirements engineering process. The MARA model, developed by Iacob et al. iacob2013online focuses on developing requirements from the ingestion of online user feedback using such a text classifier. These classifiers have become prevalent throughout the literature, with a systematic literature review by Lim et al. showing that 38 out of 63 studies which did user feedback analysis based on manually labelled data used machine learning to classify this feedback lim2021data . One of the potential reasons for this popularity is the reported high classification performance of some of these machine learning models.

Work from Maalej et al. maalej2016automatic , Panichella et al. panichella2016ardoc , and Stanik et al. stanik2019classifying all report bug report classification F1 scores of app reviews higher than 0.75, with Maalej et al. reporting as high as 0.9. However, Stanik et al. also report a bug report classification score of 0.59 in Tweets, with Nayebi et al. also reporting lower feature request classification F1 scores of 0.67 nayebi2018app . These diverse values highlight that the expected classification performance can vary dramatically depending on data source, classification method, and evaluation method. This underscores the need to rigorously compare and standardise techniques for both training and evaluating classification models.

2.3 Comparisons of classification techniques

There are several studies within the literature that compare techniques used for classifying user feedback. Work by Aurajo et al. evaluated the performance of four classical machine learning classifiers on classifying user feedback from one dataset using both term frequency derived features and features from deep pre-trained language models, showing that deep pre-trained language models generate superior text embedding features compared to frequency-based features for classification araujo2020bag . Similarly, Henao et al. demonstrated the increase in performance in user feedback classification when using pre-trained language models over both classical models as well as other deep models henao2021transfer . Hadi and Fard proposed a study where the classification accuracy of pre-trained language models is compared against that of previously constructed classifiers from the literature as well as exploring the effect of self-supervised pre-training, binary classification, multi-class classification, and zero-shot settings on classification performance hadi2021evaluating . Dhinakaran et al. showed that models trained on training data that was chosen randomly were found to consistently underperform more sophisticated training data selection techniques, such as active learning dhinakaran2018app . Di Sorbo et al. investigated the correlation between app review rating and feedback type classifier prediction, finding that predictions of “problem discovery” from the ARDOC classifier were negatively correlated with the app rating, whereas predictions of “feature request” were uncorrelated di2021investigating . As can be seen, there has been extensive work evaluating which text-based features and machine learning models are best to use when classifying user feedback. Some work has also been done to improve the data-efficiency of training a classifier. What remains unclear is how different training and evaluation methods affect the evaluation result of these classifiers (particularly on out-of-domain data), and how features apart from text affect classification performance.

This study adds to the literature by exploring the effect of several machine learning techniques to highlight where and when user feedback classifiers can and cannot be used in the real world. Firstly, evaluating on seen and unseen app reviews is evaluated, in order to determine how well user feedback classifiers perform in classifying feedback for an unseen app. Secondly, classifiers trained and tested on separate datasets are evaluated so as to determine how well classifiers can be applied to similar data. Finally, the use of metadata for classification across multiple domains is evaluated to determine its effect on user feedback performance.

3 Datasets

To measure different training and evaluation techniques, seven unique datasets from six studies were used in our evaluation. The datasets studied vary in size, feedback label set, and domain, coming from app reviews, Twitter, and forum posts. This variance among datasets was chosen partly to evaluate how well these classifiers perform across different domains and different labellers.

From these seven datasets, it was found that all seven shared a “bug report” or similar class, and six shared a “feature request” or similar class. Therefore, comparison of classification across datasets was done on a binary basis for these two classes. This section describes each dataset, with Table 1 summarizing and comparing the broad statistics of each dataset.

Domain

Size

No.

app

Bug

label

(%)

Feature

label

(%)

Meta-

data

Bug

Feature

Reviews

1,565

Error

(30.2%)

None

Reviews

4,385

Bug

report

(22.6%)

User

request

(9.2%)

Rating

0.81

0.51

Reviews

1,438

Bug

(9.5%)

Feature

(12.4%)

Rating

0.88

0.85

Reviews

2,986

705

Bug

(25.5%)

Feature

(11.1%)

Rating,

App

4 Method

To answer our research questions, we created training and test sets applicable to each experiment, before using the training sets to train state of the art text classifiers. Finally, we evaluated these models on test sets to get performance scores for each experiment.

Trained on

Tested on

RQ1

Single mixed

Dataset

\alpha

mixed

train and validation sets

Dataset

\alpha

mixed test set

Single separated

Dataset

\alpha

separated

train and validation sets

Dataset

\alpha

separated test set

RQ2

Single out-of-dataset

Dataset

\alpha

mixed

train and validation sets

Dataset

\delta

mixed test set

Leave one out

Dataset

\alpha

\beta

\gamma

mixed train

and validation sets

Dataset

\delta

mixed test set

RQ3

Single mixed

(text and metadata)

Dataset

\alpha

mixed

train and validation sets

(text and metadata)

Dataset

\alpha

mixed test set

(text and metadata)

Table 2: Example permutations of datasets for RQ1, RQ2, and RQ3 for example datasets

\alpha

\beta

\gamma

, and

\delta

Refer to caption — Figure 1: Diagram visualising how the train, validation, and test splits are created for both the mixed and separated datasets of RQ1.

4.1 Data handling

Within this study, the performance of classifiers on feedback from apps that it had not been trained on is analysed, and as such, the information as to which app a piece of feedback came from is needed for each dataset. For this reason each dataset was cleaned so that any rows which did not have an app identifier (i.e. are null) were dropped. This only affected dataset C, where some rows did not contain app identification data, and accounted for 1,821 out of 3,259 (55.9%) rows within that dataset, leaving 1,438 pieces of feedback from that dataset to use in our experiments.

Two different configurations of generating train, validation and test data splits were then used for each dataset, and these were named “mixed” and “separated”. The “mixed” data splits were created by randomly splitting all feedback across the dataset into a 64:16:20 (train : validation : test) split. These ratios were chosen to be an initial 80:20 (train+validation : test) split before splitting the train + validation set with a further 80:20 split. This random splitting of data into train, validation, and test sets is current standard practice throughout the user feedback classification literature.

The “separated” data splits were generating by randomly sampling apps within a dataset and assigning them to either the train, validation, or test split. All feedback for a given app was put into the same data split. Again, we aimed for a 64:16:20 (train : validation : test) ratio while ensuring each of these splits contained data from different apps. “Separated” data splits were not created for Dataset F because it included only two apps, and so contained too few distinct apps to split into train, validation, and test sets. A visual depiction of how these two configurations were created can be seen in Fig. 1.

The “mixed” and “separated” configurations were achieved by using Sci Kit Learn’s ShuffleSplit and GroupShuffleSplit respectively. Cross validation was used to generate 5 distinct data folds for each dataset, and reported metrics in our results are the mean over these 5 folds.

4.2 Model

4.2.1 Training machine learning models

We trained machine learning models based on state-of-the-art pre-trained language models, which have been shown to achieve higher classification performance than other models henao2021transfer . These models require text input to be tokenized before they can be trained.

Tokenization

To train and evaluate deep pre-trained language model based classifiers, we first tokenized all feedback text. This was done using Huggingface’s Tokenizers library in Python ⁵⁵5https://huggingface.co/docs/tokenizers using the “bert-base-uncased” version of the “BertTokenizer” tokenizer. This results in the creation of input IDs, an attention mask, and token type IDs for every piece of feedback. Each piece of feedback is also prepended by a [CLS] token and appended by a [SEP] token to denote the start and end of a piece of text. These values are then fed into the model for training or inference. We supply metadata tags to the model in the form of special tokens. All possible metadata tokens (i.e. those contained in train, validation, and test splits) are passed to the tokenizer when it is initialized such that it doesn’t tokenize these features and that all metadata tags are valid input IDs when running training and inference.

Model training

The inputs generated by the tokenizer are then passed to a BERT model connected to one layer of two linear output nodes which create a head from which binary classification can be determined. This is done using the “distilbert-base-cased” version of the “BertForSequenceClassification” model from Huggingface’s Transformers⁶⁶6https://huggingface.co/transformers library in Python. This model variant was chosen due to it’s relative high performance on general natural language tasks compared to larger language models sanh2019distilbert and because a smaller model allowed for more reasonable training times for the high number of models created within the constraints of this study. The model takes output from the first position (the [CLS] token) of the BERT model and passes it to the output linear layer, as is the norm when training BERT models. Labels are generated from this model using a softmax layer on top of the output to get a probability distribution between the in and out class of each piece of feedback. The class with the highest probability is then set as the prediction for the model.

Each model is trained for 500 steps with a batch size of 128 (128 pieces of feedback training the model at each step). Every time all feedback has been used to traing the model (i.e. each epoch) the model is evaluated on the validation set. The weights of the model at the epoch with the highest associated F1 score on the validation set were loaded after training and saved for use in evaluating on the test set. The choice of training for 500 steps was made as it was observed that both the smallest and largest datasets had safely peaked in validation set F1 score by that point.

A Trainer object from Huggingface’s Transformers was used to train the model, into which we set a training batch size of 32. All other hyperparameters were left as default for the Trainer (initial learning rate = 5e-05, weight decay = 0, adam beta 1 = 0.9, adam beta 2 = 0.999, adam epsilon = 1e-08) as there was little observed difference in performance when these were changed.

4.2.2 Zero shot classifier

A zero shot classifier model is a text classifier that does not require any training data before being used. In our work, we use the “bart-large-mnli” model developed by Facebook as our zero shot model due to its performance and popularity on the HuggingFace model portal⁷⁷7https://huggingface.co/models?pipeline_tag=zero-shot-classification. This model was not explicitly trained to classify user feedback, but has been designed to classify text without pre-training on class-labelled training data (hence “zero-shot”) by leveraging the entailment prediction abilities of natural language inference models, as proposed by Yin et al. yin2019benchmarking . Therefore, this classifier relies on the textual content of the label as well as the text content of the feedback, and so requires model labels to categorise text into. We used “bug report” for classifying bug reports, and “feature request” for classifying feature requests. This classifier outputs a classification score between 0 and 1, rather than a simple label. Therefore, we take a class to have been detected if it has a score of greater than 0.5.

4.3 Evaluation metrics

Each dataset was evaluated using the F1 metric. The equation for this metric can be found in equations 1, 2, and 3. This metric provides a good measure of how well a class is being correctly labelled due to it balancing recall and precision, and is less sensitive to class imbalances in the data compared to the accuracy metric.

precision=\frac{No.\>True\>Positives}{No.\>True\>Positives+No.\>False\>Positive}

(1)

recall=\frac{No.\>True\>Positives}{No.\>True\>Positives+No.\>False\>Negative}

(2)

F1=2.\>\frac{precision\>.\>recall}{precision+recall}

(3)

Statistical significance between the performances of different classifier types were determined by using an independent two-sample t-test on the F1 metrics all folds of cross validation for one given test dataset.

4.4 Training and eval

4.4.1 Unseen Apps (RQ1)

In order to determine the difference in performance between evaluating on feedback from unseen apps compared to seen apps, we trained models on the cross validation folds of each dataset’s training and validation sets for both “separated” and “mixed” configurations, for evaluation on their respective test sets. In our results, we denote these models as “Single dataset - separated” and “Single dataset - mixed”.

4.4.2 Unseen datasets (RQ2)

To find the classification ability of classifiers trained on one dataset before being applied to another, we used the models trained on “Single dataset - mixed” splits from RQ1 as the literature standard is to evaluate classifiers on mixed-app dataset splits. These were then evaluated on each dataset except the one that it was trained on. In our results, “Train A-F” denotes the models which have been trained on one of these datasets and is then evaluated on all others. In addition, we also trained a “leave-one-out” (denoted “LOO”) model for each data split, where all datasets except one were used to train a model, and then evaluated on the excluded dataset. A visual representation of how this training was done can be found in Fig. 2.

For further context, a zero-shot text classification model (denoted “Zero shot”), as was proposed by Hadi and Fard hadi2021evaluating , was also evaluated on each dataset to provide a performance benchmark.

4.4.3 Metadata (RQ3)

To determine the difference in performance of classifiers that use both metadata and text against those which use only text, we train two different models for every fold of every dataset, one which just receives feedback text as input, and one which also receives feedback metadata. As in RQ2, the “Single dataset - mixed” splits of data were used for these evaluations. A visual representation of this training and evaluation can be found in Fig. 3.

Metadata was added as a feature to the model by prepended feedback text with metadata tags before being passed to the model. These metadata tags are generated using the format as specified in Equation 1, such that an app review that has an associated rating of 3 stars would be prepended with the tag “[METADATA_rating_3]”.

[METADATA_ + metadata column name + _ + metadata value + ]

(4)

Each metadata tag is added to the text tokenizer as a special token so as to prevent it from being broken up upon tokenization. We trained models using all metadata available to us from their datasets. This includes metadata that was used as features when making classifiers in the original studies associated with these datasets (e.g. follower count in dataset G). Full details of all metadata used with each dataset can be found in Table 3. After training on a given train and validation set, each model was evaluated on the respective test set. “Text only” denotes the evaluation results of the model which was trained and tested using only text features. “Text and metadata” denotes the evaluation results of the model which was trained and tested using both text and metadata features.

Dataset

Metadata

App rating

App rating,

App category

App rating

Post position,

Is comment author

original thread author,

Forum topic,

User level

No. of favorites,

No. of followers,

No. of friends,

No. of statuses,

No of listed tweets

Is tweet a reply

Is user verified

Table 3: List of metadata used for datasets in RQ3

A visual summary of how these three research questions were answered can be found in Table 2.

5 Results

This section first presents the results from training and testing on mixed and separated apps within datasets (RQ1). It then presents the results of training and testing on separate datasets (RQ2). Finally, it details the results of training and testing using metadata (RQ3).

5.1 Mixed vs. separated apps within data splits (RQ1)

Table 4 details the mean F1 scores of classifiers from RQ1 for classifying bug reports and feature requests.

Bug classification

Feature request classification

Dataset

Single

dataset

separated

Single

dataset

mixed

T-test

stat.

Single

dataset

separated

Single

dataset

mixed

T-test

stat.

0.664

0.764

-2.137

0.680

0.725

-3.252

0.399

0.468

-2.087

0.339

0.357

-0.325

0.501

0.544

-0.639

0.868

0.856

0.716

0.659

0.653

0.341

0.683

0.875

-3.504

0.443

0.642

-2.580

0.688

0.704

-0.837

0.528

0.597

-3.227

Table 4: F1 score results for classifying bug reports and feature requests in RQ1 using both “separated” and “mixed” data splits. Student’s independent t-test scores are also given with statistically significant differences shaded and highlighted with asterisks (* p<0.05, ** p<0.01, *** p<0.001).

For the models which were trained on only one dataset, it can be seen that app-separated splits have a lower F1 score than mixed-app splits for 5 out of the 6 bug report datasets and 4 out of the 5 feature request datasets. For bug reports, two of these differences (B and E) were statistically significant to p<0.05, while for feature requests, two (E and G) were significant. Moreover, we see that only one of these differences is significant to p<0.01 over these 11 differences.

Answer to RQ1 - How does training and testing on feedback from separate apps affect classification performance? In aggregate, a minority (4/11) of datasets exhibited statistically significant difference between being split along app lines compared to being split randomly. Training and testing a user feedback classifier on feedback from the same apps does not result in meaningfully different performance compared to training and testing on feedback from separate apps in a majority of cases.

5.2 Testing on separate datasets (RQ2)

Table 5 and Table 6 details the mean F1 scores of separate-dataset classifiers from RQ2 for classifying bug reports and feature requests, respectively.

	A	B	C	D	E	F	G
Train A		0.413	0.315	0.430	0.595	0.140	0.301
Train B	0.628		0.311	0.716	0.690	0.267	0.425
Train C	0.540	0.551		0.585	0.668	0.173	0.408
Train D	0.330	0.465	0.278		0.541	0.158	0.391
Train E	0.467	0.364	0.185	0.399		0.203	0.439
Train F	0.191	0.177	0.108	0.282	0.161		0.212
Train G	0.645	0.581	0.324	0.624	0.817	0.317
Train LOO	0.694	0.704	0.422	0.781	0.792	0.344	0.493
Zero shot	0.653	0.645	0.360	0.712	0.743	0.394	0.692
Single dataset - mixed	0.764	0.725	0.357	0.856	0.875	0.455	0.704

Table 5: F1 scores for classifying bug reports in RQ2. Details the performance of being trained on one dataset and tested on another, as well as the performance of the “leave-one-out” (LOO) model, the zero-shot model, and the model trained on the same dataset as it is tested on. Note that the “Single dataset - mixed” model has been trained on in-domain data (i.e. the training set associated with the test dataset).

	B	C	D	E	F	G
Train B		0.147	0.450	0.439	0.156	0.213
Train C	0.138		0.159	0.148	0.056	0.237
Train D	0.314	0.154		0.211	0.119	0.176
Train E	0.000	0.000	0.000		0.000	0.000
Train F	0.055	0.007	0.028	0.036		0.021
Train G	0.314	0.213	0.356	0.356	0.127
Train LOO	0.406	0.281	0.527	0.445	0.148	0.274
Zero shot	0.385	0.296	0.365	0.479	0.153	0.522
Single dataset - mixed	0.468	0.544	0.653	0.642	0.270	0.597

Table 6: F1 scores for classifying feature requests in RQ2. Details the performance of being trained on one dataset and tested on another, as well as the performance of the “leave-one-out” (LOO) model, the zero-shot model, and the model trained on the same dataset as it is tested on. Note that the “Single dataset - mixed” model has been trained on in-domain data (i.e. the training set associated with the test dataset).

As can be seen for both bug report and feature request classification, F1 score for any one given test dataset can vary wildly depending on the dataset of the training data. For bug report classifiers, we observe that the classifier trained only on dataset G training data (tweet user feedback) performs best on 4 of the 6 test datasets it is applied to. For feature request classifiers, the classifier trained only on dataset B (app review data) performs best on 4 of the 5 datasets it is applied to. The classifier trained in dataset F (forum data) performs worst on all test datasets it is applied to compared to other classifiers for bug reports. The classifier trained in dataset E (app reviews) performs worst on all test datasets for feature requests. Overall, every classifier trained on one dataset and evaluated on a separate dataset achieves lower classification performance compared to models trained and tested on the same dataset.

In comparison to the models trained on only one (different) dataset, the leave-one-out classifier performs best on 6 of the 7 bug report datasets and 5 out of 6 feature report datasets that it is applied to. Compared to the model trained and tested on the same dataset, the leave-one-out classifier performs slightly worse on all but one dataset.

The zero-shot classifier performs better than any of the single-dataset out-of-dataset for 5 out of the 7 bug report datasets and 4 out of the 6 feature request datasets. The zero shot model exceeds the performance of the leave-one-out models for the test sets of Dataset F and G (forum posts and tweets) for bug reports, and 4 out of 6 datasets (C, E, F, and G) for feature requests. Therefore, zero shot models perform best relative to other models on datasets of distinct feedback platforms.

Answer to RQ2 - How does training and testing on data from separate datasets affect classification performance? Training and testing a user feedback classifier on feedback from separate datasets results in overall lower performance than training and testing on the same dataset. However, this lower performance can be improved upon by models trained on multiple datasets or by zero-shot text classification models.

5.3 Using metadata features to classify (RQ3)

Table 7 details the mean F1 scores of RQ3 classifiers both including and excluding metadata features to classify feedback into bug reports and feature requests.

Bug classification

Feature request classification

Dataset

Text

only

Text

and

Metadata

T-test

stat.

Text

only

Text

and

Metadata

T-test

stat.

0.764

0.817

-2.059

0.725

0.761

-3.270

0.468

0.475

-0.168

0.357

0.432

-1.481

0.544

0.535

0.171

0.856

0.857

-0.155

0.653

0.669

-0.698

0.875

0.878

-0.172

0.642

0.687

-0.791

0.455

0.522

-1.985

0.270

0.463

-5.102

0.704

0.722

-1.235

0.597

0.605

-0.367

Table 7: F1 score results for classifying bug reports and feature requests in RQ3 both using and not using metadata based features. Student’s independent t-test scores are also given with statistically significant differences shaded and highlighted with asterisks (* p<0.05, ** p<0.01, *** p<0.001).

For bug reports, we find all datasets have higher F1 scores when metadata and text is used to classify compared to when only text is used. However, only one of these differences (dataset B) increases between models were statistically significant with a p-value of <0.05.

Similarly for feature requests, we find that for 5 out of the 6 datasets studied, models which use metadata and text perform better than just using text. Again, only one (datasets F) of these differences were statistically significant to a p-value of 0.05.

Overall, we can see either a slight increase or no change in the performance of classifiers when metadata and text are used together compared to when text alone.

Answer to RQ3 - How does training and testing with metadata affect classification performance? Training on metadata results does not result in a statistically significant increase in classification performance on the majority of datasets tested.

6 Discussion

This section discusses the results of this work and their implications. Firstly, the effect of training and testing on separate apps is discussed. Then the effect of training and testing on separate datasets is described. Finally, the effect of using metadata in user feedback classifiers is analysed.

6.1 RQ1 - Mixed vs. separated apps within data splits

From our results, we found little difference between evaluating a classification model on feedback from unseen apps compared to evaluating on feedback from the same apps that it was trained on. This finding is made across both models classifying bug reports and feature requests. This result suggests that model evaluation as is currently carried out within the literature (i.e. not specifying that train, validation, and test splits must contain feedback from separate apps) can be seen to be a good predictor of performance of a classifier on unseen apps from within a dataset. This hints at potential real-world applicability of these models in that they could be used on feedback from unseen new apps (but crucially from the same platform and data-gathering process) without an expected drop in performance.

Another outcome of these experiments is that the classification F1 score can range from high (greater than 0.8) to low (less than 0.5). When a classifier has low absolute classification performance, their utility in finding requirements is limited. This finding highlights the fact that automatic classification using current technology is not universally useful across all feedback datasets. The fact that many of these values are slightly lower than their literature quoted values could possibly be due to the fact that we decided against doing extensive hyperparameter tuning when training our classifiers. Reasoning and discussion of this is given in Section 7. Finally, the lower classification performance across most of the datasets for classifying feature requests compared to bug reports is a trend that can be broadly seen throughout the literature, and calls into question exactly why a bug report is so much easier to identify (from a machine learning perspective) compared to a feature request.

6.2 RQ2 - Testing on unseen datasets

In RQ2, we found that a model trained on one dataset and then applied to another dataset achieves worse performance than a model trained and tested on the same dataset. This is not a surprising finding, given that class balance and labelling methods vary slightly between datasets. However, it raises an important question: How informative are the predictions of these models when used in the real world? A dataset of user feedback for a given software project is not guaranteed to have a certain class balance, and a given researcher or developer is not guaranteed to consider a piece of feedback to contain a bug or feature request in the same way that the training data labellers did.

Our results with the leave-one-out models show better performance, in contrast. While the leave-one-out models perform worse than models trained and tested on the same data, they perform better and are more consistent compared to models trained on one dataset. The leave-one-out models also perform better compared to the zero shot classifier except for feedback from an unseen platform (tweets and forum posts) at training time. This indicates that while leave-one-out classifiers are useful, zero shot classifiers are more appropriate for classifying feedback from unseen feedback platforms.

These results suggest that user feedback analysis tools will achieve highest performance if before use they first require a sample of labelled user feedback from the developer who intends to use the tool (i.e. use in-domain training data). This could be done through an active learning approach, such as was explored by Magalhães et al. magal , in order to limit the amount of labelled data required. However, a tool that requires no further labelled data before being used can still achieve good performance if it is either: trained using a labelled feedback dataset from the same feedback platform; or a zero-shot text classifier if no such dataset exists.

With these results, we recommend that future creators of user feedback analysis tools train a classification model using as much labelled user feedback as possible, especially using data from the same platform as its intended use-case. If such a dataset does not exist and is prohibitively expensive to create, then we recommend using zero-shot classification models instead.

In order to aid future user feedback analysis tools, we make bug report and feature request classifiers available for use on the Huggingface platform. We aim to make these available with a link upon publication.

6.3 RQ3 - Classifying with metadata

Our findings for RQ3 are that classification performance is modestly, but not significantly, improved when using metadata. This finding is in contrast to previous findings which reported the use of metadata on feedback classification, in which metadata was shown to have a positive impact on classification performance maalej2016automatic ; tizard2019can . We theorise that this may be due to the fact that the state-of-the-art classifiers that we used contain millions of parameters devlin-etal-2019-bert , compared to very few parameters available in the classical machine learning models used in these earlier works. With this increased capacity, our model may be better able to infer metadata from the text itself (for example low review ratings would also be associated with more negative sentiment text), which means that having this information explicitly provided would not have much of an effect on the final prediction. We therefore recommend that the use of metadata as a feature should be reviewed within text-based software engineering machine learning tasks with the advent of new, very capacious language models such as BERT. Without the use of what appears to be largely superfluous metadata, these models are better able to be applied to different feedback and to new feedback platforms, where metadata may differ.

We have shown that metadata does not affect performance significantly across a majority of datasets in the classification of bug reports and feature requests, but it is an open question as to how metadata would affect other classes of feedback. It is for future work to investigate a fuller picture as to which classes benefit most and least from use of metadata in their prediction.

7 Threats to validity

One threat to the validity of this work is that the results of this study many not generalise to the classification of other feedback classes. This study only examined the performance of classification models on classifying feedback into the binary labels of “Bug report” or “No bug report” and “Feature request” or “No feature request”. These two labels were chosen due to the fact that they were the only two consistent labels across multiple datasets. Being able to automatically detect bug reports and feature requests from users is one of the key promises of utilizing online user feedback for requirements engineering iacob2013online . Furthermore, the abundance of these labels in various literature datasets highlights how useful these labels are considered to be. Therefore, focusing on the task of classifying bug reports and feature requests can still be seen to be valuable to those looking to engineer requirements using user feedback. It is for future work to replicate this research on other label types.

Another potential threat to the validity of this work that we did not carry out any data balancing when creating our classifiers. Multiple studies within the literature, including those associated with datasets used in this work maalej2016automatic ; tizard2019can , carried out data balancing before training their classifier. This is done to counteract the fact that user feedback may have classes of interest which are a small minority of overall feedback, and so a model is unable to learn the characteristics of this class if most of its training data is from other classes. However, studies on datasets outside of the domain of user feedback classification have shown that classifiers can perform well even when trained on highly unbalanced data batista2004study . Moreover, Henao et al. demonstrated that undersampling when training a deep language model has no major impact on the F1 score of the classifier henao2021transfer . It is for this reason that we decided against balancing our data, and it is for future work to fully explore the impact of data balancing on user feedback classification.

A final threat to validity considered was the lack of hyperparameter tuning done for any one model, which may have led to lower absolute classification performance. While optimising the hyperparameters for any one app or dataset may have led to marginal performance gains, we found that in early experiments changing hyperparameters has little impact on overall classification performance. Our research questions also focused on the relative differences between machine learning treatments, rather than absolute values, and so we would expect that any performance improvements that would be introduced by hyperparameter tuning would not affect our overall conclusions. Furthermore, one of the aims of this work was to investigate how well models apply to unseen data domains. Tuning hyperparameters for the model’s training domain may overfit it and disadvantage it when applied to out-of-domain data. It is for this reason we decided against extensive hyperparameter tuning.

8 Conclusion

The technical quality of software is meaningless if it does not meet the needs of its intended users. Requirements engineering (RE) offers a way to gather the requirements of users, and has been shown to improve software quality generally. This work builds on the RE literature in understanding and automatically processing online user feedback for use in developing and maintaining software. Previous work has shown that it is possible to create text classifiers that can automatically detect bug reports, feature requests, and other requirements relevant information in user feedback for use in the software development cycle. This work contextualises these past results, and informs the future improvement of these classifiers. This has led to three broad contributions.

Firstly, we showed that there can be a small drop in classification performance when applying trained classifiers to feedback from unseen apps for some datasets. However, this trend was found to be statistically insignificant across the majority of datasets tested.

Secondly, this paper demonstrated the classification performance of models which had not been trained on the dataset of given test set. We found that in the scenario where no data from a specific dataset is used to train a classifier, training a model on multiple other datasets achieves better performance than training on any one dataset alone. Moreover, we found that these multiple-dataset models are most applicable to datasets in which it contains feedback from platforms which the model has been trained on (app reviews). We found that for other platforms (tweets and forums), which did not have another dataset to represent it in the training data, zero-shot classification models performed better.

Finally, we demonstrate that classification of both bug reports and feature requests do not notably benefit from metadata (app ratings, forum post position, etc.) as features.

Overall, these three results can inform the creation of better user feedback analysis tools so that, ultimately, developers will better understand the needs of their users and create higher quality software.

We have made the replication package for this study available online⁸⁸8https://doi.org/10.5281/zenodo.5733504.

References

(1) Ali Khan, J., Liu, L., Wen, L., Ali, R.: Conceptualising, extracting and analysing requirements arguments in users’ forums: The crowdre-arg framework. Journal of Software: Evolution and Process 32(12), e2309 (2020)
(2) Araujo, A., Golo, M., Viana, B., Sanches, F., Romero, R., Marcacini, R.: From bag-of-words to pre-trained neural language models: Improving automatic classification of app reviews for requirements engineering. In: Anais do XVII Encontro Nacional de Inteligência Artificial e Computacional, pp. 378–389. SBC (2020)
(3) Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter 6(1), 20–29 (2004)
(4) Berki, E., Georgiadou, E., Holcombe, M.: Requirements engineering and process modelling in software quality management—towards a generic process metamodel. Software Quality Journal 12(3), 265–283 (2004)
(5) Broy, M.: Requirements engineering as a key to holistic software quality. In: International Symposium on Computer and Information Sciences, pp. 24–34. Springer (2006)
(6) Ciurumelea, A., Schaufelbühl, A., Panichella, S., Gall, H.C.: Analyzing reviews and code of mobile apps for better release planning. In: 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 91–102. IEEE (2017)
(7) Cleland-Huang, J., Settimi, R., Zou, X., Solc, P.: Automated classification of non-functional requirements. Requirements engineering 12(2), 103–120 (2007)
(8) Damian, D., Chisan, J.: An empirical study of the complex relationships between requirements engineering processes and other processes that lead to payoffs in productivity, quality, and risk management. IEEE Transactions on Software Engineering 32(7), 433–453 (2006)
(9) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). DOI 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423
(10) Dhinakaran, V.T., Pulle, R., Ajmeri, N., Murukannaiah, P.K.: App review analysis via active learning: reducing supervision effort without compromising classification accuracy. In: 2018 IEEE 26th International Requirements Engineering Conference (RE), pp. 170–181. IEEE (2018)
(11) Di Sorbo, A., Grano, G., Aaron Visaggio, C., Panichella, S.: Investigating the criticality of user-reported issues through their relations with app rating. Journal of Software: Evolution and Process 33(3), e2316 (2021)
(12) Gillies, A.: Software quality: theory and management. Lulu. com (2011)
(13) Guzman, E., El-Haliby, M., Bruegge, B.: Ensemble methods for app review classification: An approach for software evolution (n). In: 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 771–776. IEEE (2015)
(14) Guzman, E., Ibrahim, M., Glinz, M.: A little bird told me: Mining tweets for requirements and software evolution. In: 2017 IEEE 25th International Requirements Engineering Conference (RE), pp. 11–20. IEEE (2017)
(15) Hadi, M.A., Fard, F.H.: Evaluating pre-trained models for user feedback analysis in software engineering: A study on classification of app-reviews. arXiv preprint arXiv:2104.05861 (2021)
(16) Henao, P.R., Fischbach, J., Spies, D., Frattini, J., Vogelsang, A.: Transfer learning for mining feature requests and bug reports from tweets and app store reviews. In: 2021 IEEE 29th International Requirements Engineering Conference Workshops (REW), pp. 80–86. IEEE (2021)
(17) Iacob, C., Harrison, R., Faily, S.: Online reviews as first class artifacts in mobile app development. In: International Conference on Mobile Computing, Applications, and Services, pp. 47–53. Springer (2013)
(18) Kassab, M., Neill, C., Laplante, P.: State of practice in requirements engineering: contemporary data. Innovations in Systems and Software Engineering 10(4), 235–241 (2014)
(19) Lim, S., Henriksson, A., Zdravkovic, J.: Data-driven requirements elicitation: A systematic literature review. SN Computer Science 2(1), 1–35 (2021)
(20) Lin, D., Bezemer, C.P., Zou, Y., Hassan, A.E.: An empirical study of game reviews on the steam platform. Empirical Software Engineering 24(1), 170–207 (2019)
(21) Maalej, W., Kurtanović, Z., Nabil, H., Stanik, C.: On the automatic classification of app reviews. Requirements Engineering 21(3), 311–331 (2016)
(22) Magalhães, C., Sardinha, A., Araújo, J.: Mare: an active learning approach for requirements classification. In: RE@Next! track of the 29th IEEE International Requirements Engineering Conference (2021)
(23) Nayebi, M., Cho, H., Ruhe, G.: App store mining is not enough for app improvement. Empirical Software Engineering 23(5), 2764–2794 (2018)
(24) Nuseibeh, B., Easterbrook, S.: Requirements engineering: a roadmap. In: Proceedings of the Conference on the Future of Software Engineering, pp. 35–46 (2000)
(25) Pagano, D., Maalej, W.: User feedback in the appstore: An empirical study. In: 2013 21st IEEE international requirements engineering conference (RE), pp. 125–134. IEEE (2013)
(26) Panichella, S., Di Sorbo, A., Guzman, E., Visaggio, C.A., Canfora, G., Gall, H.C.: Ardoc: App reviews development oriented classifier. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, pp. 1023–1027 (2016)
(27) Radliński, Ł.: Empirical analysis of the impact of requirements engineering on software quality. In: International Working Conference on Requirements Engineering: Foundation for Software Quality, pp. 232–238. Springer (2012)
(28) Rempel, P., Mäder, P.: Preventing defects: The impact of requirements traceability completeness on software quality. IEEE Transactions on Software Engineering 43(8), 777–797 (2016)
(29) Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
(30) Scalabrino, S., Bavota, G., Russo, B., Di Penta, M., Oliveto, R.: Listening to the crowd for the release planning of mobile apps. IEEE Transactions on Software Engineering 45(1), 68–86 (2017)
(31) Stanik, C., Haering, M., Maalej, W.: Classifying multilingual user feedback using traditional machine learning and deep learning. In: 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), pp. 220–226. IEEE (2019)
(32) Sultan, M.A., Bethard, S., Sumner, T.: Back to basics for monolingual alignment: Exploiting word similarity and contextual evidence. Transactions of the Association for Computational Linguistics 2, 219–230 (2014)
(33) Tizard, J., Wang, H., Yohannes, L., Blincoe, K.: Can a conversation paint a picture? mining requirements in software forums. In: 2019 IEEE 27th International Requirements Engineering Conference (RE), pp. 17–27. IEEE (2019)
(34) Williams, G., Mahmoud, A.: Mining twitter feeds for software user requirements. In: 2017 IEEE 25th International Requirements Engineering Conference (RE), pp. 1–10. IEEE (2017)
(35) Yin, W., Hay, J., Roth, D.: Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. arXiv preprint arXiv:1909.00161 (2019)