Statistically significant – but who cares?
Knowing how much time is needed to fix a bug, already when a bug report is submitted, could greatly support software maintenance – what a dream for a project managers! A previous study suggests that clustering bug reports based on textual similarity could be used in a rough estimate. We replicate the results, but also show that statistical significance by no means guarantees actionable estimation support.
In this journal paper we explore unsupervised machine learning for predicting how much time is needed to fix a bug. There are never enough resources to deal with all incoming bug reports, the maintenance effort must be prioritized in some way. Critical bugs should obviously always be taken care of, but for the large majority of issues the resolution time is fundamental in the planning.
We present a replication of a study by Uzma Raja, but we also go beyond the original work. Raja argues that “all complaints are not created equal”, and shows that by semi-automatically creating clusters of bug reports, entirely based on the text in textual descriptions, bugs end up in clusters with significantly different average resolution times. She then speculates that these differences could be used to make early predictions of the effort needed to fix them.
A conceptual replication – sporting more automation
Our work is no exact replication, instead we change the approach to fully automated k-means clustering – a really simple approach. Also, we study three large datasets (one proprietary and two open source) as opposed to Raja’s five medium-sized datasets from open source projects. Although we made considerable changes to the experimental setup, we also obtain clusters with statistically significant differences. This indicates that Raja’s findings are quite stable – just putting bugs in one of 4-6 clusters based on text appears to lead to differences not created by chance.
Ok, we have significant differences… So far so good. What can we use that for? The second part of this study deals with trying how accurate the predictions get in a simulated setting. For the three datasets under study, we consider ten points in time. We consider most bugs as closed, i.e., they have a known resolution time, and calculate the clusters’ average resolution times. For all bugs that remain, i.e., the still open bugs, we use the average resolution time of their corresponding cluster as the prediction. Then we check the results – and notice that the prediction accuracy is really bad. It doesn’t work at all! The approach is evidently far too naïve, putting bugs in only 4-6 different clusters is too coarse-grained. Sure, the clusters are significantly different… But the differences are not actionable at all!
Implications for RESEARCH
- We confirm that clustering bug reports based on textual content leads to clusters of bugs with significantly different resolution times – also with a truly simple fully automated approach.
- The results scale to large datasets – we study 20,000 proprietary bugs, and 9,000+8,000 open source bugs.
- Our replication was successful, but our attempt at using the results for predictions lead to negative results – we highlight the importance of going beyond statistical significance.
Implications for Practice
- Automatically clustering bug reports can be used to identify patterns, but more research is needed before it can be practically used for predicting resolution time.
Saïd Assar, Markus Borg, and Dietmar Pfahl. Using Text Clustering to Predict Defect Resolution Time: A Conceptual Replication and an Evaluation of Prediction Accuracy, Empirical Software Engineering, 21(4), pp. 1437-1475, 2016. (link, preprint)
Abstract
Defect management is a central task in software maintenance. When a defect is reported, appropriate resources must be allocated to analyze and resolve the defect. An important issue in resource allocation is the estimation of Defect Resolution Time (DRT). Prior research has considered different approaches for DRT prediction exploiting information retrieval techniques and similarity in textual defect descriptions. In this article, we investigate the potential of text clustering for DRT prediction. We build on a study published by Raja (2013) which demonstrated that clusters of similar defect reports had statistically significant differences in DRT. Raja’s study also suggested that this difference between clusters could be used for DRT prediction. Our aims are twofold: First, to conceptually replicate Raja's study and to assess the repeatability of its results in different settings; Second, to investigate the potential of textual clustering of issue reports for DRT prediction with focus on accuracy. Using different data sets and a different text mining tool and clustering technique, we first conduct an independent replication of the original study. Then we design a fully automated prediction method based on clustering with a simulated test scenario to check the accuracy of our method. The results of our independent replication are comparable to those of the original study and we confirm the initial findings regarding significant differences in DRT between clusters of defect reports. However, the simulated test scenario used to assess our prediction method yields poor results in terms of DRT prediction accuracy. Although our replication confirms the main finding from the original study, our attempt to use text clustering as the basis for DRT prediction did not achieve practically useful levels of accuracy.