When tossing goes out of hand – automate!
In large development projects, the sheer volume of incoming bug reports poses a challenge. Each individual bug needs to be resolved, but who should do it? No matter if some technical manager assigns bugs to teams (pushing), or if the teams take ownership of bugs themselves (pulling), finding proper bug assignments can save much effort in software maintenance. Furthermore, in many development projects “bug tossing” is a major problem – bugs are being reassigned multiple times before ending up with the developer that eventually resolves it.
In this paper we present an automated bug assignment solution based on machine learning. By analyzing large amounts of historical bugs, a classifier can find patterns in the set of bug reports. Maybe critical bugs reported from a site in China are always fixed by a specific team? What if bugs reported on Fridays tend to be closed by team leaders? Or could it be that bug reports containing the terms “memory”, “crash”, and “heap” are typically resolved by a specialist technical team in Germany? Machine learning can find such patterns for you.
This paper happened out of chance. For me it started with traveling all the way to California for ICSE’13 – just to run into another Swede also doing machine learning on bug reports… I had never met Leif before, but we immediately realized we shared many common interests. A great conference catch! Leif was at ICSE for the doctoral symposium, and his presentation was partly based on a paper describing preliminary results with ensemble-based machine learning. It didn’t take long before we decided to do a more thorough study together, and to analyze bug reports collected from both his and my industry partners – a great way to test generalizability.
Exploring everything Weka has to offer
From a technical perspective, the main contribution of this paper is that we evaluate a long list of classifiers – pretty much everything available in Weka. We continue by combining individual classifiers in ensembles using stacked generalization (aka. stacking), a state-of-the-art approach to ensemble learning in machine learning. Our ensemble classifiers scales to industrial applications with large amounts of bug reports, and the ensembles outperform individual classifiers.
From the perspective of an empiricist, the paper is an example of how to run a series of controlled experiments with much rigor. First, we study more than 50,000 bug reports from five projects collected from two companies in different domains. Second, the experiment setup goes beyond traditional 10-fold cross validation, as we complement it with setups that keep the order of the bug reports (i.e., the time dimension) – and we show that doing so is really important! For some projects, the classification accuracy deteriorates considerably as old bug reports is used for training. Our conclusions are that the accuracy of the classifiers must be constantly monitored when deployed in industry – when it drops old training data should be removed – and that evaluations of automated bug assignment must analyze how the proposed classifiers perform over time.
Implications for RESEARCH
- The most comprehensive study on bug assignment in proprietary contexts – more than 50,000 bug reports in the study, from two different companies.
- n-fold cross-validation might show unrealistically good results – the time dimension must also be considered.
- The related work section presents the best overview of empirical studies on automated bug assignment available.
Implications for Practice
- Automated bug assignment is feasible. Bugs can be assigned in an instant with roughly the same accuracy as the much slower manual process.
- While training might take days, actually using the classification model is very fast.
- At least 2,000 bug reports appears to be needed in the training set for machine learning to be feasible – For projects with fewer bugs, the potential is lower.
Leif Jonsson, Markus Borg, David Broman, Kristian Sandahl, Sigrid Eldh, and Per Runeson. Automated Bug Assignment: Ensemble-based Machine Learning in Large Scale Industrial Contexts, Empirical Software Engineering, 21(4), pp. 1533–1578, 2015. (link, preprint)
Abstract
Context: Bug report assignment is an important part of software maintenance. In particular, incorrect assignments of bug reports to development teams can be very expensive in large software development projects. Several studies propose automating bug assignment techniques using machine learning in op en source software contexts, but no study exists for large-scale proprietary projects in industry. Objective: The goal of this study is to evaluate automated bug assignment techniques that are based on machine learning classification. In particular, we study the state-of-the-art ensemble learner Stacked Generalization (SG) that combines several classifiers. Method: We collect more than 50,000 bug reports from five development projects from two companies in different domains. We implement automated bug assignment and evaluate the performance in a set of controlled experiments. Results: We show that SG scales to large scale industrial application and that it outperforms the use of individual classifiers for bug assignment, reaching prediction accuracies from 50% to 90% when large training sets are used. In addition, we show how old training data can decrease the prediction accuracy of bug assignment. Conclusions: We advice industry to use SG for bug assignment in proprietary contexts, using at least 2,000 bug reports for training. Finally, we highlight the importance of not solely relying on results from cross-validation when evaluating automated bug assignment.