Change, ImpRec, Commit!
Analyzing how a code change impacts a large software-intensive system is hard. In safety-critical development this must be documented though – what if all this documented knowledge could be reused to help other developers? We present an empirical evaluation of ImpRec, a recommendation system for change impact analysis.
Finally the bug fix is ready for commit! You have just taken care of that rare corner case bug – a critical bug, although it could only manifest if all celestial constellations are against you at the same time. The solution is only one commit away now, but hey – the source code is SIL2. Before making any changes you must do a formal change impact analysis. And specifying how the system is affected by the fix is not simple… Thousands and thousands of artifacts, two decades of legacy code, and a fair share of technical debt. Not easy at all!
This study wraps up a considerable part of my PhD – and the publication in TSE makes it kind of the crown jewel of the thesis. The paper addresses my overall PhD topic on aligning requirements engineering and testing by discussing an artifact-oriented solution in the context of change impact analysis. We describe a tool, ImpRec, whose implementation combines ideas in my previous publications on trace recovery using information retrieval, using Apache Lucene to identify similar bugs, network analysis in bug trackers, and recommendation systems. Moreover, the paper presents an evaluation based on both laboratory experiments and tool deployment in the field. A lot of work comes together in this climax publication!
ImpRec – A recommendation system for impact analysis
The technical description of ImpRec has already been published in a book chapter on recommendation systems for issue management – in the form of a side-by-side comparison with Cubranic’s quite similar tool Hipikat. However, the user interface considerably evolved after the book chapter was published, see figure below. In the upper left corner, there is a text field to enter queries, e.g., the title of the bug currently under investigation. On the left side are the most similar bugs, as identified by Apache Lucene. Finally, on the right side, ImpRec shows ranked recommendations of specific artifacts that should be investigated further in the change impact analysis – based on historical data, these are the most likely to be affected by the change. Note that while the ImpRec approach is general (i.e., harvesting previous trace links to recommend new impact), the tool is tailored for one case company.
We also implemented a feedback feature, and tracking of all the user’s clicks in ImpRec – obviously as a way to support the empirical evaluation of the tool. The feedback is based on the check-boxes listed before each item in the figure above. We encouraged the users to select the useful recommendations, which was then later used for an analysis of ImpRec’s accuracy. This mechanism could also be part of a future machine learning scheme, ImpRec could definitely be made to learn from the feedback.
Evaluation as it should be done?
The major contribution in the paper is empirical – I really think we show a good example of how to run en evaluation of a software engineering tool. We focused on two of the Dimensions and Metrics for Evaluating Recommendation Systems proposed by Avazpour et al.: correctness and utility. Some aspects of the evaluation could certainly be improved even further, but in general I’m very happy with what we managed to do. First we iteratively developed ImpRec in dialog with engineers at the case company. We never scaled down the number of artifacts, from the start we worked with all available data – thus ImpRec truly scales to realistic settings, we’ve never tried anything smaller!
Our two-phase evaluation started with a study in the controlled lab environment. We had extracted more than 10 years’ worth of validated change impact – our approach was to investigate how good ImpRec was at recommending the impact we knew was correct. To evaluate the approach from several perspectives, we didn’t only look att set-based precision and recall measures. Instead, we also measured F1, F2, F5, F10, and Mean Average Precision – all measures reported at cut-off levels from 1 to 50 recommendations. While this part of the evaluation already is more realistic than what normally is reported for SE tools, we wanted to also add data from the field.
To truly get some insights in the utility of ImpRec, we deployed the tool within two development teams. First in Sweden, then, while waiting for results, we fine-tuned parameters using our TuneR framework. Then I traveled to India for a month to deploy the tool also within a second team, i.e., we did a multi-unit case study. The field study combined quantitative data from the ImpRec feedback mechanism, with a more important qualitative perspective – we interviewed the study participants before and after they used the tool. To give some structure to the utility discussions, we used the breakpoints and quality levels provided by the QUPER model – definitely a novel use case for that model, but I think it was helpful!
Unprecedented strength of evidence
We designed a rigorous study, and we claim that the strength of its evidence goes far beyond what is normally the case. Thus, we can actually back up our claims quite a lot this time. The question is… what did we find out? What can we now emphatically say? Regarding correctness, based on our evaluation in the lab, we claim that ImpRec correctly identified 40% of the previous change impact within the top-10 recommendations. Regarding utility, we say that this level of correctness has passed the utility breakpoint, i.e., the developers recognize the value of using the tool. Finally, we conclude that ImpRec is particularly appreciated by project newcomers – the less you know about the project’s information landscape, the more you benefit from “walking in the steps of previous engineers”.
Implications for Research
- An example of how to evaluate a tool both in the lab and through a field study – combining quantitative and qualitative research in the same paper.
- Assessing tool utility through interviews with users supported by the QUPER model.
- Introducing new tool support in a safety-critical context is not straightforward – more research is needed to understand how to adapt processes when new tools are deployed.
Implications for Practice
- There is significant potential to do actionable analytics in issue trackers – data mining historical bug reports can reveal much knowledge, e.g., for change impact analysis.
- As change impact analyses must be documented anyway, make sure that the format used allows subsequent analysis – an unstructured piece of text is hard to use.
- An inadequate bug tracker impedes change impact analysis – use a system with proper search and navigation, i.e., help developers find relevant bugs from the past.
Markus Borg, Krzysztof Wnuk, Björn Regnell, and Per Runeson. Supporting Change Impact Analysis Using a Recommendation System: An Industrial Case Study in a Safety-Critical Context, IEEE Transactions on Software Engineering, 43(7), pp. 675-700, 2017. (link, preprint, data, code)
Abstract
Change Impact Analysis (CIA) during software evolution of safety-critical systems is a labor-intensive task. Several authors have proposed tool support for CIA, but very few tools were evaluated in industry. We present a case study on ImpRec, a recommendation System for Software Engineering (RSSE), tailored for CIA at a process automation company. ImpRec builds on assisted tracing, using information retrieval solutions and mining software repositories to recommend development artifacts, potentially impacted when resolving incoming issue reports. In contrast to the majority of tools for automated CIA, ImpRec explicitly targets development artifacts that are not source code. We evaluate ImpRec in a two-phase study. First, we measure the correctness of ImpRec's recommendations by a simulation based on 12 years' worth of issue reports in the company. Second, we assess the utility of working with ImpRec by deploying the RSSE in two development teams on different continents. The results suggest that ImpRec presents about 40% of the true impact among the top-10 recommendations. Furthermore, user log analysis indicates that ImpRec can support CIA in industry, and developers acknowledge the value of ImpRec in interviews. In conclusion, our findings show the potential of reusing traceability associated with developers' past activities in an RSSE.