Up up and away – results from high above
Many researchers have spent hours on improving precision and recall of traceability recovery tools. Different information retrieval techniques have been tried, and loads of quantitative results have been published. If we take a step back – what has actualy been achieved? In this study we aggregate all previous results and show just how much the results depend on which dataset you decide to experiment with.
This paper is a direct spinoff from the systematic mapping of traceability recovery research later published in Empirical Software Engineering. It was however never intended to be a separate paper, we wanted the content to be part of the mapping study – but our EMSE reviewers did not approve of it. Too bad, I actually believe these results were what really made the literature survey interesting… And removing these results “downgraded” our work from a systematic literature review (SLR) to a systematic mapping.
Our first idea with the literature survey was to 1) check what information retrieval (IR) approaches had been used for traceability recovery, and to 2) report which IR approaches had been the best. Quite a normal idea for a literature survey I would say – first a systematic mapping, then aggregating results in an SLR. The EMSE reviewers didn’t at all like our coarse-grained way of aggregating previous results – and I really agree that we compare different animals… But I argue that sometimes interesting patterns emerge when you go for the bird’s eye view. Our conclusions were quite critical of the previous research, that probably didn’t increase our chances to get published. But the ESEM reviewers liked it better – at least as a short paper!
Aggregating previous conclusions
We present two perspectives on the body of previous empirical work – based on the comprehensive collection of papers found in our systematic review of the literature. The least controversial approach comes first: an aggregation of all previous comparative studies. We simply extracted previous authors’ conclusions – either they found that one IR approach outperformed something else, or they got indecisive results – like the 2010 ICPC paper by Oliveto et al. “On the Equivalence of Information Retrieval Methods for Automated Traceability Recovery Methods“. Oliveto’s paper is closely related, but he ran his own experiments, we aggregated other researchers’ results – including his. The figure below shows our results – arrows point at the approach performing better, undirected edges represent inconclusive results, weight denotes the number of studies. It is obvious that we cannot simply announce a winner.
Mapping to Previously published quality intervals
The second perspective provided by our paper is surely more debatable – but quite interesting I would say. When running IR experiments you get a Precision-Recall (P-R) “footprint”, showing what precision you get at different recall levels. There is a trade-off, to get high recall you must accept lower precision – several papers argue that recall is more important. I prefer to talk about “actionable” IR results… Low precision can be alright as long as the user doesn’t have to sift through pages of search hits. Huffman Hayes et al. proposed P-R “quality intervals” for IR-based traceability recovery, as an attempt to “draw a line in the sand”. I think it was a nice idea, and we decided to map previous P-R footprints against her levels. The figure below shows the intervals excellent, good, and acceptable in the upper right corner. We’ve also plotted all previous P-R@10 results, i.e., the P-R footprints when considering the first 10 results from the traceability recovery tool.
Data dependency and industrial relevance
There are two interesting aspects in the figure. First, no P-R@10 results have reached the acceptable interval. Second, the P-R footprint is really dependent on the choice of dataset. The red circles show experimental results of different IR approaches evaluated on the same dataset – It appears the choice of dataset influences the results much more than the choice of IR technique! And we now from previous studies that most experiments were conducted on really small datasets and often originating from student projects. It makes me wonder about the validity of previous traceability research…
One could argue that P-R@10 is not the footprint to compare, thus we also show similar results for other suggested thresholds: P-R@5, P-R@100, P-R @ cosine similarity >= 0.7, and precision at fixed recall levels. What we also did, probably the least appreciated by the EMSE reviewers, was to put ALL previous results, no matter IR approach and dataset, in a big plot – presented on top of this post. A lot of data points, no details, truly from a bird’s eye view. But quite interesting still! We observe that there are a few results in also the ‘excellent’ and ‘good’ quality intervals, but a majority is definitely much worse. We also show an exponential trendline, suggesting that the best chances to reach ‘acceptable’ results are at recall levels 0.6-0.7… But then again, who cares about these numbers? What we really want to know is whether traceability recovery tools help developers in industry.
Implications for Research
- Results from traceability recovery tools depend greatly on the dataset under study.
- Most studies on traceability recovery do not deliver acceptable P-R according to Huffman Hayes’ quality intervals.
Implications for Practice
- Whether or not traceability recovery tools would work for you really depends on your context.
Markus Borg and Per Runeson. In Proc. of the 7th International Symposium on Empirical Software Engineering and Measurement, pp. 243-246, 2013. (link, preprint)
Background. Several researchers have proposed creating after-the-fact structure among software artifacts using trace recovery based on Information Retrieval (IR). Due to significant variation points in previous studies, results are not easily aggregated. Aim. We aim at an overview picture of the outcome of previous evaluations. Method. Based on a systematic mapping study, we perform a synthesis of published research. Results. Our synthesis shows that there are no empirical evidence that any IR model outperforms another model consistently. We also display a strong dependency between the Precision and Recall (P-R) values and the input datasets. Finally, our mapping of P-R values on the possible output space highlights the difficulty of recovering accurate trace links using naïve cut-off strategies. Conclusion. Based on our findings, we stress the need for empirical evaluations beyond the basic P-R 'race'.