Do Better IR Tools Improve the Accuracy of Engineers’ Traceability Recovery?

Tool output 5% better – but does it really matter?

Numerous papers report that their implementations of trace recovery improves state-of-the-art by a few percentages. But if you actually let humans interact with the improved tool output, do the improvements spill over on the human performance? We designed an experiment to explore this question.

This paper was my first ever workshop publication, and kind of a milestone in my PhD studies. I was quite reluctant to publish during my first years, didn’t really see anything worth to share. By getting this submission accepted, my attitude toward publications changed – I’ve had more than enough paper ideas since then! No more reluctance.

The paper originates in a PhD course I took on experiments in software engineering, given in Lund by Martin Höst and Dietmar Pfahl. As part of the course, all participants designed an experiments matching their PhD studies. The most feasible experimental designs were selected, and we conducted them within our small class. Mine was chosen together with an experiment on comparing exploratory testing vs. test-case based testing by Ghazi – I believe it eventually lead to this publication.

The overall idea of this study was to question the value of limited tool improvement in trace recovery research. Does it really matter if a tool performs 5% better on a specific dataset in the lab? What happens when you add humans in the traceability loop, as pointed out already in a 2005 TEFSE paper by Huffman Hayes. Since then, a number of experiments with student subjects have been published, but they have compared completing tasks with a trace recovery tool vs. having no tool support at all. I wanted to see if subjects performed better when supported by a slightly more accurate tool – or, as I expected, all such effects would drown in human factors.

Another traceability experiment on the NASA CM-1 dataset

I decided to use two tools on the well-studied CM-1 NASA dataset, ReqSimile (developed in Lund by Natt och Dag) and the somewhat more accurate RETRO (developed by Huffman Hayes et al.). For practical reasons, I wanted to design a pen-and-paper setup that could be conducted in a classroom setting. Thus, I designed a change impact analysis task – specifying which detailed requirements would be impacted by changes to a handful of high-level requirements. I provided all subjects with the required requirements documents and also printed tool output from either ReqSimile or RETRO, i.e., the treatment was the quality level of the printed tool output. All printouts had the same look, just text on paper, and the subjects obviously didn’t know whether they got ‘better’ or ‘worse’ support.

Precision-Recall curves for RETRO and ReqSimile. Also results from human-oriented experiments supported by the tools.

Eight subjects completed the task, four supported by ReqSimile output and four received RETRO output. The few subjects means that no definite conclusions can be draws of course, the study is a pilot paving the way for a larger experiment, but there is an interesting pattern. The figure above suggests that subjects that were supported by slightly better trace recovery support (RETRO, shown in dark gray) do perform the change impact analysis task a little bit better then the ReqSimile subjects (shown in light gray) – a pair-wise comparison gives that the RETRO subjects are closer to the top right corner. The figure also shows how RETRO performed compared to ReqSimile on this specific task. Note that we highlight the cut-off point 10 (P-R@10, as recommended in our EASE’12 paper) with a large X, representing precision and recall when recommending 10 detailed requirements per high-level requirement – a reasonable amount of recommendations for the human to manage.

First statistical equivalence testing in software engineering

An interesting contribution of the paper is the way we analyzed the experimental results. While I knew the number of subjects were not enough for any statistically significant results, I decided to analyze the data properly anyway – a later full-scale experiment could then easily complement the initial analysis. The standard procedure in inferential statistics is to show that two treatments lead to different results. I instead wanted to statistically show that two treatments are practically equivalent, i.e., that a human completes a task with the same accuracy even if the trace recovery tool is slightly better.

Meet statistical equivalence testing. As so many times before, empirical software engineering can learn from medical research. In medicine it is used to answer questions such as “Is it worthwhile to treat patients with the more expensive drug?” and “Should we do mass vaccination against this virus?”. The first step in equivalence testing is to define a “zone of indifference”, i.e., a range of results that you consider practically equivalent. There is no statistical support for this step, you have to rely on your preunderstanding – not too different from picking significance levels… 0.05, anyone? In my experiment, I decided that precision-recall values within an absolute 0.1 range should be considered equivalent, see the figure below. The zone of indifference is certainly a target for any critical reviewer – motivate your choice well in the paper.


The figure shows 90% confidence intervals of the differences between treatment RETRO and ReqSimile – positive values mean better recall, precision, and F-measure. The zone of indifference clearly does not cover any of the confidence intervals – the results are as expected statistically inconclusive. But as shown by the mean values, the subjects supported by RETRO performed a little better… Maybe slightly better tracing tools are beneficial to developers after all? More research on return on investment of improved tracing accuracy is clearly needed!

Implications for Research

  • The first use of statistical equivalence testing in software engineering – turning traditional difference testing around.
  • An experimental design for exploring whether small tool improvements spill over to humans.
  • Results strengthen Cuddeback’s work from RE’10 and Dekhtyar’s statistical extension from RE’11 – developers tend to balance precision and recall when doing traceability tasks.

Implications for Practice

  • Better tracing tools appear to help developers do more accurate change impact analysis.
Markus Borg and Dietmar Pfahl. In Proc. of the International Workshop on Machine Learning Technologies in Software Engineering, pp. 27-34, Lawrence, KS, 2011. (link, preprint)


Large-scale software development generates an ever-growing amount of information. Multiple research groups have proposed using approaches from the domain of information retrieval (IR) to recover traceability. Several enhancement strategies have been initially explored using the laboratory model of IR evaluation for performance assessment. We conducted a pilot experiment using printed candidate lists from the tools RETRO and ReqSimile to investigate how different quality levels of tool output affect the tracing accuracy of engineers. Statistical testing of equivalence, commonly used in medicine, has been conducted to analyze the data. The low number of subjects in this pilot experiment resulted neither in statistically significant equivalence nor difference. While our results are not conclusive, there are indications that it is worthwhile to investigate further into the actual value of improving tool support for semi-automatic traceability recovery. For example, our pilot experiment showed that the effect size of using RETRO versus ReqSimile is of practical significance regarding precision and F-measure. The interpretation of the effect size regarding recall is less clear. The experiment needs to be replicated with more subjects and on varying tasks to draw firm conclusions.