Requirements on Technical Debt: Dare to Specify Them!

This is a personal copy of a column in IEEE Software (Mar/Apr 2023). Republished with permission.

I’ve just switched jobs. After many years in academia and research institutions, I’m now back in industry. I work with software that ships. Even better, CodeScene develops a software intelligence tool that processes data from projects worldwide. For anyone inclined to research, this is certainly an interesting spot! The theme of this issue is methods and tools for the hybrid, post-Covid world. The CodeScene tool is one example that can help in large-scale work-from-home contexts by creating a shared understanding of the source code. In this column, I reintroduce myself as a CodeScene researcher and discuss requirements for technical debt management.—Markus Borg


Sarah hesitates before pressing the return key to submit the pull request. There is immense time pressure to get this new feature out. All tests are green, but she knows the implementation was hasty. What if it adds more long-term confusion than it should? As she reclines in her chair, Sarah has a hunch that she cut a corner to get here. But what does “good enough” code quality even mean?

Does Sarah trust her pull request?

Technical debt (TD) is an established concept in the software industry. Gone are the days when this metaphor needed an explanation among developers. Developers know that “quick and dirty” will come back and haunt you later. For TD, the interest is the extra effort you must dedicate to future development because of previous shortcuts. However, TD awareness rarely leads to action, and the debt is not paid off [1]

Heavy build-up of technical debt will slow you down.

Research primarily explains the lack of action as a managerial issue. I’m sure managers also understand the general idea of TD—metaphors are intended to help convey difficult concepts. Of course, quick fixes can be costly down the line. But hyperbolic discounting, as discussed in behavioral economics, is at play. Humans are biased to prefer small benefits today over larger rewards later. And what if you cannot even evaluate the later rewards? Paying off that TD is a card that might linger on the backlog forever unless the costs are made explicit.

Casting Light on TD Interest Rates

Organizations rarely tackle TD through explicit requirements. Established best practices for quality assurance typically focus on code conventions and specifying processes for code reviews and testing. Indeed, research shows that detecting maintainability issues is one of the significant benefits of code reviews [2]. Both formal checklist-based inspections, popular before the agile movement, and modern lightweight pull request reviews have been found to support maintainability. Still, turning “there shall be no remaining maintainability issues from code reviews” into a sweeping requirement brings little beyond symbolic value.

Disregarding TD will cost you big time. “Everyone” knows that, but how bad is it? Putting numbers on TD costs is nontrivial. Many have tried, but few have succeeded. Last year, we contributed to the evidence base with our “Code Red” paper [3]. Using CodeScene, our analysis of 39 large proprietary codebases, and their accompanying issue trackers, showed that resolving bugs and implementing new features in low-quality code required, on average, 124% more active development time. The resolution times in unhealthy code also involved much more uncertainty. Many studies, including ours, have also identified a correlation between code smells and defects [4].

Empirical studies on the costs of TD motivate action. As the amount of evidence grows, it becomes clearer how TD affects the quality requirements maintainability and evolvability. Both these system qualities are business-critical from the lifecycle perspective. Actions as vital as TD payment deserve systematic approaches throughout a product’s life. Explicit requirements, with rationales expressed in extra development time and increased bugginess, can help product managers prioritize code improvements on the backlog. But how can we turn expectations on code quality into requirements for sharing within an organization? And how can we measure it to make quality targets actionable?

Quality, ISO Standards, and Requirements on Source Code

The requirements engineering community knows how to discuss software quality in light of the ISO/International Electrotechnical Commission (IEC) 25010 quality model [5]. Most of the model’s eight qualities are related to the operational behavior of the system. Thus, the corresponding measures can only be collected after the fact—when issues have manifested. On the contrary, the (publicly available) standard ISO/IEC 5055 Automated Source Code Quality Measures [6] addresses software quality issues that can be quantitatively measured and addressed already before deployment.

ISO 5055 is a recent ISO standard, approved and published in March 2021. It was originally developed by the Consortium for Information and Software Quality (CISQ). CISQ orchestrated discussions among senior software engineers around the globe to seek ways to automatically measure ISO 25010 quality characteristics. Four of the eight are currently covered: 1) maintainability, 2) reliability, 3) security, and 4) performance efficiency. Due to the TD focus of this text, we will focus on the first one.

ISO 5055 specifies a set of measures that static analysis tools can calculate. The quality measures rely on counting occurrences of a controlled list of 138 weaknesses selected by the expert group. Looking at maintainability, the quality closest to TD, there are 29 measures. Some weaknesses address the module level, for example, excessive coupling (many outward calls from a module) and circular dependencies between modules. Other weaknesses apply on a function level, including algorithmic complexity and the presence of dead code. ISO 5055 provides default thresholds for several measures but encourages organizations to customize them for a better fit.

The weaknesses can be used to express requirements. Any weakness could be expressed as unacceptable through shall-not requirements. Typically, most measures should be normalized per 1,000 lines of code or similar—which can turn into quality targets for improvement activities. Such targets can also be used in development organizations’ maintainability requirements or communicated in requests for quotations and contracts for third-party suppliers. This would differ from including a general process maturity requirement for the external development unit, such as a specific Capability Maturity Model Integration level. ISO 5055 measures can instead be computed for individual source code deliverables.

CodeScene’s Code Health

CodeScene has developed the code health metric. Code health aggregates many factors automatically extracted from source code—in line with the gist of ISO 5055. Currently, we implement some 25+ factors that indicate quality issues on the module, function, or implementation level.

Compared to ISO 5055, code health targets maintainability issues on higher abstraction levels. The current version of ISO 5055 focuses on low-level code constructs and a set of measures restricted to object-oriented programming. Code health, on the other hand, is general across development paradigms. Still, there is an overlap regarding, for example, the use of redundant code (copy–paste detection) and complexity measurements. A major difference is that code health is primarily influenced by research on code smells and refactoring rather than the literature on static code analysis. Furthermore, code health also considers data from the version control system to identify maintainability issues related to developer coordination and knowledge distribution.

The figure shows a visualization of the PyTorch codebase using three colors for code health to communicate an overview of code quality [see 3(a)]. Code health is designed to be intuitive and actionable. The color scheme highlights the main concerns and the magnitude of possible quality issues. Drilling deeper, the code health score goes from 10—easily maintainable code—down to 1, representing code with severe quality issues. Code health is an absolute score calibrated using the vast amount of code analyzed by CodeScene.

A visualization of code health for PyTorch, a popular open-source machine learning framework. Green: healthy code; yellow: problematic code, red: unhealthy code.

No single key performance indicator (KPI) can ever capture the holistic picture of a project’s code quality. Even when “simply” looking at code quality, software engineering is too complex to capture in a single number. Thus, a CodeScene analysis presents code health for three parts of the source code.

Most development activity is confined to a fraction of the files in the codebase. CodeScene refers to such code modules as Hotspots. This is where developers spend most of their time. Hotspot code health is the most important KPI.

Two additional KPIs complement the code quality picture. First, the average code health illustrates the overall quality. As files differ substantially in size, we calculate weighted averages accordingly. Second, CodeScene highlights the code health of the worst performer, the individual file with the lowest quality. A shared awareness of this bug-prone and hard-to-evolve part of the system can be beneficial.

The time dimension is critical for TD management [7]. (b) in the Figure shows how CodeScene’s dashboard presents the overall code health trend for PyTorch. It also shows details for a specific utility file used for testing. Our analysis suggests that the codebase has been building up TD during 2022—possibly sacrificing long-term maintainability to quickly meet the needs of its users. There is substantial market pressure for machine learning frameworks. We note that the main competitor, TensorFlow, shows a very similar trend.

The CodeScene dashboard is a control center for TD decision-making. It helps development organizations to understand when to safely move ahead and implement new features. Equally important, it alerts us when it is time to take a step back and improve what’s already there. With appropriate actions—and CodeScene does provide recommendations—the system remains maintainable throughout its lifecycle.

Quality Gates Where it Is Busy and “Don’t Make it Worse”

Source code quality should be treated as a first-class citizen. For software companies, code quality is nothing less than a vital business concern. With a valid measurement system in place, two fundamental questions for requirements engineers and business analysts remain. What are the appropriate quality targets for the source code? And to which parts of the system do these targets apply?

Many factors influence the specification of code quality targets. First, the cost of resolving issues detected by tools is typically not linear. Instead, the effort tends to become exponentially larger when closing in on the zero-defect ambition. Second, setting targets for existing systems obviously depends on the current state and context. Many systems evolve for years and must meet high internal quality requirements to accommodate changes without setbacks. Others are approaching the ends of their lifecycles. Third, as systems consist of heterogeneous parts, one-size-fits-all thinking for the quality of all constituents is too blunt and often suboptimal.

Prioritization is a core activity in TD management. The same is true for requirements engineering, which puts the profession in the right spot to support decision-making and tradeoffs. Intuitively, TD should first be paid off where the interest is high, i.e., frequently modified files. Hotspots with low code health are costly productivity bottlenecks. Any development project should prioritize improvements accordingly.

What should the code health targets be for the Hotspots? As quality protagonists, we, of course, recommend targeting high scores. In practice, however, this might not be reached within a sprint or two. What we do advocate, in any situation, is to at least resist making things worse. “Pull requests targeting Hotspots shall maintain or increase the current code health”—this is a very reasonable maintainability requirement, actionable for every pull request. Violating that requirement will come with a cost. Be aware that attractive hyperbolic discount might come back with a vengeance!

ISO standards are regularly revisited. Future versions of ISO 5055 will incorporate more weaknesses. As members of the requirements engineering community, what would you like to see in the next revised version? I’m all ears if you would like to discuss that! Source code quality is way too fundamental in software engineering for requirements professionals to stand on the sidelines. We must be there and add our expertise!

And how did it go for Sarah? She did submit her pull request. Luckily, her organization recently introduced automatic code quality gates as a pull request integration. A quality gate was triggered. Sarah immediately learned that one of her changes pushed a Hotspot over the brink to low cohesion by adding yet another responsibility to the code. Ouch! Did the team address the decreased code quality before merging? Or did they agree to increase the debt? Our story doesn’t tell. We know, however, that the developers got actionable information at the right time to make an informed TD decision. This would be the right arena for maintainability requirements to guide them further.

References

  1. S. Freire et al., “Software practitioners’ point of view on technical debt payment”, J. Syst. Softw., vol. 196, no. 111554, Feb. 2023.
  2. N. Davila and I. Nunes, “A systematic literature review and taxonomy of modern code review”, J. Syst. Softw., vol. 177, no. 110951, Jul. 2021.
  3. A. Tornhill and M. Borg, “Code red: The business impact of code quality – A quantitative study of 39 proprietary production codebases”, Proc. IEEE/ACM Int. Conf. Tech. Debt, pp. 11-20, 2022.
  4. P. Piotrowski and L. Madeyski, “Software defect prediction using bad code smells: A systematic literature review” in Data-Centric Business and Applications. Lecture Notes on Data Engineering and Communications Technologies, Cham, Switzerland:Springer-Verlag, vol. 40, pp. 77-99, 2020.
  5. “Systems and Software Engineering — Systems and Software Quality Requirements and Evaluation (SQuaRE) — System and Software Quality Models, ISO/IEC 25010:2011”, 2011.
  6. “Information Technology — Software Measurement — Software Quality Measurement — Automated Source Code Quality Measures, ISO/IEC 5055:2021”, 2021.
  7. N. A. Ernst et al., “Measure it? Manage it? Ignore it? Software practitioners and technical debt”, Proc. 10th Joint Meeting Found. Softw. Eng., pp. 50-60, 2015.