
This is a personal copy of a column in IEEE Software (Jul/Aug 2025). Republished with permission.
The relationship between requirements engineering and testing has been a key interest throughout my research career. It cannot be that requirements engineers are from Omicron Persei 7 and testers from Omicron Persei 9 – we all live on the same planet. I’m happy to co-author this column with a former collaborator from a test automation EU project, now a manager at Ericsson. With this issue’s focus on AI-powered testing, we ask: How can we define corresponding tool and process requirements in practice? This column is two-fold. First, Sahar shares her experience. Then, we connect her observations to findings from a longitudinal study on tool adoption. Together, the perspectives give a grounded view from the field.
Sahar Tahvili and Markus Borg
“Can we just use AI for that?” It’s the new reflex. In the growing war of AI adoption, every problem becomes a battlefield, and AI is the weapon of choice. There’s a rush to deploy machine learning (ML), deep learning, generative AI, or “something with TensorFlow” – no matter how simple the challenge. But while Excel-based workflows may look like relics of the past, they often reveal a deeper truth: the tools we choose aren’t always based on the problem’s needs. Too often, they’re based on who yells “AI!” the loudest.
This is where requirements engineering (RE) makes its quiet comeback. Because before picking an AI model, we should probably figure out what problem we’re solving. These days, the gap between a basic “if-else” rule and a “GenAI-powered LLM” is massive. But not every problem lives on the deep end of the complexity pool. Fit-for-purpose AI isn’t about choosing the fanciest model. It’s about aligning with the process, the data, and the actual need. Sometimes that means rethinking how we write, track, and trust our tests. Sometimes it means not throwing a transformer at every bug report.
And yes, LLMs are making waves in the test lab. They can generate test cases, simulate users, and write documentation in Shakespearean English. But without understanding the why behind the test, it’s just impressive guesswork. So no, the real question isn’t “Can we use AI here?” It’s: “Do we actually know what we’re doing?”
In real-world software testing, I’ve seen both ends of the spectrum. From tedious manual coordination in Excel to reinforcement learning models that adapt and optimize over time. But beyond the technical perspective, there’s a deeper, human one: resistance to change. Traditional workflows are deeply ingrained. People who’ve built systems manually, day after day, often worry about what AI means for their roles, their relevance, their value, and their long-term impact. We’ll return to this in the second part of the column.
As an industrial researcher, I’ve recently given myself a new, unofficial task: figuring out when AI actually makes sense, when it really doesn’t, and why “more intelligent” doesn’t always mean “better suited.” Sounds like RE to me!
Excel is a tool, not a process
In daily conversations with team members, a familiar debate often surfaces: the defense of Excel as the ultimate productivity tool. The argument goes something like this:
“Why change? Excel works. There’s no onboarding, no complicated setup, and certainly no massive AI infrastructure consuming power in the cloud.”
And they’re not entirely wrong – Excel does work. It’s fast to open, easy to share, and universally understood. But here’s the catch: when Excel stops being just a tool and starts becoming the entire process, that’s where innovation stalls. I often catch myself thinking: “Okay, sure, it works… but at what cost?”
There’s all that manual effort behind every step of the process. The same steps are repeated across teams, again and again. The tiny errors that quietly pile up until they become someone else’s weekend problem. It’s not just inefficiency. It’s a slow leak we’ve normalized.
Another cost is the carbon footprint of all that repetitive human-machine interaction. Thousands of hours on laptops, email servers, and shared drives. While AI has a footprint too, it’s built to optimize, reduce repetition, and scale efficiently. A well-trained AI model doesn’t need coffee breaks, doesn’t forget to save a file, and certainly doesn’t get tired by Friday afternoon. It just gets better. Excel should stay in the toolbox, but it shouldn’t become the blueprint for how we work. Smart tools don’t just replace effort; they redirect it to higher-value thinking.
If the USA is so great, why invent the USB?
Conversations with the younger generation of engineers, data scientists, and students are, let’s say, lively. The excitement around AI is contagious. But sometimes it turns into a belief that every problem requires a deep neural network, the way every meal, apparently, needs a splash of prickly pear seed oil.
I’ll challenge them with a simple question:
“Why are we using such a complex, compute-heavy model for this? A linear regression would’ve worked just fine.”
And I always get some version of the same response:
“Well, if linear regression was so efficient, why did they even invent neural networks?”
Well, just because we can build a TensorFlow model to decide who should run the test cases in the weekend shift doesn’t mean we should. Purposeful AI isn’t about showing off horsepower. It’s about choosing the right tool for the job. Deep learning is powerful, yes. But sometimes the solution isn’t a smarter model – it’s a smarter choice.
If a simple rule or basic availability check can solve the problem, there’s no reason to launch a TensorFlow pipeline. Complexity should be earned, not assumed.
So here I am, stuck between two generations. One swears by Excel, the other by TensorFlow. One trusts experience, the other trusts the model’s accuracy score. And I? I just want the right tool for the job – something that works, scales, and doesn’t schedule me to debug failed tests on a Sunday.
Maybe progress isn’t about choosing sides. It’s about knowing when to use a formula and when to train a model. Yes, USBs were invented after the USA. But that doesn’t mean USBs are better at governing nations…
From Sahar’s story to our findings
Sahar’s observations resonate with findings from a large study we recently conducted [2]. Together with many collaborators, we explored how ML could be used to automatically assign bug reports – an important part of the testing process – to development teams. We’ve been working on this challenge for over a decade and learned just how many convoluted challenges appear in practice.
Already in 2011, we started designing an ML solution to support the manual bug report assignment process. The task consumed significant time from highly senior engineers. By automating part of the process using a trained classifier, we hoped to free up their time – and assign bug reports more accurately and efficiently. A textbook ML use case, right?
We collected a large training set with ground truth data on which teams had successfully resolved past reports. The early results were promising, and we kept going. After a series of academic papers, a team in Hungary started implementing a production-ready version in 2017. Since 2019, the tool has been continuously auto-assigning bug reports in a live environment.
Has it delivered value? Yes. Around 30% of incoming bug reports are now automatically assigned with 75% accuracy. These reports are resolved, on average, 21% faster than those assigned manually. More importantly, the tool has saved seasoned engineers many hours of work. There were also indirect effects, such as increased process awareness and higher job satisfaction.
Was it easy? Not at all. Adopting AI in a large organization is a major undertaking. That’s why we designed a longitudinal study to track key insights. We faced several setbacks along the way, and there were negative consequences. Bypassing human analysis steps using AI comes at a cost. We continue by connecting our findings to Sahar’s reflections.
Fit-for-purpose starts with people
A core lesson from our work is the importance of stakeholder identification and analysis. Classic RE activities that are essential when introducing new tools. Researchers can’t simply hand off an academic prototype and wish the development team good luck. Even for a seemingly narrow activity like bug assignment, many perspectives must be heard.
User groups that seem homogeneous at first often reveal surprising differences. We focused on two main roles. Those whose task was being automated, and those receiving the AI-assigned reports. We interviewed both satisfied and dissatisfied users and paid particularly close attention to critical voices.
One aspect that really needs curious RE questions is understanding the cost of incorrect AI assignments. Sure, humans make mistakes too, but AI can do it at scale! This concern was especially strong among the teams closest to the hardware. Addressing this was one of the biggest hurdles.
The tool we deployed targets bug assignment for a layered product architecture. In the manual process, bug reports were often routed to high-level teams for initial triage and then passed down the stack as the analysis progressed. The AI, however, more frequently assigned reports to lower-level teams directly. High-level teams applauded this – feeling they were no longer the default – but frustration grew downstream. Lower-level teams missed vital clues from pre-analysis and didn’t trust the auto-assignment. Manual effort moved down the stack.
Another drawback of AI automation was a loss of oversight. When tasks are automated, something human is inevitably lost – a classic irony of automation as Bainbridge pointed out decades ago [1]. Previously, senior analysts had eyes on every bug report. But with an ML classifier in the pipeline, a subset of reports had already been assigned before the morning sync meetings. Without compensating mechanisms, like new dashboards, the complete picture of the product status on the market was at risk.
AI spans everything from simple linear regression to the state-of-the-art large language models. But as Sahar raised: what matters is fit-for-purpose. Simple and explainable often beats complex and accurate. Again, RE comes in with structured approaches to reason about quality trade-offs. Apart from accuracy, selecting which enabling AI technology to rely on depends on qualities such as scalability, maintainability, and simplicity. Trade-offs everywhere!
We acknowledge Sahar’s point about balancing conservative Excel loyalists and overly optimistic AI enthusiasts. Overcoming organizational inertia can be tougher than solving technical issues. In our case, we partly dodged the challenge as the tool operated quietly in the background. Users didn’t need to learn how to use it. They just had to live with the output – this simplified adoption.
Her Excel warning connects to another insight from our study. Tools shape processes. But that influence can also be used for good. Our tool deployment sparked conversations about the bug triage process. This first led to better process awareness and later catalyzed small but meaningful process improvements.
Enabling the change
If your careful RE work concludes that an AI-powered solution is indeed the way forward, how do you facilitate adoption? We wrap up with a few hands-on recommendations.
As Sahar pointed out, accuracy and trust are tightly connected. Low accuracy – especially worse than human performance – is often a showstopper. But what counts as good enough can vary. RE needed again! We used QUPER[1] [1] to structure conversations about quality levels [3]. Like when AI-assignments become merely useful, and when they start to truly differentiate. This helped us understand the accuracy expectations across different teams.
What is QUPER?
The QUPER model supports reasoning about perceived value as quality improves. A core aspect is its non-linear view of quality benefits. QUPER models this using three breakpoints: utility (users recognize value), differentiation (the solution starts standing out), and saturation (beyond which higher quality adds little value).
For the bug assignment tool, the practical solution was to set different confidence thresholds for different teams. Some were fine with a few wrong assignments. Others wanted the ML model to stay in decision-support mode unless it was really confident. Avoiding a one-size-fits-all approach was a key success factor.
To overcome inertia, we found that people matter just as much as ML models. We succeeded by identifying three types of supporters:
- Champions advocating for the tool.
- Diplomats navigating various stakeholder concerns.
- Early adopters willing to experiment.
When these roles worked closely with the development team, early versions could be tested in safe environments. This enabled iterative tool development with gradually improved accuracy and new features. With integrated RE as part of the cycles of course.
Finally, also related to inertia mitigation, internal training helped align expectations. AI triggers very different mental models. A shared understanding of what the tool does – and doesn’t – is essential for successful adoption.
Have you tried increasing automation in a software engineering task? Was AI an enabling technology? If it worked, chances are you had some solid RE activities behind it. If it didn’t – well, you know what we’d recommend.
References
- [1] L. Bainbridge. Ironies of Automation, Automatica, 19, 6 (1983).
- [2] M. Borg, L. Jonsson, E. Engström, B. Bartalos, and A. Szabó. Adopting Automated Bug Assignment in Practice — A Longitudinal Case Study at Ericsson, Empirical Software Engineering, 29, 126 (2024).
- [3] B. Regnell, R. Berntsson Svensson, and T. Olsson. Supporting Roadmapping of Quality Requirements, IEEE Software, 25, 2 (2008).