RE and Large Language Models: Insights from a Panel

This is a personal copy of a column in IEEE Software (Jan/Feb 2024). Republished with permission.

Large language models are inescapable these days, obviously making waves in RE too. The long-term extent of their impact? Well, that remains to be seen. At the previous International RE Conference in Hanover, Germany, these models were the center of many lively discussions. In this column, I’ll summarize a panel debate, reporting various viewpoints and thought-provoking audience questions. The panel stressed the early adoption of AI support by students, perfectly aligning with this issue’s theme “Artificial Intelligence in Software Engineering Education & Training.”

Markus Borg

Large Language Models (LLMs): ready or not, here they come! Whether you’re already tired of the endless buzz or just tuning in… this transformative technology is here to stay. Much like how web search revolutionized information access, LLMs will disrupt various engineering tasks. As I write in November 2023 – in the middle of the CEO turbulence at OpenAI – it’s likely that several milestones have been passed by the time you’re reading this.

However, is RE late to the game? A recently published survey of LLM applications states “Hitherto, the discipline of requirements engineering has received less attention from the emerging literature on LLM-based software engineering” [FAN23]. Are we late? Or are LLMs maybe not intelligent enough to do RE tasks?

At the major academic RE event of the year – the International Requirements Engineering Conference – using LLMs was obviously one of the main discussion points. Apart from several research papers, there were no less than three well-attended panel discussions dedicated to the topic.

Standardizing validation methods: empirical evaluation of LLM-based research – at the Workshop on Empirical RE (EmpiRE)
AI and RE: past, present, and future – at the International Workshop on Artificial Intelligence and Requirements Engineering (AIRE)
Requirements Engineering and Large Language Models: Best of Friends or Worst of Enemies? – in the main conference program

This column provides a summary of the third panel. The textbox introduces the five panelists. Names mentioned by the audience are included with their explicit consent. Do you agree with the panelists? Let’s continue the conversation! It’s clear that LLMs are here to stay.

Henning Wachsmuth	Professor, Leibniz University Hannover, Germany	Leads research on natural language processing and he is an LLM expert. Limited RE background.
Markus Borg	Principal Researcher, CodeScene, Sweden	Leads research on software engineering intelligence. His solution-oriented research includes information retrieval and machine learning.
Alessio Ferrari	Senior Researcher, CNR-ISTI, Italy	His research interests include applications of natural language processing techniques to RE tasks. Also on the AIRE panel.
Fatma Başak Aydemir	Assistant Professor, Boğaziçi University, Turkey	Her research focuses on applying AI methods to increase the efficiency of RE processes.
Walid Maalej	Professor, University of Hamburg, Germany	Current steering committee chair of the RE conference series. His research often involves natural language processing and machine learning. Also on the AIRE panel.

Panelists. From left to right in the above photo, Neil Ernst (moderator), Henning Wachsmuth, Markus Borg, Alessio Ferrari, Fatma Başak Aydemir, and Walid Maalej. (Picture by Prof. Martin Glinz).

Opening the Panel

Neil Ernst, the moderator, kicked things off with a three-question poll to get a feel for the room. First, we learned the audience’s varied LLM usage. There was an equal split between 1) occasional users, 2) regulars, and 3) daily users. Some respondents had never used them. Second, almost all respondents believed LLMs will positively affect RE practice. Third, most positive effects were expected in specification followed by validation. With a generally positive sentiment as a backdrop, Neil turned to the panelists for quick introductions and opening statements about LLMs and RE.

Henning opened the panel discussion. His expertise as an LLM researcher stood out in the panel, and he later got to explain several technical aspects. Henning assumed that LLMs would help assess, validate, and consolidate textual requirements. Perhaps the technology can also support elicitation, but that comes with possible factual incorrectness – presented in ways that look highly plausible.

Next, Markus (yours truly) offered an industry viewpoint from CodeScene. I predicted that LLMs would primarily impact user expectations of our software engineering intelligence product. Soon, I will do RE for features letting users prompt the CodeScene tool for conversational interaction.

Alessio shared his excitement about the new avenues LLMs open for RE. His early experience in checking requirements completeness and user story generation had been amazingly positive. However, he also called for empirical studies to validate how effective LLMs really are. Finally, based on how capable LLMs are of generating source code, he foresaw a future where requirements specifications are adapted to support subsequent prompting.

Başak discussed her use of LLMs to generate educational material. She expressed gratitude that LLMs are better at Turkish than previous generations of NLP techniques. Still, she opened with the most cautious remark of the panelists. She feared that her students now rely on LLMs for most tasks – future students might never reach enough expertise to assess the quality of the LLM output.

Finally, Walid emphasized the responsibility of the RE community in developing responsible LLM technology. He advocated for a focus on how to do RE for LLMs, urging a step back from the hype to leverage the community’s analysis skills and understanding of technological complexities with a human-centric perspective.

Evaluating LLMs?

The moderator opened for questions from the audience and Dan Berry (Professor, University of Waterloo, Canada) initiated the conversation. He was part of the EmpiRE panel and highlighted one of its questions about LLM evaluation – a fundamental issue for the RE research community.

Dan offered a critical observation and a thought-provoking hypothesis about the speeding up of engineering tasks. He argued that while LLMs might accelerate human work, they may also lead to more mistakes early on in projects. The catch is that fixing these mistakes later, in code no human ever wrote, could demand more than the tenfold effort that fixing one’s own code has been shown to require. He hypothesized that LLMs might do a good job, but organizations will spend more effort than they would if they did things the old-fashioned way from the start. Prof. Berry urged us to go beyond assessing the quality of LLM outputs. The community needs to compare task completion times with and without LLM assistance.

Başak replied by sharing her experience using an LLM to generate exam questions. She agreed with Dan’s idea, noting that it took her longer to adapt a domain model drawn by ChatGPT for an RE course than to do it herself. Despite this, she enjoyed the process and acknowledged that it was her first trial.

In contrast, Alessio held a different stance. Although he acknowledged the need for scientific evaluation, he argued that working with LLM support is much faster and highly effective for quick prototyping.

Markus responded that the benefits of LLMs are primarily in providing fast and succinct access to information. He envisioned using A/B testing in CodeScene to measure the time efficiency of LLMs in users’ information-seeking compared to current visual analytics and dashboards available in the tool.

Henning, the panel’s LLM expert, clarified that the main problem of evaluating generative models is the absence of a single correct answer. Thus, no one can authoritatively judge whether one output is better than the other. Moreover, if LLMs would produce superhuman output, mere humans cannot properly recognize it. He agreed that measuring the quality of some downstream tasks for which comparisons can be made would be better.

Walid concluded this part of the discussion by agreeing with Dan’s points and advocating for the RE community to focus on evaluating LLM-based systems within the context of user, system, and customer environments. LLM output evaluation should be left to other research communities, whereas RE researchers are better trained to evaluate the broader complexities in context.

Can LLMs replace developers?

The conversation took a provocative turn when the moderator asked, “Can RE replace developers if code could be generated directly from requirements?”

Alessio made a bold statement affirming this possibility, comparing the evolution of programming from assembly to high-level languages. He envisioned a future where software engineering is mostly about composing prompts. That is, prompt engineering would turn into the new RE.

Markus shared an anecdote about a Swedish project funded to explore whether useful “GPT developers” could be trained in just two months. This short education should teach prompt engineering and explore if the trainees could be viable in the job market. However, he was skeptical.

Başak continued the education theme and expressed concerns. She emphasized the importance of senior-level expertise in understanding advanced system architecture and design. What would happen in the long run if we replaced junior developers with code generation?

Walid disagreed with the idea of replacing developers with LLMs. While efficiency might improve for simple tasks, complex systems and unique situations would still require skilled engineers. He pointed out the risks associated with LLMs, such as their potential to hallucinate or discriminate, underscoring the need for human oversight.

Paula Spoletini (Professor, Kennesaw State University, USA) connected to Başak’s points and raised concerns about the next generation of LLM-native developers. How will their reliance on generative AI affect their skills and development? She echoed Başak’s fear of a potential decline in the quality of future senior developers due to the diminishing role of junior positions.

Walid responded that adapting education to new technology is a complex matter. He argued against prohibiting LLMs in classrooms because students will use them regardless. He called for reflection and proposed a decade-long focus on studying the LLM impact on education and knowledge provision.

What can LLMs do for RE?

The moderator then changed the topic to what specifically LLMs can do for RE, linking back to the third question from the introductory audience poll.

Henning offered a viewpoint outside of RE expertise. He noted that LLMs, having processed vast amounts of text, can make connections and provide perspectives that humans might overlook. While he wouldn’t expect LLMs to elicit all requirements for a product, he saw their strength in brainstorming ideas. Additionally, he mentioned LLMs’ capabilities in processing input, such as summarizing requirements and identifying potential conflicts. However, he stressed that LLMs should have a supporting rather than a leading role.

Klaus Pohl (Professor, University of Duisburg-Essen, Germany) entered the discussion by claiming that RE nowadays revolves around innovation. Challenging the panel, he questioned the feasibility and logic of using LLMs trained on old data to elicit innovative requirements.

Alessio brought attention to research showing ChatGPT’s high performance in creativity tests [GUZ23]. The results suggest that large neural networks exhibit emerging phenomena, indicating a form of competition with human creativity. Building on this, Markus added that generative AI excels in combinatorial creativity [MIS18], capable of generating novel and potentially inspiring combinations.

Henning agreed but stressed LLMs’ limitations in formal logical reasoning. He said that LLMs can’t perform inference processes like humans, but their exposure to extensive data allows them to make complex connections between concepts. This tends to blur the line between actual reasoning and smart data combinations.

Alessio continued by expressing the need for theoretical frameworks and guidelines to effectively use LLMs in software engineering. He noted the limitations in current approaches focusing on generating code through simple prompting. Future research must show how next-generation developers could build chains of prompts or complex prompt architectures. Walid concurred with the previous comment and added that there will also be new trade-offs to master, perhaps including energy efficiency.

Dan closed this part of the panel with a memorable reference to the aluminum-bullet phenomenon. Initially, when a new technology emerges, it seems like a silver-bullet – offering novel, exciting possibilities that weren’t feasible before. A surge in innovative applications follows that rapidly leverage this technology. However, as the technology becomes more commonplace and integrated into everyday use, it transitions from being a silver-bullet to being an aluminum-bullet. Given enough time, LLMs will no longer be seen as extraordinary but will become standard tools.

Trusting LLMs?

A comment from Martin Glinz (Professor, University of Zürich, Switzerland) shifted the discussion toward the problem of trust. He referred to a study comparing ChatGPT and Stack Overflow answers [KAB23], revealing that users preferred ChatGPT’s responses 39% of the time – despite them containing inaccuracies in 52% of cases. This shows a psychological tendency to trust polished answers more, regardless of their accuracy. Martin stressed the importance of considering this in the context of RE for AI systems.

Başak responded by acknowledging the trust issue. She emphasized the need for caution and testing, especially since LLMs can convincingly hallucinate. Her students had, for example, already submitted assignments with made-up references.

Walid added some nuance by suggesting that the importance of accuracy might depend on the use case. In some scenarios, like gaming or entertainment, the creativity or entertainment value of the AI’s response might be more important than its factual correctness. Such considerations, he said, are again tasks for RE to address.

Alessio emphasized the importance of keeping human expertise and experience in the loop. He suggested that while LLMs can assist, expert oversight is crucial. He pointed out that unlike Stack Overflow, which has a rating system for answers, ChatGPT still lacks such a feature. This could be a key feature in fields like coding, where expert-validated answers are beneficial.

Finally, Markus shared his personal preference for ChatGPT over Stack Overflow due to its quicker, more conversational interaction style. However, he advised using both ChatGPT and Stack Overflow when seeking solutions to combine the strengths of each platform.

Closing statements

After 100 minutes of panel discussions, the moderator invited the panelists to share their closing statements.

Henning drew a parallel between the advent of the Internet and LLMs, suggesting a similar trajectory of initial overtrust followed by a more nuanced understanding and use. Acknowledging the challenges in judging the trustworthiness of LLM outputs due to their polished articulation, he underscored the need to learn how to harness them for their potential benefits.

Markus highlighted the evolution in user interaction with textual prompts. Before ChatGPT, most users were reluctant to write textual prompts for information access. He expressed enthusiasm for the new RE possibilities in product development.

Alessio looked forward, predicting a massive increase in natural language processing and LLMs in the software engineering research community. Given that requirements are typically expressed in natural language, he expected the number of RE-oriented research papers to increase in the following years.

Başak anticipated the addition of a dedicated panel on education in software engineering with a focus on LLMs at next year’s RE conference.

Walid remarked that LLMs are not brand new – Bert has been around for a while [DEV18] – and they will keep evolving. LLMs will advance engineering in various ways. Echoing Fatma’s perspective, he stressed education as the foremost challenge. Non-experts and novice developers must learn how to deal with the technology. Additionally, Walid encouraged research on AI engineering for LLM-based systems, such as adapting processes, system design, requirements elicitation, and software testing.

References

[DEV18] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
[GUZ23] Guzik, E.E., Byrge, C. and Gilde, C., 2023. Journal of Creativity. Journal of Creativity, 33, p.100065.
[FAN23] Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S. and Zhang, J., Large Language Models for Software Engineering: Survey and Open Problems, arXiv preprint arXiv:2310.03533
[KAB23] Kabir, S., Udo-Imeh, D.N., Kou, B. and Zhang, T., 2023. Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions. arXiv preprint arXiv:2308.02312.
[MIS18] Mishra, P., Henriksen, D., 2018. Twisting Knobs and Connecting Things. Creativity, Technology & Education: Exploring their Convergence, Springer, pp. 43-51.