This is a personal copy of a column in IEEE Software (May/Jun 2023). Republished with permission.
Join us as we explore the cutting-edge world of explainable artificial intelligence (AI) for software engineering from a requirements engineering perspective. With the increasing complexity of software systems and the use of opaque machine learning models, explainability has become an essential quality attribute in software engineering. This column dives into the trending topic of explainable AI-assisted code generation. We’re both advanced users and solution developers, and we live and breathe explainable AI every day. At KTH Royal Institute of Technology, we’re pioneering academic research on explainable automatic program repair. At CodeScene, we’re developing real-world solutions for explainable refactoring recommendations. We’re all dogfooding, and we’re also early adopters of other AI-infused solutions. Here is our take on explainability! – Markus Borg
Markus Borg, Emil Aasa, Khashayar Etemadi, and Martin Monperrus
Artificial Intelligence (AI)-assisted code generation is everywhere these days. Undoubtedly, AI will help near-future developers substantially by providing code suggestions and automation. In this application, explainability will be a key quality attribute. But what needs to be explained to whom? And how to deliver the explanations non-intrusively?
Generative AI for developers is burgeoning. OpenAI Codex has been successfully driving GitHub Copilot since the summer of 2021. This winter, OpenAI’s ChatGPT created headlines and proved itself a competent code generator. ChatGPT has solved numerous computer science students’ assignments worldwide. As educators of the next generation of engineers, we’ve seen this firsthand! In December, DeepMind’s success with AlphaCode for competitive program-
ming was published in Nature. We firmly believe that a widespread increase in software development automation is just around the corner.
Explainability is a crucial aspect of AI-assisted code generation. It gives developers visibility into the code generation process, an idea of the tradeoffs with a particular approach, and an understanding of why the code works, not just whether it does. AI-generated code doesn’t have to be perfect to be valuable [1]. Nevertheless, a questionable output and a lack of explainability would hinder the adoption of these tools in developers’ workflows. However, explainability is a nontrivial quality attribute, and a one-size-fits-all approach may not be feasible.
Understanding the Users
Chazette and Schneider looked at explainability as a nonfunctional requirement alongside other “-ilities.” [2] They discuss its relationship to usability and warn that poor design and implementation of explanations could do more harm than good. Developers of AI-assisted tools must tread carefully, as ill-timed explanations from a smarty-pants AI assistant would be detrimental. This means that we need to adapt the explainability to the receiver. But who exactly is that person?
Developers are a diverse crowd. For matters related to the development environment, most have strong preferences. Some are eager to try new tools. Others are surprisingly conservative, despite the rapid changes in software technology. Explainability must be designed with this diversity in mind. The same developers will also have shifting explainability needs throughout their careers. Novices may require more guidance, while power users who work seamlessly with AI assistants will have different requirements. Adaptable explanations will be essential for success.
In the academic human-computer interaction community, the topic of human-centered explainable AI is growing. Based on nine participatory design workshops with IBM developers, Sun et al. elicited user needs and expectations on explainability for three AI-assisted use cases: [3] 1) code translation, 2) code autocompletion, and 3) generating code from text. They did an exemplary requirements-elicitation process, involving two-phased scenario-based design workshops with open-ended questions elicitation and ideation sessions with low-fi prototypes. The authors distilled 11 categories of explainability questions from the participants. These categories, summarized in Table 1, serve as a useful starting point for a requirements specification.
Category | Description | Example questions |
Input | Inquiries about the kinds of input that the AI can take | What input does the AI understand? How do I prompt the AI to get good output? |
Output | Inquiries about what the AI can produce | Will the output be idiomatic Python code? Will exception handling be covered? |
How (global) | High-level inquiries about how the AI operates | How does the AI generate code? Does the AI know design patterns? |
Performance | Inquiries about the quality of the AI-generated artifacts | How correct is the output guaranteed to be? How confident is the AI about the output? |
How to | Inquiries about how to change the input to affect the quality or characteristics of the output | How can I get tailored output? How do I get more efficient code? |
Control | Inquiries about customization of how the model should work | Can I set preferences for specific libraries? Can I customize internationalization and localization? |
Why/Why not | Inquiries about why the model produced a given output | Why didn’t my prompt generate good output? Why did the AI believe this output satisfies my request? |
Data | Inquiries about the underlying training data | What code was the AI trained on? Where does the training code come from? |
System requirements | Inquiries about the requirements to use the AI | Can I use the AI in my closed source project? What is the energy consumption? |
Limitations | Inquiries about the limitations of the AI | What are the limits of the AI? What type of prompts are not supported? |
What if | Inquiries about what the output would be in hypothetical situations | What does the AI output if my prompt is invalid? What happens if I input contradictions? |
Next, we present insights on explainability from two distinct perspectives. Aasa reports his view as an early adopter and advanced user of Generative Pretrained Transformer (GPT) models. From the researchers’ horizon, Etemadi and Monperrus share their expertise in highly explainable automatic program repair (APR).
Aasa’s Story: Hands-on Work With GPT Models
My day job is building tools for other software developers. As an ostensible AI developer, I jump on any chance to try whatever intelligent tool that claims to be disrupting software engineering practice. When GitHub announced the Copilot beta, I was obviously an early bird. And since the release of Chat-GPT, I’ve been exploring ways to integrate ChatGPT into my workflow. This is my story about working with such mysterious companions for a year and a half.
There was the initial honeymoon period. I was struck by my new AI assistant’s impressive ability to write tedious code. When implementing machine learning, there are monotonous data-wrangling Python tasks for which I regularly consult
the documentation; for example, reading compressed comma-separated value files, merging data, using libraries such as Pandas and scikit-learn, and removing duplicate entries. This is where Copilot shines: providing relevant suggestions through code completion at various points throughout the work. Even with minimal input, only a file name, and some introductory code comment, my AI assistant eagerly sprang into action.
Blinded by newfound AI love, I took Copilot for a stroll in our adorable Clojure codebase. I knew, of course, that Clojure is a marginal language in the Lisp family. Compared to Python, the amount of available training data must be orders-of-magnitude less. And yes, the limitations of GPT models became evident. Copilot’s suggestions consisted entirely of Common Lisp, which is not valid Clojure. To add insult to injury, I also got suggestions consisting almost entirely of closing parentheses. Although such output is easy to disregard, I sometimes received high-quality source code using Clojure libraries that don’t exist, a phenomenon known as GPT hallucination. The need to cross-examine my AI assistant about her output became clear. She must explain her work.
Mental Models of Development
Inspired by established research on the cognitive process of writing [4], I’ve created mental models for how I develop source code with and without an AI assistant. The figure below shows four imagined centers of the brain responsible for various programming activities. I typically start with a task description and some existing source code. The description can be anything from a vague idea to a formal specification. The source code provides context and constraints, e.g., programming language and the style of error handling. My prompter initiates my programming task, which is still nothing but a natural language thought. The translator turns my ideas into programming language constructs, but I still haven’t typed any source code. Then, I enter a considerably more iterative process in which my evaluator and transcriber work in tandem. I evaluate my solution for correctness and style while I evolve source code to fulfill the task at hand. Finally, I contribute my work to the shared codebase.
Since integrating a GPT model into my work process, I find that I’ve streamlined the cognitive model. The prompter forwards the task and context to my AI assistant. This means that I’m bypassing my previous translation step from thought to programming language construct. Thanks to the GPT model, I’m going straight from prompt to source code. This generated source code then enters the transcriber-evaluator cycle. Sometimes I return to the prompter to refine the GPT input. Eventually, I contribute my latest addition to the codebase.
A major difference compared to the manual way of working, which we illustrate in the figure, is that I feel that my evaluator is less prominent in the flow. There is a risk that GPT-accelerated programming somewhat decouples fundamental evaluative steps. Hasty use could quickly spin a good project into trouble. However, sensible explanations from the GPT model could mitigate negative consequences, such as line-by-line confidence scores, performance bottlenecks, and potential vulnerability issues.
The Researchers’ Story: Explainable Patches
Researchers have recently proposed various APR tools to fix bugs with no human intervention. An APR tool takes a buggy program and an oracle that identifies the bug, such as a unit test, and generates a source code patch.
Explainable APR patches are of particular importance in code reviews. Detailed descriptions of changes made in a patch facilitate the code review process. This usually happens in the form of pull request descriptions. However, in APR, there is no human developer to write them and engage in discussion with the code reviewer. Thus, APR tools must be equipped with techniques to augment the synthesized patches with clear explanations.
Recently, state-of-the-art APR has been disrupted by AI. The key idea is to use GPT models trained on code, such as OpenAI Codex, to fix a given bug. In particular, impressive progress has been achieved by using neural machine translation on source code to produce patches; we have recently seen an explosion of academic papers on that topic. Next, we discuss two perspectives on AI explainability in the APR context.
Explaining AI-Generated Patches
A major explainability problem is that neural networks are black boxes. Furthermore, the patches they produce are “dry,” i.e., without any explanation. Consequently, there is a dire need to devise techniques to explain AI-generated patches. How can we do that?
First, we can use feature-importance-ranking techniques to clarify how a code model generates a patch [5]. The core idea is to alter the input and study how it impacts the output. The main advantage of this approach is that it can be added on top of any system, no matter the machine learning model employed.
Second, there is a line of research on chain-of-thought inference. The model is then requested to explain its reasoning process. Zhang et al. contributed to this space by studying how code generation models can “think” step-by-step [6]. This requires working with underlying GPT-models that have also been trained on natural language and not only on code, in line with ChatGPT. The early results are promising.
Using AI for Explanation Generation
Not all APR solutions are based on AI. Several tools instead build on program analysis. For such solutions, AI can generate an explanation for the non-AI patch after the fact. There are two popular means to deliver natural language explanations to developers.
A patch explanation is similar to a commit message for that patch. Thus, we can build on research for AI-based commit message generation to explain APR patches. Jiang et al. express this as a translation task and use neural machine translation methods to turn code changes into commit messages [7], just like we translate from French to English. More sophisticated AI-based commit message generation techniques, like the tool ATOM [8], use the source code’s abstract syntax tree information in the neural network training process. However, using translation techniques requires addressing a particular code-text translation challenge: out-of-vocabulary words. Here, the vocabulary of the explanation is different from that of the patch itself!
Finally, a conversational interface such as ChatGPT can deliver an explanation. The figure below shows how this works for an Apache Commons Lang project bug fix. We first give ChatGPT the fixing patch, the test that fails on the buggy version, and some prompt decoration, including the command, “Explain how the patch fixes the failing program to pass the test” [see (a)]. ChatGPT answers that the patch fixes the “equals” method by comparing the two given CharSequences, even if one of them is not of the string type. In that case, the “region-Matches” function is used for comparison [see (b)]. In a pure conversational manner, we further ask ChatGPT, “Why did the test fail before the patch?” [see (c)]. ChatGPT then explains that the test fails before the patch because it gives a CharSequence as an instance of StringBuilder to the “equals” function being tested [see (d)]. We believe that there is a promising future for using conversational pretrained bots like ChatGPT to explain APR patches.
Wrap-up
The results of Sun et al.’s structured requirements-elicitation process align with our personal experiences. When first encountering AI-assisted code generation, developers typically raise a host of how, why, and what-if questions. Finding ways to quench an inquisitive developer’s thirst for understanding the underlying process and reliability of the output will be fundamental for long-term success. Whether developers need to safeguard against GPT hallucinations, understand an automatic patch, assess a proposed refactoring, or evaluate the feasibility of a code snippet, explainability will be a prerequisite for building trust and confidence in the technology.
Finally, we highlight the practical solutions proposed in Sun et al.’s study. They suggest augmenting the AI’s output with uncertainty information, such as adding a wavy underline for lines of code where the AI is in doubt. Another proposal is to display the AI model’s attention distribution. A model could, for example, indicate which words in the user’s prompt were given the most importance through different levels of opacity. Which keywords in the prompt mattered the most?
We find these times of AI advancements to be thrilling! Developers are in for a ride thanks to soon-to-be-disrupting AI assistants. We’ll keep working on researching and developing explainable AI-assisted code generation. Please reach out if you want to discuss what the AI must tell you!
References
- J. Weisz et al., “Perfection not required? Human-AI partnerships in code translation,” in Proc. 26th Int. Conf. Intell. User Interfaces, pp. 402–412, 2021.
- L. Chazette and K. Schneider, “Explainability as a non-functional requirement: Challenges and recommendations,” Requirements Eng., 25(4), pp. 493–514, 2020.
- J. Sun et al., “Investigating explainability of generative AI for code through scenario-based design,” in Proc. 27th Int. Conf. Intell. User Interfaces, pp. 212–228, 2022.
- L. Flower and J. Hayes, “A cognitive process theory of writing,” College Composition Commun., 32(4), pp. 365–387, 1981.
- Y. Wang et al., “WheaCha: A method for explaining the predictions of models of code,” arXiv:2102.04625, 2021.
- Z. Zhang et al., “Automatic chain of thought prompting in large language models,” arXiv:2210.03493, 2022.
- A. Jiang and C. McMillan, “Automatically generating commit messages from diffs using neural machine translation,” in Proc. 32nd Int. Conf. Autom. Softw. Eng. (ASE), pp. 135–146, 2017.
- S. Liu et al., “ATOM: Commit message generation based on abstract syntax tree and hybrid ranking,” IEEE Trans. Softw. Eng., 48(5), pp. 1800–1817, 2022.