Rigorous Research When Models Shift Weekly

Our latest AI article just got published in Springer’s journal Empirical Software Engineering! It’s the journal known to publish the most rigorous studies in the field. Great news, what a milestone! A reasonable reaction would be to start planning for a dissemination tour to share the results with the world. So what are we waiting for? Well, we probably won’t even bother running an internal CodeScene meeting based on this. Why? Because the AI world and the academic publishing system operate on totally different time scales. This post is less about the result itself than about what it taught me about doing academic research in a field that refuses to stand still.

The Experimental Design

This is the story behind how we conducted by far the most meticulous study so far here at CodeScene. Adam introduced me to Dave Hewett at Equal Experts in January 2024, sharing that Dave wanted to conduct a two-phase experiment on the maintainability of code developed with AI assistance. As this is a topic we’re genuinely interested in at CodeScene – and I was personally very interested – it immediately caught my attention. We met and started refining Dave’s ideas into an experimental design.

The core ideas were:

  • The phenomenon under study should be manual evolution of previously AI-assisted development – not the direct effect of using AI.
  • The task should represent a realistic and common development task, reasonable to complete in 2-4 hours.
  • The subjects should primarily be professional developers – not students as is commonly the case in controlled experiments.
  • The assignments should be completed remotely in the subjects’ preferred dev environments – with automated task distribution and non-intrusive instrumentation.

Anyone who has tried designing a study like this knows how hard it is. Every step contains numerous design decisions, large or small, that influence the experimental variables at play. Balancing control and realism is at the core. We decided to go for the most rigorous design to allow making causal claims based on the results: a randomized controlled trial (RCT). Randomly assigning subjects to either the treatment group or the control group mitigates effects of variables that aren’t then actively controlled by design, e.g., current mood, energy levels, disturbing neighbors, keyboard failures and whatnot. Still, there are two typical showstoppers when designing an RCT for software engineering: 1) designing a realistic context and 2) recruiting trial participants.

Showstopper 1: Experimenting Within a Realistic Context

The experimental target system must be meaningful to study, or everything falls apart. Who cares about experimental results on a toy example developed by students, for example? But if the system is too large and complex, there is no way the participants can complete it within a reasonable time frame. This necessitates careful judgment. This is where our first unfair advantage came in!

A group of Equal Experts consultants joined the design of our study, and in a series of meetings with iterative development, we jointly refined the core ideas into detailed requirements on both the system and the corresponding development tasks. Special thanks go to Donald Graham and Neil Moore, who designed, implemented, and tested it for realism. We went for a task in a down-to-earth domain – cooking – and a common Java stack that many developers should be able to onboard quickly. Given the long experience consulting for various companies, the practical realism of the system and task was largely validated by design. This means we developed a system and realistic tasks from scratch and carefully made sure that it couldn’t leak to LLMs’ training data.

In short, the system details were the following:

  • Java web application for a recipe service built on Spring Boot
  • Spring Data JPA for database access
  • Spring MVC for REST endpoints
  • Main features: a) list all recipes and b) filter recipes by a search term

Then we designed two incremental evolution tasks for participants to implement in the recipe service. Task 1: Extend the search feature to filter recipes for cooking time. Task 2: Extend the search feature to filter recipes by cost per serving. Subjects would either do Task 1 – with or without AI assistance – or do Task 2 manually. Task 2 was designed to necessitate building on what was done (by someone else) in Task 1.

Showstopper 2: Recruiting Relevant Participants

One of the biggest challenges when conducting controlled experiments in software engineering is finding subjects willing to complete tasks. Professional developers are busy and hard to recruit. The typical remedy has been to recruit students as subjects. Whether valid conclusions can be drawn from such a population has been debated at length in software engineering.

Here comes our second unfair advantage! Early in this project, Dave Hewett invited Dave Farley, a well-known figure in software engineering, to join our study. Farley is mostly known for his work on continuous delivery and currently runs a popular YouTube channel. He made this study possible by recording an episode inviting his followers to take part in our experiment. Hundreds of developers signed up, and in the end, 151 of them submitted complete solutions to our tasks. That’s a lot, and pretty unique for this type of study! They were all compensated by getting a signed copy of Dave’s latest book.

We did a prescreening of the volunteers focusing on their development experience and preferences regarding working with AI assistants. Subjects who preferred working with AI (which happened to be about 25% of the volunteers – this was probably the last chance to find enough non-AI developers on the planet!) were assigned to solve Task 1 with AI support, and the rest were randomly assigned to either do Task 1 manually, or manually evolve a Task 1 solution – which means completing Task 2. The completion rates were about the same for all groups, roughly 30-35%. This is a pretty good score, given how much work we asked the participants to do.

What Else Deserves Special Mention?

We solved the two showstoppers of lacking meaningful context for study and of recruiting participants who truly represent the population we’re interested in. What else did we do? On top of this, I haven’t even mentioned five other facets that we adhered to that represent best-in-class related to empirical research:

  • We preregistered the study design. This means we presented our hypotheses and study design upfront and submitted them for peer review. No fishing for significant results. Our two research questions were:
    • RQ1) Do developers manually evolve code that has been co-developed with AI assistants more efficiently?
    • RQ2) Does code co-developed with AI assistants result in higher quality upon manual evolution?
  • We used mixed-methods research, combining software instrumentation for quantitative analysis with questionnaires and free-text data collection for subjective comments. This ensured an analysis of rich data.
  • We rely on software engineering constructs that have been validated in previous research. CodeScene’s CodeHealth for maintainability and the SPACE framework for productivity.
  • Despite the safeguards in place for a pre-registered RCT, we complemented the frequentist statistics with Bayesian analysis. This gives us uncertainty estimates based on the causal model we transparently published.
  • We used a professional system, snapcode.review, for distributing take-home challenges to participants. The developer of the system, Uttam Kini, tailored it to our needs and joined the paper as a co-author.

In conclusion, we did our utmost to design a good study, and the registered report was submitted to ICSME 2024 and received an “in-principle acceptance” for a publication in Empirical Software Engineering. Contemporary empirical software engineering doesn’t get better than this.

What Did We Learn From the Research?

With all these pieces of scientific excellence in place, what did the results show? Well, we didn’t see much of an effect from using AI assistants. This is what the research community refers to as negative results. Oh, the beauty of science! Such results are actually harder to publish because reviewers tend to find them less interesting, which gives a publication bias toward papers with significant results. Another reason to pre-register studies, since they will be published no matter how the results come in… And here they are:

RQ1) Is AI-assisted code more efficient to evolve manually? Our RCT provides no reliable evidence that AI-assisted code differs in how efficiently it can be evolved. Both completion times and productivity in Task 2 were consistent with a null effect, with at most small and highly uncertain differences between treatment (building on AI-assisted code) and control (building on manually written code). In short, if there is an effect, it’s so small that our experiment didn’t involve enough participants to detect it. For Task 1, on the other hand, observational results showed that working with an AI assistant yielded a 30.7% median reduction in completion time – and habitual AI users showed an estimated 55.9% speedup.

RQ2) Does prior AI use improve quality after manual evolution? The results provide no reliable frequentist evidence that prior AI-assisted development affects CodeHealth after continued manual evolution. Bayesian analysis suggests a small CodeHealth improvement when the original Task 1 solution was co-developed with AI by a habitual AI user. This suggests that experienced AI-empowered developers use the tooling to remove more code smells than developers working manually. This is actually a result in AI’s favor, which wasn’t necessarily what we expected.

A perfectly valid question that every study should answer after presenting the results is: “So what?” We present three perspectives on practical implications for this work.

  1. AI assistants tend to help with file-level maintainability issues. From a maintainability perspective, our findings suggest that developers who prefer working with AI assistants should continue doing so. Habitual AI users appear to use these supporting tools in ways that benefit future manual evolution.
  2. Reckless use of AI assistants quickly bloats codebases (we saw the signs of that). With the cost of creating new code approaching zero, the effort to understand and maintain existing code remains. This asymmetry risks bloating codebases with redundant logic, abandoned attempts, and code with questionable purpose. Organizations should carefully maintain human oversight – I recommend software visualization to monitor where and by whom code is added.
  3. No one really knows how to do knowledge management in the AI era. What skills will be central for future generations of developers? Universities are scrambling to understand what to include in course curricula. For sure, some skills will be lost when using AI. Everyone should beware of cognitive and intent debt going forward, and we’ve had long discussions with Prof. Margaret-Anne Storey about this (who was also involved in the SPACE development). Reading her thoughts on this is time well spent!

What Did We Learn About Doing Research?

All good, a pre-registered study was finally published in a top journal. So far, this all reads like a huge success. There is, however, one major downside. The participants in our study did this in late 2024, primarily using IDE-based code-completion assistants and prompt-based programming. The paper was published in June 2026. In May 2025, Claude Code was released, and the agentic development took over, completely changing the picture. The ways of working have been disrupted again – and now a single prompt to a coding agent would be enough to solve Task 1 in a minute.

The figure shows a timeline of what we have done and how the world moved during this period. Under the timeline are six product releases that characterize the rapid movement of AI in software engineering: A) GitHub Copilot: AI pair programming becomes mainstream. B) ChatGPT: Conversational software work becomes normal. C) Cursor: The rise of AI-native development environments. D) Devin: The idea of the AI software engineer enters the mainstream. E) Lovable: AI prototyping broadens who can participate in software creation. F) Claude Code: Terminal-native coding agents enter developer workflows.

Above the timeline, milestones related to this study are presented. The milestones are presented in bold font, the rest of the important dates in normal font.

  • Jan 25, 2024 – First emails with Dave Hewett
  • Jul 25, 2024 – Registered report submitted to ICSME 2024
  • Oct 11, 2024 – Study design presented at ICSME 2024 in Flagstaff, Arizona
  • Nov 13, 2024 – Dave Farley announced our study and called for volunteers
  • Jan 18, 2025 – Data collection concluded (see the blue rectangle)
  • July 3, 2025 – Manuscript submitted
  • Oct 20, 2025 – Editor decision: Major revision
  • Dec 18, 2025 – Major revision submitted
  • Jan 27, 2026 – Dave Farley published a summary video. 250k views in 4 weeks! (And oh boy, there are keyboard warriors in those comment fields…)
  • Feb 16, 2026 – Editor decision: Minor revision
  • Feb 26, 2026 – Minor revision submitted
  • May 11, 2026 – Manuscript accepted
  • June 9, 2026 – Manuscript published

What can be said about the above process? Academic publishing is a slow process and the AI race is moving at full pace. We share the same interests, but live in worlds apart. Actually, seeing your paper published in a top journal in less than 12 months is quite good. Sure, it should’ve been faster for two reasons. First, the registered report got an “in-principle acceptance” at ICSME, and thus the initial review could have been faster. But the reviewers did a good job and correctly requested that we tone down the language and stay closer to what the data supported. We were overly AI-positive in the first version, and at the time the community was still very skeptical. Second, the major revision came back in two months with a minor revision decision. We submitted it within 10 days and it should typically be handled quickly. But somewhere an editor became non-responsive and nothing happened until we reached out to the editors-in-chief in early May.

But in June 2026, we published the most rigorous research article on downstream effects of AI assistants! Too bad it’s a great analysis of previous generations of AI assistance (code completion and chat-based programming). Now we’re all doing agentic AI…

Why Do We Still Do All This?

At CodeScene, we invest quite some effort in academic research. We conduct careful experiments, insightful case studies, and do large-scale repository mining – and publish the results as openly as possible. We read what others publish and build on it, just like we hope others can build on our results. Software engineering is a very complex activity to get right, and we’re all in this together. We need to find more effective and efficient ways of building software to solve the biggest challenges of our time.

That belief only means something if you back it with real effort, so we also serve the community that makes this work possible. As we publish this post, we’ve just reviewed papers for the ASE 2026 research track and we’re finishing the paper selection as co-chairs of the ESEM 2026 software-engineering-in-practice track – two commitments that alone add up to about half a work month. And they’re only two of the conferences we support this year, alongside editorial work for Empirical Software Engineering and IEEE Software, and regular reviewing for other top journals. None of it moves at the speed of the AI race – and that’s exactly the point.

Because here’s what this study taught me beyond its results: we ran the most rigorous experiment we’ve ever attempted, published it in a top journal – and it was already a time capsule from a previous era by the time it appeared. That isn’t a failure. It’s what a real field looks like. There’s always a trade-off between rigor and relevance in software engineering, and I usually stand on the relevance side, doing applied research that helps engineers today. The RCT was a deliberate U-turn into rigor – studying a genuinely interesting phenomenon in a realistic context. For one sliver of time, it was absolutely a valid finding, and that sliver was worth every month it took. Science is not in a rush. Sometimes we shouldn’t be either, and instead ensure time for deep reflection. Or some well-earned hammock time! I’m off on vacation now – let’s see what the world looks like on the other side of summer. I’ve never before gone on vacation with the sense that the field might shift this much before I return.