Developer Productivity · April 2026

The METR Study and the Developer Productivity Paradox

In July 2025, METR published a study that changed how I think about AI tools and software development.

They measured the productivity of experienced professional developers on real software tasks — with and without access to AI tools including Claude, GPT-4o, and Gemini. The result: developers using AI tools completed assigned tasks 19% more slowly on average than developers working without them.

Not junior developers. Not people learning a new language. Experienced engineers, on real-world software work.

The finding was widely discussed, and widely contested. A follow-up study in February 2026 showed a smaller effect — closer to 4% — with a wider confidence interval. The exact number remains uncertain.

What is not uncertain is the direction.

Why This Matters More Than the Number

The immediate response to the METR finding was to dispute the methodology. The tasks were open-source GitHub issues. The AI tools might have already seen similar code in training data. The developers might have been adapting to new workflows.

All of these are reasonable objections. But they don't explain away the core observation: something about adding AI tools to an experienced developer's workflow produces friction, not speed.

The most plausible explanation isn't that AI tools are bad at writing code. They're not. The most plausible explanation is that AI tools are optimized for a different problem than the one that actually slows developers down.

The Planning Problem vs. the Execution Problem

There are two distinct phases where a software project can fail.

The planning problem is failing to figure out what to build, how to architect it, which tools to use. This is the problem where AI assistance is genuinely transformative. You can go from vague idea to detailed architecture in minutes. Database schema, API design, component structure, deployment plan — AI tools are extraordinarily good at this.

The execution problemis failing to build it, even when you know what it is. This is the problem of the Tuesday at 7pm when you have 45 minutes and somehow end up on YouTube. Of the project that has been "80% done" for three months. Of the feature you keep starting and not finishing.

AI tools attacked the planning problem. The execution problem remained untouched — and arguably got worse.

Here's why: when planning is fast and cheap, you plan more often. You can redo your architecture in an afternoon. This lowers the cost of uncertainty, which raises tolerance for not committing. The execution problem feeds on this. Every "maybe I should reconsider the approach" is a productive-feeling delay.

The Cognitive Science Behind This

This isn't unique to AI tools. It's a known pattern in cognitive science.

Planning Fallacy (Kahneman and Tversky, 1979): Humans systematically underestimate how long tasks will take and overestimate their future motivation. We plan from a best-case scenario — full focus, no interruptions, all dependencies resolved. We execute in reality.

The Zeigarnik Effect:Unfinished tasks consume more cognitive resources than finished ones. A project that has been "almost done" for six months creates ongoing mental overhead that compounds with every passing week.

Context Switching Cost:For developers with full-time jobs, side projects don't get contiguous time. They get fragments. Each fragment requires reconstructing the mental model of where things stand. AI tools can generate code faster than ever — but they can't reconstruct the 20 minutes of context-loading that precedes each session.

None of these problems are solved by a better autocomplete. They're structural.

Why Solo Side Projects Are Particularly Vulnerable

All of the above applies to professional software work in teams. For solo side projects, the problem is amplified by a factor the METR study didn't measure: the absence of external accountability.

In a professional context, there are forcing functions. A deadline someone else cares about. A standup where you report progress. A colleague who will notice if you disappear on a task.

Solo side projects have none of these. The only accountability is self-generated — and self-generated accountability is the weakest kind. Research on behavior change consistently shows that external commitment devices outperform internal motivation, especially for tasks requiring sustained effort over weeks.

This is why Buildspace worked. It wasn't the curriculum. It was the cohort, the checkpoint, and the fact that someone would notice if you disappeared on Day 4. When Buildspace shut down in August 2024, it left a gap no AI tool has filled — because the gap isn't in the planning layer.

An Independent Confirmation: What Security Researchers Say

In a separate domain entirely, security researcher Dr. Karsten Nohl arrived at the same structural conclusion.

Nohl is known for exposing critical vulnerabilities in mobile network protocols and SIM card cryptography. He described the failure mode of enterprise AI deployments in a public interview:

"Ein realistisches Ziel darf nicht sein, Menschen zu 100% zu ersetzen. Sondern 90% der Arbeit die Maschine machen lassen — und an allen wichtigen Entscheidungsstellen trotzdem noch Human in the Loop. [...] Sehr viele AI-Experimente gehen gerade schief — genau weil die Leute verschiedenste AI Agents aneinanderkleben, vorne sinnvolle Daten reingeben, ganz hinten ein falsches Ergebnis rausbekommen und sagen: das funktioniert ja alles nicht."
— Dr. Karsten Nohl (translated below)

"A realistic goal must not be to replace humans 100%. Instead, let the machine do 90% of the work — and still keep a human at every important decision point. [...] Very many AI experiments are currently going wrong — precisely because people are connecting various AI agents together, feeding in sensible data at the front, getting a wrong result out at the back, and saying: this just doesn't work."

Nohl was describing enterprise security systems. But the failure pattern is structurally identical to what METR measured in developer productivity: AI-assisted systems without human checkpoints generate output without validation, and the errors accumulate invisibly.

Two independent sources, two different domains, the same structural finding. 90% automated, 10% human at defined checkpoints — not as a compromise, but as the architecture that makes the system work.

For side projects this translates directly: daily prompts can be automated. Milestone review — whether what you built matches what you committed to — requires a human who can evaluate specific context, not pattern-match against generic criteria.

What the METR Update Actually Changes

The February 2026 METR follow-up narrowed the original finding. The effect on experienced developers may be smaller than 19%, and more dependent on task type than originally implied.

The more honest framing: for complex, novel software tasks, the productivity benefit of current AI tools is uncertain, and may be negative.

For simpler, routine coding tasks, AI tools are clearly faster. The uncertainty lives in the middle — in sustained, complex execution on work you own alone. That's exactly where most side projects live.

The Accountability Gap

What developers with side projects actually need isn't a better plan.

Most of them have a plan. Many have a very good plan, generated with AI tools that are genuinely excellent at architecture and design. The plan isn't the problem.

The problem is the space between the plan and the shipped product. That space is filled with context switching, motivation decay, scope creep, and the absence of any external forcing function. This is the accountability gap.

A structured accountability system closes this gap differently than a planning tool. It doesn't generate better architecture. It creates checkpoints that are external to the developer's own motivation. It makes the cost of stopping visible, rather than invisible.

MVP Builder is a structured 30-day sprint built around this premise. You apply with your project. You receive daily prompts tailored to your stack. At defined milestones, your progress is reviewed by a human before you advance.

Cohort #2 open. 5 spots. You pay only if accepted.

Apply to Cohort #1 →

Frequently Asked Questions

What did the METR study actually measure?

The METR study (July 2025) measured the time experienced professional developers took to complete real GitHub issues on open-source repositories, with and without access to AI tools (Claude, GPT-4o, Gemini). Developers using AI tools were on average 19% slower than those without. A follow-up study (February 2026) showed a smaller effect of approximately 4% with a wider confidence interval.

Why would AI tools make experienced developers slower?

The most widely discussed explanation is that AI tools introduce additional decision points and revision opportunities. When planning and architecture are fast and cheap, developers may revise their approach more often rather than committing to execution. The tools are optimized for code generation, not for maintaining forward momentum on a project.

Does this mean AI coding tools are a bad investment?

No. For specific tasks — boilerplate generation, refactoring, documentation, unfamiliar syntax — AI tools provide measurable speed improvements. The uncertainty in the METR finding applies specifically to complex, novel software work where the developer needs to maintain strategic coherence across multiple sessions.

Why do solo side projects fail at higher rates than professional work?

The primary difference is external accountability. Professional software work has built-in forcing functions: deadlines that affect others, standups, colleagues who notice absence. Solo side projects have none of these. Self-generated accountability is less durable than external commitment devices, especially for tasks requiring sustained effort over weeks.

What is MVP Builder?

MVP Builder is a structured 30-day sprint for developers with a full-time job who want to ship a side project. You apply with your project description, receive daily prompts tailored to your stack and sprint tier (Bronze 13 days / Silver 21 days / Gold 30 days), and complete milestone checkpoints reviewed by a human before advancing. Cohort #1 is free.

How is MVP Builder different from an AI accountability bot?

AI accountability tools simulate external accountability — they detect patterns and generate reminders. What they cannot do is read your specific project description, evaluate whether your milestone output matches what you committed to, and make a judgment call based on context. That requires a human review layer. MVP Builder is built around this distinction.

What does "90% AI, 10% human" mean in practice for a side project?

It means automating what can be automated — daily prompts, progress reminders, structured frameworks — and reserving human judgment for the decisions that actually determine whether the project ships. In MVP Builder, this means milestone checkpoints at Day 13, 21, or 30: a human reviews your progress before you advance. That review cannot be automated because it requires evaluating specific project context, not pattern-matching against generic criteria.

Sources

METR: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity (July 2025) — arxiv.org/abs/2507.09089
METR: Update on AI Uplift Study Design (February 2026) — metr.org/blog/2026-02-24-uplift-update/
Kahneman, D. & Tversky, A.: Intuitive prediction: Biases and corrective procedures (1979) — Planning Fallacy framework
Stack Overflow Developer Survey 2025: Trust in AI accuracy at 29% (down from 40% in 2024)
Carta State of Private Markets 2025: 36% of new US startups are solo-founded
Dr. Karsten Nohl: KI-Agenten und menschliche Kontrollpunkte — public interview, YouTube (42:52–45:15)