AI coding tools feel like they're working. The data says otherwise.

A senior developer I know, sharp and with 15 years of experience, recently told me his team was "crushing it" with Claude Code.

"We're shipping twice as fast," he said.

I asked him how he knew.

Silence. Then: "It feels faster."

This conversation keeps happening. And the research tells us something uncomfortable: that feeling is probably wrong.

In July 2025, researchers at METR ran a randomized controlled trial. They recruited 16 experienced open-source developers, with an average of 5 years and 1,500 commits on their respective projects. The codebases averaged over a million lines of code. These weren't juniors learning the ropes. These were experts working on systems they knew intimately.

Each developer was assigned real tasks from their own repositories. Half the time, they could use AI tools (primarily Cursor Pro with Claude). Half the time, they couldn't.

Before starting, the developers predicted AI would make them 24% faster.

After finishing, they believed they'd been 20% faster.

The actual result? They were 19% slower when using AI.

Read that again.

A 39-percentage-point gap between perception and reality. The developers genuinely believed AI was helping them, even as it was slowing them down.

The METR study found that developers spent about 9% of their total task time just reviewing and cleaning AI-generated output. For a full-time developer, that's nearly four hours per week; time that could have been spent writing code they actually understood.

But that's just the measurable costs. There are costs that are harder to measure, and they are worse.

Cognitive load shifts from construction to evaluation. Writing code is a familiar, constructive task. Evaluating someone else's code, especially code that looks plausible but might be subtly wrong, is a different kind of work entirely. It's exhausting in a way that doesn't register as "work."

Context gets lost. AI tools can see your codebase, but they can't understand its history, its architectural decisions, or the reasoning behind that fragile piece of code that's been there since 2019. Developers reported "context rot": feeding the model more information sometimes made outputs worse, not better, as the AI pulled in irrelevant details.

Trust breaks down. Google's 2024 DORA report found that 39% of developers have little or no trust in AI-generated code. Yet 75% said they felt more productive with AI tools. The tools make people feel productive while simultaneously eroding their confidence in the output.

The Productivity Paradox

The Faros AI research team analyzed telemetry from over 10,000 developers across 1,255 teams. They found that developers using AI completed 21% more tasks and created 98% more pull requests.

Sounds great, right?

But their PR review times ballooned by 91%.

The AI helped developers throw code over the wall faster, but the walls (code review, QA, integration testing) just kept piling up. The bottleneck moved from developers to reviewers, but overall delivery velocity didn't improve.

This is Amdahl's Law at work in real time: the slowest part of the pipeline dictates the overall speed. Making one part faster just exposes the next constraint.

If co-pilot tools have mixed results, what about fully autonomous coding agents? You know…the ones that promise to work like "a junior developer who never sleeps"?

The data is sobering.

Answer.AI, an AI research lab founded by Jeremy Howard and Eric Ries, spent a month testing Devin, Cognition's heavily-funded autonomous coding agent. They gave it 20 real-world tasks.

The results: 14 failures, 3 inconclusive, 3 successes. A 15% success rate.

"More concerning was our inability to predict which tasks would succeed," the researchers wrote. "Even tasks similar to our early wins would fail in complex, time-consuming ways. The autonomous nature that seemed promising became a liability—Devin would spend days pursuing impossible solutions rather than recognizing fundamental blockers."

One example: when asked to deploy multiple applications to Railway, Devin spent over a day trying approaches that didn't work and hallucinating features that didn't exist, rather than recognizing that the task wasn't supported.

On the SWE-bench benchmark, which tests AI's ability to resolve real GitHub issues, even the best autonomous agents solve only about 23% of problems on the harder "SWE-bench Pro" test. When evaluated on private codebases the models haven't seen before, performance drops further, to around 15-18%.

Cognition's own 2025 performance review acknowledges the pattern: "Devin does best with clear requirements. Devin can't independently tackle an ambiguous coding project end-to-end like a senior engineer could." It excels at defined, contained tasks like vulnerability fixes, migrations, and test coverage, but not the messy, judgment-heavy work that fills most engineering days.

The marketing promised autonomous software engineers, but the reality is a tool that works well for specific, well-scoped tasks and struggles with everything else.

Why Experts Get Hit Hardest

The METR slowdown hit experienced developers hardest, which makes sense once you think about it.

Junior developers or people working on unfamiliar codebases often see productivity gains. The AI can help them scaffold solutions, fill knowledge gaps, and avoid obvious mistakes. When you don't know the right approach, having a plausible suggestion is valuable.

But when you already know what you're doing, the AI becomes overhead. You're waiting for it to generate something you could have typed faster. You're correcting its misunderstandings of your architecture. You're cleaning up code that almost works but doesn't quite fit.

One developer in the MIT Technology Review piece put it bluntly: after tracking his own productivity for six weeks, flipping a coin to decide whether to use AI on each task, he found AI slowed him down by 21%. He'd estimated a 25% speedup.

What Actually Works: Best Practices That Deliver Real Value

The developers who get genuine productivity gains share specific habits.

Based on research from Google Cloud, DX, and practitioners who've measured their own output, here's what separates effective AI-assisted development from expensive theater:

  • Start with a plan, not a prompt. The biggest mistake is jumping straight to code generation. The developers who see real gains spend time upfront: documenting requirements, analyzing current code structure, and defining tests that will validate success. Google's engineering teams found that creating an explicit execution plan and saving it to a file (and having AI tools read and use it for context) dramatically improves output quality. The AI becomes a collaborator in planning, not just a code generator.

  • Match the tool to the task. Inline autocomplete for writing a function. Chat-based assistants for explaining unfamiliar code. Agentic tools for multi-file migrations with clear specifications. Don't use a sledgehammer for a nail. One practitioner noted: "Autocomplete is convenient but doesn't improve productivity by much on its own. The cognitive load remains squarely on your shoulders."

  • Break work into small, verifiable chunks. Generate in increments, run tests after each integration, and commit frequently so you can roll back when AI suggestions take your project sideways. Large chunks of AI-generated code pasted wholesale create hidden errors, broken dependencies, and missed edge cases that cost more time than they save.

  • Provide exhaustive context before asking for anything. AI tools lack business context, architectural history, and domain knowledge. The more you explain—constraints, edge cases, reasons behind past decisions—the better the output. Treat the prompt like a briefing for a new team member, not a search query.

  • Stay in control of architecture. AI struggles with complex interactions, strategic planning, and long-term thinking. Define the overall design yourself, then delegate specific, well-defined implementation tasks. As one engineering team put it: "Developers should remain firmly in control of designing complex interactions and software architecture."

  • Review everything. Trust nothing. AI-generated code looks more confident than it deserves to be. Run it, trace through it, ask: what am I missing? Studies consistently show AI introduces security vulnerabilities and logical flaws that look correct at first glance. You remain accountable for every line committed to the repository.

  • Know when not to use AI. Some tasks are faster without the overhead of prompting, waiting, reviewing, and correcting.

The Code Quality Concern

There's a longer-term issue beyond immediate productivity.

GitClear analyzed 211 million changed lines of code from 2020 to 2024. They found that code associated with refactoring dropped from 25% of changed lines in 2021 to less than 10% in 2024. Meanwhile, code copied and pasted rose from 8.3% to 12.3%. "Code churn, " which is code that gets discarded within two weeks of being written, is projected to double.

More code is being written but less of it is being maintained.

As one analysis noted: "AI-induced technical debt is a good way to describe the side effects of overinflated expectations of AI-generated code." The speed benefits during initial development could be overshadowed by increased complexity during deployment, operations, and future iterations.

What This Means for Leaders

If you're running a technology organization, the implications are serious.

  • Don't trust productivity metrics that measure output. Lines of code, pull requests merged, tasks completed: these are proxies, not outcomes. Your developers might be generating more artifacts while shipping less value.

  • Invest in AI readiness, not just AI tools. Well-tested, well-structured projects get better results from AI. One senior developer observed: "Investing in code quality is no longer just about maintainability. It's about AI-readiness." The teams seeing real gains have strong foundations first.

  • Watch your review pipeline. If your developers are generating more code but your reviewers are drowning, you haven't improved anything. The constraint just moved.

  • Be skeptical of self-reported productivity gains. The METR study's most important finding wasn't the 19% slowdown; it was that developers believed the opposite. Self-assessment is unreliable when it comes to AI tools.

  • Establish clear guidelines. Define when to use AI versus traditional approaches. Create feedback loops so teams can improve their techniques. Adopt metrics beyond lines of code changed to evaluate real productivity.

  • Hire for investigation skills, not just coding speed. The developers who get value from AI tools are the ones who know how to ask good questions, provide good context, and verify outputs critically. These are the same skills that make people effective at debugging, architecture, and code review.

AI coding tools are powerful. They're also easy to misuse. The difference between getting value and wasting time isn't the tool; it's the human operating it.

The best engineers I know treat AI the same way they treat any other tool: with appropriate skepticism, clear expectations, and a willingness to put in the work before asking for help.

They're still asking better questions, still providing a lot of context, and still thinking before they type.

The tool might have changed, but the job hasn’t.

This is the kind of challenge I help CEOs work through: figuring out which AI investments actually deliver value versus which ones just feel productive. If it's on your radar, I'd be happy to talk it through. You can find me at ericbrown.com or connect on LinkedIn.

Sources cited:

Newsletter Recommendations

The Magnus Memo

The Magnus Memo

A personal dispatch from my corner of the tech world, 25 years in the making, I write about a blend of tech wisdom, hard-won lessons, behind-the-scenes stories, and the occasional life hack — all t...

Westenberg.

Westenberg.

Where Builders Come to Think.

Brian Maierhofer

Brian Maierhofer

One decision to change your life; one decision to save your heart

Reply

or to participate

Keep Reading

No posts found