Is Test Blindness Another Second-Order Effect of AI Coding Agents?

Six months into widespread agent adoption, some productivity gains are clear. What's less obvious to me is what changed underneath.

Before agents became the default, I'd circulated an internal memo across several of the teams I work with about what I expected to happen when code production was effectively democratised: more output, faster, but with a growing burden on reviewers and on long-term code health.

At the time, the conversation was centered around speed and productivity, but I had made a conscious effort to bring in topics around accountability and what happens after the code lands too.

Now I think it's time to start looking for emerging patterns. My personal read, from several teams I work with closely, is that PRs are getting larger and much of the growth is coming from test code.

The observation

PRs were arriving with substantial test suites. CI was green. For me it was increasingly difficult to distinguish between tests that reflected careful thinking about what to validate and tests that an agent had produced as part of its default output.

I started calling this test blindness. It operates on two levels.

First, on the authoring side, agents have made including tests so frictionless that they're often added without the same deliberation they would have received previously. The cost of writing tests used to enforce a natural filter. If someone wrote one, they'd thought about why it was necessary and what it covered. Now though, when an agent produces tests from the source code, the engineer who ships them doesn't develop the same depth of understanding for what those tests actually validate. That understanding was a byproduct of the effort, and the effort has largely disappeared.

On the opposite side, reviewers begin to treat test code as an artefact of the source change rather than something that warrants independent scrutiny. After all, it arrived with the implementation, CI is green, and so the natural tendency is to move past it.

The person writing the code doesn't think as hard about which tests to include, and the person reviewing it doesn't look as closely at the tests that arrived.

Test code then accumulates, and nobody in the chain applied the same level of judgment they would have before.

Checking the data

I pulled 12 months of Git history and GitHub metadata from one of our busiest repositories and measured what had changed.

Test volume more than doubled. The amount of test code landing each month more than doubled over the 12-month period. The sharpest increase coincided with the first wave of broad agent adoption. It partially self-corrected as the tooling matured and teams developed better habits around it, but the sustained level remained well above the pre-agent baseline. The growth was concentrated in a subset of PRs carrying large test payloads.

Tests were co-generated, not written independently. The proportion of commits bundling test and source code together rose by about half.

Human review comments per PR dropped by about 40%. This decline began before any automated review tooling was in place, and it coincided with PRs getting larger.

Merge times didn't change. Despite larger PRs with more test code, merge velocity stayed flat.

Test code ratio and time to merge. One repository, May 2025 – April 2026.

The structural problem

There are plenty of things worth examining about how agents are changing engineering work, good and bad. This one caught my attention first because it was showing up across different teams, different codebases, and different types of PRs. I don't expect it to be the only thing worth measuring.

When the cost of producing code drops dramatically, everything downstream absorbs that increase. Review, maintenance, onboarding, documentation. The processes around the code were designed for a certain volume. They don't automatically scale just because the tools got faster. The question for engineering leadership is which downstream processes have adapted and which ones haven't.

The filter I described earlier (the effort of writing them) has now completely vanished for tests (and the speed at which agents can generate tests is only increasing, of course).

Unless we replace that filter with something deliberate, maintenance costs will accumulate without any corresponding increase in confidence.

What comes next

Remember the reviewer. Agents can produce the code, the tests, the PR description, even the comments. But none of that changes the author's responsibility to make the change reviewable. If a human reviewer can't tell why a change was made and what the tests are actually validating, the PR isn't ready, regardless of who or what wrote it.

Apply the scepticism we already learned. When ChatGPT first arrived, everyone quickly figured out that the first answer was often agreeable but shallow. People adapted. That same scepticism hasn't caught up with generated code and tests. If an agent produces 30 tests, the question isn't whether they pass. It's whether all 30 are needed and whether they validate the right things. The cost of producing text dropped to near zero. That's not a reason to accept it uncritically.

Correct for the tool's defaults. Agents write tests by reading the implementation, so they naturally produce the kind of tightly coupled tests that engineering has known for decades is the most fragile. The lesson isn't new: favour higher-level tests that describe how users interact with the system, tests that can survive a refactor. Agents will tend to work against that pattern unless you deliberately correct for it.

Measure trends. Nobody is going to mention, or even notice, a shift in the test-to-src ratio or a drop in review comments overnight. It's also unlikely to pop up during your regular weekly engineering meetings. These kinds of trends take time to play out, but they can indicate if your processes are keeping up (or not) when measured correctly.

Test blindness is just one effect I keep returning to.

The larger conclusion I'm drawing after combining the data (through git + GitHub API) with my own experiences is that agent usage must be coupled with a stronger layer of content moderation. Otherwise we risk introducing patterns that have been known to be detrimental for decades.

I suggest we move engineering discussions on from tool/model comparisons and instead focus on how teams can improve the quality of agent-generated outputs.