
The Bias Built Into Every Test Suite Your Developers Write
There is a structural flaw in how developers test their own code. AI does not share it. Here is what that means in practice.
Most engineering teams believe their test suites are more comprehensive than they are. This isn’t a failure of effort or intention. It’s the result of a bias so consistent across developers, teams, and codebases that it functions less like a mistake and more like a structural feature of how humans write tests.
The bias is this: developers test what they think their code does.
When a developer writes an implementation and then writes tests for it, both the implementation and the tests are products of the same mental model. The assumptions embedded in the code are also embedded in the tests. The edge cases the developer didn’t think of when writing the implementation are also absent from the test suite. The tests confirm the code. They don’t challenge it.
The result is a test suite that provides genuine confidence about the scenarios the developer considered, and no information at all about the scenarios they didn’t. Which is precisely where the bugs live.
AI doesn’t share this bias. And that single characteristic has more practical implications for software quality than most teams have yet worked out.
Why the bias is structural, not a skills problem
The instinct is to treat this as a discipline issue. If developers just tried harder, wrote more tests, reviewed their own test coverage more carefully, the problem would diminish. In practice it doesn’t, and understanding why matters.
The bias is structural because writing code and writing tests for that code uses the same cognitive resource: the developer’s mental model of what the code is supposed to do. When you write a function that processes a list of transactions, you have in mind a set of scenarios: normal transactions, empty lists, large lists, transactions with standard edge cases. You implement the function with those scenarios in mind. Then you write tests for those scenarios.
What you don’t test are the scenarios you weren’t thinking about when you wrote the function: transactions with null fields in unexpected combinations, concurrent modifications to the list during processing, inputs that are technically valid but semantically unusual in your business context. Not because you forgot to be thorough. Because those scenarios weren’t in the mental model that produced the code.
This is why code review doesn’t solve the problem. A reviewer reading the code and the tests will evaluate whether the tests cover the code as written. They’re using the same mental model. The coverage gap is in the scenarios neither the developer nor the reviewer was thinking about.
It’s also why increasing test coverage metrics doesn’t solve the problem. A developer under pressure to hit a coverage target will write tests that bring the number up. Those tests will cover the code paths the developer was already thinking about. The bias doesn’t diminish with more tests. It just becomes less visible behind a higher coverage number.
What AI changes about testing
AI tools approach test generation from a structurally different starting point. They don’t have the developer’s mental model of what the code is supposed to do. They have the code itself, and they generate tests by reasoning about what the code does, what inputs it accepts, what outputs it produces, and what edge cases the implementation needs to handle.
This produces a different kind of test suite. Not necessarily a better-written one. Not necessarily one with better test names or cleaner organisation. But one with a different coverage profile: broader in the edge cases it considers, more systematic about boundary conditions, less shaped by the assumptions that produced the implementation.
In concrete terms, AI-assisted test generation tends to surface:
Boundary conditions at data type limits.
Integer overflow, empty strings where non-empty strings were assumed, null values in fields that were assumed to always be populated, zero values in denominators, negative numbers in contexts where only positive inputs were considered. These are mechanical to enumerate once you’re looking for them. Developers under time pressure stop looking for them after the obvious ones.
Combinations of valid inputs that produce unexpected behaviour.
Individual inputs that are each handled correctly can combine in ways the developer didn’t anticipate. AI tools explore the combinatorial space more systematically than developers do, surfacing interaction effects that single-variable test cases miss.
Concurrency and ordering assumptions.
Code that works correctly in single-threaded sequential execution sometimes fails when operations arrive in unexpected orders or when shared state is modified concurrently. AI tools flag these scenarios more reliably than developers who wrote the code with a specific execution sequence in mind.
Implicit assumptions made explicit.
One of the most useful outputs of AI-assisted test generation is the assumptions it surfaces. When an AI generates a test case for an input the developer considered invalid and therefore never tested, it makes visible an assumption that was implicit in the implementation. Whether the test reveals a bug or confirms the assumption holds, the assumption is now explicit and reviewable rather than hidden in the code.
None of this is magic. AI tools generate test cases that need to be reviewed for relevance and quality, just like AI-generated code. Some generated tests will be irrelevant to the actual use case. Some will test scenarios that are impossible in the real system. Human judgement is required to evaluate which generated tests are worth keeping.
But the starting point is different. A developer reviewing AI-generated tests is asking “which of these edge cases actually matter in our context?” rather than “what edge cases have I missed?” The former is a more tractable question.
What AI cannot replace in testing
The distinction between test generation and test strategy is the one that matters most here, and it’s the one that gets collapsed in oversimplified accounts of AI in testing.
AI generates test cases well. It does not design test strategies well.
Test strategy is the set of decisions that determine what gets tested, how, and to what depth. Which components carry the most risk and deserve the most thorough coverage? Which testing approaches are appropriate for different parts of the system: unit tests, integration tests, contract tests, end-to-end tests? How should the test suite be structured to remain maintainable as the system evolves? What is the acceptable tradeoff between coverage depth and test run time?
These are architectural decisions. They require understanding the system’s risk profile, the team’s deployment practices, the business context that makes some failures more costly than others. AI tools don’t have this understanding, and the test suites they generate without human strategic input reflect that. More tests is not always better. Tests that cover the wrong things at the wrong level of the stack create maintenance overhead without proportionate risk reduction.
The highest-value use of AI in testing is when it operates within a human-designed test strategy: where the overall coverage approach, the testing levels, and the risk prioritisation have been decided by someone who understands the system, and AI is used to generate thorough coverage within that framework. That combination produces test suites that are both strategically sound and comprehensively executed.
The lowest-value use is generating tests without a strategy: producing high coverage numbers against an unclear risk model, creating a false sense of thoroughness while the important scenarios remain untested.
The compounding value in contract engagements
Testing is one of the areas where the difference between AI-native developers and those who have simply adopted AI tools is most practically visible in what gets delivered.
A developer who uses AI well in testing produces a codebase with genuinely broader edge case coverage than one who tests manually, even with the same time budget. The mechanical work of enumerating boundary conditions, thinking through input combinations, and generating test scaffolding happens faster. That time gets reinvested in the strategic layer: designing the test architecture, reviewing generated tests for relevance, and ensuring the suite covers the scenarios that actually matter.
The output is a codebase that’s more resistant to the class of bugs that come from untested assumptions. These are disproportionately the bugs that are expensive to find: they don’t show up in development, they don’t show up in standard QA, and they surface in production under conditions the team didn’t test because the conditions weren’t in anyone’s mental model.
For a client bringing in a contract developer, this matters beyond the immediate engagement. Test suites are assets that outlast the contract. A thoroughly tested codebase is easier to maintain, easier to extend, and less likely to produce costly production incidents. The testing discipline an AI-native developer brings to an engagement is one of the clearest ways that value compounds after the work is done.
Navigaite evaluates testing practice specifically when assessing developers for placement. Not whether a developer writes tests, which most developers do. But whether they have the judgement to combine AI-assisted coverage breadth with the strategic thinking that makes a test suite genuinely useful rather than just numerically comprehensive. That combination is rarer than the headline adoption numbers suggest, and it’s the combination that produces codebases clients are glad to own.
A practical note on adopting AI in your testing process
If your team is exploring AI-assisted testing, or you’re bringing in developers who use it, a few observations from what works:
Treat generated tests as a coverage audit, not a finished suite.
The most useful thing AI-generated tests do is show you where your current suite has gaps. Review them with that question in mind: what scenarios are being surfaced here that your existing tests don’t cover? The answer shapes which generated tests are worth keeping.
Invest in the strategy before the generation.
The teams that get the most from AI-assisted test generation are the ones that have a clear testing strategy before they apply AI to fill it out. What levels of the stack are being tested? What risk areas deserve the deepest coverage? What does a test suite that’s maintainable at scale look like for this codebase? These questions are worth answering first.
Use generated tests to challenge assumptions explicitly.
When an AI generates a test for a scenario you consider impossible or irrelevant, treat that as a prompt rather than a dismissal. Why is that scenario impossible? Is it actually impossible, or is it just not something the current implementation handles? Making the answer explicit improves the codebase regardless of whether the test gets kept.
“Navigaite places AI-native contract developers who bring genuine testing discipline to every engagement: broader coverage, sounder strategy, and codebases that hold up after the work is done.”
Navigaite
Real Stories · Thought Leadership
Read next
Your Code Review Queue Is the New Bottleneck
AI has made developers more productive. That productivity is going into your review queue.
Thought Leadership“We stopped asking if candidates use AI. We started asking what they catch.”
A VP of Engineering on hiring in an AI-native world and the metrics that mislead.
GuideWhat ‘AI-native’ means in a software team
A practical guide for engineering managers on the signals that separate genuine AI fluency from surface-level tool use.
Want developers who work this way?
Every contractor we place uses AI tooling as a standard part of how they deliver. Tell us what your team needs.
Get in touch