Who's Checking the AI's Homework?

Something shifted in the last eighteen months and most engineering leaders haven’t fully reckoned with it yet.

Your developers are writing code faster than they ever have. AI tools – Copilot, Cursor, Claude Code, whatever your team landed on – have genuinely changed the math. Features that used to take a week land in two days. Founders are shipping MVPs over a weekend. The McKinsey numbers floating around this year put it at roughly a 46% reduction in time spent on routine coding tasks. That’s not hype. That part is real.

But here’s the question almost nobody is asking out loud: if your team is producing four times as much code, who is checking four times as much code?

The quiet math problem

Let me lay it out plainly, because once you see it you can’t unsee it.

Studies this year have been brutal on AI-generated code quality. A Stanford and MIT analysis of over two million AI-generated snippets found that around 14% contained at least one security vulnerability – versus about 9% for human-written code doing the same job. Veracode tested over a hundred models and found that nearly half of AI-generated samples introduced an OWASP Top 10 vulnerability, and that pass rate hasn’t really budged across multiple testing cycles. Cross-site scripting, log injection, SQL injection, hardcoded secrets – not exotic edge cases. The boring, dangerous classics.

And the volume is the killer part. There’s research showing AI-assisted developers commit at three to four times the rate of their peers, while introducing security findings at roughly ten times the rate.

So stack those two facts. More code per developer. A higher defect rate per line. Multiply them together and you don’t get a small bump in risk. You get an explosion. Georgia Tech’s tracking project logged six AI-attributable CVEs in January of this year, fifteen in February, thirty-five in March. That’s not a trend line you want your product sitting on top of.

Why AI writes code that “works” but isn’t right

There’s a reason for this, and it’s worth understanding rather than just panicking about.

An AI model optimizes for code that works – code that compiles, runs, and passes the obvious happy-path check. It does not optimize for code that is secure, or maintainable, or correct under the weird edge cases your actual users will absolutely find. It learned from millions of public repositories, which means it learned all the bad patterns alongside the good ones. It will cheerfully hand you a function that passes every test you wrote while being wide open to an attack you didn’t think to test for.

This is the trap. The code looks done. It demos beautifully. Everyone moves on. And the bug is sitting there quietly, waiting for production traffic.

The developers feel productive – they are productive, in the narrow sense – but the verification step that used to happen naturally when a human slowly wrote each line by hand has been skipped. You can’t eyeball-review code at the speed an AI generates it. The human bottleneck didn’t disappear. It just moved downstream, and got bigger.

The reflex that makes it worse

Here’s what I see teams do when they notice this, and why it backfires.

The instinct is to throw more AI at the problem. The code was written by AI, so let’s have AI review it too. Plug in an AI code reviewer, let it scan the pull requests, call it covered.

I understand the appeal and I’m not against the tooling – automated scanning catches real things, fast. But there’s a circular logic here that should make you nervous. The same kind of system that optimizes for “looks correct” is now the thing judging whether something is correct. AI reviewing AI is great at catching syntax problems and the obvious stuff a linter would flag anyway. It’s much weaker at the thing that actually matters: judgment. Does this code do what the business needs? Is this edge case one a real user will hit? Is this “vulnerability” actually exploitable in our context, or noise? Should this even ship?

That’s not a question of pattern matching. That’s a question of someone who understands the product, owns the outcome, and is accountable when it breaks.

What good actually looks like now

So I’m not here to tell you to slow your developers down. That ship has sailed and honestly you shouldn’t want it back. The speed is a real competitive advantage. The answer isn’t less AI. The answer is matching your verification capacity to your new production capacity – and keeping a human firmly in charge of the part that requires judgment.

In practice that means a few things.

It means treating AI-generated code as exactly what it is: a fast first draft from a talented but careless junior. You wouldn’t ship a junior’s first PR straight to production without review. Same rule applies, except now the junior is producing ten PRs a day.

It means the verification layer has to scale with the generation layer. If your output quadrupled and your QA capacity is flat, you have already lost the thread – you just haven’t seen the incident yet. This is usually the moment a team realizes their two internal QA people cannot physically keep up with what fifteen AI-augmented developers are now producing.

And it means using AI on the testing side too, but with a person owning every decision. Use it to generate test cases from the user stories and surface the gaps. Use it to do intelligent regression selection so you’re not re-running three thousand tests on a one-line change. Use it to scan for the OWASP classics. But the engineer reads the output, makes the call, and signs their name to the release. AI is the co-pilot. It is never the pilot. The second you let it fly the plane is the second you’re in one of those CVE statistics.

The uncomfortable bottom line

The teams winning right now aren’t the ones that adopted AI fastest. Plenty of teams did that and are quietly drowning in tech debt and security findings they don’t have the people to triage.

The teams winning are the ones who understood that faster production creates a verification debt, and who actually paid it down – by scaling their quality capacity to match, and by keeping accountability with humans who own the outcome rather than outsourcing judgment to the same kind of system that created the problem.

Your developers ship 4x faster now. Good. The only question that matters is whether anyone is actually checking 4x faster too. If the honest answer is no, that gap is where your next outage lives – and it’s growing every single sprint.

Who’s Checking the AI’s Homework?

Sign Up For Our Newsletter