Your Tests Are Green. Your AI Just Invented a Refund Policy

A team I’ll keep anonymous shipped their first AI-powered feature last year. The test suite was fully green. Every assertion passed. It demoed flawlessly in front of the whole company.

Then a real user typed: “What’s your refund policy?”

The AI gave a confident, detailed answer – specific timelines, specific conditions, the works. There was just one problem. The company had no such policy. The model invented the entire thing, presented fiction as fact, and the test suite never said a word. Because the tests were built for normal software, and the thing they’d just shipped was anything but normal.

This is the wall almost every team hits the moment they put an LLM feature in front of customers. And if your product roadmap has the word “AI” anywhere on it – a chatbot, a copilot, a summarizer, an agent – you are heading straight for it, whether you’ve noticed yet or not.

Why your entire testing playbook quietly stopped working

For the last few decades, software testing rested on one assumption so basic nobody bothered to say it out loud: the same input gives you the same output, every time.

You send a request, you know what should come back, you assert that it matches. Pass or fail. Binary. Clean. Your whole QA apparatus – every framework, every CI gate, every regression suite – is built on that bedrock.

LLMs detonate it.

Ask an AI feature the same question twice and you can get two different answers, both phrased completely differently, both potentially correct. Or one subtly, dangerously wrong in a way that looks just as fluent and confident as the right one. There’s no expected value to assert against, because there are a hundred acceptable phrasings of the right answer and an infinite number of wrong ones that look exactly like them.

So the question your tests have asked since the beginning – “did the output match?” – becomes meaningless. There’s nothing to match against. You’re not testing a calculator anymore. You’re testing something closer to a very fast, very confident, occasionally lying intern.

The failure modes nobody’s QA was designed to catch

Traditional bugs are things that crash, error, or return the wrong number. AI features fail in ways your existing process has no concept of.

There’s hallucination – the refund policy story. The model fills a gap with something invented, stated with total confidence. No exception is thrown. Nothing turns red. It just quietly makes things up.

There’s silent regression, and this one should genuinely scare any leader. Your AI feature works perfectly for months. Then your model provider – OpenAI, Anthropic, Google – ships a routine update. Or someone tweaks a prompt. Or adjusts a temperature setting. And the behavior of your product shifts underneath you, with no code change on your side to trace it back to. Users start getting worse answers. Support tickets climb. And your engineering team has no systematic way to connect the degradation to its cause, because nothing in your codebase changed. Without a way to test for this, you don’t find out from your monitoring. You find out from your angriest customers.

There’s bias and unsafe output. In regulated spaces – finance, healthcare, HR – an AI that produces subtly biased responses isn’t an embarrassment, it’s a compliance violation and potentially a lawsuit. And there’s the whole adversarial surface: users trying to jailbreak the thing into saying something that ends up on social media with your logo attached.

None of these are caught by assert response === expected. Not one.

So how do you actually test something non-deterministic?

Here’s the shift, and it’s a genuine mental rewire: you stop asking “did it pass?” and start asking “did it behave within acceptable bounds?”

You’re no longer checking for one correct answer. You’re checking that the output lands inside an acceptable range – every time, across many runs. In practice that means a few concrete things that look nothing like a traditional test.

You build a golden dataset – a curated set of inputs where humans have defined what “good” actually looks like. This is the foundation, and notably, a human has to build it. You can’t automate your way to knowing what a good answer to your customers is.

You test for consistency, not exact equality. Run the same input five or ten times, score how semantically similar the outputs are, and assert on the behavior staying stable rather than on a string matching. You’re measuring whether the thing is reliable, not whether it’s identical.

You use a judge – often another LLM, scored against a rubric – to evaluate whether responses follow instructions, stay relevant, avoid hallucinating, and respect policy. But a human curates that rubric and owns the error analysis, because somebody accountable has to decide what the bar is.

And critically, you version-lock a baseline and regression-test against it on every prompt change and every model update – so when the provider ships a silent update, you catch the drift before your users do.

The part that doesn’t get easier

I want to be straight about something, because there’s a lot of “AI tests AI, problem solved” noise out there. This is one of the areas where human judgment isn’t a nice-to-have, it’s load-bearing.

A survey of QA, data science, and product professionals this year found that nearly half rely on human sentiment and usability to decide whether an AI feature is even ready for production. Humans build the golden dataset. Humans set what “good” means. Humans map the behavioral boundaries and judge the things a rubric can’t capture – tone, intent, whether an answer is technically accurate but completely tone-deaf for the situation. You can automate the running of these evaluations. You cannot automate the judgment underneath them.

This is exactly why testing AI features tends to break in-house teams that are already stretched. It’s not that they’re not capable – it’s that this is a genuinely new discipline requiring new skills, new tooling, and a new mental model, layered on top of all the traditional testing that doesn’t go away. Most teams discover they need this expertise right around the time they’ve already shipped the feature and the support tickets have started.

The takeaway

If you’re putting AI into your product, the most dangerous moment isn’t when something obviously breaks. It’s when everything looks fine – green tests, clean demo, happy launch – and the failure is sitting quietly in the gap between “the code ran” and “the output was actually true, safe, and on-brand.”

Traditional QA can’t see into that gap. It was never built to. Testing non-deterministic systems is a different craft, and the teams that treat it as just-more-test-cases are the ones whose AI invents a refund policy in front of a paying customer.

The good news: this is a solved problem in the sense that the framework exists – golden datasets, semantic consistency, rubric-based evaluation, version-locked baselines, human judgment at the decision points. It’s just not the framework your team already has. The question worth asking before your next AI feature ships isn’t “are the tests passing?” It’s “do we even know what we’re supposed to be testing for?”

Your Tests Are Green. Your AI Just Invented a Refund Policy

Sign Up For Our Newsletter