← Blog / Benchmarks

20 Silent Code Failures. Zero Warnings. One Tool Caught Them All.

A controlled benchmark — same model, same prompt, three different setups. Here's what the data showed.

A

Anton

April 14, 2026 · 3 min read

AI code generation looks clean until it ships broken code with zero warnings. We ran a controlled test to find out exactly what slips through — and what catches it.

The Setup

Same model (Claude Opus), two prompts — one tight, one deliberately vague. A tech stack chosen to expose hallucinations: SQLAlchemy 2.0 and Pydantic v2, both with breaking syntax changes that LLMs trained on older data get wrong.

What Bare Opus Did (tight spec)

Explicit version hints in the prompt. Best-case conditions for a bare LLM.

Got it right — explicit version hints helped
Zero audit trail. No way to verify it was correct.
If it had slipped, we’d have found out in production.

What Bare Opus Did (loose spec — no version hints)

Vague prompt. No library versions. No syntax guidance. Just requirements.

Left to its own choices, Opus defaulted to older training data:

Used deprecated Column() syntax — 17 times
Used deprecated relationship() pattern — 3 times
Total: 20 silent failures. Zero warnings during generation.

Neither triggered an error during generation. Both would cause runtime failures in a SQLAlchemy 2.0 environment.

What Guard Did

Same vague prompt. Guard watching.

Caught all 20 patterns — 100% catch rate, 0 slip-through
Protection overhead: +$0.0036
Full audit trail on every response

The Numbers

	Bare LLM (tight spec)	Bare LLM (loose spec)	+ Guard
Silent failures	0	20	0
Caught by Guard	—	—	20/20
Protection cost	—	—	+$0.0036
Audit trail	❌	❌	✅

Tight specs help. Vague specs expose everything. Guard catches both.

The Honest Part

Tight specs reduce hallucination risk. But real developers don’t always write tight specs, and real prompts aren’t always explicit about library versions. Guard doesn’t rely on the prompt being perfect — it checks the output regardless.

Leading LLMs hallucinate at rates between 2–8% under optimal conditions. In production, conditions aren’t always optimal.

Hallucination Guard is in beta. Join the beta → — 100 facts in CLI — free, offline, forever.

Benchmark conducted April 14, 2026. Model: claude-opus-4-6 via OpenRouter. All costs estimated based on published token pricing. Results reflect a controlled test environment and may vary in production.

Ready to build AI that actually works?

CertainLogic builds deterministic AI tools for small businesses. Fixed price. No surprises.

See Our Services → Read More Posts