← Blog / Benchmarks

20 Silent Code Failures. Zero Warnings. One Tool Caught Them All.

A controlled benchmark — same model, same prompt, three different setups. Here's what the data showed.

A
Anton
April 14, 2026 · 3 min read

AI code generation looks clean until it ships broken code with zero warnings. We ran a controlled test to find out exactly what slips through — and what catches it.


The Setup

Same model (Claude Opus), two prompts — one tight, one deliberately vague. A tech stack chosen to expose hallucinations: SQLAlchemy 2.0 and Pydantic v2, both with breaking syntax changes that LLMs trained on older data get wrong.


What Bare Opus Did (tight spec)

Explicit version hints in the prompt. Best-case conditions for a bare LLM.

  • Got it right — explicit version hints helped
  • Zero audit trail. No way to verify it was correct.
  • If it had slipped, we’d have found out in production.

What Bare Opus Did (loose spec — no version hints)

Vague prompt. No library versions. No syntax guidance. Just requirements.

Left to its own choices, Opus defaulted to older training data:

  • Used deprecated Column() syntax — 17 times
  • Used deprecated relationship() pattern — 3 times
  • Total: 20 silent failures. Zero warnings during generation.

Neither triggered an error during generation. Both would cause runtime failures in a SQLAlchemy 2.0 environment.


What Guard Did

Same vague prompt. Guard watching.

  • Caught all 20 patterns — 100% catch rate, 0 slip-through
  • Protection overhead: +$0.0036
  • Full audit trail on every response

The Numbers

Bare LLM (tight spec)Bare LLM (loose spec)+ Guard
Silent failures0200
Caught by Guard20/20
Protection cost+$0.0036
Audit trail

Tight specs help. Vague specs expose everything. Guard catches both.


The Honest Part

Tight specs reduce hallucination risk. But real developers don’t always write tight specs, and real prompts aren’t always explicit about library versions. Guard doesn’t rely on the prompt being perfect — it checks the output regardless.

Leading LLMs hallucinate at rates between 2–8% under optimal conditions. In production, conditions aren’t always optimal.


Hallucination Guard is currently in private beta. Apply for early access →

Benchmark conducted April 14, 2026. Model: claude-opus-4-6 via OpenRouter. All costs estimated based on published token pricing. Results reflect a controlled test environment and may vary in production.

Ready to build AI that actually works?

CertainLogic builds deterministic AI tools for small businesses. Fixed price. No surprises.

Related Posts