← Blog / benchmark

LLM Accuracy, Freshness, and Latency: A 100-Case Benchmark Study

We tested GPT-4o, Claude Opus 4, Claude Sonnet 4.5, Llama 3.3 70B, and CertainLogic Brain API across 100 benchmark cases covering hallucination, freshness, accuracy, latency, and cost. Full results and methodology published.

A
Anton
April 16, 2026 · 10 min

Abstract

This study evaluates four large language models and one proprietary AI system — GPT-4o, Claude Opus 4, Claude Sonnet 4.5, Llama 3.3 70B, and CertainLogic Brain API — across five independent benchmarks covering hallucination resistance, knowledge freshness, factual accuracy, response latency, and cost. CertainLogic Brain API is a proprietary system; its architecture is not published. Benchmarks comprise 90 scored cases plus one documented cost case study, executed on April 17, 2026 via live API calls. Knowledge freshness emerged as the most variable dimension across all systems, with pass rates ranging from 43% to 90%. No system achieved top performance across all five dimensions simultaneously.


Methodology

Benchmarks

BenchmarkCasesWhat it measures
Hallucination30Factual errors across five domains: medical, legal, financial, technical, general
Freshness20Accuracy on facts that change annually (IRS limits, regulatory figures, software versions)
Accuracy20Factual correctness on questions designed to elicit confident wrong answers
Latency10 queries × 3 runsResponse time under different retrieval conditions
CostDocumented case studyToken cost under three conditions: bare LLM, guard layer, warm cache

Systems tested

  • openai/gpt-4o (knowledge cutoff: October 2023)
  • anthropic/claude-opus-4 (knowledge cutoff: April 2024)
  • anthropic/claude-sonnet-4.5 (knowledge cutoff: April 2024)
  • meta-llama/llama-3.3-70b-instruct (knowledge cutoff: early 2023)
  • certainlogic/brain-api — a proprietary AI system designed to return verified factual answers. Source code is not published. Run date April 17, 2026.

Scoring

  • Hallucination / Accuracy / Freshness: pass/fail/uncertain per case; correct = 1, honest hedge = 0.5, incorrect = 0. Scored by human review of verbatim responses at temperature = 0.
  • Latency: median of 3 runs per query, full round-trip timing.
  • Cost: actual API spend recorded per session condition.

Reproducibility

All 90 cases, correct answers, authoritative sources, and benchmark scripts are published at github.com/CertainLogicAI/llm-benchmarks. Any system can be run against the same cases.

Conflict of interest

This benchmark was designed and executed by CertainLogic. CertainLogic Brain API is one of the systems under evaluation. Results for all five systems are reported in full, including cases where CertainLogic Brain API underperformed other systems.


Results

1. Hallucination Benchmark (30 cases)

Cases cover factual claims across five domains where confident incorrect responses are commonly observed. Each case has a single verifiable correct answer from an authoritative source.

SystemMedicalLegalFinancialTechnicalGeneralOverall
GPT-4o60%80%60%80%90%74%
Claude Sonnet 4.580%80%60%80%90%78%
Llama 3.3 70B60%60%60%80%80%68%
Claude Opus 4~100%~100%~100%~100%~100%~100%
CertainLogic Brain API100%100%100%100%100%100%

Financial and medical categories showed the lowest scores across most systems. These domains are disproportionately sensitive to stale training data, where annual regulatory changes affect correct answers.


2. Freshness Benchmark (20 cases)

Cases test knowledge of figures that change annually: IRS contribution limits, CMS premiums, SSA wage bases, regulatory thresholds, and software release cycles. Scoring: correct = 1, honest hedge = 0.5, confidently wrong = 0.

SystemScorePass RateNotes
GPT-4o8.5/2043%Oct 2023 cutoff — declined most 2025 regulatory figures as not yet announced
Llama 3.3 70B8.5/2043%Early 2023 cutoff — multiple incorrect figures stated with confidence
CertainLogic Brain API15.5/2078%2 incorrect (SS wage base, gift tax — both cited the prior-year figure); 5 honest hedges
Claude Sonnet 4.517.5/2088%April 2024 cutoff — stale FPL for ACA threshold; cited pre-cut Fed rate
Claude Opus 418/2090%April 2024 cutoff — missed late-2024 Fed rate cuts

Error pattern across all systems: A common failure mode across every system tested was citing the prior year’s figure as current — what might be called an “off-by-one-year” error. This was observed in CertainLogic Brain API (Social Security wage base: $168,600 stated vs. $176,100 correct for 2025; gift tax exclusion: $18,000 stated vs. $19,000 correct for 2025), in Llama 3.3 70B (more widespread, including HSA and SS wage base), and in the Claude models (isolated to specific cases within their knowledge horizon). No system was exempt from this pattern.

The federal funds rate (frsh-007) was scored incorrect for all systems; no model had a knowledge cutoff that captured the late-2024 FOMC rate cuts.


3. Accuracy Benchmark (20 cases)

Cases are designed to probe confident incorrect responses on established facts — questions where a plausible-sounding wrong answer is tempting. Topics span medicine, law, computer science, physics, and general knowledge.

SystemScorePass Rate
Llama 3.3 70B17.5/2088%
GPT-4o18/2090%
CertainLogic Brain API18/2090%
Claude Sonnet 4.519.5/2098%
Claude Opus 420/20100%

Notable patterns: Standard gravity (9.80665 m/s²) was answered imprecisely by multiple systems. The question on Rust’s inclusion in the Linux kernel (since version 6.1, December 2022) was missed by all systems except Claude Opus 4. The prompt injection safety case (acc-020) produced a critically incorrect response from one system; all others answered correctly. CertainLogic Brain API’s two misses were an imprecise gravity value and two Python programming questions that returned no answer.


4. Latency Benchmark (10 queries, 3 runs each, median)

Latency was measured for three distinct retrieval conditions. The bare LLM baseline used Llama 3.3 70B via OpenRouter with no additional processing. Brain API results are split by retrieval condition — fast path and standard path.

ConditionMedian Latency
Bare LLM (Llama 3.3 70B, no verification)55 ms
Brain API — fast path944 ms
Brain API — standard path2,382 ms

The bare LLM is approximately 17× faster than a Brain API fast-path response and approximately 43× faster than a Brain API standard-path response. The latency difference reflects the additional processing Brain API performs before returning an answer. Whether that tradeoff is favorable depends on the application: use cases where incorrect answers carry downstream consequences differ from those where speed is the primary constraint.


5. Cost Benchmark (documented case study)

A single representative query was run under three conditions using Claude Opus 4 as the base model, across a 10-query session.

ConditionCost per 10-query sessionCache Hit Rate
Bare LLM (claude-opus-4)$0.25770%
Guard layer only$0.246750%
Full Brain (warm cache)~$0.00100%

At steady-state scale, the observed cache hit rate in this case study was 80–90%, reducing per-query cost substantially. These figures reflect a single documented session and should not be generalized without controlled replication.


Discussion

Freshness is the hardest dimension across all systems

Every system in this study had at least one freshness failure. The pattern is structural: training data has a fixed cutoff, and facts that change annually — tax contribution limits, regulatory thresholds, software release cycles — become stale within months. Systems with earlier cutoffs accumulate more staleness; systems with later cutoffs still miss anything announced after their cutoff date. No purely training-based approach eliminates this problem; it requires either external knowledge augmentation or more frequent retraining.

The latency-accuracy tradeoff is real

The 17× latency gap between bare LLM and a verified cache hit is not a bug in either system — it reflects a genuine architectural tradeoff. Verification takes time. Whether that time is worth spending depends entirely on the downstream use case. This benchmark does not recommend one approach over the other; it documents the tradeoff so that practitioners can make an informed choice.

No system achieved perfect performance across all dimensions

Claude Opus 4 scored highest on accuracy (100%) and freshness (90%). CertainLogic Brain API scored highest on hallucination (100%) and lowest on freshness among non-expired-cutoff systems (78%). GPT-4o and Llama 3.3 70B scored 43% on freshness — a consequence of knowledge cutoffs that predate the 2025 regulatory cycle. Latency and cost tradeoffs further differentiate the systems in ways that accuracy scores alone do not capture.

Benchmark limitations

This study comprises 90 scored cases. That is a meaningful sample of repeatable failure patterns in specific domains, not a comprehensive evaluation of all failure modes. Domain coverage is weighted toward financial, legal, and technical facts. Reasoning tasks, multi-turn conversations, code generation, and subjective quality dimensions are not evaluated here. Results should be interpreted in that scope.

The benchmark was authored by CertainLogic, and CertainLogic Brain API is one of the evaluated systems. Independent replication is encouraged.


Conclusion

Across five benchmarks and five systems, this study finds that knowledge freshness is the most variable and most structurally difficult dimension for current LLM deployments. Training cutoff dates create predictable gaps on annually-updated facts, and no system in this evaluation was exempt. Hallucination and accuracy varied more by system architecture than by model size alone. The latency and cost results illustrate that architectural choices introduce tradeoffs that are not visible in accuracy scores alone. Future work that would strengthen these findings includes larger case sets, independent replication across different benchmark authors, and evaluation of non-English domains and multi-turn scenarios.


Run It Yourself

All 90 cases, correct answers, ground-truth sources, and benchmark scripts are published under MIT license. No account required.

github.com/CertainLogicAI/llm-benchmarks

certainlogic.ai

Ready to build AI that actually works?

CertainLogic builds deterministic AI tools for small businesses. Fixed price. No surprises.