An agent proposes. A verifier and a human decide.

The failure mode is not wrong answers. It is confident wrong answers.

Naive RAG demos beautifully. Embed your documents, retrieve the nearest chunks, let the model write a fluent answer. In the demo, everyone nods.

In production, a customer asks whether your hardware handles 600,000 units an hour. Retrieval surfaces a datasheet that mentions the product line, the model writes a confident yes, and the system reports 0.9 confidence. The right answer was no. Nobody finds out until the customer does.

Every team that puts retrieval in front of a language model discovers the same thing: the model does not know what it does not know. Retrieval similarity is not truth. A chunk that looks relevant is all the model needs to commit, and the prose it commits with reads exactly as trustworthy as a correct answer. That symmetry is the whole problem. In a high-stakes domain, the only real failure is being confidently, silently wrong.

Hallucination is an architecture problem, not a model problem.

You cannot prompt your way out of this, and a bigger model only lowers the rate. If one component both writes the answer and judges the answer, you do not have a system, you have an opinion. The fix is structural: separate proposing from deciding.

The shape I build is simple to state. An agent proposes. A verifier and, when it matters, a human decide.

The architecture.

The agent reads from a store of cited, versioned claims, not from raw chunks, and calls typed tools when an answer needs computation. Its output is a structured proposal where every factual statement carries a citation. Facts never live in model weights. No fine-tuning.

The verifier is a separate step, and the part most people get wrong: the decisive checks are recomputed in code, not delegated to a second model's opinion. Do the citations resolve to real, current claims? Was that number produced by a tool, and does a source actually back the result? Hard constraints are walls, not penalties: a violated constraint vetoes the answer, it does not lower a score.

Confidence is three numbers, never one. Did I understand the question (extraction)? Is the answer backed by sources (grounding)? Am I sure overall (answer)? Collapsing them into a single score is how systems learn to hide their doubt. A gate reads all three and routes each request to one of three outcomes: answer, redirect, or escalate to a person.

Everything is appended to an audit log: every retrieval, tool call, verdict, and human decision. Any answer the system ever gave can be replayed and explained with zero model calls. When a human corrects the knowledge, the correction forks a new claim version, so the system gets more right over time and you can see exactly when and why.

And you measure the thing that matters: not just accuracy, but correct deflection. An agent that refuses exactly when it should is worth more than one that answers everything.

What changes in practice.

Take the 600,000 units an hour question. The verified agent calls a throughput tool, computes that the request implies 6 milliseconds per unit against a 10 millisecond switching window, finds no source backing the result, and the verifier blocks the answer: it will not assert a rate the datasheet does not cover. The customer gets a person, with the computation attached. The naive version said yes.

The agent stops being a liability exactly where it used to be most dangerous: the edge of its knowledge. There is a live demo of this on my homepage. Flip the same question between a naive bot and the verified agent and watch what changes.

What it costs.

Honesty about the trade: verification adds a step to every answer, so it costs latency and tokens. The claim store is real curation work; someone owns the knowledge. And not every product needs this. If a wrong answer is cheap (internal search, say), or a human already reviews every output, naive RAG is fine and you should use it.

The line is simple: the moment an agent acts or answers in front of customers, where being wrong costs money, trust, or safety, the verifier pays for itself the first time it blocks one bad answer.

Questions I get.

How do you stop an LLM from hallucinating in production? You do not, at the model level. You make the system catch it: ground answers in cited claims, recompute the factual signals in code, and give the verifier a veto.

Does this need fine-tuning? No. Facts live in a claim store your team can read, edit, and version. Fine-tuning bakes facts into weights, where nobody can audit or correct them.

What does the verifier actually check? That citations resolve to current claims, that computed values are source-backed, that hard constraints hold, and that the proposal is structurally valid. The decisive checks run as code; the model's self-assessment is never trusted over a recomputation.

When is this overkill? Whenever a wrong answer costs you nothing. Verification is for the answers you cannot afford to get wrong.