Blog
The reviewer agent — the production answer to 15–23% domain hallucination
Benchmarked frontier models on domain-specific medical device queries through Q4 2025. GPT-4o, Claude Sonnet 4.5, Gemini 2.5. Fifty queries each covering ISO clause citations, IEC safety class rationale, 510(k) predicate comparisons, and FDA guidance references. The hallucination rate ran 15% on the cleanest model and 23% on the weakest.
One in five claims, give or take, is wrong. Confidently wrong. Stylistically indistinguishable from the four out of five that are correct.
For a 50-page clinical evaluation report with two hundred substantive assertions, the expected hallucination count runs 30 to 46. Every one of them is a potential finding if it gets into the submission. You can't prompt your way out of this. You can't proofread your way out either — the errors are invisible by design, because a well-trained model produces plausible errors.
The production pattern that works is architectural, not stylistic. Two agents. The generator writes. The reviewer audits. They never swap roles.
Why single-agent systems fail predictably
The instinct is to ask the model to check its own work. Chain-of-thought verification. Self-consistency sampling. A second prompt asking "review this output for accuracy." These techniques work in some contexts. They don't reliably work here.
The reason is structural. A model's confidence is calibrated to its training distribution, not to the truth of any specific claim about a specific regulatory document. When the model generates an incorrect ISO clause number, it does so because that number pattern-matches what it learned. When you ask the same model to verify that clause number, it runs the same pattern-matching operation against the same prior. It confirms the error, because the error is self-consistent with the model's beliefs.
Self-review improves surface consistency. It doesn't catch factual errors the model believes are facts. For regulatory documentation — where the entire value depends on accuracy of specific claims against specific sources — that's not an edge case. It's the failure mode.
Separation of concerns fixes it. The agent that generates doesn't verify. The agent that verifies doesn't generate. And the verifier reads the source documents directly, not the generator's summary of them.
The hallucination modes that actually ship to submissions
Four hallucination patterns show up in almost every AI-drafted regulatory document I've audited in the last eighteen months.
Standard references. The model knows ISO 14971 exists and governs risk management. It does not reliably know whether a specific clause applies to a specific situation, and it often confuses the 2007 and 2019 revisions. The 2019 revision reworked Annex ZA and the benefit-risk acceptance framework. Citing 2007 language in a current submission creates a compliance gap. IEC 62304 has the same problem across the 2006 base and 2015 amendment.
Performance thresholds. "The device shall achieve ≥95% sensitivity" is exactly the claim FDA expects traced to a design input. A model generating a risk section will produce plausible thresholds from training-distribution averages. If those thresholds don't appear in your actual design inputs, the document fails traceability, and a reviewer will find it.
Predicate device details. Cleared indications, technological characteristics, K-numbers. All available to FDA reviewers in the public 510(k) database. Models hallucinate predicate details at surprisingly high rates because the database is sparse in training data and the surface patterns are easy to confuse. An SE argument built on a wrong predicate characterization is a finding the reviewer doesn't even need to think hard to catch.
Clinical evidence citations. Study identifiers, sample sizes, primary endpoints, follow-up durations. The numeric claims in a clinical evaluation are the hardest to spot-check and the easiest to fabricate convincingly. I've seen an AI-drafted CER cite a pivotal study at n=872 when the actual study was n=317. Same title. Different number. Nobody caught it until a notified body did.
The reviewer-agent pattern, literally
Two agents. Defined roles. No role overlap.
Agent A, the generator. Produces document content from source materials: risk analysis sections, 510(k) summaries, clinical evidence summaries, software lifecycle documentation. Job is generation. Has access to your DHF, source documents, standards templates. Produces structured output: document text plus a list of every claim it made and the source it believes supports each claim.
Agent B, the reviewer. Receives the generator's output and the source documents simultaneously. Doesn't generate. Audits. For each claim, locates the cited source, extracts the actual supporting text, compares claim to source. Checks standard references for version accuracy. Checks performance thresholds against design inputs. Produces a verdict per claim: verified, flagged, or unresolvable (no source located).
The system output is two things: the document, and the audit trace. The trace shows every claim, its stated source, the extracted source text, and verification status. Flagged claims — unverified against source — surface explicitly for human review.
The architectural principle: the reviewer has no stake in the document being correct. It wasn't involved in generating it. It isn't trying to make the document sound good. Its only job is to check traceability and accuracy. That independence is what makes it work.
Orchestration choices that actually matter
The basic generator-reviewer pipeline is the starting point. Production systems layer on top of it depending on document type and stakes.
Linear pipeline. Generator produces a section. Reviewer audits. Output goes to human with the audit trace. Minimum viable architecture. Appropriate for lower-stakes sections or teams new to multi-agent workflows. Limitation: reviewer runs once. If the generator made a structural error rather than a citation error, the reviewer may flag symptoms without identifying root cause.
Parallel review. Multiple reviewers, different dimensions. One checks standards compliance — every cited standard, version, clause. Another checks internal consistency — does the risk section contradict the design input section? A third checks completeness — for this document type, are required elements present? Coverage scales without proportional latency. Appropriate for 510(k) summaries, EU MDR technical files, any document where multiple failure modes each have real consequences.
Iterative refinement. Reviewer flags a claim. Generator receives the flag and the reasoning, revises the claim. Reviewer re-checks. Loop until verdict is verified or cycle limit hits. Most powerful for reducing human review burden. Critical guardrail: cycle limit. Two or three iterations max. Unconstrained refinement doesn't converge. Claims unresolved by cycle three need human judgment, not more iterations.
Most production systems combine: linear at the document level, parallel at the claim level, iterative for high-confidence resolvable flags. Architecture should be proportional to the risk consequence of an error and the volume of documents produced.
The audit trail as compliance artefact
The trace the reviewer agent produces isn't a debugging tool. It's a compliance artefact in its own right.
A notified body performing technical file review or an FDA reviewer evaluating a 510(k) is doing essentially the same thing the reviewer agent does: checking that claims in the document are supported by the source record. The audit trace makes their job tractable. Without it, a human reviewer faces a 50-page document and has to reconstruct, claim by claim, which source supports which assertion. With a claim-level trace, they see the work already done.
For the trail to function as a compliance artefact, four elements matter.
Claim-level provenance. Each claim mapped to source document, section, and the actual extracted text supporting it. Not "this section draws from the risk analysis." Each claim, individually sourced.
Claim-level verification status. Three states: verified (reviewer confirmed the source supports the claim), flagged (reviewer could not confirm), human-reviewed (a human reviewed the flag and made a determination). The third state is critical. Shows the trail wasn't just generated and ignored.
Model version and timestamp. Submissions exist in specific version states. If a submission is questioned 18 months later, you need to reproduce which model version generated which content and when the reviewer ran. Not optional metadata.
Standards version specificity. Every standard reference includes the year of the revision. ISO 14971:2019, not ISO 14971. IEC 62304:2006+AMD1:2015, not IEC 62304. Revision matters for compliance. The trace has to make it explicit wherever a standard is cited.
What changes for the human reviewer
A human reviewer facing a 50-page document without a trace has an impossible task. Nobody reads 50 pages with enough attention to verify every claim against every source. What actually happens: reviewers scan, check familiar sections, sign off. Not negligence. The predictable response to an impossible task.
The reviewer-agent pattern changes the task. The human receives a document where every claim has been verified or flagged by machine. Verified claims don't require re-verification. Attention goes to the flagged queue. The job shifts from "does this entire document look right" to "does my judgment agree with the reviewer's verdict on these specific items."
That's tractable. A 50-page document with 200 claims might produce 8 to 15 flags. A human can evaluate 15 specific, framed questions with focused attention. Expertise applied to the decisions requiring it, not distributed across the whole document surface.
The result isn't reduced human oversight. It's more effective human oversight. The human is still the final authority on every flagged item. Cognitive load is concentrated where it belongs.
The trail also creates a meaningful record of human contribution. Each flagged item captures the human reviewer's determination with attribution. Defensible to a regulator. Specifically, the kind of documented human oversight FDA's evolving AI/ML guidance expects for AI-generated content in submissions.
MANKAIND runs every document through a reviewer
Every document MANKAIND generates runs through a reviewer agent before it surfaces to the user. Not a feature to opt into. A requirement built into the architecture because unverified output undermines the value of generating it.
A claim that can't be traced to your engineering record doesn't appear in output. The reviewer checks every citation against source documents in your DHF. Standard references checked for version accuracy. Performance thresholds matched against design inputs. Predicate data verified against the source 510(k) records.
The trace your team receives is structured at the claim level, with provenance, verification status, model version, and timestamp. When a reviewer asks where a specific claim came from, you have a machine-generated, human-confirmed answer at the claim level, not a reference to a document you hope contains the relevant language.
Reviewer-agent architecture isn't a differentiator for regulated AI. It's a baseline. Any system generating regulatory documents without independent claim verification is liability shipping. The question isn't whether to implement it. It's how well tuned your reviewer is to the failure modes specific to medical device submissions, and how the trace integrates with the rest of your design history.
See how MANKAIND handles this
30-minute demo. Bring your hardest design controls question.