Skip to main content

Blog

RAG for regulated documentation — why citation is load-bearing, not decorative

Three fine-tuned models I've reviewed in medtech documentation workflows shared the same failure mode. All three were confident, fluent, and wrong. One was citing ISO 14971:2007 risk acceptability language that the 2019 version substantially revised. Another was generating IEC 62304 software lifecycle content from an amendment that had since been updated. The third was referencing FDA guidance that had been superseded. No error flag. No staleness warning. Just authoritative-sounding output referencing requirements that no longer applied.

That's the core problem with fine-tuning in regulated domains. The knowledge is baked into the weights. When the standard updates, the model doesn't know.

Why fine-tuning fails the traceability test

Fine-tuning trains a model on your regulatory corpus. The result is behavioral consistency: the model writes requirements in EARS notation without being told, structures risk entries correctly, uses the right register. Real advantages.

The problem is that the knowledge and the behavior become the same thing. You can't update one without touching the other. And in a regulated context, knowledge has a timestamp.

ISO 14971:2007 to 2019 wasn't an editorial update. Risk management principles changed. Annex content changed. The acceptable risk framing shifted. A model trained before that update is silently wrong about a standard that governs every medical device risk file. Silent is the critical word. The model doesn't know it's wrong. It generates confident prose that cites a standard whose requirements it no longer accurately reflects. No error signal. Just a document that looks correct until a regulatory reviewer reads it against the current text.

Auditors read against the current text. Notified body assessors read against the current text. FDA reviewers read against the current text.

There's a second problem: traceability. When an auditor asks "what is this based on?", a fine-tuned model can't answer. It produced the output from weight activations. There's no source to point to. That's not a minor limitation in a regulated submission — it's a compliance gap. Every technical claim needs a traceable evidence source. A model that can't cite its sources isn't usable in a submission workflow regardless of output quality.

What RAG actually solves

Retrieval-augmented generation separates reasoning from knowledge. The model's capabilities — how to structure a risk analysis, how to write a software requirement, how to trace a design input — stay in the model. What ISO 14971:2019 section 6.3 currently requires, what your specific risk management file contains, what your predicate device's cleared indications are — that lives in a corpus the model retrieves from at inference time.

Update the corpus when the standard updates. No retraining. No revalidation of the model.

The citation isn't a byproduct. It's a direct output. Every generated claim arrives with a reference to the retrieved source: which document, which version, which section. "Based on section 6.3.1 of ISO 14971:2019 and section 4.2 of your risk management plan (RMP-001, version 2.1, approved 2025-11-14)." That's the answer to the auditor's question. That's what makes human review tractable.

Teams that have actually deployed AI in regulated document workflows — not piloted, deployed — converge on RAG. Not because fine-tuned models can't produce good output. Because good output without traceability isn't usable. The citation is load-bearing.

Fine-tuningRAGKnowledge storageBaked into weightsKnowledge storageRetrieved at inference timeStandards updatesRequires retrainingStandards updatesUpdate corpus, not modelCitationsNone — cannot trace outputCitationsSource attribution on every claimDevice-specific contextNot possible without retrainingDevice-specific contextRetrieved from your DHFRegulatory validationRe-validate on every updateRegulatory validationValidate retrieval + generation separately
In regulated documentation, citation traceability is a compliance requirement — not a nice-to-have. This is the primary reason RAG dominates fine-tuning in medical device documentation workflows.

Four structural advantages of RAG in regulated documentation

Citation is the most important. Not the only one.

Standards change, and the changes are substantive. ISO 14971:2007 to 2019 shifted core risk management principles. IEC 62304 amendments changed software lifecycle requirements. FDA guidance documents update regularly. A fine-tuned model trained on 2023 data is silently wrong about requirements that have since changed. A RAG corpus updates when the source document updates. Immediately. For every subsequent generation.

Device-specificity is never in a base model. Your predicate device's cleared indications, your design inputs from last Tuesday's engineering review, your particular risk management rationale for a novel mechanism — none of this exists in any training corpus. RAG retrieves it from your corpus. Fine-tuning would require a new model per device, per version, per significant change. At that cadence, validation alone makes fine-tuning impractical.

Regulatory validation of AI tools is an emerging requirement. Validating a RAG system means validating two separable things: retrieval quality and generation quality. They can be tested and qualified independently. Validating a fine-tuned model means revalidating after every update. If the model updates frequently to track standard changes, the validation effort becomes continuous and impractical.

Behavioral consistency doesn't require baking knowledge into weights. Fine-tune for behavior: write in EARS notation, structure risk entries as hazard → hazardous situation → harm, reference the applicable standard clause. Those patterns are stable. They don't change when standards update. Fine-tuning for behavior is appropriate. Relying on fine-tuning for knowledge is the mistake.

Building a RAG corpus for regulatory content

Naive RAG fails on regulatory content in specific, predictable ways. The production architecture looks different from the tutorial implementation.

Corpus design is the first decision. In: design inputs and outputs, risk management file, verification and validation records, cleared 510(k) summaries for predicate devices, FDA guidance documents, relevant ISO/IEC standards. Out: general internet content, unverified secondary sources, anything without a clear document version and approval date. Garbage-in-garbage-out applies with regulatory consequences.

Chunking strategy for regulatory content differs from general documents. Sentence-level chunks fail — a single sentence from IEC 62304 section 6.3 may not be interpretable without surrounding clause context. Section-level chunks are the right unit for most regulatory content. Each chunk gets metadata: document type, version, approval date, device applicability, standard number. That metadata drives retrieval filtering and citation generation.

Hybrid retrieval handles the two types of regulatory queries that matter. Dense retrieval — embedding-based semantic search — handles paraphrase and conceptual similarity. Ask about "risk acceptability criteria" and it finds relevant content even when the exact phrase doesn't appear. Sparse retrieval — BM25 keyword matching — handles exact clause references. Ask for "IEC 62304 section 5.2.1" and it finds it reliably. Neither alone is sufficient on regulatory content. Both together work.

Re-ranking matters. Embedding similarity has a known failure mode on technical documents: summary sections often score higher on semantic similarity than the normative clause that actually answers the query. A cross-encoder re-ranker — which scores each candidate passage against the full query jointly — substantially improves precision on the queries that matter most for regulatory generation. In production, re-ranking isn't optional overhead. It's the layer that separates acceptable retrieval from retrieval you can stake a submission on.

Freshness weighting addresses the version problem. When both ISO 14971:2007 and ISO 14971:2019 are in the corpus — which they should be, for historical traceability — retrieval must prefer the current version for generation while preserving access to the historical version when needed. Time-weighted scoring combined with version metadata filtering ensures generated documents cite current requirements, not superseded ones.

Querygenerate 510(k)Section 12CorpusDesign inputsStandardsPredicate dataRisk fileV&V recordsGuidance docsHybrid RetrievalBM25 + dense vectorskeyword + semanticRe-rankingcross-encoderjoint query scoringContext WindowRetrieved chunkswith metadata +version infoGenerated Output[1][2][3]every claim cites its sourceWhy hybrid? BM25 matches exact clause references('ISO 14971:2019 §4.2'). Dense retrieval matches semantic meaning.
The full RAG pipeline for regulatory document generation. Hybrid retrieval handles both exact standard references and semantic queries. Every output chunk cites its source.

What you fine-tune for, what you RAG for

Fine-tuning and RAG aren't competing. The production architecture uses both, for different purposes.

Fine-tune for behavioral alignment. Always write requirements in EARS notation. Always structure risk entries as hazard → hazardous situation → harm. Always output design inputs in the DHF template format. These patterns are stable. They don't change when standards update. Fine-tuning them in produces consistent output without requiring complex prompting.

RAG for knowledge retrieval. What does IEC 62304 section 5.4 currently require? What were the cleared indications of the predicate device your engineering team identified last month? What did your design review on March 14th conclude about the software architecture change? This knowledge changes. It's device-specific. It must be traceable.

The separation is clean: behavioral reasoning in the model, knowledge in the corpus. Update cycles are independent. That independence is the architectural advantage.

What evaluation looks like in production

"Does the output sound correct?" isn't a useful evaluation for regulatory RAG. You need continuous evaluation, not just a deployment gate.

Retrieval quality: given a set of regulatory queries with known correct sources, what fraction of the time does the system retrieve the right source in the top-k results? Track by query type — clause-level, document-level, device-specific — because they have different retrieval profiles and different failure modes.

Generation faithfulness: given the retrieved sources, does the generated output accurately represent what those sources say? A hallucination in a regulatory context isn't a minor quality issue — it's a document with claims that cannot be verified against any source. A separate model verifying that each generated claim is grounded in retrieved context catches the failure mode where retrieval is correct but generation introduces unsupported claims.

Citation accuracy: does the citation in the output correctly identify the retrieved source? Compare cited document, version, and section against the actual retrieved chunk. Citation errors are distinct from generation errors and need to be tracked separately.

End-to-end precision on standard clause references is the metric closest to what actually matters. Take known-correct regulatory claims with authoritative citations, run them through the system, measure whether the system retrieves the right source and generates an accurate claim. That's the test that tells you whether the system is deployable in a submission workflow.

MANKAIND

The corpus problem for medical device documentation isn't just loading the right standards. It's keeping the corpus current with your engineering record in real time. When your design input changes today, the document generated tomorrow needs to reflect it. When a risk item is updated after a design review, the risk management file generated next week needs to cite the updated item — not last month's version.

MANKAIND's retrieval corpus is your live engineering record. Every design decision, risk item, verification result, and regulatory artifact is a retrievable source. Generated documents cite your actual record — not a model's interpolation of what your record might say based on training data. The citation format isn't decorative. It's the audit trail that makes every generated document defensible.

See how MANKAIND handles this

30-minute demo. Bring your hardest design controls question.