Skip to main content

Blog

Prompt engineering is the wrong frame — context is the failure

Ran a diagnostic on a medtech team's AI-drafted 510(k) summary last month. Seventeen citations. Four were to standards that don't exist in the cited form. Two referenced a predicate the team didn't actually select. One quoted a guidance document released eight months after the submission's planned filing date.

The team had been iterating on the prompt for six weeks. Better prompts. More detailed prompts. Prompts with explicit instructions about what not to hallucinate. None of it helped, because the problem wasn't the prompt.

The problem was that the model had nothing correct to work from. No design inputs. No predicate document. No version-pinned standards. The prompt was a sophisticated request made to a model that had no grounding in the specific regulatory reality the request assumed. The output was always going to be a plausible-sounding fabrication.

This is the shift practitioners who've shipped production AI for regulated contexts have already made. Not prompt engineering. Context engineering. What you feed the model matters more than how you ask.

Why the prompt-engineering frame quietly fails in medtech

Prompt engineering made sense when the task was generic. Rewrite this email. Summarize this paper. Generate test cases for this function. The model had the knowledge; you just needed the right instruction.

Medical device documentation doesn't sit anywhere in any public training corpus. Your design inputs aren't public. Your predicate selection isn't. Your risk rationale isn't. The standards that govern your submission are versioned documents that change faster than model training cycles and whose revision history matters for compliance. IEC 62304:2006 and IEC 62304:2006/AMD1:2015 are not interchangeable. ISO 14971:2007 and ISO 14971:2019 differ materially on residual risk and benefit-risk.

A model asked to generate risk controls without knowing which standard version applies, which patient population the device serves, and which hazards the 14971 file already identifies will produce a risk control list that looks reasonable to someone unfamiliar with the device and wrong to anyone who isn't. You cannot fix that with a better-worded prompt. The prompt is fine. The information around it is missing.

The four context failure modes that actually ship

Most medtech AI tools fail for one or more of these four reasons. The model is almost never the problem. Context is.

1. No device context. The system has no access to the design inputs, the intended use statement, the technology description, or prior regulatory history. Output is a plausible risk analysis for a generic device in your category. The hazards are real hazards in the literature. They're not your hazards. The document can't be submitted without wholesale rewrite, and worse, the team starts treating AI output as a rough draft rather than a candidate deliverable — which is the opposite of what acceleration should produce.

2. No predicate context. Substantial equivalence is predicate-specific. The model needs the actual cleared 510(k) summary — indications for use, technological characteristics, performance data — to produce a credible SE argument. Without the predicate document, every comparison is anchored to the model's guess at what the predicate looks like. Reviewers have direct access to the 510(k) database. They will check. They will find the discrepancy.

3. No risk rationale context. Risk management under ISO 14971 is a reasoning chain: hazard, hazardous situation, harm, severity, probability, acceptability, control, residual risk. Each link depends on decisions the team made about the specific device. A model without that context produces a risk analysis that looks structured but whose reasoning is generic. Risk controls don't connect to your architecture. Verification references don't match your protocols. Everything compiles and nothing is defensible.

4. No QMS structure context. Your documents reference each other. Software Development Plan references Configuration Management Procedure. Risk Management Plan references Risk Acceptability Criteria. CAPA Procedure references the Risk Update Process. Without knowledge of document numbers, versions, and cross-references, AI output references procedures that don't exist in your QMS. The internal inconsistencies are painful to clean up manually and trivial to avoid with a QMS document graph in context.

?docNo device contextModel doesn't know your deviceclass, indications, or predicateNo standards contextModel uses wrong standard versionor hallucinates clause numbersNo design historyModel can't reference your actual riskrationale or design decisionsNo QMS structureModel ignores your specific documenthierarchy and review workflows
Four context failure modes that cause AI documentation tools to produce outputs that look plausible but can't be verified against your actual engineering record.

Context architecture per document type

Different documents demand different context architectures. These are the minimums I've found sufficient across enough programs to call them production-viable.

Risk file (ISO 14971). Device description and intended use. Use environment and user profile. Current standard (ISO 14971:2019) with applicable annex. Risk acceptability criteria established by the team. Prior risk analyses for related devices. Software safety class if IEC 62304 applies. Useful extras: FDA guidance for the device type, MAUDE adverse event data for comparable devices. Without the first six, output is generic. With them, output is specific enough to be reviewed rather than rewritten.

510(k) summary. Your device description, intended use, indications for use. The selected predicate's actual 510(k) summary — not a description of it, the document itself. The technological comparison structure FDA expects for the device category. Applicable guidance. The predicate document is load-bearing. Without it, every comparison in the output is unanchored.

Context InputsDevice design inputs (current version)Predicate 510(k) clearance summaryIntended use statementPerformance test dataRisk management summaryRelevant FDA guidancestructuredretrievalContextWindowversion-pinned sourcesgenerationwith citations510(k) Section Output[1][2][3][1] Design Inputs v2.3 [2] Predicate K213456 [3] FDA Guidance 2023
Context engineering for a 510(k) Section 12 (substantial equivalence). Every input is version-pinned and source-attributed. The output citation traces back to the specific retrieval.

Clinical evaluation report (EU MDR). Intended purpose. Relevant GSPRs from Annex I. Current MEDDEV 2.7/1 Rev 4 guidance. Literature search protocol and results for equivalent devices. Clinical claims the team intends to make. Any clinical investigations on the device. CER context is bigger than most documents because the source literature is extensive. The retrieval strategy needs to handle structured citation alongside free-text analysis, or the output will drop primary sources in favor of whatever the model remembers from training.

IEC 62304 software documentation. Software safety classification rationale. System architecture at the level of software items and their interfaces. Development lifecycle the team actually runs. SOUP inventory with version, provenance, and risk level. The software development plan, software architecture document, and SOUP list form a triangle. Each references the others. All three need to be in context for any one to be generated consistently.

Chunking for design history files

DHFs are large. A mature Class II SaMD DHF runs hundreds of documents. You can't fit the whole thing in a context window, and trying to is almost always the wrong approach anyway.

Three retrieval patterns cover most production use.

Full document context. Appropriate when the output is a direct derivative of a specific source. Generating a 510(k) summary from design inputs: load design inputs in full. Generating software architecture from system architecture: load it in full. Maximum fidelity, constrained by context size. Use when the source is primary.

Section-level chunks. Appropriate when relevant content is a subset of a larger document and the subset can be identified deterministically or via retrieval. Generating risk controls for a specific hazard: retrieve the risk file sections addressing that hazard and related hazardous situations. Preserves semantic coherence while managing context load. Key constraint: chunk boundaries must respect document structure. Splitting a risk control from its harm description destroys the reasoning chain.

Metadata-only retrieval. Appropriate for maintaining cross-reference integrity without loading full content. Generating a Software Development Plan needs to know which procedures exist and their document numbers — but not their full text. A structured metadata index enables consistent cross-references without saturating the window.

Context window vs RAG — which to choose

This decision is not primarily about performance. It's about fidelity.

RAG works when the relevant information is a small, retrievable subset of a large corpus and the retrieval signal is strong. Pulling the relevant section of a standards document for a specific requirement topic: good RAG candidate. Pulling the relevant predicate from thousands of cleared devices given a device description: good RAG candidate.

RAG fails when the retrieval signal is weak or when semantic similarity is a poor proxy for regulatory relevance. A risk control semantically similar to an existing control isn't necessarily the right control for your specific hazard. A guidance section that matches on keywords may not be the normatively relevant section for the submission type. In regulated contexts, retrieval errors are expensive — a plausible-but-wrong chunk produces output that passes superficial review and fails substantive review.

Direct loading works when the document is small enough to fit, when the model needs to reason across the whole document rather than a retrieved section, and when retrieval errors are unacceptable. For most regulated documents the answer is hybrid: direct loading for primary sources the output is derived from, RAG for supplementary standards and guidance providing normative framing.

Production principles that most early builds violate

Every claim needs a source in context. If the model asserts that a specific risk control reduces severity from S3 to S2, the basis — the design feature, the verification test, the literature — must be in the context window. If it isn't, the model is generating the rationale rather than retrieving it. In submissions, generated rationale isn't acceptable. Retrieved rationale is.

Version-pin everything. Standards, guidance, predicate 510(k)s — every context document has a version identifier. Retrieval serves the version current for the jurisdiction at submission time, not whatever the system indexed last quarter. Version drift in context is invisible in output and catastrophic in review.

Exclude what you don't control. General web content, unverified literature, AI summaries of standards, informal forum discussions — none of these belong in context for regulated generation. Primary sources only. Secondary and tertiary sources introduce noise and dilute precision.

Make it auditable. For each generated document, you should be able to answer: what was in the context window? Which retrieved chunks were included and why? Which versions of which standards were referenced? This isn't academic. It's a design controls requirement. Your QMS applies to AI-assisted generation the way it applies to any other design activity.

MANKAIND as a context engineering system

The reason most medtech AI tools fail isn't that the underlying models aren't capable. The models are. The context architecture doesn't exist. That architecture requires: design history structured as retrievable, version-controlled artefacts; an authoritative standards library with version management; a predicate database structured for programmatic access; and a QMS document graph that knows how procedures relate to each other.

That's the work MANKAIND does. Every document drawn from a context window built from your actual design history — your design inputs, architecture decisions, test protocols — alongside the current version of applicable standards, the predicate's actual cleared 510(k) summary, and your QMS structure. Output isn't a template populated with your device name. It's a document derived from your engineering record, every claim traceable to a specific source in that record.

When FDA asks where a claim came from, the answer is always a specific document, a specific clause in a specific standard version, or a specific test result. Not "the model said so." Context is the product. The model just reads it.

See how MANKAIND handles this

30-minute demo. Bring your hardest design controls question.