Blog
Using ChatGPT and Claude for medical device requirements — the honest version
Engineers at Class II and Class III medtech companies are already using ChatGPT and Claude to draft requirements. It's happening. It's not a pilot. Most teams aren't saying so publicly because their compliance function hasn't figured out how to say yes to it yet.
So let's skip the debate about whether. The real question is where general-purpose LLMs help, where they fail predictably, and what has to be true before the output is defensible in a 510(k) or CE submission.
What ChatGPT and Claude genuinely do well
Start with what works. Otherwise the framing is wrong.
First-draft structure. Give a frontier model a device function description, and it will produce a plausibly-structured first draft of EARS-format requirements in under a minute. The structure is usually sound — subject, verb, object, condition. Event-driven, state-driven, and unwanted-behaviour templates are well-represented in training data. Getting to a first draft in 20 minutes instead of two hours is real, and it's real even for senior engineers. Drafting fatigue is a legitimate cost of the job.
Gap-finding on an existing set. "Here are my current requirements for the alarm subsystem. What's missing?" is a prompt these models answer well. They pattern-match against common missing failure modes, unstated conditions, and inconsistent granularity. Not infallible. Useful as a second reviewer on a draft before a formal design review.
Rewriting for consistency. If your requirements were written by three engineers in three styles, a model will normalise them into EARS without changing intent, faster than any engineer will. Passive to active. Multi-sentence to single-statement. Compound statements flagged for splitting.
Generating adversarial questions. What are the failure modes for this component? What environmental conditions should requirements cover? What happens when power is interrupted mid-operation? LLMs generate these questions quickly and cheaply, and the questions are mostly good ones.
If a model does nothing else in your workflow, those four uses earn it a seat. The failures start when you ask it to do the authoritative work.
Where general-purpose LLMs fail predictably
The failures aren't obvious bad outputs. They're plausible outputs missing something your regulatory framework requires — which is worse, because nobody flags them.
Safety-class blindness. The model doesn't know whether your device is IEC 62304 Class A, Class B, or Class C. It can't know unless you tell it, and even then it has no mechanism to calibrate the depth and specificity of its output to the class. A Class C software item needs unit verification with documented coverage. The model will happily write the same requirement set it would for a Class A item, because class is an external constraint the model has no way to enforce.
No risk file context. Good requirements trace to user needs, risk controls, or regulatory clauses. A model generating in isolation has no knowledge of your ISO 14971 file, your user need list, or your intended use statement. Requirements it produces are sourceless by construction. You can paste the risk file into the prompt. Most engineers won't, and even when they do, the traceability is a post-hoc reconstruction.
Invented performance thresholds. This is the one that scares me. "The alarm shall activate within 5 seconds" sounds fine. It's also completely made up, drawn from training-distribution averages that have no relationship to your specific risk analysis or clinical evidence. For a Class C alarm, 5 seconds might be 4.5 seconds too slow. The model won't know.
Hallucinated standard citations. Ask a frontier model what standards apply to your device. The answer will be confident, plausible, and partially wrong. Misquoted clause numbers. Standards that don't exist in the cited form. Standards that apply to a different device category. I've tested this on GPT-4, Claude Sonnet, and Gemini across fifty medtech queries in the last year. The citation error rate ran 15–23% depending on the query complexity. In a submission, a wrong standard reference isn't a typo. It's a finding.
The regulatory hallucination patterns worth knowing
In the testing I ran across 2025, five hallucination patterns showed up often enough to name.
Phantom clause numbers. "IEC 62304 Clause 5.7.3 requires..." when Clause 5.7 doesn't have a sub-clause 3 in the 2015 edition. The model is pattern-matching clause structure from ISO 14971 or 13485 and confabulating. Plausible. Wrong.
Cross-standard bleed. Attributes of one standard attributed to another. ISO 13485 clauses quoted as IEC 62366 clauses. 21 CFR 820 language attributed to EU MDR Annex II. Happens because the underlying text is related in training data and the model doesn't distinguish at high resolution.
Stale regulation. Citations to the 2007 edition of IEC 62366 when the model is asked about current requirements. Citations to MDD when the device needs MDR. Both look right at a glance. Both are wrong.
FDA guidance that doesn't exist. The model confabulates titles of guidance documents. "FDA Guidance on AI-Enabled Clinical Decision Support, May 2023" when no such document exists — only the September 2022 final guidance on CDS, which has a different title. Regulatory submissions die on this kind of citation.
Invented predicate devices. For 510(k) work, the model will suggest K-numbers that either don't exist or belong to a different device class. Always verify against the FDA 510(k) database. Always.
When AI requirements generation is defensible
The defensibility question comes up in audits. An auditor asks how a requirement was generated. If the answer is "our engineer used ChatGPT," the next question is what review and approval process was applied. That's the only answer that matters.
The rule I've been using with clients: any AI-generated requirement is a draft until a qualified engineer reviews it against the risk file, assigns a source, and approves it under a documented procedure. The AI contribution is acceleration, not authorship. The documented procedure is the audit answer.
Three conditions make it work:
- The AI tool is used under a documented procedure inside your QMS. Not "engineer writes a prompt and pastes the result." A procedure that specifies what prompts are acceptable, what review is required, and what rejection criteria apply.
- Every generated requirement is linked to a source before acceptance. User need, risk control, or standard clause. If it can't be linked, it doesn't enter the SRS.
- Every standard citation, performance threshold, and regulatory reference is verified against the original source by a qualified engineer. No exceptions, no automation of this step.
Companies that run AI-assisted requirements generation this way have had no issues in the audits I'm aware of. Companies that let engineers paste model output into SRS documents without review have had several.
Prompts that actually improve output
If you're using a general-purpose model today, four prompt patterns improve output substantially.
- Supply the context the model doesn't have. Device description, intended use, patient population, software safety class, regulatory pathway. Every prompt. Don't assume the model remembers.
- Constrain the format explicitly. "Write requirements in EARS event-driven format with explicit timing constraints" produces better output than "write requirements." State the template.
- Ask for challenge, not completeness. "What failure modes are missing from these requirements?" outperforms "write complete requirements." Use the model to attack your draft, not to author the authoritative version.
- Never use AI-generated numbers directly. Treat thresholds, tolerances, and timing values as placeholder text. Every number in a final requirement comes from your risk analysis or clinical evidence.
What changes when AI is built for this context
The gap between general-purpose AI and purpose-built engineering AI isn't marginal. It's architectural.
A general-purpose model works from a prompt. A purpose-built platform works from your engineering record — your specific device description, your risk analysis, your identified hazards, your user needs, your regulatory pathway. When it helps you draft a requirement, it already knows which risk controls need implementing, which user needs aren't covered, and what performance thresholds your risk analysis supports.
The result is requirements that are traceable from creation. Not because someone adds traceability links later. Because the requirement was written against the design record it belongs to, and the record was the context, not the output.
That's how MANKAIND approaches requirements assistance. Not as a prompt interface on top of a language model. As engineering intelligence embedded in the platform where design decisions already live. Your team writes and approves every requirement. The platform ensures those requirements connect to everything the engineering record needs them to connect to, and that every citation is pulled from a controlled standards library rather than the model's memory of one.
See how MANKAIND handles this
30-minute demo. Bring your hardest design controls question.