Blog
ISO 14971 for AI/ML devices — where the standard strains and how to hold it
Reviewed a dermatology SaMD risk file in late 2025. Class IIa, CE-marked under MDR, FDA 510(k) in review. Hazard identification section listed one AI-specific entry: "incorrect classification output." That was it.
The model was trained on 38,000 images, 78% of which came from Fitzpatrick types I-III. Deployed population skewed considerably darker. The risk file said nothing about distribution shift, nothing about demographic performance stratification, nothing about confidence miscalibration on out-of-distribution inputs. CDRH response took four months and the first AI request ran eleven pages.
ISO 14971 was written for deterministic devices. You can enumerate the failure modes of a valve because a valve obeys physical laws. For a trained model, the failure mode space has a different character — it's statistical, continuous, and conditional on a deployment environment you don't fully control. The standard still applies. The process still works. What fails is hazard identification when teams treat the AI component like a deterministic module.
The five AI failure modes that belong in hazard identification
Every AI-enabled device risk file should address these five explicitly. Not in a paragraph labeled "AI considerations." As separate hazards with severity, probability, and risk controls.
1. Distribution shift. A model trained on 2020–2023 patient data performs differently on 2026 data if imaging equipment has changed, clinical protocols have drifted, disease prevalence has moved, or the patient mix has shifted. Not a defect. A predictable consequence of deploying a static model in a dynamic environment. Hazard: clinical decisions made on degraded real-world performance. Risk control: post-market performance monitoring with pre-specified thresholds that trigger re-evaluation or field action.
2. Dataset bias. Under-representation of a demographic subgroup in training data produces systematically worse performance on that subgroup in deployment. Not random failure. Predictable, stratified failure. Hazard: inequitable clinical performance producing disparate patient outcomes. Risk control: stratified validation with minimum performance thresholds per safety-relevant subgroup, plus post-market stratified monitoring.
3. Confidence miscalibration. Many models produce confidence scores that don't correspond to actual probability of correctness. A model confidently wrong is more dangerous than a model tentatively wrong, because high confidence gets accepted without scrutiny. Hazard: clinician over-reliance on high-confidence but incorrect output. Risk control: calibration testing during validation, explicit uncertainty quantification in the UI, and workflow design that routes high-stakes output to human review regardless of confidence.
4. Out-of-distribution inputs. Models don't fail gracefully on inputs outside training distribution. An MRI classifier trained on scans from three vendors may produce confident garbage on scans from a fourth. Hazard: normal-appearing confident output on inputs the model wasn't designed for. Risk control: OOD detection that routes suspect inputs to human review, plus explicit scope-of-use labeling.
5. Adversarial or unexpected inputs. Generally lower priority for clinical devices than consumer AI, but worth explicit evaluation for devices with accessible input channels. Includes malicious inputs where relevant and — more common in practice — inputs that fall outside intended use in ways operators may not recognize (pediatric input into an adult-only device, for example).
A sixth category worth naming for LLM-based components: hallucination. Confident generation of false information. Not a bug. A structural property. Risk control options: retrieval-augmented generation with citation requirements, reviewer-agent architectures, and workflow design that requires clinician sign-off before clinical use of any LLM output.
Risk estimation — where ML and traditional components diverge
Severity estimation is unchanged. Clinical consequences of failure are what they are. Probability estimation is where ML requires a different approach.
For traditional components, probability comes from failure rate data, reliability analysis, and physical testing. For ML, it comes from statistical performance characterization — false positive rate, false negative rate, stratified performance across the intended use population.
Aggregate metrics mislead. A model with 96% overall accuracy that collapses to 78% on a 15% demographic subgroup isn't a model with 96% accuracy. It's a model with 96% accuracy in the majority and a risk hazard in a minority. Your probability estimates have to be stratified along the dimensions that matter clinically — age, sex, disease prevalence, image acquisition characteristics, whatever the risk analysis identifies.
Practical rule I've been using: if the validation dataset doesn't let you compute stratified performance for every subgroup named in the hazard analysis, your probability estimates are not adequate. Go back. Get stratified evidence. Don't write the probability estimate from the aggregate metric.
Risk controls for AI — a workable hierarchy
ISO 14971's risk control hierarchy applies: inherent safety by design first, then protective measures, then information for safety. For AI/ML, each layer has specific levers.
Inherent safety by design. Scope the model's role to advisory rather than autonomous. Require clinician confirmation for high-consequence outputs. Reject or escalate inputs the model wasn't designed for. Choose model architecture and training objectives that make the failure modes less likely — not just more accurate.
Protective measures. OOD detection. Confidence thresholds that route low-confidence output to human review. Canary deployment with staged rollout. Monitored performance with automatic halt triggers if drift exceeds predefined bounds. These are the software-level controls that catch failures the design didn't prevent.
Information for safety. Labeling that names the validated population, documented performance stratified by subgroup, clear statements about what the device is not validated for. Usability engineering that surfaces uncertainty to the clinician in interpretable form. Training materials that communicate the failure modes operators need to watch for.
Most AI submissions I see rely too heavily on the third layer — labeling and training — to compensate for insufficient investment in the first two. That pattern doesn't survive review. Inherent safety design and protective measures have to carry the weight. Labeling is the last mile.
Post-market monitoring as structural risk control
For deterministic devices, residual risk is bounded at release by the controls implemented during design. For AI/ML devices, residual risk changes over time as the deployment environment shifts. Post-market performance monitoring isn't just a regulatory obligation. It's a structural risk control, and the risk management file has to treat it that way.
Your risk management file should specify, for each AI-specific hazard: which performance metrics are tracked, what thresholds trigger re-evaluation, what thresholds trigger field action, how the data is collected from deployment, how the feedback loop updates the hazard analysis.
The monitoring plan is also the foundation of your PCCP. The performance thresholds that trigger re-evaluation in your risk analysis are the same thresholds that bound your modification protocol. Writing them once, in a live record, is the only sustainable approach. Writing them twice — once in the risk file, once in the PCCP — creates the inconsistency that produces AI requests during review.
Validation for ML — what's different, concretely
Four things separate ML validation from traditional software validation.
- Test set independence. Not just held-out. Independently collected. Independently labeled. No information leakage from development. Sounds obvious. Gets violated routinely because dataset curation teams aren't structurally separated from model development teams.
- Clinical representativeness. Test population has to represent intended use population. A model validated on a population that doesn't match your intended users has unknown performance for those users. "Our test set was diverse" is not the answer to the question being asked.
- Failure-mode-specific analysis. Validation protocol includes explicit evaluation of each failure mode named in the risk analysis. If distribution shift is a named hazard, validation includes cross-site and temporal stratification. If dataset bias is named, validation includes demographic stratification. If OOD is named, validation includes deliberately OOD inputs.
- Human factors integration. For decision-support devices, validation includes how clinicians interact with output, not just whether the output is correct in isolation. This is where IEC 62366 and ISO 14971 intersect structurally for AI, and it's where most submissions underinvest.
Connecting risk analysis to the engineering record
The most common failure isn't in hazard identification. It's in the disconnect between the risk analysis and the design elements implementing the controls.
A risk analysis that names distribution shift as a hazard and specifies post-market monitoring as a control is only valuable if that monitoring system is actually built, its thresholds are documented as requirements, those requirements are implemented as design elements, and those design elements are verified against specific tests. If any link in the chain is missing, the risk control exists on paper only.
MANKAIND connects risk controls to requirements and design decisions from the moment they're identified. When the risk analysis specifies OOD detection, that control becomes a requirement in the engineering record, linked to the design element implementing it and the verification tests confirming it works. The AI risk analysis isn't a separate document. It's a live layer of the design history, connected to the engineering decisions the team makes every day.
That's the difference between a risk file that passes review and a risk file that becomes the subject of an eleven-page AI request.
See how MANKAIND handles this
30-minute demo. Bring your hardest design controls question.