Module 5

Failure Cases: What Breaks in Real Devices

Study production failure patterns and the preventive controls teams should have implemented earlier.

Read 4 minAvg understanding 10 min

Learning objectives

  • Diagnose common secure boot failures with root-cause thinking
  • Connect provisioning mistakes to field-level outages
  • Build prevention controls into manufacturing and OTA lifecycle

Failure cases from production programs

These are common not because teams are careless, but because secure boot crosses manufacturing, firmware, DevOps, and field support.

Case 1: Signature mismatch in field

  • What happened: OTA installed, device fails to boot image.
  • Root cause: Packaging changed artifact bytes post-sign.
  • Impact: Slot boot failure, fleet incident.
  • Prevention: Immutable post-sign artifacts + reproducible package checks.

Case 2: Wrong key enrolled at manufacturing

  • What happened: Devices reject all release images.
  • Root cause: Provisioning script used staging key hash.
  • Impact: Bricked production batch unless recovery override exists.
  • Prevention: Two-person provisioning approval + device-side readback verification before shipment.

Case 3: Developer key shipped to production

  • What happened: Attackers sign malicious image with leaked dev key.
  • Root cause: Boot policy accepted debug keyring in production mode.
  • Impact: Secure boot bypass by policy, not cryptography.
  • Prevention: Lifecycle-gated key policy and production key allowlist only.

Case 4: Recovery image not protected

  • What happened: Forced recovery boots unsigned payload.
  • Root cause: Recovery path exempted from secure checks.
  • Impact: Full compromise with minimal sophistication.
  • Prevention: Recovery path must enforce same or stricter signature policy.

Case 5: Kernel verified, DTB not verified

  • What happened: Malicious DTB reconfigures boot behavior/security assumptions.
  • Root cause: Incomplete trust chain.
  • Impact: Kernel starts in unsafe hardware/policy state.
  • Prevention: Include DTB and initramfs in signed FIT and verify all.

Case 6: OTA interrupted halfway

  • What happened: Power loss during update leaves unusable active slot.
  • Root cause: Non-atomic update strategy.
  • Impact: Boot loops or manual recovery operations.
  • Prevention: A/B slots + transactional state markers + rollback-safe boot selection.

Case 7: Secure boot enabled too late

  • What happened: Team adds secure boot near release and breaks factory/service flow.
  • Root cause: No early architecture alignment.
  • Impact: Delays, brittle exceptions, risky policy shortcuts.
  • Prevention: Define lifecycle states and provisioning plan at platform kickoff.

Case 8: Debug UART/JTAG left open

  • What happened: Boot interruption and memory inspection in production device.
  • Root cause: Factory debug policy not transitioned to production lock.
  • Impact: Confidentiality and integrity compromise.
  • Prevention: Hardware lock bits + tested secure lifecycle transition.

Case 9: Inconsistent key rotation strategy

  • What happened: Some devices trust new key, others don't.
  • Root cause: Non-versioned key manifest and uneven rollout.
  • Impact: Fleet split-brain and update failures.
  • Prevention: Versioned key hierarchy + staged rotation plan + backward compatibility window.

Practical takeaway

Most outages come from lifecycle and policy gaps, not signature algorithm weakness.

Misconception to correct

"Once secure boot is enabled, we are done."
Secure boot is an ongoing operational discipline across manufacturing, updates, and incident response.