Module 5

Failure Cases: What Breaks in Real Devices

Study production failure patterns and the preventive controls teams should have implemented earlier.

Read 4 minAvg understanding 10 min

Learning objectives

Diagnose common secure boot failures with root-cause thinking
Connect provisioning mistakes to field-level outages
Build prevention controls into manufacturing and OTA lifecycle

Failure cases from production programs

These are common not because teams are careless, but because secure boot crosses manufacturing, firmware, DevOps, and field support.

Case 1: Signature mismatch in field

What happened: OTA installed, device fails to boot image.
Root cause: Packaging changed artifact bytes post-sign.
Impact: Slot boot failure, fleet incident.
Prevention: Immutable post-sign artifacts + reproducible package checks.

Case 2: Wrong key enrolled at manufacturing

What happened: Devices reject all release images.
Root cause: Provisioning script used staging key hash.
Impact: Bricked production batch unless recovery override exists.
Prevention: Two-person provisioning approval + device-side readback verification before shipment.

Case 3: Developer key shipped to production

What happened: Attackers sign malicious image with leaked dev key.
Root cause: Boot policy accepted debug keyring in production mode.
Impact: Secure boot bypass by policy, not cryptography.
Prevention: Lifecycle-gated key policy and production key allowlist only.

Case 4: Recovery image not protected

What happened: Forced recovery boots unsigned payload.
Root cause: Recovery path exempted from secure checks.
Impact: Full compromise with minimal sophistication.
Prevention: Recovery path must enforce same or stricter signature policy.

Case 5: Kernel verified, DTB not verified

What happened: Malicious DTB reconfigures boot behavior/security assumptions.
Root cause: Incomplete trust chain.
Impact: Kernel starts in unsafe hardware/policy state.
Prevention: Include DTB and initramfs in signed FIT and verify all.

Case 6: OTA interrupted halfway

What happened: Power loss during update leaves unusable active slot.
Root cause: Non-atomic update strategy.
Impact: Boot loops or manual recovery operations.
Prevention: A/B slots + transactional state markers + rollback-safe boot selection.

Case 7: Secure boot enabled too late

What happened: Team adds secure boot near release and breaks factory/service flow.
Root cause: No early architecture alignment.
Impact: Delays, brittle exceptions, risky policy shortcuts.
Prevention: Define lifecycle states and provisioning plan at platform kickoff.

Case 8: Debug UART/JTAG left open

What happened: Boot interruption and memory inspection in production device.
Root cause: Factory debug policy not transitioned to production lock.
Impact: Confidentiality and integrity compromise.
Prevention: Hardware lock bits + tested secure lifecycle transition.

Case 9: Inconsistent key rotation strategy

What happened: Some devices trust new key, others don't.
Root cause: Non-versioned key manifest and uneven rollout.
Impact: Fleet split-brain and update failures.
Prevention: Versioned key hierarchy + staged rotation plan + backward compatibility window.

Practical takeaway

Most outages come from lifecycle and policy gaps, not signature algorithm weakness.

Misconception to correct

"Once secure boot is enabled, we are done."
Secure boot is an ongoing operational discipline across manufacturing, updates, and incident response.

Previous lesson

Signing Firmware: Build -> Sign -> Package -> Deploy

Next lesson

Demo Architecture: End-to-End Secure Boot Design

Back to course overview Browse courses