Skip to main content

Designing an ESP32 OTA Firmware Update System for a Fleet

GizanTech EngineeringIndustrial Firmware TeamPublished June 16, 202613 min read

The difference between an OTA demo and an OTA system

Every ESP32 tutorial shows the same thing: a device downloads a .bin over Wi-Fi and reboots into it. That demo works on the bench and falls apart in the field, because the failures that matter at fleet scale never appear with one device on your desk. A flash that gets interrupted by a brownout. An image that boots but is subtly broken. A rollout that looks fine on device one and bricks devices four hundred through nine hundred. A signing key you can no longer revoke. A cellular plan that can't afford a full-image download per release across ten thousand units.

A production OTA system is the set of design choices that turn each of those failures from a recall into a non-event. This post walks the six decisions we make when we design OTA for a fleet, and ties every one to the specific field failure it prevents. We are opinionated where mistakes are irreversible — partition layout and signing — and neutral on vendor, because the same shape works on AWS IoT, Azure, or your own backend.

Decision 1 — Partition layout: A/B slots, never in-place

The first decision is where the new image lands, and it is the one that decides whether an interrupted update bricks a unit. The answer is an A/B (dual-slot) OTA partition scheme: two app partitions of equal size, an active slot running now and an inactive slot you write the new image into. The device boots from one, downloads into the other, and only switches the boot pointer once the new image is fully written and verified.

The failure this prevents is the half-flashed brick. If you overwrite the running application in place and power drops mid-write — a brownout, a yanked connector, a watchdog reset — you are left with a partial image and no application to boot. With A/B slots the running image is never touched during download, so an interrupted update simply leaves the old, working slot in charge; the device reboots into exactly what it ran before and tries again later. The cost is flash: you size the part for two app slots plus the bootloader, NVS, and factory data. That is the price of admission for safe updates, and it is non-negotiable on anything you can't physically reach.

On the ESP32 this maps directly onto the standard two-OTA partition table with an otadata partition that records which slot is active and whether the current image is still on probation.

Decision 2 — Image signing: a verified boot chain, one revocable key

The second decision is what the device is willing to boot, and it is the one you cannot retrofit safely after units ship. Every image must be cryptographically signed, and the bootloader must verify that signature before it hands over control. The device boots your firmware and only your firmware.

Two failures hide here. The obvious one is a malicious or corrupted image: without signature verification, anyone who can reach the OTA endpoint — or who tampers with an image in transit or at rest — can push code that the device will faithfully execute. Verified boot makes the device reject anything not signed by your key, so a swapped or corrupted .bin fails verification and the old slot stays in charge.

The subtler failure is the un-revocable key. If you sign with a single key and it leaks, your only options are to re-key the whole fleet by hand or live with the compromise. Design for revocation from day one: provision multiple signing-key slots so a compromised key can be retired and a successor trusted without touching every device, and keep the private key in an HSM or a hardened CI signing step, never on a laptop or in the repo. This decision is permanent because the bootloader enforces it — get it wrong and the fix requires the physical access you built OTA to avoid needing.

Decision 3 — Rollout strategy: canary rings, never all-at-once

The third decision is who gets the update and when, and it is the difference between a regression that touches ten devices and one that touches ten thousand. The answer is a staged, canary rollout: ship the new version to a small representative ring first, watch it for a defined soak period, and only widen to larger rings if the canary stays healthy.

The failure this prevents is the fleet-wide regression from a build that passed QA. No lab reproduces every condition your fleet lives in — power quality, RF noise, peripheral board revisions, the one sensor that returns a value your parser never expected. A canary turns those unknowns into a bounded experiment. Pick the canary deliberately:

  1. Start with internal and friendly-customer devices you can physically reach if something goes wrong.
  2. Expand to a small percentage that spans your real hardware revisions and deployment environments, not just the easy ones.
  3. Hold each ring for a soak period long enough to cover at least one full duty cycle of the device, not five minutes.
  4. Gate each widening on health metrics — crash rate, reconnect rate, successful-update rate — staying within bounds, not on a calendar.
  5. Widen geometrically (canary, then a larger ring, then the remainder) so confidence compounds before the mass rollout.

Skip this and your first sign of a bad build is a wave of offline devices — exactly when remediation is hardest.

Decision 4 — Rollback trigger: the device saves itself

Staging limits how many devices a bad image reaches; rollback decides whether each affected device recovers on its own. The fourth decision is the automatic rollback trigger: a newly flashed image boots on probation, runs a self-test, and must actively confirm it is healthy or it is reverted to the previous slot.

The failure this prevents is the boot-loop or dead-on-arrival image with no way back. An image can pass signature verification — it's authentic — and still be broken: a null-deref on boot, a config it can't parse, a network stack that never comes up. Without a rollback trigger that device is now an offline unit you must visit. With one, the new image gets a single chance to prove itself.

The mechanism on the ESP32 is the rollback-enabled OTA flow: a freshly switched image starts in a pending-verify state, and the application must call the "mark valid" API only after its self-test passes — network up, peripherals responding, critical task scheduled. If the image crashes before it confirms, or the watchdog fires, the bootloader marks the image invalid and boots the previous slot on the next reset. Design the self-test to check what actually matters for that device, and never mark valid from setup() before you've proven the firmware can do its job — a too-eager "mark valid" defeats the whole mechanism.

The fifth decision is what travels over the wire, and it is the one that decides whether your release is affordable across the whole fleet. On Wi-Fi with a full-image OTA you may never think about it. On metered cellular or a marginal link, shipping the entire binary to every device on every point release is slow, expensive, and prone to failing mid-download. The answer is delta (differential) updates: ship only the binary patch between the version a device runs and the version it should run.

The failure this prevents is the rollout that stalls on bandwidth or cost. A small point release changes a small fraction of the binary, so the patch is a fraction of the full-image transfer — it downloads faster, costs less per device, and is far less likely to be interrupted on a flaky link, and that saving compounds across every device and release. The device reconstructs the target image from its current slot plus the patch, then verifies the result against the expected hash and signature before it touches the boot pointer. Delta is layered on top of A/B slots and signing, never a replacement: you still write into the inactive slot and still verify the final image, so a corrupt patch is caught and the working slot survives. Keep full-image OTA as the fallback for devices too far behind for a patch to apply.

Decision 6 — The update backend: know what every device is running

The sixth decision is the system of record, and it is the one that turns a pile of devices into a manageable fleet. The backend must store signed firmware artifacts, know which version every device is actually running, assign devices to rollout rings, hand each device its update decision, and record the outcome of every attempt.

The failure this prevents is flying blind during a rollout. If you don't know which version each device runs and whether its last update succeeded or rolled back, you cannot tell a healthy canary from a failing one, nor halt a bad rollout before it widens. The backend closes the loop: the device reports its running version and update result, the backend compares that against the target for its ring, and it pauses the rollout automatically when failure rates cross a threshold. AWS IoT Device Management — jobs for staged rollouts, the device shadow for desired-versus-reported version, signed artifacts in storage — implements this cleanly, but the capabilities matter more than the brand: inventory, staged jobs, signed artifacts, and observable per-device outcomes. Without that telemetry, "we updated the fleet" is a hope, not a fact.

The OTA design-decision table

Every decision above has more than one viable option and a default we reach for first. Here is the whole system in one view, with the field failure each default is chosen to prevent.

DecisionViable optionsRecommended defaultField failure it prevents
Partition schemeIn-place single-slot; A/B dual-slot; A/B + recovery factory imageA/B dual-slot with otadata, sized for two app slotsA power-interrupted flash leaves a half-written image and bricks the unit
Image signingUnsigned; single-key signing; signed + verified boot with revocable key slotsSigned images with verified boot and multiple revocable key slotsAn attacker or a corrupt artifact gets arbitrary code onto the device, or a leaked key can't be revoked
Rollout strategyAll-at-once; manual batches; automated canary rings with health gatesStaged canary rings gated on crash/offline metricsA regression that passed QA bricks the whole fleet instead of ten devices
Rollback triggerNone (manual reflash); watchdog-only; pending-verify self-test + auto-revertPending-verify self-test that must actively mark the image validA boot-looping or dead-on-arrival image with no automatic way back
Transport / bandwidthFull-image over Wi-Fi; full-image over cellular; delta patch + full-image fallbackDelta updates on metered links, full-image as fallbackA rollout that stalls or overruns cost on slow or metered connections
Update backendHard-coded URL the device polls; bespoke server; managed device-management serviceManaged fleet service (e.g. AWS IoT jobs + shadow) with signed artifactsFlying blind: no per-device version truth, no way to halt a bad rollout

How the decisions compose into one safe update

Read top to bottom and a single device's update lifecycle falls out, with a safety net at every step:

  1. The backend assigns the device to a rollout ring and offers it a target version only if its ring is active.
  2. The device downloads a delta patch (or full image) into the inactive A/B slot, never touching the running one.
  3. It reconstructs and verifies the image — signature and hash — and rejects anything that fails.
  4. It switches the boot pointer to the new slot in pending-verify state and reboots.
  5. The new image runs its self-test; on success it marks itself valid, on failure the bootloader rolls back to the previous slot.
  6. The device reports its running version and the result, and the backend widens or halts the rollout based on aggregate health.

If anything breaks at any step, the previous working image stays in charge and the fleet keeps running. That is the entire point of building it this way rather than pushing a .bin and hoping.

Where teams get this wrong, and where to get help

Three mistakes recur, and all three are the expensive, hard-to-reverse kind. Teams flash in place to save flash and brick units on the first brownout. They ship unsigned images or sign with one un-revocable key, then discover the cost the day an endpoint is reachable or a key leaks. And they push to the whole fleet at once with no rollback, turning a regression that survived QA into a field recall.

GizanTech designs the OTA system end to end — the partition table and bootloader configuration on the device, the signed-image and rollback flow in firmware, the delta pipeline, and the backend that stages rollouts and tracks every device's real version — as one system rather than a .bin and a download URL bolted on late. If you are about to put an ESP32 fleet in the field and want the partition, signing, and rollout model reviewed before firmware is committed, talk to our engineering team.

Frequently asked questions

How do you roll out an ESP32 firmware update to thousands of devices without bricking the fleet?

You never push one build to the whole fleet at once. Each device flashes into an inactive A/B partition, verifies the image signature before it boots, and runs a self-test on first boot; if the test fails it rolls back to the previous slot automatically. On top of that, the backend releases the update in stages — a small canary group first, then widening rings — and halts the rollout the moment crash or offline rates climb. A bad image then reaches a handful of devices, not all of them, and even those recover themselves.

What is a staged or canary OTA rollout and why does it matter at fleet scale?

A staged rollout ships a new firmware version to a small, representative subset of devices first — the canary — watches their telemetry for a defined soak period, and only widens to larger rings if the canary stays healthy. It matters because no lab test reproduces every real-world condition your fleet sees: power quality, RF environment, peripheral revisions, edge-case sensor data. The canary turns those unknowns into a bounded, observable experiment, so a regression that survived QA is caught at ten devices instead of after it has bricked the field.

How do delta (differential) OTA updates reduce bandwidth and cost?

A delta update ships only the binary difference between the version a device runs and the version it should run, instead of the whole image. For a typical point release that changes a small fraction of the binary, the patch is a fraction of the full-image size, which is decisive on metered cellular or low-bandwidth links where the full download is slow, expensive, or fails outright. The device reconstructs the new image from its current slot plus the patch, verifies the result against the expected signature and hash, and only then commits it.

What backend do you need to manage OTA updates for an ESP32 fleet?

You need a service that stores signed firmware artifacts, knows which version every device is actually running, assigns devices to rollout rings, and exposes per-device update decisions the device polls or is pushed. It must record the result of every attempt — success, rollback, failure — so you can halt a rollout on rising failure rates. AWS IoT Device Management with jobs and the device shadow covers this, but the capabilities matter more than the vendor: inventory, staged jobs, signed artifacts, and observable outcomes.

Frequently asked questions

How do you roll out an ESP32 firmware update to thousands of devices without bricking the fleet?

You never push one build to the whole fleet at once. Each device flashes into an inactive A/B partition, verifies the image signature before it boots, and runs a self-test on first boot; if the test fails it rolls back to the previous slot automatically. On top of that, the backend releases the update in stages — a small canary group first, then widening rings — and halts the rollout the moment crash or offline rates climb. A bad image then reaches a handful of devices, not all of them, and even those recover themselves.

What is a staged or canary OTA rollout and why does it matter at fleet scale?

A staged rollout ships a new firmware version to a small, representative subset of devices first — the canary — watches their telemetry for a defined soak period, and only widens to larger rings if the canary stays healthy. It matters because no lab test reproduces every real-world condition your fleet sees: power quality, RF environment, peripheral revisions, edge-case sensor data. The canary turns those unknowns into a bounded, observable experiment, so a regression that survived QA is caught at ten devices instead of after it has bricked the field.

How do delta (differential) OTA updates reduce bandwidth and cost?

A delta update ships only the binary difference between the version a device runs and the version it should run, instead of the whole image. For a typical point release that changes a small fraction of the binary, the patch is a fraction of the full-image size, which is decisive on metered cellular or low-bandwidth links where the full download is slow, expensive, or fails outright. The device reconstructs the new image from its current slot plus the patch, verifies the result against the expected signature and hash, and only then commits it.

What backend do you need to manage OTA updates for an ESP32 fleet?

You need a service that stores signed firmware artifacts, knows which version every device is actually running, assigns devices to rollout rings, and exposes per-device update decisions the device polls or is pushed. It must record the result of every attempt — success, rollback, failure — so you can halt a rollout on rising failure rates. AWS IoT Device Management with jobs and the device shadow covers this, but the capabilities matter more than the vendor: inventory, staged jobs, signed artifacts, and observable outcomes.

Related solutions

See how we apply this in production, by industry: