Skip to main content

Edge AI & TinyML on the ESP32: On-Device Machine Learning

GizanTech EngineeringEmbedded Systems TeamPublished June 17, 202611 min read

Why "AI on a microcontroller" stopped being a stretch

For years the honest answer to "can we run machine learning on the ESP32?" was technically yes, practically no — the classic ESP32 could execute a tiny model, but anything resembling vision was so slow the device timed out before it finished one inference. That changed with the ESP32-S3. It added vector/SIMD instructions that the neural-network kernels actually use, paired them with octal PSRAM, and turned a chip that could pretend to do ML into one that ships real on-device inference. We know because we have shipped it: an ESP32 camera that detects objects locally, decides on the frame, and never sends a pixel to the cloud.

This is a senior-engineer field guide to what genuinely fits on the ESP32, written field-failure-first. The point is not that the ESP32 is a GPU — it is not, and any post that implies otherwise is selling something. The point is that a deliberately small, quantized model running on the device eliminates a cloud round-trip, and for a whole class of products that round-trip was the thing killing latency, privacy, bandwidth, or offline operation. The challenge is making the model small enough to fit without making it too dumb to be useful.

What actually fits on the device

The most common failure in edge-AI scoping is assuming the ESP32 can run whatever model trained well on a laptop. It cannot: a microcontroller has kilobytes-to-low-megabytes of RAM, a fixed flash budget, and a latency target set by a real-world event — a person walking past, a bearing starting to fail, a spoken wake word. The table below is the feasibility map we use when scoping.

Edge-AI use caseModel approach on ESP32-S3Memory/latency realityWhat it replaces
Person / object detection on a cameraSmall quantized detector (int8) via ESP-DL or a tiny SSD/MobileNet-class net at low input resolutionNeeds PSRAM for the frame buffer + tensors; inference in the tens-to-hundreds-of-ms range at small resolutions — fine for triggering, not for high-FPS trackingA continuous video upload + cloud detection round-trip; only the event leaves the device
Vibration anomaly detection (predictive maintenance)Tiny autoencoder or classifier on FFT/spectral features, int8, often via Edge ImpulseModel is small (kilobytes); the cost is the DSP feature extraction, not the inference — runs comfortably in real timeStreaming raw accelerometer data to a server and computing the anomaly score in the cloud
Keyword / wake-word spottingCompact CNN or DS-CNN on audio spectrogram features, int8Fits in internal SRAM; low single-digit to tens-of-ms inference, always-on within power budgetAn always-listening cloud speech service and the privacy/bandwidth cost of streaming audio
Visual classification / countingMobileNet-class classifier at reduced resolution, int8, ESP-DL or TFLite MicroPSRAM-friendly; latency acceptable for per-event classification, not for every frame at high FPSA per-image cloud classification API call for every frame the camera captures

Read the rightmost column carefully, because it is where the value is. In every row, on-device inference is not replacing the cloud's intelligence — it is replacing the round-trip. The local model is smaller and slightly less capable, an acceptable trade because the alternative is streaming raw video or audio off the device and waiting on a network you do not control.

Why the ESP32-S3 is the practical default

The classic ESP32 still has a place — for the smallest audio and sensor models it is adequate, cheaper, and lower power. But for anything vision-shaped the ESP32-S3 is the default, and the reason is concrete rather than marketing. The S3's vector/SIMD instructions let the convolution and matrix kernels in ESP-DL and TFLite Micro process multiple values per cycle — exactly the inner loop a neural network spends its time in. On the classic ESP32 those same kernels run scalar, and a detection model that is usable on the S3 becomes too slow to ship on the classic part.

The second half is memory. A camera frame, a model's intermediate tensors, and the connectivity stack all want RAM at once, and the S3's octal PSRAM is what lets a frame buffer and a detector coexist without the heap fragmenting. The field failure this prevents looks fine on the bench and dies in deployment: a model that fits at idle but collides with the frame and Wi-Fi buffers under load, an allocation failure days in. Size the worst-case concurrent RAM — frame buffer plus tensor arena plus connectivity — before you commit to the part.

The engineering that makes a model fit a microcontroller

Getting a model onto the ESP32 is not "train it and flash it." It is a sequence of deliberate compressions, each with a cost you measure rather than assume:

  1. Quantize to int8. The single most important step. A 32-bit float model is shrunk to 8-bit integers, cutting the memory footprint by roughly 4x and letting the integer-optimized kernels run far faster. With proper post-training quantization or quantization-aware training the accuracy loss is usually small — but it is not zero, so validate the quantized model on real data rather than assume the float accuracy carries over.
  2. Choose an architecture built for the constraint. MobileNet-class and depthwise-separable convolutions exist precisely because they do more with fewer parameters. Shrinking a large server-grade backbone is the slow road; starting from an architecture designed for microcontrollers is the fast one.
  3. Budget memory and latency up front. The tensor arena, the model weights in flash, and the input buffer are three separate budgets, and the latency target is set by the real event you react to. Write these down before training — a model that is 5% more accurate but does not fit the arena is one you cannot ship.
  4. Pick the input resolution honestly. For vision, input size dominates both latency and memory. Dropping a camera detector to a smaller input is often the difference between real time and not — and at the distances most triggering applications care about, the smaller input is good enough.

Every one of these is a trade-off you measure. The field failure we see most: a team quantizes, assumes the accuracy held, and discovers in deployment that the int8 model misses a class the float model caught. Validate after every compression, on data that looks like the field.

TFLite Micro, ESP-DL, or Edge Impulse: the toolchain reality

These three are not competitors you choose between once; they fit different jobs, and a real project often touches more than one.

  • Edge Impulse is the fastest path from raw sensor or audio data to a deployed model. It handles collection, feature extraction, training, and quantization, then exports a C++ library you compile into firmware. For vibration, audio, and sensor-classification work — where the hard part is the data pipeline, not the inference — it removes weeks of plumbing.
  • TensorFlow Lite Micro is the portable runtime when you already have a trained, quantized model and want full control of the inference loop and tensor arena. It is the lingua franca of TinyML, the right choice when you need to own the integration.
  • ESP-DL is Espressif's own library, tuned for the S3's SIMD instructions. For image detection where you want the vendor-optimized vision kernels and the best latency on Espressif silicon, it is the strongest option.

The decision is driven by the model and the latency budget, not loyalty. A vibration node might be pure Edge Impulse; an S3 camera detector might run ESP-DL for the vision kernels. Pin your toolchain and model versions — the runtime, quantization behavior, and kernels all evolve, and an unpinned upgrade can silently change your accuracy or memory footprint.

An honest on-device-vs-cloud decision framework

The most expensive edge-AI mistake is putting the model in the wrong place. Run this in order during scoping and stop at the first answer that decides it:

  1. Does latency or offline operation force the edge? If the device must react in milliseconds, or work on a site with no reliable connectivity, the model belongs on-device — the round-trip is what you cannot afford.
  2. Does privacy or bandwidth force the edge? If the raw image or audio should never leave the device, or streaming it continuously is too expensive, run inference locally and send only the result.
  3. Is the model too large for a microcontroller? If it genuinely needs the capacity, it stays in the cloud — and the edge's job becomes filtering, so only the interesting frames are uploaded.
  4. Do you need frequent retraining on fleet data? If the model must learn continuously from the whole fleet, training is a cloud job; the device runs the latest exported model.

The strongest architectures are not edge or cloud — they are both. A small model on the ESP32 decides in real time and filters out the 99% of frames that are nothing, and only the rare, interesting event is sent up for heavier analysis or human review. That split gives you the latency and privacy of the edge and the capability of the cloud, without paying the round-trip every frame.

What we learned shipping an ESP32 camera detector

This guide is field-failure-first because GizanTech has shipped an ESP32 camera object-detection system — a device that runs the detector on-board, decides on the frame locally, and never sends a pixel to a server unless something worth seeing happened. The lessons that mattered were system lessons, not the ones the tutorials warn about: the int8 model lost a class we only caught on field data; the frame buffer and the Wi-Fi stack fought over PSRAM until we sized the worst-case concurrency; the "fast enough" bench latency was measured against an idle CPU, not one also running the connectivity stack. None of these are model problems. They are embedded problems — exactly where an edge-AI project that treats the model as the whole job loses months.

That is the case for designing the model and the firmware together. The model decision — int8, architecture, input resolution — is inseparable from the memory budget, the camera pipeline, and the power envelope. GizanTech designs the ESP32 hardware, writes the firmware, and trains and quantizes the on-device model, so the trade-offs above are made once with the full system in view. If you are scoping an edge-AI product and want to know honestly what will fit before you commit, talk to our engineering team.

Frequently asked questions

Can the ESP32 actually run machine learning, or is that marketing?

It genuinely runs machine learning, but inference only, not training. The ESP32-S3 executes quantized neural networks on-device — wake-word spotting, vibration anomaly detection, and small-image person/object detection are all shipping today. What it cannot do is train a model or run a large multi-megabyte network; those stay in the cloud. The realistic framing is that you train and quantize a small model off-device, then the ESP32 runs that fixed model in real time with no network round-trip. So it is real, with the honest caveat that "ML on the ESP32" means on-device inference of a deliberately small, int8 model, not a general-purpose AI accelerator.

ESP32 vs ESP32-S3 for edge AI — does the chip choice matter?

It matters more than almost any other decision. The ESP32-S3 added vector/SIMD instructions that the neural-network kernels in ESP-DL and TFLite Micro use to run convolutions several times faster than the classic ESP32, and it pairs with octal PSRAM so a camera frame buffer plus model tensors actually fit. The classic ESP32 can run tiny models — a wake-word classifier, a simple anomaly detector — but it has no SIMD acceleration and tighter RAM, so anything vision-shaped becomes too slow to be useful. For any new edge-AI design treat the ESP32-S3 as the default and reach for the classic ESP32 only for the smallest audio or sensor models.

Do you use TensorFlow Lite Micro, ESP-DL, or Edge Impulse?

We use whichever fits the project, and they are not mutually exclusive. Edge Impulse is the fastest path from raw sensor or audio data to a deployed model — it handles data collection, training, and quantization and exports a library you drop into firmware, which is ideal for vibration, audio, and classification work. TensorFlow Lite Micro is the portable runtime when you already have a trained model and want full control of the inference loop. ESP-DL is Espressif's own library, tuned for the S3's SIMD instructions and the strongest choice for image detection where you need the vendor-optimized vision kernels. The decision is driven by the model and the latency budget, not loyalty to one toolchain.

When is running ML on the device better than sending data to the cloud?

Run it on-device when latency, privacy, bandwidth, or offline operation decide the product. On-device inference answers in milliseconds with no network dependency, never transmits the raw image or audio off the device, and works on a site with no reliable connectivity — which is exactly why a camera that must react instantly, or a sensor on a remote machine, belongs at the edge. The cloud wins when the model is too large to fit a microcontroller, when you need to retrain frequently on aggregated fleet data, or when a human reviews the result anyway. The strongest products do both: a small model on the ESP32 filters and decides in real time, and only the rare, interesting event is sent up for heavier analysis.

Frequently asked questions

Can the ESP32 actually run machine learning, or is that marketing?

It genuinely runs machine learning, but inference only, not training. The ESP32-S3 executes quantized neural networks on-device — wake-word spotting, vibration anomaly detection, and small-image person/object detection are all shipping today. What it cannot do is train a model or run a large multi-megabyte network; those stay in the cloud. The realistic framing is that you train and quantize a small model off-device, then the ESP32 runs that fixed model in real time with no network round-trip. So it is real, with the honest caveat that "ML on the ESP32" means on-device inference of a deliberately small, int8 model, not a general-purpose AI accelerator.

ESP32 vs ESP32-S3 for edge AI — does the chip choice matter?

It matters more than almost any other decision. The ESP32-S3 added vector/SIMD instructions that the neural-network kernels in ESP-DL and TFLite Micro use to run convolutions several times faster than the classic ESP32, and it pairs with octal PSRAM so a camera frame buffer plus model tensors actually fit. The classic ESP32 can run tiny models — a wake-word classifier, a simple anomaly detector — but it has no SIMD acceleration and tighter RAM, so anything vision-shaped becomes too slow to be useful. For any new edge-AI design treat the ESP32-S3 as the default and reach for the classic ESP32 only for the smallest audio or sensor models.

Do you use TensorFlow Lite Micro, ESP-DL, or Edge Impulse?

We use whichever fits the project, and they are not mutually exclusive. Edge Impulse is the fastest path from raw sensor or audio data to a deployed model — it handles data collection, training, and quantization and exports a library you drop into firmware, which is ideal for vibration, audio, and classification work. TensorFlow Lite Micro is the portable runtime when you already have a trained model and want full control of the inference loop. ESP-DL is Espressif's own library, tuned for the S3's SIMD instructions and the strongest choice for image detection where you need the vendor-optimized vision kernels. The decision is driven by the model and the latency budget, not loyalty to one toolchain.

When is running ML on the device better than sending data to the cloud?

Run it on-device when latency, privacy, bandwidth, or offline operation decide the product. On-device inference answers in milliseconds with no network dependency, never transmits the raw image or audio off the device, and works on a site with no reliable connectivity — which is exactly why a camera that must react instantly, or a sensor on a remote machine, belongs at the edge. The cloud wins when the model is too large to fit a microcontroller, when you need to retrain frequently on aggregated fleet data, or when a human reviews the result anyway. The strongest products do both: a small model on the ESP32 filters and decides in real time, and only the rare, interesting event is sent up for heavier analysis.

Related solutions

See how we apply this in production, by industry: