RS485/Modbus to MQTT to AWS IoT Core: Reference Architecture

The architecture in one sentence

A field device speaks Modbus RTU over RS485; an ESP32 gateway polls its registers, buffers every reading to non-volatile storage, and publishes them over MQTT with per-device mutual TLS to AWS IoT Core, where a single IoT Rule fans the stream out to time-series storage and a dashboard. Each layer in that sentence exists to prevent a specific, repeatable field failure. This is the reference industrial IoT gateway architecture we ship, layer by layer, mapped to the failure each layer stops.

We are deliberately vendor-neutral on the cloud — the same shape works on any MQTT-broker-plus-rules-engine stack — and opinionated on the security model, because security mistakes are the ones that are expensive and irreversible after a fleet is in the field.

Layer 1 — Edge protocol: Modbus RTU over RS485

Modbus RTU over RS485 is the lingua franca of industrial sensors, energy meters, VFDs, and PLCs: a half-duplex, multi-drop, two-wire differential bus with a single master and addressed slaves. Two field failures dominate, and neither is a software bug.

The first is electrical noise and ground loops. RS485 runs are long, routed past motors and contactors, and reference different grounds at each end. An unisolated transceiver will see CRC errors, dropped frames, and — in the bad cases — a fried UART. The fix is an isolated RS485 transceiver plus correct 120 Ω termination at both ends of the trunk and proper biasing, so the idle line sits at a defined level instead of floating into garbage.

The second is bus contention: RS485 is half-duplex, so the gateway must drive the transmit-enable line precisely around each frame and never talk while a slave is responding. Get the turnaround timing wrong and you collide with the device you are polling. Treat the gateway as a disciplined Modbus master — one outstanding transaction at a time, a sane inter-frame gap, and a per-register timeout — not a UART you spray bytes at.

Layer 2 — Gateway MCU: the ESP32 as Modbus master

The gateway is where the fieldbus ends and the authenticated, encrypted world begins, which makes the MCU choice consequential. We default to the ESP32 family because one part does three jobs at once: Modbus RTU master on a hardware UART, local buffer in flash, and a TLS MQTT session — all on a chip with enough SRAM for the TLS handshake and enough flash for a meaningful backlog.

The failure this layer prevents is bridging raw Modbus to the internet. Modbus has no authentication, no encryption, and no concept of identity. Tunnel it straight to a cloud endpoint and anyone on the path can read meter data or, worse, write a holding register on a VFD. The gateway is the trust boundary: it terminates Modbus on one side and re-emits only authenticated, TLS-protected MQTT on the other. Nothing speaks Modbus past the gateway, and nothing speaks unauthenticated MQTT into the cloud.

Layer 3 — Local buffering: the layer that saves your dataset

This is the layer teams skip and then regret. An industrial uplink — Wi-Fi in a steel building, a 4G modem at a remote pump, Starlink on a vessel — will drop. The question is not if but for how long.

The rule is simple: keep reading Modbus and keep appending timestamped samples to non-volatile storage whenever the uplink is down, then drain that backlog when MQTT reconnects. A flash-backed ring buffer, an SD card, or a LittleFS file all work; the storage medium matters less than the discipline. Two non-negotiables:

Stamp each reading with its own capture time, not the publish time. Backfilled data must land where it actually happened on the timeline. If everything reconnects with a "now" timestamp, an hour of meter readings collapses onto one instant and the dataset is worthless.
Size the buffer for a realistic worst-case outage at your sample rate, and decide the overflow policy on purpose — overwrite oldest, or stop and alarm. Silent data loss is the failure; an explicit, logged policy is the cure.

The failure this layer prevents is a permanent hole in the record. For energy metering, compliance logging, or anything you bill against, "we were offline so that hour is gone" is not an acceptable sentence.

Layer 4 — Transport and QoS: MQTT done deliberately

MQTT is the right transport: lightweight, pub/sub, designed for flaky links, and natively supported by every major IoT cloud. But two defaults bite people.

The first is QoS. QoS 0 ("fire and forget") will silently drop a publish if the link hiccups mid-send. For telemetry that feeds billing or alarms, publish at QoS 1 ("at least once") so the broker acknowledges receipt and the client retries on failure — and make your downstream idempotent so the occasional duplicate is harmless. Reserve QoS 0 for high-rate, loss-tolerant signals where one missed sample does not matter.

The second is the Last Will and Testament. Configure an LWT message so that when a gateway drops off ungracefully, the broker publishes a "device offline" notice on its behalf. Without it, a dead gateway looks identical to a quiet one, and your dashboard shows the last good reading forever while the machine has actually gone dark.

Layer 5 — Authentication: one X.509 certificate per device

Here is where we stop being neutral. The single most important decision in the whole architecture is the per-device security model, and the right answer is one unique X.509 certificate and private key per physical device, with the private key generated on-device or in a secured provisioning step and never shared.

The failure this prevents is fleet-wide compromise from a single leaked secret. If every gateway authenticates with the same username/password or a shared key — a depressingly common shortcut — then one extracted device, one decompiled firmware image, one leaked credential, and the attacker can impersonate every unit you ever shipped. Your only remedy is rotating a secret across the entire fleet at once, which in practice means a recall.

With one certificate per device you get the opposite: a compromised gateway can be evicted by revoking exactly one certificate, and no other device is affected. AWS IoT Core authenticates each connection by its client certificate against the registered identity, so the device's identity is cryptographic, not a string in a config file.

Layer 6 — Cloud ingest: AWS IoT Core and least-privilege topic policy

Authentication says who the device is; the topic policy says what it may do. This is the second half of the security model and the one most teams under-build.

Attach each certificate to an IoT policy that grants least privilege:

The device may connect only with a client ID equal to its own identity (bind the MQTT client ID to the certificate, so a device cannot squat on another's session).
It may publish only to its own namespaced topic tree, e.g. dt/<deviceId>/telemetry.
It may subscribe/receive only on its own command topic, e.g. cmd/<deviceId>/#.
Express all three with policy variables (the connection's identity) so a single policy template scopes every device to its own tree — you author the policy once, not once per device.

The failure this prevents is lateral movement. With a wildcard topic policy, a compromised meter in building A can subscribe to every other device's telemetry and publish commands to a VFD in building B. With per-device namespacing enforced at the policy layer, a compromised gateway can only ever see and touch its own topics — the blast radius is one device.

Layer 7 — The IoT Rule: fan-out without coupling

Inside AWS IoT Core, an IoT Rule subscribes to the telemetry topic with a SQL-like statement and routes each message to one or more destinations. This is the seam that keeps the architecture decoupled: the device publishes to exactly one topic and knows nothing about what consumes it.

One rule fans a single inbound stream out to, typically, a time-series store for history and a hot path (a function or live channel) for the dashboard and alarms. The failure this prevents is tight coupling and rework: when you later add an analytics sink, an archive, or a second dashboard, you add a rule action — you do not touch firmware or re-provision a single device. Changing where data goes must never require flashing the fleet.

Layer 8 — Storage and dashboard: history plus a live view

Telemetry splits into two access patterns, and conflating them is the failure here.

Storage wants a time-series database (or partitioned object storage) optimized for "device X between time A and B" range scans, retention/downsampling, and aggregation. A general-purpose relational table will work until your fleet and history grow, then range queries crawl.
Dashboard wants low-latency reads and live updates — current values, trends, threshold alarms — served from the hot path, not by hammering the historical store on every page load.

Keep them as two destinations off the same IoT Rule. The dashboard reads the live channel for "now" and queries the time-series store for "the last 30 days," and neither workload starves the other.

The layer-by-layer decision table

Every layer has a viable alternative and a trade-off. Here is the whole stack with the option we recommend as the default for a small-to-mid industrial fleet, and what you are trading when you choose otherwise.

Layer	Viable options	Trade-off	Recommended default
Edge protocol	Modbus RTU/RS485; Modbus TCP; analog/pulse I/O	RTU is universal but needs isolation, termination, and turnaround discipline; TCP is easier but assumes the device already has Ethernet	Modbus RTU over isolated RS485, terminated and biased
Gateway MCU	ESP32 / ESP32-S3; Linux SBC; vendor gateway appliance	ESP32 is cheap, low-power, single-purpose; an SBC adds containers and multi-bus translation at higher cost and power	ESP32 for one bus / small fleet; SBC when you need multi-protocol or edge compute
Local buffering	RAM only; flash ring buffer / LittleFS; SD card	RAM is lost on reboot; flash is robust but finite; SD holds more but adds a connector and a failure point	Non-volatile buffer (flash ring buffer, SD for long outages), per-sample timestamps
Transport / QoS	MQTT QoS 0; MQTT QoS 1; MQTT QoS 2	QoS 0 can silently drop; QoS 2 adds round-trips and broker cost; QoS 1 can duplicate	MQTT QoS 1 for telemetry with idempotent consumers; QoS 0 only for loss-tolerant signals
Auth	Shared key/password; per-device X.509 mutual TLS	A shared secret means fleet-wide compromise from one leak; per-device certs add provisioning effort	Per-device X.509 mutual TLS, key generated on-device, one cert per unit
Cloud ingest	Wildcard topic policy; per-device least-privilege policy	Wildcard is easy but allows lateral movement; scoped policy needs policy variables	Least-privilege policy scoping each device to its own topic tree via policy variables
Storage	Relational table; time-series DB; partitioned object storage	Relational degrades on range scans at scale; TSDB adds an operational component	Time-series database with retention/downsampling for history
Dashboard	Poll the historical store; live channel + history query	Polling history is slow and costly; a live path adds a component but stays responsive	Live channel for "now" + time-series query for history, off the same IoT Rule

How the layers compose: the data's journey

Read the table top to bottom and one reading's lifecycle falls out:

The ESP32 polls a holding register over isolated RS485 and reads a raw 16-bit value.
It scales that value to engineering units and stamps it with the capture time.
The sample is appended to the non-volatile buffer — unconditionally, online or not.
When MQTT is connected, the buffer drains to AWS IoT Core at QoS 1 over mutual TLS, on the device's own namespaced topic.
The IoT Rule matches the topic and fans the message out to the time-series store and the dashboard's hot path.
The dashboard shows the live value immediately and serves history from the time-series store on demand.

If any link breaks, the buffer holds the line until it heals — which is the whole point of building it this way.

Where teams get this wrong, and where to get help

Three mistakes recur. Teams bridge raw Modbus to the cloud and create an unauthenticated control path into industrial equipment. They skip local buffering and discover the gap only when an auditor asks for the missing hours. And they share one credential across the fleet to ship faster, turning a single leaked device into a recall.

The architecture above is opinionated precisely because those three are the expensive, hard-to-reverse ones. GizanTech designs the gateway PCB (isolated RS485 and all), writes the ESP32 firmware (Modbus master, buffering, mutual-TLS MQTT), and stands up the cloud side (per-device certificates, least-privilege policies, the IoT Rule, storage, and dashboard) as one coherent industrial IoT gateway architecture rather than three teams meeting at the seams. If you are scoping a fleet and want the security model and buffering strategy validated before you commit firmware to the field, talk to our engineering team.

Frequently asked questions

How do I get Modbus RTU data from an RS485 device up to the cloud securely?

Put a gateway MCU on the RS485 bus as the Modbus RTU master, poll the holding/input registers you care about, scale the raw 16-bit values into engineering units, and publish them as JSON over MQTT to a cloud broker. The secure part is the transport: use TLS with a per-device X.509 client certificate so the broker authenticates each gateway and the link is encrypted end to end. Never bridge raw Modbus to the internet — Modbus RTU has no authentication or encryption, so the gateway must be the trust boundary that terminates the fieldbus and re-emits authenticated, TLS-protected MQTT.

Should the ESP32 gateway buffer data locally, and what happens to readings during an internet outage?

Yes — buffer locally, always. The gateway should keep reading Modbus registers and appending timestamped samples to non-volatile storage (a ring buffer in flash, an SD card, or LittleFS) whenever the uplink is down, then drain that backlog to the cloud once MQTT reconnects. Without a buffer, every minute offline is a permanent hole in your dataset, which is unacceptable for energy metering, compliance logging, or anything you bill against. Size the buffer for your realistic worst-case outage and sample rate, and stamp each reading with its own capture time so backfilled data lands at the right point on the timeline, not at the reconnect moment.

Do I need a separate industrial gateway appliance, or can an ESP32 be the gateway for a small fleet?

For a single machine or a small site, an ESP32 with an isolated RS485 transceiver is a perfectly capable gateway and is what we ship in many deployments — it handles the Modbus master role, local buffering, and mutual-TLS MQTT comfortably. You move to a heavier Linux-based gateway appliance when you need protocol translation across many buses, store-and-forward of large historical archives, container workloads at the edge, or a vendor-certified box for a regulated environment. The deciding factors are bus count, on-site compute needs, and certification — not raw connectivity, which the ESP32 does well.

What is the per-device security model for connecting many gateways to AWS IoT Core?

One unique X.509 certificate and private key per physical device, with the private key generated on-device (or in a secure provisioning step) and never shared. Each certificate is attached to an IoT policy that grants least privilege: the device may only connect with a client ID equal to its own identity and may only publish and subscribe on topics namespaced to itself, typically using policy variables so a single policy template scopes every device to its own topic tree. This means a compromised gateway cannot read or impersonate any other device, and you revoke exactly one certificate to evict it — no shared secret to rotate across the whole fleet.