IoT ingest: getting sensor data from the edge into typed Parquet

Mar 18, 2026·EdgeMQ Team

IoT ingest: getting sensor data from the edge into typed Parquet

A building management system runs 10,000 sensors across 50 sites. Each sensor reports temperature, humidity, and pressure every 30 seconds. That's 200,000 events per minute, 24 hours a day, from devices scattered across three continents.

The data is JSON. The destination is S3. The requirement is typed Parquet, partitioned by date, queryable by any analytics tool. Between those two points sits every hard problem in distributed data engineering: geographic latency, burst traffic, credential management, schema drift, nested payloads, and the operational burden of keeping it all running.

Why IoT ingest is different

IoT data has characteristics that make it harder than typical application event streams.

Geographic distribution. Sensors live where they're deployed - factory floors in Shenzhen, wind farms in the North Sea, delivery vehicles across South America. If your ingest endpoint runs in us-east-1, every event from a sensor in Frankfurt pays 80-120ms of transatlantic latency before it even arrives. For battery-powered devices that need to POST and sleep, that round-trip cost matters.

Burst patterns. Sensors that report on intervals create synchronized traffic spikes. 10,000 sensors reporting every 30 seconds don't distribute evenly - they cluster around the interval boundary. At scale, this creates thundering herd problems that look like DDoS traffic to infrastructure that isn't designed for it.

Firmware-driven schema drift. When a firmware update adds a new field or changes a type, the payload structure changes across the fleet as devices update at different rates. You end up with v1 and v2 payloads arriving on the same endpoint simultaneously, and your pipeline needs to handle both without losing data or halting ingestion.

Nested payloads. IoT sensors don't produce flat JSON. Readings are nested inside objects, GPS coordinates are grouped under a location key, device metadata is a separate sub-document. Extracting typed columns from nested structures adds complexity that flat event data doesn't have.

Security from untrusted networks. Sensors operate on factory Wi-Fi, cellular networks, and public internet connections. Credentials must be managed for thousands of devices. A compromised device key shouldn't expose the entire fleet's data.

The infrastructure you end up operating

Most teams building IoT data pipelines land on one of three architectures.

The AWS stack. The canonical AWS IoT-to-S3 path looks like this:

Devices ──► AWS IoT Core ──► Rules Engine ──► Kinesis Data Streams
                                                      │
                                              Kinesis Firehose
                                                      │
                                              Glue (schema) ──► S3

That's five services, each with its own IAM roles, configuration, monitoring, and failure modes. IoT Core handles MQTT and device certificates. Rules Engine routes messages. Kinesis Data Streams buffers them. Firehose converts to Parquet using a Glue Data Catalog schema. If a nested field in your sensor payload doesn't have an explicit STRUCT definition in Glue, it's silently dropped. If the Glue schema doesn't match the case of your JSON keys, fields vanish without an error.

The default Firehose delivery window is 60 seconds. Pricing spans five dimensions: IoT Core messaging, Rules Engine evaluations, Kinesis shard-hours, Firehose per-GB, and Glue catalog storage. Predicting your monthly bill requires a spreadsheet.

The Kafka path. Teams already running Kafka add an MQTT-to-Kafka bridge (or Confluent's MQTT Proxy), then configure an S3 sink connector with Parquet conversion. This works, but you're now operating a Kafka cluster, managing topics and partitions, tuning consumer groups, and maintaining connectors - plus the MQTT broker, plus Schema Registry for device schemas. Each component is another thing to deploy, monitor, and debug at 3 AM.

Custom gateways. A surprising number of IoT teams build their own HTTP gateway - a service that accepts sensor POST requests, buffers them in memory or a local queue, batches into Parquet, and uploads to S3. It works for the first month. Then a firmware update changes the payload structure and the Parquet writer crashes. Or the gateway restarts during a traffic spike and the memory buffer is lost. Or you need to add a second region and suddenly you're operating a distributed system with no durability guarantees.

The common thread: for the specific workload of "accept sensor JSON from globally distributed devices and write typed Parquet to S3," every option requires operating multiple services or building custom infrastructure.

What EdgeMQ does for IoT

EdgeMQ replaces this stack with a single HTTP endpoint per deployment:

Devices ──► EdgeMQ (edge) ──► your S3 bucket
              │
              ├── Typed Parquet (view definition)
              ├── DLQ Parquet (failed rows)
              └── Commit marker (delivery receipt)

No IoT Core, no Kinesis, no Glue, no Kafka, no custom gateway. The device POSTs JSON over HTTPS, EdgeMQ handles durability, schema transformation, and S3 delivery.

Edge deployment

EdgeMQ endpoints deploy to any of 15+ global regions across North America, Europe, and Asia-Pacific. If your sensors are in Frankfurt, your endpoint runs in Frankfurt. The p95 acknowledgment latency is under 10ms - the device POSTs, gets a 202 Accepted, and can sleep. The data is durable on NVMe-backed storage before the response is sent.

Handling burst traffic

When thousands of sensors fire simultaneously at an interval boundary, EdgeMQ absorbs the burst into its write-ahead log. The WAL is append-only - writes are batched with writev() and periodically fsynced. There's no parse step on the hot path, no schema validation at ingest time, no synchronous S3 upload. The ingest node returns a 202 before doing anything with the payload beyond writing it to the WAL.

If traffic exceeds configured rate limits, EdgeMQ returns 429 with a Retry-After header. If the WAL writer is saturated, it returns 503 with backpressure signaling. Both are standard HTTP semantics that any device client handles with exponential backoff.

Device authentication

IoT devices authenticate with either API keys or short-lived JWTs:

API keys (X-API-Key: ing_live_...) - simple, works out of the box. Keys are hashed with argon2id at rest; plaintext is never stored.
JWT bearer tokens - the device exchanges an API key for a 15-minute JWT via POST /v1/token/exchange. Subsequent requests use the JWT, avoiding key validation overhead on every request. This is the recommended path for high-volume device fleets.

Both methods enforce per-tenant rate limits (RPS, bytes/day, bandwidth). A compromised device credential can be revoked with propagation within 5 minutes.

Data in transit

All ingest endpoints serve HTTPS only. Sensor data travels encrypted from the device to the edge endpoint. Once in the WAL, data is on encrypted NVMe storage. S3 uploads use your bucket's encryption configuration (SSE-S3 or SSE-KMS). EdgeMQ never stores your data in its own buckets - it assumes an IAM role you control, with least-privilege permissions scoped to your S3 prefix.

Schema transformation

Here's where IoT payloads get interesting. A typical environmental sensor produces something like this:

{
  "sensor_id":: "sensor-042",
  "device_type":: "environmental",
  "location":: {
    "lat":: 37.7749,
    "lon":: -122.4194,
    "building":: "Building-A"
  },
  "readings":: {
    "temperature":: 25.3,
    "humidity":: 65,
    "pressure":: 1013.2
  },
  "battery_level":: 87,
  "quality":: "good",
  "is_online":: true,
  "timestamp":: "2026-03-18T12:30:00.000Z"
}

The nested location and readings objects need to be flattened into typed columns. With Firehose, you'd need explicit STRUCT definitions in a Glue table. With Spark, you'd write a StructType schema with nested StructFields. With EdgeMQ, you write a view definition:

name: sensor_readings
version: 1

definition:
  columns:
    - name: sensor_id
      type: VARCHAR
      path: $.sensor_id
      required: true

    - name: device_type
      type: VARCHAR
      path: $.device_type

    - name: lat
      type: DOUBLE
      path: $.location.lat

    - name: lon
      type: DOUBLE
      path: $.location.lon

    - name: building
      type: VARCHAR
      path: $.location.building

    - name: temperature
      type: DOUBLE
      path: $.readings.temperature

    - name: humidity
      type: DOUBLE
      path: $.readings.humidity

    - name: pressure
      type: DOUBLE
      path: $.readings.pressure

    - name: battery_level
      type: INTEGER
      path: $.battery_level

    - name: quality
      type: VARCHAR
      path: $.quality

    - name: is_online
      type: BOOLEAN
      path: $.is_online

    - name: timestamp
      type: TIMESTAMP
      path: $.timestamp
      required: true

JSONPath expressions handle the nesting. $.readings.temperature extracts the temperature from inside the readings object. $.location.lat flattens the GPS coordinate. The output is a flat Parquet table with 13 typed columns, partitioned by date, with column statistics for predicate pushdown.

All type casts use safe coercion internally - a string in a DOUBLE column produces NULL, not a crash. Rows that fail validation (required field missing, cast failure) are captured in a separate DLQ Parquet file with the original payload and a reason column, so you can diagnose firmware issues without losing data.

Firmware schema drift

When a firmware update adds a firmware_version field or changes battery_level from an integer to a float, EdgeMQ doesn't break. Events are accepted regardless of schema - validation happens at the output layer, not at ingest. Old firmware and new firmware payloads land in the same WAL. The view definition extracts what it can:

A field that exists in the view but not in the payload produces NULL
A field that exists in the payload but not in the view is ignored (preserved in the WAL for future views)
A type mismatch (string where DOUBLE is expected) produces NULL via safe casting

You can run multiple view definitions on the same endpoint simultaneously - one for v1 firmware payloads, another for v2 - producing separate typed Parquet outputs from the same raw stream.

Querying the output

Once the data is in S3 as typed Parquet, any SQL engine reads it directly:

SELECT
  DATE_TRUNC('hour', timestamp) as hour,
  building,
  AVG(temperature) as avg_temp,
  MIN(battery_level) as min_battery,
  COUNT(*) as readings
FROM read_parquet('s3://bucket/views/sensor_readings/dt=*/seg=*/part-*.parquet')
WHERE dt >= '2026-03-01'
GROUP BY hour, building
ORDER BY hour DESC;

This works from DuckDB, Snowflake (external tables), Databricks (Auto Loader), AWS Athena, Trino, ClickHouse, or any other engine that reads Parquet from S3.

Limitations

EdgeMQ solves the ingest-to-Parquet path for IoT. It doesn't solve everything.

No MQTT support today. EdgeMQ accepts HTTP/HTTPS only. Most device runtimes support HTTP natively, but we know MQTT is common for constrained IoT devices. A lightweight bridge (Mosquitto, HiveMQ) can translate MQTT to HTTP POSTs in the meantime, and AWS IoT Core handles this at scale. If you have use cases where native MQTT support would be valuable, get in touch - we'd like to hear about them.

No device registry. EdgeMQ doesn't track individual devices, manage certificates per device, or handle device provisioning. Authentication is per-tenant (API key or JWT), not per-device. If you need per-device identity management, IoT Core or Azure IoT Hub provide this.

No real-time alerting. Data is available in S3 within minutes, not milliseconds. If you need real-time threshold alerts (temperature exceeded 80C, battery below 10%), you need a streaming layer. Confluent Cloud with Flink can process events in real-time from Kafka topics.

S3-only output. Typed Parquet lands in your S3 bucket. If you need data in a time-series database like TimescaleDB or InfluxDB for real-time dashboards, you'll need a separate path from S3 or a parallel write from your devices.

No cross-segment aggregation. Each WAL segment is materialized independently. Rolling averages, sessionization, or cross-device joins happen in your query engine, not in EdgeMQ.

These are real constraints. EdgeMQ handles one part of the IoT data pipeline well: getting JSON from distributed devices into typed, query-ready Parquet in S3 without operating infrastructure between those two points.

Getting started

Schema-aware Parquet for IoT is available on the Pro plan ($49/month) with a 14-day free trial.

Create an endpoint in the region closest to your devices
Connect your S3 bucket
Create a view definition matching your sensor payload (or use View Copilot to generate one from a sample)
Point your devices at https://your-endpoint.edge.mq/v1/ingest

The full sensor payload format, view definition syntax, and query examples are in the Materialized Views documentation.