x

Schema-aware Parquet: from JSON to typed columns, automatically

·EdgeMQ Team

Schema-aware Parquet: from JSON to typed columns, automatically

Most data pipelines that turn JSON into Parquet are more complex than the business logic they exist to support.

You have applications producing events - page views, sensor readings, transactions, API logs. You want those events in S3 as typed Parquet files that Snowflake, Databricks, or Athena can query directly. The gap between "I have JSON" and "I have typed columns" is where engineering time goes to die.

The problem

JSON has six types: string, number, boolean, null, array, object. Parquet has dozens: INT32, INT64, FLOAT, DOUBLE, DECIMAL, TIMESTAMP, DATE, VARCHAR, nested STRUCTs, and more. Every field in your payload requires a type decision, and every type decision is a potential failure point.

This sounds simple. It isn't.

A field that's 42 in one record and 42.5 in the next forces a widening from INT64 to DOUBLE. A timestamp arrives as an ISO string in some records and epoch milliseconds in others. A field that was always present starts arriving as null because an upstream team changed their API. A new nested object appears that nobody documented.

These aren't edge cases. They're the normal lifecycle of event data.

How teams solve this today

The industry has converged on roughly four approaches, each with the same punchline: more infrastructure than the problem warrants.

Spark / Databricks jobs. The default at scale. Read JSON from S3, apply a schema, write Parquet. Schema inference is available but unreliable - it samples records, and if the sample misses a field or sees conflicting types, you get silent data loss or pipeline failures. Databricks Auto Loader adds schema evolution, but type conflicts still require manual intervention. Either way, you're running and maintaining a compute cluster.

AWS Glue + Firehose. Firehose offers "format conversion" that deserializes JSON against a Glue Data Catalog table and writes Parquet. The catch: you must pre-define every column and type in the Glue table. Nested JSON beyond the first level requires explicit STRUCT definitions or it's silently dropped. If your JSON keys use capital letters and your Glue table doesn't match the case, fields vanish without an error. Failed records go to an error prefix in S3 and require custom code to reprocess.

dbt + warehouse-native parsing. Land raw JSON into a VARIANT column in Snowflake or BigQuery, then use dbt models to extract and cast fields with SQL. The typing happens in the warehouse, not in the file. You're paying warehouse compute to do what could be done at ingest, and you're maintaining SQL models for every schema change.

Custom scripts. PyArrow, pandas, or DuckDB in a container. You manually define a schema, read JSON, coerce types, and write Parquet. It works until someone changes the payload structure and your script throws an exception at 3 AM.

All four approaches share the same problems: schema drift breaks pipelines, failed records require manual investigation, and you need separate infrastructure just to convert formats.

What EdgeMQ does differently

EdgeMQ handles the JSON-to-typed-Parquet conversion at ingest time, using a declarative YAML definition instead of a pipeline.

You write a view definition that maps JSON fields to typed columns:

name: purchases
version: 1

definition:
  columns:
    - name: order_id
      type: VARCHAR
      path: $.orderId
      required: true

    - name: user_id
      type: VARCHAR
      path: $.userId
      required: true

    - name: amount
      type: DOUBLE
      path: $.amount

    - name: currency
      type: VARCHAR
      path: $.currency
      default: "USD"

    - name: purchased_at
      type: TIMESTAMP
      path: $.timestamp
      required: true

  filters:
    - condition: "amount > 0"
      description: Exclude zero-value transactions

That's it. No Spark cluster, no Glue table, no dbt model, no custom script. EdgeMQ compiles this definition into a safe execution plan and runs it against every sealed WAL segment, producing typed Parquet files partitioned by date in your S3 bucket.

The dataflow looks like this:

POST JSON ──► EdgeMQ WAL ──► Segment sealed
                                    │
                             View Definition
                                    │
                          ┌─────────┴─────────┐
                          ▼                   ▼
                    Typed Parquet         DLQ Parquet
                    (good rows)       (failed rows + reason)
                          │                   │
                          ▼                   ▼
                         S3                  S3

How it works under the hood

When a WAL segment is sealed (typically ~128 MB), the ingest node's materializer:

  1. Loads the segment frames into a temporary table (two columns: ingest_ts_ms and payload)
  2. Executes the compiled view definition, extracting fields via JSONPath and casting to target types with safe coercion (invalid casts return NULL, not errors)
  3. Writes the result as Parquet to your S3 bucket with column statistics for predicate pushdown
  4. Captures failed rows in a separate DLQ Parquet file with the original payload and a __reason column

All type casts use TRY_CAST, which returns NULL on failure instead of throwing an error. This means a string "ABC" in a DOUBLE column produces NULL - not a pipeline crash. You control what happens next via cast_failures:

  • null (default) - Cast failure produces NULL; row is kept
  • drop_row - Row is silently excluded
  • dlq - Row is sent to the Dead Letter Queue with the failure reason

What the DLQ gives you

This is where EdgeMQ's approach differs most from the alternatives.

Firehose writes failed records to an error prefix with an error code and the raw payload. Reprocessing requires custom code. Spark jobs halt the batch or silently drop records depending on your error handling.

EdgeMQ writes every failed row as a Parquet file alongside the successful output. Each DLQ row contains the original JSON payload, the ingest timestamp, and a __reason column explaining exactly what went wrong: required_column_user_id_is_null, cast_failure_on_amount, filtered_by_where_clause.

You can query DLQ files directly:

SELECT __reason, COUNT(*) as count
FROM read_parquet('dlq_file.parquet')
GROUP BY __reason
ORDER BY count DESC;

This is possible because of a deliberate architectural choice: EdgeMQ validates at the output layer, not at ingest. Every event is accepted and durably stored in the WAL. During materialisation, events that don't match the view definition are captured in the DLQ. Nothing is lost.

Systems that validate at the input - Snowplow rejects events that don't match a JSON Schema, Confluent Schema Registry rejects messages without valid schemas - give you synchronous feedback but lose the bad data. If your schema was wrong, the events are gone and you're depending on the producer to retry. EdgeMQ's approach means you can fix the view definition and the data is still there, waiting to be reprocessed.

The trade-off is that feedback is asynchronous. Producers don't get a 4xx when they send malformed data - they get a 202 Accepted. You monitor DLQ rates and error breakdowns in the Console's Data Quality dashboard, or via the API. For server-to-server ingestion, which is the majority of EdgeMQ's use cases, this is the right trade-off.

Pluggable output architecture

Under the hood, schema-aware Parquet is one of several output renderers in a pluggable architecture. Each sealed WAL segment is processed by whichever renderers are configured for the endpoint:

Sealed WAL segment
       │
       ├──► SegmentRenderer ──► .wal.zst (compressed archive)
       │
       ├──► ParquetRenderer ──► .parquet (opaque payload)
       │
       └──► ViewRenderer ──► .parquet (typed columns)
                                   + DLQ .parquet

All renderers produce deterministic S3 keys - the same input always generates the same output paths. Artifacts are uploaded to S3, and only after all configured renderers complete does the orchestrator write a commit manifest. The commit is the publish step; it's what tells downstream consumers that new data is safe to read.

This architecture means adding new output formats - Iceberg metadata, Delta Lake, CSV - is a new renderer implementation, not a redesign. The ingest hot path (HTTP to WAL) never changes.

What this means in practice

A concrete comparison. You have an IoT sensor fleet producing events like this:

{
  "sensor_id":: "SNS-4829",
  "device_type":: "environmental",
  "location":: { "lat":: 47.6, "lon":: -122.3, "building":: "HQ-A" },
  "readings":: { "temperature":: 22.4, "humidity":: 45.2, "pressure":: 1013.2 },
  "battery_level":: 87,
  "timestamp":: "2026-02-15T14:30:00Z"
}

With Spark: Write a job that reads JSON from S3, defines a StructType schema with nested StructFields for location and readings, handles null sensors, manages schema evolution when new fields appear, schedules the job, monitors it, and debugs failures. Estimated setup: days.

With Firehose: Create a Glue table with STRUCT columns for nested objects (Firehose silently drops nested fields without explicit STRUCT definitions). Define every column type. Configure format conversion. Write custom Lambda code to reprocess failed records. Estimated setup: hours, plus ongoing maintenance.

With EdgeMQ:

name: iot_sensor_readings
version: 1

definition:
  columns:
    - name: sensor_id
      type: VARCHAR
      path: $.sensor_id
      required: true

    - name: device_type
      type: VARCHAR
      path: $.device_type

    - name: lat
      type: DOUBLE
      path: $.location.lat

    - name: lon
      type: DOUBLE
      path: $.location.lon

    - name: building
      type: VARCHAR
      path: $.location.building

    - name: temperature
      type: DOUBLE
      path: $.readings.temperature

    - name: humidity
      type: DOUBLE
      path: $.readings.humidity

    - name: pressure
      type: DOUBLE
      path: $.readings.pressure

    - name: battery_level
      type: INTEGER
      path: $.battery_level

    - name: timestamp
      type: TIMESTAMP
      path: $.timestamp
      required: true

Nested structures are flattened by JSONPath. Types are declared explicitly and cast safely. Required fields enforce data quality. The output is a flat Parquet table with ten typed columns, partitioned by date, ready for SELECT * FROM read_parquet(...). Estimated setup: minutes.

Limitations

Schema-aware Parquet in EdgeMQ is not a general-purpose data transformation engine. Here's what it doesn't do:

No cross-segment aggregation. Each WAL segment is materialised independently. You can't compute rolling averages, sessionize events, or join data across segments. Use your query engine for that.

No external enrichment. View definitions work with what's in the JSON payload. You can't call a geo-IP API, look up a reference table, or enrich events with external data during materialisation.

No arbitrary SQL. You define columns, types, paths, defaults, and filters. The system compiles this to SQL internally, but you don't write SQL yourself. This is by design - it constrains what the system can do, but it also means there's no risk of a user-authored query taking down the materialiser.

S3-only output. Typed Parquet files land in your S3 bucket. If you query with Snowflake, Databricks, or Athena, you point them at S3 via external tables or Auto Loader. EdgeMQ doesn't load data into warehouses directly - S3 is the universal storage layer, and every analytics tool already knows how to read from it.

These are real constraints. If your use case requires joins, enrichment, or multi-destination delivery, tools like Confluent Cloud (Flink for stream processing) or RudderStack (200+ destination routing) may be a better fit. EdgeMQ is purpose-built for the specific problem of turning HTTP JSON into typed, query-ready Parquet in S3 - and doing that well.

Getting started

Schema-aware Parquet is available on the Pro plan ($49/month) with a 14-day free trial.

  1. Create a view definition in the Console (or use View Copilot to generate one from sample JSON)
  2. Attach it to your endpoint
  3. Start POSTing events
  4. Query typed Parquet directly from S3

The full view definition format, supported types, and examples are in the Materialized Views documentation.

Start your free trial or read the docs to see the full specification.