Schema-aware Parquet: from JSON to typed columns, automatically
Schema-aware Parquet: from JSON to typed columns, automatically
Most data pipelines that turn JSON into Parquet are more complex than the business logic they exist to support.
You have applications producing events - page views, sensor readings, transactions, API logs. You want those events in S3 as typed Parquet files that Snowflake, Databricks, or Athena can query directly. The gap between "I have JSON" and "I have typed columns" is where engineering time goes to die.
The problem
JSON has six types: string, number, boolean, null, array, object. Parquet has dozens: INT32, INT64, FLOAT, DOUBLE, DECIMAL, TIMESTAMP, DATE, VARCHAR, nested STRUCTs, and more. Every field in your payload requires a type decision, and every type decision is a potential failure point.
This sounds simple. It isn't.
A field that's 42 in one record and 42.5 in the next forces a widening from INT64 to DOUBLE. A timestamp arrives as an ISO string in some records and epoch milliseconds in others. A field that was always present starts arriving as null because an upstream team changed their API. A new nested object appears that nobody documented.
These aren't edge cases. They're the normal lifecycle of event data.
How teams solve this today
The industry has converged on roughly four approaches, each with the same punchline: more infrastructure than the problem warrants.
Spark / Databricks jobs. The default at scale. Read JSON from S3, apply a schema, write Parquet. Schema inference is available but unreliable - it samples records, and if the sample misses a field or sees conflicting types, you get silent data loss or pipeline failures. Databricks Auto Loader adds schema evolution, but type conflicts still require manual intervention. Either way, you're running and maintaining a compute cluster.
AWS Glue + Firehose. Firehose offers "format conversion" that deserializes JSON against a Glue Data Catalog table and writes Parquet. The catch: you must pre-define every column and type in the Glue table. Nested JSON beyond the first level requires explicit STRUCT definitions or it's silently dropped. If your JSON keys use capital letters and your Glue table doesn't match the case, fields vanish without an error. Failed records go to an error prefix in S3 and require custom code to reprocess.
dbt + warehouse-native parsing. Land raw JSON into a VARIANT column in Snowflake or BigQuery, then use dbt models to extract and cast fields with SQL. The typing happens in the warehouse, not in the file. You're paying warehouse compute to do what could be done at ingest, and you're maintaining SQL models for every schema change.
Custom scripts. PyArrow, pandas, or DuckDB in a container. You manually define a schema, read JSON, coerce types, and write Parquet. It works until someone changes the payload structure and your script throws an exception at 3 AM.
All four approaches share the same problems: schema drift breaks pipelines, failed records require manual investigation, and you need separate infrastructure just to convert formats.
What EdgeMQ does differently
EdgeMQ handles the JSON-to-typed-Parquet conversion at ingest time, using a declarative YAML definition instead of a pipeline.
You write a view definition that maps JSON fields to typed columns:
name: purchases version: 1 definition: columns: - name: order_id type: VARCHAR path: $.orderId required: true - name: user_id type: VARCHAR path: $.userId required: true - name: amount type: DOUBLE path: $.amount - name: currency type: VARCHAR path: $.currency default: "USD" - name: purchased_at type: TIMESTAMP path: $.timestamp required: true filters: - condition: "amount > 0" description: Exclude zero-value transactions
That's it. No Spark cluster, no Glue table, no dbt model, no custom script. EdgeMQ compiles this definition into a safe execution plan and runs it against every sealed WAL segment, producing typed Parquet files partitioned by date in your S3 bucket.
The dataflow looks like this:
POST JSON ──► EdgeMQ WAL ──► Segment sealed
│
View Definition
│
┌─────────┴─────────┐
▼ ▼
Typed Parquet DLQ Parquet
(good rows) (failed rows + reason)
│ │
▼ ▼
S3 S3How it works under the hood
When a WAL segment is sealed (typically ~128 MB), the ingest node's materializer:
- Loads the segment frames into a temporary table (two columns:
ingest_ts_msandpayload) - Executes the compiled view definition, extracting fields via JSONPath and casting to target types with safe coercion (invalid casts return NULL, not errors)
- Writes the result as Parquet to your S3 bucket with column statistics for predicate pushdown
- Captures failed rows in a separate DLQ Parquet file with the original payload and a
__reasoncolumn
All type casts use TRY_CAST, which returns NULL on failure instead of throwing an error. This means a string "ABC" in a DOUBLE column produces NULL - not a pipeline crash. You control what happens next via cast_failures:
- null (default) - Cast failure produces NULL; row is kept
- drop_row - Row is silently excluded
- dlq - Row is sent to the Dead Letter Queue with the failure reason
What the DLQ gives you
This is where EdgeMQ's approach differs most from the alternatives.
Firehose writes failed records to an error prefix with an error code and the raw payload. Reprocessing requires custom code. Spark jobs halt the batch or silently drop records depending on your error handling.
EdgeMQ writes every failed row as a Parquet file alongside the successful output. Each DLQ row contains the original JSON payload, the ingest timestamp, and a __reason column explaining exactly what went wrong: required_column_user_id_is_null, cast_failure_on_amount, filtered_by_where_clause.
You can query DLQ files directly:
SELECT __reason, COUNT(*) as count FROM read_parquet('dlq_file.parquet') GROUP BY __reason ORDER BY count DESC;
This is possible because of a deliberate architectural choice: EdgeMQ validates at the output layer, not at ingest. Every event is accepted and durably stored in the WAL. During materialisation, events that don't match the view definition are captured in the DLQ. Nothing is lost.
Systems that validate at the input - Snowplow rejects events that don't match a JSON Schema, Confluent Schema Registry rejects messages without valid schemas - give you synchronous feedback but lose the bad data. If your schema was wrong, the events are gone and you're depending on the producer to retry. EdgeMQ's approach means you can fix the view definition and the data is still there, waiting to be reprocessed.
The trade-off is that feedback is asynchronous. Producers don't get a 4xx when they send malformed data - they get a 202 Accepted. You monitor DLQ rates and error breakdowns in the Console's Data Quality dashboard, or via the API. For server-to-server ingestion, which is the majority of EdgeMQ's use cases, this is the right trade-off.
Pluggable output architecture
Under the hood, schema-aware Parquet is one of several output renderers in a pluggable architecture. Each sealed WAL segment is processed by whichever renderers are configured for the endpoint:
Sealed WAL segment
│
├──► SegmentRenderer ──► .wal.zst (compressed archive)
│
├──► ParquetRenderer ──► .parquet (opaque payload)
│
└──► ViewRenderer ──► .parquet (typed columns)
+ DLQ .parquetAll renderers produce deterministic S3 keys - the same input always generates the same output paths. Artifacts are uploaded to S3, and only after all configured renderers complete does the orchestrator write a commit manifest. The commit is the publish step; it's what tells downstream consumers that new data is safe to read.
This architecture means adding new output formats - Iceberg metadata, Delta Lake, CSV - is a new renderer implementation, not a redesign. The ingest hot path (HTTP to WAL) never changes.
What this means in practice
A concrete comparison. You have an IoT sensor fleet producing events like this:
{ "sensor_id":: "SNS-4829", "device_type":: "environmental", "location":: { "lat":: 47.6, "lon":: -122.3, "building":: "HQ-A" }, "readings":: { "temperature":: 22.4, "humidity":: 45.2, "pressure":: 1013.2 }, "battery_level":: 87, "timestamp":: "2026-02-15T14:30:00Z" }
With Spark: Write a job that reads JSON from S3, defines a StructType schema with nested StructFields for location and readings, handles null sensors, manages schema evolution when new fields appear, schedules the job, monitors it, and debugs failures. Estimated setup: days.
With Firehose: Create a Glue table with STRUCT columns for nested objects (Firehose silently drops nested fields without explicit STRUCT definitions). Define every column type. Configure format conversion. Write custom Lambda code to reprocess failed records. Estimated setup: hours, plus ongoing maintenance.
With EdgeMQ:
name: iot_sensor_readings version: 1 definition: columns: - name: sensor_id type: VARCHAR path: $.sensor_id required: true - name: device_type type: VARCHAR path: $.device_type - name: lat type: DOUBLE path: $.location.lat - name: lon type: DOUBLE path: $.location.lon - name: building type: VARCHAR path: $.location.building - name: temperature type: DOUBLE path: $.readings.temperature - name: humidity type: DOUBLE path: $.readings.humidity - name: pressure type: DOUBLE path: $.readings.pressure - name: battery_level type: INTEGER path: $.battery_level - name: timestamp type: TIMESTAMP path: $.timestamp required: true
Nested structures are flattened by JSONPath. Types are declared explicitly and cast safely. Required fields enforce data quality. The output is a flat Parquet table with ten typed columns, partitioned by date, ready for SELECT * FROM read_parquet(...). Estimated setup: minutes.
Limitations
Schema-aware Parquet in EdgeMQ is not a general-purpose data transformation engine. Here's what it doesn't do:
No cross-segment aggregation. Each WAL segment is materialised independently. You can't compute rolling averages, sessionize events, or join data across segments. Use your query engine for that.
No external enrichment. View definitions work with what's in the JSON payload. You can't call a geo-IP API, look up a reference table, or enrich events with external data during materialisation.
No arbitrary SQL. You define columns, types, paths, defaults, and filters. The system compiles this to SQL internally, but you don't write SQL yourself. This is by design - it constrains what the system can do, but it also means there's no risk of a user-authored query taking down the materialiser.
S3-only output. Typed Parquet files land in your S3 bucket. If you query with Snowflake, Databricks, or Athena, you point them at S3 via external tables or Auto Loader. EdgeMQ doesn't load data into warehouses directly - S3 is the universal storage layer, and every analytics tool already knows how to read from it.
These are real constraints. If your use case requires joins, enrichment, or multi-destination delivery, tools like Confluent Cloud (Flink for stream processing) or RudderStack (200+ destination routing) may be a better fit. EdgeMQ is purpose-built for the specific problem of turning HTTP JSON into typed, query-ready Parquet in S3 - and doing that well.
Getting started
Schema-aware Parquet is available on the Pro plan ($49/month) with a 14-day free trial.
- Create a view definition in the Console (or use View Copilot to generate one from sample JSON)
- Attach it to your endpoint
- Start POSTing events
- Query typed Parquet directly from S3
The full view definition format, supported types, and examples are in the Materialized Views documentation.
Start your free trial or read the docs to see the full specification.