x

Schema-aware Parquet: from JSON to typed columns, automatically

·EdgeMQ Team

Schema-aware Parquet: from JSON to typed columns, automatically

One of the most common patterns in data engineering is: receive JSON events over HTTP, then transform them into typed columnar files for analytics. This usually involves a multi-step pipeline - a message queue, a stream processor, a schema registry, and a writer that produces Parquet.

With EdgeMQ's schema-aware Parquet output, you skip all of that. Define a view, and your data arrives in S3 as typed Parquet files.

How it works

When you enable a schema-aware view on an endpoint, EdgeMQ runs your view definition against each sealed WAL segment using DuckDB. The view definition is a YAML file that maps JSON paths to typed columns:

name: trades
source: payload
columns:
  - name: symbol
    type: VARCHAR
    path: $.symbol
  - name: price
    type: DOUBLE
    path: $.price
  - name: quantity
    type: INTEGER
    path: $.qty
  - name: timestamp
    type: TIMESTAMP
    path: $.ts

Each sealed segment (~128 MB of raw data) is processed in about 8-10 seconds, producing a compact Parquet file with proper types, ready for predicate pushdown and efficient querying.

Performance characteristics

We've measured the following in production:

  • Processing rate: ~21,700 rows/second per view
  • Memory usage: ~300-400 MB per DuckDB process
  • Parquet compression: ~3.6 bytes/row (excellent ratio)
  • Concurrency: Max 2 concurrent DuckDB processes per node

Multiple views on the same endpoint execute serially (semaphore-controlled) to prevent memory exhaustion. Three views on a single segment takes about 25-30 seconds total.

When to use it

Schema-aware Parquet is ideal when:

  • You're ingesting structured JSON events (analytics, IoT, trades)
  • You want to query data directly from S3 with tools like Athena, DuckDB, or Spark
  • You'd rather define a schema once than maintain an ETL pipeline
  • You need predicate pushdown for efficient date-range or column-filtered queries

If your payloads are opaque blobs or you need maximum ingest throughput without processing overhead, stick with Segments or Parquet (Raw) - both are available on all plans.

Available on Pro

Schema-aware views require the Pro plan ($49/month) because DuckDB processing is significantly more compute-intensive than raw segment shipping. The Pro plan includes 1 view per endpoint, with additional views at $0.02/GiB processed.

A 14-day Pro trial is available to test schema-aware outputs before committing.

Get started with a Pro trial or read the docs to learn more about view definitions.