Schema-aware Parquet: from JSON to typed columns, automatically
Schema-aware Parquet: from JSON to typed columns, automatically
One of the most common patterns in data engineering is: receive JSON events over HTTP, then transform them into typed columnar files for analytics. This usually involves a multi-step pipeline - a message queue, a stream processor, a schema registry, and a writer that produces Parquet.
With EdgeMQ's schema-aware Parquet output, you skip all of that. Define a view, and your data arrives in S3 as typed Parquet files.
How it works
When you enable a schema-aware view on an endpoint, EdgeMQ runs your view definition against each sealed WAL segment using DuckDB. The view definition is a YAML file that maps JSON paths to typed columns:
name: trades source: payload columns: - name: symbol type: VARCHAR path: $.symbol - name: price type: DOUBLE path: $.price - name: quantity type: INTEGER path: $.qty - name: timestamp type: TIMESTAMP path: $.ts
Each sealed segment (~128 MB of raw data) is processed in about 8-10 seconds, producing a compact Parquet file with proper types, ready for predicate pushdown and efficient querying.
Performance characteristics
We've measured the following in production:
- Processing rate: ~21,700 rows/second per view
- Memory usage: ~300-400 MB per DuckDB process
- Parquet compression: ~3.6 bytes/row (excellent ratio)
- Concurrency: Max 2 concurrent DuckDB processes per node
Multiple views on the same endpoint execute serially (semaphore-controlled) to prevent memory exhaustion. Three views on a single segment takes about 25-30 seconds total.
When to use it
Schema-aware Parquet is ideal when:
- You're ingesting structured JSON events (analytics, IoT, trades)
- You want to query data directly from S3 with tools like Athena, DuckDB, or Spark
- You'd rather define a schema once than maintain an ETL pipeline
- You need predicate pushdown for efficient date-range or column-filtered queries
If your payloads are opaque blobs or you need maximum ingest throughput without processing overhead, stick with Segments or Parquet (Raw) - both are available on all plans.
Available on Pro
Schema-aware views require the Pro plan ($49/month) because DuckDB processing is significantly more compute-intensive than raw segment shipping. The Pro plan includes 1 view per endpoint, with additional views at $0.02/GiB processed.
A 14-day Pro trial is available to test schema-aware outputs before committing.
Get started with a Pro trial or read the docs to learn more about view definitions.