Materialized Views
Materialized views enable schema-aware data transformation directly at ingest time, producing typed Parquet outputs optimized for analytics and warehouse consumption.
Materialized views are available on Pro and Enterprise plans. A 14-day free trial is available for all new accounts.
What are Materialized Views?
Materialized views transform raw JSON events into strongly-typed columnar data (Parquet or CSV) as data is ingested. Instead of writing opaque payloads that require downstream parsing, EdgeMQ can extract, cast, filter, and partition your data automatically.
Key benefits:
- Eliminate ETL jobs - Schema extraction happens at ingest time
- Type safety - Automatic casting with error handling
- Partitioning - Organize data by date, region, or other dimensions
- Query-ready - Direct consumption by Snowflake, Databricks, Trino, Athena, etc.
- Declarative - Define views as YAML, not SQL (no user SQL execution)
How It Works
JSON Events -> EdgeMQ Ingest -> WAL Segment -> View Materializer -> Typed Parquet -> S3
|
View Definition (YAML)
- Column extraction
- Type casting
- Filtering
- PartitioningArchitecture
- View Definition - You define a declarative YAML mapping (columns, types, filters)
- Validation - EdgeMQ validates the definition and compiles it to safe SQL internally
- Per-segment execution - For each sealed WAL segment, the materializer:
- Loads frames into a temporary frames table - Executes the compiled SELECT query - Writes typed Parquet/CSV to S3
- Commit markers - Output artifacts are listed in commit manifests for safe consumption
Security Model
- No user SQL - View definitions compile to safe, validated SQL internally
- Sandboxed execution - Each view runs in an isolated execution context
- Resource limits - Memory and timeout constraints prevent runaway queries
- Type allowlist - Only approved DuckDB types are permitted
Output Formats
Materialized views are powered by the Parquet (Schema-Aware) output format. To use materialized views, you must enable this format on your endpoint.
See Output Formats for comparison with Segments and Raw Parquet formats.
Quick Start
1. Enable Schema-Aware Parquet
In the EdgeMQ Console, navigate to your endpoint's Configuration tab and enable Parquet (Schema-Aware):
Output Formats [x] Segments (optional) [x] Parquet (Raw) (optional) [x] Parquet (Schema-Aware) (required for views)
2. Create a View Definition
In the Console, create a new view using the visual editor or paste YAML directly:
name: user_events version: 1 format: json definition: columns: - name: user_id type: VARCHAR path: $.userId required: true description: Unique user identifier - name: event_type type: VARCHAR path: $.eventType description: Type of event - name: amount type: DOUBLE path: $.amount description: Transaction amount - name: event_time type: TIMESTAMP path: $.timestamp required: true description: Event timestamp (ISO 8601)
3. Attach View to Endpoint
In the Console, go to Configuration -> Materialized Views and attach your view. You can specify:
- Required - Ingest fails if view extraction fails (data quality enforcement)
- Optional - Errors are logged but ingest continues (best-effort)
- Canary mode - Apply view only to segments sealed after a specific timestamp (gradual rollout)
4. Query Your Data
Once segments are sealed, query the typed Parquet directly:
-- Example query SELECT user_id, event_type, amount, event_time FROM read_parquet('s3://bucket/prefix/views/user_events/dt=2025-01-20/seg=*/part-*.parquet') WHERE event_type = 'purchase' ORDER BY event_time DESC LIMIT 100;
Core Concepts
View Definition Format
Tip: Use View Copilot to generate a view definition from sample JSON instead of writing YAML by hand.
A view definition is a YAML document with these top-level fields:
| Field | Required | Description |
|---|---|---|
name | Yes | Lowercase alphanumeric with underscores (max 64 chars) |
version | No | Version number (auto-incremented by Console) |
format | No | yaml or json (default: yaml) |
description | No | Human-readable description (max 2000 chars) |
tags | No | Array of tags for organization (max 10) |
definition | Yes | Contains columns and optional filters |
Column Definition
Each column in definition.columns extracts and types a field from the JSON payload:
| Field | Required | Description |
|---|---|---|
name | Yes | Output column name (valid SQL identifier) |
type | Yes | DuckDB column type (see Type System below) |
path | Yes | JSONPath expression to extract the value (e.g., $.userId) |
required | No | If true, rows with NULL values for this column are rejected (default: false) |
default | No | Fallback value if extraction returns NULL |
description | No | Human-readable description of the column |
cast_failures | No | How to handle cast failures: null (default), drop_row, or dlq |
The frames Table
Internally, the materializer creates a virtual frames table for each segment:
CREATE TEMP TABLE frames( ingest_ts_ms BIGINT, -- Unix epoch milliseconds(ingestion time) payload VARCHAR -- JSON string(your ingested event) );
Each row represents one ingested event. Your path expressions extract fields from the payload column using JSONPath syntax.
JSONPath Expressions
The path field uses JSONPath syntax to navigate your JSON payloads:
| Path | Extracts |
|---|---|
$.userId | Top-level field userId |
$.data.amount | Nested field amount inside data |
$.address.city | Nested field city inside address |
$.items[0].name | First element's name in an array |
$.location.lat | Deeply nested latitude value |
Type System
All output columns must declare a supported type:
| Category | Types | Notes |
|---|---|---|
| String | VARCHAR, TEXT, STRING | Aliases for same type |
| Integer | BIGINT, INTEGER, INT, SMALLINT, TINYINT | Use BIGINT for IDs |
| Float | DOUBLE, REAL, FLOAT | Use DOUBLE for precision |
| Boolean | BOOLEAN, BOOL | Aliases for same type |
| Temporal | TIMESTAMP, TIMESTAMPTZ, DATE | For time-series data |
| JSON | JSON, JSONB | Preserve nested structures |
All type casts use TRY_CAST internally for safety - invalid casts return NULL instead of failing.
Required Fields and Defaults
Control data quality with required and default:
- name: status type: VARCHAR path: $.status required: true default: "pending"
Semantics:
- Extract field from JSON using the
path - Cast to target type (using
TRY_CAST) - Apply
defaultif still NULL - If
required: true, drop row if still NULL
Filtering
Filter rows using definition.filters:
definition: columns: - name: event_type type: VARCHAR path: $.eventType - name: amount type: DOUBLE path: $.amount filters: - condition: "event_type = 'purchase'" description: Only include purchase events - condition: "amount > 0" description: Exclude zero-value transactions
Filtering happens before required field checks. Only rows matching all filters are considered for output.
Partitioning
Partitioning is configured at the endpoint level in the Console when you attach a view. EdgeMQ supports Hive-style date partitioning, organizing output files for efficient querying.
Results in S3 layout:
s3://bucket/prefix/views/user_events/ dt=2025-01-20/seg=00001/part-0.parquet dt=2025-01-21/seg=00002/part-0.parquet
Best practices:
- Use low-cardinality dimensions (date, region, status)
- Avoid high-cardinality fields (user_id, session_id, order_id)
- Limit to 2-3 partition columns max
- High cardinality creates millions of small files - slow writes + S3 overhead
Common Patterns
Pattern 1: Basic JSON Extraction
Extract simple top-level fields from JSON:
definition: columns: - name: user_id type: VARCHAR path: $.userId - name: amount type: DOUBLE path: $.amount
Pattern 2: Nested JSON Navigation
Flatten nested structures into top-level columns:
definition: columns: - name: city type: VARCHAR path: $.address.city - name: country_code type: VARCHAR path: $.address.country.code - name: lat type: DOUBLE path: $.location.lat - name: lon type: DOUBLE path: $.location.lon
Pattern 3: Timestamp Handling
Extract timestamps from different source formats:
definition: columns: # ISO 8601 string - EdgeMQ parses automatically based on the TIMESTAMP type - name: created_at type: TIMESTAMP path: $.createdAt # Unix epoch milliseconds stored as a number - name: updated_at type: BIGINT path: $.updatedAtMs # Date-only field - name: birth_date type: DATE path: $.birthDate
Pattern 4: Default Values
Provide fallback values for missing or NULL fields:
definition: columns: - name: status type: VARCHAR path: $.status default: "pending" - name: priority type: VARCHAR path: $.priority default: "normal" - name: processed type: BOOLEAN path: $.processed default: "false"
Pattern 5: Required Fields
Enforce data quality by rejecting rows with missing required fields:
definition: columns: - name: order_id type: VARCHAR path: $.orderId required: true - name: amount type: DOUBLE path: $.amount required: true # Optional field - can be NULL in output - name: coupon_code type: VARCHAR path: $.couponCode
Rows where order_id or amount is NULL (after casting) are sent to the Dead Letter Queue.
Pattern 6: Filtering by Event Type
Process only specific events:
definition: columns: - name: user_id type: VARCHAR path: $.userId required: true - name: amount type: DOUBLE path: $.amount required: true - name: event_type type: VARCHAR path: $.eventType filters: - condition: "event_type = 'purchase'" description: Only include purchase events - condition: "amount > 0" description: Exclude zero-value transactions
Error Handling
Type Casting Errors
All type casts use TRY_CAST automatically. When you define a column:
- name: amount type: DOUBLE path: $.amount
EdgeMQ generates safe SQL internally:
TRY_CAST(json_extract_string(payload, '$.amount') AS DOUBLE) AS amount
Behavior:
- Valid number:
"123.45"->123.45 - Invalid number:
"invalid"->NULL - Missing field:
undefined->NULL
Required Field Enforcement
Use required: true to reject rows with NULL values:
- name: user_id type: VARCHAR path: $.userId required: true
Rows with NULL user_id (after casting and defaults) are captured in a Dead Letter Queue (DLQ) file for inspection. See Rejected Rows (DLQ) below.
Cast Failure Handling
Control what happens when a type cast fails using cast_failures:
- name: amount type: DOUBLE path: $.amount cast_failures: drop_row # Options: null (default), drop_row, dlq
- **
null** (default) - Cast failure produces NULL; row is kept - **
drop_row** - Row is silently dropped on cast failure - **
dlq** - Row is sent to the Dead Letter Queue with the failure reason
Rejected Rows (DLQ)
Tip: Use DLQ Copilot to automatically diagnose DLQ failures and propose fixes — no manual Parquet inspection needed.
When rows fail validation (required field is NULL, cast failure with dlq policy, or a filter excludes them), they are written to a separate DLQ Parquet file alongside the successful output. This enables you to investigate *why* rows were rejected rather than losing them silently.
DLQ files contain three columns:
| Column | Description |
|---|---|
ingest_ts_ms | Original ingest timestamp (for correlating with source data) |
payload | The full original JSON payload |
__reason | Why the row was rejected (e.g., required_column_user_id_is_null, filtered_by_where_clause) |
What triggers DLQ capture:
- Required column NULL - A
required: truecolumn is NULL after casting and defaults - Cast failure with DLQ policy - A column with
cast_failures: dlqfails to cast - Filter exclusion - Row doesn't match the filter conditions
Accessing DLQ files:
- Navigate to the view's Data Quality tab in the Console
- Review the REJECTED (DLQ) count alongside processed and accepted metrics
- Download DLQ Parquet files to inspect the rejected payloads and reasons
- Use the
__reasoncolumn to identify which field or filter caused the rejection
Example: Inspecting a DLQ file with DuckDB
SELECT __reason, COUNT(*) as count FROM read_parquet('downloaded_dlq_file.parquet') GROUP BY __reason ORDER BY count DESC;
This might show:
| __reason | count |
|---|---|
required_column_user_id_is_null | 523 |
required_column_amount_is_null | 89 |
filtered_by_where_clause | 12 |
Note: DLQ files are only generated when a view has required columns, cast failure DLQ policies, or filters. Views with no validation constraints produce no rejected rows.
View Execution Failures
Views can be configured as:
- Required - Ingest fails if view extraction fails (blocks commit)
- Optional - Errors logged but ingest continues (view output may be missing)
Configure this when attaching the view to your endpoint.
Performance Considerations
Memory Usage
The materializer loads entire segments into memory:
memory_needed ~ (segment_size_uncompressed x 2-3x) + query_overhead
Example:
- 100 MB uncompressed segment
- 2x expansion during query -> 200 MB
- Intermediate operations -> +100 MB
- Total: ~300-400 MB
For large segments or complex views, ensure adequate memory allocation.
Query Complexity
- Keep views simple (projection + filter)
- Avoid expensive aggregations across large segments
- Use required fields for filtering instead of complex filter conditions
- Views cannot query across segments (each segment is independent)
Partitioning Strategy
Good partitions:
- Date/hour partitions:
dt=2025-01-20 - Low-cardinality dimensions:
region=us-east-1,status=active - Typically creates 10-100 folders per dimension
Bad partitions:
- High-cardinality fields:
user_id,session_id,order_id - Can create millions of small files
- Slow S3 writes + list operations
- High S3 request costs
Operational Model
View Lifecycle
- Draft - Create and edit view definition
- Publish - Validate and compile view
- Attach - Bind view to endpoint(s)
- Materialize - Execute per-segment as data arrives
- Query - Consume typed Parquet from S3
Versioning
Views are versioned independently:
view_name@v1 -> Initial version view_name@v2 -> Updated schema view_name@v3 -> New columns added
Multiple versions can coexist. Update endpoint bindings to migrate to new versions.
Canary Deployments
Test views on new data before full rollout by setting a canary start timestamp when attaching the view. The view only applies to segments sealed after this timestamp. Use this to:
- Test new view versions on recent data
- Gradually roll out schema changes
- Validate output before full migration
Monitoring and Data Quality
Tip: Drift Guard monitors field distributions over time and alerts you when event shapes change — before they cause DLQ spikes.
Track view execution metrics in the EdgeMQ Console:
- Rows processed - Total frames in segment
- Rows accepted - Rows passing all filters and required checks
- Rows rejected (DLQ) - Rows captured in DLQ files (failed required checks or filters)
- Success rate - Percentage of rows accepted vs. processed
- Execution time - Materialization duration per segment
Where to find metrics:
- Endpoint -> Data Quality tab - Per-view breakdown for a specific endpoint, with error rate badges and drill-down links
- View -> Data Quality tab - Cross-endpoint metrics for a specific view, with DLQ file download buttons
- Endpoints list - Health indicators showing data quality issues at a glance
- Views list - Error rate badges on view cards
Metrics update in real-time via live streaming and are aggregated over 24-hour rolling windows.
Limits and Quotas
View capabilities vary by plan:
| Plan | Max Views per Endpoint | Max Columns per View | Max Partition Dimensions |
|---|---|---|---|
| Pro | 3 | 50 | 3 |
| Enterprise | 10+ | 100+ | 5+ |
Other limits:
- Column name: max 255 characters
- JSONPath expression: any valid
$.path - Filter condition: max 1000 characters
- Memory per view: Configured at deployment
- Execution timeout: 5 minutes (configurable)
Examples
Example 1: E-commerce Purchases
Transform purchase events into an analytics-ready table:
name: purchases version: 1 format: json definition: columns: - name: order_id type: VARCHAR path: $.orderId required: true description: Unique order identifier - name: user_id type: VARCHAR path: $.userId required: true description: Customer identifier - name: amount type: DOUBLE path: $.amount required: true description: Purchase amount - name: currency type: VARCHAR path: $.currency default: "USD" description: Currency code - name: product_id type: VARCHAR path: $.productId description: Product identifier - name: purchased_at type: TIMESTAMP path: $.timestamp required: true description: Purchase timestamp (ISO 8601) filters: - condition: "amount > 0" description: Exclude zero-value transactions
Query example:
SELECT DATE_TRUNC('day', purchased_at) as day, currency, COUNT(*) as orders, SUM(amount) as revenue FROM read_parquet('s3://bucket/prefix/views/purchases/dt=*/seg=*/part-*.parquet') WHERE dt >= '2025-01-01' GROUP BY day, currency ORDER BY day DESC, revenue DESC;
Example 2: IoT Sensor Readings
Process time-series sensor data with nested structures:
name: iot_sensor_readings version: 1 format: json definition: columns: - name: sensor_id type: VARCHAR path: $.sensor_id required: true description: Unique sensor identifier - name: device_type type: VARCHAR path: $.device_type description: Sensor device type - name: lat type: DOUBLE path: $.location.lat description: Latitude coordinate - name: lon type: DOUBLE path: $.location.lon description: Longitude coordinate - name: building type: VARCHAR path: $.location.building description: Building identifier - name: temperature type: DOUBLE path: $.readings.temperature description: Temperature reading - name: humidity type: DOUBLE path: $.readings.humidity description: Humidity reading - name: pressure type: DOUBLE path: $.readings.pressure description: Pressure reading in hPa - name: battery_level type: INTEGER path: $.battery_level description: Battery percentage - name: quality type: VARCHAR path: $.quality description: Signal quality (good, degraded, poor) - name: is_online type: BOOLEAN path: $.is_online description: Whether the device is currently online - name: timestamp type: TIMESTAMP path: $.timestamp required: true description: Event timestamp (ISO 8601)
Query example:
SELECT DATE_TRUNC('hour', timestamp) as hour, sensor_id, AVG(temperature) as avg_temp, MIN(temperature) as min_temp, MAX(temperature) as max_temp, COUNT(*) as reading_count FROM read_parquet('s3://bucket/prefix/views/iot_sensor_readings/dt=2025-01-20/seg=*/part-*.parquet') GROUP BY hour, sensor_id ORDER BY hour DESC, sensor_id;
Example 3: Application Logs
Extract structured log events:
name: app_logs version: 1 format: json definition: columns: - name: level type: VARCHAR path: $.level required: true description: Log level (ERROR, WARN, INFO, DEBUG) - name: message type: VARCHAR path: $.message required: true description: Log message - name: service type: VARCHAR path: $.service default: "unknown" description: Service name - name: request_id type: VARCHAR path: $.requestId description: Request correlation ID - name: user_id type: VARCHAR path: $.userId description: Associated user ID - name: logged_at type: TIMESTAMP path: $.timestamp description: Log timestamp (ISO 8601) filters: - condition: "level IN ('ERROR', 'WARN')" description: Only capture errors and warnings
Query example:
-- Find all errors for a specific user SELECT logged_at, level, message, service, request_id FROM read_parquet('s3://bucket/prefix/views/app_logs/dt=*/seg=*/part-*.parquet') WHERE user_id = 'user_12345' ORDER BY logged_at DESC LIMIT 100;
FAQ
Can I change a view's schema after deployment?
Yes, create a new version from the Console and update endpoint bindings. Old and new versions can coexist.
Do views query across segments?
No. Each segment is materialized independently. Views cannot reference data from previous segments or perform cross-segment aggregations.
What happens if JSON extraction fails?
TRY_CAST returns NULL for invalid data. Use required: true to reject rows with NULL values (captured in DLQ files for inspection), or default to substitute a fallback value.
What happens to rejected rows?
Rows that fail validation (required field is NULL or filter excludes them) are written to a Dead Letter Queue (DLQ) Parquet file. Each DLQ row includes the original JSON payload and a __reason column explaining why the row was rejected. You can download DLQ files from the Data Quality tab in the Console.
Can I use SQL directly?
No. View definitions compile to SQL internally, but users cannot submit arbitrary SQL for security reasons. Use the declarative YAML format with path expressions.
How do I test a view before production?
- Use the Preview tab in the Console to test against sample JSON payloads
- Create the view as a draft
- Attach with a canary start timestamp to apply only to new data
- Monitor the Data Quality tab for rejection rates and download DLQ files to inspect errors
- Remove canary restriction once validated
What's the difference between required views and optional views?
- Required - View execution failure blocks the commit (data quality enforcement)
- Optional - View execution failure is logged but ingest continues (best-effort)
Configure this when attaching the view to your endpoint.
How many views can I attach per endpoint?
- Pro: 3 views per endpoint
- Enterprise: 10+ views per endpoint
Can I output to formats other than Parquet?
Yes, CSV is also supported. Configure the output format in the Console when attaching the view.
How do I handle high-cardinality partitions?
Avoid partitioning by high-cardinality fields (user_id, order_id, session_id). Instead:
- Partition by date/hour + low-cardinality dimension (region, category)
- Use query predicates to filter by high-cardinality fields
- Consider composite keys in your view (not partitioning)
What query engines can read the output?
Any engine supporting Parquet over S3:
- Snowflake (external tables or COPY INTO)
- Databricks (Auto Loader or direct read)
- AWS Athena (Glue Catalog + partitions)
- Trino/Presto
- Apache Spark
- ClickHouse (S3 table function)
- BigQuery (external tables)
- Any Parquet-compatible analytics tool
Next Steps
- Start your free 14-day trial at console.edgemq.io
- Review Output Formats for format comparison
- See Quickstart Guide for end-to-end setup
- Join the EdgeMQ community for support
Questions? Contact support@edgemq.io or open an issue on GitHub.