Materialized Views

Materialized views enable schema-aware data transformation directly at ingest time, producing typed Parquet outputs optimized for analytics and warehouse consumption.

Availability

Materialized views are available on Pro and Enterprise plans. A 14-day free trial is available for all new accounts.

What are Materialized Views?

Materialized views transform raw JSON events into strongly-typed columnar data (Parquet or CSV) as data is ingested. Instead of writing opaque payloads that require downstream parsing, EdgeMQ can extract, cast, filter, and partition your data automatically.

Key benefits:

Eliminate ETL jobs - Schema extraction happens at ingest time
Type safety - Automatic casting with error handling
Partitioning - Organize data by date, region, or other dimensions
Query-ready - Direct consumption by Snowflake, Databricks, Trino, Athena, etc.
Declarative - Define views as YAML, not SQL (no user SQL execution)

How It Works

JSON Events -> EdgeMQ Ingest -> WAL Segment -> View Materializer -> Typed Parquet -> S3
                                    |
                            View Definition (YAML)
                            - Column extraction
                            - Type casting
                            - Filtering
                            - Partitioning

Architecture

View Definition - You define a declarative YAML mapping (columns, types, filters)
Validation - EdgeMQ validates the definition and compiles it to safe SQL internally
Per-segment execution - For each sealed WAL segment, the materializer:

- Loads frames into a temporary frames table - Executes the compiled SELECT query - Writes typed Parquet/CSV to S3

Commit markers - Output artifacts are listed in commit manifests for safe consumption

Security Model

No user SQL - View definitions compile to safe, validated SQL internally
Sandboxed execution - Each view runs in an isolated execution context
Resource limits - Memory and timeout constraints prevent runaway queries
Type allowlist - Only approved DuckDB types are permitted

Output Formats

Materialized views are powered by the Parquet (Schema-Aware) output format. To use materialized views, you must enable this format on your endpoint.

See Output Formats for comparison with Segments and Raw Parquet formats.

Quick Start

1. Enable Schema-Aware Parquet

In the EdgeMQ Console, navigate to your endpoint's Configuration tab and enable Parquet (Schema-Aware):

Output Formats
[x] Segments (optional)
[x] Parquet (Raw) (optional)
[x] Parquet (Schema-Aware) (required for views)

2. Create a View Definition

In the Console, create a new view using the visual editor or paste YAML directly:

name: user_events
version: 1
format: json

definition:
  columns:
    - name: user_id
      type: VARCHAR
      path: $.userId
      required: true
      description: Unique user identifier

    - name: event_type
      type: VARCHAR
      path: $.eventType
      description: Type of event

    - name: amount
      type: DOUBLE
      path: $.amount
      description: Transaction amount

    - name: event_time
      type: TIMESTAMP
      path: $.timestamp
      required: true
      description: Event timestamp (ISO 8601)

3. Attach View to Endpoint

In the Console, go to Configuration -> Materialized Views and attach your view. You can specify:

Required - Ingest fails if view extraction fails (data quality enforcement)
Optional - Errors are logged but ingest continues (best-effort)
Canary mode - Apply view only to segments sealed after a specific timestamp (gradual rollout)

4. Query Your Data

Once segments are sealed, query the typed Parquet directly:

-- Example query
SELECT
  user_id,
  event_type,
  amount,
  event_time
FROM read_parquet('s3://bucket/prefix/views/user_events/dt=2025-01-20/seg=*/part-*.parquet')
WHERE event_type = 'purchase'
ORDER BY event_time DESC
LIMIT 100;

Core Concepts

View Definition Format

Tip: Use View Copilot to generate a view definition from sample JSON instead of writing YAML by hand.

A view definition is a YAML document with these top-level fields:

Field	Required	Description
`name`	Yes	Lowercase alphanumeric with underscores (max 64 chars)
`version`	No	Version number (auto-incremented by Console)
`format`	No	`yaml` or `json` (default: `yaml`)
`description`	No	Human-readable description (max 2000 chars)
`tags`	No	Array of tags for organization (max 10)
`definition`	Yes	Contains `columns` and optional `filters`

Column Definition

Each column in definition.columns extracts and types a field from the JSON payload:

Field	Required	Description
`name`	Yes	Output column name (valid SQL identifier)
`type`	Yes	DuckDB column type (see Type System below)
`path`	Yes	JSONPath expression to extract the value (e.g., `$.userId`)
`required`	No	If `true`, rows with NULL values for this column are rejected (default: `false`)
`default`	No	Fallback value if extraction returns NULL
`description`	No	Human-readable description of the column
`cast_failures`	No	How to handle cast failures: `null` (default), `drop_row`, or `dlq`

The `frames` Table

Internally, the materializer creates a virtual frames table for each segment:

CREATE TEMP TABLE frames(
    ingest_ts_ms BIGINT,    -- Unix epoch milliseconds(ingestion time)
    payload VARCHAR         -- JSON string(your ingested event)
);

Each row represents one ingested event. Your path expressions extract fields from the payload column using JSONPath syntax.

JSONPath Expressions

The path field uses JSONPath syntax to navigate your JSON payloads:

Path	Extracts
`$.userId`	Top-level field `userId`
`$.data.amount`	Nested field `amount` inside `data`
`$.address.city`	Nested field `city` inside `address`
`$.items[0].name`	First element's `name` in an array
`$.location.lat`	Deeply nested latitude value

Type System

All output columns must declare a supported type:

Category	Types	Notes
String	`VARCHAR`, `TEXT`, `STRING`	Aliases for same type
Integer	`BIGINT`, `INTEGER`, `INT`, `SMALLINT`, `TINYINT`	Use `BIGINT` for IDs
Float	`DOUBLE`, `REAL`, `FLOAT`	Use `DOUBLE` for precision
Boolean	`BOOLEAN`, `BOOL`	Aliases for same type
Temporal	`TIMESTAMP`, `TIMESTAMPTZ`, `DATE`	For time-series data
JSON	`JSON`, `JSONB`	Preserve nested structures

All type casts use TRY_CAST internally for safety - invalid casts return NULL instead of failing.

Required Fields and Defaults

Control data quality with required and default:

- name: status
  type: VARCHAR
  path: $.status
  required: true
  default: "pending"

Semantics:

Extract field from JSON using the path
Cast to target type (using TRY_CAST)
Apply default if still NULL
If required: true, drop row if still NULL

Filtering

Filter rows using definition.filters:

definition:
  columns:
    - name: event_type
      type: VARCHAR
      path: $.eventType

    - name: amount
      type: DOUBLE
      path: $.amount

  filters:
    - condition: "event_type = 'purchase'"
      description: Only include purchase events

    - condition: "amount > 0"
      description: Exclude zero-value transactions

Filtering happens before required field checks. Only rows matching all filters are considered for output.

Partitioning

Partitioning is configured at the endpoint level in the Console when you attach a view. EdgeMQ supports Hive-style date partitioning, organizing output files for efficient querying.

Results in S3 layout:

s3://bucket/prefix/views/user_events/
  dt=2025-01-20/seg=00001/part-0.parquet
  dt=2025-01-21/seg=00002/part-0.parquet

Best practices:

Use low-cardinality dimensions (date, region, status)
Avoid high-cardinality fields (user_id, session_id, order_id)
Limit to 2-3 partition columns max
High cardinality creates millions of small files - slow writes + S3 overhead

Common Patterns

Pattern 1: Basic JSON Extraction

Extract simple top-level fields from JSON:

definition:
  columns:
    - name: user_id
      type: VARCHAR
      path: $.userId

    - name: amount
      type: DOUBLE
      path: $.amount

Pattern 2: Nested JSON Navigation

Flatten nested structures into top-level columns:

definition:
  columns:
    - name: city
      type: VARCHAR
      path: $.address.city

    - name: country_code
      type: VARCHAR
      path: $.address.country.code

    - name: lat
      type: DOUBLE
      path: $.location.lat

    - name: lon
      type: DOUBLE
      path: $.location.lon

Pattern 3: Timestamp Handling

Extract timestamps from different source formats:

definition:
  columns:
    # ISO 8601 string - EdgeMQ parses automatically based on the TIMESTAMP type
    - name: created_at
      type: TIMESTAMP
      path: $.createdAt

    # Unix epoch milliseconds stored as a number
    - name: updated_at
      type: BIGINT
      path: $.updatedAtMs

    # Date-only field
    - name: birth_date
      type: DATE
      path: $.birthDate

Pattern 4: Default Values

Provide fallback values for missing or NULL fields:

definition:
  columns:
    - name: status
      type: VARCHAR
      path: $.status
      default: "pending"

    - name: priority
      type: VARCHAR
      path: $.priority
      default: "normal"

    - name: processed
      type: BOOLEAN
      path: $.processed
      default: "false"

Pattern 5: Required Fields

Enforce data quality by rejecting rows with missing required fields:

definition:
  columns:
    - name: order_id
      type: VARCHAR
      path: $.orderId
      required: true

    - name: amount
      type: DOUBLE
      path: $.amount
      required: true

    # Optional field - can be NULL in output
    - name: coupon_code
      type: VARCHAR
      path: $.couponCode

Rows where order_id or amount is NULL (after casting) are sent to the Dead Letter Queue.

Pattern 6: Filtering by Event Type

Process only specific events:

definition:
  columns:
    - name: user_id
      type: VARCHAR
      path: $.userId
      required: true

    - name: amount
      type: DOUBLE
      path: $.amount
      required: true

    - name: event_type
      type: VARCHAR
      path: $.eventType

  filters:
    - condition: "event_type = 'purchase'"
      description: Only include purchase events

    - condition: "amount > 0"
      description: Exclude zero-value transactions

Error Handling

Type Casting Errors

All type casts use TRY_CAST automatically. When you define a column:

- name: amount
  type: DOUBLE
  path: $.amount

EdgeMQ generates safe SQL internally:

TRY_CAST(json_extract_string(payload, '$.amount') AS DOUBLE) AS amount

Behavior:

Valid number: "123.45" -> 123.45
Invalid number: "invalid" -> NULL
Missing field: undefined -> NULL

Required Field Enforcement

Use required: true to reject rows with NULL values:

- name: user_id
  type: VARCHAR
  path: $.userId
  required: true

Rows with NULL user_id (after casting and defaults) are captured in a Dead Letter Queue (DLQ) file for inspection. See Rejected Rows (DLQ) below.

Cast Failure Handling

Control what happens when a type cast fails using cast_failures:

- name: amount
  type: DOUBLE
  path: $.amount
  cast_failures: drop_row   # Options: null (default), drop_row, dlq

**null** (default) - Cast failure produces NULL; row is kept
**drop_row** - Row is silently dropped on cast failure
**dlq** - Row is sent to the Dead Letter Queue with the failure reason

Rejected Rows (DLQ)

Tip: Use DLQ Copilot to automatically diagnose DLQ failures and propose fixes — no manual Parquet inspection needed.

When rows fail validation (required field is NULL, cast failure with dlq policy, or a filter excludes them), they are written to a separate DLQ Parquet file alongside the successful output. This enables you to investigate *why* rows were rejected rather than losing them silently.

DLQ files contain three columns:

Column	Description
`ingest_ts_ms`	Original ingest timestamp (for correlating with source data)
`payload`	The full original JSON payload
`__reason`	Why the row was rejected (e.g., `required_column_user_id_is_null`, `filtered_by_where_clause`)

What triggers DLQ capture:

Required column NULL - A required: true column is NULL after casting and defaults
Cast failure with DLQ policy - A column with cast_failures: dlq fails to cast
Filter exclusion - Row doesn't match the filter conditions

Accessing DLQ files:

Navigate to the view's Data Quality tab in the Console
Review the REJECTED (DLQ) count alongside processed and accepted metrics
Download DLQ Parquet files to inspect the rejected payloads and reasons
Use the __reason column to identify which field or filter caused the rejection

Example: Inspecting a DLQ file with DuckDB

SELECT
  __reason,
  COUNT(*) as count
FROM read_parquet('downloaded_dlq_file.parquet')
GROUP BY __reason
ORDER BY count DESC;

This might show:

__reason	count
`required_column_user_id_is_null`	523
`required_column_amount_is_null`	89
`filtered_by_where_clause`	12

Note: DLQ files are only generated when a view has required columns, cast failure DLQ policies, or filters. Views with no validation constraints produce no rejected rows.

View Execution Failures

Views can be configured as:

Required - Ingest fails if view extraction fails (blocks commit)
Optional - Errors logged but ingest continues (view output may be missing)

Configure this when attaching the view to your endpoint.

Performance Considerations

Memory Usage

The materializer loads entire segments into memory:

memory_needed ~ (segment_size_uncompressed x 2-3x) + query_overhead

Example:

100 MB uncompressed segment
2x expansion during query -> 200 MB
Intermediate operations -> +100 MB
Total: ~300-400 MB

For large segments or complex views, ensure adequate memory allocation.

Query Complexity

Keep views simple (projection + filter)
Avoid expensive aggregations across large segments
Use required fields for filtering instead of complex filter conditions
Views cannot query across segments (each segment is independent)

Partitioning Strategy

Good partitions:

Date/hour partitions: dt=2025-01-20
Low-cardinality dimensions: region=us-east-1, status=active
Typically creates 10-100 folders per dimension

Bad partitions:

High-cardinality fields: user_id, session_id, order_id
Can create millions of small files
Slow S3 writes + list operations
High S3 request costs

Operational Model

View Lifecycle

Draft - Create and edit view definition
Publish - Validate and compile view
Attach - Bind view to endpoint(s)
Materialize - Execute per-segment as data arrives
Query - Consume typed Parquet from S3

Versioning

Views are versioned independently:

view_name@v1  -> Initial version
view_name@v2  -> Updated schema
view_name@v3  -> New columns added

Multiple versions can coexist. Update endpoint bindings to migrate to new versions.

Canary Deployments

Test views on new data before full rollout by setting a canary start timestamp when attaching the view. The view only applies to segments sealed after this timestamp. Use this to:

Test new view versions on recent data
Gradually roll out schema changes
Validate output before full migration

Monitoring and Data Quality

Tip: Drift Guard monitors field distributions over time and alerts you when event shapes change — before they cause DLQ spikes.

Track view execution metrics in the EdgeMQ Console:

Rows processed - Total frames in segment
Rows accepted - Rows passing all filters and required checks
Rows rejected (DLQ) - Rows captured in DLQ files (failed required checks or filters)
Success rate - Percentage of rows accepted vs. processed
Execution time - Materialization duration per segment

Where to find metrics:

Endpoint -> Data Quality tab - Per-view breakdown for a specific endpoint, with error rate badges and drill-down links
View -> Data Quality tab - Cross-endpoint metrics for a specific view, with DLQ file download buttons
Endpoints list - Health indicators showing data quality issues at a glance
Views list - Error rate badges on view cards

Metrics update in real-time via live streaming and are aggregated over 24-hour rolling windows.

Limits and Quotas

View capabilities vary by plan:

Plan	Max Views per Endpoint	Max Columns per View	Max Partition Dimensions
Pro	3	50	3
Enterprise	10+	100+	5+

Other limits:

Column name: max 255 characters
JSONPath expression: any valid $. path
Filter condition: max 1000 characters
Memory per view: Configured at deployment
Execution timeout: 5 minutes (configurable)

Examples

Example 1: E-commerce Purchases

Transform purchase events into an analytics-ready table:

name: purchases
version: 1
format: json

definition:
  columns:
    - name: order_id
      type: VARCHAR
      path: $.orderId
      required: true
      description: Unique order identifier

    - name: user_id
      type: VARCHAR
      path: $.userId
      required: true
      description: Customer identifier

    - name: amount
      type: DOUBLE
      path: $.amount
      required: true
      description: Purchase amount

    - name: currency
      type: VARCHAR
      path: $.currency
      default: "USD"
      description: Currency code

    - name: product_id
      type: VARCHAR
      path: $.productId
      description: Product identifier

    - name: purchased_at
      type: TIMESTAMP
      path: $.timestamp
      required: true
      description: Purchase timestamp (ISO 8601)

  filters:
    - condition: "amount > 0"
      description: Exclude zero-value transactions

Query example:

SELECT
  DATE_TRUNC('day', purchased_at) as day,
  currency,
  COUNT(*) as orders,
  SUM(amount) as revenue
FROM read_parquet('s3://bucket/prefix/views/purchases/dt=*/seg=*/part-*.parquet')
WHERE dt >= '2025-01-01'
GROUP BY day, currency
ORDER BY day DESC, revenue DESC;

Example 2: IoT Sensor Readings

Process time-series sensor data with nested structures:

name: iot_sensor_readings
version: 1
format: json

definition:
  columns:
    - name: sensor_id
      type: VARCHAR
      path: $.sensor_id
      required: true
      description: Unique sensor identifier

    - name: device_type
      type: VARCHAR
      path: $.device_type
      description: Sensor device type

    - name: lat
      type: DOUBLE
      path: $.location.lat
      description: Latitude coordinate

    - name: lon
      type: DOUBLE
      path: $.location.lon
      description: Longitude coordinate

    - name: building
      type: VARCHAR
      path: $.location.building
      description: Building identifier

    - name: temperature
      type: DOUBLE
      path: $.readings.temperature
      description: Temperature reading

    - name: humidity
      type: DOUBLE
      path: $.readings.humidity
      description: Humidity reading

    - name: pressure
      type: DOUBLE
      path: $.readings.pressure
      description: Pressure reading in hPa

    - name: battery_level
      type: INTEGER
      path: $.battery_level
      description: Battery percentage

    - name: quality
      type: VARCHAR
      path: $.quality
      description: Signal quality (good, degraded, poor)

    - name: is_online
      type: BOOLEAN
      path: $.is_online
      description: Whether the device is currently online

    - name: timestamp
      type: TIMESTAMP
      path: $.timestamp
      required: true
      description: Event timestamp (ISO 8601)

Query example:

SELECT
  DATE_TRUNC('hour', timestamp) as hour,
  sensor_id,
  AVG(temperature) as avg_temp,
  MIN(temperature) as min_temp,
  MAX(temperature) as max_temp,
  COUNT(*) as reading_count
FROM read_parquet('s3://bucket/prefix/views/iot_sensor_readings/dt=2025-01-20/seg=*/part-*.parquet')
GROUP BY hour, sensor_id
ORDER BY hour DESC, sensor_id;

Example 3: Application Logs

Extract structured log events:

name: app_logs
version: 1
format: json

definition:
  columns:
    - name: level
      type: VARCHAR
      path: $.level
      required: true
      description: Log level (ERROR, WARN, INFO, DEBUG)

    - name: message
      type: VARCHAR
      path: $.message
      required: true
      description: Log message

    - name: service
      type: VARCHAR
      path: $.service
      default: "unknown"
      description: Service name

    - name: request_id
      type: VARCHAR
      path: $.requestId
      description: Request correlation ID

    - name: user_id
      type: VARCHAR
      path: $.userId
      description: Associated user ID

    - name: logged_at
      type: TIMESTAMP
      path: $.timestamp
      description: Log timestamp (ISO 8601)

  filters:
    - condition: "level IN ('ERROR', 'WARN')"
      description: Only capture errors and warnings

Query example:

-- Find all errors for a specific user
SELECT
  logged_at,
  level,
  message,
  service,
  request_id
FROM read_parquet('s3://bucket/prefix/views/app_logs/dt=*/seg=*/part-*.parquet')
WHERE user_id = 'user_12345'
ORDER BY logged_at DESC
LIMIT 100;

FAQ

Can I change a view's schema after deployment?

Yes, create a new version from the Console and update endpoint bindings. Old and new versions can coexist.

Do views query across segments?

No. Each segment is materialized independently. Views cannot reference data from previous segments or perform cross-segment aggregations.

What happens if JSON extraction fails?

TRY_CAST returns NULL for invalid data. Use required: true to reject rows with NULL values (captured in DLQ files for inspection), or default to substitute a fallback value.

What happens to rejected rows?

Rows that fail validation (required field is NULL or filter excludes them) are written to a Dead Letter Queue (DLQ) Parquet file. Each DLQ row includes the original JSON payload and a __reason column explaining why the row was rejected. You can download DLQ files from the Data Quality tab in the Console.

Can I use SQL directly?

No. View definitions compile to SQL internally, but users cannot submit arbitrary SQL for security reasons. Use the declarative YAML format with path expressions.

How do I test a view before production?

Use the Preview tab in the Console to test against sample JSON payloads
Create the view as a draft
Attach with a canary start timestamp to apply only to new data
Monitor the Data Quality tab for rejection rates and download DLQ files to inspect errors
Remove canary restriction once validated

What's the difference between required views and optional views?

Required - View execution failure blocks the commit (data quality enforcement)
Optional - View execution failure is logged but ingest continues (best-effort)

Configure this when attaching the view to your endpoint.

How many views can I attach per endpoint?

Pro: 3 views per endpoint
Enterprise: 10+ views per endpoint

Can I output to formats other than Parquet?

Yes, CSV is also supported. Configure the output format in the Console when attaching the view.

How do I handle high-cardinality partitions?

Avoid partitioning by high-cardinality fields (user_id, order_id, session_id). Instead:

Partition by date/hour + low-cardinality dimension (region, category)
Use query predicates to filter by high-cardinality fields
Consider composite keys in your view (not partitioning)

What query engines can read the output?

Any engine supporting Parquet over S3:

Snowflake (external tables or COPY INTO)
Databricks (Auto Loader or direct read)
AWS Athena (Glue Catalog + partitions)
Trino/Presto
Apache Spark
ClickHouse (S3 table function)
BigQuery (external tables)
Any Parquet-compatible analytics tool

Next Steps

Start your free 14-day trial at console.edgemq.io
Review Output Formats for format comparison
See Quickstart Guide for end-to-end setup
Join the EdgeMQ community for support

Questions? Contact support@edgemq.io or open an issue on GitHub.

Materialized Views

What are Materialized Views?

How It Works

Architecture

Security Model

Output Formats

Quick Start

1. Enable Schema-Aware Parquet

2. Create a View Definition

3. Attach View to Endpoint

4. Query Your Data

Core Concepts

View Definition Format

Column Definition

The frames Table

JSONPath Expressions

Type System

Required Fields and Defaults

Filtering

Partitioning

Common Patterns

Pattern 1: Basic JSON Extraction

Pattern 2: Nested JSON Navigation

Pattern 3: Timestamp Handling

Pattern 4: Default Values

Pattern 5: Required Fields

Pattern 6: Filtering by Event Type

Error Handling

Type Casting Errors

Required Field Enforcement

Cast Failure Handling

Rejected Rows (DLQ)

View Execution Failures

Performance Considerations

Memory Usage

Query Complexity

Partitioning Strategy

Operational Model

View Lifecycle

Versioning

Canary Deployments

Monitoring and Data Quality

Limits and Quotas

Examples

Example 1: E-commerce Purchases

Example 2: IoT Sensor Readings

Example 3: Application Logs

FAQ

Can I change a view's schema after deployment?

Do views query across segments?

What happens if JSON extraction fails?

What happens to rejected rows?

Can I use SQL directly?

How do I test a view before production?

What's the difference between required views and optional views?

How many views can I attach per endpoint?

Can I output to formats other than Parquet?

How do I handle high-cardinality partitions?

What query engines can read the output?

Next Steps

The `frames` Table