Materialized Views

Materialized views enable schema-aware data transformation directly at ingest time, producing typed Parquet outputs optimized for analytics and warehouse consumption.

Availability

Materialized views are available on Pro and Enterprise plans. A 14-day free trial is available for all new accounts.

What are Materialized Views?

Materialized views transform raw JSON events into strongly-typed columnar data (Parquet or CSV) as data is ingested. Instead of writing opaque payloads that require downstream parsing, EdgeMQ can extract, cast, filter, and partition your data automatically.

Key benefits:

  • Eliminate ETL jobs - Schema extraction happens at ingest time
  • Type safety - Automatic casting with error handling
  • Partitioning - Organize data by date, region, or other dimensions
  • Query-ready - Direct consumption by Snowflake, Databricks, Trino, Athena, etc.
  • Declarative - Define views as YAML, not SQL (no user SQL execution)

How It Works

JSON Events -> EdgeMQ Ingest -> WAL Segment -> View Materializer -> Typed Parquet -> S3
                                    |
                            View Definition (YAML)
                            - Column extraction
                            - Type casting
                            - Filtering
                            - Partitioning

Architecture

  1. View Definition - You define a declarative YAML mapping (columns, types, filters)
  2. Validation - EdgeMQ validates the definition and compiles it to safe SQL internally
  3. Per-segment execution - For each sealed WAL segment, the materializer:

- Loads frames into a temporary frames table - Executes the compiled SELECT query - Writes typed Parquet/CSV to S3

  1. Commit markers - Output artifacts are listed in commit manifests for safe consumption

Security Model

  • No user SQL - View definitions compile to safe, validated SQL internally
  • Sandboxed execution - Each view runs in an isolated execution context
  • Resource limits - Memory and timeout constraints prevent runaway queries
  • Type allowlist - Only approved DuckDB types are permitted

Output Formats

Materialized views are powered by the Parquet (Schema-Aware) output format. To use materialized views, you must enable this format on your endpoint.

See Output Formats for comparison with Segments and Raw Parquet formats.


Quick Start

1. Enable Schema-Aware Parquet

In the EdgeMQ Console, navigate to your endpoint's Configuration tab and enable Parquet (Schema-Aware):

Output Formats
[x] Segments (optional)
[x] Parquet (Raw) (optional)
[x] Parquet (Schema-Aware) (required for views)

2. Create a View Definition

In the Console, create a new view using the visual editor or paste YAML directly:

name: user_events
version: 1
format: json

definition:
  columns:
    - name: user_id
      type: VARCHAR
      path: $.userId
      required: true
      description: Unique user identifier

    - name: event_type
      type: VARCHAR
      path: $.eventType
      description: Type of event

    - name: amount
      type: DOUBLE
      path: $.amount
      description: Transaction amount

    - name: event_time
      type: TIMESTAMP
      path: $.timestamp
      required: true
      description: Event timestamp (ISO 8601)

3. Attach View to Endpoint

In the Console, go to Configuration -> Materialized Views and attach your view. You can specify:

  • Required - Ingest fails if view extraction fails (data quality enforcement)
  • Optional - Errors are logged but ingest continues (best-effort)
  • Canary mode - Apply view only to segments sealed after a specific timestamp (gradual rollout)

4. Query Your Data

Once segments are sealed, query the typed Parquet directly:

-- Example query
SELECT
  user_id,
  event_type,
  amount,
  event_time
FROM read_parquet('s3://bucket/prefix/views/user_events/dt=2025-01-20/seg=*/part-*.parquet')
WHERE event_type = 'purchase'
ORDER BY event_time DESC
LIMIT 100;

Core Concepts

View Definition Format

Tip: Use View Copilot to generate a view definition from sample JSON instead of writing YAML by hand.

A view definition is a YAML document with these top-level fields:

FieldRequiredDescription
nameYesLowercase alphanumeric with underscores (max 64 chars)
versionNoVersion number (auto-incremented by Console)
formatNoyaml or json (default: yaml)
descriptionNoHuman-readable description (max 2000 chars)
tagsNoArray of tags for organization (max 10)
definitionYesContains columns and optional filters

Column Definition

Each column in definition.columns extracts and types a field from the JSON payload:

FieldRequiredDescription
nameYesOutput column name (valid SQL identifier)
typeYesDuckDB column type (see Type System below)
pathYesJSONPath expression to extract the value (e.g., $.userId)
requiredNoIf true, rows with NULL values for this column are rejected (default: false)
defaultNoFallback value if extraction returns NULL
descriptionNoHuman-readable description of the column
cast_failuresNoHow to handle cast failures: null (default), drop_row, or dlq

The frames Table

Internally, the materializer creates a virtual frames table for each segment:

CREATE TEMP TABLE frames(
    ingest_ts_ms BIGINT,    -- Unix epoch milliseconds(ingestion time)
    payload VARCHAR         -- JSON string(your ingested event)
);

Each row represents one ingested event. Your path expressions extract fields from the payload column using JSONPath syntax.

JSONPath Expressions

The path field uses JSONPath syntax to navigate your JSON payloads:

PathExtracts
$.userIdTop-level field userId
$.data.amountNested field amount inside data
$.address.cityNested field city inside address
$.items[0].nameFirst element's name in an array
$.location.latDeeply nested latitude value

Type System

All output columns must declare a supported type:

CategoryTypesNotes
StringVARCHAR, TEXT, STRINGAliases for same type
IntegerBIGINT, INTEGER, INT, SMALLINT, TINYINTUse BIGINT for IDs
FloatDOUBLE, REAL, FLOATUse DOUBLE for precision
BooleanBOOLEAN, BOOLAliases for same type
TemporalTIMESTAMP, TIMESTAMPTZ, DATEFor time-series data
JSONJSON, JSONBPreserve nested structures

All type casts use TRY_CAST internally for safety - invalid casts return NULL instead of failing.

Required Fields and Defaults

Control data quality with required and default:

- name: status
  type: VARCHAR
  path: $.status
  required: true
  default: "pending"

Semantics:

  1. Extract field from JSON using the path
  2. Cast to target type (using TRY_CAST)
  3. Apply default if still NULL
  4. If required: true, drop row if still NULL

Filtering

Filter rows using definition.filters:

definition:
  columns:
    - name: event_type
      type: VARCHAR
      path: $.eventType

    - name: amount
      type: DOUBLE
      path: $.amount

  filters:
    - condition: "event_type = 'purchase'"
      description: Only include purchase events

    - condition: "amount > 0"
      description: Exclude zero-value transactions

Filtering happens before required field checks. Only rows matching all filters are considered for output.

Partitioning

Partitioning is configured at the endpoint level in the Console when you attach a view. EdgeMQ supports Hive-style date partitioning, organizing output files for efficient querying.

Results in S3 layout:

s3://bucket/prefix/views/user_events/
  dt=2025-01-20/seg=00001/part-0.parquet
  dt=2025-01-21/seg=00002/part-0.parquet

Best practices:

  • Use low-cardinality dimensions (date, region, status)
  • Avoid high-cardinality fields (user_id, session_id, order_id)
  • Limit to 2-3 partition columns max
  • High cardinality creates millions of small files - slow writes + S3 overhead

Common Patterns

Pattern 1: Basic JSON Extraction

Extract simple top-level fields from JSON:

definition:
  columns:
    - name: user_id
      type: VARCHAR
      path: $.userId

    - name: amount
      type: DOUBLE
      path: $.amount

Pattern 2: Nested JSON Navigation

Flatten nested structures into top-level columns:

definition:
  columns:
    - name: city
      type: VARCHAR
      path: $.address.city

    - name: country_code
      type: VARCHAR
      path: $.address.country.code

    - name: lat
      type: DOUBLE
      path: $.location.lat

    - name: lon
      type: DOUBLE
      path: $.location.lon

Pattern 3: Timestamp Handling

Extract timestamps from different source formats:

definition:
  columns:
    # ISO 8601 string - EdgeMQ parses automatically based on the TIMESTAMP type
    - name: created_at
      type: TIMESTAMP
      path: $.createdAt

    # Unix epoch milliseconds stored as a number
    - name: updated_at
      type: BIGINT
      path: $.updatedAtMs

    # Date-only field
    - name: birth_date
      type: DATE
      path: $.birthDate

Pattern 4: Default Values

Provide fallback values for missing or NULL fields:

definition:
  columns:
    - name: status
      type: VARCHAR
      path: $.status
      default: "pending"

    - name: priority
      type: VARCHAR
      path: $.priority
      default: "normal"

    - name: processed
      type: BOOLEAN
      path: $.processed
      default: "false"

Pattern 5: Required Fields

Enforce data quality by rejecting rows with missing required fields:

definition:
  columns:
    - name: order_id
      type: VARCHAR
      path: $.orderId
      required: true

    - name: amount
      type: DOUBLE
      path: $.amount
      required: true

    # Optional field - can be NULL in output
    - name: coupon_code
      type: VARCHAR
      path: $.couponCode

Rows where order_id or amount is NULL (after casting) are sent to the Dead Letter Queue.

Pattern 6: Filtering by Event Type

Process only specific events:

definition:
  columns:
    - name: user_id
      type: VARCHAR
      path: $.userId
      required: true

    - name: amount
      type: DOUBLE
      path: $.amount
      required: true

    - name: event_type
      type: VARCHAR
      path: $.eventType

  filters:
    - condition: "event_type = 'purchase'"
      description: Only include purchase events

    - condition: "amount > 0"
      description: Exclude zero-value transactions

Error Handling

Type Casting Errors

All type casts use TRY_CAST automatically. When you define a column:

- name: amount
  type: DOUBLE
  path: $.amount

EdgeMQ generates safe SQL internally:

TRY_CAST(json_extract_string(payload, '$.amount') AS DOUBLE) AS amount

Behavior:

  • Valid number: "123.45" -> 123.45
  • Invalid number: "invalid" -> NULL
  • Missing field: undefined -> NULL

Required Field Enforcement

Use required: true to reject rows with NULL values:

- name: user_id
  type: VARCHAR
  path: $.userId
  required: true

Rows with NULL user_id (after casting and defaults) are captured in a Dead Letter Queue (DLQ) file for inspection. See Rejected Rows (DLQ) below.

Cast Failure Handling

Control what happens when a type cast fails using cast_failures:

- name: amount
  type: DOUBLE
  path: $.amount
  cast_failures: drop_row   # Options: null (default), drop_row, dlq
  • **null** (default) - Cast failure produces NULL; row is kept
  • **drop_row** - Row is silently dropped on cast failure
  • **dlq** - Row is sent to the Dead Letter Queue with the failure reason

Rejected Rows (DLQ)

Tip: Use DLQ Copilot to automatically diagnose DLQ failures and propose fixes — no manual Parquet inspection needed.

When rows fail validation (required field is NULL, cast failure with dlq policy, or a filter excludes them), they are written to a separate DLQ Parquet file alongside the successful output. This enables you to investigate *why* rows were rejected rather than losing them silently.

DLQ files contain three columns:

ColumnDescription
ingest_ts_msOriginal ingest timestamp (for correlating with source data)
payloadThe full original JSON payload
__reasonWhy the row was rejected (e.g., required_column_user_id_is_null, filtered_by_where_clause)

What triggers DLQ capture:

  • Required column NULL - A required: true column is NULL after casting and defaults
  • Cast failure with DLQ policy - A column with cast_failures: dlq fails to cast
  • Filter exclusion - Row doesn't match the filter conditions

Accessing DLQ files:

  1. Navigate to the view's Data Quality tab in the Console
  2. Review the REJECTED (DLQ) count alongside processed and accepted metrics
  3. Download DLQ Parquet files to inspect the rejected payloads and reasons
  4. Use the __reason column to identify which field or filter caused the rejection

Example: Inspecting a DLQ file with DuckDB

SELECT
  __reason,
  COUNT(*) as count
FROM read_parquet('downloaded_dlq_file.parquet')
GROUP BY __reason
ORDER BY count DESC;

This might show:

__reasoncount
required_column_user_id_is_null523
required_column_amount_is_null89
filtered_by_where_clause12

Note: DLQ files are only generated when a view has required columns, cast failure DLQ policies, or filters. Views with no validation constraints produce no rejected rows.

View Execution Failures

Views can be configured as:

  • Required - Ingest fails if view extraction fails (blocks commit)
  • Optional - Errors logged but ingest continues (view output may be missing)

Configure this when attaching the view to your endpoint.


Performance Considerations

Memory Usage

The materializer loads entire segments into memory:

memory_needed ~ (segment_size_uncompressed x 2-3x) + query_overhead

Example:

  • 100 MB uncompressed segment
  • 2x expansion during query -> 200 MB
  • Intermediate operations -> +100 MB
  • Total: ~300-400 MB

For large segments or complex views, ensure adequate memory allocation.

Query Complexity

  • Keep views simple (projection + filter)
  • Avoid expensive aggregations across large segments
  • Use required fields for filtering instead of complex filter conditions
  • Views cannot query across segments (each segment is independent)

Partitioning Strategy

Good partitions:

  • Date/hour partitions: dt=2025-01-20
  • Low-cardinality dimensions: region=us-east-1, status=active
  • Typically creates 10-100 folders per dimension

Bad partitions:

  • High-cardinality fields: user_id, session_id, order_id
  • Can create millions of small files
  • Slow S3 writes + list operations
  • High S3 request costs

Operational Model

View Lifecycle

  1. Draft - Create and edit view definition
  2. Publish - Validate and compile view
  3. Attach - Bind view to endpoint(s)
  4. Materialize - Execute per-segment as data arrives
  5. Query - Consume typed Parquet from S3

Versioning

Views are versioned independently:

view_name@v1  -> Initial version
view_name@v2  -> Updated schema
view_name@v3  -> New columns added

Multiple versions can coexist. Update endpoint bindings to migrate to new versions.

Canary Deployments

Test views on new data before full rollout by setting a canary start timestamp when attaching the view. The view only applies to segments sealed after this timestamp. Use this to:

  • Test new view versions on recent data
  • Gradually roll out schema changes
  • Validate output before full migration

Monitoring and Data Quality

Tip: Drift Guard monitors field distributions over time and alerts you when event shapes change — before they cause DLQ spikes.

Track view execution metrics in the EdgeMQ Console:

  • Rows processed - Total frames in segment
  • Rows accepted - Rows passing all filters and required checks
  • Rows rejected (DLQ) - Rows captured in DLQ files (failed required checks or filters)
  • Success rate - Percentage of rows accepted vs. processed
  • Execution time - Materialization duration per segment

Where to find metrics:

  • Endpoint -> Data Quality tab - Per-view breakdown for a specific endpoint, with error rate badges and drill-down links
  • View -> Data Quality tab - Cross-endpoint metrics for a specific view, with DLQ file download buttons
  • Endpoints list - Health indicators showing data quality issues at a glance
  • Views list - Error rate badges on view cards

Metrics update in real-time via live streaming and are aggregated over 24-hour rolling windows.


Limits and Quotas

View capabilities vary by plan:

PlanMax Views per EndpointMax Columns per ViewMax Partition Dimensions
Pro3503
Enterprise10+100+5+

Other limits:

  • Column name: max 255 characters
  • JSONPath expression: any valid $. path
  • Filter condition: max 1000 characters
  • Memory per view: Configured at deployment
  • Execution timeout: 5 minutes (configurable)

Examples

Example 1: E-commerce Purchases

Transform purchase events into an analytics-ready table:

name: purchases
version: 1
format: json

definition:
  columns:
    - name: order_id
      type: VARCHAR
      path: $.orderId
      required: true
      description: Unique order identifier

    - name: user_id
      type: VARCHAR
      path: $.userId
      required: true
      description: Customer identifier

    - name: amount
      type: DOUBLE
      path: $.amount
      required: true
      description: Purchase amount

    - name: currency
      type: VARCHAR
      path: $.currency
      default: "USD"
      description: Currency code

    - name: product_id
      type: VARCHAR
      path: $.productId
      description: Product identifier

    - name: purchased_at
      type: TIMESTAMP
      path: $.timestamp
      required: true
      description: Purchase timestamp (ISO 8601)

  filters:
    - condition: "amount > 0"
      description: Exclude zero-value transactions

Query example:

SELECT
  DATE_TRUNC('day', purchased_at) as day,
  currency,
  COUNT(*) as orders,
  SUM(amount) as revenue
FROM read_parquet('s3://bucket/prefix/views/purchases/dt=*/seg=*/part-*.parquet')
WHERE dt >= '2025-01-01'
GROUP BY day, currency
ORDER BY day DESC, revenue DESC;

Example 2: IoT Sensor Readings

Process time-series sensor data with nested structures:

name: iot_sensor_readings
version: 1
format: json

definition:
  columns:
    - name: sensor_id
      type: VARCHAR
      path: $.sensor_id
      required: true
      description: Unique sensor identifier

    - name: device_type
      type: VARCHAR
      path: $.device_type
      description: Sensor device type

    - name: lat
      type: DOUBLE
      path: $.location.lat
      description: Latitude coordinate

    - name: lon
      type: DOUBLE
      path: $.location.lon
      description: Longitude coordinate

    - name: building
      type: VARCHAR
      path: $.location.building
      description: Building identifier

    - name: temperature
      type: DOUBLE
      path: $.readings.temperature
      description: Temperature reading

    - name: humidity
      type: DOUBLE
      path: $.readings.humidity
      description: Humidity reading

    - name: pressure
      type: DOUBLE
      path: $.readings.pressure
      description: Pressure reading in hPa

    - name: battery_level
      type: INTEGER
      path: $.battery_level
      description: Battery percentage

    - name: quality
      type: VARCHAR
      path: $.quality
      description: Signal quality (good, degraded, poor)

    - name: is_online
      type: BOOLEAN
      path: $.is_online
      description: Whether the device is currently online

    - name: timestamp
      type: TIMESTAMP
      path: $.timestamp
      required: true
      description: Event timestamp (ISO 8601)

Query example:

SELECT
  DATE_TRUNC('hour', timestamp) as hour,
  sensor_id,
  AVG(temperature) as avg_temp,
  MIN(temperature) as min_temp,
  MAX(temperature) as max_temp,
  COUNT(*) as reading_count
FROM read_parquet('s3://bucket/prefix/views/iot_sensor_readings/dt=2025-01-20/seg=*/part-*.parquet')
GROUP BY hour, sensor_id
ORDER BY hour DESC, sensor_id;

Example 3: Application Logs

Extract structured log events:

name: app_logs
version: 1
format: json

definition:
  columns:
    - name: level
      type: VARCHAR
      path: $.level
      required: true
      description: Log level (ERROR, WARN, INFO, DEBUG)

    - name: message
      type: VARCHAR
      path: $.message
      required: true
      description: Log message

    - name: service
      type: VARCHAR
      path: $.service
      default: "unknown"
      description: Service name

    - name: request_id
      type: VARCHAR
      path: $.requestId
      description: Request correlation ID

    - name: user_id
      type: VARCHAR
      path: $.userId
      description: Associated user ID

    - name: logged_at
      type: TIMESTAMP
      path: $.timestamp
      description: Log timestamp (ISO 8601)

  filters:
    - condition: "level IN ('ERROR', 'WARN')"
      description: Only capture errors and warnings

Query example:

-- Find all errors for a specific user
SELECT
  logged_at,
  level,
  message,
  service,
  request_id
FROM read_parquet('s3://bucket/prefix/views/app_logs/dt=*/seg=*/part-*.parquet')
WHERE user_id = 'user_12345'
ORDER BY logged_at DESC
LIMIT 100;

FAQ

Can I change a view's schema after deployment?

Yes, create a new version from the Console and update endpoint bindings. Old and new versions can coexist.

Do views query across segments?

No. Each segment is materialized independently. Views cannot reference data from previous segments or perform cross-segment aggregations.

What happens if JSON extraction fails?

TRY_CAST returns NULL for invalid data. Use required: true to reject rows with NULL values (captured in DLQ files for inspection), or default to substitute a fallback value.

What happens to rejected rows?

Rows that fail validation (required field is NULL or filter excludes them) are written to a Dead Letter Queue (DLQ) Parquet file. Each DLQ row includes the original JSON payload and a __reason column explaining why the row was rejected. You can download DLQ files from the Data Quality tab in the Console.

Can I use SQL directly?

No. View definitions compile to SQL internally, but users cannot submit arbitrary SQL for security reasons. Use the declarative YAML format with path expressions.

How do I test a view before production?

  1. Use the Preview tab in the Console to test against sample JSON payloads
  2. Create the view as a draft
  3. Attach with a canary start timestamp to apply only to new data
  4. Monitor the Data Quality tab for rejection rates and download DLQ files to inspect errors
  5. Remove canary restriction once validated

What's the difference between required views and optional views?

  • Required - View execution failure blocks the commit (data quality enforcement)
  • Optional - View execution failure is logged but ingest continues (best-effort)

Configure this when attaching the view to your endpoint.

How many views can I attach per endpoint?

  • Pro: 3 views per endpoint
  • Enterprise: 10+ views per endpoint

Can I output to formats other than Parquet?

Yes, CSV is also supported. Configure the output format in the Console when attaching the view.

How do I handle high-cardinality partitions?

Avoid partitioning by high-cardinality fields (user_id, order_id, session_id). Instead:

  • Partition by date/hour + low-cardinality dimension (region, category)
  • Use query predicates to filter by high-cardinality fields
  • Consider composite keys in your view (not partitioning)

What query engines can read the output?

Any engine supporting Parquet over S3:

  • Snowflake (external tables or COPY INTO)
  • Databricks (Auto Loader or direct read)
  • AWS Athena (Glue Catalog + partitions)
  • Trino/Presto
  • Apache Spark
  • ClickHouse (S3 table function)
  • BigQuery (external tables)
  • Any Parquet-compatible analytics tool

Next Steps


Questions? Contact support@edgemq.io or open an issue on GitHub.