Product
For ML Teams
Feed your models with live, reliable data
EdgeMQ is the AI data hose into your S3. Ingest events from apps, devices, and services over HTTPS and land them reliably in object storage as segments, raw Parquet, or schema-aware Parquet views-ready for Databricks, Snowflake, ClickHouse, DuckDB, and your feature pipelines.
Under the hood, EdgeMQ is the same lakehouse ingest layer your data team uses to keep the S3 Bronze layer fresh.
Stop babysitting brittle data feeds.
Start assuming S3 is always fresh.
Getting data into the lake
As an ML engineer, MLOps engineer, or AI platform owner, you're held back by one thing over and over:
Data doesn't show up in S3 reliably.
Instead, you deal with:
- Training pipelines that depend on homegrown data collectors that break quietly.
- "Quick scripts" that upload JSON to S3... until someone changes a cron, a path, or a credential.
- Constant questions like: "Is this dataset actually up to date?" "Did we drop any events during that incident?"
- Painful back-and-forth with product / data engineering teams just to get a new event stream wired up.
- Dreams of online-ish and continuous training, blocked by bad ingest.
You want to focus on models, features, and evaluation-not HTTP retries and S3 multipart uploads.
EdgeMQ: the AI data hose into your S3
EdgeMQ is a managed edge ingest layer for modern data and ML stacks. Producers send NDJSON over HTTPS to a single endpoint. EdgeMQ:
Writes each request to a durable write-ahead log (WAL) on NVMe at the edge, and publishes commit markers so your jobs know what's safe to read.
Handles bursts and reconnect storms with bounded queues and backpressure.
Compresses segments and ships them into your S3 bucket under structured prefixes.
Choose the S3 artifacts your team wants: segments for raw replay, Parquet (raw/opaque) for payload-preserving reads, or schema-aware Parquet views for typed, query-ready tables.
From your point of view, S3 just keeps filling with fresh, trustworthy data you can build ML pipelines on.
ML-friendly ingest in one call
Your upstream teams can send training and feature data with a simple call:
curl -X POST "https://<region>.edge.mq/ingest" \
-H "Authorization: Bearer $EDGEMQ_TOKEN" \
-H "Content-Type: application/x-ndjson" \
--data-binary @events.ndjsonEdgeMQ guarantees:
- ▸Events hit disk (WAL) before acknowledging.
- ▸If the system is overloaded, producers see 503 + Retry-After, not silent drops.
- ▸Segments and Parquet (raw or views) land under prefixes you control.
- ▸Commit markers tell your ML pipelines exactly which segments are safe to read.
You don't build or own any of this ingest plumbing. You just depend on it.
Continuous datasets for training and evaluation
Your best models come from:
- ▸Frequent retrains on fresh data.
- ▸Clean evaluation sets that reflect real-world usage.
- ▸Fast iteration when you want to try a new feature or label.
EdgeMQ makes it realistic to treat S3 as a continuously updated ML data lake:
Training data
Segment your EdgeMQ S3 prefixes by time, cohort, or experiment, and train directly from them with Spark, Databricks, Snowflake, or DuckDB - reading either compressed segments (NDJSON) or Parquet (raw or views), depending on how your endpoint is configured.
Evaluation slices
Pull specific date ranges or cohorts from EdgeMQ-managed prefixes to create consistent validation and test sets.
Experiment logs
Ingest model input/output events via EdgeMQ to analyze drift, failures, or regression behavior later.
Because ingest is handled centrally, you don't have to negotiate a new pipeline every time you want a new signal.
Feature pipelines powered by S3
Most modern feature stores and custom feature pipelines assume object storage is the raw source of truth. EdgeMQ is built to keep that source of truth healthy:
Raw → feature pipeline
- ▸Apps / devices / services send NDJSON to EdgeMQ.
- ▸EdgeMQ lands segments (and, when enabled, Parquet raw or views) in S3 under structured prefixes.
- ▸Your feature jobs (Spark, Flink, dbt, custom Python) transform those segments into feature tables or online stores.
Historical replay
- ▸Rebuild features from historical EdgeMQ segments when you change logic.
- ▸Reproduce past model behavior by training from the exact same raw data.
Multi-consumer
The same EdgeMQ raw data can feed both:
- ▸Offline training / evaluation
- ▸Online feature stores / monitoring pipelines
Once the data is in S3, you're free to wire it into any feature stack you want.
Query with the tools you already use
EdgeMQ doesn't ask you to switch engines. It just keeps them fed.
Databricks / Spark
Treat EdgeMQ prefixes as your streaming input:
- ▸EdgeMQ continuously drops compressed segments + commit markers, and optionally Parquet (raw or views), into S3.
- ▸Databricks Autoloader or Spark jobs monitor those prefixes and load data into Delta tables.
- ▸You train models on Delta and build feature tables on top.
EdgeMQ can emit Parquet output as raw/opaque Parquet or schema-aware views. Autoloader and Spark can treat those prefixes like any other Parquet dataset in your lake. Table-style layouts remain on the roadmap.
Snowflake
Use EdgeMQ as the raw staging area for training and analytics tables:
- ▸Producers → EdgeMQ
/ingest→ S3. - ▸Snowpipe or
COPY INTOpulls from those S3 prefixes into Snowflake tables: - ▸Use the JSON example below when you're reading compressed NDJSON segments.
- ▸Or, when Parquet output is enabled, treat EdgeMQ's Parquet prefixes (raw or views) as external tables or copy from Parquet directly, driven by the same commit markers.
COPY INTO ml_raw.events FROM 's3://your-bucket/edge-events/ml/prod/' CREDENTIALS=(AWS_ROLE='arn:aws:iam::123:role/edge-snowflake-access') FILE_FORMAT = (TYPE = JSON) PATTERN = '.*\.json\.gz';
You keep all the power of Snowflake; you just stop worrying about how data got to S3.
ClickHouse / Postgres
Use EdgeMQ as a buffer in front of online and near-real-time stores:
- ▸High-volume events (clicks, metrics, logs) flow to EdgeMQ, not directly to your database.
- ▸EdgeMQ absorbs the spikes and writes to S3.
- ▸A loader job ingests into ClickHouse or Postgres at the rate your cluster can safely take.
This is perfect for near-real-time feature tables in ClickHouse and monitoring/tracking tables in Postgres with controlled load.
DuckDB (and friends)
Give data scientists and ML researchers direct, notebook-friendly access to fresh data:
- ▸EdgeMQ writes NDJSON segments and, when enabled, Parquet (raw or schema-aware views) to S3.
- ▸DuckDB queries those S3 prefixes directly from a laptop or cloud notebook.
-- Raw segments (NDJSON)
SELECT *
FROM read_json_auto('s3://your-bucket/edge-events/ml/prod/*.json.gz')
WHERE event_type = 'prediction'
AND ts >= now() - INTERVAL '7 days';
-- Parquet outputs (raw or views, when enabled)
-- SELECT *
-- FROM read_parquet('s3://your-bucket/edge-events/ml/prod/parquet/tenant=.../dt=.../*.parquet')
-- WHERE event_type = 'prediction'
-- AND ts >= now() - INTERVAL '7 days';No duplicate pipelines, no extra infrastructure-just query the lake.
Formats for ML and analytics
Today, EdgeMQ is optimized for NDJSON → compressed segments + commit markers and Parquet output in S3. The roadmap extends this into a richer, format-aware ingest layer:
Input formats
- ▸NDJSON and JSON batches.
- ▸Additional line-delimited formats over time.
Output formats on S3
- ▸Compressed NDJSON segments (today).
- ▸Parquet output as raw/opaque Parquet or schema-aware views for direct querying by engines like Databricks, Snowflake, ClickHouse, DuckDB, and others (today).
- ▸CSV where needed (roadmap).
- ▸Table-friendly layouts (e.g. Iceberg-style directories) that slot into Databricks, Spark, Trino/Presto, Snowflake (via external tables), DuckDB and other engines (roadmap).
For ML teams, that means less custom ingest code, easier direct reads from S3 in the engines you already use, and a clear path from raw events to training and feature datasets-whether you prefer working from segments (NDJSON) or Parquet files.
Example ML patterns with EdgeMQ
Real-time product signals → feature store
- ▸Product backend logs user actions and metadata as NDJSON.
- ▸EdgeMQ ingests those events from multiple regions into S3.
- ▸A feature pipeline in Spark/Databricks/Snowflake converts segments into offline feature tables for training and online features for a feature store or a low-latency DB.
Result: your models see fresh behavioral signals without anyone building a bespoke ingest stack.
Telemetry & sensor data → anomaly detection
- ▸IoT devices POST telemetry to regional EdgeMQ endpoints.
- ▸EdgeMQ handles flaky networks, reconnect storms, and spikes.
- ▸Telemetry accumulates in S3 as a clean, continuous stream of records.
- ▸You train and deploy anomaly detection models (Spark, Snowflake, ClickHouse, or notebooks) using this history.
Result: your detection models ride on top of a robust ingest backbone, not fragile scripts.
Model input/output logging → observability and evaluation
- ▸When your model serves a prediction, your service logs input features, model version, output, and extra context.
- ▸Those logs go to EdgeMQ, not local disk.
- ▸EdgeMQ lands them in S3 under a ml-logs/ prefix.
- ▸You analyze drift, calibration, failures, counterfactuals and "what if?" scenarios.
Because everything is centralized in S3, you can slice, audit, and replay model behavior over time.
You don't need to own ingest infrastructure
Most ML teams don't want to:
- ▸Run Kafka or Kinesis just for ingest.
- ▸Maintain critical HTTP services that absorb events.
- ▸Debug partial S3 uploads and edge-case retries.
- ▸Explain to security why there are random access keys in source trees.
EdgeMQ takes this off your plate:
Managed edge infrastructure
Per-tenant microVMs, WAL on NVMe, S3 shippers, and health checks are operated for you.
Predictable overload behavior
If things get hot, producers see 503 + Retry-After. You don't get silent gaps in datasets.
Security that fits your platform
S3 writes via short-lived IAM roles and scoped prefixes; data teams and platform teams can govern it using the tools they already know.
You get a dependable data hose; platform/infrastructure stays in control; ML teams move faster.
Collaborate cleanly with data and platform teams
EdgeMQ is a shared primitive you can rally around. It's the common lakehouse ingest layer that data engineers, ML teams, and platform engineers all depend on, with S3 as the shared source of truth.
Platform / infra
- ▸Set up S3 buckets, prefixes, and IAM roles.
- ▸Provision EdgeMQ endpoints as a "paved road" for ingest.
Data engineers
- ▸Define schemas, prefixes, and downstream load jobs.
- ▸Use EdgeMQ as the standard way data enters the lake.
ML teams
- ▸Consume from the same S3 lake for training, evaluation, and features.
- ▸Ask for "one more prefix + schema" instead of "a new ingest system."
Everyone aligns on a single, well-understood ingest layer.
Make S3 the live heart of your ML platform
Your models are only as good as the data they see-and how reliably they see it. EdgeMQ makes getting data into S3 something you can take for granted:
Reliable ingest from anywhere
Apps, devices, services, and partners.
ML-ready lake in S3
Constantly updated with segments and Parquet (raw or views), structured and easy to query.
No custom ingest infra
For you to own or debug.
Related pages
- For Data Engineers - how your S3 Bronze layer is built and maintained on top of EdgeMQ.
- For Platform / Infra - how EdgeMQ is operated as a standardized ingest primitive.
Ready to feed your models with live data instead of brittle pipelines?
Stop babysitting brittle data feeds. Start assuming S3 is always fresh.