Product
For ML Teams
Feed your models with live, reliable data
EdgeMQ is the AI data hose into your S3. Ingest events from apps, devices, and services over HTTPS and land them reliably in object storage—ready for Databricks, Snowflake, ClickHouse, DuckDB, and your feature pipelines.
Under the hood, EdgeMQ is the same lakehouse ingest layer your data team uses to keep the S3 Bronze layer fresh.
Stop babysitting brittle data feeds.
Start assuming S3 is always fresh.
Getting data into the lake
As an ML engineer, MLOps engineer, or AI platform owner, you're held back by one thing over and over:
Data doesn't show up in S3 reliably.
Instead, you deal with:
- Training pipelines that depend on homegrown data collectors that break quietly.
- "Quick scripts" that upload JSON to S3... until someone changes a cron, a path, or a credential.
- Constant questions like: "Is this dataset actually up to date?" "Did we drop any events during that incident?"
- Painful back-and-forth with product / data engineering teams just to get a new event stream wired up.
- Dreams of online-ish and continuous training, blocked by bad ingest.
You want to focus on models, features, and evaluation—not HTTP retries and S3 multipart uploads.
EdgeMQ: the AI data hose into your S3
EdgeMQ is a managed ingest layer for modern data and ML stacks. Producers send NDJSON over HTTPS to a single endpoint. EdgeMQ:
Writes each request to a durable write-ahead log (WAL) on NVMe at the edge.
Handles bursts and reconnect storms with bounded queues and backpressure.
Compresses and ships segments into your S3 bucket.
Writes a commit marker only when the segment is safely stored.
From your point of view, S3 just keeps filling with fresh, trustworthy data you can build ML pipelines on.
ML-friendly ingest in one call
Your upstream teams can send training and feature data with a simple call:
curl -X POST "https://<region>.edge.mq/ingest" \
-H "Authorization: Bearer $EDGEMQ_TOKEN" \
-H "Content-Type: application/x-ndjson" \
--data-binary @events.ndjsonEdgeMQ guarantees:
- ▸Events hit disk (WAL) before acknowledging.
- ▸If the system is overloaded, producers see 503 + Retry-After, not silent drops.
- ▸Compressed segments land under a prefix you control.
- ▸Commit markers tell your ML pipelines exactly which segments are safe to read.
You don't build or own any of this ingest plumbing. You just depend on it.
Continuous datasets for training and evaluation
Your best models come from:
- ▸Frequent retrains on fresh data.
- ▸Clean evaluation sets that reflect real-world usage.
- ▸Fast iteration when you want to try a new feature or label.
EdgeMQ makes it realistic to treat S3 as a continuously updated ML data lake:
Training data
Segment your EdgeMQ S3 prefixes by time, cohort, or experiment, and train directly from them with Spark, Databricks, Snowflake, or DuckDB.
Evaluation slices
Pull specific date ranges or cohorts from EdgeMQ-managed prefixes to create consistent validation and test sets.
Experiment logs
Ingest model input/output events via EdgeMQ to analyze drift, failures, or regression behavior later.
Because ingest is handled centrally, you don't have to negotiate a new pipeline every time you want a new signal.
Feature pipelines powered by S3
Most modern feature stores and custom feature pipelines assume object storage is the raw source of truth. EdgeMQ is built to keep that source of truth healthy:
Raw → feature pipeline
- ▸Apps / devices / services send NDJSON to EdgeMQ.
- ▸EdgeMQ lands segments in S3 under structured prefixes.
- ▸Your feature jobs (Spark, Flink, dbt, custom Python) transform those segments into feature tables or online stores.
Historical replay
- ▸Rebuild features from historical EdgeMQ segments when you change logic.
- ▸Reproduce past model behavior by training from the exact same raw data.
Multi-consumer
The same EdgeMQ raw data can feed both:
- ▸Offline training / evaluation
- ▸Online feature stores / monitoring pipelines
Once the data is in S3, you're free to wire it into any feature stack you want.
Query with the tools you already use
EdgeMQ doesn't ask you to switch engines. It just keeps them fed.
Databricks / Spark
Treat EdgeMQ prefixes as your streaming input:
- ▸EdgeMQ continuously drops compressed segments + commit markers into S3.
- ▸Databricks Autoloader or Spark jobs monitor those prefixes and load data into Delta tables.
- ▸You train models on Delta and build feature tables on top.
Future-facing: as EdgeMQ emits Parquet and table-like layouts, Autoloader gets even faster and simpler to configure.
Snowflake
Use EdgeMQ as the raw staging area for training and analytics tables:
- ▸Producers → EdgeMQ / ingest → S3.
- ▸Snowpipe or COPY INTO pulls from those S3 prefixes into Snowflake tables.
COPY INTO ml_raw.events FROM 's3://your-bucket/edge-events/ml/prod/' CREDENTIALS=(AWS_ROLE='arn:aws:iam::123:role/edge-snowflake-access') FILE_FORMAT = (TYPE = JSON) PATTERN = '.*\.json\.gz';
You keep all the power of Snowflake; you just stop worrying about how data got to S3.
ClickHouse / Postgres
Use EdgeMQ as a buffer in front of online and near-real-time stores:
- ▸High-volume events (clicks, metrics, logs) flow to EdgeMQ, not directly to your database.
- ▸EdgeMQ absorbs the spikes and writes to S3.
- ▸A loader job ingests into ClickHouse or Postgres at the rate your cluster can safely take.
This is perfect for near-real-time feature tables in ClickHouse and monitoring/tracking tables in Postgres with controlled load.
DuckDB (and friends)
Give data scientists and ML researchers direct, notebook-friendly access to fresh data:
- ▸EdgeMQ writes NDJSON (and in future, Parquet) to S3.
- ▸DuckDB queries those S3 prefixes directly from a laptop or cloud notebook.
SELECT *
FROM read_json_auto('s3://your-bucket/edge-events/ml/prod/*.json.gz')
WHERE event_type = 'prediction'
AND ts >= now() - INTERVAL '7 days';No duplicate pipelines, no extra infrastructure—just query the lake.
Future-ready formats for ML and analytics
Today, EdgeMQ is optimized for NDJSON → compressed segments + commit markers in S3. The roadmap expands this into a format-aware ingest layer:
Input formats
- ▸NDJSON and JSON batches.
- ▸Additional line-delimited formats over time.
Output formats on S3
- ▸NDJSON segments (today).
- ▸Parquet for efficient columnar training and feature extraction.
- ▸CSV where needed.
- ▸Table-friendly layouts (e.g. Iceberg-style directory structures) that slot into Databricks, Spark, Trino/Presto, Snowflake (via external tables), DuckDB and other engines.
For ML teams, that means faster training jobs (less parsing, more scanning), easier integration with new engines and tools, and less glue code translating "whatever we got from the app" into "what the engine expects."
Example ML patterns with EdgeMQ
Real-time product signals → feature store
- ▸Product backend logs user actions and metadata as NDJSON.
- ▸EdgeMQ ingests those events from multiple regions into S3.
- ▸A feature pipeline in Spark/Databricks/Snowflake converts segments into offline feature tables for training and online features for a feature store or a low-latency DB.
Result: your models see fresh behavioral signals without anyone building a bespoke ingest stack.
Telemetry & sensor data → anomaly detection
- ▸IoT devices POST telemetry to regional EdgeMQ endpoints.
- ▸EdgeMQ handles flaky networks, reconnect storms, and spikes.
- ▸Telemetry accumulates in S3 as a clean, continuous stream of records.
- ▸You train and deploy anomaly detection models (Spark, Snowflake, ClickHouse, or notebooks) using this history.
Result: your detection models ride on top of a robust ingest backbone, not fragile scripts.
Model input/output logging → observability and evaluation
- ▸When your model serves a prediction, your service logs input features, model version, output, and extra context.
- ▸Those logs go to EdgeMQ, not local disk.
- ▸EdgeMQ lands them in S3 under a ml-logs/ prefix.
- ▸You analyze drift, calibration, failures, counterfactuals and "what if?" scenarios.
Because everything is centralized in S3, you can slice, audit, and replay model behavior over time.
You don't need to own ingest infrastructure
Most ML teams don't want to:
- ▸Run Kafka or Kinesis just for ingest.
- ▸Maintain critical HTTP services that absorb events.
- ▸Debug partial S3 uploads and edge-case retries.
- ▸Explain to security why there are random access keys in source trees.
EdgeMQ takes this off your plate:
Managed edge infrastructure
Per-tenant microVMs, WAL on NVMe, S3 shippers, and health checks are operated for you.
Predictable overload behavior
If things get hot, producers see 503 + Retry-After. You don't get silent gaps in datasets.
Security that fits your platform
S3 writes via short-lived IAM roles and scoped prefixes; data teams and platform teams can govern it using the tools they already know.
You get a dependable data hose; platform/infrastructure stays in control; ML teams move faster.
Collaborate cleanly with data and platform teams
EdgeMQ is a shared primitive you can rally around. It's the common lakehouse ingest layer that data engineers, ML teams, and platform engineers all depend on, with S3 as the shared source of truth.
Platform / infra
- ▸Set up S3 buckets, prefixes, and IAM roles.
- ▸Provision EdgeMQ endpoints as a "paved road" for ingest.
Data engineers
- ▸Define schemas, prefixes, and downstream load jobs.
- ▸Use EdgeMQ as the standard way data enters the lake.
ML teams
- ▸Consume from the same S3 lake for training, evaluation, and features.
- ▸Ask for "one more prefix + schema" instead of "a new ingest system."
Everyone aligns on a single, well-understood ingest layer.
Make S3 the live heart of your ML platform
Your models are only as good as the data they see—and how reliably they see it. EdgeMQ makes getting data into S3 something you can take for granted:
Reliable ingest from anywhere
Apps, devices, services, and partners.
ML-ready lake in S3
Constantly updated, structured, and easy to query.
No custom ingest infra
For you to own or debug.
Related pages
- For Data Engineers — how your S3 Bronze layer is built and maintained on top of EdgeMQ.
- For Platform / Infra — how EdgeMQ is operated as a standardized ingest primitive.
Ready to feed your models with live data instead of brittle pipelines?
Stop babysitting brittle data feeds. Start assuming S3 is always fresh.