x

Why we built EdgeMQ

·EdgeMQ Team

Why we built EdgeMQ

Every data team we talked to had some version of the same story. They had applications producing events - clickstream, IoT telemetry, trade data, API logs - and they wanted those events in S3 as query-ready files. Columnar. Typed. Partitioned by date. Ready for Snowflake, Databricks, DuckDB, or whatever they happened to query with.

The actual requirement was simple: accept JSON over HTTP, land it in S3 as Parquet.

The infrastructure they ended up operating was not.

The infrastructure tax

We kept hearing the same three paths, each with the same punchline.

Path 1: Kafka. You set up brokers, create topics, configure partitions, write producers, deploy consumer groups, add an S3 sink connector, tune retention, manage offsets, monitor lag, handle rebalances, and eventually - after a few weeks of engineering - your data lands in S3. The system works, but you're now operating a distributed streaming platform for what amounts to a pipe. If you're at LinkedIn scale and need pub/sub fan-out across dozens of consumers, Kafka earns its complexity. If you need "HTTP in, S3 out," you've built a highway to cross the street.

Path 2: Firehose. AWS Data Firehose is genuinely simpler. But the default delivery window is 60 seconds, the pricing requires a spreadsheet to predict (per-GB ingestion + format conversion surcharges + dynamic partitioning fees + VPC delivery costs), and everything routes through a single AWS region. If your data sources span continents, every event pays a latency tax to reach the nearest Firehose endpoint. You also inherit a hard dependency on the AWS ecosystem: IAM roles, Glue schemas, CloudWatch for monitoring. It's simpler than Kafka, but it's still a lot of AWS primitives wired together.

Path 3: DIY. A surprising number of teams just build it themselves. An API endpoint that buffers events, batches them, compresses, and uploads to S3. It works for the first month. Then someone asks what happens during a deploy (you lose the buffer). Or during a traffic spike (you drop events). Or when S3 is temporarily throttled (you lose more events). Every team that goes this route eventually builds a WAL, then a crash-recovery mechanism, then a backpressure system, then a health monitor - and at some point realises they've built a bespoke data ingestion service that nobody else on the team wants to maintain.

The common thread: for the specific workload of "accept HTTP, write to S3," every option is either over-engineered, under-reliable, or both.

What we actually needed

We took a step back and asked: if you were designing this from scratch today, with S3 as the target and HTTP as the interface, what would the simplest correct system look like?

Four properties kept coming up:

Durable before acknowledging. Every event must hit stable storage before the client gets a 202. Not "buffered in memory." Not "queued for processing." Written to a local NVMe-backed write-ahead log with integrity verification. If the process crashes one millisecond after acknowledging, the data must survive.

Fast at the edge. Ingest endpoints should run close to the data sources. If your IoT devices are in Frankfurt and your S3 bucket is in us-east-1, the endpoint should be in Frankfurt. The p95 response time should be under 10ms, not 60 seconds.

S3 as the source of truth. Once a commit marker exists in S3, the data is delivered. No ambiguity, no "check the consumer lag," no secondary confirmation. The commit marker is the contract. Downstream tools can read it, verify it, and build on it.

No moving parts the user has to operate. No brokers. No partitions. No consumer groups. No connectors. You send HTTP, you get S3 files. The system between those two points is our problem, not yours.

That's EdgeMQ.

How it works

The interface is a single HTTP endpoint:

curl -X POST https://your-endpoint.edge.mq/v1/ingest \
  -H "Content-Type: application/x-ndjson" \
  -H "X-API-Key: emq_live_..." \
  -d '{"event":"page_view","user":"u_8f3a","page":"/pricing","ts":"2026-02-20T14:30:00Z"}'

You get back a 202 Accepted. At that point, your event is durably written to a local write-ahead log on NVMe storage - verified with CRC32C checksums, fsynced to disk. The p95 latency for this acknowledgement is under 10ms.

Behind the scenes, the ingest node seals WAL segments as they fill (~128 MB), compresses them with zstd, and uploads them to your S3 bucket via multipart upload. A commit marker JSON file is written atomically after all artifacts for that segment are confirmed in S3. That commit marker is the delivery receipt - if it exists, the data is safe.

If the node crashes mid-upload, it resumes from its local state file on restart. No data is lost, no segments are duplicated. The design is intentionally boring: append-only local writes, crash-safe state, deterministic S3 keys, atomic commit markers.

Three output formats

Not every workload needs the same file format. EdgeMQ supports three, configurable per endpoint:

Segments are compressed WAL archives (.wal.zst). They're the fastest to produce, the cheapest to store, and they preserve the exact byte sequence of every payload. Use them for replay, compliance archival, or feeding custom downstream processors.

Parquet (Raw) wraps each event payload as an opaque binary column alongside tenant ID and ingest timestamp. This gives you columnar storage with date-range partitioning - efficient for time-bounded scans and bulk exports - without requiring any schema definition. Every payload is preserved as-is.

Parquet (Schema-Aware) is where it gets interesting.

The view engine

Most teams don't want opaque blobs in S3. They want typed columns they can query with SQL. The traditional approach is to build an ETL pipeline downstream: read the raw files, parse the JSON, extract fields, cast types, partition by date, write new Parquet files. That's an entire pipeline to build, test, deploy, monitor, and maintain - for every schema change, for every new event type.

EdgeMQ's answer is view definitions: a declarative YAML spec that tells the system how to extract typed columns from your JSON payloads.

name: page_views
source: payload
columns:
  - name: event
    type: VARCHAR
    path: $.event
  - name: user_id
    type: VARCHAR
    path: $.user
  - name: page
    type: VARCHAR
    path: $.page
  - name: viewed_at
    type: TIMESTAMP
    path: $.ts
partitioning:
  - expression: "date_trunc('day', viewed_at)"
    alias: dt

When a WAL segment is sealed, EdgeMQ loads it into an embedded DuckDB instance and executes the view definition as compiled SQL. The output is a typed Parquet file with proper column statistics, ready for predicate pushdown, partitioned by date, uploaded to S3 alongside the commit marker. In production, this processes around 21,000 rows per second, producing Parquet files at roughly 3.6 bytes per row - excellent compression.

The key design choice: validation happens at the output layer, not at ingest. Every event is accepted and durably stored in the WAL regardless of whether it matches a view definition. Events that fail type casting or violate required-field constraints during materialisation are captured in a Dead Letter Queue - a separate Parquet artifact with full error context, uploaded alongside the successful output.

This means EdgeMQ never drops your data. If you deploy a view with a typo in a JSONPath, the events aren't lost - they're in the DLQ, and you can fix the view and reprocess. If your schema evolves and a new field appears, existing events are still in the WAL, queryable by a new view definition. You can even run multiple views on the same endpoint simultaneously, producing different typed outputs from the same raw stream.

This is a deliberate trade-off. Systems like Snowplow and Segment validate at collection time - bad events get a synchronous 4xx and the producer can retry immediately. EdgeMQ's feedback loop is asynchronous: you monitor DLQ metrics and error rates in the console. For server-to-server ingestion (which is the majority of our use cases), this is the right trade-off. You get zero data loss and the ability to iterate on schemas without re-sending anything.

What EdgeMQ is not

We think the best way to explain what something is, is to be honest about what it isn't.

EdgeMQ is not a stream processor. There are no joins, no windowed aggregations, no consumer groups. If you need real-time stream processing, Kafka and Flink are the right tools. EdgeMQ is a pipe from HTTP to S3, with optional schema transformation at the output layer.

EdgeMQ is not a customer data platform. There's no identity resolution, no audience building, no journey orchestration. If you need to unify customer profiles across marketing tools, look at Segment or RudderStack. EdgeMQ doesn't know what a "user" is - it knows what an event payload is.

EdgeMQ is not a replacement for your entire data stack. It handles one thing well: getting event data from the internet into your S3 bucket, structured and fast, without you having to operate infrastructure. What you do with that data once it's in S3 is up to you and your existing tools.

EdgeMQ only delivers to S3. Today, that means S3 and S3-compatible storage (MinIO, Cloudflare R2, etc.). We don't deliver directly to Snowflake, BigQuery, or Redshift - yet. Our belief is that S3 is the right intermediate layer: it's the cheapest, most durable, most universally accessible storage available, and every analytics tool knows how to read from it. But we know "data in S3" and "data in my warehouse" aren't always the same thing, and closing that gap is high on our roadmap.

Your data, your infrastructure

A managed service that writes to your S3 bucket needs to earn trust. We designed EdgeMQ's security model around a simple principle: your data should never touch infrastructure you don't control.

Dedicated VMs, not shared containers. Every EdgeMQ account runs on its own dedicated microVM with a private NVMe-backed WAL volume and isolated process boundaries. There is no shared disk between tenants, no shared memory, no noisy-neighbour risk on the storage path. Your ingest node is yours alone.

Your S3 bucket, your IAM role. EdgeMQ never stores your data in its own buckets. You provide an IAM Role ARN with a unique ExternalId (preventing the confused deputy problem), and we assume that role using short-lived STS credentials to write to your bucket. The permissions are least-privilege by design: PutObject, AbortMultipartUpload, and the multipart upload actions scoped to your specific prefix. No ListBucket, no broad read access. You can revoke the role at any time and uploads stop immediately.

Credentials never hit disk. STS temporary credentials are delivered to ingest nodes over mTLS, held in memory only, and auto-rotated before expiry. There are no static AWS access keys stored anywhere in the system.

Encryption throughout. TLS 1.2+ for all data in transit. WAL volumes are encrypted at rest. We recommend SSE-S3 or SSE-KMS on your bucket - and if you use KMS, EdgeMQ's identity only has kms:Encrypt and kms:GenerateDataKey, never kms:Decrypt. We can write your data but we can't read it.

This isn't security theatre bolted on after launch. It's the foundation the system was built on, because we knew that teams handling financial data, healthcare telemetry, and PII-bearing event streams would need to trust the plumbing before they'd send a single payload through it.

Getting started

EdgeMQ is in beta and available today.

The Starter plan is free - 10 GiB per month of ingestion included, no credit card required. You get segments and raw Parquet output, one region, and enough throughput for most development and small production workloads.

The Pro plan is $49/month with per-GiB usage pricing. It unlocks schema-aware Parquet views, multi-region endpoints, larger payloads (up to 10 MiB), and configurable flush policies. A 14-day Pro trial is available if you want to test view definitions before committing.

If you're building something that ingests event data over HTTP and you're tired of operating infrastructure that exists only to get that data into S3 - we built EdgeMQ for exactly that problem.

Create an account or read the quickstart to send your first payload in under five minutes.