S3 Segment File Format & Extraction

How segments are written to S3 and how to extract data.

Edge MQ Segment File Format and Extraction Guide

This document describes the format of the segment objects that EdgeMQ’s ingest service writes to S3, and explains how to extract the original JSON or NDJSON payloads from them.

What gets uploaded to S3

Each sealed WAL segment is uploaded as a single zstd-compressed file.
Object key shape: S3_PREFIX/REGION/INSTANCE_ID/EPOCH/segments/seg-XXXXXXXX.wal.zst.
S3 object metadata includes:

- sha256: SHA-256 of the compressed bytes - source: original filename (e.g., seg-00000009.wal.zst) - content-type: application/zstd (Multipart upload is used transparently; consumers read the object normally.)

What is Zstandard?

Zstandard (.zst) is a modern compression format created at Facebook, designed for high compression ratios with fast compression and decompression. It is widely used across backends, package managers, and data pipelines. In EdgeMQ, segments are sealed and compressed with Zstandard before uploading to S3, keeping storage costs low while extraction stays fast. You can decompress with the zstdCLI or common language libraries.

Inside the compressed file

The compressed content is a byte-for-byte zstd stream of the raw WAL segment. The WAL segment is a concatenation of back-to-back frames. There is no extra container, footer, or padding.

Frame layout (big-endian for all multi-byte integers):

[LEN u32 BE][CRC32C u32][FMT u8][TS u64 BE(ms)][PAYLOAD]

LEN: 32-bit unsigned, big-endian. Total frame size in bytes (header + payload).

- Therefore LEN = 17 + payload_length.

CRC32C: 32-bit unsigned, big-endian. Castagnoli polynomial. Computed over PAYLOAD only (header excluded).
FMT: 8-bit format/version. Currently 0.
TS: 64-bit unsigned, big-endian. Milliseconds since Unix epoch when the frame was created.
PAYLOAD: opaque bytes; for the ingest service, this is the user-provided JSON or NDJSON content.

Header size: 4 + 4 + 1 + 8 = 17 bytes.

Note on tenant isolation: Each tenant has their own endpoints and S3 bucket prefixes, so no per-record tenant field is needed.

Frames are simply appended one after another. Readers should iterate until end-of-file.

Robust parsing rules

Decompress the .zst stream to obtain raw .wal bytes (streaming preferred).
Maintain a buffer. While buffer length ≥ 17 bytes:

- Read LEN and CRC. - If LEN < 17 (minimum header size), stop (invalid/truncated tail). - If buffer length < LEN, wait for more bytes. - Extract payloadLen = LEN - 17 and payload slice (starting at offset 17). - If validating, compute CRC32C over payload and compare to CRC; stop or error on mismatch. - Emit or process payload. - Advance buffer by LEN bytes and continue.

Reaching EOF with an incomplete header or partial frame is not an error; stop.

Step-by-step extraction

Download and decompress

Stream from S3 and pipe through zstd -d (CLI) or use a zstd library.

Parse frames

For each frame, read the 17-byte header, then the payload of computed length.
Verify CRC32C over payload if you need integrity validation.

Emit JSON

The PAYLOAD is the original JSON/NDJSON. If it does not end with \n, you may append a newline for NDJSON output convenience.

Reference implementation

Accepts input from S3 (via AWS SDK) or a local file.
Uses zstd CLI for decompression.
Parses frames incrementally.
Optionally validates CRC32C (uses fast-crc32c).
Writes NDJSON to stdout.

Usage examples:

node scripts/extract_segment.js --s3 s3://BUCKET/path/to/seg-00000009.wal.zst --region us-east-1 > out.ndjson
node scripts/extract_segment.js --file ./seg-00000009.wal.zst --no-crc > out.ndjson

Pseudocode for parsers in other languages

buf = bytes()
HDR = 17
while stream has data:
  buf += read()
  while len(buf) >= HDR:
    frameLen = u32be(buf[0:4])
    crc = u32be(buf[4:8])
    if frameLen < HDR: stop
    if len(buf) < frameLen: break
    payloadLen = frameLen - HDR
    payload = buf[HDR:HDR+payloadLen]
    if validate_crc and crc32c(payload) != crc: error/stop
    emit(payload)
    buf = buf[frameLen:]

Notes and constraints

All integers in the frame header are big-endian.
CRC32C validation is performed over PAYLOAD only.
Frames are back-to-back; there is no delimiter between frames.
Compressed object uses a single zstd stream.
Sealed segments uploaded to S3 should not be corrupted; CRC checks are useful for consumer-side verification.