S3 Segment File Format & Extraction
Edge MQ Segment File Format and Extraction Guide
This document describes the format of the segment objects that EdgeMQ’s ingest service writes to S3, and explains how to extract the original JSON or NDJSON payloads from them.
What gets uploaded to S3
- Each sealed WAL segment is uploaded as a single zstd-compressed file.
- Object key shape: `S3_PREFIX/REGION/INSTANCE_ID/EPOCH/segments/seg-XXXXXXXX.wal.zst`.
- S3 object metadata includes:
- `sha256`: SHA-256 of the compressed bytes
- `source`: original filename (e.g., `seg-00000009.wal.zst`)
- `content-type`: `application/zstd`
(Multipart upload is used transparently; consumers read the object normally.)
What is Zstandard?
Zstandard (.zst) is a modern compression format created at Facebook, designed for high compression ratios with fast compression and decompression. It is widely used across backends, package managers, and data pipelines. In EdgeMQ, segments are sealed and compressed with Zstandard before uploading to S3, keeping storage costs low while extraction stays fast. You can decompress with the zstdCLI or common language libraries.
Inside the compressed file
The compressed content is a byte-for-byte zstd stream of the raw WAL segment. The WAL segment is a concatenation of back-to-back frames. There is no extra container, footer, or padding.
Frame layout (big-endian for all multi-byte integers):
[LEN u32 BE][CRC32C u32][FMT u8][TENANT u32 BE][TS u64 BE(ms)][PAYLOAD]
- LEN: 32-bit unsigned, big-endian. Number of bytes in `[FMT][TENANT][TS][PAYLOAD]`.
- Therefore `LEN = payload_length + 1 + 4 + 8`.
- CRC32C: 32-bit unsigned, big-endian. Castagnoli polynomial. Computed over `PAYLOAD` only (header excluded).
- FMT: 8-bit format/version. Currently `0`.
- TENANT: 32-bit unsigned, big-endian. Tenant identifier.
- TS: 64-bit unsigned, big-endian. Milliseconds since Unix epoch when the frame was created.
- PAYLOAD: opaque bytes; for the ingest service, this is the user-provided JSON or NDJSON content.
Header size: 4 + 4 + 1 + 4 + 8 = 21 bytes.
Frames are simply appended one after another. Readers should iterate until end-of-file.
Robust parsing rules
- Decompress the `.zst` stream to obtain raw `.wal` bytes (streaming preferred).
- Maintain a buffer. While buffer length ≥ 21 bytes:
- Read `LEN` and `CRC`.
- Compute `frameBytes = 21 + LEN`.
- If `LEN < 13` (the size of `[FMT][TENANT][TS]`), stop (invalid/truncated tail).
- If buffer length < `frameBytes`, wait for more bytes.
- Extract `payloadLen = LEN - 13` and `payload` slice.
- If validating, compute CRC32C over `payload` and compare to `CRC`; stop or error on mismatch.
- Emit or process `payload`.
- Advance buffer by `frameBytes` and continue.
- Reaching EOF with an incomplete header or partial frame is not an error; stop.
Step-by-step extraction
- Download and decompress
- Stream from S3 and pipe through `zstd -d` (CLI) or use a zstd library.
- Parse frames
- For each frame, read the 21-byte header, then the payload of computed length.
- Verify CRC32C over payload if you need integrity validation.
- Emit JSON
- The `PAYLOAD` is the original JSON/NDJSON. If it does not end with `\n`, you may append a newline for NDJSON output convenience.
Reference implementation
Contact us for a working Node.js streaming extractor that:
- Accepts input from S3 (via AWS SDK) or a local file.
- Uses `zstd` CLI for decompression.
- Parses frames incrementally.
- Optionally validates CRC32C (uses `fast-crc32c`).
- Writes NDJSON to stdout.
Usage examples:
node scripts/extract_segment.js --s3 s3://BUCKET/path/to/seg-00000009.wal.zst --region us-east-1 > out.ndjson node scripts/extract_segment.js --file ./seg-00000009.wal.zst --no-crc > out.ndjson
Pseudocode for parsers in other languages
buf = bytes()
HDR = 21
while stream has data:
buf += read()
while len(buf) >= HDR:
len = u32be(buf[0:4])
crc = u32be(buf[4:8])
if len < 13: stop
need = HDR + len
if len(buf) < need: break
payloadLen = len - 13
payload = buf[HDR:HDR+payloadLen]
if validate_crc and crc32c(payload) != crc: error/stop
emit(payload)
buf = buf[need:]Notes and constraints
- All integers in the frame header are big-endian.
- CRC32C validation is performed over `PAYLOAD` only.
- Frames are back-to-back; there is no delimiter between frames.
- Compressed object uses a single zstd stream.
- Sealed segments uploaded to S3 should not be corrupted; CRC checks are useful for consumer-side verification.