S3 Segment File Format & Extraction
Edge MQ Segment File Format and Extraction Guide
This document describes the format of the segment objects that EdgeMQ’s ingest service writes to S3, and explains how to extract the original JSON or NDJSON payloads from them.
What gets uploaded to S3
- Each sealed WAL segment is uploaded as a single zstd-compressed file.
- Object key shape:
S3_PREFIX/REGION/INSTANCE_ID/EPOCH/segments/seg-XXXXXXXX.wal.zst. - S3 object metadata includes:
- sha256: SHA-256 of the compressed bytes - source: original filename (e.g., seg-00000009.wal.zst) - content-type: application/zstd (Multipart upload is used transparently; consumers read the object normally.)
What is Zstandard?
Zstandard (.zst) is a modern compression format created at Facebook, designed for high compression ratios with fast compression and decompression. It is widely used across backends, package managers, and data pipelines. In EdgeMQ, segments are sealed and compressed with Zstandard before uploading to S3, keeping storage costs low while extraction stays fast. You can decompress with the zstdCLI or common language libraries.
Inside the compressed file
The compressed content is a byte-for-byte zstd stream of the raw WAL segment. The WAL segment is a concatenation of back-to-back frames. There is no extra container, footer, or padding.
Frame layout (big-endian for all multi-byte integers):
[LEN u32 BE][CRC32C u32][FMT u8][TS u64 BE(ms)][PAYLOAD]
- LEN: 32-bit unsigned, big-endian. Total frame size in bytes (header + payload).
- Therefore LEN = 17 + payload_length.
- CRC32C: 32-bit unsigned, big-endian. Castagnoli polynomial. Computed over
PAYLOADonly (header excluded). - FMT: 8-bit format/version. Currently
0. - TS: 64-bit unsigned, big-endian. Milliseconds since Unix epoch when the frame was created.
- PAYLOAD: opaque bytes; for the ingest service, this is the user-provided JSON or NDJSON content.
Header size: 4 + 4 + 1 + 8 = 17 bytes.
Note on tenant isolation: Each tenant has their own endpoints and S3 bucket prefixes, so no per-record tenant field is needed.
Frames are simply appended one after another. Readers should iterate until end-of-file.
Robust parsing rules
- Decompress the
.zststream to obtain raw.walbytes (streaming preferred). - Maintain a buffer. While buffer length ≥ 17 bytes:
- Read LEN and CRC. - If LEN < 17 (minimum header size), stop (invalid/truncated tail). - If buffer length < LEN, wait for more bytes. - Extract payloadLen = LEN - 17 and payload slice (starting at offset 17). - If validating, compute CRC32C over payload and compare to CRC; stop or error on mismatch. - Emit or process payload. - Advance buffer by LEN bytes and continue.
- Reaching EOF with an incomplete header or partial frame is not an error; stop.
Step-by-step extraction
- Download and decompress
- Stream from S3 and pipe through
zstd -d(CLI) or use a zstd library.
- Parse frames
- For each frame, read the 17-byte header, then the payload of computed length.
- Verify CRC32C over payload if you need integrity validation.
- Emit JSON
- The
PAYLOADis the original JSON/NDJSON. If it does not end with\n, you may append a newline for NDJSON output convenience.
Reference implementation
Contact us for a working Node.js streaming extractor that:
- Accepts input from S3 (via AWS SDK) or a local file.
- Uses
zstdCLI for decompression. - Parses frames incrementally.
- Optionally validates CRC32C (uses
fast-crc32c). - Writes NDJSON to stdout.
Usage examples:
node scripts/extract_segment.js --s3 s3://BUCKET/path/to/seg-00000009.wal.zst --region us-east-1 > out.ndjson node scripts/extract_segment.js --file ./seg-00000009.wal.zst --no-crc > out.ndjson
Pseudocode for parsers in other languages
buf = bytes()
HDR = 17
while stream has data:
buf += read()
while len(buf) >= HDR:
frameLen = u32be(buf[0:4])
crc = u32be(buf[4:8])
if frameLen < HDR: stop
if len(buf) < frameLen: break
payloadLen = frameLen - HDR
payload = buf[HDR:HDR+payloadLen]
if validate_crc and crc32c(payload) != crc: error/stop
emit(payload)
buf = buf[frameLen:]Notes and constraints
- All integers in the frame header are big-endian.
- CRC32C validation is performed over
PAYLOADonly. - Frames are back-to-back; there is no delimiter between frames.
- Compressed object uses a single zstd stream.
- Sealed segments uploaded to S3 should not be corrupted; CRC checks are useful for consumer-side verification.