A reference implementation of the watermark-based change-data capture algorithm, based on the original DBLog paper and Netflix Technology Blog post.
A downstream system — a search index, a cache, a data warehouse — needs to track every change in a database. To feed it, you need two streams:
INSERT / UPDATE / DELETE, as it happens, read from the database log.In order to combine these streams, coordination is needed to ensure convergence to the latest state. Typical approaches achieve this by blocking writes at the application layer or with table locks, pausing log-event processing, or letting change events accumulate until the snapshot has completed.
To make things more interesting, a full-state read is usually not only needed initially. It may be needed later too: to backfill a new consumer, repair corrupted downstream state, reload after a restore, or re-read specific primary keys.
DBLog captures table state through bounded chunk selects and consumes the change stream concurrently, with no source-side locks, while still producing a correct, ordered output.
The trick is two markers written into the source database and appearing in the change stream afterwards:
▼ LW — the low watermark, written to the source before reading a chunk of rows.
▲ HW — the high watermark, written to the source after reading the chunk.
Between LW and HW, DBLog keeps consuming live log events normally. It also reads a bounded chunk of rows from the source — say primary keys 1–100 — and buffers them. When HW comes back from the log, DBLog has a clean picture of every change that happened to the table while the chunk was being read.
The reconciliation rule is one sentence: if a buffered chunk row shares a primary key with any in-window log event, drop the chunk row. The log version is newer, so the buffered copy is stale. The surviving chunk rows are emitted on HW, and the reader's durable position advances.
The cycle for one chunk, start to finish:
UPDATE for pk=42 arrives from the
log. The chunk's pk=42 is dropped — the log copy is fresher.
The output: every in-window log event in order, plus the surviving
chunk rows (41, 43) emitted on HW.
A small Java reference implementation, built from public DBLog material — the original DBLog paper and the Netflix Technology Blog post. Designed for reading and experimentation, not for production.
Source-neutral watermark reconciliation, dump orchestration, targeted repair, and checkpoint safety.
MySQL binlog and PostgreSQL pgoutput, with Docker-backed local demos and tests.
NDJSON, H2 inspector, JDBC apply, no-op sinks, control-plane HTTP, and the Hydroscope tap.