Do you know if this or something else, like Merkle trees, could be used to efficiently diff datasets (e.g. an SQL database)?
My rough idea: Consider a table "items" that you want to back up and make a perfect copy of. You want to keep this backup up to date and also verify its integrity. The data isn't timestamped and can be billions of rows, though we can assume each row has a primary key that doesn't change. To do this, compute a Merkle tree in which each adjacent node is a non-overlapping partition of the items table. To determine which partition a row belongs to, we could say that each item row with a "hash(pk) mod N" which we could even precompute at insert time. To diff the "items" table with a remote mirror, you just compare the Merkle tree recursively and only copy the rows with differences.
The challenge, as I understand it, is to compute the hashes incrementally in a live system in a way that preserves consistency. The backup copy's tree doesn't change until you do a diff, but as the "items" table undergoes database updates, we need to (1) serialize the updates in order for the hash updates to be applied in a consistent order, (2) update the hashes at each level of the tree, (3) and this needs to be kept transactional consistent with the original data.
The above is trivial if the hash tree is computed at diff time, because you could just sort and scan both tables to build the tree in memory. But this would not be efficient for very large datasets that you want to compare fairly frequently. (It's also much trickier if the scan takes so long that you run into the issue of having out-of-date hashes by the time you're done, if the input table is frequently updated.)
To date, I've not seen a Merkle tree implementation like this that is also persistent and usable in a database setting. Prolly trees may work, but they're a kind of low level data structure that wouldn't map to something you can store in, say, Postgres, I think. Do you know of any prior art, aside from Prolly trees?
The paper describes a Merkle-DAG that can sync two copies together, because it's also a CRDT. I don't think it's exactly what you want, but I think if you're building an implementation, it gives you the core ideas of what you're looking for.
I can't describe the exact use case, but suffice to say I'd like to incrementally replicate (in one direction) data, but I don't desire to track what has changed or having to index the data.
I'm attracted to the simplicity behind Merkle trees where you can use hashes as a much smaller proxy for the upstream data, and even describe use the root-level hash as a "signature" representing the state of the dataset.
A naive, non-Merkle-y solution would be to associate each datum with a modification timestamp, and simply grab any data where "updated_at > $last_timestamp"; this would require an index to filter efficiently, however. Another simple solution would be a log that tracked every insert, and a process could then tail this log; this would require all the bookeeping this entails.
You describe a common use case. There's some legacy batch processing centric workflow, which couldn't possibly be modernized, and the PHBs demand an app that's best implemented with events (message).
Like when there's some legacy flat file (more or less), hosted on a mainframe, serving as the "single source of truth" for a new customer facing app. Or when the "data feed" from some partner org is basically a file transfer of some report (data dump).
Please post a Show HN if you come up with a solution you like.
The easiest way to do this is to replicate to a database system that uses prolly trees for its storage, then run your diffs in that other system. I.e. dolt. Currently supports mysql binlog replication, postgres logical replication coming in about a week.
I'm kind of asking about something that would work with any kind of data representation. The hypothetical "items" table could be an S3 bucket in reality.
So perhaps Dolt could be used purely for storing the metadata (mainly the primary keys), with the original data being elsewhere. The main engineering challenge would be keeping it in perfect sync with the upstream data.
My rough idea: Consider a table "items" that you want to back up and make a perfect copy of. You want to keep this backup up to date and also verify its integrity. The data isn't timestamped and can be billions of rows, though we can assume each row has a primary key that doesn't change. To do this, compute a Merkle tree in which each adjacent node is a non-overlapping partition of the items table. To determine which partition a row belongs to, we could say that each item row with a "hash(pk) mod N" which we could even precompute at insert time. To diff the "items" table with a remote mirror, you just compare the Merkle tree recursively and only copy the rows with differences.
The challenge, as I understand it, is to compute the hashes incrementally in a live system in a way that preserves consistency. The backup copy's tree doesn't change until you do a diff, but as the "items" table undergoes database updates, we need to (1) serialize the updates in order for the hash updates to be applied in a consistent order, (2) update the hashes at each level of the tree, (3) and this needs to be kept transactional consistent with the original data.
The above is trivial if the hash tree is computed at diff time, because you could just sort and scan both tables to build the tree in memory. But this would not be efficient for very large datasets that you want to compare fairly frequently. (It's also much trickier if the scan takes so long that you run into the issue of having out-of-date hashes by the time you're done, if the input table is frequently updated.)
To date, I've not seen a Merkle tree implementation like this that is also persistent and usable in a database setting. Prolly trees may work, but they're a kind of low level data structure that wouldn't map to something you can store in, say, Postgres, I think. Do you know of any prior art, aside from Prolly trees?