Do you know if this or something else, like Merkle trees, could be used to effic...

iamwil · on March 9, 2024

Curious: What do you need something like that for? It sounds like some kind of synced backup of a dataset represented by files on S3?

With what you describe, it kinda sounds like a Merkle-DAG CRDT. https://arxiv.org/pdf/2004.00107.pdf

The paper describes a Merkle-DAG that can sync two copies together, because it's also a CRDT. I don't think it's exactly what you want, but I think if you're building an implementation, it gives you the core ideas of what you're looking for.

I also detail Merkle-DAGs in my search.

https://interjectedfuture.com/crdts-turned-inside-out/

If you're looking for something off-the-shelf, I haven't looked for implementations in the wild, so I can't make recommendations.

atombender · on March 9, 2024

I can't describe the exact use case, but suffice to say I'd like to incrementally replicate (in one direction) data, but I don't desire to track what has changed or having to index the data.

I'm attracted to the simplicity behind Merkle trees where you can use hashes as a much smaller proxy for the upstream data, and even describe use the root-level hash as a "signature" representing the state of the dataset.

A naive, non-Merkle-y solution would be to associate each datum with a modification timestamp, and simply grab any data where "updated_at > $last_timestamp"; this would require an index to filter efficiently, however. Another simple solution would be a log that tracked every insert, and a process could then tail this log; this would require all the bookeeping this entails.

Thanks, I will check out those links.

specialist · on March 9, 2024

Great question.

You describe a common use case. There's some legacy batch processing centric workflow, which couldn't possibly be modernized, and the PHBs demand an app that's best implemented with events (message).

Like when there's some legacy flat file (more or less), hosted on a mainframe, serving as the "single source of truth" for a new customer facing app. Or when the "data feed" from some partner org is basically a file transfer of some report (data dump).

Please post a Show HN if you come up with a solution you like.

zachmu · on March 9, 2024

The easiest way to do this is to replicate to a database system that uses prolly trees for its storage, then run your diffs in that other system. I.e. dolt. Currently supports mysql binlog replication, postgres logical replication coming in about a week.

atombender · on March 9, 2024

I'm kind of asking about something that would work with any kind of data representation. The hypothetical "items" table could be an S3 bucket in reality.

So perhaps Dolt could be used purely for storing the metadata (mainly the primary keys), with the original data being elsewhere. The main engineering challenge would be keeping it in perfect sync with the upstream data.