Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Prolly Trees (dolthub.com)
118 points by ingve on March 8, 2024 | hide | past | favorite | 22 comments


I've been looking at Prolly Trees in-depth and they're pretty neat! They have nice properties like:

- self-balancing (probabilistically!)

- unicit (the same data in the tree results in the same tree)

- fast comparisons (they're a Merkle tree, so just comparing root hashes is enough)

The original formulation for Prolly Trees had an expected node size of the target node size, but its standard deviation was geometric. That meant there was a long tail of node sizes far beyond the target node size.

The Dolt formulation is pretty neat. Chef's kiss. It uses the current chunk size thus-far as input into a cumulative distribution function to shape the node size into a normal distribution. So now the standard deviation is under control. But now, the decision to split a node is no longer independent. It now depends on the chunk size so far, which means a decision to rebalance nodes after an insert could potentially cascade to the entire right side of the tree with decreasing probability.

But good news! This can be mitigated by increasing the target node size or by relaxing the target std dev of the node sizes. I detail the investigation here:

https://interjectedfuture.com/lab-notes/lab-note-033-cascadi...


Can you share any insight on latency? I feel that, while a prolly tree has the same big O bounds as a BTree, there are extremely large constant factors involved.

In the benchmark of their new storage engine that they published on their blog they had read operations take 3us and write operations take 96 us for in-memory workloads.

That's an order of magnitude slower than what I get from a persistent trie implementation that uses unicity to achieve node hashability (instead of the other way around as in a prolly tree).[1]

I could imagine that this becomes less pronounced in the durable on-disk case, as storage bandwidth and latency become the dominating factors?

1: https://github.com/triblesspace/tribles-rust/blob/master/src...


Can you comment more on how you use persistent tries with unicity? And what do you mean by "the other way around"?

I can't speak to latency fully as of yet. But in my own incremental implementation of a Prolly Tree, it appears to have the same Big-O bounds as a Btree (as you said). But I can see there are places that it has to do a bit more work.

- Given the nature of merkle tree nodes identified by their hash, you have to go do look ups, rather than just following pointers.

- The tree has to touch a swath of nodes from the bottom up, due to potential rebalancing. Unlike a Btree, where you just change a single node seam from root to leaf, even with splits and merges.

- Prolly Trees are constructed from the bottom up, where you have to traverse across a layer. To find the next sibling, you sometimes have to travel up to the parents and come back down--hence subject to the occasional height traversal in the worse case. This doesn't happen often, especially if the cascading effect on insertion I talked about doesn't happen.

But like you said, maybe there are other factors that dominate the latency equation. It's hard for me to say at this point. Relevant post:

https://motherduck.com/blog/perf-is-not-enough/


Disclaimer up front: I somewhat agree with the performance sentiment towards databases, however I feel that index performance is a different thing, because it gives you a larger budget for features further up the chain that make a meaningful non-performance difference for users, e.g. dynamic skew estimation and adaptation saving you from manual performance tuning. To make an additional performance point about my use-case, the work I linked is not competing with on-disk databases but with in-memory datastructures like nested hash-maps and I would argue the points from the article that you linked in that scenario, but with the scales adjusted appropriately (think ns and us instead of us and ms).

To answer the question, prolly trees have a deterministic structure based on the chunking properties of their hash function. So their deterministic structure is a result of using a hash to identify nodes, and only implicitly defined over the contained keys (by rolling the hash over the leafs).

Tries on the other hand, by virtue of being prefix trees, have a structure that is deterministic and directly based on the keys that they contain. This also allows you to compute a hash for each node (determined by they key range it represents) but because hashing is now orthogonal to the structure, you can choose tradeoffs more flexibly.

You can for example use an incremental hash function [1], which in my case is SipHash for the leaves and xor as a combiner (Note the importance of using a keyed hash function like SipHash when XOR is used as a combiner, see the XHASH attack in [1].) So whenever a child is updated the new node hash can be maintained as `node.hash = (node.hash xor old_child.hash) xor new_child.hash)`.

The original Nom author actually considered PATRICA Tries and HAMTs, but dismissed unhashed prefix-tries due to bad performance. I didn't run into this yet in my use-case however, and I suspect that the overhead of string interning, indirection vs direct pointers, and more expensive hash-functions, will always outweigh in the in-memory case.

I'm still very interested in using prolly trees in the on-disk case, hence my question about latency :D

1: https://cseweb.ucsd.edu//~Mihir/papers/inchash.pdf

2: https://github.com/attic-labs/noms/blob/master/doc/intro.md#...

BTW can you point me to where the "unicit" term originates from? a quick google search only ever shows it used in your writings :D


Ah, thanks for the explanation. Some quick follow ups:

Oh, so an incremental hash function is where you don't have to hash the entire thing again, but you can use the combiner (xor) to "remove" the old_child.hash and "add" the new_child.hash? And you use this to decrease the cost of hashing?

Huh, I didn't make up the term "unicit". I read it somewhere, presumably from Aaron Boodman, who invented the Prolly Tree.


Exactly! The xor combiner is somewhat special in that it is self inverse, so a xor b xor a = b. This property doesn't hold for the other combiners, so for any other combiner ⊙ you would need to compute the inverse `old_child_hash^-1` for the removed hash so that `new_hash = old_hash ⊙ old_child_hash^-1 ⊙ new_child_hash`.

Although I feel like I should re-iterate that the xor-combiner (in the paper termed XHASH) does not produce a cryptographic hash in any way and to quote the paper "Indeed, it is not even one-way". For my use case this means that you can only perform set operations in O(|common-prefix|) time with trusted peers and I actually don't expose the hash beyond the process itself. You can ofc ingest data from untrusted peers, that's why the randomization function (the function used for the keys/leafs) is keyed, with a key that is choosen randomly at library initalisation.

In my case I'm storing hashes in the branch-nodes, but re-compute hashes for the leafs, so when performing set operations between two tries you often don't have to compute any hash.

I didn't mean to imply that, I was just hoping to find a new root for research paper exploration, but even if you did it would be fine, it's a property that should have it's own name, I personally call them "deterministic"-trees :D.


Do you know if this or something else, like Merkle trees, could be used to efficiently diff datasets (e.g. an SQL database)?

My rough idea: Consider a table "items" that you want to back up and make a perfect copy of. You want to keep this backup up to date and also verify its integrity. The data isn't timestamped and can be billions of rows, though we can assume each row has a primary key that doesn't change. To do this, compute a Merkle tree in which each adjacent node is a non-overlapping partition of the items table. To determine which partition a row belongs to, we could say that each item row with a "hash(pk) mod N" which we could even precompute at insert time. To diff the "items" table with a remote mirror, you just compare the Merkle tree recursively and only copy the rows with differences.

The challenge, as I understand it, is to compute the hashes incrementally in a live system in a way that preserves consistency. The backup copy's tree doesn't change until you do a diff, but as the "items" table undergoes database updates, we need to (1) serialize the updates in order for the hash updates to be applied in a consistent order, (2) update the hashes at each level of the tree, (3) and this needs to be kept transactional consistent with the original data.

The above is trivial if the hash tree is computed at diff time, because you could just sort and scan both tables to build the tree in memory. But this would not be efficient for very large datasets that you want to compare fairly frequently. (It's also much trickier if the scan takes so long that you run into the issue of having out-of-date hashes by the time you're done, if the input table is frequently updated.)

To date, I've not seen a Merkle tree implementation like this that is also persistent and usable in a database setting. Prolly trees may work, but they're a kind of low level data structure that wouldn't map to something you can store in, say, Postgres, I think. Do you know of any prior art, aside from Prolly trees?


Curious: What do you need something like that for? It sounds like some kind of synced backup of a dataset represented by files on S3?

With what you describe, it kinda sounds like a Merkle-DAG CRDT. https://arxiv.org/pdf/2004.00107.pdf

The paper describes a Merkle-DAG that can sync two copies together, because it's also a CRDT. I don't think it's exactly what you want, but I think if you're building an implementation, it gives you the core ideas of what you're looking for.

I also detail Merkle-DAGs in my search.

https://interjectedfuture.com/crdts-turned-inside-out/

If you're looking for something off-the-shelf, I haven't looked for implementations in the wild, so I can't make recommendations.


I can't describe the exact use case, but suffice to say I'd like to incrementally replicate (in one direction) data, but I don't desire to track what has changed or having to index the data.

I'm attracted to the simplicity behind Merkle trees where you can use hashes as a much smaller proxy for the upstream data, and even describe use the root-level hash as a "signature" representing the state of the dataset.

A naive, non-Merkle-y solution would be to associate each datum with a modification timestamp, and simply grab any data where "updated_at > $last_timestamp"; this would require an index to filter efficiently, however. Another simple solution would be a log that tracked every insert, and a process could then tail this log; this would require all the bookeeping this entails.

Thanks, I will check out those links.


Great question.

You describe a common use case. There's some legacy batch processing centric workflow, which couldn't possibly be modernized, and the PHBs demand an app that's best implemented with events (message).

Like when there's some legacy flat file (more or less), hosted on a mainframe, serving as the "single source of truth" for a new customer facing app. Or when the "data feed" from some partner org is basically a file transfer of some report (data dump).

Please post a Show HN if you come up with a solution you like.


The easiest way to do this is to replicate to a database system that uses prolly trees for its storage, then run your diffs in that other system. I.e. dolt. Currently supports mysql binlog replication, postgres logical replication coming in about a week.


I'm kind of asking about something that would work with any kind of data representation. The hypothetical "items" table could be an S3 bucket in reality.

So perhaps Dolt could be used purely for storing the metadata (mainly the primary keys), with the original data being elsewhere. The main engineering challenge would be keeping it in perfect sync with the upstream data.


Also the rolling hash is only on map keys, so replacing fixed-size values requires no splits.


I don’t know who came first, but https://github.com/jacobgorm/mindcastle.io also uses the rsync/LBFS rolling hashes trick to split the tree data into chunks. I presented the idea at Usenix Vault 2019 https://m.youtube.com/watch?v=QgOkDiP0C4c&embeds_referring_e...


I think rsync came first?

In the original formulation of Prolly Trees, it does reference rsync.

https://github.com/attic-labs/noms/blob/master/doc/intro.md#...


> If the hash of the root chunk is the same the entire subtree is the same.

Shouldn't there be some discussion on hash collisions? Why could different trees not result in identical hashes?

Are they using cryptographic hash functions with a large enough number of output bits (>=128) that collision odds can be considered negligible?


The dolt implementation uses a 20 byte address space with a SHA 256 cryptographic hash function. The risk of a collision is trivial.


Negligible*?


Would it make sense to use a Prolly Tree’s root hash as the ETag for a collection resource in an HTTP API?

Given an HTTP API that: - includes a collection resource to which items maybe added and removed - and for which supporting an ETag response that reflects the current set of items is needed - where it is prohibitively expensive to enumerate all items to recalculate the ETag

It seems an incrementally maintained Prolly Tree would be an efficient mechanism and the root hash could be used as the ETag since the order of operations isn’t important. Do I have that right?


Yes, it sounds like it.


Excellent article! Very well written, good explanations, useful graphics.


nice!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: