The Swiss Army Knife of Hashmaps

ajtulloch · on Dec 7, 2018

One thing worth trying out as well are the folly::F14 hashmaps (https://github.com/facebook/folly/blob/master/folly/containe...), which has a roughly similar design. You can find some discussions on the performance tradeoffs by the SwissTable and F14 authors at https://groups.google.com/forum/#!msg/hashtable-benchmarks/-....

_urga · on Dec 8, 2018

This is a fantastic explanation of developments in hash tables in general.

If anyone is interested, I recently released a scalable hash table for Node.js (@ronomon/hash-table), designed to store literally billions of elements without stressing the GC.

At first, I tried writing a native addon module for Node.js in C (using SIMD), but the Javascript-to-C call cost was just too high, and doubled latency. In the end, I was surprised to find that writing directly in Javascript proved faster than the C module with SIMD, simply by avoiding the bridging cost.

Also, instead of linear or quadratic probing (or double hashing or triangular probing), I used cuckoo hashing, but with a twist. Normally, in the case of a naive cuckoo hash table, if an element is not in its first bucket, then its second bucket must be fetched from memory, causing a cache miss, which for a hash table is embarrassingly slow. This is why SwissTable and F14 tend to avoid naive cuckoo hashing.

However, in our design, the first bucket includes an 8-byte coarse-grained bloom filter (k=1, or a single bit per element). If the element is not in this bloom filter, the element is guaranteed not to be in either of its first or second buckets, and a cache miss can be avoided nearly 90% of the time. Using cuckoo hashing in this way then starts to produce major advantages over linear probing: no tombstones, higher load factor (which has knock-on implications for resize latency), and constant time worst-case lookups.

More design details are here:

https://github.com/ronomon/hash-table

lucas_membrane · on Dec 7, 2018

I read TFA discussion of performance issues for hashmaps, and I see that deletes are something of a serious hindrance. As I often use hashmaps for a single-pass accumulation or transformation of data, I suspect that a hashmap without a delete operator might be designed to give superior performance in many such cases. Is this so, and are there any such implementations of hashmaps, mutable but with no delete method?

rurban · on Dec 7, 2018

Without the delete tombstone it can be made much faster, yes. I wrote one recently: https://github.com/LibreDWG/libredwg/blob/master/src/hash.c

samcheng · on Dec 7, 2018

Bloom filters are an example of a hashing data structure that does not support delete. So yes, such a tradeoff can be made.

Maybe some DHT implementations also make deletion hard? (In general, I'd much rather mark something as invalid than delete it.)

kccqzy · on Dec 7, 2018

I think GP is talking about improving performance just by omitting deletion support in hash maps. For example the tombstone marking in the original article won't be necessary. Experience tells me that for a host of data structures (vectors, BSTs, hash maps and many others), omitting deletion support can simplify the code a lot as well.

adamnemecek · on Dec 8, 2018

There’s cuckoo filters.

bogomipz · on Dec 8, 2018

Whats the TFA discussion? Can you say why are deletes such a serious hindrance?

easytiger · on Dec 7, 2018

wasn't the original folly hashmap exactly that?