SmoothieMap 2: the lowest memory hash table

In this post, I want to announce the revamp of SmoothieMap, an implementation of extendible hashing (a.k.a. dynamic hashing) in Java. This post also features the first implementation of SwissTable algorithm in Java. Performance comparisons are here, too, of course!

Introduction
Memory usage and performance comparisons
Memory usage
Performance
Hash table benchmarking is thankless work
HashMap’s and SwissTable’s performance variability is low
HashMap is still very strong
What’s new in SmoothieMap 2
Applications
Concurrency and specializations

Introduction

In a recent post, Ive presented my mental framework of thinking about hash table tradeoffs. In my view, the three main dimensions in the play in the hash table algorithm design space are:

  • CPU cost (of hash table operations such as a key search or an insertion) not related to memory access.

In this tradeoff space, SmoothieMap fares extremely well on the memory footprint: it‘s probably the leanest among all production hash table implementations ever written, and the variability factor: there is no “cliff” in the memory footprint and performance around the threshold rehashing size (which is typical for most hash table implementations) because SmoothieMap doesn’t have rehashing in the first place: it grows smoothly as the entries are inserted. SmoothieMap also doesn’t have a load factor and performs equally well regardless of whether the expected population of the hash table is specified at the creation time or not.

The absence of rehash and the related latency spike was the main original goal of SmoothieMap which makes it useful in real-time applications.

On the other hand, SmoothieMap has relatively high non-memory related CPU cost of key search and insertion operations. It may be about 50% slower than java.util.HashMap.

This compromise: low memory usage and absence of latency spikes traded for raw throughput may resemble the behavior of low-latency garbage collection algorithms for the JVM, such as Shenandoah and ZGC, versus G1 or Parallel garbage collectors:

The tradeoff of GC behavior. From Shenandoah GC talk by Aleskey Shipilëv

Memory usage and performance comparisons

Memory usage

Footprint of Map implementations

The chart above shows the average footprint per entry of several Map implementations depending on the map size. Measurements are done with -XX:-UseCompressedOps JVM flag, i. e. with 8-byte object references. The picture is not radically different with 4-byte object references.

  • HashMap: 58.8–69.3 bytes. The swing is 18% (or less than that, if we take the footprint of the key and value objects stored in the map into consideration).

You may notice that all HashMap, UnifiedMap, and SwissTable have the footprint cliff at the same point: this is because they all use 0.75 load factor in this test. It’s also easy to verify the SwissTable’s footprint numbers knowing its data structure layout: theoretically, the average footprint per entry should vary from (8 + 8) / 0.75 + 1 / 0.75 = 22.7 bytes to (8 + 8) / 0.375 + 1 / 0.375 = 45.3 bytes. This matches with the numbers that we’ve got.

There is a characteristic tendency that can be seen in the data above: the more maps based on a single array table (HashMap, UnifiedMap, SwissTable) squeeze the average footprint down, the more pronounced the footprint cliff around the rehash size becomes. This is an example of a tradeoff between the memory footprint and the variability factors in play (see my earlier post for other examples of such tradeoffs).

SwissTable is an open addressing hash table with entries stored inline and it uses power-of-two-sized array tables. This predictably leads to the footprint swing close to 100%. The same swing of 100% would be observed for other open addressing hash table implementations with power-of-two-sized tables, such as fastutil’s Object2ObjectHashMap or Koloboke’s HashObjObjMap. The only difference is that their average footprint would be greater than the footprint of SwissTable because they can’t afford using load factor as high as 0.75.

Performance

Performance of successful get() operation on Map<Integer, Integer>

The chart above shows the average throughput of successful key lookups (Map.get() is called in a loop on keys in randomized order) depending on the map size (ranging from 1k to 50 million entries) for three implementations: HashMap, SmoothieMap 2 in the low-garbage mode, and SwissTable. The benchmarks were run on OpenJDK 11.0.3, using G1 GC, C2 compiler, with -XX:-UseCompressedOps flag on an n1-standard-8 instance in GCP (that’s why the results are so noisy). Note that both X and Y axes are logarithmic and don’t start with 0.

Hash table benchmarking is thankless work

The most striking on the chart above is that of six possible rankings of the three Maps compared, for some map sizes, five different rankings can be observed with confidence. The sixth ranking (1. Smoothie Map 2. SwissTable 3. HashTable) can actually be found on the chart, too, but at the map size when the throughput of all three implementations is very close, so this is just accidental :)

The explanation is that the dominating factor in hash table’s performance during operations such as key lookup is how well the memory tiers of the data structure fit certain levels of CPU cache and the memory access pattern within these tiers (see my previous post for a more elaborate discussion). Due to their very different memory layouts and footprints, HashMap, SmoothieMap, and SwissTable “exhaust” certain levels of CPU cache and therefore begin performance phase transitions at different map sizes.

To me, this effect resembles F1 races where tire wear and tank fullness affect car’s speed so pilots may surpass each other on the track many times during the race because they decide to pitstop at different laps, rather than because they suddenly start to race better than their contenders.

Returning to maps in Java. To make matters worse, in addition to map size, the following factors change the performance picture completely:

  • Whether a successful or an unsuccessful key lookup or an insertion operation is tested, or some mix of these operations.

This means that performance comparison of Map implementations outside of the context of a specific production use case is not just problematic, it’s impossible.

HashMap’s and SwissTable’s performance variability is low

Another feature that we can notice on the performance chart is the “saw” pattern of throughput of HashMap and, perhaps, SwissTable (however, since the “saw” of SwissTable at some map sizes seems to be in antiphase with the “saw” of HashMap, despite logically they should be synchronous because the two Maps use the same load factor 0.75 in this benchmark, the apparent “saw” of SwissTable might be just noise). Nevertheless, the “saw” of HashMap is indisputable. The spikes of the saw correspond to map populations when it hits the load factor and doubles the underlying table.

This shouldn’t be surprising. The valuable observation that we can make, though, is that the maximum local difference between the highs and the lows of HashMap’s performance is approximately 30%, i. e. not that much and unlikely should ever be a concern. This is thanks to HashMap’s collision resolution via separate chaining which is a low-variability approach both in terms of memory footprint and performance. SwissTable (as well as F14) achieve low performance variability while still having high memory footprint variability by checking 8–16 slots at a time using SIMD instructions, effectively leveling the collision chain length factor.

HashMap is still very strong

At least in the specific test configuration used in my benchmark, HashMap is the fastest at many map sizes, and at no point is far behind SmoothieMap or SwissTable. Someone may wonder why HashMap performs so well in Java, while in C++, SwissTable, F14, and many other open addressing hash table variants written by enthusiasts beat std::unordered_map by a huge margin, although both java.util.HashMap and std::unordered_map use the same algorithm: the old good entry-based hash table with collision resolution via chaining and power-of-two-sized arrays.

I suspect this might be not due to an inherent inferiority of collision resolution via separate chaining, but rather have more to do with the particular inefficiency of unordered_map: the extra indirection hop in the data structure and, consequently, during all hash table operations (see the part of Matt Kulukundis’s CppCon 2017 talk for a nice visual representation) imposed by some nasty C++ standard requirement.

Note that in all Martin Ankerl’s benchmarks linked above, “node” versions of all algorithms perform surprisingly close to “flat” versions with entries stored inline, as he notes himself.

There are a few extra factors that might give HashMap a relative edge in Java against SmoothieMap and SwissTable, as well as other open addressing hash tables:

  • The corresponding keys and entry objects being placed adjacently in memory (either in a TLAB or in a memory region, perhaps, after a rearrangement performed by the GC). In C++, this could only happen if the key size is equal to the internal unordered_map’s node size which puts them into the same size class for the memory allocator.

What’s new in SmoothieMap 2

SmoothieMap 1.0 was introduced in 2015. Here is the summary of the changes in SmoothieMap 2:

Automatic shrinking. When a SmoothieMap reduces in size after a peak, it can release the extra unused heap space. This feature is turned on by default and could be turned off by calling SmoothieMapBuilder.doShrink(false).

Modes of operation. SmoothieMap 2 introduces three modes of operation to navigate the garbage — footprint tradeoff space:

  • Low-garbage: SmoothieMap generates very little garbage. A SmoothieMap instance keeps using almost all heap memory that it ever allocated throughout its lifetime. The low-garbage mode is equivalent to the behavior of SmoothieMap 1.x.

To understand how these modes are implemented, check out the documentation for allocateIntermediateCapacitySegments() and splitBetweenTwoNewSegments() methods in SmoothieMap’s Javadoc.

Coping well with a poor distribution of key hash codes (or direct collisions). SmoothieMap 1.x didn’t handle well the case when a lot of keys inserted into a map have colliding hash codes (either fully, or in some range of the bits), leading to excessive memory usage at best, or IllegalStateException at worst. SmoothieMap 2 handles these situations robustly and without performance degradation for the rest of the keys. If there are too many keys with fully colliding hash codes, SmoothieMap 2 organizes them in a balanced tree, like HashMap.

This addition moves SmoothieMap from the proof-of-concept category to production readiness.

Lower memory footprint and better successful lookup performance. SmoothieMap 2 in the low-garbage mode has slightly better footprint per entry than SmoothieMap 1.x (see the memory usage comparison above in this post).

Worse unsuccessful lookup and insertion performance. The throughput of unsuccessful get(), containsKey(), etc. methods, as well as put(), putIfAbsent(), compute(), etc. which insert a new entry into a map is lower in SmoothieMap 2 than in SmoothieMap 1.x, especially when SmoothieMap 2 is used in the mixed or the footprint mode.

Rejecting null key and values. SmoothieMap 1.x aimed to mimic java.util.HashMap as much as possible, allowing to store null key and values. SmoothieMap 2 models more after immutable maps (Map.of(), Guava’s ImmutableMaps) and ConcurrentHashMap which don’t allow storing null key and values and throw a NullPointerException when null is passed to query methods such as get().

Configuration of custom key and value equivalences and hash function using builder. In SmoothieMap 1.x, it was possible to subclass SmoothieMap to override the default usage of Java object equality for comparing keys and values (e. g. to emulate IdentityHashMap’s behavior, or storing objects of classes without properly defined equals(), like byte[], as keys) and specifying a custom hash function instead of calling key.hashCode(). In SmoothieMap 2, the SmoothieMap class is not open for inheritance anymore. Custom equivalences and a hash function should now be configured via SmoothieMapBuilder.

Applications

I think that SmoothieMap may be an interesting choice in the following areas:

  • Applications with hard real-time environments (because SmoothieMap doesn’t exhibit a rehash latency spike), and/or turning off garbage collector (e. g. by using Epsilon GC) and therefore needing the libraries to produce close to zero garbage, as SmoothieMap in the low-garbage mode does.

Concurrency and specializations

There are four things that didn’t make the cut into the SmoothieMap library as it is released to Maven Central (io.timeandspace:smoothie-map):

Concurrent SmoothieMap. SmoothieMap is not concurrent, though its internals (striping into small segments) naturally enables efficient implementation of concurrency.

Primitive specializations (such as long keys), Set specialization (not storing values) and KeyedSet abstraction: a set of values that store key/id in a field and could be queried by this id.

Monitoring of poor hash code distribution and active protection against malicious keys inserted into a SmoothieMap. I’ve experimented a lot with these ideas and SmoothieMap has all the necessary infrastructure in place for making such detections, but ultimately decided to leave this functionality out of the publically released version because of some problems with the statistical model.

Specializations for specific modes of operation, “IdentitySmoothieMap”. The version of SmoothieMap which is released as a library to Maven Central is generic, that is, enables configuration of the operation mode, custom key/value equivalences, etc. via flags and fields which adds some inefficiency compared to a version with hard-coded configurations.

If you are interested in one of these things, please let me know at leventov.ru@gmail.com.

Software engineer and designer, author. Working at Northvolt.