Hash table tradeoffs: CPU, memory, and variability
Some time ago, I wrote about the time — memory tradeoff in hash table implementations. That was an oversimplified model which predicts the performance only for some algorithms on some range of sizes of hash tables in some benchmarks.
In this post, I describe a tri-factor model which I find more useful in the analysis of hash table algorithms and discuss several state-of-the-art algorithms in the context of this model.
Factor 1: non-memory-related CPU cost of a key search or an insertion
This is how much CPU a key search (or an insertion) takes on average, excluding the CPU cost of random accesses to the hash table’s memory. This factor is not directly measurable. It can be best estimated when benchmarking hash tables with very few entries so that all random accesses to the hash table’s memory hit the L1 cache.
The cost of computing a hash code from a key should be included in this baseline cost as well because different hash table algorithms have different requirements for the quality of the hash function.
Factor 2: memory
The memory factor consists of several subfactors which, however, are deeply interwoven, so it’s usually not possible to consider these subfactors in separation.
Linear memory footprint by tiers
O(N) memory footprint of a hash table (where N is the number of entries stored in the hash table) is comprised of one or several “tiers” with different constant factors and, sometimes, different variability. For example, open addressing hash tables often have a single memory tier: the table itself which is doubled when the size exceeds a certain limit, determined by the maximum load factor.
Hash tables with collision resolution via separate chaining (including Java’s
HashMap) have two memory tiers: (1) the table which exhibits the same dynamics as in open addressing, and (2) the memory taken by the chaining nodes. The size (footprint) of the latter tier is strictly proportional to N.
The number of locations accessed in each memory tier during a key search or an insertion
Extending the example from the previous section, during a key search in an open addressing hash table we begin with reading from some slot in the table. If the slot appears to be empty it means that the key is absent in the hash table and we finish the search. If the slot is full and the key stored in the slot is equal to the searched key, then the search is successful and we are done, too. Finally, if the slot is full but the key stored in the slot is not equal to the searched key then we try another slot according to the collision resolution scheme, whether it is linear probing, quadratic probing, Cuckoo hashing, etc. The whole process is repeated for the second slot which may result in accessing the third, fourth, etc. slots with diminishing probability.
This example shows that “the number of accessed locations” is most precisely described by some discrete probability distribution. However, for practical purposes, that is, for measuring the throughput of a hash table implementation, it is sufficient to consider only the mean of the distribution: in other words, the average number of random accesses to each memory tier involved in a key search or an insertion.
On the other hand, the discrete distribution itself (and, therefore, its mean) depends on the current hash table’s load factor which may change in the course of the application’s runtime. This is a subject of the variability factor described below in this post.
Going back to the definition of the first factor: “how much CPU a key search (or an insertion) takes on average”, note that the “on average” part encapsulates “the weighted expected value of a discrete distribution of CPU costs”. That distribution is parallel to the distribution of the number of accessed memory locations. They both directly stem from the distribution of the number of collision resolution steps during an operation on a hash table.
The relative locality of random accesses to the memory tiers
When random accesses to hash table’s memory exhibit some locality (meaning, in fact, that some of those accesses are not fully random w.r.t. the other accesses) the CPU cache (both data cache and TLB) might be utilized better and the “secondary” accesses might complete faster.
The important “degrees of locality” are:
- Accesses within a single cache line (64 bytes).
- Accesses within a single aligned block of 128 bytes: taking advantage of the Spatial Prefetcher (see Intel’s optimization manual, § 220.127.116.11).
- Accesses within a single page (e. g. 4KB or 2MB) for better TLB utilization.
The regularity of memory accesses (which may allow taking advantage of CPU’s stride and streaming prefetchers) is much less important than the locality because hash table operations usually don’t involve enough accesses to trigger prefetch and begin taking advantage of it. The exception is operations on open addressing hash tables with linear probing at high load factors.
In general, the memory tiers in a hash table data structure don’t have to form continuous blocks of memory (see, for example, the tier of chaining nodes in Java’s
HashMap described above: it consists of many independent allocations), nor have to be disaggregated: in F14, tag vectors are combined with slots into chunks (see details below in this post). Therefore, accesses not only to the same memory tier but also to different tiers may be relatively close to each other in some hash table algorithms. Sometimes, this is the key to the hash table’s performance.
Other memory-related subfactors
The memory footprint and the number of accessed memory locations subfactors in the hash table’s tradeoff space are similar to space amplification and read amplification, respectively, used in the analysis of durable index structure algorithms.
I think that the hash table’s equivalent for write amplification of index structures: the number of memory write locations during an insertion into a hash table is less important than the subfactors described above. Alternatively, it simply shouldn’t be distinguished within the total number of accessed locations. This is because the write locations are typically a subset of the read locations accessed during the key search in the first phase of the insertion algorithm (this is not necessarily true in index structure algorithms: for example, many of them require writing to WAL which shouldn’t be accessed on the read path), and in the computer’s main memory hierarchy (unlike disks) writes to the cache lines that are already brought into L1 cache impose little additional latency.
There is an important exception: concurrent hash table algorithms. Firstly, it’s very beneficial when a concurrent hash table can avoid writes for locking during read-only hash table operations and thus avoid unnecessary cache coherence traffic in a read-only (or mostly read-only) steady state. This is done by Java’s
ConcurrentHashMap whose operations like
get() don’t involve locking on the hot path.
Secondly, in small hash tables which are heavily updated from multiple concurrent threads, the configuration of memory write locations may be important because it may cause (or, on the contrary, may allow avoiding) false sharing.
However, since in this post I’m focusing on non-concurrent hash table algorithms, I won’t return to the memory write locations subfactor below.
Another memory subfactor which becomes important only for some specific hash table algorithms is the relative alignment of the very often accessed locations (aka “hot fields”). If the hot cache lines are spaced at 4096 bytes from each other (or a multiple of that) they fall into the same cache associativity set which might cause these hot locations to be frequently missing from L1. This becomes a concern only for “segmented” (“striped”) data structures.
Factor 3: variability
Efficiency characteristics of all hash table algorithms in the context of the CPU and the memory factors described above change depending on the specific hash table’s population (the number of stored entries) which translates to some load factor in most hash table algorithms. It may be impossible to clearly define load factor for some complicated, multi-tier hash table algorithms, but their “relative fullness” and, therefore, the efficiency characteristics vary too.
For hash tables with so-called tombstone entries, the level of “contamination” of the table with tombstones is another source of variability in memory efficiency as well as in the performance of some operations.
The variability of hash tables’ efficiency becomes important when the efficiency of the whole application significantly depends on the efficiency of a single or a few large hash tables (rather than many small-to-medium-sized, independent hash tables). Such applications may need to pessimistically assume the worst efficiency within the hash table’s variability bounds for their resource usage planning. The hash table’s performance variability should also be considered for making founded quality of service estimates. The examples of such applications are in-memory key-value stores (like Redis or Memcached) and databases and data processing frameworks in which large hash joins are a major workload component.
There are two main types of variability, corresponding to the first two memory subfactors:
- Variability in memory utilization.
- Variability in the average number of collision resolution steps during hash table operations. Note this is not the variance of the discrete probability distribution characterizing the number of accessed locations described above. Rather, this is the swing of the average itself for a given hash table algorithm/data structure in different states. The variability in collision resolution steps translates into the variability in the average number of accessed memory locations and in the non-memory-related CPU cost.
Another variability aspect is very long occasional insertions which may be caused by a hash table’s rehash. This may be important in latency-sensitive applications. The two techniques for avoiding the rehash latency spikes are incremental/background resizing (implemented, for example, in Redis and in
PauselessHashMap) and extendible hashing.
Finally, it’s worth mentioning two static properties of a hash table’s use case that affect the efficiency. They can be viewed as sources of “static” variability:
- Quality of hash function. Open addressing hash tables with “local” collision resolution: namely, linear and quadratic probing are the most susceptible to the hash function quality.
- Entry size affects the variability of memory utilization in hash tables that store entries in separately allocated nodes (like hash tables with collision resolution via separate chaining, or open addressing hash tables with forwarding pointers): the larger the entry size, the less variable the memory utilization of such data structures becomes (because of the diminishing relative weight of the variably sized memory tier, the table). See also a discussion of the tradeoff between inline entries and separately allocated nodes below in this post.
Putting everything together, the efficiency and the behavioral model of a hash table is influenced by the algorithm factors described above: CPU cost, memory, and variability, as well as the following factors which are independent of the hash table algorithm:
- The types and the relative frequency of operations on the hash table in the workload, such as read-only (successful key searches), read-only (unsuccessful key searches), 50/50 read/write, etc.
- The range of the possible hash table’s sizes.
- The distribution of key access frequencies. Most benchmarks assume either uniform or Zipf distribution.
- The hash function.
- The entry size.
- The fraction of CPU’s caches capacity storing the hash table’s memory lines. This depends on whether the operations on the hash table are on the application’s hot spot.
- The sizes of the CPU caches: L1/L2/L3 and TLB on the machine where the application is deployed.
- The CPU architecture: the availability of some instructions, for example,
_mm_movemask_epi8used in SwissTable and F14.
Factors 1, 2, and 4 affect the actual distributions of the numbers of collision resolution steps which translate into the numbers of accessed memory locations and the CPU costs. The distributions may be different for different operations on the hash table.
Factors 1, 2, 4, and 5 affect the actual memory efficiency (utilization) of the hash table and the variability thereof.
Factors 3, 6, and 7 determine how frequently random accesses to the hash table’s memory will hit the CPU cache. This allows translating the numbers of accessed memory locations into the actual average CPU latency. Factor 8 also affects the non-memory related CPU costs of operations on the hash table.
Examples of tradeoffs between the factors
Low vs. high maximum load factor
Setting the maximum load factor of an open addressing hash table to a relatively low value (e. g. 0.5) reduces the average number of collision resolution steps and its variability. The difference in the average number of collision resolution steps during an operation on hash tables with load factors 0.5 and 0.5/X (where X is the growth factor, typically 2) is smaller than the difference between hash tables with load factors L and L/X where L is any load factor greater than 0.5. This is because the dependency of the number of collision resolution steps during an operation on a hash table from the load factor is a convex function.
On the other hand, lowering the maximum load factor increases the memory footprint of a hash table.
Linear vs. quadratic probing vs. double hashing
In open addressing hash tables, quadratic probing reduces the variability of the average number of collision resolution steps (compared to linear probing) but also increases the non-memory related CPU cost of operations a bit and reduces the relative locality of the memory accesses. Compared to quadratic probing, double hashing (as well as cuckoo hashing) reduces the variability in the number of collision resolution steps even more, but incurs significantly higher non-memory-related CPU costs of operations and don’t impose any relative locality between the random accesses to the table.
Quadratic probing vs. Robin Hood hashing
By construction, the variant of Robin Hood hashing which is based on linear probing aims to reduce the variability of the average number of collision resolution steps on successful key searches at the price of higher non-memory-related CPU cost of insertions. The tradeoff between quadratic probing and Robin Hood hashing is that the non-memory-related CPU cost of successful key searches with Robin Hood hashing is lower than with quadratic probing, and the accessed memory locations (if multiple) are adjacent, as with linear probing.
On the other hand, the average number of collision resolution steps (and its variability) on unsuccessful key searches in a hash table with Robin Hood hashing is as high as with vanilla linear probing and is much higher than with quadratic probing.
Inline entries vs. separately allocated nodes
Open addressing hash tables that store the key structures (or the values, or both, that is, the whole entries) inline have one less memory tier than equivalent hash tables with a level of pointer indirection. Usually, the latter variant is chosen when the pointer stability of the keys (or the values) in the table is required (pointers to keys or values inside an “inline” table become invalid during a rehash). This factor is outside of the tradeoff space discussed in this post. However, “inline” vs. “forwarding” hash tables fare differently on the tradeoff factors as well: “inline” tables have higher variability in memory utilization and accesses with less locality to their sole memory tier (when using a collision resolution scheme that preserves some locality, such as linear or quadratic probing), whereas “forwarding” tables have higher minimum memory footprint and impose more memory accesses because there is an extra memory tier.
Both disadvantages of “inline” tables become relatively worse with the larger key (or value) structures. Therefore, the bigger the entries, the more likely it is that a “forwarding” table is preferable, especially when accesses to the colliding keys for comparisons are avoided most of the time, which is the case in SwissTable and F14.
Open addressing vs. separate chaining
To isolate from the “inline vs. forwarding” tradeoff discussed in the previous section, let’s consider hash tables that resolve collisions via separate chaining and store all their entries in separately allocated nodes (rather than the hybrid with the first level of entries inlined) and open addressing hash tables with forwarding pointers, i. e. with separately allocated nodes, too.
Hash tables with separate chaining benefit much lower variability of the average number of collision resolution steps than open addressing hash tables. Hash tables with separate chaining are operational at load factors of 1.0 and higher, which is just impossible with open addressing.
On the other hand, hash tables with separate chaining have a higher footprint of its “chaining nodes” memory tier than open addressing hash tables because each node includes a pointer to the next node in the bucket, in a singly linked list manner. However, this difference in memory footprint could be offset because a hash table with separate chaining can tolerate a higher maximum load factor than an open addressing hash table.
If we assume that a “forwarding” open addressing hash table doesn’t avoid accessing the memory of nodes on the collision chain, it needs to access a higher average number of random memory locations (which are also potentially less local, depending on the probing scheme) than a hash table with separate chaining. Avoiding the access to the memory of nodes on the collision chain is another tradeoff in itself: it requires to store the hash codes of the keys inline in the table or in a parallel memory area, increasing the hash table’s footprint and the non-memory-related CPU cost of operations.
SwissTable vs. F14
SwissTable and F14 are both open addressing hash tables. They have a memory tier with 8 bits per slot (tags) of which 7 bits are taken from the hash code of the key stored in the corresponding slot. They use fancy vector CPU instructions to check 16 slots at a time (14 or 12 slots in F14). They also both use quadratic probing for collision resolution between the “buckets” of slots checked at once.
The main difference between SwissTable and F14 is that in SwissTable the tags tier and the entries table are two separate memory areas. In F14, the tags and the entries of a bucket are the parts of a bigger structure, a chunk. An array of chunks is the table. Effectively, it means that blocks of tags and entries are interleaved in F14 table’s memory.
Operations on SwissTable access a little fewer memory locations than operations on F14 (explanation is beyond the scope of this post). However, the F14’s accesses during successful key searches are much more local: they are almost always within a single page (thus requiring only one entry to be loaded in the TLB) and, with good probability, even within a single or two adjacent cache lines. On the other hand, F14’s accesses during unsuccessful key searches are less local than SwissTable’s because the latter happen in the dense “tags” memory tier and may all fall within a single cache line, while the accesses to different buckets on the collision resolution chain in F14 are a few chucks apart, while a chunk is one or several cache lines long.