TY - GEN
T1 - Fast hashing with strong concentration bounds
AU - Aamand, Anders
AU - Knudsen, Jakob Bæk Tejs
AU - Knudsen, Mathias Bæk Tejs
AU - Rasmussen, Peter Michael Reichstein
AU - Thorup, Mikkel
PY - 2020
Y1 - 2020
N2 - Previous work on tabulation hashing by PÇtraşcu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of c=O(1) characters, e.g., a 64-bit key as c=8 characters of 8-bits. The character domain ς should be small enough that character tables of size |ς| fit in fast cache. The schemes then use O(1) tables of this size, so the space of tabulation hashing is O(|ς|). However, the concentration bounds by PÇtraşcu and Thorup only apply if the expected sums are g‰ |ς|. To see the problem, consider the very simple case where we use tabulation hashing to throw n balls into m bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if n=m, for then the expected value is 1. However, if m=2, as when tossing n unbiased coins, the expected value n/2 is ≫ |ς| for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call tabulation-permutation hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.
AB - Previous work on tabulation hashing by PÇtraşcu and Thorup from STOC'11 on simple tabulation and from SODA'13 on twisted tabulation offered Chernoff-style concentration bounds on hash based sums, e.g., the number of balls/keys hashing to a given bin, but under some quite severe restrictions on the expected values of these sums. The basic idea in tabulation hashing is to view a key as consisting of c=O(1) characters, e.g., a 64-bit key as c=8 characters of 8-bits. The character domain ς should be small enough that character tables of size |ς| fit in fast cache. The schemes then use O(1) tables of this size, so the space of tabulation hashing is O(|ς|). However, the concentration bounds by PÇtraşcu and Thorup only apply if the expected sums are g‰ |ς|. To see the problem, consider the very simple case where we use tabulation hashing to throw n balls into m bins and want to analyse the number of balls in a given bin. With their concentration bounds, we are fine if n=m, for then the expected value is 1. However, if m=2, as when tossing n unbiased coins, the expected value n/2 is ≫ |ς| for large data sets, e.g., data sets that do not fit in fast cache. To handle expectations that go beyond the limits of our small space, we need a much more advanced analysis of simple tabulation, plus a new tabulation technique that we call tabulation-permutation hashing which is at most twice as slow as simple tabulation. No other hashing scheme of comparable speed offers similar Chernoff-style concentration bounds.
KW - Chernoff bounds
KW - Concentration bounds
KW - Hashing
KW - Sampling
KW - Streaming algorithms
UR - http://www.scopus.com/inward/record.url?scp=85086766216&partnerID=8YFLogxK
U2 - 10.1145/3357713.3384259
DO - 10.1145/3357713.3384259
M3 - Article in proceedings
AN - SCOPUS:85086766216
T3 - Proceedings of the Annual ACM Symposium on Theory of Computing
SP - 1265
EP - 1278
BT - STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing
A2 - Makarychev, Konstantin
A2 - Makarychev, Yury
A2 - Tulsiani, Madhur
A2 - Kamath, Gautam
A2 - Chuzhoy, Julia
PB - Association for Computing Machinery
T2 - 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020
Y2 - 22 June 2020 through 26 June 2020
ER -