TY - GEN
T1 - Confirmation sampling for exact nearest neighbor search
AU - Christiani, Tobias
AU - Pagh, Rasmus
AU - Thorup, Mikkel
PY - 2020
Y1 - 2020
N2 - Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest neighbor problem, in practice LSH data structures with suitably chosen parameters are used to solve the exact nearest neighbor problem (with some error probability). Sublinear query time is often possible in practice even for exact nearest neighbor search, intuitively because the nearest neighbor tends to be significantly closer than other data points. However, theory offers little advice on how to choose LSH parameters outside of pre-specified worst-case settings. We introduce the technique of confirmation sampling for solving the exact nearest neighbor problem using LSH. First, we give a general reduction that transforms a sequence of data structures that each find the nearest neighbor with a small, unknown probability, into a data structure that returns the nearest neighbor with probability $$1-\delta $$, using as few queries as possible. Second, we present a new query algorithm for the LSH Forest data structure with L trees that is able to return the exact nearest neighbor of a query point within the same time bound as an LSH Forest of $$\varOmega (L)$$ trees with internal parameters specifically tuned to the query and data.
AB - Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest neighbor problem, in practice LSH data structures with suitably chosen parameters are used to solve the exact nearest neighbor problem (with some error probability). Sublinear query time is often possible in practice even for exact nearest neighbor search, intuitively because the nearest neighbor tends to be significantly closer than other data points. However, theory offers little advice on how to choose LSH parameters outside of pre-specified worst-case settings. We introduce the technique of confirmation sampling for solving the exact nearest neighbor problem using LSH. First, we give a general reduction that transforms a sequence of data structures that each find the nearest neighbor with a small, unknown probability, into a data structure that returns the nearest neighbor with probability $$1-\delta $$, using as few queries as possible. Second, we present a new query algorithm for the LSH Forest data structure with L trees that is able to return the exact nearest neighbor of a query point within the same time bound as an LSH Forest of $$\varOmega (L)$$ trees with internal parameters specifically tuned to the query and data.
KW - Locality-sensitive hashing
KW - Nearest neighbor search
U2 - 10.1007/978-3-030-60936-8_8
DO - 10.1007/978-3-030-60936-8_8
M3 - Article in proceedings
AN - SCOPUS:85093858448
SN - 9783030609351
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 97
EP - 110
BT - Similarity Search and Applications - 13th International Conference, SISAP 2020, Proceedings
A2 - Satoh, Shin’ichi
A2 - Vadicamo, Lucia
A2 - Carrara, Fabio
A2 - Zimek, Arthur
A2 - Bartolini, Ilaria
A2 - Aumüller, Martin
A2 - Jonsson, Bjorn Por
A2 - Pagh, Rasmus
PB - Springer
T2 - 13th International Conference on Similarity Search and Applications, SISAP 2020
Y2 - 30 September 2020 through 2 October 2020
ER -