Unsupervised multi-index semantic hashing

Christian Hansen, Casper Hansen, Jakob Grue Simonsen, Stephen Alstrup, Christina Lioma

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningpeer review

6 Citationer (Scopus)
17 Downloads (Pure)

Abstract

Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve similarity search effectiveness, while assuming a brute-force linear scan strategy for searching over all the hash codes, even though much faster alternatives exist. One such alternative is multi-index hashing, an approach that constructs a smaller candidate set to search over, which depending on the distribution of the hash codes can lead to sub-linear search time. In this work, we propose Multi-Index Semantic Hashing (MISH), an unsupervised hashing model that learns hash codes that are both effective and highly efficient by being optimized for multi-index hashing. We derive novel training objectives, which enable to learn hash codes that reduce the candidate sets produced by multi-index hashing, while being end-to-end trainable. In fact, our proposed training objectives are model agnostic, i.e., not tied to how the hash codes are generated specifically in MISH, and are straight-forward to include in existing and future semantic hashing models. We experimentally compare MISH to state-of-the-art semantic hashing baselines in the task of document similarity search. We find that even though multi-index hashing also improves the efficiency of the baselines compared to a linear scan, they are still upwards of 33% slower than MISH, while MISH is still able to obtain state-of-the-art effectiveness.

OriginalsprogEngelsk
TitelThe Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
ForlagAssociation for Computing Machinery, Inc
Publikationsdato2021
Sider2879-2889
ISBN (Elektronisk)9781450383127
DOI
StatusUdgivet - 2021
Begivenhed2021 World Wide Web Conference, WWW 2021 - Ljubljana, Slovenien
Varighed: 19 apr. 202123 apr. 2021

Konference

Konference2021 World Wide Web Conference, WWW 2021
Land/OmrådeSlovenien
ByLjubljana
Periode19/04/202123/04/2021
SponsorAmazon, et al., Facebook, FINVOLUTION, Microsoft Research, Pinterest
NavnThe Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021

Bibliografisk note

Publisher Copyright:
© 2021 ACM.

Citationsformater