Probabilistic near-duplicate detection using simhash

11 years 9 months ago
Probabilistic near-duplicate detection using simhash
This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to detect similar document pairs in large-scale collections. We show that this algorithm produces interesting intermediate data, which is normally discarded, that can be used to predict which of the bits in the final hash are more susceptible to being flipped in similar documents. This paves the way for a probabilistic search technique in the Hamming space of simhashes that can be significantly faster and more space-efficient than the existing simhash approaches. We show that with 95% recall compared to deterministic search of prior work [16], our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Clustering General Terms Algorithms Keywords Hamming distance, similarity, simhash, clustering
Sadhan Sood, Dmitri Loguinov
Added 13 Dec 2011
Updated 13 Dec 2011
Type Journal
Year 2011
Where CIKM
Authors Sadhan Sood, Dmitri Loguinov
Comments (0)