Locality-sensitive hashing for documents.pptx
- Количество слайдов: 18
Locality-sensitive hashing for documents By Bondarevska Natalia Superviser Glybovets A.
Our access to data is growing faster than our ability to process it
Duplicate detection: Given a big data collection of web documents, find document pairs that are near-neighbors. Lexically similar Need not be an identical copy of one another Choice: Jaccard similarity of sets.
Jaccard similarity of sets:
Problem statement: Given a query and any point q, return the point closest to q. We wish to guarantee with a high probability that we return the nearest neighbor for any query point.
Can we represent similarities between objects in a succinct manner? By obtaining a sketch of the object Sacrifice exactness to efficiency - by using randomization
Locality sensitive hashing Based on the simple idea that, if two points are close together, then after a “projection” operation these two points will remain together. Projection operations is the one that maps a data point from a high-dimensional to a low dimensional space. Create projections from a number of different directions, keep track of the nearby points.
LSH for documents: Present the documents as a set => build a set of shingles. Minhash shingles. Create signatures for documents. LSH for signatures. Tune parameters.
Hashingles:
Minhashingles (building signatures), ex. : Element 2 1 0 0 1 3 0 0 1 0 1 0 1 1 4 0 0 1 0
Minhashing and Jaccard similarity Theorem. The probability that the minhash function for a random permutation of rows produces the same value for two sets equals to the Jaccard similarity of those sets.
Dividing a signature matrix band 1 1 0 0 0 2 2 3 1 0 1 2 band 2 …. band 3 …. band 4 ….
Analysis:
The S - curve
Values of the S-curve for b = 20 and r=5 s 0. 2 0. 006 0. 3 0. 047 0. 4 0. 186 0. 5 0. 470 0. 6 0. 802 0. 7 0. 975 0. 8 0. 9996
Tuning parameters:
Conclusions: Approximate algorithm is necessary for a big data collection. LSH requires vector representation of objects Computational complexity O(n)
Sources:
Locality-sensitive hashing for documents.pptx