There are numerous pieces of duplicate information served by multiple sources on the web. Many news stories that we receive from the media tend to originate from the same source, such as the Associated Press. When such contents are scraped off the web for archiving, a need may arise to categorize documents by their similarity (not in the sense of meaning of the text but the characterlevel or lexical matching).
Here, we build a prototype for nearduplicate document detection system. This article presents the background material on an algorithm called MinHash and a method for probabilistic dimension reduction through the localitysensitive hashing. A future article presents their implementation on Python and CouchDB.
(Note that all the numbers generated for the tables in this article are totally made up for illustration purposes; they may not be mathematically consistent with any hashing algorithm.)
Jaccard Similarity Index
A similarity is represented by the Jaccard index:
where and are sets representing the two documents in our context.
Shingling
A useful way to construct a set representing a document is by shingling. To construct a set of singles from a text, a sliding window of characters is applied over the text. For example, if the text is “abcdabd,” the resulting set of 2shingles is {ab, bc, cd, da, bd} (note that “ab” appears only once and not repeated in the set).
The value of is arbitrary, but it should be large enough that the probability of any given shingle appearing in any random document is low. That is, if the available number of characters is and the character length of typical documents is , we should at least ensure . Since each character has a different frequency of appearance in a typical text, an suitable value of depends on the nature of the documents and should be tuned accordingly. A good rule of thumb for an order of magnitude estimate is to assume for English texts.
Instead of using individual characters, shingles can also be constructed from words. For example, in a math text book we may often see a sentence beginning with a terse expression “it is trivial to show,” whose 3shingle set is {“it is trivial”, “is trivial to”, “trivial to show”}. This has advantage in that shingles built this way are more sensitive to the styles of writing. The style sensitivity may aid in identifying similarities between domainspecific texts buried in other types of documents.
Hashing Shingles
Typically, shingles are hashed so that those who contain similar information are grouped into a single bucket while being represented as an integer. The latter is a huge advantage in that the data can be compressed more efficiently in terms of bytes. For example, a 4shingle (of characters) typically uses 4 bytes, each byte used for character representation, and this is good for representing 160,000 4shingles (i.e., ). With a 4byte, however, about 4 million () integers and therefore shingles could be represented, which is good enough size for (i.e., ). If a small probability for collision into the same bucket can be tolerated, can be chosen even larger. From here on, we assume a random hash function does not produce any collision between any pair of randomly chosen shingles, i.e., the mapping yields a unique integer.
Characteristic Matrix
Suppose we have a random hash function and all possible singles from for a total of documents. We can summarize this in a characteristic matrix:
where the entry of indicates that the document contains a shingle for which a hash value exists. (The entry of means the shingle itself does not appear in that document.) It is trivial to compute Jaccard indices using any pair of documents from this matrix. In practice, however, the requirement for comparing all the hash values for a large number of documents makes the process prohibitive.
MinHash as a Jaccard Index Estimator
Let us focus on a pair of documents, and , for which the shingles , , , have been hashed by a function . The relevant entries from the characteristic matrix look as follows:




0 
0 

1 
0 

1 
1 

0 
1 

1 
0 

0 
0 

1 
1 
There are three types of rows: (a) both columns have 1, (b) one of the columns have 1, and (c) both columns have 0. We let , , and denote the numbers of rows categorized this way, respectively. For and , is the cardinality of their joint set and is that for their union. Hence the Jaccard index is .
Now, consider an experiment in which the rows in the matrix are randomly permutated. Remove the rows of type (c), since they do not contribute at all to the union of two sets. We look at the first row of the matrix thus constructed and note its type defined above, either (a) or (b). Repeat the process times. What is the chance that the first row found this way to be of type (a) above? The probability is given by , which is similar to the way Jaccard index is computed. This is the property that we use as a Jaccard index estimator.
In practice, randomly permuting a huge number of rows is very inefficient. Instead, we prepare a set of random hash functions (for for a set of measurements) that effectively permute the row order given the same set of shingles and sort rows in ascending order by hash values. (In order for this to be true, the hash functions need to be wellchosen and produce few collisions.) The row of the minimum hash value corresponds to picking the first row in the example above.
What we have shown is that, for estimating Jaccard indices, we only need to keep the minimum hash values generated from different hash functions. Therefore the very sparse characteristic matrix can be condensed into a signature matrix of minimum hash values with entries given by
where
is the set of shingles from the document .








98273 

23 
23 

63483 

2763 

524 
524 

2345 








325 

567849 
567849 

987 

876 

7849 
32 

897347 
For supposedly nearduplicate documents such as and in the table, most MinHash values are similar, and the fraction of values that are similar is an estimate of the Jaccard index. This is the gist of the MinHash algorithm. In other words, the probability that a pair of MinHash values from two documents and match is equivalent to their Jaccard index:
(1)
Localitysensitive Hashing
While the information necessary to compute document similarity have been compressed quite nicely into a signature matrix, examining all documents would take pairs, each involving comparisons from signature entries. The vast majority of documents may not be nearduplicate, however, and in such a case we do not need every pair to be compared. Localitysensitive hashing (LSH) offers a method of reducing the number of dimensions in highdimensional MinHash signatures.
The idea is to partition a MinHash signature matrix into bands, each with rows (such that is the total number of rows), and hashing MinHash integer sequences grouped by band. For example, if we have MinHash values, we could partition them into bands of rows:
Band 







1 

98273 

23 
23 

63483 
1 

2763 

524 
524 

2345 
1 

49368 

7207 
7207 

59542 
1 

9559 

34784 
34784 

6095 
2 

37153 

14927 
14927 

581 
2 

8671 

17492 
17492 

6664 
2 

2763 

43306 
43306 

10916 
2 

2600 

38712 
38712 

45472 
3 

14352 

25862 
25862 

14812 
3 

263 

52 
52 

11973 
3 

325 

567849 
567849 

987 
3 

876 

7849 
32 

897347 
Then we use some good hash function which takes MinHash values and summarizes into another hash, for band 1, for band 2, and so on. This reduces the signature matrix into matrix:








390 

57232 
57232 

33719 

62509 

453 
453 

51954 

453 

13009 
23905 

12174 
Nearduplicate documents will be hashed into the same bucket within each band. In this example, and are in the same bucket for bands 1 and 2. (Note that in band 3 has the same hash value as and in band 2, but they are not considered to be in the same bucket since the bands are different.) The documents that share a bucket within a band is considered candidates for further examination. The advantage is that, since in general, the number of required comparisons is much smaller. The LSH thus provides a way to select out candidates for nearduplicate detection, before full signature comparisons are carried out.
The assumption is that a pair of documents, if nearduplicate, has a total of chances to be hashed into a common bucket in at least one of the available bands. Recall from Eq.~(1) that the probability that a pair of MinHash values from two documents match is equivalent to their Jaccard index. The probability that a pair of documents share a bucket in a band of rows is given by . Its complement, , is the probability that a document pair does not get hashed into the same bucket for a band. Then the probability that two documents become candidates in at least one band is given by . Plotting for varying and , the function forms a series of Scurves:
The figure provides an intuition as to how the value of and should be chosen for a target Jaccard similarity threshold (above which two documents are considered nearduplicate). Let . The value of for the steepest slope is obtained from the second derivative, , which is
for . As a rule of thumb, we want want , but the exact value of can be adjusted based on rejection criteria. Choosing reduces false positives, whereas reduces false negatives at the candidate selection step.
References
 Anand Rajaraman and Jeffrey David Ullman (2011). Mining of Massive Datasets. Cambridge University Press. ISBN 9781107015357