Biboroku

Near-Duplicate Detection using MinHash: Background

Written by Taro Sato, on . Tagged: stats Python math

There are numerous pieces of duplicate information served by multiple sources on the web. Many news stories that we receive from the media tend to originate from the same source, such as the Associated Press. When such contents are scraped off the web for archiving, a need may arise to categorize documents by their similarity (not in the sense of the meaning of the text but the character-level or lexical matching).

Here, we build a prototype for a near-duplicate document detection system. This article presents the background material on an algorithm called MinHash and a method for probabilistic dimension reduction through locality-sensitive hashing. A future article presents their implementation with Python and CouchDB.

(Note that all the numbers generated for the tables in this article are made up for illustration purposes; they may not be mathematically consistent with any hashing algorithm.)

Jaccard Similarity Index

A similarity is represented by the Jaccard index:

J(Di,Dj)=|DiDj||DiDj|

where Di and Dj are sets representing the two documents in our context.

Shingling

A useful way to construct a set representing a document is by shingling. To construct a set of k-singles from a text, a sliding window of k characters is applied over the text. For example, if the text is “abcdabd,” the resulting set of 2-shingles is {ab, bc, cd, da, bd} (note that “ab” appears only once and is not repeated in the set).

The value of k is arbitrary, but it should be large enough that the probability of any given shingle appearing in any random document is low. That is, if the available number of characters is c and the character length of typical documents is l, we should at least ensure cklk+1. Since each character has a different frequency of appearance in a typical text, a suitable value for k depends on the nature of documents and should be tuned accordingly. A good rule of thumb for an order of magnitude estimate is to assume c=20 for English texts.

Instead of using individual characters, shingles can also be constructed from words. For example, in a math textbook we may often see a sentence beginning with a terse expression “it is trivial to show,” whose 3-shingle set is {“it is trivial”, “is trivial to”, “trivial to show”}. This has an advantage in that shingles built this way are more sensitive to the styles of writing. The style sensitivity may aid in identifying similarities between domain-specific texts buried in other types of documents.

Hashing Shingles

Typically, shingles are all hashed and grouped into buckets represented by an integer. The use of an integer is a huge advantage in terms of data compression. For example, a 4-shingle (of characters) typically uses 4 bytes, each byte used for a character, and this is good for representing 160,000 4-shingles (i.e., 204). With a 4-byte, however, about 4 million (232) integers and therefore shingles could be represented, which is a good enough size for k=7 (i.e., 207). If a tiny probability of collision into the same bucket can be tolerated, k can be chosen even larger. From here on, we assume a random hash function does not produce any collision between any pair of randomly chosen shingles, i.e., the mapping sih(si) yields a unique integer.

Characteristic Matrix

Suppose we have a random hash function h(s) and all possible singles s1, s2, , sm from D1, D2, , Dl for a total of l documents. We can summarize this in a characteristic matrix:

D1 D2 Dl
h(s1) 1 0 1
h(s2) 1 1 0
h(sm) 1 1 0

where the entry of 1 indicates that the document Dj contains a shingle si for which a hash value h(si) exists. (The entry of 0 means the shingle itself does not appear in that document.) It is trivial to compute Jaccard indices using any pair of documents from this matrix. In practice, however, the requirement for comparing all the hash values for a large number of documents makes the process prohibitive.

MinHash as a Jaccard Index Estimator

Let us focus on a pair of documents, D1 and D2, for which the shingles s1, s2, , s7 have been hashed by a function h. The relevant entries from the characteristic matrix look as follows:

D1 D2
h(s1) 0 0
h(s2) 1 0
h(s3) 1 1
h(s4) 0 1
h(s5) 1 0
h(s6) 0 0
h(s7) 1 1

There are three types of rows: (a) both columns have 1, (b) one of the columns has 1, and (c\) both columns have 0. We let X, Y, and Z denote the numbers of rows categorized this way, respectively. For D1={h(s2),h(s3),h(s5),h(s7)} and D2={h(s3),h(s4),h(s7)}, X is the cardinality of their joint set and Y is that for their union. Hence the Jaccard index is X/(X+Y)=2/5=0.4.

Now, consider an experiment in which the rows in the matrix are randomly permutated. Remove the rows of type (c\), since they do not contribute at all to the union of two sets. We look at the first row of the matrix thus constructed and note its type defined above, either (a) or (b). Repeat the process n times. What is the chance that the first row found this way to be of type (a) above? The probability is given by X/(X+Y), which is similar to the way Jaccard index is computed. This is the property that we use as a Jaccard index estimator.

In practice, randomly permuting a huge number of rows is very inefficient. Instead, we prepare a set of random hash functions hi(s) (for i={1,2,,n} for a set of n measurements) that effectively permute the row order given the same set of shingles and sort rows in ascending order by hash values. (For this to be true, the hash functions need to be well-chosen and produce few collisions.) The row of the minimum hash value corresponds to picking the first row in the example above.

What we have shown is that, for estimating Jaccard indices, we only need to keep the minimum hash values generated from n different hash functions. Therefore the very sparse characteristic matrix can be condensed into a signature matrix of minimum hash values with entries given by

himin{hi(s1),hi(s2),,hi(sm)} ,

where

Dj={s1,s2,,sm}

is the set of shingles from the document Dj.

D1 Dj Dj+1 Dl
h1 98,273 23 23 63,483
h2 2,763 524 524 2,345
hn1 325 567,849 567,849 987
hn 876 7,849 32 897,347

For supposedly near-duplicate documents such as D_j and Dj+1 in the table, most MinHash values are similar, and the fraction of similar values is an estimate of the Jaccard index. This is the gist of the MinHash algorithm. In other words, the probability p that a pair of MinHash values from two documents Di and Dj match is equivalent to their Jaccard index:

(1)p=J(Di,Dj) .

Locality-Sensitive Hashing

While the information necessary to compute document similarity has been compressed quite nicely into a signature matrix, examining all documents would take l(l1)/2 pairs, each involving n comparisons from signature entries. The vast majority of documents may not be near-duplicate, however, and in such a case we do not need every pair to be compared. Locality-sensitive hashing (LSH) offers a method of reducing the number of dimensions in high-dimensional MinHash signatures.

The idea is to partition a MinHash signature matrix into b bands, each with r rows (such that n=br is the total number of rows), and hashing MinHash integer sequences grouped by band. For example, if we have n=12 MinHash values, we could partition them into b=3 bands of r=4 rows:

Band D1 Dj Dj+1 Dl
1 h1 98,273 23 23 63,483
1 h2 2,763 524 524 2,345
1 h3 49,368 7,207 7,207 59,542
1 h4 9,559 34,784 34,784 6,095
2 h5 37,153 14,927 14,927 581
2 h6 8,671 17,492 17,492 6,664
2 h7 2,763 43,306 43,306 10,916
2 h8 2,600 38,712 38,712 45,472
3 h9 14,352 25,862 25,862 14,812
3 h10 263 52 52 11,973
3 h11 325 567,849 567,849 987
3 h12 876 7,849 32 897,347

Then we use some good hash function H which takes r MinHash values and summarizes into integers, H(h1,h2,h3,h4)H1 for band 1, H(h5,h6,h7,h8)H2 for band 2, and so on. This reduces the n×l signature matrix into b×l matrix:

D1 Dj Dj+1 Dl
H1 390 57,232 57,232 33,719
H2 62,509 453 453 51,954
H3 453 13,009 23,905 12,174

Near-duplicate documents will be hashed into the same bucket within each band. In this example, D_j and D_j+1 are in the same bucket for bands 1 and 2. (Note that D_1 in band 3 has the same hash value as D_j and Dj+1 in band 2, but they are not considered to be in the same bucket since the bands are different.) The documents that share a bucket within a band are considered candidates for further examination. The advantage is that, since bn in general, the number of required comparisons is much smaller. The LSH thus provides a way to select candidates for near-duplicate detection, before full signature comparisons are carried out.

The assumption is that a pair of documents, if near-duplicate, has a total of b chances of being hashed into a common bucket in at least one of the available bands. Recall from Eq. (1) that the probability that a pair of MinHash values from two documents match is equivalent to their Jaccard index. The probability that a pair of documents share a bucket in a band of r rows is given by pr. Its complement, 1pr, is the probability that a document pair does not get hashed into the same bucket for a band. Then the probability that two documents become candidates in at least one band is given by 1(1pr)b. Plotting for varying b and r, the function forms a series of S-curves1:

Figure 1: Probability of Being Chosen as Candidates vs. Jaccard Index

Figure 1: Probability of Being Chosen as Candidates vs. Jaccard Index

The figure provides an intuition as to how the values of b and r should be chosen for a target Jaccard similarity threshold t (above which two documents are considered near-duplicate). Let f(p)=1(1pr)b. The value of p for the steepest slope is obtained from the second derivative, d2f/dp2|p=pt=0, which is

pt=(r1br1)1/r(1b)1/r

for b, r1. As a rule of thumb, we want ptt, but the exact value of pt can be adjusted based on rejection criteria. Choosing pt>t reduces false positives, whereas pt<t reduces false negatives at the candidate selection step.

Reference

  • Anand Rajaraman and Jeffrey David Ullman (2011). Mining of Massive Datasets. Cambridge University Press. ISBN 978-1-107-01535-7

  1. The figure is generated by the following Python script:

    import matplotlib.pyplot as plt
    import numpy as np
    
    ps = np.arange(0, 1, 0.01)
    
    for r in [5, 10]:
        for b in [2, 4, 8]:
            ys = 1 - (1 - ps**r) ** b
            plt.plot(ps, ys, label=f"r={r}, b={b}")
    
    plt.xlabel("Jaccard index")
    plt.ylabel("Probability of being chosen as candidates")
    plt.legend(loc="upper left")
    plt.show()
    
     ↩︎