victormiller
commited on
Commit
•
e01353c
1
Parent(s):
c2b0703
Update common.py
Browse files
common.py
CHANGED
@@ -258,21 +258,23 @@ global_div = Div(
|
|
258 |
),
|
259 |
Section(
|
260 |
H3("MinHash Generation"),
|
261 |
-
P("We use the datasketch library to generate MinHash signatures with the number of permutations to 128. To calculate a signature, represented as a MinHash object for each document, we first clean the text by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, we generate a list of 13-grams to use as features for creating a document signature. These signatures along with globally-unique document ids are then saved to disk. We designed a document id encoding scheme to convert file names and line numbers (there is one document per line) to unique document ids. This also helped a lot in saving disk and memory for this stage.
|
|
|
262 |
),
|
263 |
Section(
|
264 |
H3("Matching Pairs Generation"),
|
265 |
P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."),
|
266 |
P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. Also doing a group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:"),
|
267 |
Pre(Code(dask_algo,)),
|
268 |
-
P("This step produced 9.2 TB of matching pairs from all bands."),
|
269 |
),
|
270 |
Section(
|
271 |
H3("Finding Duplicate Pairs"),
|
272 |
P("Multiple bands can create the same document pairs, leading to duplicates. The simplest way to eliminate these duplicate pairs is to call distinct() before the compute(). However, we found that Dask is not very efficient when it comes to distributed distinct execution. Additionally, since we process each band separately, this approach wouldn’t remove duplicates across different bands."),
|
273 |
P("To address this, we use a Bloom filter with a capacity of 64 billion and a false positive rate of 0.001 to remove duplicates. One way we parallelize the Bloom filter execution is by partitioning pairs horizontally and running one filter per partition, as shown in the table below. There is a high chance that duplicates from different bands will have the same pairs in the same horizontal partition. This step reduces the number of pairs by nearly ninefold."),
|
274 |
table_div_bloom_examples,
|
275 |
-
P("The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches.
|
|
|
276 |
),
|
277 |
Section(
|
278 |
H3("Finding Connected Components using MapReduce"),
|
|
|
258 |
),
|
259 |
Section(
|
260 |
H3("MinHash Generation"),
|
261 |
+
P("We use the datasketch library to generate MinHash signatures with the number of permutations to 128. To calculate a signature, represented as a MinHash object for each document, we first clean the text by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, we generate a list of 13-grams to use as features for creating a document signature. These signatures along with globally-unique document ids are then saved to disk. We designed a document id encoding scheme to convert file names and line numbers (there is one document per line) to unique document ids. This also helped a lot in saving disk and memory for this stage.),
|
262 |
+
P(B("This step produced 20 TB of hashes.")),
|
263 |
),
|
264 |
Section(
|
265 |
H3("Matching Pairs Generation"),
|
266 |
P("We are using a Jaccard similarity threshold of 0.8 to identify near-duplicate documents. To do this, we divide the MinHashes into 9 bands, each with 13 hashes (also known as the range). To save memory during matching, we first store each band of MinHashes separately on disk. We then process each band individually. Within each band, documents are matched based on their hashes, and the matches are saved as document pairs. A document is considered a match if it matches another document in any of the 9 bands. Since we are looking for near-duplicates, a document may match multiple documents across different bands."),
|
267 |
P("For partitioning and matching the hashes, we utilize Dask's bag data structure to load the document ids and MinHashes. The matching process is simply a group by operation on this bag data structure. This approach allows us to group matches efficiently and distribute the operation to multiple machines. Also doing a group by produces full components (documents that share the same signature) within a band which simplifies the later stages. The algorithm can be expressed using the Dask expression below:"),
|
268 |
Pre(Code(dask_algo,)),
|
269 |
+
P(B("This step produced 9.2 TB of matching pairs from all bands.")),
|
270 |
),
|
271 |
Section(
|
272 |
H3("Finding Duplicate Pairs"),
|
273 |
P("Multiple bands can create the same document pairs, leading to duplicates. The simplest way to eliminate these duplicate pairs is to call distinct() before the compute(). However, we found that Dask is not very efficient when it comes to distributed distinct execution. Additionally, since we process each band separately, this approach wouldn’t remove duplicates across different bands."),
|
274 |
P("To address this, we use a Bloom filter with a capacity of 64 billion and a false positive rate of 0.001 to remove duplicates. One way we parallelize the Bloom filter execution is by partitioning pairs horizontally and running one filter per partition, as shown in the table below. There is a high chance that duplicates from different bands will have the same pairs in the same horizontal partition. This step reduces the number of pairs by nearly ninefold."),
|
275 |
table_div_bloom_examples,
|
276 |
+
P("The resulting unique pairs are then used to identify clusters of near-duplicates by finding connected components in a graph, where the vertices represent documents and the edges represent matches."),
|
277 |
+
P(B("This step produced 1.9 TB of unique pairs.")),
|
278 |
),
|
279 |
Section(
|
280 |
H3("Finding Connected Components using MapReduce"),
|