victormiller
commited on
Commit
•
466af30
1
Parent(s):
e01353c
Update common.py
Browse files
common.py
CHANGED
@@ -258,7 +258,7 @@ global_div = Div(
|
|
258 |
),
|
259 |
Section(
|
260 |
H3("MinHash Generation"),
|
261 |
-
P("We use the datasketch library to generate MinHash signatures with the number of permutations to 128. To calculate a signature, represented as a MinHash object for each document, we first clean the text by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, we generate a list of 13-grams to use as features for creating a document signature. These signatures along with globally-unique document ids are then saved to disk. We designed a document id encoding scheme to convert file names and line numbers (there is one document per line) to unique document ids. This also helped a lot in saving disk and memory for this stage.),
|
262 |
P(B("This step produced 20 TB of hashes.")),
|
263 |
),
|
264 |
Section(
|
|
|
258 |
),
|
259 |
Section(
|
260 |
H3("MinHash Generation"),
|
261 |
+
P("We use the datasketch library to generate MinHash signatures with the number of permutations to 128. To calculate a signature, represented as a MinHash object for each document, we first clean the text by stripping whitespace, converting to lowercase, and removing punctuation, consecutive spaces, newlines, and tabs. Next, we generate a list of 13-grams to use as features for creating a document signature. These signatures along with globally-unique document ids are then saved to disk. We designed a document id encoding scheme to convert file names and line numbers (there is one document per line) to unique document ids. This also helped a lot in saving disk and memory for this stage."),
|
262 |
P(B("This step produced 20 TB of hashes.")),
|
263 |
),
|
264 |
Section(
|