Spaces:
Running
Running
victormiller
commited on
Commit
•
0e26631
1
Parent(s):
2191206
Update web.py
Browse files
web.py
CHANGED
@@ -363,12 +363,41 @@ def web_data():
|
|
363 |
),
|
364 |
style="margin-top: 20px;",
|
365 |
),
|
366 |
-
H2("Web Data Processing
|
367 |
P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
|
368 |
table_div_filter_data,
|
369 |
P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across dumps. "),
|
370 |
Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
|
371 |
P("Note: All percentages are based on the number of documents. The gray bars represent the relative percentages of removed documents at each step, while the colorful bars represent the percentages of retained documents relative to the total number of documents in the raw Common Crawl."),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
372 |
H3("1. Document Preparation"),
|
373 |
|
374 |
H4("1.1 Text Extraction"),
|
|
|
363 |
),
|
364 |
style="margin-top: 20px;",
|
365 |
),
|
366 |
+
H2("Web Data Processing Summary"),
|
367 |
P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
|
368 |
table_div_filter_data,
|
369 |
P("Our filtering rate is illustrated below. Before deduplication, our filtering rate is comparable to RefinedWeb. During global deduplication, we removed approximately 85.89% of the data, significantly higher than previous works, indicating a large number of duplicates across dumps. "),
|
370 |
Img(src="images/filter_rate.jpg", height = "300", width = "600" ),
|
371 |
P("Note: All percentages are based on the number of documents. The gray bars represent the relative percentages of removed documents at each step, while the colorful bars represent the percentages of retained documents relative to the total number of documents in the raw Common Crawl."),
|
372 |
+
P("We also adopt rules from RefinedWeb [1] to remove lines if they satisfy any of the following criteria:"),
|
373 |
+
Ul(
|
374 |
+
Li("the line is only composed of uppercase characters"),
|
375 |
+
Li("the line is only composed of numerical characters"),
|
376 |
+
Li("the line matches the pattern “r'^\d+\s+likes$"),
|
377 |
+
Li("the line only contains one word."),
|
378 |
+
),
|
379 |
+
P("We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:"),
|
380 |
+
Ul(
|
381 |
+
Li("the word count in the document"),
|
382 |
+
Li("the mean word length"),
|
383 |
+
Li("the number of sentences"),
|
384 |
+
Li("the symbol-to-word ratio"),
|
385 |
+
Li("the fraction of alphabetic words"),
|
386 |
+
Li("and the number of stop words"),
|
387 |
+
),
|
388 |
+
P("Specifically, we remove any document which satisfies any of the following criteria:"),
|
389 |
+
Ul(
|
390 |
+
Li("it contains less than 50 words or more than 100,000 words"),
|
391 |
+
Li("its mean word length is outside the range of 3 to 10"),
|
392 |
+
Li("it contains less than 3 sentences"),
|
393 |
+
Li("its symbol-to-word ratio is greater than 0.1"),
|
394 |
+
Li("the words that contain at least one alphabetic character are less than 80% of the whole words"),
|
395 |
+
Li("it contains less than two of the stop words (the, be, to, of, and, that, have, with"),
|
396 |
+
),
|
397 |
+
|
398 |
+
P("Following C4, we remove any page where the phrase “lorem ipsum” appears since some pages have placeholder “lorem ipsum” text."),
|
399 |
+
|
400 |
+
|
401 |
H3("1. Document Preparation"),
|
402 |
|
403 |
H4("1.1 Text Extraction"),
|