Spaces:
Running
Running
victormiller
commited on
Commit
•
1ac16da
1
Parent(s):
9919a10
Update web.py
Browse files
web.py
CHANGED
@@ -285,7 +285,7 @@ def web_data():
|
|
285 |
H2("Stage 1: Document Preparation"),
|
286 |
|
287 |
|
288 |
-
P(B("Text Extraction: ")
|
289 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
290 |
WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
|
291 |
WET files contain plaintexts extracted by Common Crawl. In line with previous works ([1], [2], [3], [4]),
|
|
|
285 |
H2("Stage 1: Document Preparation"),
|
286 |
|
287 |
|
288 |
+
P(B("Text Extraction: "), """
|
289 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
290 |
WARC files contain the raw data from the crawl, which store the full HTTP response and request metadata.
|
291 |
WET files contain plaintexts extracted by Common Crawl. In line with previous works ([1], [2], [3], [4]),
|