Spaces:
Running
Running
320!
Browse files- docs/index.md +2 -2
docs/index.md
CHANGED
@@ -2,9 +2,9 @@
|
|
2 |
|
3 |
I'm fascinated by this new dataset https://huggingface.co/datasets/PleIAs/French-PD-Newspapers that just dropped on :hugging_face:.
|
4 |
|
5 |
-
It references about 3 million French newspapers, with full-text. ("only"
|
6 |
|
7 |
-
I wrote a data loader that outputs a single parquet file combining all their metadata (that is, without the text contents). It takes only about 5 minutes to run, thanks to parquet magic (and fiber internet); not sure how much of the ~
|
8 |
|
9 |
In one query, I see that A LOT of publications stopped publishing in 1944/1945, and conversely a large number of newspapers started publishing between 1941 and 1946. This probably includes both collaborationist publications and resistance publications—it would be interesting to find a ML way to separate them.
|
10 |
|
|
|
2 |
|
3 |
I'm fascinated by this new dataset https://huggingface.co/datasets/PleIAs/French-PD-Newspapers that just dropped on :hugging_face:.
|
4 |
|
5 |
+
It references about 3 million French newspapers, with full-text. ("only" 320 files of about 700MB each :slightly_smiling_face:).
|
6 |
|
7 |
+
I wrote a data loader that outputs a single parquet file combining all their metadata (that is, without the text contents). It takes only about 5 minutes to run, thanks to parquet magic (and fiber internet); not sure how much of the ~200+GB I downloaded for that. Now I've started exploring the metadata in an observable project.
|
8 |
|
9 |
In one query, I see that A LOT of publications stopped publishing in 1944/1945, and conversely a large number of newspapers started publishing between 1941 and 1946. This probably includes both collaborationist publications and resistance publications—it would be interesting to find a ML way to separate them.
|
10 |
|