Post
Hi everyone,
For my first post, I'm announcing a big release (in multiple ways): probably the largest open corpus in French to date, with 85 billion words in the public domain.
The dataset has been prepared in collaboration with Benoît de Courson and Benjamin Azoulay from Gallicagram (https://shiny.ens-paris-saclay.fr/app/gallicagram). Gallicagram is a major cultural analytics project in French, the open and better version of ngram viewer for large scale search of word and ngram occurrences.
The corpus is made of two different dataset for monographs (16B words) PleIAs/French-PD-Newspapers and newspapers/periodicals (69B) PleIAs/French-PD-Newspapers Along with the full text it also includes core provenance metadata.
Beyond research in digital humanities, the corpus can also be used to training open and reproducible LLMs. Being in the public domain means it can be released everywhere in any shape without restrictions.
The corpus is not perfect: digitization of cultural heritage is challenging and, especially for newspapers, we tackle with layout issues and a significant rate of optical character recognition mistake. Our conviction is that releasing corpus as a commons is the best way to improve on this. Sharing is caring.
For my first post, I'm announcing a big release (in multiple ways): probably the largest open corpus in French to date, with 85 billion words in the public domain.
The dataset has been prepared in collaboration with Benoît de Courson and Benjamin Azoulay from Gallicagram (https://shiny.ens-paris-saclay.fr/app/gallicagram). Gallicagram is a major cultural analytics project in French, the open and better version of ngram viewer for large scale search of word and ngram occurrences.
The corpus is made of two different dataset for monographs (16B words) PleIAs/French-PD-Newspapers and newspapers/periodicals (69B) PleIAs/French-PD-Newspapers Along with the full text it also includes core provenance metadata.
Beyond research in digital humanities, the corpus can also be used to training open and reproducible LLMs. Being in the public domain means it can be released everywhere in any shape without restrictions.
The corpus is not perfect: digitization of cultural heritage is challenging and, especially for newspapers, we tackle with layout issues and a significant rate of optical character recognition mistake. Our conviction is that releasing corpus as a commons is the best way to improve on this. Sharing is caring.