Spaces:
Running
Running
File size: 2,743 Bytes
74f67fc 1568f57 60d2a04 1568f57 2f106b6 1568f57 2f106b6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
---
title: README
emoji: π
colorFrom: red
colorTo: blue
sdk: static
pinned: false
---
# The OSCAR Project
<img align="left" src="https://oscar-project.org/authors/admin/avatar_hub278c5e06b117b962e66239a5e135575_291648_270x270_fill_lanczos_center_3.png" width="200" height="200" />
The OSCAR project (**O**pen **S**uper-large **C**rawled **A**ggregated co**R**pus) is an Open Source project aiming to provide web-based multilingual resources and datasets for Machine Learning (ML) and Artificial Intelligence (AI) applications. The project focuses specifically in providing large quantities of unannotated raw data that is commonly used in the pre-training of large deep learning models. The OSCAR project has developed [high-performance data pipelines](https://github.com/oscar-corpus/ungoliant) specifically conceived to classify and filter large amounts of [web data](https://commoncrawl.org/). The project has also put special attention in improving the data quality of web-based corpora as well as providing data for low-resource languages, so that these new ML/AI technologies are accessible to as many communities as possible.
Data is distributed by language in both original and deduplicated form. There are currently 166 different languages available. If you use OSCAR please consider giving us some feedback by writing to our [mail address](mailto:contact@oscar-project.org). Also consider citing our [papers](https://oscar-corpus.com/#featured).
If you want to contribute to OSCAR please open a pull request!
Since 2019, The OSCAR Project has been funded by [Inria](https://www.inria.fr/en) (project-team [ALMAnaCH](https://almanach.inria.fr/index-en.html)), the [PRAIRIE institute](https://prairie-institute.fr/). Starting in 2023, [DFKI](https://www.dfki.de/en/web) and the German Federal Ministry for Economic Affairs and Climate Action ([BMWK](https://www.bmwk.de/Navigation/EN/Home/home.html)) through the project [OpenGPT-X](https://opengpt-x.de/en/), have joined Inria, ALMAnaCH and the PRAIRIE institute in providing funding for the OSCAR Project. During 2022 and at the beginning of 2023, OSCAR was also shortly funded by [The University of Mannheim](https://www.uni-mannheim.de/en/).
If you are interested in OSCAR and would like to access the corpus, send us a mail using our [mail address](mailto:contact@oscar-project.org), with "OSCAR Access Request" as mail title. Please include your name, last name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR.
Grab the latest OSCAR release [here](https://oscar-project.github.io/documentation/versions/oscar-2301/)! π
Join our Discord community [here](https://discord.com/invite/4JNg9FTar4)! π¬
|