Pclanglais
commited on
Commit
•
ee6b07c
1
Parent(s):
9cbbe10
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- PleIAs/common_corpus
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
- fr
|
8 |
+
- es
|
9 |
+
- de
|
10 |
+
- it
|
11 |
+
- la
|
12 |
+
- nl
|
13 |
+
- pl
|
14 |
+
---
|
15 |
+
**Pleias-3b-Preview** is an early preview of a 3 billion parameters base model trained by Pleias on Common Corpus.
|
16 |
+
|
17 |
+
Like all the base and specialized models from Pleias, Pleias-3b-Preview has only been trained on open data out of copyright (public domain) or under a permissible license.
|
18 |
+
|
19 |
+
## Description
|
20 |
+
Pleias-3b-Preview is a transformer base model, entirely pretrained from scratch, using an architecture similar to Llama/GPT-Neox for easier deployment/inference.
|
21 |
+
|
22 |
+
It includes the following features, that would apply to any responsibly trained variant:
|
23 |
+
* Only trained on open data under a permissible license and in compliance with the European AI Act. By design, all Pleias model are unable to output copyrighted content.
|
24 |
+
* Extensive multilingual support for main European languages.
|
25 |
+
* A new tokenizer designed for enhanced document processing tasks and better multilingual support.
|
26 |
+
* Extremely low level of toxicity and problematic content.
|
27 |
+
|
28 |
+
Pleias-3b-Preview has demonstrated unusual abilities for multilingual generation in its size range. Fully supported languages include English, French, Spanish, German, Italian, Dutch, Latin and Portuguese.
|
29 |
+
|
30 |
+
## Recommended use
|
31 |
+
As a base model, Pleias-3b-Preview is only able to run continuation prompts.
|
32 |
+
|
33 |
+
Text generation is currently able to support a range of creative writing tasks in multiple European languages. For more consistent results we recommend using a low or null temperature with a slight repetition penalty (1.1-1.2).
|
34 |
+
|
35 |
+
Pleias-3b-Preview has been successfully adapted for continuous pretraining and full-fine-tuning on document processing tasks such as RAG, translation or OCR correction. Given the small size of the model we do not recommend fine-tuning methods based on LORA.
|
36 |
+
|
37 |
+
## Training
|
38 |
+
Pleias-3b-Preview was fully pretrained at Jean Zay on 64 h100s for 46 hours with Nanotron, the pretraining library from HuggingFace. We provide the complete settings as a yaml file as part of our release.
|
39 |
+
|
40 |
+
Training schedule includes 518,000 steps (batch size 1,024) on a filtered and enhanced version of Common Corpus (1,086,324,736,000 tokens).
|
41 |
+
|
42 |
+
## Update
|
43 |
+
Pleias-3b-Preview is currently released as an early preview.
|
44 |
+
|
45 |
+
The model will undergo several more round of post-training to enhance reasoning capacities and fine-tunability as well as in anticipation of a generalist instruct version.
|