README.md · PleIAs/Pleias-3b-Preview at cce5da45acf63e64a8b68fe1ea1c05e94f50cb8a

metadata

license: apache-2.0
datasets:
  - PleIAs/common_corpus
language:
  - en
  - fr
  - es
  - de
  - it
  - la
  - nl
  - pl

Pleias-3b-Preview is an early preview of a 3 billion parameters base model trained by Pleias on Common Corpus.

Like all the base and specialized models from Pleias, Pleias-3b-Preview has only been trained on open data out of copyright (public domain) or under a permissible license.

Description

Pleias-3b-Preview is a transformer base model, entirely pretrained from scratch, using an architecture similar to Llama/GPT-Neox for easier deployment/inference.

It includes the following features, that would apply to any responsibly trained variant:

Only trained on open data under a permissible license and in compliance with the European AI Act. By design, all Pleias model are unable to output copyrighted content.
Extensive multilingual support for main European languages.
A new tokenizer designed for enhanced document processing tasks and better multilingual support.
Extremely low level of toxicity and problematic content.

Fully supported languages include English, French, Spanish, German, Italian, Dutch, Latin and Portuguese.

Recommended use

As a base model, Pleias-3b-Preview is only able to run continuation prompts.

Text generation is currently able to support a range of creative writing tasks in multiple European languages. For more consistent results we recommend using a low or null temperature with a slight repetition penalty (1.1-1.2).

Pleias-3b-Preview has been successfully adapted for continuous pretraining and full-fine-tuning on document processing tasks such as RAG, translation or OCR correction. Given the small size of the model we do not recommend fine-tuning methods based on LORA.

Training

Pleias-3b-Preview was fully pretrained at Jean Zay on 64 h100s for 46 hours with Nanotron, the pretraining library from HuggingFace. We provide the complete settings as a yaml file as part of our release.

Training schedule includes 518,000 steps (batch size 1,024) on a filtered and enhanced version of Common Corpus (1,086,324,736,000 tokens).

Update

Pleias-3b-Preview is currently released as an early preview.

The model will undergo several more round of post-training to enhance reasoning capacities and fine-tunability as well as in anticipation of a generalist instruct version.