PleIAs
/

Pleias-3b-Preview

Model card Files Files and versions Community

Pleias-3b-Preview / README.md

Pclanglais's picture

Update README.md

bdf8cd7 verified 16 days ago

|

2.36 kB

	---
	license: apache-2.0
	datasets:
	- PleIAs/common_corpus
	language:
	- en
	- fr
	- es
	- de
	- it
	- la
	- nl
	- pl
	---
	Pleias-3b-Preview is an early preview of a 3 billion parameters base model trained by Pleias on Common Corpus.

	Like all the base and specialized models from Pleias, Pleias-3b-Preview has only been trained on open data out of copyright (public domain) or under a permissible license.

	## Description
	Pleias-3b-Preview is a transformer base model, entirely pretrained from scratch, using an architecture similar to Llama/GPT-Neox for easier deployment/inference.

	It includes the following features, that would apply to any responsibly trained variant:
	* Only trained on open data under a permissible license and in compliance with the European AI Act. By design, all Pleias model are unable to output copyrighted content.
	* Extensive multilingual support for main European languages.
	* A new tokenizer designed for enhanced document processing tasks and better multilingual support.
	* Extremely low level of toxicity and problematic content.

	Fully supported languages include English, French, Spanish, German, Italian, Dutch, Latin and Portuguese.

	## Recommended use
	As a base model, Pleias-3b-Preview is only able to run continuation prompts.

	Text generation is currently able to support a range of creative writing tasks in multiple European languages. For more consistent results we recommend using a low or null temperature with a slight repetition penalty (1.1-1.2).

	Pleias-3b-Preview has been successfully adapted for continuous pretraining and full-fine-tuning on document processing tasks such as RAG, translation or OCR correction. Given the small size of the model we do not recommend fine-tuning methods based on LORA.

	## Training
	Pleias-3b-Preview was fully pretrained at Jean Zay on 64 h100s for 46 hours with Nanotron, the pretraining library from HuggingFace. We provide the complete settings as a yaml file as part of our release.

	Training schedule includes 518,000 steps (batch size 1,024) on a filtered and enhanced version of Common Corpus (1,086,324,736,000 tokens).

	## Update
	Pleias-3b-Preview is currently released as an early preview.

	The model will undergo several more round of post-training to enhance reasoning capacities and fine-tunability as well as in anticipation of a generalist instruct version.