wissamantoun commited on
Commit
571e1ba
ยท
verified ยท
1 Parent(s): 9d3ac27

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language: fr
4
+ library_name: transformers
5
+ pipeline_tag: feature-extraction
6
+ datasets:
7
+ - uonlp/CulturaX
8
+ - oscar
9
+ - almanach/HALvest
10
+ - wikimedia/wikipedia
11
+ tags:
12
+ - deberta-v2
13
+ - deberta-v3
14
+ - debertav2
15
+ - debertav3
16
+ - camembert
17
+ ---
18
+ # CamemBERT(a)-v2: A Smarter French Language Model Aged to Perfection
19
+
20
+ [CamemBERTv2](https://arxiv.org/abs/2411.08868) is a French language model pretrained on a large corpus of 275B tokens of French text. It is the second version of the CamemBERT model, which is based on the RoBERTa architecture. CamemBERTv2 is trained using the Masked Language Modeling (MLM) objective with 40% mask rate for 3 epochs on 32 H100 GPUs. The dataset used for training is a combination of French [OSCAR](https://oscar-project.org/) dumps from the [CulturaX Project](https://huggingface.co/datasets/uonlp/CulturaX), French scientific documents from [HALvest](https://huggingface.co/datasets/almanach/HALvest), and the French Wikipedia.
21
+
22
+ The model is a drop-in replacement for the original CamemBERT model. Note that the new tokenizer is different from the original CamemBERT tokenizer, so you will need to use Fast Tokenizers to use the model. It will work with `CamemBERTTokenizerFast` from `transformers` library even if the original `CamemBERTTokenizer` was sentencepiece-based.
23
+
24
+ # Model Checkpoints
25
+
26
+ This repository contains all intermediate model checkpoints with corresponding checkpoints in TF and PT structured as follows:
27
+
28
+ ```
29
+ โ”œโ”€โ”€ checkpoints/
30
+ โ”‚ โ”œโ”€โ”€ iter_ckpt_rank_XX/ # Contains all iterator checkpoints from a specific rank
31
+ โ”‚ โ”œโ”€โ”€ summaries/ # Tensorboard logs
32
+ โ”‚ โ”œโ”€โ”€ ckpt-YYYYY.data-00000-of-00001
33
+ โ”‚ โ”œโ”€โ”€ ckpt-YYYYY.index
34
+ โ”œโ”€โ”€ post/
35
+ โ”‚ โ”œโ”€โ”€ ckpt-YYYYY/
36
+ โ”‚ โ”‚ โ”œโ”€โ”€ pt/
37
+ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ config.json
38
+ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ pytorch_model.bin
39
+ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ special_tokens_map.json
40
+ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ tokenizer.json
41
+ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ tokenizer_config.json
42
+ โ”‚ โ”‚ โ”œโ”€โ”€ tf/
43
+ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ ...
44
+ ```
45
+
46
+ ## Citation
47
+
48
+ ```bibtex
49
+ @misc{antoun2024camembert20smarterfrench,
50
+ title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection},
51
+ author={Wissam Antoun and Francis Kulumba and Rian Touchent and ร‰ric de la Clergerie and Benoรฎt Sagot and Djamรฉ Seddah},
52
+ year={2024},
53
+ eprint={2411.08868},
54
+ archivePrefix={arXiv},
55
+ primaryClass={cs.CL},
56
+ url={https://arxiv.org/abs/2411.08868},
57
+ }
58
+ ```