File size: 5,171 Bytes
9553bb4 6df37e1 9553bb4 fa41ee3 9553bb4 52db033 9553bb4 52db033 9553bb4 c6fa044 9553bb4 c6fa044 9553bb4 52db033 9553bb4 021df87 9553bb4 021df87 aa7ea16 9553bb4 3ac5315 9553bb4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
---
license: apache-2.0
language:
- en
- ja
programming_language:
- C
- C++
- C#
- Go
- Java
- JavaScript
- Lua
- PHP
- Python
- Ruby
- Rust
- Scala
- TypeScript
library_name: transformers
tags:
- deberta
- deberta-v3
# - token-classification
datasets:
- wikipedia
- EleutherAI/pile
- bigcode/the-stack
- mc4
metrics:
- accuracy
# mask_token: "[MASK]"
# widget:
# - text: "京都大学で機械言語処理を研究する。"
---
# Model Card for Japanese DeBERTa V3 base
## Model description
This is a Japanese DeBERTa V3 base model pre-trained on LLM-jp corpus v1.0.
## How to use
You can use this model for masked language modeling as follows:
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v3-base-japanese')
model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v3-base-japanese')
sentences = [
"京都大学で自然言語処理を研究する。",
"I research NLP at Kyoto University.",
'int main() { printf("Hello, world!"); return 0; }',
]
encodings = tokenizer(sentences, return_tensors='pt')
...
```
You can also fine-tune this model on downstream tasks.
## Tokenization
The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (100k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-jp/llm-ja-tokenizer` for details on the vocabulary construction procedure.
Note that, unlike [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese), pre-segmentation by a morphological analyzer (e.g., Juman++) is no longer required for this model.
## Training data
We used the [LLM-jp corpus](https://github.com/llm-jp/llm-jp-corpus) v1.0.1 for pre-training.
The corpus consists of the following corpora:
- Japanese
- Wikipedia (1B tokens)
- mC4 (129B tokens)
- English
- Wikipedia (4B tokens)
- The Pile (126B tokens)
- Code
- The Stack (10B tokens)
We shuffled the corpora, which has 270B tokens in total, and trained the model for 2 epochs.
Thus, the total number of tokens fed to the model was 540B.
## Training procedure
We slightly modified [the official implementation of DeBERTa V3](https://github.com/microsoft/DeBERTa) and followed the official training procedure.
The modified code is available at [nobu-g/DeBERTa](https://github.com/nobu-g/DeBERTa).
The following hyperparameters were used during pre-training:
- learning_rate: 1e-4
- per_device_train_batch_size: 800
- num_devices: 8
- gradient_accumulation_steps: 3
- total_train_batch_size: 2400
- max_seq_length: 512
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06
- lr_scheduler_type: linear schedule with warmup
- training_steps: 475,000
- warmup_steps: 10,000
## Fine-tuning on NLU tasks
We fine-tuned the following models and evaluated them on the dev set of JGLUE.
We tuned the learning rate and training epochs for each model and task following [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
| Model | MARC-ja/acc | JCoLA/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
|-------------------------------|-------------|-----------|--------------|---------------|----------|-----------|-----------|------------|
| Waseda RoBERTa base | 0.965 | 0.867 | 0.913 | 0.876 | 0.905 | 0.853 | 0.916 | 0.853 |
| Waseda RoBERTa large (seq512) | 0.969 | 0.849 | 0.925 | 0.890 | 0.928 | 0.910 | 0.955 | 0.900 |
| LUKE Japanese base* | 0.965 | - | 0.916 | 0.877 | 0.912 | - | - | 0.842 |
| LUKE Japanese large* | 0.965 | - | 0.932 | 0.902 | 0.927 | - | - | 0.893 |
| DeBERTaV2 base | 0.970 | 0.879 | 0.922 | 0.886 | 0.922 | 0.899 | 0.951 | 0.873 |
| DeBERTaV2 large | 0.968 | 0.882 | 0.925 | 0.892 | 0.924 | 0.912 | 0.959 | 0.890 |
| DeBERTaV3 base | 0.960 | 0.878 | 0.927 | 0.891 | 0.927 | 0.896 | 0.947 | 0.875 |
*The scores of LUKE are from [the official repository](https://github.com/studio-ousia/luke).
## License
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
## Author
[Nobuhiro Ueda](https://huggingface.co/nobu-g) (ueda **at** nlp.ist.i.kyoto-u.ac.jp)
## Acknowledgments
This work was supported by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) through General Collaboration Project no. jh231006, "Developing a Platform for Constructing and Sharing of Large-Scale Japanese Language Models".
For training models, we used the mdx: a platform for the data-driven future.
|