initial commit

Browse files

Files changed (6) hide show

.gitattributes +34 -0
README.md +114 -0
config.json +24 -0
pytorch_model.bin +3 -0
tokenizer_config.json +10 -0
vocab.txt +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,34 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,114 @@

+---
+language: ja
+license: cc-by-nc-sa-4.0
+tags:
+- roberta
+- medical
+inference: false
+---
+# alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000
+## Model description
+This is a Japanese RoBERTa base model pre-trained on academic articles in medical sciences collected by Japan Science and Technology Agency (JST).
+This model is released under the [Creative Commons 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/deed) (CC BY-NC-SA 4.0).
+## Datasets used for pre-training
+- abstracts (train: 1.6GB (10M sentences), validation: 0.2GB (1.3M sentences))
+- abstracts & body texts (train: 0.2GB (1.4M sentences))
+## How to use
+**Before using the model, make sure that [Manbyo Dictionary](https://sociocom.naist.jp/manbyou-dic/) has been downloaded under `/usr/local/lib/mecab/dic/userdic`.**
+```bash
+# download Manbyo-Dictionary
+mkdir -p /usr/local/lib/mecab/dic/userdic
+wget https://sociocom.jp/~data/2018-manbyo/data/MANBYO_201907_Dic-utf8.dic && mv MANBYO_201907_Dic-utf8.dic /usr/local/lib/mecab/dic/userdic
+```
+**Input text must be converted to full-width characters（全角）in advance.**
+You can use this model for masked language modeling as follows:
+```python
+from transformers import AutoModelForMaskedLM, AutoTokenizer
+model = AutoModelForMaskedLM.from_pretrained("alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000")
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained("alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000")
+texts = ['この患者は[MASK]と診断された。']
+inputs = tokenizer.batch_encode_plus(texts, return_tensors='pt')
+outputs = model(**inputs)
+tokenizer.convert_ids_to_tokens(outputs.logits[0][1:-1].argmax(axis=-1))
+# ['この', '患者', 'は', 'ＳＬＥ', 'と', '診断', 'さ', 'れ', 'た', '。']
+```
+Alternatively, you can employ [Fill-mask pipeline](https://huggingface.co/tasks/fill-mask).
+```python
+from transformers import pipeline
+fill = pipeline("fill-mask", model="alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000", top_k=10)
+fill("この患者は[MASK]と診断された。")
+#[{'score': 0.035826072096824646,
+#  'token': 10840,
+#  'token_str': 'ＳＬＥ',
+#  'sequence': 'この 患者 は ＳＬＥ と 診断 さ れ た 。'},
+# {'score': 0.020926717668771744,
+#  'token': 10777,
+#  'token_str': '統合失調症',
+#  'sequence': 'この 患者 は 統合失調症 と 診断 さ れ た 。'},
+# {'score': 0.02092057280242443,
+#  'token': 8338,
+#  'token_str': '糖尿病',
+#  'sequence': 'この 患者 は 糖尿病 と 診断 さ れ た 。'},
+# ...
+```
+You can fine-tune this model on downstream tasks.
+**See also sample Colab notebooks:** https://colab.research.google.com/drive/1p2770dXs0lge1IkuSHYLO-G-KJ4gZtou?usp=sharing
+## Tokenization
+Mecab (w/ IPAdic & [Manbyo Dictionary](https://sociocom.naist.jp/manbyou-dic/)) was used for pre-training. Each word is tokenized into tokens by [WordPiece](https://huggingface.co/course/chapter6/6).
+## Vocabulary
+The vocabulary consists of 50000 tokens including words (IPAdic & [Manbyo Dictionary](https://sociocom.naist.jp/manbyou-dic/)) and subwords induced by [WordPiece](https://huggingface.co/course/chapter6/6).
+## Training procedure
+The following hyperparameters were used during pre-training:
+- learning_rate: 0.0001
+- train_batch_size: 32
+- eval_batch_size: 32
+- seed: 42
+- distributed_type: multi-GPU
+- num_devices: 8
+- total_train_batch_size: 256
+- total_eval_batch_size: 256
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_steps: 20000
+- training_steps: 2000000
+- mixed_precision_training: Native AMP
+## Note: Why do we call our model RoBERTa, not BERT?
+As the config file suggests, our model is based on HuggingFace's `BertForMaskedLM` class. However, we consider our model as **RoBERTa** for the following reasons:
+- We kept training only with max sequence length (= 512) tokens.
+- We removed the next sentence prediction (NSP) training objective.
+- We introduced dynamic masking (changing the masking pattern in each training iteration).
+## Acknowledgements
+This work was supported by Japan Japan Science and Technology Agency (JST) AIP Trilateral AI Research (Grant Number: JPMJCR20G9), and Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) (Project ID: jh221004), in Japan.
+In this research work, we used the "[mdx: a platform for the data-driven future](https://mdx.jp/)".

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architectures": [
+    "BertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.16.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 50000
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f46c45d39e0536ea37d6514f51035d2c05150465c61c5c88fd7348f282ee368c
+size 498061650

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "tokenizer_class": "BertJapaneseTokenizer",
+    "word_tokenizer_type": "mecab",
+    "subword_tokenizer_type": "wordpiece",
+    "mecab_kwargs": {
+        "mecab_dic": "ipadic",
+        "mecab_option": "-u /usr/local/lib/mecab/dic/userdic/MANBYO_201907_Dic-utf8.dic",
+        "normalize_text": false
+    }
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff