kaisugi's picture
initial commit
976f5d4
|
raw
history blame
4.48 kB
metadata
language: ja
license: cc-by-nc-sa-4.0
tags:
  - roberta
  - medical
inference: false

alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000

Model description

This is a Japanese RoBERTa base model pre-trained on academic articles in medical sciences collected by Japan Science and Technology Agency (JST).

This model is released under the Creative Commons 4.0 International License (CC BY-NC-SA 4.0).

Datasets used for pre-training

  • abstracts (train: 1.6GB (10M sentences), validation: 0.2GB (1.3M sentences))
  • abstracts & body texts (train: 0.2GB (1.4M sentences))

How to use

Before using the model, make sure that Manbyo Dictionary has been downloaded under /usr/local/lib/mecab/dic/userdic.

# download Manbyo-Dictionary

mkdir -p /usr/local/lib/mecab/dic/userdic
wget https://sociocom.jp/~data/2018-manbyo/data/MANBYO_201907_Dic-utf8.dic && mv MANBYO_201907_Dic-utf8.dic /usr/local/lib/mecab/dic/userdic

Input text must be converted to full-width characters(全角)in advance.

You can use this model for masked language modeling as follows:

from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000")
model.eval()
tokenizer = AutoTokenizer.from_pretrained("alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000")

texts = ['この患者は[MASK]と診断された。']
inputs = tokenizer.batch_encode_plus(texts, return_tensors='pt')
outputs = model(**inputs)
tokenizer.convert_ids_to_tokens(outputs.logits[0][1:-1].argmax(axis=-1))
# ['この', '患者', 'は', 'SLE', 'と', '診断', 'さ', 'れ', 'た', '。']

Alternatively, you can employ Fill-mask pipeline.

from transformers import pipeline

fill = pipeline("fill-mask", model="alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000", top_k=10)
fill("この患者は[MASK]と診断された。")
#[{'score': 0.035826072096824646,
#  'token': 10840,
#  'token_str': 'SLE',
#  'sequence': 'この 患者 は SLE と 診断 さ れ た 。'},
# {'score': 0.020926717668771744,
#  'token': 10777,
#  'token_str': '統合失調症',
#  'sequence': 'この 患者 は 統合失調症 と 診断 さ れ た 。'},
# {'score': 0.02092057280242443,
#  'token': 8338,
#  'token_str': '糖尿病',
#  'sequence': 'この 患者 は 糖尿病 と 診断 さ れ た 。'},
# ...

You can fine-tune this model on downstream tasks.

See also sample Colab notebooks: https://colab.research.google.com/drive/1p2770dXs0lge1IkuSHYLO-G-KJ4gZtou?usp=sharing

Tokenization

Mecab (w/ IPAdic & Manbyo Dictionary) was used for pre-training. Each word is tokenized into tokens by WordPiece.

Vocabulary

The vocabulary consists of 50000 tokens including words (IPAdic & Manbyo Dictionary) and subwords induced by WordPiece.

Training procedure

The following hyperparameters were used during pre-training:

  • learning_rate: 0.0001
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • total_train_batch_size: 256
  • total_eval_batch_size: 256
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 20000
  • training_steps: 2000000
  • mixed_precision_training: Native AMP

Note: Why do we call our model RoBERTa, not BERT?

As the config file suggests, our model is based on HuggingFace's BertForMaskedLM class. However, we consider our model as RoBERTa for the following reasons:

  • We kept training only with max sequence length (= 512) tokens.
  • We removed the next sentence prediction (NSP) training objective.
  • We introduced dynamic masking (changing the masking pattern in each training iteration).

Acknowledgements

This work was supported by Japan Japan Science and Technology Agency (JST) AIP Trilateral AI Research (Grant Number: JPMJCR20G9), and Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) (Project ID: jh221004), in Japan.
In this research work, we used the "mdx: a platform for the data-driven future".