File size: 6,843 Bytes
82481db 2f5fb80 82481db 2f5fb80 c57096b 2f5fb80 c57096b 82481db 2f5fb80 3f0d9f8 2f5fb80 c57096b 2f5fb80 3f0d9f8 2f5fb80 3f0d9f8 2f5fb80 3f0d9f8 2f5fb80 3f0d9f8 2f5fb80 3f0d9f8 2f5fb80 3f0d9f8 2f5fb80 3f0d9f8 c57096b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
---
language: id
license: mit
datasets:
- oscar
- wikipedia
- id_newspapers_2018
widget:
- text: Saya [MASK] makan nasi goreng.
- text: Kucing itu sedang bermain dengan [MASK].
pipeline_tag: fill-mask
---
# Indonesian small BigBird model
## Source Code
Source code to create this model is available at [https://github.com/ilos-vigil/bigbird-small-indonesian](https://github.com/ilos-vigil/bigbird-small-indonesian).
## Downstream Task
* NLI/ZSC: [ilos-vigil/bigbird-small-indonesian-nli](https://huggingface.co/ilos-vigil/bigbird-small-indonesian-nli)
## Model Description
This **cased** model has been pretrained with Masked LM objective. It has ~30M parameters and was pretrained with 8 epoch/51474 steps with 2.078 eval loss (7.988 perplexity). Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole dataset with 30K vocabulary size.
```py
from transformers import BigBirdConfig
config = BigBirdConfig(
vocab_size = 30_000,
hidden_size = 512,
num_hidden_layers = 4,
num_attention_heads = 8,
intermediate_size = 2048,
max_position_embeddings = 4096,
is_encoder_decoder=False,
attention_type='block_sparse'
)
```
## How to use
> Inference with Transformers pipeline (one MASK token)
```py
>>> from transformers import pipeline
>>> pipe = pipeline(task='fill-mask', model='ilos-vigil/bigbird-small-indonesian')
>>> pipe('Saya sedang bermain [MASK] teman saya.')
[{'score': 0.7199566960334778,
'token': 14,
'token_str':'dengan',
'sequence': 'Saya sedang bermain dengan teman saya.'},
{'score': 0.12370546162128448,
'token': 17,
'token_str': 'untuk',
'sequence': 'Saya sedang bermain untuk teman saya.'},
{'score': 0.0385284349322319,
'token': 331,
'token_str': 'bersama',
'sequence': 'Saya sedang bermain bersama teman saya.'},
{'score': 0.012146958149969578,
'token': 28,
'token_str': 'oleh',
'sequence': 'Saya sedang bermain oleh teman saya.'},
{'score': 0.009499032981693745,
'token': 25,
'token_str': 'sebagai',
'sequence': 'Saya sedang bermain sebagai teman saya.'}]
```
> Inference with PyTorch (one or multiple MASK token)
```py
import torch
from transformers import BigBirdTokenizerFast, BigBirdForMaskedLM
from pprint import pprint
tokenizer = BigBirdTokenizerFast.from_pretrained('ilos-vigil/bigbird-small-indonesian')
model = BigBirdForMaskedLM.from_pretrained('ilos-vigil/bigbird-small-indonesian')
topk = 5
text = 'Saya [MASK] bermain [MASK] teman saya.'
tokenized_text = tokenizer(text, return_tensors='pt')
raw_output = model(**tokenized_text)
tokenized_output = torch.topk(raw_output.logits, topk, dim=2).indices
score_output = torch.softmax(raw_output.logits, dim=2)
result = []
for position_idx in range(tokenized_text['input_ids'][0].shape[0]):
if tokenized_text['input_ids'][0][position_idx] == tokenizer.mask_token_id:
outputs = []
for token_idx in tokenized_output[0, position_idx]:
output = {}
output['score'] = score_output[0, position_idx, token_idx].item()
output['token'] = token_idx.item()
output['token_str'] = tokenizer.decode(output['token'])
outputs.append(output)
result.append(outputs)
pprint(result)
```
```py
[[{'score': 0.22353802621364594, 'token': 36, 'token_str': 'dapat'},
{'score': 0.13962049782276154, 'token': 24, 'token_str': 'tidak'},
{'score': 0.13610956072807312, 'token': 32, 'token_str': 'juga'},
{'score': 0.0725034773349762, 'token': 584, 'token_str': 'bermain'},
{'score': 0.033740025013685226, 'token': 38, 'token_str': 'akan'}],
[{'score': 0.7111291885375977, 'token': 14, 'token_str': 'dengan'},
{'score': 0.10754624754190445, 'token': 17, 'token_str': 'untuk'},
{'score': 0.022657711058855057, 'token': 331, 'token_str': 'bersama'},
{'score': 0.020862115547060966, 'token': 25, 'token_str': 'sebagai'},
{'score': 0.013086902908980846, 'token': 11, 'token_str': 'di'}]]
```
## Limitations and bias
Due to low parameter count and case-sensitive tokenizer/model, it's expected this model have low performance on certain fine-tuned task. Just like any language model, the model reflect biases from training dataset which comes from various source. Here's an example of how the model can have biased predictions,
```py
>>> pipe('Memasak dirumah adalah kewajiban seorang [MASK].')
[{'score': 0.16381049156188965,
'sequence': 'Memasak dirumah adalah kewajiban seorang budak.',
'token': 4910,
'token_str': 'budak'},
{'score': 0.1334381103515625,
'sequence': 'Memasak dirumah adalah kewajiban seorang wanita.',
'token': 649,
'token_str': 'wanita'},
{'score': 0.11588197946548462,
'sequence': 'Memasak dirumah adalah kewajiban seorang lelaki.',
'token': 6368,
'token_str': 'lelaki'},
{'score': 0.061377108097076416,
'sequence': 'Memasak dirumah adalah kewajiban seorang diri.',
'token': 258,
'token_str': 'diri'},
{'score': 0.04679233580827713,
'sequence': 'Memasak dirumah adalah kewajiban seorang gadis.',
'token': 6845,
'token_str': 'gadis'}]
```
## Training and evaluation data
This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data.
## Training Procedure
The model was pretrained on single RTX 3060 with 8 epoch/51474 steps with accumalted batch size 128. The sequence was limited to 4096 tokens. The optimizer used is AdamW with LR 1e-4, weight decay 0.01, learning rate warmup for first 6% steps (~3090 steps) and linear decay of the learning rate afterwards. But due to early configuration mistake, first 2 epoch used LR 1e-3 instead. Additional information can be seen on Tensorboard training logs.
## Evaluation
The model achieve the following result during training evaluation.
| Epoch | Steps | Eval. loss | Eval. perplexity |
| ----- | ----- | ---------- | ---------------- |
| 1 | 6249 | 2.466 | 11.775 |
| 2 | 12858 | 2.265 | 9.631 |
| 3 | 19329 | 2.127 | 8.390 |
| 4 | 25758 | 2.116 | 8.298 |
| 5 | 32187 | 2.097 | 8.141 |
| 6 | 38616 | 2.087 | 8.061 |
| 7 | 45045 | 2.081 | 8.012 |
| 8 | 51474 | 2.078 | 7.988 | |