|
--- |
|
language: id |
|
license: mit |
|
datasets: |
|
- oscar |
|
- wikipedia |
|
- id_newspapers_2018 |
|
widget: |
|
- text: Saya [MASK] makan nasi goreng. |
|
- text: Kucing itu sedang bermain dengan [MASK]. |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
# Indonesian small BigBird model |
|
|
|
## Source Code |
|
|
|
Source code to create this model is available at [https://github.com/ilos-vigil/bigbird-small-indonesian](https://github.com/ilos-vigil/bigbird-small-indonesian). |
|
|
|
## Downstream Task |
|
|
|
* NLI/ZSC: [ilos-vigil/bigbird-small-indonesian-nli](https://huggingface.co/ilos-vigil/bigbird-small-indonesian-nli) |
|
|
|
## Model Description |
|
|
|
This **cased** model has been pretrained with Masked LM objective. It has ~30M parameters and was pretrained with 8 epoch/51474 steps with 2.078 eval loss (7.988 perplexity). Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole dataset with 30K vocabulary size. |
|
|
|
```py |
|
from transformers import BigBirdConfig |
|
|
|
config = BigBirdConfig( |
|
vocab_size = 30_000, |
|
hidden_size = 512, |
|
num_hidden_layers = 4, |
|
num_attention_heads = 8, |
|
intermediate_size = 2048, |
|
max_position_embeddings = 4096, |
|
is_encoder_decoder=False, |
|
attention_type='block_sparse' |
|
) |
|
``` |
|
|
|
## How to use |
|
|
|
> Inference with Transformers pipeline (one MASK token) |
|
|
|
```py |
|
>>> from transformers import pipeline |
|
>>> pipe = pipeline(task='fill-mask', model='ilos-vigil/bigbird-small-indonesian') |
|
>>> pipe('Saya sedang bermain [MASK] teman saya.') |
|
[{'score': 0.7199566960334778, |
|
'token': 14, |
|
'token_str':'dengan', |
|
'sequence': 'Saya sedang bermain dengan teman saya.'}, |
|
{'score': 0.12370546162128448, |
|
'token': 17, |
|
'token_str': 'untuk', |
|
'sequence': 'Saya sedang bermain untuk teman saya.'}, |
|
{'score': 0.0385284349322319, |
|
'token': 331, |
|
'token_str': 'bersama', |
|
'sequence': 'Saya sedang bermain bersama teman saya.'}, |
|
{'score': 0.012146958149969578, |
|
'token': 28, |
|
'token_str': 'oleh', |
|
'sequence': 'Saya sedang bermain oleh teman saya.'}, |
|
{'score': 0.009499032981693745, |
|
'token': 25, |
|
'token_str': 'sebagai', |
|
'sequence': 'Saya sedang bermain sebagai teman saya.'}] |
|
``` |
|
|
|
> Inference with PyTorch (one or multiple MASK token) |
|
|
|
```py |
|
import torch |
|
from transformers import BigBirdTokenizerFast, BigBirdForMaskedLM |
|
from pprint import pprint |
|
|
|
tokenizer = BigBirdTokenizerFast.from_pretrained('ilos-vigil/bigbird-small-indonesian') |
|
model = BigBirdForMaskedLM.from_pretrained('ilos-vigil/bigbird-small-indonesian') |
|
topk = 5 |
|
text = 'Saya [MASK] bermain [MASK] teman saya.' |
|
|
|
tokenized_text = tokenizer(text, return_tensors='pt') |
|
raw_output = model(**tokenized_text) |
|
tokenized_output = torch.topk(raw_output.logits, topk, dim=2).indices |
|
score_output = torch.softmax(raw_output.logits, dim=2) |
|
|
|
result = [] |
|
for position_idx in range(tokenized_text['input_ids'][0].shape[0]): |
|
if tokenized_text['input_ids'][0][position_idx] == tokenizer.mask_token_id: |
|
outputs = [] |
|
for token_idx in tokenized_output[0, position_idx]: |
|
output = {} |
|
output['score'] = score_output[0, position_idx, token_idx].item() |
|
output['token'] = token_idx.item() |
|
output['token_str'] = tokenizer.decode(output['token']) |
|
outputs.append(output) |
|
result.append(outputs) |
|
|
|
pprint(result) |
|
``` |
|
|
|
```py |
|
[[{'score': 0.22353802621364594, 'token': 36, 'token_str': 'dapat'}, |
|
{'score': 0.13962049782276154, 'token': 24, 'token_str': 'tidak'}, |
|
{'score': 0.13610956072807312, 'token': 32, 'token_str': 'juga'}, |
|
{'score': 0.0725034773349762, 'token': 584, 'token_str': 'bermain'}, |
|
{'score': 0.033740025013685226, 'token': 38, 'token_str': 'akan'}], |
|
[{'score': 0.7111291885375977, 'token': 14, 'token_str': 'dengan'}, |
|
{'score': 0.10754624754190445, 'token': 17, 'token_str': 'untuk'}, |
|
{'score': 0.022657711058855057, 'token': 331, 'token_str': 'bersama'}, |
|
{'score': 0.020862115547060966, 'token': 25, 'token_str': 'sebagai'}, |
|
{'score': 0.013086902908980846, 'token': 11, 'token_str': 'di'}]] |
|
``` |
|
|
|
## Limitations and bias |
|
|
|
Due to low parameter count and case-sensitive tokenizer/model, it's expected this model have low performance on certain fine-tuned task. Just like any language model, the model reflect biases from training dataset which comes from various source. Here's an example of how the model can have biased predictions, |
|
|
|
```py |
|
>>> pipe('Memasak dirumah adalah kewajiban seorang [MASK].') |
|
[{'score': 0.16381049156188965, |
|
'sequence': 'Memasak dirumah adalah kewajiban seorang budak.', |
|
'token': 4910, |
|
'token_str': 'budak'}, |
|
{'score': 0.1334381103515625, |
|
'sequence': 'Memasak dirumah adalah kewajiban seorang wanita.', |
|
'token': 649, |
|
'token_str': 'wanita'}, |
|
{'score': 0.11588197946548462, |
|
'sequence': 'Memasak dirumah adalah kewajiban seorang lelaki.', |
|
'token': 6368, |
|
'token_str': 'lelaki'}, |
|
{'score': 0.061377108097076416, |
|
'sequence': 'Memasak dirumah adalah kewajiban seorang diri.', |
|
'token': 258, |
|
'token_str': 'diri'}, |
|
{'score': 0.04679233580827713, |
|
'sequence': 'Memasak dirumah adalah kewajiban seorang gadis.', |
|
'token': 6845, |
|
'token_str': 'gadis'}] |
|
``` |
|
|
|
## Training and evaluation data |
|
|
|
This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data. |
|
|
|
## Training Procedure |
|
|
|
The model was pretrained on single RTX 3060 with 8 epoch/51474 steps with accumalted batch size 128. The sequence was limited to 4096 tokens. The optimizer used is AdamW with LR 1e-4, weight decay 0.01, learning rate warmup for first 6% steps (~3090 steps) and linear decay of the learning rate afterwards. But due to early configuration mistake, first 2 epoch used LR 1e-3 instead. Additional information can be seen on Tensorboard training logs. |
|
|
|
## Evaluation |
|
|
|
The model achieve the following result during training evaluation. |
|
|
|
| Epoch | Steps | Eval. loss | Eval. perplexity | |
|
| ----- | ----- | ---------- | ---------------- | |
|
| 1 | 6249 | 2.466 | 11.775 | |
|
| 2 | 12858 | 2.265 | 9.631 | |
|
| 3 | 19329 | 2.127 | 8.390 | |
|
| 4 | 25758 | 2.116 | 8.298 | |
|
| 5 | 32187 | 2.097 | 8.141 | |
|
| 6 | 38616 | 2.087 | 8.061 | |
|
| 7 | 45045 | 2.081 | 8.012 | |
|
| 8 | 51474 | 2.078 | 7.988 | |