Update README.md

c57096b about 2 years ago

6.84 kB

	---
	language: id
	license: mit
	datasets:
	- oscar
	- wikipedia
	- id_newspapers_2018
	widget:
	- text: Saya [MASK] makan nasi goreng.
	- text: Kucing itu sedang bermain dengan [MASK].
	pipeline_tag: fill-mask
	---

	# Indonesian small BigBird model

	## Source Code

	Source code to create this model is available at [https://github.com/ilos-vigil/bigbird-small-indonesian](https://github.com/ilos-vigil/bigbird-small-indonesian).

	## Downstream Task

	* NLI/ZSC: [ilos-vigil/bigbird-small-indonesian-nli](https://huggingface.co/ilos-vigil/bigbird-small-indonesian-nli)

	## Model Description

	This cased model has been pretrained with Masked LM objective. It has ~30M parameters and was pretrained with 8 epoch/51474 steps with 2.078 eval loss (7.988 perplexity). Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole dataset with 30K vocabulary size.

	```py
	from transformers import BigBirdConfig

	config = BigBirdConfig(
	vocab_size = 30_000,
	hidden_size = 512,
	num_hidden_layers = 4,
	num_attention_heads = 8,
	intermediate_size = 2048,
	max_position_embeddings = 4096,
	is_encoder_decoder=False,
	attention_type='block_sparse'
	)
	```

	## How to use

	> Inference with Transformers pipeline (one MASK token)

	```py
	>>> from transformers import pipeline
	>>> pipe = pipeline(task='fill-mask', model='ilos-vigil/bigbird-small-indonesian')
	>>> pipe('Saya sedang bermain [MASK] teman saya.')
	[{'score': 0.7199566960334778,
	'token': 14,
	'token_str':'dengan',
	'sequence': 'Saya sedang bermain dengan teman saya.'},
	{'score': 0.12370546162128448,
	'token': 17,
	'token_str': 'untuk',
	'sequence': 'Saya sedang bermain untuk teman saya.'},
	{'score': 0.0385284349322319,
	'token': 331,
	'token_str': 'bersama',
	'sequence': 'Saya sedang bermain bersama teman saya.'},
	{'score': 0.012146958149969578,
	'token': 28,
	'token_str': 'oleh',
	'sequence': 'Saya sedang bermain oleh teman saya.'},
	{'score': 0.009499032981693745,
	'token': 25,
	'token_str': 'sebagai',
	'sequence': 'Saya sedang bermain sebagai teman saya.'}]
	```

	> Inference with PyTorch (one or multiple MASK token)

	```py
	import torch
	from transformers import BigBirdTokenizerFast, BigBirdForMaskedLM
	from pprint import pprint

	tokenizer = BigBirdTokenizerFast.from_pretrained('ilos-vigil/bigbird-small-indonesian')
	model = BigBirdForMaskedLM.from_pretrained('ilos-vigil/bigbird-small-indonesian')
	topk = 5
	text = 'Saya [MASK] bermain [MASK] teman saya.'

	tokenized_text = tokenizer(text, return_tensors='pt')
	raw_output = model(**tokenized_text)
	tokenized_output = torch.topk(raw_output.logits, topk, dim=2).indices
	score_output = torch.softmax(raw_output.logits, dim=2)

	result = []
	for position_idx in range(tokenized_text['input_ids'][0].shape[0]):
	if tokenized_text['input_ids'][0][position_idx] == tokenizer.mask_token_id:
	outputs = []
	for token_idx in tokenized_output[0, position_idx]:
	output = {}
	output['score'] = score_output[0, position_idx, token_idx].item()
	output['token'] = token_idx.item()
	output['token_str'] = tokenizer.decode(output['token'])
	outputs.append(output)
	result.append(outputs)

	pprint(result)
	```

	```py
	[[{'score': 0.22353802621364594, 'token': 36, 'token_str': 'dapat'},
	{'score': 0.13962049782276154, 'token': 24, 'token_str': 'tidak'},
	{'score': 0.13610956072807312, 'token': 32, 'token_str': 'juga'},
	{'score': 0.0725034773349762, 'token': 584, 'token_str': 'bermain'},
	{'score': 0.033740025013685226, 'token': 38, 'token_str': 'akan'}],
	[{'score': 0.7111291885375977, 'token': 14, 'token_str': 'dengan'},
	{'score': 0.10754624754190445, 'token': 17, 'token_str': 'untuk'},
	{'score': 0.022657711058855057, 'token': 331, 'token_str': 'bersama'},
	{'score': 0.020862115547060966, 'token': 25, 'token_str': 'sebagai'},
	{'score': 0.013086902908980846, 'token': 11, 'token_str': 'di'}]]
	```

	## Limitations and bias

	Due to low parameter count and case-sensitive tokenizer/model, it's expected this model have low performance on certain fine-tuned task. Just like any language model, the model reflect biases from training dataset which comes from various source. Here's an example of how the model can have biased predictions,

	```py
	>>> pipe('Memasak dirumah adalah kewajiban seorang [MASK].')
	[{'score': 0.16381049156188965,
	'sequence': 'Memasak dirumah adalah kewajiban seorang budak.',
	'token': 4910,
	'token_str': 'budak'},
	{'score': 0.1334381103515625,
	'sequence': 'Memasak dirumah adalah kewajiban seorang wanita.',
	'token': 649,
	'token_str': 'wanita'},
	{'score': 0.11588197946548462,
	'sequence': 'Memasak dirumah adalah kewajiban seorang lelaki.',
	'token': 6368,
	'token_str': 'lelaki'},
	{'score': 0.061377108097076416,
	'sequence': 'Memasak dirumah adalah kewajiban seorang diri.',
	'token': 258,
	'token_str': 'diri'},
	{'score': 0.04679233580827713,
	'sequence': 'Memasak dirumah adalah kewajiban seorang gadis.',
	'token': 6845,
	'token_str': 'gadis'}]
	```

	## Training and evaluation data

	This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data.

	## Training Procedure

	The model was pretrained on single RTX 3060 with 8 epoch/51474 steps with accumalted batch size 128. The sequence was limited to 4096 tokens. The optimizer used is AdamW with LR 1e-4, weight decay 0.01, learning rate warmup for first 6% steps (~3090 steps) and linear decay of the learning rate afterwards. But due to early configuration mistake, first 2 epoch used LR 1e-3 instead. Additional information can be seen on Tensorboard training logs.

	## Evaluation

	The model achieve the following result during training evaluation.

	\| Epoch \| Steps \| Eval. loss \| Eval. perplexity \|
	\| ----- \| ----- \| ---------- \| ---------------- \|
	\| 1 \| 6249 \| 2.466 \| 11.775 \|
	\| 2 \| 12858 \| 2.265 \| 9.631 \|
	\| 3 \| 19329 \| 2.127 \| 8.390 \|
	\| 4 \| 25758 \| 2.116 \| 8.298 \|
	\| 5 \| 32187 \| 2.097 \| 8.141 \|
	\| 6 \| 38616 \| 2.087 \| 8.061 \|
	\| 7 \| 45045 \| 2.081 \| 8.012 \|
	\| 8 \| 51474 \| 2.078 \| 7.988 \|