ilos-vigil
/

bigbird-small-indonesian

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

ilos-vigil commited on Nov 16, 2022

Commit

2f5fb80

·

1 Parent(s): 7f12618

Update README.md

Files changed (1) hide show

README.md +49 -0

README.md CHANGED Viewed

@@ -1,3 +1,52 @@
 ---
 license: mit
 ---

 ---
+language: id
 license: mit
+datasets:
+- oscar
+- wikipedia
+- id_newspapers_2018
+widget:
+- text: "Saya [MASK] makan nasi goreng."
+- text: "Kucing itu sedang bermain dengan [MASK]."
 ---
+# Indonesian small BigBird model
+**Disclaimer:** This is work in progress. Current checkpoint is trained with ~1.0 epoch/6450 steps with 2.565 train loss and 2.466 eval loss. Newer checkpoint and additional information will be added in the future.
+## Model Description
+This model was pretrained **only** with Masked LM objective. Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole **cased** dataset with **only** 30K vocabulary size.
+```py
+config = BigBirdConfig(
+    vocab_size = 30_000,
+    hidden_size = 512,
+    num_hidden_layers = 4,
+    num_attention_heads = 8,
+    intermediate_size = 2048,
+    max_position_embeddings = 4096,
+    is_encoder_decoder=False,
+    attention_type='block_sparse'
+)
+```
+## How to use
+> TBD
+## Limitations and bias
+> TBD
+## Training and evaluation data
+This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data.
+## Training Procedure
+> TBD
+## Evaluation
+> TBD