Commit
·
2f5fb80
1
Parent(s):
7f12618
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,52 @@
|
|
1 |
---
|
|
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: id
|
3 |
license: mit
|
4 |
+
datasets:
|
5 |
+
- oscar
|
6 |
+
- wikipedia
|
7 |
+
- id_newspapers_2018
|
8 |
+
widget:
|
9 |
+
- text: "Saya [MASK] makan nasi goreng."
|
10 |
+
- text: "Kucing itu sedang bermain dengan [MASK]."
|
11 |
---
|
12 |
+
|
13 |
+
# Indonesian small BigBird model
|
14 |
+
|
15 |
+
**Disclaimer:** This is work in progress. Current checkpoint is trained with ~1.0 epoch/6450 steps with 2.565 train loss and 2.466 eval loss. Newer checkpoint and additional information will be added in the future.
|
16 |
+
|
17 |
+
## Model Description
|
18 |
+
|
19 |
+
This model was pretrained **only** with Masked LM objective. Architecture of this model is shown in the configuration snippet below. The tokenizer was trained with whole **cased** dataset with **only** 30K vocabulary size.
|
20 |
+
|
21 |
+
```py
|
22 |
+
config = BigBirdConfig(
|
23 |
+
vocab_size = 30_000,
|
24 |
+
hidden_size = 512,
|
25 |
+
num_hidden_layers = 4,
|
26 |
+
num_attention_heads = 8,
|
27 |
+
intermediate_size = 2048,
|
28 |
+
max_position_embeddings = 4096,
|
29 |
+
is_encoder_decoder=False,
|
30 |
+
attention_type='block_sparse'
|
31 |
+
)
|
32 |
+
```
|
33 |
+
|
34 |
+
## How to use
|
35 |
+
|
36 |
+
> TBD
|
37 |
+
|
38 |
+
## Limitations and bias
|
39 |
+
|
40 |
+
> TBD
|
41 |
+
|
42 |
+
## Training and evaluation data
|
43 |
+
|
44 |
+
This model was pretrained with [Indonesian Wikipedia](https://huggingface.co/datasets/wikipedia) with dump file from 2022-10-20, [OSCAR](https://huggingface.co/datasets/oscar) on subset `unshuffled_deduplicated_id` and [Indonesian Newspaper 2018](https://huggingface.co/datasets/id_newspapers_2018). Preprocessing is done using function from [task guides - language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling#preprocess) with 4096 block size. Each dataset is splitted using [`train_test_split`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) with 5% allocation as evaluation data.
|
45 |
+
|
46 |
+
## Training Procedure
|
47 |
+
|
48 |
+
> TBD
|
49 |
+
|
50 |
+
## Evaluation
|
51 |
+
|
52 |
+
> TBD
|