Update README.md
Browse files
README.md
CHANGED
@@ -1,69 +1,105 @@
|
|
1 |
---
|
2 |
-
|
3 |
tags:
|
4 |
-
-
|
5 |
-
|
6 |
-
|
7 |
-
-
|
8 |
-
|
9 |
-
-
|
10 |
-
model-index:
|
11 |
-
- name: indonesian-roberta-base-prdect-id
|
12 |
-
results: []
|
13 |
---
|
14 |
|
15 |
-
|
16 |
-
should probably proofread and complete it, then remove this comment. -->
|
17 |
|
18 |
-
|
19 |
|
20 |
-
This model
|
21 |
-
It achieves the following results on the evaluation set:
|
22 |
-
- Loss: 0.8133
|
23 |
-
- Accuracy: 0.6852
|
24 |
-
- F1: 0.6447
|
25 |
-
- Precision: 0.6464
|
26 |
-
- Recall: 0.6437
|
27 |
|
28 |
-
## Model
|
29 |
|
30 |
-
|
|
|
|
|
31 |
|
32 |
-
##
|
33 |
|
34 |
-
|
35 |
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
|
40 |
## Training procedure
|
41 |
|
42 |
### Training hyperparameters
|
43 |
|
44 |
The following hyperparameters were used during training:
|
45 |
-
|
46 |
-
-
|
47 |
-
-
|
48 |
-
-
|
49 |
-
-
|
50 |
-
-
|
51 |
-
-
|
|
|
52 |
|
53 |
### Training results
|
54 |
|
55 |
-
| Training Loss | Epoch | Step
|
56 |
-
|
57 |
-
|
|
58 |
-
|
|
59 |
-
|
|
60 |
-
|
|
61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
|
|
|
|
|
|
|
|
|
|
|
63 |
|
64 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
65 |
|
66 |
- Transformers 4.24.0
|
67 |
- Pytorch 1.12.1+cu113
|
68 |
- Datasets 2.7.1
|
69 |
- Tokenizers 0.13.2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: id
|
3 |
tags:
|
4 |
+
- indonesian-roberta-base-prdect-id
|
5 |
+
license: apache-2.0
|
6 |
+
datasets:
|
7 |
+
- prdect-id
|
8 |
+
widget:
|
9 |
+
- text: "Wah, kualitas produk ini sangat bagus!"
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
+
## Indonesian RoBERTa Base PRDECT-ID
|
|
|
13 |
|
14 |
+
Indonesian RoBERTa Base PRDECT-ID is a emotion text-classification model based on the [RoBERTa](https://arxiv.org/abs/1907.11692) model. The model was originally the pre-trained [Indonesian RoBERTa Base](https://hf.co/flax-community/indonesian-roberta-base) model, which is then fine-tuned on the [`PRDECT-ID`](https://doi.org/10.1016/j.dib.2022.108554) dataset consisting of Indonesian product reviews (Sutoyo et al., 2022).
|
15 |
|
16 |
+
This model was trained using HuggingFace's PyTorch framework. All training was done on a NVIDIA T4, provided by Google Colaboratory. [Training metrics](https://huggingface.co/w11wo/indonesian-roberta-base-prdect-id/tensorboard) were logged via Tensorboard.
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
+
## Model
|
19 |
|
20 |
+
| Model | #params | Arch. | Training/Validation data (text) |
|
21 |
+
| ----------------------------------- | ------- | ------------ | ------------------------------- |
|
22 |
+
| `indonesian-roberta-base-prdect-id` | 124M | RoBERTa Base | `PRDECT-ID` |
|
23 |
|
24 |
+
## Evaluation Results
|
25 |
|
26 |
+
The model achieves the following results on evaluation:
|
27 |
|
28 |
+
| Dataset | Accuracy | F1 | Precision | Recall |
|
29 |
+
| ----------- | -------- | -------- | --------- | -------- |
|
30 |
+
| `PRDECT-ID` | 0.685185 | 0.644750 | 0.646400 | 0.643710 |
|
31 |
|
32 |
## Training procedure
|
33 |
|
34 |
### Training hyperparameters
|
35 |
|
36 |
The following hyperparameters were used during training:
|
37 |
+
|
38 |
+
- `learning_rate`: 2e-05
|
39 |
+
- `train_batch_size`: 32
|
40 |
+
- `eval_batch_size`: 32
|
41 |
+
- `seed`: 42
|
42 |
+
- `optimizer`: Adam with `betas=(0.9,0.999)` and `epsilon=1e-08`
|
43 |
+
- `lr_scheduler_type`: linear
|
44 |
+
- `num_epochs`: 5
|
45 |
|
46 |
### Training results
|
47 |
|
48 |
+
| Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall |
|
49 |
+
| :-----------: | :---: | :---: | :-------------: | :------: | :----: | :-------: | :----: |
|
50 |
+
| 1.0358 | 1.0 | 152 | 0.8293 | 0.6519 | 0.5814 | 0.6399 | 0.5746 |
|
51 |
+
| 0.7012 | 2.0 | 304 | 0.7444 | 0.6741 | 0.6269 | 0.6360 | 0.6220 |
|
52 |
+
| 0.5599 | 3.0 | 456 | 0.7635 | 0.6852 | 0.6440 | 0.6433 | 0.6453 |
|
53 |
+
| 0.4628 | 4.0 | 608 | 0.8031 | 0.6852 | 0.6421 | 0.6471 | 0.6396 |
|
54 |
+
| 0.4027 | 5.0 | 760 | 0.8133 | 0.6852 | 0.6447 | 0.6464 | 0.6437 |
|
55 |
+
|
56 |
+
## How to Use
|
57 |
+
|
58 |
+
### As Text Classifier
|
59 |
+
|
60 |
+
```python
|
61 |
+
from transformers import pipeline
|
62 |
+
|
63 |
+
pretrained_name = "w11wo/indonesian-roberta-base-prdect-id"
|
64 |
|
65 |
+
nlp = pipeline(
|
66 |
+
"sentiment-analysis",
|
67 |
+
model=pretrained_name,
|
68 |
+
tokenizer=pretrained_name
|
69 |
+
)
|
70 |
|
71 |
+
nlp("Wah, kualitas produk ini sangat bagus!")
|
72 |
+
```
|
73 |
+
|
74 |
+
## Disclaimer
|
75 |
+
|
76 |
+
Do consider the biases which come from both the pre-trained RoBERTa model and the `PRDECT-ID` dataset that may be carried over into the results of this model.
|
77 |
+
|
78 |
+
## Author
|
79 |
+
|
80 |
+
Indonesian RoBERTa Base PRDECT-ID was trained and evaluated by [Wilson Wongso](https://w11wo.github.io/). All computation and development are done on Google Colaboratory using their free GPU access.
|
81 |
+
|
82 |
+
## Framework versions
|
83 |
|
84 |
- Transformers 4.24.0
|
85 |
- Pytorch 1.12.1+cu113
|
86 |
- Datasets 2.7.1
|
87 |
- Tokenizers 0.13.2
|
88 |
+
|
89 |
+
## References
|
90 |
+
|
91 |
+
```bib
|
92 |
+
@article{SUTOYO2022108554,
|
93 |
+
title = {PRDECT-ID: Indonesian product reviews dataset for emotions classification tasks},
|
94 |
+
journal = {Data in Brief},
|
95 |
+
volume = {44},
|
96 |
+
pages = {108554},
|
97 |
+
year = {2022},
|
98 |
+
issn = {2352-3409},
|
99 |
+
doi = {https://doi.org/10.1016/j.dib.2022.108554},
|
100 |
+
url = {https://www.sciencedirect.com/science/article/pii/S2352340922007612},
|
101 |
+
author = {Rhio Sutoyo and Said Achmad and Andry Chowanda and Esther Widhi Andangsari and Sani M. Isa},
|
102 |
+
keywords = {Natural language processing, Text processing, Text mining, Emotions classification, Sentiment analysis},
|
103 |
+
abstract = {Recognizing emotions is vital in communication. Emotions convey additional meanings to the communication process. Nowadays, people can communicate their emotions on many platforms; one is the product review. Product reviews in the online platform are an important element that affects customers’ buying decisions. Hence, it is essential to recognize emotions from the product reviews. Emotions recognition from the product reviews can be done automatically using a machine or deep learning algorithm. Dataset can be considered as the fuel to model the recognizer. However, only a limited dataset exists in recognizing emotions from the product reviews, particularly in a local language. This research contributes to the dataset collection of 5400 product reviews in Indonesian. It was carefully curated from various (29) product categories, annotated with five emotions, and verified by an expert in clinical psychology. The dataset supports an innovative process to build automatic emotion classification on product reviews.}
|
104 |
+
}
|
105 |
+
```
|