Sundanese GPT-2 Base Emotion Classifier
Sundanese GPT-2 Base Emotion Classifier is an emotion-text-classification model based on the OpenAI GPT-2 model. The model was originally the pre-trained Sundanese GPT-2 Base model, which is then fine-tuned on the Sundanese Twitter dataset, consisting of Sundanese tweets.
10% of the dataset is kept for evaluation purposes. After training, the model achieved an evaluation accuracy of 94.84% and F1-macro of 94.75%.
Hugging Face's Trainer
class from the Transformers library was used to train the model. PyTorch was used as the backend framework during training, but the model remains compatible with other frameworks nonetheless.
Model
Model | #params | Arch. | Training/Validation data (text) |
---|---|---|---|
sundanese-gpt2-base-emotion-classifier |
124M | GPT-2 Base | Sundanese Twitter dataset |
Evaluation Results
The model was trained for 10 epochs and the best model was loaded at the end.
Epoch | Training Loss | Validation Loss | Accuracy | F1 | Precision | Recall |
---|---|---|---|---|---|---|
1 | 0.819200 | 0.331463 | 0.880952 | 0.878694 | 0.883126 | 0.879304 |
2 | 0.140300 | 0.309764 | 0.900794 | 0.899025 | 0.906819 | 0.898632 |
3 | 0.018600 | 0.324491 | 0.948413 | 0.947525 | 0.948037 | 0.948153 |
4 | 0.004500 | 0.335100 | 0.932540 | 0.931648 | 0.934629 | 0.931617 |
5 | 0.000200 | 0.392145 | 0.932540 | 0.932281 | 0.935075 | 0.932527 |
6 | 0.000000 | 0.371689 | 0.932540 | 0.931760 | 0.934925 | 0.931840 |
7 | 0.000000 | 0.368086 | 0.944444 | 0.943652 | 0.945875 | 0.943843 |
8 | 0.000000 | 0.367550 | 0.944444 | 0.943652 | 0.945875 | 0.943843 |
9 | 0.000000 | 0.368033 | 0.944444 | 0.943652 | 0.945875 | 0.943843 |
10 | 0.000000 | 0.368391 | 0.944444 | 0.943652 | 0.945875 | 0.943843 |
How to Use
As Text Classifier
from transformers import pipeline
pretrained_name = "sundanese-gpt2-base-emotion-classifier"
nlp = pipeline(
"sentiment-analysis",
model=pretrained_name,
tokenizer=pretrained_name
)
nlp("Wah, éta gélo, keren pisan!")
Disclaimer
Do consider the biases which come from both the pre-trained RoBERTa model and the Sundanese Twitter dataset that may be carried over into the results of this model.
Author
Sundanese GPT-2 Base Emotion Classifier was trained and evaluated by Wilson Wongso. All computation and development are done on Google Colaboratory using their free GPU access.
Citation Information
@article{rs-907893,
author = {Wongso, Wilson
and Lucky, Henry
and Suhartono, Derwin},
journal = {Journal of Big Data},
year = {2022},
month = {Feb},
day = {26},
abstract = {The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.},
issn = {2693-5015},
doi = {10.21203/rs.3.rs-907893/v1},
url = {https://doi.org/10.21203/rs.3.rs-907893/v1}
}
- Downloads last month
- 14