TajaKuzman
commited on
Commit
•
ac6c965
1
Parent(s):
07f72cb
Update README.md
Browse files
README.md
CHANGED
@@ -114,28 +114,46 @@ widget:
|
|
114 |
|
115 |
# X-GENRE classifier - multilingual text genre classifier
|
116 |
|
117 |
-
Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three genre datasets: Slovene GINCO
|
118 |
|
119 |
-
|
120 |
|
121 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
122 |
|
123 |
-
|
124 |
|
125 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
126 |
|
127 |
-
```python
|
128 |
-
model_args= {
|
129 |
-
"num_train_epochs": 15,
|
130 |
-
"learning_rate": 1e-5,
|
131 |
-
"max_seq_length": 512,
|
132 |
-
}
|
133 |
-
|
134 |
-
```
|
135 |
|
136 |
## Intended use and limitations
|
137 |
|
138 |
-
|
139 |
|
140 |
An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
|
141 |
|
@@ -236,7 +254,18 @@ The classifier was compared with other classifiers on 2 additional genre dataset
|
|
236 |
|
237 |
At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier, trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
|
238 |
|
239 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
240 |
|
241 |
If you use the model, please cite the paper which describes creation of the X-GENRE dataset and the genre classifier:
|
242 |
|
|
|
114 |
|
115 |
# X-GENRE classifier - multilingual text genre classifier
|
116 |
|
117 |
+
Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three genre datasets: Slovene [GINCO](http://hdl.handle.net/11356/1467) dataset (Kuzman et al., 2022), the English [CORE](https://github.com/TurkuNLP/CORE-corpus) dataset (Egbert et al., 2015) and the English [FTD](https://github.com/ssharoff/genre-keras) dataset (Sharoff, 2018). The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`. The details on the model development, the datasets and the model's in-dataset, cross-dataset and multilingual performance are provided in described in details in the paper [Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models](https://www.mdpi.com/2504-4990/5/3/59) (Kuzman et al., 2023).
|
118 |
|
119 |
+
If you use the model, please cite the paper which describes creation of the X-GENRE dataset and the genre classifier:
|
120 |
|
121 |
+
```
|
122 |
+
@article{kuzman2023automatic,
|
123 |
+
title={Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models},
|
124 |
+
author={Kuzman, Taja and Mozeti{\v{c}}, Igor and Ljube{\v{s}}i{\'c}, Nikola},
|
125 |
+
journal={Machine Learning and Knowledge Extraction},
|
126 |
+
volume={5},
|
127 |
+
number={3},
|
128 |
+
pages={1149--1175},
|
129 |
+
year={2023},
|
130 |
+
publisher={MDPI}
|
131 |
+
}
|
132 |
+
```
|
133 |
|
134 |
+
## AGILE - Automatic Genre Identification Benchmark
|
135 |
|
136 |
+
We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability for the automatic enrichment of large text collections with genre information. You are welcome to request the test dataset and submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
|
137 |
+
|
138 |
+
In a out-of-dataset scenario (evaluating a model on a manually-annotated English dataset on which it was not trained), the model outperforms all other technologies:
|
139 |
+
|
140 |
+
| | micro F1 | macro F1 | accuracy |
|
141 |
+
|:----------------------------|-----------:|-----------:|-----------:|
|
142 |
+
| **XLM-RoBERTa, fine-tuned on the X-GENRE dataset - X-GENRE classifier** (Kuzman et al. 2023) | 0.68 | 0.69 | 0.68 |
|
143 |
+
| GPT-4 (7/7/2023) (Kuzman et al. 2023) | 0.65 | 0.55 | 0.65 |
|
144 |
+
| GPT-3.5-turbo (Kuzman et al. 2023) | 0.63 | 0.53 | 0.63 |
|
145 |
+
| SVM (Kuzman et al. 2023) | 0.49 | 0.51 | 0.49 |
|
146 |
+
| Logistic Regression (Kuzman et al. 2023) | 0.49 | 0.47 | 0.49 |
|
147 |
+
| FastText (Kuzman et al. 2023) | 0.45 | 0.41 | 0.45 |
|
148 |
+
| Naive Bayes (Kuzman et al. 2023) | 0.36 | 0.29 | 0.36 |
|
149 |
+
| mt0 | 0.32 | 0.23 | 0.27 |
|
150 |
+
| Zero-Shot classification with `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli` @ HuggingFace | 0.2 | 0.15 | 0.2 |
|
151 |
+
| Dummy Classifier (stratified) (Kuzman et al. 2023)| 0.14 | 0.1 | 0.14 |
|
152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
153 |
|
154 |
## Intended use and limitations
|
155 |
|
156 |
+
### Usage
|
157 |
|
158 |
An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
|
159 |
|
|
|
254 |
|
255 |
At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier, trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
|
256 |
|
257 |
+
### Fine-tuning hyperparameters
|
258 |
+
|
259 |
+
Fine-tuning was performed with `simpletransformers`. Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
|
260 |
+
|
261 |
+
```python
|
262 |
+
model_args= {
|
263 |
+
"num_train_epochs": 15,
|
264 |
+
"learning_rate": 1e-5,
|
265 |
+
"max_seq_length": 512,
|
266 |
+
}
|
267 |
+
|
268 |
+
```
|
269 |
|
270 |
If you use the model, please cite the paper which describes creation of the X-GENRE dataset and the genre classifier:
|
271 |
|