Text Classification
Transformers
PyTorch
Safetensors
xlm-roberta
genre
text-genre
Inference Endpoints
TajaKuzman commited on
Commit
ac6c965
1 Parent(s): 07f72cb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -15
README.md CHANGED
@@ -114,28 +114,46 @@ widget:
114
 
115
  # X-GENRE classifier - multilingual text genre classifier
116
 
117
- Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three genre datasets: Slovene GINCO<sup>1</sup> dataset, the English CORE<sup>2</sup> dataset and the English FTD<sup>3</sup> dataset. The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
118
 
119
- ## Model description
120
 
121
- The model was fine-tuned on the "X-GENRE" dataset which consists of three genre datasets: CORE, FTD and GINCO dataset. Each of the datasets has their own genre schema, so they were combined into a joint schema ("X-GENRE" schema) based on the comparison of labels and cross-dataset experiments (described in details [here](https://github.com/TajaKuzman/Genre-Datasets-Comparison/tree/main/Creation-of-classifiers-and-cross-prediction#joint-schema-x-genre)).
 
 
 
 
 
 
 
 
 
 
 
122
 
123
- ### Fine-tuning hyperparameters
124
 
125
- Fine-tuning was performed with `simpletransformers`. Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
 
127
- ```python
128
- model_args= {
129
- "num_train_epochs": 15,
130
- "learning_rate": 1e-5,
131
- "max_seq_length": 512,
132
- }
133
-
134
- ```
135
 
136
  ## Intended use and limitations
137
 
138
- ## Usage
139
 
140
  An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
141
 
@@ -236,7 +254,18 @@ The classifier was compared with other classifiers on 2 additional genre dataset
236
 
237
  At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier, trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
238
 
239
- ## Citation
 
 
 
 
 
 
 
 
 
 
 
240
 
241
  If you use the model, please cite the paper which describes creation of the X-GENRE dataset and the genre classifier:
242
 
 
114
 
115
  # X-GENRE classifier - multilingual text genre classifier
116
 
117
+ Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three genre datasets: Slovene [GINCO](http://hdl.handle.net/11356/1467) dataset (Kuzman et al., 2022), the English [CORE](https://github.com/TurkuNLP/CORE-corpus) dataset (Egbert et al., 2015) and the English [FTD](https://github.com/ssharoff/genre-keras) dataset (Sharoff, 2018). The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`. The details on the model development, the datasets and the model's in-dataset, cross-dataset and multilingual performance are provided in described in details in the paper [Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models](https://www.mdpi.com/2504-4990/5/3/59) (Kuzman et al., 2023).
118
 
119
+ If you use the model, please cite the paper which describes creation of the X-GENRE dataset and the genre classifier:
120
 
121
+ ```
122
+ @article{kuzman2023automatic,
123
+ title={Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models},
124
+ author={Kuzman, Taja and Mozeti{\v{c}}, Igor and Ljube{\v{s}}i{\'c}, Nikola},
125
+ journal={Machine Learning and Knowledge Extraction},
126
+ volume={5},
127
+ number={3},
128
+ pages={1149--1175},
129
+ year={2023},
130
+ publisher={MDPI}
131
+ }
132
+ ```
133
 
134
+ ## AGILE - Automatic Genre Identification Benchmark
135
 
136
+ We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability for the automatic enrichment of large text collections with genre information. You are welcome to request the test dataset and submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
137
+
138
+ In a out-of-dataset scenario (evaluating a model on a manually-annotated English dataset on which it was not trained), the model outperforms all other technologies:
139
+
140
+ | | micro F1 | macro F1 | accuracy |
141
+ |:----------------------------|-----------:|-----------:|-----------:|
142
+ | **XLM-RoBERTa, fine-tuned on the X-GENRE dataset - X-GENRE classifier** (Kuzman et al. 2023) | 0.68 | 0.69 | 0.68 |
143
+ | GPT-4 (7/7/2023) (Kuzman et al. 2023) | 0.65 | 0.55 | 0.65 |
144
+ | GPT-3.5-turbo (Kuzman et al. 2023) | 0.63 | 0.53 | 0.63 |
145
+ | SVM (Kuzman et al. 2023) | 0.49 | 0.51 | 0.49 |
146
+ | Logistic Regression (Kuzman et al. 2023) | 0.49 | 0.47 | 0.49 |
147
+ | FastText (Kuzman et al. 2023) | 0.45 | 0.41 | 0.45 |
148
+ | Naive Bayes (Kuzman et al. 2023) | 0.36 | 0.29 | 0.36 |
149
+ | mt0 | 0.32 | 0.23 | 0.27 |
150
+ | Zero-Shot classification with `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli` @ HuggingFace | 0.2 | 0.15 | 0.2 |
151
+ | Dummy Classifier (stratified) (Kuzman et al. 2023)| 0.14 | 0.1 | 0.14 |
152
 
 
 
 
 
 
 
 
 
153
 
154
  ## Intended use and limitations
155
 
156
+ ### Usage
157
 
158
  An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
159
 
 
254
 
255
  At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier, trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
256
 
257
+ ### Fine-tuning hyperparameters
258
+
259
+ Fine-tuning was performed with `simpletransformers`. Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
260
+
261
+ ```python
262
+ model_args= {
263
+ "num_train_epochs": 15,
264
+ "learning_rate": 1e-5,
265
+ "max_seq_length": 512,
266
+ }
267
+
268
+ ```
269
 
270
  If you use the model, please cite the paper which describes creation of the X-GENRE dataset and the genre classifier:
271