classla
/

xlm-roberta-base-multilingual-text-genre-classifier

@@ -114,7 +114,9 @@ widget:
 # X-GENRE classifier - multilingual text genre classifier
-Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three genre datasets: Slovene [GINCO](http://hdl.handle.net/11356/1467) dataset (Kuzman et al., 2022), the English  [CORE](https://github.com/TurkuNLP/CORE-corpus) dataset (Egbert et al., 2015) and the English [FTD](https://github.com/ssharoff/genre-keras) dataset (Sharoff, 2018). The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
 The details on the model development, the datasets and the model's in-dataset, cross-dataset and multilingual performance are provided in the paper [Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models](https://www.mdpi.com/2504-4990/5/3/59) (Kuzman et al., 2023).
@@ -135,9 +137,12 @@ If you use the model, please cite the paper:
 ## AGILE - Automatic Genre Identification Benchmark
-We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability for the automatic enrichment of large text collections with genre information. You are welcome to request the test dataset and submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
-In an out-of-dataset scenario (evaluating a model on a manually-annotated English dataset on which it was not trained), the model outperforms all other technologies:
 |                             |   micro F1 |   macro F1 |   accuracy |
 |:----------------------------|-----------:|-----------:|-----------:|
@@ -194,7 +199,8 @@ Use example for prediction on a dataset, using batch processing, is available vi
 ## X-GENRE categories
-List of labels:
 ```
 labels_list=['Other', 'Information/Explanation', 'News', 'Instruction', 'Opinion/Argumentation', 'Forum', 'Prose/Lyrical', 'Legal', 'Promotion'],
@@ -202,7 +208,7 @@ labels_map={'Other': 0, 'Information/Explanation': 1, 'News': 2, 'Instruction':
 ```
-Description of labels:
 |     Label               |     Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |     Examples                                                                                                                                                                                                                                  |
 |-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
@@ -221,9 +227,11 @@ Description of labels:
 ### Comparison with other models at in-dataset and cross-dataset experiments
-The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately, using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
-At the in-dataset experiments (trained and tested on splits of the same dataset), it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
 | Trained on   |   Micro F1 |   Macro F1 |
 |:-------------|-----------:|-----------:|
@@ -243,7 +251,7 @@ When applied on test splits of each of the datasets, the classifier performs wel
 | X-GENRE      | GINCO    |      0.749 |      0.758 |
 The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
-- EN-GINCO: a sample of the English enTenTen20 corpus
 - [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
 | Trained on   | Tested on   |   Micro F1 |   Macro F1 |
@@ -254,11 +262,13 @@ The classifier was compared with other classifiers on 2 additional genre dataset
 | FTD          | EN-GINCO    |      0.574 |      0.475 |
 | CORE         | EN-GINCO    |      0.485 |      0.422 |
-At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier, trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
 ### Fine-tuning hyperparameters
-Fine-tuning was performed with `simpletransformers`. Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
 ```python
 model_args= {
@@ -271,7 +281,7 @@ model_args= {
 ## Citation
-If you use the model, please cite the paper which describes creation of the X-GENRE dataset and the genre classifier:
 ```
 @article{kuzman2023automatic,

 # X-GENRE classifier - multilingual text genre classifier
+Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base)
+and fine-tuned on a [multilingual manually-annotated X-GENRE genre dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset).
+The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
 The details on the model development, the datasets and the model's in-dataset, cross-dataset and multilingual performance are provided in the paper [Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models](https://www.mdpi.com/2504-4990/5/3/59) (Kuzman et al., 2023).
 ## AGILE - Automatic Genre Identification Benchmark
+We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability
+for the automatic enrichment of large text collections with genre information.
+You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
+In an out-of-dataset scenario (evaluating a model on a [manually-annotated English EN-GINCO dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset) on which it was not trained),
+ the model outperforms all other technologies:
 |                             |   micro F1 |   macro F1 |   accuracy |
 |:----------------------------|-----------:|-----------:|-----------:|
 ## X-GENRE categories
+### List of labels
 ```
 labels_list=['Other', 'Information/Explanation', 'News', 'Instruction', 'Opinion/Argumentation', 'Forum', 'Prose/Lyrical', 'Legal', 'Promotion'],
 ```
+### Description of labels
 |     Label               |     Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |     Examples                                                                                                                                                                                                                                  |
 |-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 ### Comparison with other models at in-dataset and cross-dataset experiments
+The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately,
+using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
+At the in-dataset experiments (trained and tested on splits of the same dataset),
+it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
 | Trained on   |   Micro F1 |   Macro F1 |
 |:-------------|-----------:|-----------:|
 | X-GENRE      | GINCO    |      0.749 |      0.758 |
 The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
+- [EN-GINCO](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset): a sample of the English enTenTen20 corpus
 - [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
 | Trained on   | Tested on   |   Micro F1 |   Macro F1 |
 | FTD          | EN-GINCO    |      0.574 |      0.475 |
 | CORE         | EN-GINCO    |      0.485 |      0.422 |
+At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier,
+trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
 ### Fine-tuning hyperparameters
+Fine-tuning was performed with `simpletransformers`.
+Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
 ```python
 model_args= {
 ## Citation
+If you use the model, please cite the paper which describes creation of the [X-GENRE dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset) and the genre classifier:
 ```
 @article{kuzman2023automatic,