File size: 19,275 Bytes
0e3e5b2 28fac11 83a7051 28fac11 83a7051 28fac11 de7ed0f c613f0e 0e3e5b2 83a7051 d699313 83a7051 6be9a30 de7ed0f 6be9a30 83a7051 9b8116e 8784f89 9b8116e 39047a3 ac6c965 39047a3 ac6c965 83a7051 6be9a30 ac6c965 5ffe01f 6be9a30 ac6c965 83a7051 ac6c965 83a7051 39047a3 5ffe01f 39047a3 f98b0d9 39047a3 83a7051 ab3f058 83a7051 15883fb 83a7051 be4e3d7 83a7051 ab3f058 ac2bd19 6be9a30 ac2bd19 6be9a30 ac2bd19 83a7051 39047a3 6be9a30 39047a3 6be9a30 39047a3 f98b0d9 39047a3 f98b0d9 39047a3 f98b0d9 5ffe01f f98b0d9 39047a3 f98b0d9 39047a3 6be9a30 83a7051 ac6c965 6be9a30 ac6c965 83a7051 9b8116e de7ed0f 83a7051 07f72cb 83a7051 28fac11 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 |
---
license: cc-by-sa-4.0
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
tags:
- text-classification
- genre
- text-genre
widget:
- text: >-
On our site, you can find a great genre identification model which you can
use for thousands of different tasks. For free!
example_title: English
- text: >-
Na naši spletni strani lahko najdete odličen model za prepoznavanje žanrov,
ki ga lahko uporabite pri na tisoče različnih nalogah. In to brezplačno!
example_title: Slovene
- text: >-
Sur notre site, vous trouverez un modèle d'identification de genre très
intéressant que vous pourrez utiliser pour des milliers de tâches
différentes. C'est gratuit !
example_title: French
datasets:
- TajaKuzman/X-GENRE-text-genre-dataset
base_model:
- FacebookAI/xlm-roberta-base
---
# X-GENRE classifier - multilingual text genre classifier
Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base)
and fine-tuned on a [multilingual manually-annotated X-GENRE genre dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-text-genre-dataset).
The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
The details on the model development, the datasets and the model's in-dataset, cross-dataset and multilingual performance are provided in the paper [Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models](https://www.mdpi.com/2504-4990/5/3/59) (Kuzman et al., 2023).
The model can also be downloaded from the [CLARIN.SI repository](http://hdl.handle.net/11356/1961).
If you use the model, please cite the paper:
```
@article{kuzman2023automatic,
title={Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models},
author={Kuzman, Taja and Mozeti{\v{c}}, Igor and Ljube{\v{s}}i{\'c}, Nikola},
journal={Machine Learning and Knowledge Extraction},
volume={5},
number={3},
pages={1149--1175},
year={2023},
publisher={MDPI}
}
```
## AGILE - Automatic Genre Identification Benchmark
We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability
for the automatic enrichment of large text collections with genre information.
You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
In an out-of-dataset scenario (evaluating a model on a manually-annotated English EN-GINCO dataset (available upon request)) on which it was not trained),
the model outperforms all other technologies:
| | micro F1 | macro F1 | accuracy |
|:----------------------------|-----------:|-----------:|-----------:|
| **XLM-RoBERTa, fine-tuned on the X-GENRE dataset - X-GENRE classifier** (Kuzman et al. 2023) | 0.68 | 0.69 | 0.68 |
| GPT-4 (7/7/2023) (Kuzman et al. 2023) | 0.65 | 0.55 | 0.65 |
| GPT-3.5-turbo (Kuzman et al. 2023) | 0.63 | 0.53 | 0.63 |
| SVM (Kuzman et al. 2023) | 0.49 | 0.51 | 0.49 |
| Logistic Regression (Kuzman et al. 2023) | 0.49 | 0.47 | 0.49 |
| FastText (Kuzman et al. 2023) | 0.45 | 0.41 | 0.45 |
| Naive Bayes (Kuzman et al. 2023) | 0.36 | 0.29 | 0.36 |
| mt0 | 0.32 | 0.23 | 0.27 |
| Zero-Shot classification with `MoritzLaurer/mDeBERTa-v3-base-mnli-xnli` @ HuggingFace | 0.2 | 0.15 | 0.2 |
| Dummy Classifier (stratified) (Kuzman et al. 2023)| 0.14 | 0.1 | 0.14 |
## Intended use and limitations
### Usage
An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words).
It is advised that the predictions, predicted with confidence higher than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not have enough features of any genre, and these predictions can be discarded as well.
After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection reached macro and micro F1 of 0.92.
### Use examples
```python
from simpletransformers.classification import ClassificationModel
model_args= {
"num_train_epochs": 15,
"learning_rate": 1e-5,
"max_seq_length": 512,
"silent": True
}
model = ClassificationModel(
"xlmroberta", "classla/xlm-roberta-base-multilingual-text-genre-classifier", use_cuda=True,
args=model_args
)
predictions, logit_output = model.predict(["How to create a good text classification model? First step is to prepare good data. Make sure not to skip the exploratory data analysis. Pre-process the text if necessary for the task. The next step is to perform hyperparameter search to find the optimum hyperparameters. After fine-tuning the model, you should look into the predictions and analyze the model's performance. You might want to perform the post-processing of data as well and keep only reliable predictions.",
"On our site, you can find a great genre identification model which you can use for thousands of different tasks. With our model, you can fastly and reliably obtain high-quality genre predictions and explore which genres exist in your corpora. Available for free!"]
)
predictions
# Output: array([3, 8])
[model.config.id2label[i] for i in predictions]
# Output: ['Instruction', 'Promotion']
```
Use example for prediction on a dataset, using batch processing, is available via [Google Collab](https://colab.research.google.com/drive/1yC4L_p2t3oMViC37GqSjJynQH-EWyhLr?usp=sharing).
## X-GENRE categories
### List of labels
```
labels_list=['Other', 'Information/Explanation', 'News', 'Instruction', 'Opinion/Argumentation', 'Forum', 'Prose/Lyrical', 'Legal', 'Promotion'],
labels_map={'Other': 0, 'Information/Explanation': 1, 'News': 2, 'Instruction': 3, 'Opinion/Argumentation': 4, 'Forum': 5, 'Prose/Lyrical': 6, 'Legal': 7, 'Promotion': 8}
```
### Description of labels
| Label | Description | Examples |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Information/Explanation | An objective text that describes or presents an event, a person, a thing, a concept etc. Its main purpose is to inform the reader about something. Common features: objective/factual, explanation/definition of a concept (x is …), enumeration. | research article, encyclopedia article, informational blog, product specification, course materials, general information, job description, manual, horoscope, travel guide, glossaries, historical article, biographical story/history. |
| Instruction | An objective text which instructs the readers on how to do something. Common features: multiple steps/actions, chronological order, 1st person plural or 2nd person, modality (must, have to, need to, can, etc.), adverbial clauses of manner (in a way that), of condition (if), of time (after …). | how-to texts, recipes, technical support |
| Legal | An objective formal text that contains legal terms and is clearly structured. The name of the text type is often included in the headline (contract, rules, amendment, general terms and conditions, etc.). Common features: objective/factual, legal terms, 3rd person. | small print, software license, proclamation, terms and conditions, contracts, law, copyright notices, university regulation |
| News | An objective or subjective text which reports on an event recent at the time of writing or coming in the near future. Common features: adverbs/adverbial clauses of time and/or place (dates, places), many proper nouns, direct or reported speech, past tense. | news report, sports report, travel blog, reportage, police report, announcement |
| Opinion/Argumentation | A subjective text in which the authors convey their opinion or narrate their experience. It includes promotion of an ideology and other non-commercial causes. This genre includes a subjective narration of a personal experience as well. Common features: adjectives/adverbs that convey opinion, words that convey (un)certainty (certainly, surely), 1st person, exclamation marks. | review, blog (personal blog, travel blog), editorial, advice, letter to editor, persuasive article or essay, formal speech, pamphlet, political propaganda, columns, political manifesto |
| Promotion | A subjective text intended to sell or promote an event, product, or service. It addresses the readers, often trying to convince them to participate in something or buy something. Common features: contains adjectives/adverbs that promote something (high-quality, perfect, amazing), comparative and superlative forms of adjectives and adverbs (the best, the greatest, the cheapest), addressing the reader (usage of 2nd person), exclamation marks. | advertisement, promotion of a product (e-shops), promotion of an accommodation, promotion of company's services, invitation to an event |
| Forum | A text in which people discuss a certain topic in form of comments. Common features: multiple authors, informal language, subjective (the writers express their opinions), written in 1st person. | discussion forum, reader/viewer responses, QA forum |
| Prose/Lyrical | A literary text that consists of paragraphs or verses. A literary text is deemed to have no other practical purpose than to give pleasure to the reader. Often the author pays attention to the aesthetic appearance of the text. It can be considered as art. | lyrics, poem, prayer, joke, novel, short story |
| Other | A text that which does not fall under any of other genre categories. | |
## Performance
### Comparison with other models at in-dataset and cross-dataset experiments
The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately,
using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
At the in-dataset experiments (trained and tested on splits of the same dataset),
it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
| Trained on | Micro F1 | Macro F1 |
|:-------------|-----------:|-----------:|
| FTD | 0.843 | 0.851 |
| X-GENRE | 0.797 | 0.794 |
| CORE | 0.778 | 0.627 |
| GINCO | 0.754 | 0.75 |
When applied on test splits of each of the datasets, the classifier performs well:
| Trained on | Tested on | Micro F1 | Macro F1 |
|:-------------|:------------|-----------:|-----------:|
| X-GENRE | CORE | 0.837 | 0.859 |
| X-GENRE | FTD | 0.804 | 0.809 |
| X-GENRE | X-GENRE | 0.797 | 0.794 |
| X-GENRE | X-GENRE-dev | 0.784 | 0.784 |
| X-GENRE | GINCO | 0.749 | 0.758 |
The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
- EN-GINCO (available upon request): a sample of the English enTenTen20 corpus
- [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
| Trained on | Tested on | Micro F1 | Macro F1 |
|:-------------|:------------|-----------:|-----------:|
| X-GENRE | EN-GINCO | 0.688 | 0.691 |
| X-GENRE | FinCORE | 0.674 | 0.581 |
| GINCO | EN-GINCO | 0.632 | 0.502 |
| FTD | EN-GINCO | 0.574 | 0.475 |
| CORE | EN-GINCO | 0.485 | 0.422 |
At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier,
trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
### Fine-tuning hyperparameters
Fine-tuning was performed with `simpletransformers`.
Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
```python
model_args= {
"num_train_epochs": 15,
"learning_rate": 1e-5,
"max_seq_length": 512,
}
```
## Citation
If you use the model, please cite the paper which describes creation of the [X-GENRE dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-text-genre-dataset) and the genre classifier:
```
@article{kuzman2023automatic,
title={Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models},
author={Kuzman, Taja and Mozeti{\v{c}}, Igor and Ljube{\v{s}}i{\'c}, Nikola},
journal={Machine Learning and Knowledge Extraction},
volume={5},
number={3},
pages={1149--1175},
year={2023},
publisher={MDPI}
}
``` |