Text Classification
Transformers
PyTorch
Safetensors
xlm-roberta
genre
text-genre
Inference Endpoints
TajaKuzman commited on
Commit
6be9a30
1 Parent(s): 9b8116e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -11
README.md CHANGED
@@ -114,7 +114,9 @@ widget:
114
 
115
  # X-GENRE classifier - multilingual text genre classifier
116
 
117
- Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base) and fine-tuned on a combination of three genre datasets: Slovene [GINCO](http://hdl.handle.net/11356/1467) dataset (Kuzman et al., 2022), the English [CORE](https://github.com/TurkuNLP/CORE-corpus) dataset (Egbert et al., 2015) and the English [FTD](https://github.com/ssharoff/genre-keras) dataset (Sharoff, 2018). The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
 
 
118
 
119
  The details on the model development, the datasets and the model's in-dataset, cross-dataset and multilingual performance are provided in the paper [Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models](https://www.mdpi.com/2504-4990/5/3/59) (Kuzman et al., 2023).
120
 
@@ -135,9 +137,12 @@ If you use the model, please cite the paper:
135
 
136
  ## AGILE - Automatic Genre Identification Benchmark
137
 
138
- We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability for the automatic enrichment of large text collections with genre information. You are welcome to request the test dataset and submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
 
 
139
 
140
- In an out-of-dataset scenario (evaluating a model on a manually-annotated English dataset on which it was not trained), the model outperforms all other technologies:
 
141
 
142
  | | micro F1 | macro F1 | accuracy |
143
  |:----------------------------|-----------:|-----------:|-----------:|
@@ -194,7 +199,8 @@ Use example for prediction on a dataset, using batch processing, is available vi
194
 
195
  ## X-GENRE categories
196
 
197
- List of labels:
 
198
  ```
199
  labels_list=['Other', 'Information/Explanation', 'News', 'Instruction', 'Opinion/Argumentation', 'Forum', 'Prose/Lyrical', 'Legal', 'Promotion'],
200
 
@@ -202,7 +208,7 @@ labels_map={'Other': 0, 'Information/Explanation': 1, 'News': 2, 'Instruction':
202
 
203
  ```
204
 
205
- Description of labels:
206
 
207
  | Label | Description | Examples |
208
  |-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
@@ -221,9 +227,11 @@ Description of labels:
221
 
222
  ### Comparison with other models at in-dataset and cross-dataset experiments
223
 
224
- The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately, using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
 
225
 
226
- At the in-dataset experiments (trained and tested on splits of the same dataset), it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
 
227
 
228
  | Trained on | Micro F1 | Macro F1 |
229
  |:-------------|-----------:|-----------:|
@@ -243,7 +251,7 @@ When applied on test splits of each of the datasets, the classifier performs wel
243
  | X-GENRE | GINCO | 0.749 | 0.758 |
244
 
245
  The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
246
- - EN-GINCO: a sample of the English enTenTen20 corpus
247
  - [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
248
 
249
  | Trained on | Tested on | Micro F1 | Macro F1 |
@@ -254,11 +262,13 @@ The classifier was compared with other classifiers on 2 additional genre dataset
254
  | FTD | EN-GINCO | 0.574 | 0.475 |
255
  | CORE | EN-GINCO | 0.485 | 0.422 |
256
 
257
- At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier, trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
 
258
 
259
  ### Fine-tuning hyperparameters
260
 
261
- Fine-tuning was performed with `simpletransformers`. Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
 
262
 
263
  ```python
264
  model_args= {
@@ -271,7 +281,7 @@ model_args= {
271
 
272
  ## Citation
273
 
274
- If you use the model, please cite the paper which describes creation of the X-GENRE dataset and the genre classifier:
275
 
276
  ```
277
  @article{kuzman2023automatic,
 
114
 
115
  # X-GENRE classifier - multilingual text genre classifier
116
 
117
+ Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base)
118
+ and fine-tuned on a [multilingual manually-annotated X-GENRE genre dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset).
119
+ The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
120
 
121
  The details on the model development, the datasets and the model's in-dataset, cross-dataset and multilingual performance are provided in the paper [Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models](https://www.mdpi.com/2504-4990/5/3/59) (Kuzman et al., 2023).
122
 
 
137
 
138
  ## AGILE - Automatic Genre Identification Benchmark
139
 
140
+ We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability
141
+ for the automatic enrichment of large text collections with genre information.
142
+ You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
143
 
144
+ In an out-of-dataset scenario (evaluating a model on a [manually-annotated English EN-GINCO dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset) on which it was not trained),
145
+ the model outperforms all other technologies:
146
 
147
  | | micro F1 | macro F1 | accuracy |
148
  |:----------------------------|-----------:|-----------:|-----------:|
 
199
 
200
  ## X-GENRE categories
201
 
202
+ ### List of labels
203
+
204
  ```
205
  labels_list=['Other', 'Information/Explanation', 'News', 'Instruction', 'Opinion/Argumentation', 'Forum', 'Prose/Lyrical', 'Legal', 'Promotion'],
206
 
 
208
 
209
  ```
210
 
211
+ ### Description of labels
212
 
213
  | Label | Description | Examples |
214
  |-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 
227
 
228
  ### Comparison with other models at in-dataset and cross-dataset experiments
229
 
230
+ The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately,
231
+ using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
232
 
233
+ At the in-dataset experiments (trained and tested on splits of the same dataset),
234
+ it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
235
 
236
  | Trained on | Micro F1 | Macro F1 |
237
  |:-------------|-----------:|-----------:|
 
251
  | X-GENRE | GINCO | 0.749 | 0.758 |
252
 
253
  The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
254
+ - [EN-GINCO](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset): a sample of the English enTenTen20 corpus
255
  - [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
256
 
257
  | Trained on | Tested on | Micro F1 | Macro F1 |
 
262
  | FTD | EN-GINCO | 0.574 | 0.475 |
263
  | CORE | EN-GINCO | 0.485 | 0.422 |
264
 
265
+ At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier,
266
+ trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
267
 
268
  ### Fine-tuning hyperparameters
269
 
270
+ Fine-tuning was performed with `simpletransformers`.
271
+ Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
272
 
273
  ```python
274
  model_args= {
 
281
 
282
  ## Citation
283
 
284
+ If you use the model, please cite the paper which describes creation of the [X-GENRE dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset) and the genre classifier:
285
 
286
  ```
287
  @article{kuzman2023automatic,