Text Classification
Transformers
PyTorch
Safetensors
xlm-roberta
genre
text-genre
Inference Endpoints
TajaKuzman commited on
Commit
5ffe01f
1 Parent(s): 470fa16

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -148,7 +148,7 @@ We set up a benchmark for evaluating robustness of automatic genre identificatio
148
  for the automatic enrichment of large text collections with genre information.
149
  You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
150
 
151
- In an out-of-dataset scenario (evaluating a model on a [manually-annotated English EN-GINCO dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset) on which it was not trained),
152
  the model outperforms all other technologies:
153
 
154
  | | micro F1 | macro F1 | accuracy |
@@ -171,7 +171,8 @@ In an out-of-dataset scenario (evaluating a model on a [manually-annotated Engli
171
 
172
  An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
173
 
174
- For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words). It is advised that the predictions, predicted with confidence lower than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not have enough features of any genre, and these predictions can be discarded as well.
 
175
 
176
  After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection reached macro and micro F1 of 0.92.
177
 
@@ -258,7 +259,7 @@ When applied on test splits of each of the datasets, the classifier performs wel
258
  | X-GENRE | GINCO | 0.749 | 0.758 |
259
 
260
  The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
261
- - [EN-GINCO](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset): a sample of the English enTenTen20 corpus
262
  - [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
263
 
264
  | Trained on | Tested on | Micro F1 | Macro F1 |
 
148
  for the automatic enrichment of large text collections with genre information.
149
  You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
150
 
151
+ In an out-of-dataset scenario (evaluating a model on a manually-annotated English EN-GINCO dataset (available upon request)) on which it was not trained),
152
  the model outperforms all other technologies:
153
 
154
  | | micro F1 | macro F1 | accuracy |
 
171
 
172
  An example of preparing data for genre identification and post-processing of the results can be found [here](https://github.com/TajaKuzman/Applying-GENRE-on-MaCoCu-bilingual) where we applied X-GENRE classifier to the English part of [MaCoCu](https://macocu.eu/) parallel corpora.
173
 
174
+ For reliable results, genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words).
175
+ It is advised that the predictions, predicted with confidence higher than 0.9, are not used. Furthermore, the label "Other" can be used as another indicator of low confidence of the predictions, as it often indicates that the text does not have enough features of any genre, and these predictions can be discarded as well.
176
 
177
  After proposed post-processing (removal of low-confidence predictions, labels "Other" and in this specific case also label "Forum"), the performance on the MaCoCu data based on manual inspection reached macro and micro F1 of 0.92.
178
 
 
259
  | X-GENRE | GINCO | 0.749 | 0.758 |
260
 
261
  The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
262
+ - EN-GINCO (available upon request): a sample of the English enTenTen20 corpus
263
  - [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
264
 
265
  | Trained on | Tested on | Micro F1 | Macro F1 |