TajaKuzman
commited on
Commit
•
6be9a30
1
Parent(s):
9b8116e
Update README.md
Browse files
README.md
CHANGED
@@ -114,7 +114,9 @@ widget:
|
|
114 |
|
115 |
# X-GENRE classifier - multilingual text genre classifier
|
116 |
|
117 |
-
Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base)
|
|
|
|
|
118 |
|
119 |
The details on the model development, the datasets and the model's in-dataset, cross-dataset and multilingual performance are provided in the paper [Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models](https://www.mdpi.com/2504-4990/5/3/59) (Kuzman et al., 2023).
|
120 |
|
@@ -135,9 +137,12 @@ If you use the model, please cite the paper:
|
|
135 |
|
136 |
## AGILE - Automatic Genre Identification Benchmark
|
137 |
|
138 |
-
We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability
|
|
|
|
|
139 |
|
140 |
-
In an out-of-dataset scenario (evaluating a model on a manually-annotated English dataset on which it was not trained),
|
|
|
141 |
|
142 |
| | micro F1 | macro F1 | accuracy |
|
143 |
|:----------------------------|-----------:|-----------:|-----------:|
|
@@ -194,7 +199,8 @@ Use example for prediction on a dataset, using batch processing, is available vi
|
|
194 |
|
195 |
## X-GENRE categories
|
196 |
|
197 |
-
List of labels
|
|
|
198 |
```
|
199 |
labels_list=['Other', 'Information/Explanation', 'News', 'Instruction', 'Opinion/Argumentation', 'Forum', 'Prose/Lyrical', 'Legal', 'Promotion'],
|
200 |
|
@@ -202,7 +208,7 @@ labels_map={'Other': 0, 'Information/Explanation': 1, 'News': 2, 'Instruction':
|
|
202 |
|
203 |
```
|
204 |
|
205 |
-
Description of labels
|
206 |
|
207 |
| Label | Description | Examples |
|
208 |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
@@ -221,9 +227,11 @@ Description of labels:
|
|
221 |
|
222 |
### Comparison with other models at in-dataset and cross-dataset experiments
|
223 |
|
224 |
-
The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately,
|
|
|
225 |
|
226 |
-
At the in-dataset experiments (trained and tested on splits of the same dataset),
|
|
|
227 |
|
228 |
| Trained on | Micro F1 | Macro F1 |
|
229 |
|:-------------|-----------:|-----------:|
|
@@ -243,7 +251,7 @@ When applied on test splits of each of the datasets, the classifier performs wel
|
|
243 |
| X-GENRE | GINCO | 0.749 | 0.758 |
|
244 |
|
245 |
The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
|
246 |
-
- EN-GINCO: a sample of the English enTenTen20 corpus
|
247 |
- [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
|
248 |
|
249 |
| Trained on | Tested on | Micro F1 | Macro F1 |
|
@@ -254,11 +262,13 @@ The classifier was compared with other classifiers on 2 additional genre dataset
|
|
254 |
| FTD | EN-GINCO | 0.574 | 0.475 |
|
255 |
| CORE | EN-GINCO | 0.485 | 0.422 |
|
256 |
|
257 |
-
At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier,
|
|
|
258 |
|
259 |
### Fine-tuning hyperparameters
|
260 |
|
261 |
-
Fine-tuning was performed with `simpletransformers`.
|
|
|
262 |
|
263 |
```python
|
264 |
model_args= {
|
@@ -271,7 +281,7 @@ model_args= {
|
|
271 |
|
272 |
## Citation
|
273 |
|
274 |
-
If you use the model, please cite the paper which describes creation of the X-GENRE dataset and the genre classifier:
|
275 |
|
276 |
```
|
277 |
@article{kuzman2023automatic,
|
|
|
114 |
|
115 |
# X-GENRE classifier - multilingual text genre classifier
|
116 |
|
117 |
+
Text classification model based on [`xlm-roberta-base`](https://huggingface.co/xlm-roberta-base)
|
118 |
+
and fine-tuned on a [multilingual manually-annotated X-GENRE genre dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset).
|
119 |
+
The model can be used for automatic genre identification, applied to any text in a language, supported by the `xlm-roberta-base`.
|
120 |
|
121 |
The details on the model development, the datasets and the model's in-dataset, cross-dataset and multilingual performance are provided in the paper [Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models](https://www.mdpi.com/2504-4990/5/3/59) (Kuzman et al., 2023).
|
122 |
|
|
|
137 |
|
138 |
## AGILE - Automatic Genre Identification Benchmark
|
139 |
|
140 |
+
We set up a benchmark for evaluating robustness of automatic genre identification models to test their usability
|
141 |
+
for the automatic enrichment of large text collections with genre information.
|
142 |
+
You are welcome to submit your entry at the [benchmark's GitHub repository](https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark/tree/main).
|
143 |
|
144 |
+
In an out-of-dataset scenario (evaluating a model on a [manually-annotated English EN-GINCO dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset) on which it was not trained),
|
145 |
+
the model outperforms all other technologies:
|
146 |
|
147 |
| | micro F1 | macro F1 | accuracy |
|
148 |
|:----------------------------|-----------:|-----------:|-----------:|
|
|
|
199 |
|
200 |
## X-GENRE categories
|
201 |
|
202 |
+
### List of labels
|
203 |
+
|
204 |
```
|
205 |
labels_list=['Other', 'Information/Explanation', 'News', 'Instruction', 'Opinion/Argumentation', 'Forum', 'Prose/Lyrical', 'Legal', 'Promotion'],
|
206 |
|
|
|
208 |
|
209 |
```
|
210 |
|
211 |
+
### Description of labels
|
212 |
|
213 |
| Label | Description | Examples |
|
214 |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
|
227 |
|
228 |
### Comparison with other models at in-dataset and cross-dataset experiments
|
229 |
|
230 |
+
The X-GENRE model was compared with `xlm-roberta-base` classifiers, fine-tuned on each of genre datasets separately,
|
231 |
+
using the X-GENRE schema (see experiments in https://github.com/TajaKuzman/Genre-Datasets-Comparison).
|
232 |
|
233 |
+
At the in-dataset experiments (trained and tested on splits of the same dataset),
|
234 |
+
it outperforms all datasets, except the FTD dataset which has a smaller number of X-GENRE labels.
|
235 |
|
236 |
| Trained on | Micro F1 | Macro F1 |
|
237 |
|:-------------|-----------:|-----------:|
|
|
|
251 |
| X-GENRE | GINCO | 0.749 | 0.758 |
|
252 |
|
253 |
The classifier was compared with other classifiers on 2 additional genre datasets (to which the X-GENRE schema was mapped):
|
254 |
+
- [EN-GINCO](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset): a sample of the English enTenTen20 corpus
|
255 |
- [FinCORE](https://github.com/TurkuNLP/FinCORE): Finnish CORE corpus
|
256 |
|
257 |
| Trained on | Tested on | Micro F1 | Macro F1 |
|
|
|
262 |
| FTD | EN-GINCO | 0.574 | 0.475 |
|
263 |
| CORE | EN-GINCO | 0.485 | 0.422 |
|
264 |
|
265 |
+
At cross-dataset and cross-lingual experiments, it was shown that the X-GENRE classifier,
|
266 |
+
trained on all three datasets, outperforms classifiers that were trained on just one of the datasets.
|
267 |
|
268 |
### Fine-tuning hyperparameters
|
269 |
|
270 |
+
Fine-tuning was performed with `simpletransformers`.
|
271 |
+
Beforehand, a brief hyperparameter optimization was performed and the presumed optimal hyperparameters are:
|
272 |
|
273 |
```python
|
274 |
model_args= {
|
|
|
281 |
|
282 |
## Citation
|
283 |
|
284 |
+
If you use the model, please cite the paper which describes creation of the [X-GENRE dataset](https://huggingface.co/datasets/TajaKuzman/X-GENRE-multilingual-text-genre-dataset) and the genre classifier:
|
285 |
|
286 |
```
|
287 |
@article{kuzman2023automatic,
|