ERCDiDip commited on
Commit
a51e71e
1 Parent(s): 6b1ec95

Update README.md

Browse files

Adding Ancient and Medieval Greek, together with example

Files changed (1) hide show
  1. README.md +6 -2
README.md CHANGED
@@ -7,6 +7,7 @@ widget:
7
  - text: "Rath und Gemeinde der Stadt Wismar beschweren sich über die von den Hauptleuten, Beamten und Vasallen des Grafen Johann von Holstein und Stormarn ihren Bürgern seit Jahren zugefügten Unbilden, indem sie ein Verzeichniss der erlittenen einzelnen Verluste beibringen."
8
  - text: "Diplomă de înnobilare emisă de împăratul romano-german Rudolf al II-lea de Habsburg la în favoarea familiei Szőke de Galgóc. Aussteller: Rudolf al II-lea de Habsburg, împărat romano-german Empfänger: Szőke de Galgóc, familie"
9
  - text: "бѣ жє болѧ єтєръ лазаръ отъ виѳаньѧ градьца марьина и марѳꙑ сєстрꙑ єѧ | бѣ жє марьꙗ помазавъшиꙗ господа мѵромъ и отьръши ноѕѣ єго власꙑ своими єѧжє братъ лазаръ болѣашє"
 
10
  ---
11
 
12
  # XLM-RoBERTa (base) language-detection model (modern and medieval)
@@ -17,11 +18,11 @@ This model is a fine-tuned version of xlm-roberta-base on the [monasterium.net](
17
  On the top of this XLM-RoBERTa transformer model is a classification head. Please refer this model together with to the [XLM-RoBERTa (base-sized model)](https://huggingface.co/xlm-roberta-base) card or the paper [Unsupervised Cross-lingual Representation Learning at Scale by Conneau et al.](https://arxiv.org/abs/1911.02116) for additional information.
18
 
19
  ## Intended uses & limitations
20
- You can directly use this model as a language detector, i.e. for sequence classification tasks. Currently, it supports the following 40 languages, modern and medieval:
21
 
22
  Modern: Bulgarian (bg), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), German (de), Greek (el), Hungarian (hu), Irish (ga), Italian (it), Latvian (lv), Lithuanian (lt), Maltese (mt), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv), Russian (ru), Turkish (tr), Basque (eu), Catalan (ca), Albanian (sq), Serbian (se), Ukrainian (uk), Norwegian (no), Arabic (ar), Chinese (zh), Hebrew (he)
23
 
24
- Medieval: Middle High German (mhd), Latin (la), Middle Low German (gml), Old French (fro), Old Church Slavonic (chu), Early New High German (fnhd)
25
 
26
  ## Training and evaluation data
27
  The model was fine-tuned using the Monasterium and Wikipedia datasets, which consist of text sequences in 40 languages. The training set contains 80k samples, while the validation and test sets contain 16k. The average accuracy on the test set is 99.59% (this matches the average macro/weighted F1-score, the test set being perfectly balanced).
@@ -67,6 +68,9 @@ classificator = pipeline("text-classification", model="ERCDiDip/40_langdetect_v0
67
  classificator("clemens etc dilecto filio scolastico ecclesie wetflari ensi treveren dioc salutem etc significarunt nobis dilecti filii commendator et fratres hospitalis beate marie theotonicorum")
68
  ```
69
 
 
 
 
70
  ## Framework versions
71
  - Transformers 4.24.0
72
  - Pytorch 1.13.0
 
7
  - text: "Rath und Gemeinde der Stadt Wismar beschweren sich über die von den Hauptleuten, Beamten und Vasallen des Grafen Johann von Holstein und Stormarn ihren Bürgern seit Jahren zugefügten Unbilden, indem sie ein Verzeichniss der erlittenen einzelnen Verluste beibringen."
8
  - text: "Diplomă de înnobilare emisă de împăratul romano-german Rudolf al II-lea de Habsburg la în favoarea familiei Szőke de Galgóc. Aussteller: Rudolf al II-lea de Habsburg, împărat romano-german Empfänger: Szőke de Galgóc, familie"
9
  - text: "бѣ жє болѧ єтєръ лазаръ отъ виѳаньѧ градьца марьина и марѳꙑ сєстрꙑ єѧ | бѣ жє марьꙗ помазавъшиꙗ господа мѵромъ и отьръши ноѕѣ єго власꙑ своими єѧжє братъ лазаръ болѣашє"
10
+ - text: "μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος οὐλομένην, ἣ μυρί᾽ Ἀχαιοῖς ἄλγε᾽ ἔθηκε, πολλὰς δ᾽ ἰφθίμους ψυχὰς Ἄϊδι προΐαψεν ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν οἰωνοῖσί"
11
  ---
12
 
13
  # XLM-RoBERTa (base) language-detection model (modern and medieval)
 
18
  On the top of this XLM-RoBERTa transformer model is a classification head. Please refer this model together with to the [XLM-RoBERTa (base-sized model)](https://huggingface.co/xlm-roberta-base) card or the paper [Unsupervised Cross-lingual Representation Learning at Scale by Conneau et al.](https://arxiv.org/abs/1911.02116) for additional information.
19
 
20
  ## Intended uses & limitations
21
+ You can directly use this model as a language detector, i.e. for sequence classification tasks. Currently, it supports the following 41 languages, modern and medieval:
22
 
23
  Modern: Bulgarian (bg), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), German (de), Greek (el), Hungarian (hu), Irish (ga), Italian (it), Latvian (lv), Lithuanian (lt), Maltese (mt), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk), Slovenian (sl), Spanish (es), Swedish (sv), Russian (ru), Turkish (tr), Basque (eu), Catalan (ca), Albanian (sq), Serbian (se), Ukrainian (uk), Norwegian (no), Arabic (ar), Chinese (zh), Hebrew (he)
24
 
25
+ Medieval: Middle High German (mhd), Latin (la), Middle Low German (gml), Old French (fro), Old Church Slavonic (chu), Early New High German (fnhd), Ancient and Medieval Greek (grc)
26
 
27
  ## Training and evaluation data
28
  The model was fine-tuned using the Monasterium and Wikipedia datasets, which consist of text sequences in 40 languages. The training set contains 80k samples, while the validation and test sets contain 16k. The average accuracy on the test set is 99.59% (this matches the average macro/weighted F1-score, the test set being perfectly balanced).
 
68
  classificator("clemens etc dilecto filio scolastico ecclesie wetflari ensi treveren dioc salutem etc significarunt nobis dilecti filii commendator et fratres hospitalis beate marie theotonicorum")
69
  ```
70
 
71
+ ## Updates
72
+ - 25th November 2022: Adding Ancient and Medieval Greek (grc)
73
+
74
  ## Framework versions
75
  - Transformers 4.24.0
76
  - Pytorch 1.13.0