Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,102 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: ro
|
3 |
+
tags:
|
4 |
+
- bert
|
5 |
+
- fill-mask
|
6 |
+
license: mit
|
7 |
+
---
|
8 |
+
|
9 |
+
# romanian-sentence-e5-large
|
10 |
+
|
11 |
+
The BERT **base**, **uncased** model for Romanian, finetuned on RO_MNLI dataset (translated entire MNLI dataset from English to Romanian) ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
|
12 |
+
|
13 |
+
### How to use
|
14 |
+
|
15 |
+
```python
|
16 |
+
from sentence_transformers import SentenceTransformer
|
17 |
+
import numpy as np
|
18 |
+
|
19 |
+
# Inițializăm modelul
|
20 |
+
model = SentenceTransformer("iliemihai/romanian-sentence-e5-large")
|
21 |
+
|
22 |
+
# Definim propozițiile
|
23 |
+
sentences = [
|
24 |
+
"Un tren își începe călătoria către destinație.",
|
25 |
+
"O locomotivă pornește zgomotos spre o stație îndepărtată.",
|
26 |
+
"Un muzician cântă la un saxofon impresionant.",
|
27 |
+
"Un saxofonist evocă melodii suave sub lumina lunii.",
|
28 |
+
"O bucătăreasă presară condimente pe un platou cu legume.",
|
29 |
+
"Un chef adaugă un strop de mirodenii peste o salată colorată.",
|
30 |
+
"Un jongler aruncă si prinde mingi colorate.",
|
31 |
+
"Un artist de circ jonglează cu măiestrie sub reflectoare.",
|
32 |
+
"Un artist pictează un peisaj minunat pe o pânză albă.",
|
33 |
+
"Un pictor redă frumusețea naturii pe pânza sa strălucitoare."
|
34 |
+
]
|
35 |
+
|
36 |
+
# Obținem embeddings pentru fiecare propoziție
|
37 |
+
embeddings = model.encode(sentences)
|
38 |
+
|
39 |
+
# Calculăm similaritatea semantică folosind similaritatea cosine
|
40 |
+
similarities = np.dot(embeddings, embeddings.T) / (np.linalg.norm(embeddings, axis=1)[:, np.newaxis] * np.linalg.norm(embeddings, axis=1)[np.newaxis, :])
|
41 |
+
|
42 |
+
# Identificăm cea mai similară propoziție pentru fiecare propoziție, excluzând similaritatea cu sine însăși
|
43 |
+
most_similar_indices = np.argmax(similarities - np.eye(len(sentences)), axis=1)
|
44 |
+
|
45 |
+
most_similar_sentences = [(sentences[i], sentences[most_similar_indices[i]], similarities[i, most_similar_indices[i]]) for i in range(len(sentences))]
|
46 |
+
|
47 |
+
print(most_similar_sentences)
|
48 |
+
```
|
49 |
+
|
50 |
+
Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with :
|
51 |
+
```
|
52 |
+
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
|
53 |
+
```
|
54 |
+
because the model was **NOT** trained on cedilla ``s`` and ``t``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word.
|
55 |
+
|
56 |
+
### Parameters:
|
57 |
+
|
58 |
+
|
59 |
+
| Parameter | Value |
|
60 |
+
|------------------|-------|
|
61 |
+
| Batch size | 16 |
|
62 |
+
| Training steps | 256k |
|
63 |
+
| Warmup steps | 500 |
|
64 |
+
| Uncased | True |
|
65 |
+
| Max. Seq. Length | 512 |
|
66 |
+
| Loss function | Contrastive Loss |
|
67 |
+
|
68 |
+
### Evaluation
|
69 |
+
|
70 |
+
Evaluation is performed on Romaian STSb dataset
|
71 |
+
|
72 |
+
|
73 |
+
| Model | Spearman | Pearson |
|
74 |
+
|--------------------------------|:-----:|:------:|
|
75 |
+
| bert-base-romanian-uncased-v1 | 0.8086 | 0.8159 |
|
76 |
+
| sentence-bert-base-romanian-uncased-v1 | **0.8393** | **0.8387** |
|
77 |
+
|
78 |
+
### Corpus
|
79 |
+
|
80 |
+
#### Pretraining
|
81 |
+
|
82 |
+
The model is trained on the following corpora (stats in the table below are after cleaning):
|
83 |
+
|
84 |
+
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
|
85 |
+
|-----------|:--------:|:--------:|:--------:|:--------:|
|
86 |
+
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
|
87 |
+
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
|
88 |
+
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
|
89 |
+
| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
|
90 |
+
|
91 |
+
#### Finetuning
|
92 |
+
|
93 |
+
The model is finetune on the RO_MNLI dataset (translated entire MNLI dataset from English to Romanian and select only contradiction and entailment pairs, ~ 256k sentence pairs).
|
94 |
+
|
95 |
+
### Citation
|
96 |
+
|
97 |
+
Paper coming soon
|
98 |
+
|
99 |
+
|
100 |
+
#### Acknowledgements
|
101 |
+
|
102 |
+
- We'd like to thank [Stefan Dumitrescu](https://github.com/dumitrescustefan) and [Andrei Marius Avram](https://github.com/avramandrei) for pretraining the v1.0 BERT models!
|