pettertonar commited on
Commit
16ba761
·
verified ·
1 Parent(s): 5d930e8

Add new SentenceTransformer model with an onnx backend

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,297 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ lang:
4
+ - sv
5
+ tags:
6
+ - sentence-transformers
7
+ - feature-extraction
8
+ - sentence-similarity
9
+ - transformers
10
+ widget:
11
+ - source_sentence: Mannen åt mat.
12
+ sentences:
13
+ - Han förtärde en närande och nyttig måltid.
14
+ - Det var ett sunkigt hak med ganska gott käk.
15
+ - Han inmundigade middagen tillsammans med ett glas rödvin.
16
+ - Potatischips är jättegoda.
17
+ - Tryck på knappen för att få tala med kundsupporten.
18
+ example_title: Mat
19
+ - source_sentence: Kan jag deklarera digitalt från utlandet?
20
+ sentences:
21
+ - Du som befinner dig i utlandet kan deklarera digitalt på flera olika sätt.
22
+ - >-
23
+ Du som har kvarskatt att betala ska göra en inbetalning till ditt
24
+ skattekonto.
25
+ - >-
26
+ Efter att du har deklarerat går vi igenom uppgifterna i din deklaration och
27
+ räknar ut din skatt.
28
+ - >-
29
+ I din deklaration som du får från oss har vi räknat ut vad du ska betala
30
+ eller få tillbaka.
31
+ - Tryck på knappen för att få tala med kundsupporten.
32
+ example_title: Skatteverket FAQ
33
+ - source_sentence: Hon kunde göra bakåtvolter.
34
+ sentences:
35
+ - Hon var atletisk.
36
+ - Hon var bra på gymnastik.
37
+ - Hon var inte atletisk.
38
+ - Hon var oförmögen att flippa baklänges.
39
+ example_title: Gymnastik
40
+ license: apache-2.0
41
+ language:
42
+ - sv
43
+ ---
44
+
45
+ # KBLab/sentence-bert-swedish-cased
46
+
47
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps Swedish sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. This model is a bilingual Swedish-English model trained according to instructions in the paper [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/pdf/2004.09813.pdf) and the [documentation](https://www.sbert.net/examples/training/multilingual/README.html) accompanying its companion python package. We have used the strongest available pretrained English Bi-Encoder ([all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)) as a teacher model, and the pretrained Swedish [KB-BERT](https://huggingface.co/KB/bert-base-swedish-cased) as the student model.
48
+
49
+ A more detailed description of the model can be found in an article we published on the KBLab blog [here](https://kb-labb.github.io/posts/2021-08-23-a-swedish-sentence-transformer/) and for the updated model [here](https://kb-labb.github.io/posts/2023-01-16-sentence-transformer-20/).
50
+
51
+ **Update**: We have released updated versions of the model since the initial release. The original model described in the blog post is **v1.0**. The current version is **v2.0**. The newer versions are trained on longer paragraphs, and have a longer max sequence length. **v2.0** is trained with a stronger teacher model and is the current default.
52
+
53
+ | Model version | Teacher Model | Max Sequence Length |
54
+ |---------------|---------|----------|
55
+ | v1.0 | [paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2) | 256 |
56
+ | v1.1 | [paraphrase-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2) | 384 |
57
+ | v2.0 | [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) | 384 |
58
+
59
+ <!--- Describe your model here -->
60
+
61
+ ## Usage (Sentence-Transformers)
62
+
63
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
64
+
65
+ ```
66
+ pip install -U sentence-transformers
67
+ ```
68
+
69
+ Then you can use the model like this:
70
+
71
+ ```python
72
+ from sentence_transformers import SentenceTransformer
73
+ sentences = ["Det här är en exempelmening", "Varje exempel blir konverterad"]
74
+
75
+ model = SentenceTransformer('KBLab/sentence-bert-swedish-cased')
76
+ embeddings = model.encode(sentences)
77
+ print(embeddings)
78
+ ```
79
+
80
+ ### Loading an older model version (Sentence-Transformers)
81
+
82
+ Currently, the easiest way to load an older model version is to clone the model repository and load it from disk. For example, to clone the **v1.0** model:
83
+
84
+ ```bash
85
+ git clone --depth 1 --branch v1.0 https://huggingface.co/KBLab/sentence-bert-swedish-cased
86
+ ```
87
+
88
+ Then you can load the model by pointing to the local folder where you cloned the model:
89
+
90
+ ```python
91
+ from sentence_transformers import SentenceTransformer
92
+ model = SentenceTransformer("path_to_model_folder/sentence-bert-swedish-cased")
93
+ ```
94
+
95
+
96
+ ## Usage (HuggingFace Transformers)
97
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
98
+
99
+ ```python
100
+ from transformers import AutoTokenizer, AutoModel
101
+ import torch
102
+
103
+
104
+ #Mean Pooling - Take attention mask into account for correct averaging
105
+ def mean_pooling(model_output, attention_mask):
106
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
107
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
108
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
109
+
110
+
111
+ # Sentences we want sentence embeddings for
112
+ sentences = ['Det här är en exempelmening', 'Varje exempel blir konverterad']
113
+
114
+ # Load model from HuggingFace Hub
115
+ # To load an older version, e.g. v1.0, add the argument revision="v1.0"
116
+ tokenizer = AutoTokenizer.from_pretrained('KBLab/sentence-bert-swedish-cased')
117
+ model = AutoModel.from_pretrained('KBLab/sentence-bert-swedish-cased')
118
+
119
+ # Tokenize sentences
120
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
121
+
122
+ # Compute token embeddings
123
+ with torch.no_grad():
124
+ model_output = model(**encoded_input)
125
+
126
+ # Perform pooling. In this case, max pooling.
127
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
128
+
129
+ print("Sentence embeddings:")
130
+ print(sentence_embeddings)
131
+ ```
132
+
133
+ ### Loading an older model (Hugginfface Transformers)
134
+
135
+ To load an older model specify the version tag with the `revision` arg. For example, to load the **v1.0** model, use the following code:
136
+
137
+ ```python
138
+ AutoTokenizer.from_pretrained('KBLab/sentence-bert-swedish-cased', revision="v1.0")
139
+ AutoModel.from_pretrained('KBLab/sentence-bert-swedish-cased', revision="v1.0")
140
+ ```
141
+
142
+ ## Evaluation Results
143
+
144
+ <!--- Describe how your model was evaluated -->
145
+
146
+ The model was evaluated on [SweParaphrase v1.0](https://spraakbanken.gu.se/en/resources/sweparaphrase) and **SweParaphrase v2.0**. This test set is part of [SuperLim](https://spraakbanken.gu.se/en/resources/superlim) -- a Swedish evaluation suite for natural langage understanding tasks. We calculated Pearson and Spearman correlation between predicted model similarity scores and the human similarity score labels. Results from **SweParaphrase v1.0** are displayed below.
147
+
148
+ | Model version | Pearson | Spearman |
149
+ |---------------|---------|----------|
150
+ | v1.0 | 0.9183 | 0.9114 |
151
+ | v1.1 | 0.9183 | 0.9114 |
152
+ | v2.0 | **0.9283** | **0.9130** |
153
+
154
+ The following code snippet can be used to reproduce the above results:
155
+
156
+ ```python
157
+ from sentence_transformers import SentenceTransformer
158
+ import pandas as pd
159
+
160
+ df = pd.read_csv(
161
+ "sweparaphrase-dev-165.csv",
162
+ sep="\t",
163
+ header=None,
164
+ names=[
165
+ "original_id",
166
+ "source",
167
+ "type",
168
+ "sentence_swe1",
169
+ "sentence_swe2",
170
+ "score",
171
+ "sentence1",
172
+ "sentence2",
173
+ ],
174
+ )
175
+
176
+ model = SentenceTransformer("KBLab/sentence-bert-swedish-cased")
177
+
178
+ sentences1 = df["sentence_swe1"].tolist()
179
+ sentences2 = df["sentence_swe2"].tolist()
180
+
181
+ # Compute embedding for both lists
182
+ embeddings1 = model.encode(sentences1, convert_to_tensor=True)
183
+ embeddings2 = model.encode(sentences2, convert_to_tensor=True)
184
+
185
+ # Compute cosine similarity after normalizing
186
+ embeddings1 /= embeddings1.norm(dim=-1, keepdim=True)
187
+ embeddings2 /= embeddings2.norm(dim=-1, keepdim=True)
188
+
189
+ cosine_scores = embeddings1 @ embeddings2.t()
190
+ sentence_pair_scores = cosine_scores.diag()
191
+
192
+ df["model_score"] = sentence_pair_scores.cpu().tolist()
193
+ print(df[["score", "model_score"]].corr(method="spearman"))
194
+ print(df[["score", "model_score"]].corr(method="pearson"))
195
+ ```
196
+
197
+ ### Sweparaphrase v2.0
198
+
199
+ In general, **v1.1** correlates the most with human assessment of text similarity on SweParaphrase v2.0. Below, we present zero-shot evaluation results on all data splits. They display the model's performance out of the box, without any fine-tuning.
200
+
201
+ | Model version | Data split | Pearson | Spearman |
202
+ |---------------|------------|------------|------------|
203
+ | v1.0 | train | 0.8355 | 0.8256 |
204
+ | v1.1 | train | **0.8383** | **0.8302** |
205
+ | v2.0 | train | 0.8209 | 0.8059 |
206
+ | v1.0 | dev | 0.8682 | 0.8774 |
207
+ | v1.1 | dev | **0.8739** | **0.8833** |
208
+ | v2.0 | dev | 0.8638 | 0.8668 |
209
+ | v1.0 | test | 0.8356 | 0.8476 |
210
+ | v1.1 | test | **0.8393** | **0.8550** |
211
+ | v2.0 | test | 0.8232 | 0.8213 |
212
+
213
+ ### SweFAQ v2.0
214
+
215
+ When it comes to retrieval tasks, **v2.0** performs the best by quite a substantial margin. It is better at matching the correct answer to a question compared to v1.1 and v1.0.
216
+
217
+ | Model version | Data split | Accuracy |
218
+ |---------------|------------|------------|
219
+ | v1.0 | train | 0.5262 |
220
+ | v1.1 | train | 0.6236 |
221
+ | v2.0 | train | **0.7106** |
222
+ | v1.0 | dev | 0.4636 |
223
+ | v1.1 | dev | 0.5818 |
224
+ | v2.0 | dev | **0.6727** |
225
+ | v1.0 | test | 0.4495 |
226
+ | v1.1 | test | 0.5229 |
227
+ | v2.0 | test | **0.5871** |
228
+
229
+
230
+ Examples how to evaluate the models on some of the test sets of the SuperLim suites can be found on the following links: [evaluate_faq.py](https://github.com/kb-labb/swedish-sbert/blob/main/evaluate_faq.py) (Swedish FAQ), [evaluate_swesat.py](https://github.com/kb-labb/swedish-sbert/blob/main/evaluate_swesat.py) (SweSAT synonyms), [evaluate_supersim.py](https://github.com/kb-labb/swedish-sbert/blob/main/evaluate_supersim.py) (SuperSim).
231
+
232
+ ## Training
233
+
234
+ An article with more details on data and v1.0 of the model can be found on the [KBLab blog](https://kb-labb.github.io/posts/2021-08-23-a-swedish-sentence-transformer/).
235
+
236
+ Around 14.6 million sentences from English-Swedish parallel corpuses were used to train the model. Data was sourced from the [Open Parallel Corpus](https://opus.nlpl.eu/) (OPUS) and downloaded via the python package [opustools](https://pypi.org/project/opustools/). Datasets used were: JW300, Europarl, DGT-TM, EMEA, ELITR-ECA, TED2020, Tatoeba and OpenSubtitles.
237
+
238
+ The model was trained with the parameters:
239
+
240
+ **DataLoader**:
241
+
242
+ `torch.utils.data.dataloader.DataLoader` of length 180513 with parameters:
243
+ ```
244
+ {'batch_size': 64, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
245
+ ```
246
+
247
+ **Loss**:
248
+
249
+ `sentence_transformers.losses.MSELoss.MSELoss`
250
+
251
+ Parameters of the fit()-Method:
252
+ ```
253
+ {
254
+ "epochs": 2,
255
+ "evaluation_steps": 1000,
256
+ "evaluator": "sentence_transformers.evaluation.SequentialEvaluator.SequentialEvaluator",
257
+ "max_grad_norm": 1,
258
+ "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
259
+ "optimizer_params": {
260
+ "eps": 1e-06,
261
+ "lr": 8e-06
262
+ },
263
+ "scheduler": "WarmupLinear",
264
+ "steps_per_epoch": null,
265
+ "warmup_steps": 5000,
266
+ "weight_decay": 0.01
267
+ }
268
+ ```
269
+
270
+
271
+ ## Full Model Architecture
272
+ ```
273
+ SentenceTransformer(
274
+ (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: BertModel
275
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
276
+ )
277
+ ```
278
+
279
+ ## Citing & Authors
280
+
281
+ <!--- Describe where people can find more information -->
282
+ This model was trained by KBLab, a data lab at the National Library of Sweden.
283
+
284
+ You can cite the article on our blog: https://kb-labb.github.io/posts/2021-08-23-a-swedish-sentence-transformer/ .
285
+
286
+ ```
287
+ @misc{rekathati2021introducing,
288
+ author = {Rekathati, Faton},
289
+ title = {The KBLab Blog: Introducing a Swedish Sentence Transformer},
290
+ url = {https://kb-labb.github.io/posts/2021-08-23-a-swedish-sentence-transformer/},
291
+ year = {2021}
292
+ }
293
+ ```
294
+
295
+ ## Acknowledgements
296
+
297
+ We gratefully acknowledge the HPC RIVR consortium ([www.hpc-rivr.si](https://www.hpc-rivr.si/)) and EuroHPC JU ([eurohpc-ju.europa.eu/](https://eurohpc-ju.europa.eu/)) for funding this research by providing computing resources of the HPC system Vega at the Institute of Information Science ([www.izum.si](https://www.izum.si/)).
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "KBLab/sentence-bert-swedish-cased",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "output_past": true,
20
+ "pad_token_id": 0,
21
+ "position_embedding_type": "absolute",
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.46.2",
24
+ "type_vocab_size": 2,
25
+ "use_cache": true,
26
+ "vocab_size": 50325
27
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.3.0",
4
+ "transformers": "4.46.2",
5
+ "pytorch": "2.3.0"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
onnx/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:af73b6ff7501d183f1ec7d4785595b236de0923faef49caba03ad52c3c007c14
3
+ size 496679432
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 384,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "mask_token": "[MASK]",
49
+ "max_length": 384,
50
+ "model_max_length": 384,
51
+ "never_split": null,
52
+ "pad_to_multiple_of": null,
53
+ "pad_token": "[PAD]",
54
+ "pad_token_type_id": 0,
55
+ "padding_side": "right",
56
+ "sep_token": "[SEP]",
57
+ "stride": 0,
58
+ "strip_accents": false,
59
+ "tokenize_chinese_chars": true,
60
+ "tokenizer_class": "BertTokenizer",
61
+ "truncation_side": "right",
62
+ "truncation_strategy": "longest_first",
63
+ "unk_token": "[UNK]"
64
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff