KennethTM commited on
Commit
af75341
1 Parent(s): 96b4745

initial upload

Browse files
README.md CHANGED
@@ -4,93 +4,88 @@ tags:
4
  - sentence-transformers
5
  - feature-extraction
6
  - sentence-similarity
7
- license: mit
8
- datasets:
9
- - sentence-transformers/embedding-training-data
10
- - clips/mfaq
11
- language:
12
- - da
13
- library_name: sentence-transformers
14
- ---
15
-
16
- *Performance is not great - updates coming...*
17
 
18
- # MiniLM-L6-danish-encoder
19
 
20
- This is a lightweight (~22 M parameters) [sentence-transformers](https://www.SBERT.net) model for Danish NLP: It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search.
21
 
22
- The maximum sequence length is 256 tokens.
23
 
24
- The model was not pre-trained from scratch but adapted from the English version with a [tokenizer](https://huggingface.co/KennethTM/bert-base-uncased-danish) trained on Danish text.
25
 
26
- # Usage (Sentence-Transformers)
27
 
28
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
29
 
30
  ```
31
  pip install -U sentence-transformers
32
  ```
 
33
  Then you can use the model like this:
34
 
35
  ```python
36
  from sentence_transformers import SentenceTransformer
37
- sentences = ["En mand løber vejen.", "En panda løber på vejen.", "En mand kører hurtigt forbi på cykel."]
38
 
39
- model = SentenceTransformer('KennethTM/MiniLM-L6-danish-encoder')
40
  embeddings = model.encode(sentences)
41
  print(embeddings)
42
  ```
43
- # Usage (HuggingFace Transformers)
44
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
45
 
46
- ```python
47
- from transformers import AutoTokenizer, AutoModel
48
- import torch
49
- import torch.nn.functional as F
50
 
51
- #Mean Pooling - Take attention mask into account for correct averaging
52
- def mean_pooling(model_output, attention_mask):
53
- token_embeddings = model_output[0] #First element of model_output contains all token embeddings
54
- input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
55
- return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
56
 
57
- # Sentences we want sentence embeddings for
58
- sentences = ["En mand løber på vejen.", "En panda løber på vejen.", "En mand kører hurtigt forbi på cykel."]
59
 
60
- # Load model from HuggingFace Hub
61
- tokenizer = AutoTokenizer.from_pretrained('KennethTM/MiniLM-L6-danish-encoder')
62
- model = AutoModel.from_pretrained('KennethTM/MiniLM-L6-danish-encoder')
63
 
64
- # Tokenize sentences
65
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
66
 
67
- # Compute token embeddings
68
- with torch.no_grad():
69
- model_output = model(**encoded_input)
70
 
71
- # Perform pooling
72
- sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
73
 
74
- # Normalize embeddings
75
- sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
76
 
77
- print("Sentence embeddings:")
78
- print(sentence_embeddings)
 
79
  ```
80
 
81
- # Evaluation
 
 
 
 
 
82
 
83
- The performance of the pretrained model was evaluated using [ScandEval](https://github.com/ScandEval/ScandEval).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
- | Task | Dataset | Score (±SE) |
86
- |:-------------------------|:-------------|:--------------------------------|
87
- | sentiment-classification | angry-tweets | mcc = 36.14 (±1.07) |
88
- | | | macro_f1 = 56.57 (±0.84) |
89
- | named-entity-recognition | dane | micro_f1 = 55.56 (±1.69) |
90
- | | | micro_f1_no_misc = 57.44 (±1.9) |
91
- | linguistic-acceptability | scala-da | mcc = 12.4 (±3.07) |
92
- | | | macro_f1 = 53.54 (±2.15) |
93
- | question-answering | scandiqa-da | em = 17.87 (±1.15) |
94
- | | | f1 = 27.84 (±1.37) |
95
- | speed | speed | speed = 18.59 (±0.05) |
96
 
 
 
4
  - sentence-transformers
5
  - feature-extraction
6
  - sentence-similarity
 
 
 
 
 
 
 
 
 
 
7
 
8
+ ---
9
 
10
+ # {MODEL_NAME}
11
 
12
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
13
 
14
+ <!--- Describe your model here -->
15
 
16
+ ## Usage (Sentence-Transformers)
17
 
18
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
19
 
20
  ```
21
  pip install -U sentence-transformers
22
  ```
23
+
24
  Then you can use the model like this:
25
 
26
  ```python
27
  from sentence_transformers import SentenceTransformer
28
+ sentences = ["This is an example sentence", "Each sentence is converted"]
29
 
30
+ model = SentenceTransformer('{MODEL_NAME}')
31
  embeddings = model.encode(sentences)
32
  print(embeddings)
33
  ```
 
 
34
 
 
 
 
 
35
 
 
 
 
 
 
36
 
37
+ ## Evaluation Results
 
38
 
39
+ <!--- Describe how your model was evaluated -->
 
 
40
 
41
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
 
42
 
 
 
 
43
 
44
+ ## Training
45
+ The model was trained with the parameters:
46
 
47
+ **DataLoader**:
 
48
 
49
+ `torch.utils.data.dataloader.DataLoader` of length 14531 with parameters:
50
+ ```
51
+ {'batch_size': 128, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
52
  ```
53
 
54
+ **Loss**:
55
+
56
+ `__main__.MultipleNegativesRankingLoss` with parameters:
57
+ ```
58
+ {'scale': 20.0, 'similarity_fct': 'cos_sim'}
59
+ ```
60
 
61
+ Parameters of the fit()-Method:
62
+ ```
63
+ {
64
+ "epochs": 1,
65
+ "evaluation_steps": 0,
66
+ "evaluator": "NoneType",
67
+ "max_grad_norm": 1,
68
+ "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
69
+ "optimizer_params": {
70
+ "lr": 2e-05
71
+ },
72
+ "scheduler": "WarmupLinear",
73
+ "steps_per_epoch": null,
74
+ "warmup_steps": 1454,
75
+ "weight_decay": 0.01
76
+ }
77
+ ```
78
+
79
+
80
+ ## Full Model Architecture
81
+ ```
82
+ SentenceTransformer(
83
+ (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
84
+ (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
85
+ (2): Normalize()
86
+ )
87
+ ```
88
 
89
+ ## Citing & Authors
 
 
 
 
 
 
 
 
 
 
90
 
91
+ <!--- Describe where people can find more information -->
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3b6a8c8f10ba5858880cfd968cfa14a7bd2e7fdfae7c8d9bca1b3aaba8e75cab
3
  size 90887145
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:034c21d9a02972167974cee34792364fb8a157f31e1c8f052f759fb90f3b6795
3
  size 90887145
sentence_bert_config.json CHANGED
@@ -1,4 +1,4 @@
1
  {
2
- "max_seq_length": 256,
3
  "do_lower_case": false
4
  }
 
1
  {
2
+ "max_seq_length": 128,
3
  "do_lower_case": false
4
  }
tokenizer.json CHANGED
@@ -2,7 +2,7 @@
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
- "max_length": 256,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
 
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
+ "max_length": 128,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
tokenizer_config.json CHANGED
@@ -2,7 +2,7 @@
2
  "clean_up_tokenization_spaces": true,
3
  "cls_token": "[CLS]",
4
  "mask_token": "[MASK]",
5
- "max_length": 256,
6
  "model_max_length": 512,
7
  "pad_to_multiple_of": null,
8
  "pad_token": "[PAD]",
 
2
  "clean_up_tokenization_spaces": true,
3
  "cls_token": "[CLS]",
4
  "mask_token": "[MASK]",
5
+ "max_length": 128,
6
  "model_max_length": 512,
7
  "pad_to_multiple_of": null,
8
  "pad_token": "[PAD]",