TJKlein commited on
Commit
d1858c9
·
1 Parent(s): de95105

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -4
README.md CHANGED
@@ -10,14 +10,42 @@ Language model of the pre-print arXiv paper titled: "_**miCSE**: Mutual Informat
10
  The **miCSE** language model is trained for sentence similarity computation. Training the model imposes alignment between the attention pattern of different views (embeddings of augmentations) during contrastive learning. Learning sentence embeddings with **miCSE** entails enforcing the syntactic consistency across augmented views for every single sentence, making contrastive self-supervised learning more sample efficient. Sentence representations correspond to the embedding of the _**[CLS]**_ token.
11
 
12
 
13
- # Usage
14
-
15
 
16
  ```shell
17
- tokenizer = AutoTokenizer.from_pretrained("sap-ai-research/<----Enter Model Name---->")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- model = AutoModelWithLMHead.from_pretrained("sap-ai-research/<----Enter Model Name---->")
20
  ```
 
 
21
  # Benchmark
22
 
23
  Model results on SentEval Benchmark:
 
10
  The **miCSE** language model is trained for sentence similarity computation. Training the model imposes alignment between the attention pattern of different views (embeddings of augmentations) during contrastive learning. Learning sentence embeddings with **miCSE** entails enforcing the syntactic consistency across augmented views for every single sentence, making contrastive self-supervised learning more sample efficient. Sentence representations correspond to the embedding of the _**[CLS]**_ token.
11
 
12
 
13
+ # Model Usage
 
14
 
15
  ```shell
16
+ from transformers import AutoTokenizer, AutoModel
17
+
18
+ tokenizer = AutoTokenizer.from_pretrained("sap-ai-research/miCSE")
19
+
20
+ model = AutoModel.from_pretrained("sap-ai-research/miCSE")
21
+
22
+
23
+ # Encoding of sentences in a list with a predefined maximum lengths of tokens (max_length)
24
+
25
+ max_length = 32
26
+
27
+ sentences = [
28
+ "This is a sentence for testing miCSE.",
29
+ "This is yet another test sentence for the mutual information Contrastive Sentence Embeddings model."
30
+ ]
31
+
32
+ batch = tokenizer.batch_encode_plus(
33
+ sentences,
34
+ return_tensors='pt',
35
+ padding=True,
36
+ max_length=max_length,
37
+ truncation=True
38
+ )
39
+
40
+ # Compute the embeddings
41
+
42
+ outputs = model(**batch, output_hidden_states=True, return_dict=True)
43
+
44
+ embeddings = outputs.last_hidden_state[:,0]
45
 
 
46
  ```
47
+
48
+
49
  # Benchmark
50
 
51
  Model results on SentEval Benchmark: