jon-fernandes commited on
Commit
3c9e8d7
·
1 Parent(s): b742641

Add new SentenceTransformer model.

Browse files
.gitattributes CHANGED
@@ -25,3 +25,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
25
  *.zip filter=lfs diff=lfs merge=lfs -text
26
  *.zstandard filter=lfs diff=lfs merge=lfs -text
27
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
25
  *.zip filter=lfs diff=lfs merge=lfs -text
26
  *.zstandard filter=lfs diff=lfs merge=lfs -text
27
  *tfevents* filter=lfs diff=lfs merge=lfs -text
28
+ pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ license: apache-2.0
4
+ tags:
5
+ - sentence-transformers
6
+ - feature-extraction
7
+ - sentence-similarity
8
+ - transformers
9
+ ---
10
+
11
+ # sentence-transformers/msmarco-distilbert-base-tas-b
12
+
13
+ This is a port of the [DistilBert TAS-B Model](https://huggingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco) to [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and is optimized for the task of semantic search.
14
+
15
+
16
+
17
+ ## Usage (Sentence-Transformers)
18
+
19
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
20
+
21
+ ```
22
+ pip install -U sentence-transformers
23
+ ```
24
+
25
+ Then you can use the model like this:
26
+
27
+ ```python
28
+ from sentence_transformers import SentenceTransformer, util
29
+
30
+ query = "How many people live in London?"
31
+ docs = ["Around 9 Million people live in London", "London is known for its financial district"]
32
+
33
+ #Load the model
34
+ model = SentenceTransformer('sentence-transformers/msmarco-distilbert-base-tas-b')
35
+
36
+ #Encode query and documents
37
+ query_emb = model.encode(query)
38
+ doc_emb = model.encode(docs)
39
+
40
+ #Compute dot score between query and all document embeddings
41
+ scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
42
+
43
+ #Combine docs & scores
44
+ doc_score_pairs = list(zip(docs, scores))
45
+
46
+ #Sort by decreasing score
47
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
48
+
49
+ #Output passages & scores
50
+ for doc, score in doc_score_pairs:
51
+ print(score, doc)
52
+ ```
53
+
54
+
55
+
56
+ ## Usage (HuggingFace Transformers)
57
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
58
+
59
+ ```python
60
+ from transformers import AutoTokenizer, AutoModel
61
+ import torch
62
+
63
+ #CLS Pooling - Take output from first token
64
+ def cls_pooling(model_output):
65
+ return model_output.last_hidden_state[:,0]
66
+
67
+ #Encode text
68
+ def encode(texts):
69
+ # Tokenize sentences
70
+ encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
71
+
72
+ # Compute token embeddings
73
+ with torch.no_grad():
74
+ model_output = model(**encoded_input, return_dict=True)
75
+
76
+ # Perform pooling
77
+ embeddings = cls_pooling(model_output)
78
+
79
+ return embeddings
80
+
81
+
82
+ # Sentences we want sentence embeddings for
83
+ query = "How many people live in London?"
84
+ docs = ["Around 9 Million people live in London", "London is known for its financial district"]
85
+
86
+ # Load model from HuggingFace Hub
87
+ tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")
88
+ model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")
89
+
90
+ #Encode query and docs
91
+ query_emb = encode(query)
92
+ doc_emb = encode(docs)
93
+
94
+ #Compute dot score between query and all document embeddings
95
+ scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
96
+
97
+ #Combine docs & scores
98
+ doc_score_pairs = list(zip(docs, scores))
99
+
100
+ #Sort by decreasing score
101
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
102
+
103
+ #Output passages & scores
104
+ for doc, score in doc_score_pairs:
105
+ print(score, doc)
106
+ ```
107
+
108
+
109
+
110
+ ## Evaluation Results
111
+
112
+
113
+
114
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name=sentence-transformers/msmarco-distilbert-base-tas-b)
115
+
116
+
117
+
118
+ ## Full Model Architecture
119
+ ```
120
+ SentenceTransformer(
121
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel
122
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
123
+ )
124
+ ```
125
+
126
+ ## Citing & Authors
127
+
128
+ Have a look at: [DistilBert TAS-B Model](https://huggingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco)
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/root/.cache/torch/sentence_transformers/sentence-transformers_msmarco-distilbert-base-tas-b/",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertModel"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "initializer_range": 0.02,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "distilbert",
14
+ "n_heads": 12,
15
+ "n_layers": 6,
16
+ "pad_token_id": 0,
17
+ "qa_dropout": 0.1,
18
+ "seq_classif_dropout": 0.2,
19
+ "sinusoidal_pos_embds": false,
20
+ "tie_weights_": true,
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.18.0",
23
+ "vocab_size": 30522
24
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.0.0",
4
+ "transformers": "4.7.0",
5
+ "pytorch": "1.9.0+cu102"
6
+ }
7
+ }
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb8eb394b5471172c994f9ea9967a3cbb4cbed4a1e937fb804593af0ae12d149
3
+ size 265483513
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "do_basic_tokenize": true, "never_split": null, "model_max_length": 512, "name_or_path": "/root/.cache/torch/sentence_transformers/sentence-transformers_msmarco-distilbert-base-tas-b/", "special_tokens_map_file": "/home/ukp-reimers/.cache/huggingface/transformers/ba1a276969ccad7ea2344196e7b8561b36292db74bff940ee316dadc05d005d3.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d", "tokenizer_class": "DistilBertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff