Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

.gitattributes +1 -0
README.md +123 -0
config.json +33 -0
dev_scores.csv +2 -0
pytorch_model.bin +3 -0
sentencepiece.bpe.model +3 -0
special_tokens_map.json +15 -0
tokenizer.json +3 -0
tokenizer_config.json +19 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,123 @@

+---
+pipeline_tag: sentence-similarity
+language: fr
+license: apache-2.0
+datasets:
+- unicamp-dl/mmarco
+metrics:
+- recall
+tags:
+- sentence-similarity
+library_name: sentence-transformers
+---
+# crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR
+This is a [sentence-transformers](https://www.SBERT.net) model trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
+It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1. The model can be used for tasks like clustering or [semantic search]((https://www.sbert.net/examples/applications/retrieve_rerank/README.html): given a query, encode the latter with some candidate passages -- e.g., retrieved with BM25 or a biencoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.
+## Usage
+***
+#### Sentence-Transformers
+Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
+```bash
+pip install -U sentence-transformers
+```
+Then you can use the model like this:
+```python
+from sentence_transformers import CrossEncoder
+pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]
+model = CrossEncoder('crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR')
+scores = model.predict(pairs)
+print(scores)
+```
+#### 🤗 Transformers
+Without [sentence-transformers](https://www.SBERT.net), you can use the model as follows:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model = AutoModelForSequenceClassification.from_pretrained('crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR')
+tokenizer = AutoTokenizer.from_pretrained('crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR')
+pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]
+features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
+model.eval()
+with torch.no_grad():
+    scores = model(**features).logits
+print(scores)
+```
+## Evaluation
+***
+We evaluated our model on 500 random queries from the mMARCO-fr train set (which were excluded from training). Each of these queries has at least one relevant and up to 200 irrelevant passages.
+|   r-precision |   mrr@10 |   recall@10 |   recall@20 |   recall@50 |   recall@100 |
+|--------------:|---------:|------------:|------------:|------------:|-------------:|
+|         33.92 |    49.33 |          79 |       88.35 |        94.8 |         98.2 |
+Below, we compared its results with other cross-encoder models fine-tuned on the same dataset:
+|    | model                                                                                                                                                                                  |   r-precision |   mrr@10 |   recall@10 (↑) |   recall@20 |   recall@50 |   recall@100 |
+|---:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------:|---------:|------------:|------------:|------------:|-------------:|
+|  1 | [crossencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR)                                                                       |         35.65 |    50.44 |       82.95 |       91.5  |       96.8  |        98.8  |
+|  2 | [crossencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR)           |         34.37 |    51.01 |       82.23 |       90.6  |       96.45 |        98.4  |
+|  3 | [crossencoder-mmarcoFR-mMiniLMv2-L12-H384-v1-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mmarcoFR-mMiniLMv2-L12-H384-v1-mmarcoFR)                                       |         34.22 |    49.2  |       81.7  |       90.9  |       97.1  |        98.9  |
+|  4 | [crossencoder-mpnet-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mpnet-base-mmarcoFR)                                                                               |         29.68 |    46.13 |       80.45 |       87.9  |       93.15 |        96.6  |
+|  5 | [crossencoder-distilcamembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilcamembert-base-mmarcoFR)                                                           |         27.28 |    43.71 |       80.3  |       89.1  |       95.55 |        98.6  |
+|  6 | [crossencoder-roberta-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-roberta-base-mmarcoFR)                                                                           |         33.33 |    48.87 |       79.33 |       86.75 |       94.15 |        97.6  |
+|  7 | [crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR) |         28.32 |    45.28 |       79.22 |       87.15 |       93.15 |        95.75 |
+|  8 | **crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR**                                                                                                                  |         33.92 |    49.33 |       79    |       88.35 |       94.8  |        98.2  |
+|  9 | [crossencoder-msmarco-electra-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-electra-base-mmarcoFR)                                                           |         25.52 |    42.46 |       78.73 |       88.85 |       96.55 |        98.85 |
+| 10 | [crossencoder-bert-base-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-bert-base-uncased-mmarcoFR)                                                                 |         30.48 |    45.79 |       78.35 |       89.45 |       94.15 |        97.45 |
+| 11 | [crossencoder-msmarco-MiniLM-L-12-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-12-v2-mmarcoFR)                                                       |         29.07 |    44.41 |       77.83 |       88.1  |       95.55 |        99    |
+| 12 | [crossencoder-msmarco-MiniLM-L-6-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-6-v2-mmarcoFR)                                                         |         32.92 |    47.56 |       77.27 |       88.15 |       94.85 |        98.15 |
+| 13 | [crossencoder-msmarco-MiniLM-L-4-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-4-v2-mmarcoFR)                                                         |         30.98 |    46.22 |       76.35 |       85.8  |       94.35 |        97.55 |
+| 14 | [crossencoder-MiniLM-L6-H384-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L6-H384-uncased-mmarcoFR)                                                       |         29.23 |    45.12 |       76.08 |       83.7  |       92.65 |        97.45 |
+| 15 | [crossencoder-electra-base-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-discriminator-mmarcoFR)                                               |         28.48 |    43.58 |       75.63 |       86.15 |       93.25 |        96.6  |
+| 16 | [crossencoder-electra-small-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-small-discriminator-mmarcoFR)                                             |         31.83 |    45.97 |       75.13 |       84.95 |       94.55 |        98.15 |
+| 17 | [crossencoder-distilroberta-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilroberta-base-mmarcoFR)                                                               |         28.22 |    42.85 |       74.13 |       84.08 |       94.2  |        98.5  |
+| 18 | [crossencoder-msmarco-TinyBERT-L-6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-6-mmarcoFR)                                                           |         28.23 |    42.7  |       73.63 |       85.65 |       92.65 |        98.35 |
+| 19 | [crossencoder-msmarco-TinyBERT-L-4-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-4-mmarcoFR)                                                           |         28.6  |    43.19 |       72.17 |       81.95 |       92.8  |        97.4  |
+| 20 | [crossencoder-msmarco-MiniLM-L-2-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-2-v2-mmarcoFR)                                                         |         30.82 |    44.3  |       72.03 |       82.65 |       93.35 |        98.1  |
+| 21 | [crossencoder-distilbert-base-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilbert-base-uncased-mmarcoFR)                                                     |         25.47 |    40.11 |       71.37 |       85.6  |       93.85 |        97.95 |
+| 22 | [crossencoder-msmarco-TinyBERT-L-2-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-2-v2-mmarcoFR)                                                     |         31.08 |    43.88 |       71.3  |       81.43 |       92.6  |        98.1  |
+## Training
+***
+#### Background
+We used the [nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large](https://huggingface.co/nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large) model and fine-tuned it with a binary cross-entropy loss function on 1M question-passage pairs in French with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are relevant and 75% are irrelevant).
+#### Hyperparameters
+We trained the model on a single Tesla V100 GPU with 32GBs of memory during 10 epochs (i.e., 312.4k steps) using a batch size of 32. We used the adamw optimizer with an initial learning rate of 2e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens.
+#### Data
+We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a popular large-scale IR dataset.
+## Citation
+***
+```bibtex
+@online{louis2023,
+   author    = 'Antoine Louis',
+   title     = 'crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR: A Cross-Encoder Model Trained on 1M sentence pairs in French',
+   publisher = 'Hugging Face',
+   month     = 'september',
+   year      = '2023',
+   url       = 'https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR',
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "_name_or_path": "nreimers/mMiniLMv2-L6-H384-distilled-from-XLMR-Large",
+  "architectures": [
+    "XLMRobertaForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 1536,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.28.1",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 250002
+}

dev_scores.csv ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ r-precision,mrr@10,recall@10,recall@20,recall@50,recall@100,model
2	+ 33.92,49.33,79.00,88.35,94.80,98.20,crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:583323d2d680875692e528421c6c9fdfeff5b3f29a86be8eede01e98a5076349
+size 428017397

sentencepiece.bpe.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
+size 5069051

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:46afe88da5fd71bdbab5cfab5e84c1adce59c246ea5f9341bbecef061891d0a7
+size 17082913

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "__type": "AddedToken",
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "unk_token": "<unk>"
+}