Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +123 -0
config.json +34 -0
dev_scores.csv +2 -0
pytorch_model.bin +3 -0
sentencepiece.bpe.model +3 -0
special_tokens_map.json +19 -0
tokenizer.json +0 -0
tokenizer_config.json +24 -0

README.md ADDED Viewed

	@@ -0,0 +1,123 @@

+---
+pipeline_tag: sentence-similarity
+language: fr
+license: apache-2.0
+datasets:
+- unicamp-dl/mmarco
+metrics:
+- recall
+tags:
+- sentence-similarity
+library_name: sentence-transformers
+---
+# crossencoder-distilcamembert-base-mmarcoFR
+This is a [sentence-transformers](https://www.SBERT.net) model trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
+It performs cross-attention between a question-passage pair and outputs a relevance score between 0 and 1. The model can be used for tasks like clustering or [semantic search]((https://www.sbert.net/examples/applications/retrieve_rerank/README.html): given a query, encode the latter with some candidate passages -- e.g., retrieved with BM25 or a biencoder -- then sort the passages in a decreasing order of relevance according to the model's predictions.
+## Usage
+***
+#### Sentence-Transformers
+Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
+```bash
+pip install -U sentence-transformers
+```
+Then you can use the model like this:
+```python
+from sentence_transformers import CrossEncoder
+pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]
+model = CrossEncoder('crossencoder-distilcamembert-base-mmarcoFR')
+scores = model.predict(pairs)
+print(scores)
+```
+#### 🤗 Transformers
+Without [sentence-transformers](https://www.SBERT.net), you can use the model as follows:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model = AutoModelForSequenceClassification.from_pretrained('crossencoder-distilcamembert-base-mmarcoFR')
+tokenizer = AutoTokenizer.from_pretrained('crossencoder-distilcamembert-base-mmarcoFR')
+pairs = [('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')]
+features = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt')
+model.eval()
+with torch.no_grad():
+    scores = model(**features).logits
+print(scores)
+```
+## Evaluation
+***
+We evaluated our model on 500 random queries from the mMARCO-fr train set (which were excluded from training). Each of these queries has at least one relevant and up to 200 irrelevant passages.
+|   r-precision |   mrr@10 |   recall@10 |   recall@20 |   recall@50 |   recall@100 |
+|--------------:|---------:|------------:|------------:|------------:|-------------:|
+|         27.28 |    43.71 |        80.3 |        89.1 |       95.55 |         98.6 |
+Below, we compared its results with other cross-encoder models fine-tuned on the same dataset:
+|    | model                                                                                                                                                                                  |   r-precision |   mrr@10 |   recall@10 (↑) |   recall@20 |   recall@50 |   recall@100 |
+|---:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------:|---------:|------------:|------------:|------------:|-------------:|
+|  1 | [crossencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR)                                                                       |         35.65 |    50.44 |       82.95 |       91.5  |       96.8  |        98.8  |
+|  2 | [crossencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-H384-distilled-from-XLMR-Large-mmarcoFR)           |         34.37 |    51.01 |       82.23 |       90.6  |       96.45 |        98.4  |
+|  3 | [crossencoder-mmarcoFR-mMiniLMv2-L12-H384-v1-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mmarcoFR-mMiniLMv2-L12-H384-v1-mmarcoFR)                                       |         34.22 |    49.2  |       81.7  |       90.9  |       97.1  |        98.9  |
+|  4 | [crossencoder-mpnet-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mpnet-base-mmarcoFR)                                                                               |         29.68 |    46.13 |       80.45 |       87.9  |       93.15 |        96.6  |
+|  5 | **crossencoder-distilcamembert-base-mmarcoFR**                                                                                                                                         |         27.28 |    43.71 |       80.3  |       89.1  |       95.55 |        98.6  |
+|  6 | [crossencoder-roberta-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-roberta-base-mmarcoFR)                                                                           |         33.33 |    48.87 |       79.33 |       86.75 |       94.15 |        97.6  |
+|  7 | [crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-french-europeana-cased-discriminator-mmarcoFR) |         28.32 |    45.28 |       79.22 |       87.15 |       93.15 |        95.75 |
+|  8 | [crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR)             |         33.92 |    49.33 |       79    |       88.35 |       94.8  |        98.2  |
+|  9 | [crossencoder-msmarco-electra-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-electra-base-mmarcoFR)                                                           |         25.52 |    42.46 |       78.73 |       88.85 |       96.55 |        98.85 |
+| 10 | [crossencoder-bert-base-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-bert-base-uncased-mmarcoFR)                                                                 |         30.48 |    45.79 |       78.35 |       89.45 |       94.15 |        97.45 |
+| 11 | [crossencoder-msmarco-MiniLM-L-12-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-12-v2-mmarcoFR)                                                       |         29.07 |    44.41 |       77.83 |       88.1  |       95.55 |        99    |
+| 12 | [crossencoder-msmarco-MiniLM-L-6-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-6-v2-mmarcoFR)                                                         |         32.92 |    47.56 |       77.27 |       88.15 |       94.85 |        98.15 |
+| 13 | [crossencoder-msmarco-MiniLM-L-4-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-4-v2-mmarcoFR)                                                         |         30.98 |    46.22 |       76.35 |       85.8  |       94.35 |        97.55 |
+| 14 | [crossencoder-MiniLM-L6-H384-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-MiniLM-L6-H384-uncased-mmarcoFR)                                                       |         29.23 |    45.12 |       76.08 |       83.7  |       92.65 |        97.45 |
+| 15 | [crossencoder-electra-base-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-discriminator-mmarcoFR)                                               |         28.48 |    43.58 |       75.63 |       86.15 |       93.25 |        96.6  |
+| 16 | [crossencoder-electra-small-discriminator-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-small-discriminator-mmarcoFR)                                             |         31.83 |    45.97 |       75.13 |       84.95 |       94.55 |        98.15 |
+| 17 | [crossencoder-distilroberta-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilroberta-base-mmarcoFR)                                                               |         28.22 |    42.85 |       74.13 |       84.08 |       94.2  |        98.5  |
+| 18 | [crossencoder-msmarco-TinyBERT-L-6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-6-mmarcoFR)                                                           |         28.23 |    42.7  |       73.63 |       85.65 |       92.65 |        98.35 |
+| 19 | [crossencoder-msmarco-TinyBERT-L-4-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-4-mmarcoFR)                                                           |         28.6  |    43.19 |       72.17 |       81.95 |       92.8  |        97.4  |
+| 20 | [crossencoder-msmarco-MiniLM-L-2-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-MiniLM-L-2-v2-mmarcoFR)                                                         |         30.82 |    44.3  |       72.03 |       82.65 |       93.35 |        98.1  |
+| 21 | [crossencoder-distilbert-base-uncased-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-distilbert-base-uncased-mmarcoFR)                                                     |         25.47 |    40.11 |       71.37 |       85.6  |       93.85 |        97.95 |
+| 22 | [crossencoder-msmarco-TinyBERT-L-2-v2-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-msmarco-TinyBERT-L-2-v2-mmarcoFR)                                                     |         31.08 |    43.88 |       71.3  |       81.43 |       92.6  |        98.1  |
+## Training
+***
+#### Background
+We used the [cmarkea/distilcamembert-base](https://huggingface.co/cmarkea/distilcamembert-base) model and fine-tuned it with a binary cross-entropy loss function on 1M question-passage pairs in French with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are relevant and 75% are irrelevant).
+#### Hyperparameters
+We trained the model on a single Tesla V100 GPU with 32GBs of memory during 10 epochs (i.e., 312.4k steps) using a batch size of 32. We used the adamw optimizer with an initial learning rate of 2e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens.
+#### Data
+We used the French version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset to fine-tune our model. mMARCO is a multi-lingual machine-translated version of the MS MARCO dataset, a popular large-scale IR dataset.
+## Citation
+***
+```bibtex
+@online{louis2023,
+   author    = 'Antoine Louis',
+   title     = 'crossencoder-distilcamembert-base-mmarcoFR: A Cross-Encoder Model Trained on 1M sentence pairs in French',
+   publisher = 'Hugging Face',
+   month     = 'september',
+   year      = '2023',
+   url       = 'https://huggingface.co/antoinelouis/crossencoder-distilcamembert-base-mmarcoFR',
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "_name_or_path": "cmarkea/distilcamembert-base",
+  "architectures": [
+    "CamembertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "camembert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.28.1",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 32005
+}

dev_scores.csv ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ r-precision,mrr@10,recall@10,recall@20,recall@50,recall@100,model
2	+ 27.28,43.71,80.30,89.10,95.55,98.60,crossencoder-distilcamembert-base-mmarcoFR

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:65d8ec5eebab10f0612153bb96c72a431789f08a979094582893116dc6107d72
+size 272422133

sentencepiece.bpe.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:988bc5a00281c6d210a5d34bd143d0363741a432fefe741bf71e61b1869d4314
+size 810912

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "additional_special_tokens": [
+    "<s>NOTUSED",
+    "</s>NOTUSED"
+  ],
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "additional_special_tokens": [
+    "<s>NOTUSED",
+    "</s>NOTUSED"
+  ],
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "__type": "AddedToken",
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "CamembertTokenizer",
+  "unk_token": "<unk>"
+}