arabic-english-sts-matryoshka / README.md

Update README.md

763d116 verified 24 days ago

7.11 kB

	---
	base_model: FacebookAI/xlm-roberta-large
	library_name: sentence-transformers
	metrics:
	- pearson_cosine
	- spearman_cosine
	- pearson_manhattan
	- spearman_manhattan
	- pearson_euclidean
	- spearman_euclidean
	- pearson_dot
	- spearman_dot
	- pearson_max
	- spearman_max
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- mteb
	- bilingual
	model-index:
	- name: omarelshehy/arabic-english-sts-matryoshka
	results:
	- dataset:
	config: en-en
	name: MTEB STS17 (en-en)
	revision: faeb762787bd10488a50c8b5be4a3b82e411949c
	split: test
	type: mteb/sts17-crosslingual-sts
	metrics:
	- type: cosine_pearson
	value: 87.17053120821998
	- type: cosine_spearman
	value: 87.05959159411456
	- type: euclidean_pearson
	value: 87.63706739480517
	- type: euclidean_spearman
	value: 87.7675347222274
	- type: main_score
	value: 87.05959159411456
	- type: manhattan_pearson
	value: 87.7006832512623
	- type: manhattan_spearman
	value: 87.80128473941168
	- type: pearson
	value: 87.17053012311975
	- type: spearman
	value: 87.05959159411456
	task:
	type: STS
	- dataset:
	config: ar-ar
	name: MTEB STS17 (ar-ar)
	revision: faeb762787bd10488a50c8b5be4a3b82e411949c
	split: test
	type: mteb/sts17-crosslingual-sts
	metrics:
	- type: cosine_pearson
	value: 82.22889478671283
	- type: cosine_spearman
	value: 83.0533648934447
	- type: euclidean_pearson
	value: 81.15891941165452
	- type: euclidean_spearman
	value: 82.14034597386936
	- type: main_score
	value: 83.0533648934447
	- type: manhattan_pearson
	value: 81.17463976232014
	- type: manhattan_spearman
	value: 82.09804987736345
	- type: pearson
	value: 82.22889389569819
	- type: spearman
	value: 83.0529662284269
	task:
	type: STS
	- dataset:
	config: en-ar
	name: MTEB STS17 (en-ar)
	revision: faeb762787bd10488a50c8b5be4a3b82e411949c
	split: test
	type: mteb/sts17-crosslingual-sts
	metrics:
	- type: cosine_pearson
	value: 79.79480510851795
	- type: cosine_spearman
	value: 79.67609346073252
	- type: euclidean_pearson
	value: 81.64087935350051
	- type: euclidean_spearman
	value: 80.52588414802709
	- type: main_score
	value: 79.67609346073252
	- type: manhattan_pearson
	value: 81.57042957417305
	- type: manhattan_spearman
	value: 80.44331526051143
	- type: pearson
	value: 79.79480418294698
	- type: spearman
	value: 79.67609346073252
	task:
	type: STS
	language:
	- ar
	- en
	license: apache-2.0
	---

	# SentenceTransformer based on FacebookAI/xlm-roberta-large

	This is a Bilingual (Arabic-English) [sentence-transformers](https://www.SBERT.net) model finetuned from [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

	The model handles both languages separately 🌐, but also interchangeably, which unlocks flexible applications for developers and researchers who want to further build on Arabic models! 💡

	📊 Metrics from MTEB are promising, but don't just rely on them — test the model yourself and see if it fits your needs! ✅

	## Matryoshka Embeddings 🪆

	This model supports Matryoshka embeddings, allowing you to truncate embeddings into smaller sizes to optimize performance and memory usage, based on your task requirements. Available truncation sizes include: 1024, 768, 512, 256, 128, and 64

	You can select the appropriate embedding size for your use case, ensuring flexibility in resource management.

	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	- Base model: [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) <!-- at revision c23d21b0620b635a76227c604d44e43a9f0ee389 -->
	- Maximum Sequence Length: 512 tokens
	- Output Dimensionality: 1024 tokens
	- Similarity Function: Cosine Similarity
	<!-- - Training Dataset: Unknown -->
	<!-- - Language: Unknown -->
	<!-- - License: Unknown -->



	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	matryoshka_dim = 786
	model = SentenceTransformer("omarelshehy/arabic-english-sts-matryoshka", truncate_dim=matryoshka_dim)
	# Run inference
	sentences = [
	"She enjoyed reading books by the window as the rain poured outside.",
	"كانت تستمتع بقراءة الكتب بجانب النافذة بينما كانت الأمطار تتساقط في الخارج.",
	"Reading by the window was her favorite thing, especially during rainy days."
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 1024]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->


	## Citation

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### MatryoshkaLoss
	```bibtex
	@misc{kusupati2024matryoshka,
	title={Matryoshka Representation Learning},
	author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
	year={2024},
	eprint={2205.13147},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```

	#### MultipleNegativesRankingLoss
	```bibtex
	@misc{henderson2017efficient,
	title={Efficient Natural Language Response Suggestion for Smart Reply},
	author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
	year={2017},
	eprint={1705.00652},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```