arabic-english-sts-matryoshka / README.md

Update README.md

f1a407c verified 24 days ago

6.85 kB

	---
	base_model: FacebookAI/xlm-roberta-large
	library_name: sentence-transformers
	metrics:
	- pearson_cosine
	- spearman_cosine
	- pearson_manhattan
	- spearman_manhattan
	- pearson_euclidean
	- spearman_euclidean
	- pearson_dot
	- spearman_dot
	- pearson_max
	- spearman_max
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- mteb
	- bilingual
	model-index:
	- name: omarelshehy/Arabic-English-Matryoshka-STS
	results:
	- dataset:
	config: en-ar
	name: MTEB STS17 (en-ar)
	revision: faeb762787bd10488a50c8b5be4a3b82e411949c
	split: test
	type: mteb/sts17-crosslingual-sts
	metrics:
	- type: cosine_pearson
	value: 79.79480510851795
	- type: cosine_spearman
	value: 79.67609346073252
	- type: euclidean_pearson
	value: 81.64087935350051
	- type: euclidean_spearman
	value: 80.52588414802709
	- type: main_score
	value: 79.67609346073252
	- type: manhattan_pearson
	value: 81.57042957417305
	- type: manhattan_spearman
	value: 80.44331526051143
	- type: pearson
	value: 79.79480418294698
	- type: spearman
	value: 79.67609346073252
	task:
	type: STS
	- dataset:
	config: ar-ar
	name: MTEB STS17 (ar-ar)
	revision: faeb762787bd10488a50c8b5be4a3b82e411949c
	split: test
	type: mteb/sts17-crosslingual-sts
	metrics:
	- type: cosine_pearson
	value: 82.22889478671283
	- type: cosine_spearman
	value: 83.0533648934447
	- type: euclidean_pearson
	value: 81.15891941165452
	- type: euclidean_spearman
	value: 82.14034597386936
	- type: main_score
	value: 83.0533648934447
	- type: manhattan_pearson
	value: 81.17463976232014
	- type: manhattan_spearman
	value: 82.09804987736345
	- type: pearson
	value: 82.22889389569819
	- type: spearman
	value: 83.0529662284269
	task:
	type: STS
	- dataset:
	config: en-en
	name: MTEB STS17 (en-en)
	revision: faeb762787bd10488a50c8b5be4a3b82e411949c
	split: test
	type: mteb/sts17-crosslingual-sts
	metrics:
	- type: cosine_pearson
	value: 87.17053120821998
	- type: cosine_spearman
	value: 87.05959159411456
	- type: euclidean_pearson
	value: 87.63706739480517
	- type: euclidean_spearman
	value: 87.7675347222274
	- type: main_score
	value: 87.05959159411456
	- type: manhattan_pearson
	value: 87.7006832512623
	- type: manhattan_spearman
	value: 87.80128473941168
	- type: pearson
	value: 87.17053012311975
	- type: spearman
	value: 87.05959159411456
	task:
	type: STS
	Language:
	- ar
	- en
	language:
	- ar
	- en
	---

	# SentenceTransformer based on FacebookAI/xlm-roberta-large

	This is a Multilingual (Arabic-English) [sentence-transformers](https://www.SBERT.net) model finetuned from [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

	The model can handle both languages separately pretty well but also interchangeably which opens many possibilities for different flexible applications but also for researchers who want to further develop arabic models :)

	The metrics from MTEB are good but don't focus completely on them anyway, test the model first and see if it works for you.

	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	- Base model: [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) <!-- at revision c23d21b0620b635a76227c604d44e43a9f0ee389 -->
	- Maximum Sequence Length: 512 tokens
	- Output Dimensionality: 1024 tokens
	- Similarity Function: Cosine Similarity
	<!-- - Training Dataset: Unknown -->
	<!-- - Language: Unknown -->
	<!-- - License: Unknown -->



	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("omarelshehy/Arabic-English-Matryoshka-STS")
	# Run inference
	sentences = [
	'حب سعيد الواضح للأدب والموسيقى الغربية يتصادم باستمرار مع غضبه الصالح لما فعله الغرب للبقية.',
	'Said loves Western literature and music but is angry about what the West has done to the rest.',
	'سعيد يعتقد أن الغرب لديه أفضل من كل شيء.',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 1024]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->


	## Citation

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### MatryoshkaLoss
	```bibtex
	@misc{kusupati2024matryoshka,
	title={Matryoshka Representation Learning},
	author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
	year={2024},
	eprint={2205.13147},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```

	#### MultipleNegativesRankingLoss
	```bibtex
	@misc{henderson2017efficient,
	title={Efficient Natural Language Response Suggestion for Smart Reply},
	author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
	year={2017},
	eprint={1705.00652},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```