update readme

a80319c 25 days ago

6.37 kB

	---
	language:
	- ja
	- en
	license_name: sarahina-non-commercial-license
	license_link: LICENSE
	tags:
	- transformers
	- sentence-similarity
	- feature-extraction
	- sentence-transformers
	pipeline_tag: sentence-similarity
	inference: false
	datasets:
	- hpprc/emb
	- cl-nagoya/auto-wiki-qa
	- cl-nagoya/ruri-dataset-ft
	- hpprc/mqa-ja
	- izumi-lab/llm-japanese-dataset
	- sentence-transformers/NQ-retrieval
	- sbintuitions/JSQuAD
	- SkelterLabsInc/JaQuAD
	---

	# Sarashina-Embedding-v1-1B

	[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)

	"Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)".
	We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark).

	This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

	## Model Details

	### Model Description

	- Model Type: Sentence Transformer
	- Base model: [Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)
	- Maximum Sequence Length: 8,192 tokens
	- Output Dimensionality: 1,792 dimensions
	- Similarity Function: Cosine Similarity
	- Language: Japanese
	- License: [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel
	(1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
	)
	```

	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.

	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")
	# Run inference
	sentences = [
	'更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
	'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
	'更科蕎麦とはなんですか?'
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 1792]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```

	Note

	- You do not need to add prefixes such as "Query: " and "Document: " at the beginning of the input sentence.
	- This model is licensed under the [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE), which has restrictions on commercial use. If you are interested in utilizing this model for your business, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).

	## Training

	"Sarashina-Embedding-v1-1B" is created through the following two-stage learning process:

	### Stage 1: Weakly-supervised Learning

	To achieve generic text embedding performance across a wide range of domains, we performed contrastive training on weakly-supervised data consisting of our own web-crawled data and open data.

	#### Dataset

	\|dataset\|counts\|
	\|:-:\|:-:\|
	\|AutoWikiQA\|50,521,135\|
	\|web-crawled data\|47,370,649\|
	\|MQA\|12,941,472\|
	\|llm-japanese-dataset\|9,074,340\|
	\|wikipedia\|5,555,212\|
	\|Quiz dataset\|988,478\|
	\|Natural Questions\|132,796\|
	\|JSQuAD\|62,859\|
	\|snow\|62,758\|
	\|JaQuAD\|31,746\|
	\|mkqa\|3,318\|
	\|\|\|
	\|total\|126,744,763\|

	### Step2: Supervised Fine-tuning

	To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.

	# Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)

	Model \|Max Tokens\|Avg. \| Retrieval \| STS \| Classification \| Reranking \| Clustering \| PairClassification \|
	\|:----------------------------------------------\|:----------\|:----------\|:------------\|:----------\|:-----------------\|:------------\|:-------------\|:---------------------\|
	\| OpenAI/text-embedding-3-large \| 8191 \|74.05 \| 74.48 \| 82.52 \| 77.58 \| 93.58 \| 53.32 \| 62.35 \|
	\| [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large) \| 512 \|73.31 \| 73.02 \| 83.13 \| 77.43 \| 92.99 \| 51.82 \| 62.29 \|
	\| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) \| 512 \|72.23 \| 73.36 \| 82.96 \| 74.21 \| 93.01 \| 48.65 \| 62.37 \|
	\| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) \|1024 \|72.04 \| 73.21 \| 81.39 \| 72.41 \| 92.69 \| 53.23 \| 61.74 \|
	\| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) \| 512\|70.90 \| 70.98 \| 79.70 \| 72.89 \| 92.96 \| 51.24 \| 62.15 \|
	\|\|\|
	\|[sarashina-embedding-v1-1b](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)\|8192\|75.50\|77.61\|82.71\|78.37\|93.74\|53.86\|62.00\|

	## License

	This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).

	If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).