---
language: 
- ja
- en
license_name: sarahina-non-commercial-license
license_link: LICENSE
tags:
- transformers
- sentence-similarity
- feature-extraction
- sentence-transformers
pipeline_tag: sentence-similarity
inference: false
datasets:
  - hpprc/emb
  - cl-nagoya/auto-wiki-qa
  - cl-nagoya/ruri-dataset-ft
  - hpprc/mqa-ja
  - izumi-lab/llm-japanese-dataset
  - sentence-transformers/NQ-retrieval
  - sbintuitions/JSQuAD
  - SkelterLabsInc/JaQuAD

---

# Sarashina-embedding-v1-1b

**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**


"Sarashina-embedding-v1-1b" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "Sarashina". 
We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in  [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark).

This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

## Model Details

### Model Description
- **Model Type:** Sentence Transformer
<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
- **Maximum Sequence Length:** 8,192 tokens
- **Output Dimensionality:** 1,792 dimensions
- **Similarity Function:** Cosine Similarity
- **Language:**  Japanese
- **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)


### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel 
  (1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
)
```

## Usage

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")
# Run inference
sentences = [
    '更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
    'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
    '更科蕎麦とはなんですか?'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1792]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```

**Note**

- You do not need to add prefixes such as "Query: " and "Document: " at the beginning of the input sentence.
- This model is licensed under the [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE), which has restrictions on commercial use. If you are interested in utilizing this model for your business, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).

## Training

"Sarashina-embedding-v1-1b" is created through the following two-stage learning process:

### Stage 1: Weakly-supervised Learning

To achieve generic text embedding performance across a wide range of domains, we performed contrastive training on weakly-supervised data consisting of our own web-crawled data and open data.

#### Dataset

|dataset|counts|
|:-:|:-:|
|AutoWikiQA|50,521,135|
|web-crawled data|47,370,649|
|MQA|12,941,472|
|llm-japanese-dataset|9,074,340|
|wikipedia|5,555,212|
|Quiz dataset|988,478|
|Natural Questions|132,796|
|JSQuAD|62,859|
|snow|62,758|
|JaQuAD|31,746|
|mkqa|3,318|
|||
|**total**|**126,744,763**|


### Step2: Supervised Fine-tuning

To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.


# Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)

 Model                                         |Max Tokens|Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
| OpenAI/text-embedding-3-large                 | 8191 |74.05 | 74.48   | 82.52     | 77.58        | 93.58   | 53.32        | 62.35                |
| [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large)    | 512 |73.31     | 73.02       | **83.13**     | 77.43            | 92.99       | 51.82        | 62.29                |
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2)    | 512 |72.23     | 73.36       | 82.96     | 74.21            | 93.01       | 48.65        | **62.37**                |
| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja)     |1024 |72.04     | 73.21       | 81.39     | 72.41            | 92.69       | 53.23        | 61.74                |
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)          | 512|70.90     | 70.98       | 79.70     | 72.89            | 92.96       | 51.24        | 62.15                |
|||
|[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|


## License

This model is licensed under  [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).

**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**