File size: 6,550 Bytes
e3cd543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cfb24e0
 
e3cd543
 
cfb24e0
 
 
e3cd543
 
5072705
e3cd543
 
 
a80319c
edafeb4
e3cd543
 
 
 
 
 
a80319c
e3cd543
a80319c
97f0a0f
 
e3cd543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a80319c
e3cd543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5072705
e3cd543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97f0a0f
e3cd543
edafeb4
 
 
 
 
 
 
 
 
 
 
97f0a0f
e3cd543
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a80319c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
language: 
- ja
- en
license_name: sarahina-non-commercial-license
license_link: LICENSE
tags:
- transformers
- sentence-similarity
- feature-extraction
- sentence-transformers
pipeline_tag: sentence-similarity
inference: false
datasets:
  - hpprc/emb
  - cl-nagoya/auto-wiki-qa
  - cl-nagoya/ruri-dataset-ft
  - hpprc/mqa-ja
  - izumi-lab/llm-japanese-dataset
  - sentence-transformers/NQ-retrieval
  - sbintuitions/JSQuAD
  - SkelterLabsInc/JaQuAD
---

# Sarashina-Embedding-v1-1B

**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**

"Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)".
We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in  [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) (Japanese Massive Text Embedding Benchmark).

This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

## Model Details

### Model Description

- **Model Type:** Sentence Transformer
- **Base model:** [Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)
- **Maximum Sequence Length:** 8,192 tokens
- **Output Dimensionality:** 1,792 dimensions
- **Similarity Function:** Cosine Similarity
- **Language:**  Japanese
- **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel 
  (1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
)
```

## Usage

### Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

```bash
pip install -U sentence-transformers
```

Then you can load this model and run inference.

```python
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")
# Run inference
sentences = [
    '更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
    'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
    '更科蕎麦とはなんですか?'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1792]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```

**Note**

- You do not need to add prefixes such as "Query: " and "Document: " at the beginning of the input sentence.
- This model is licensed under the [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE), which has restrictions on commercial use. If you are interested in utilizing this model for your business, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).

## Training

"Sarashina-Embedding-v1-1B" is created through the following two-stage learning process:

### Stage 1: Weakly-supervised Learning

To achieve generic text embedding performance across a wide range of domains, we performed contrastive training on weakly-supervised data consisting of our own web-crawled data and open data.

#### Dataset

|dataset|counts|
|:-:|:-:|
|AutoWikiQA|50,521,135|
|web-crawled data|47,370,649|
|MQA|12,941,472|
|llm-japanese-dataset|9,074,340|
|wikipedia|5,555,212|
|Quiz dataset|988,478|
|Natural Questions|132,796|
|JSQuAD|62,859|
|snow|62,758|
|JaQuAD|31,746|
|mkqa|3,318|
|||
|**total**|**126,744,763**|

### Step2: Supervised Fine-tuning

To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.

#### Dataset

|dataset|counts|
|:-:|:-:|
|JSNLI|141,388 |
|NU-MNLI|67,987|
|Mr. TyDi (only Japanese subset)| 3,697 |
|Natural Question (sampled)| 20,000|
|||
|**total**|**233,072**|

# Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)

 Model                                         |Max Tokens|Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
| OpenAI/text-embedding-3-large                 | 8191 |74.05 | 74.48   | 82.52     | 77.58        | 93.58   | 53.32        | 62.35                |
| [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large)    | 512 |73.31     | 73.02       | **83.13**     | 77.43            | 92.99       | 51.82        | 62.29                |
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2)    | 512 |72.23     | 73.36       | 82.96     | 74.21            | 93.01       | 48.65        | **62.37**                |
| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja)     |1024 |72.04     | 73.21       | 81.39     | 72.41            | 92.69       | 53.23        | 61.74                |
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)          | 512|70.90     | 70.98       | 79.70     | 72.89            | 92.96       | 51.24        | 62.15                |
|||
|[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|

## License

This model is licensed under  [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).

**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**