File size: 7,575 Bytes
e3cd543 cfb24e0 e3cd543 cfb24e0 e1356e8 e3cd543 5072705 e3cd543 4c6010c e3cd543 4c6010c e3cd543 a80319c e3cd543 a80319c 97f0a0f e3cd543 a80319c e3cd543 4c6010c e3cd543 4c6010c e3cd543 4c6010c e3cd543 5072705 e3cd543 de99cf6 e3cd543 e1356e8 7b3afd8 e1356e8 7b3afd8 e1356e8 7b3afd8 e3cd543 de99cf6 e3cd543 de99cf6 edafeb4 e1356e8 9849de0 7b3afd8 edafeb4 97f0a0f e3cd543 7b3afd8 e3cd543 7b3afd8 e3cd543 a80319c 4c6010c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
---
language:
- ja
- en
license_name: sarahina-non-commercial-license
license_link: LICENSE
tags:
- transformers
- sentence-similarity
- feature-extraction
- sentence-transformers
pipeline_tag: sentence-similarity
inference: false
datasets:
- hpprc/emb
- cl-nagoya/auto-wiki-qa
- cl-nagoya/ruri-dataset-ft
- hpprc/mqa-ja
- izumi-lab/llm-japanese-dataset
- sentence-transformers/NQ-retrieval
- sbintuitions/JSQuAD
- SkelterLabsInc/JaQuAD
- wikimedia/wikipedia
- cl-nagoya/nu-mnli
- castorini/mr-tydi
---
# Sarashina-Embedding-v1-1B
**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
"Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japanese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)".
We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score across 16 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) (Japanese Massive Text Embedding Benchmark).
This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications.
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)
- **Maximum Sequence Length:** 8,192 tokens
- **Output Dimensionality:** 1,792 dimensions
- **Similarity Function:** Cosine Similarity
- **Language:** Japanese
- **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel
(1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False})
)
```
## Usage
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")
# Run inference
sentences = [
'更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。',
'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。',
'サラシナエンベディングは日本語言語モデルをベースにした日本語埋め込みモデルです。'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1792]
# Get the similarity scores between the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```
**Note**
- You do not need to add prefixes such as "Query: " and "Document: " to the beginning of the input sentence.
- This model is licensed under the [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE), which has restrictions on commercial use. If you are interested in utilizing this model for your business, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).
## Training
"Sarashina-Embedding-v1-1B" is created through the following two-stage learning process:
### Stage 1: Weakly-supervised Learning
To achieve generic text embedding performance across a wide range of domains, we performed contrastive training on weakly-supervised data consisting of our own web-crawled data and open data.
#### Datasets
|dataset|counts|
|:-:|:-:|
|[AutoWikiQA](https://huggingface.co/datasets/cl-nagoya/auto-wiki-qa)|50,521,135|
|web-crawled data (ours)|47,370,649|
|[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
|[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
|[Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
|Quiz dataset (ours)|988,478|
|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
|[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
|[SNOW(T15+T23)](https://aclanthology.org/L18-1185)|62,758|
|[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
|[MKQA](https://aclanthology.org/2021.tacl-1.82)|3,318|
|||
|**total**|**126,744,763**|
### Step2: Supervised Fine-tuning
To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following datasets.
#### Datasets
|dataset|counts|
|:-:|:-:|
|[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
|[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
|[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 |
|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
|||
|**total**|**233,072**|
# Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^oai] | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
| [cl-nagoya/ruri-large](https://arxiv.org/abs/2409.07737) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512|70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
|||
|[**Sarashina-Embedding-v1-1B**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
## License
This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).
**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**
[^oai]: Benchmarked on April 23, 2024.
|