|
--- |
|
language: |
|
- ja |
|
- en |
|
license_name: sarahina-non-commercial-license |
|
license_link: LICENSE |
|
tags: |
|
- transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- sentence-transformers |
|
pipeline_tag: sentence-similarity |
|
inference: false |
|
datasets: |
|
- hpprc/emb |
|
- cl-nagoya/auto-wiki-qa |
|
- cl-nagoya/ruri-dataset-ft |
|
- hpprc/mqa-ja |
|
- izumi-lab/llm-japanese-dataset |
|
- sentence-transformers/NQ-retrieval |
|
- sbintuitions/JSQuAD |
|
- SkelterLabsInc/JaQuAD |
|
- wikimedia/wikipedia |
|
- cl-nagoya/nu-mnli |
|
- castorini/mr-tydi |
|
--- |
|
|
|
# Sarashina-Embedding-v1-1B |
|
|
|
**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)** |
|
|
|
"Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japanese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)". |
|
We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score across 16 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) (Japanese Massive Text Embedding Benchmark). |
|
|
|
This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and other applications. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Model Type:** Sentence Transformer |
|
- **Base model:** [Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b) |
|
- **Maximum Sequence Length:** 8,192 tokens |
|
- **Output Dimensionality:** 1,792 dimensions |
|
- **Similarity Function:** Cosine Similarity |
|
- **Language:** Japanese |
|
- **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE) |
|
|
|
### Full Model Architecture |
|
|
|
``` |
|
SentenceTransformer( |
|
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel |
|
(1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False}) |
|
) |
|
``` |
|
|
|
## Usage |
|
|
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can load this model and run inference. |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Download from the 🤗 Hub |
|
model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b") |
|
# Run inference |
|
sentences = [ |
|
'更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。', |
|
'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。', |
|
'サラシナエンベディングは日本語言語モデルをベースにした日本語埋め込みモデルです。' |
|
] |
|
embeddings = model.encode(sentences) |
|
print(embeddings.shape) |
|
# [3, 1792] |
|
|
|
# Get the similarity scores between the embeddings |
|
similarities = model.similarity(embeddings, embeddings) |
|
print(similarities.shape) |
|
# [3, 3] |
|
``` |
|
|
|
**Note** |
|
|
|
- You do not need to add prefixes such as "Query: " and "Document: " to the beginning of the input sentence. |
|
- This model is licensed under the [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE), which has restrictions on commercial use. If you are interested in utilizing this model for your business, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact). |
|
|
|
## Training |
|
|
|
"Sarashina-Embedding-v1-1B" is created through the following two-stage learning process: |
|
|
|
### Stage 1: Weakly-supervised Learning |
|
|
|
To achieve generic text embedding performance across a wide range of domains, we performed contrastive training on weakly-supervised data consisting of our own web-crawled data and open data. |
|
|
|
#### Datasets |
|
|
|
|dataset|counts| |
|
|:-:|:-:| |
|
|[AutoWikiQA](https://huggingface.co/datasets/cl-nagoya/auto-wiki-qa)|50,521,135| |
|
|web-crawled data (ours)|47,370,649| |
|
|[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472| |
|
|[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340| |
|
|[Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212| |
|
|Quiz dataset (ours)|988,478| |
|
|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796| |
|
|[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859| |
|
|[SNOW(T15+T23)](https://aclanthology.org/L18-1185)|62,758| |
|
|[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746| |
|
|[MKQA](https://aclanthology.org/2021.tacl-1.82)|3,318| |
|
||| |
|
|**total**|**126,744,763**| |
|
|
|
### Step2: Supervised Fine-tuning |
|
|
|
To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following datasets. |
|
|
|
#### Datasets |
|
|
|
|dataset|counts| |
|
|:-:|:-:| |
|
|[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 | |
|
|[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987| |
|
|[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) (only Japanese subset)| 3,697 | |
|
|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000| |
|
||| |
|
|**total**|**233,072**| |
|
|
|
# Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) |
|
|
|
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification | |
|
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------| |
|
| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)[^oai] | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 | |
|
| [cl-nagoya/ruri-large](https://arxiv.org/abs/2409.07737) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 | |
|
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** | |
|
| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 | |
|
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512|70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 | |
|
||| |
|
|[**Sarashina-Embedding-v1-1B**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00| |
|
|
|
## License |
|
|
|
This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE). |
|
|
|
**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).** |
|
|
|
[^oai]: Benchmarked on April 23, 2024. |
|
|