--- language: - ja - en license_name: sarahina-non-commercial-license license_link: LICENSE tags: - transformers - sentence-similarity - feature-extraction - sentence-transformers pipeline_tag: sentence-similarity inference: false datasets: - hpprc/emb - cl-nagoya/auto-wiki-qa - cl-nagoya/ruri-dataset-ft - hpprc/mqa-ja - izumi-lab/llm-japanese-dataset - sentence-transformers/NQ-retrieval - sbintuitions/JSQuAD - SkelterLabsInc/JaQuAD --- # Sarashina-Embedding-v1-1B **[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)** "Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)". We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) (Japanese Massive Text Embedding Benchmark). This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Base model:** [Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b) - **Maximum Sequence Length:** 8,192 tokens - **Output Dimensionality:** 1,792 dimensions - **Similarity Function:** Cosine Similarity - **Language:** Japanese - **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE) ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: LlamaModel (1): Pooling({'word_embedding_dimension': 1792, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': False}) ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b") # Run inference sentences = [ '更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。', 'Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。', '更科蕎麦とはなんですか?' ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 1792] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] ``` **Note** - You do not need to add prefixes such as "Query: " and "Document: " at the beginning of the input sentence. - This model is licensed under the [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE), which has restrictions on commercial use. If you are interested in utilizing this model for your business, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact). ## Training "Sarashina-Embedding-v1-1B" is created through the following two-stage learning process: ### Stage 1: Weakly-supervised Learning To achieve generic text embedding performance across a wide range of domains, we performed contrastive training on weakly-supervised data consisting of our own web-crawled data and open data. #### Dataset |dataset|counts| |:-:|:-:| |AutoWikiQA|50,521,135| |web-crawled data|47,370,649| |MQA|12,941,472| |llm-japanese-dataset|9,074,340| |wikipedia|5,555,212| |Quiz dataset|988,478| |Natural Questions|132,796| |JSQuAD|62,859| |snow|62,758| |JaQuAD|31,746| |mkqa|3,318| ||| |**total**|**126,744,763**| ### Step2: Supervised Fine-tuning To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset. #### Dataset |dataset|counts| |:-:|:-:| |JSNLI|141,388 | |NU-MNLI|67,987| |Mr. TyDi (only Japanese subset)| 3,697 | |Natural Question (sampled)| 20,000| ||| |**total**|**233,072**| # Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification | |:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------| | OpenAI/text-embedding-3-large | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 | | [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 | | [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** | | [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 | | [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512|70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 | ||| |[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00| ## License This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE). **If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**