sbintuitions
/

sarashina-embedding-v1-1b

@@ -20,15 +20,13 @@ datasets:
   - sentence-transformers/NQ-retrieval
   - sbintuitions/JSQuAD
   - SkelterLabsInc/JaQuAD
 ---
 # Sarashina-Embedding-v1-1B
 **[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
-"Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "Sarashina".
 We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in  [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark).
 This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
@@ -36,17 +34,15 @@ This model maps sentences & paragraphs to a 1792-dimensional dense vector space
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
-<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
 - **Maximum Sequence Length:** 8,192 tokens
 - **Output Dimensionality:** 1,792 dimensions
 - **Similarity Function:** Cosine Similarity
 - **Language:**  Japanese
 - **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
 ### Full Model Architecture
 ```
@@ -67,6 +63,7 @@ pip install -U sentence-transformers
 ```
 Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
@@ -119,13 +116,10 @@ To achieve generic text embedding performance across a wide range of domains, we
 |||
 |**total**|**126,744,763**|
 ### Step2: Supervised Fine-tuning
 To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.
 # Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
  Model                                         |Max Tokens|Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |
@@ -138,9 +132,8 @@ To enable the model to learn a more accurate query-document similarity, we perfo
 |||
 |[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
 ## License
 This model is licensed under  [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).
-**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**

   - sentence-transformers/NQ-retrieval
   - sbintuitions/JSQuAD
   - SkelterLabsInc/JaQuAD
 ---
 # Sarashina-Embedding-v1-1B
 **[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
+"Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)".
 We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in  [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark).
 This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
+- **Base model:** [Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)
 - **Maximum Sequence Length:** 8,192 tokens
 - **Output Dimensionality:** 1,792 dimensions
 - **Similarity Function:** Cosine Similarity
 - **Language:**  Japanese
 - **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
 ### Full Model Architecture
 ```
 ```
 Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
 |||
 |**total**|**126,744,763**|
 ### Step2: Supervised Fine-tuning
 To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.
 # Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
  Model                                         |Max Tokens|Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |
 |||
 |[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
 ## License
 This model is licensed under  [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).
+**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**

README_JA.md CHANGED Viewed

@@ -22,7 +22,7 @@ datasets:
 # Sarashina-Embedding-v1-1B
-「Sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「Sarashina」をベースにした日本語テキスト埋め込みモデルです。
 このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)（Japanese Massive Text Embedding Benchmark）の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
@@ -33,13 +33,13 @@ datasets:
 ### モデル説明
 - **モデルタイプ:** Sentence Transformer
 - **最大シーケンス長:** 8,192トークン
 - **出力次元数:** 1,792次元
 - **類似度関数:** コサイン類似度
 - **言語:** 日本語
 - **ライセンス:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
 ### モデルアーキテクチャ
 ```
@@ -142,6 +142,6 @@ print(similarities.shape)
 ## ライセンス
-このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)に基づいて公開されています.
-**もしこのモデルの商用利用に興味がある場合は、気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)にご連絡ください。**

 # Sarashina-Embedding-v1-1B
+「Sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)」をベースにした日本語テキスト埋め込みモデルです。
 このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)（Japanese Massive Text Embedding Benchmark）の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
 ### モデル説明
 - **モデルタイプ:** Sentence Transformer
+- **ベースモデル:** [Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)
 - **最大シーケンス長:** 8,192トークン
 - **出力次元数:** 1,792次元
 - **類似度関数:** コサイン類似度
 - **言語:** 日本語
 - **ライセンス:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
 ### モデルアーキテクチャ
 ```
 ## ライセンス
+このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)に基づいて公開されています.
+**もしこのモデルの商用利用に興味がある場合は、気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)にご連絡ください。**