sbintuitions
/

sarashina-embedding-v1-1b

@@ -27,7 +27,7 @@ datasets:
 **[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
 "Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)".
-We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in  [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark).
 This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
@@ -120,6 +120,17 @@ To achieve generic text embedding performance across a wide range of domains, we
 To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.
 # Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
  Model                                         |Max Tokens|Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |

 **[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
 "Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)".
+We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in  [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) (Japanese Massive Text Embedding Benchmark).
 This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.
+#### Dataset
+|dataset|counts|
+|:-:|:-:|
+|JSNLI|141,388 |
+|NU-MNLI|67,987|
+|Mr. TyDi (only Japanese subset)| 3,697 |
+|Natural Question (sampled)| 20,000|
+|||
+|**total**|**233,072**|
 # Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
  Model                                         |Max Tokens|Avg.      | Retrieval   | STS       | Classification   | Reranking   | Clustering   | PairClassification   |

README_JA.md CHANGED Viewed

@@ -24,7 +24,7 @@ datasets:
 「Sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)」をベースにした日本語テキスト埋め込みモデルです。
-このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)（Japanese Massive Text Embedding Benchmark）の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
 このモデルは、文や段落を1792次元の高密度ベクトル空間にマッピングし、意味的テキスト類似度、意味的検索、paraphrase mining、テキスト分類、クラスタリングなどに使用できます。

 「Sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)」をベースにした日本語テキスト埋め込みモデルです。
+このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB) （Japanese Massive Text Embedding Benchmark）の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
 このモデルは、文や段落を1792次元の高密度ベクトル空間にマッピングし、意味的テキスト類似度、意味的検索、paraphrase mining、テキスト分類、クラスタリングなどに使用できます。