fix typo
Browse files- README.md +7 -8
- README_JA.md +7 -9
README.md
CHANGED
@@ -23,12 +23,12 @@ datasets:
|
|
23 |
|
24 |
---
|
25 |
|
26 |
-
#
|
27 |
|
28 |
**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
|
29 |
|
30 |
|
31 |
-
"
|
32 |
We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark).
|
33 |
|
34 |
This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
@@ -38,8 +38,8 @@ This model maps sentences & paragraphs to a 1792-dimensional dense vector space
|
|
38 |
### Model Description
|
39 |
- **Model Type:** Sentence Transformer
|
40 |
<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
|
41 |
-
- **Maximum Sequence Length:**
|
42 |
-
- **Output Dimensionality:**
|
43 |
- **Similarity Function:** Cosine Similarity
|
44 |
- **Language:** Japanese
|
45 |
- **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
|
@@ -95,7 +95,7 @@ print(similarities.shape)
|
|
95 |
|
96 |
## Training
|
97 |
|
98 |
-
|
99 |
|
100 |
### Stage 1: Weakly-supervised Learning
|
101 |
|
@@ -123,11 +123,10 @@ To achieve generic text embedding performance across a wide range of domains, we
|
|
123 |
|
124 |
### Step2: Supervised Fine-tuning
|
125 |
|
126 |
-
To enable the model to learn a more accurate query-document similarity,
|
127 |
|
128 |
|
129 |
-
#
|
130 |
-
### [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
|
131 |
|
132 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
133 |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|
|
|
23 |
|
24 |
---
|
25 |
|
26 |
+
# Sarashina-embedding-v1-1b
|
27 |
|
28 |
**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
|
29 |
|
30 |
|
31 |
+
"Sarashina-embedding-v1-1b" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "Sarashina".
|
32 |
We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark).
|
33 |
|
34 |
This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
|
|
38 |
### Model Description
|
39 |
- **Model Type:** Sentence Transformer
|
40 |
<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
|
41 |
+
- **Maximum Sequence Length:** 8,192 tokens
|
42 |
+
- **Output Dimensionality:** 1,792 dimensions
|
43 |
- **Similarity Function:** Cosine Similarity
|
44 |
- **Language:** Japanese
|
45 |
- **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
|
|
|
95 |
|
96 |
## Training
|
97 |
|
98 |
+
"Sarashina-embedding-v1-1b" is created through the following two-stage learning process:
|
99 |
|
100 |
### Stage 1: Weakly-supervised Learning
|
101 |
|
|
|
123 |
|
124 |
### Step2: Supervised Fine-tuning
|
125 |
|
126 |
+
To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.
|
127 |
|
128 |
|
129 |
+
# Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
|
|
|
130 |
|
131 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
132 |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|
README_JA.md
CHANGED
@@ -20,9 +20,9 @@ datasets:
|
|
20 |
- SkelterLabsInc/JaQuAD
|
21 |
---
|
22 |
|
23 |
-
#
|
24 |
|
25 |
-
「
|
26 |
|
27 |
このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark)の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
|
28 |
|
@@ -33,8 +33,8 @@ datasets:
|
|
33 |
### モデル説明
|
34 |
|
35 |
- **モデルタイプ:** Sentence Transformer
|
36 |
-
- **最大シーケンス長:**
|
37 |
-
- **出力次元数:**
|
38 |
- **類似度関数:** コサイン類似度
|
39 |
- **言語:** 日本語
|
40 |
- **ライセンス:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
|
@@ -89,11 +89,11 @@ print(similarities.shape)
|
|
89 |
|
90 |
## 学習
|
91 |
|
92 |
-
|
93 |
|
94 |
### Stage 1: 弱教師あり学習
|
95 |
|
96 |
-
幅広いドメインに対して汎用的なテキスト埋め込みの性能を達成するために、私たちは、独自web
|
97 |
|
98 |
#### データセット
|
99 |
|
@@ -128,9 +128,7 @@ sarashina-embedding-v1-1bは、以下の2段階の学習ステージによって
|
|
128 |
|||
|
129 |
|**total**|**233,072**|
|
130 |
|
131 |
-
##
|
132 |
-
|
133 |
-
### [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
|
134 |
|
135 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
136 |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|
|
|
20 |
- SkelterLabsInc/JaQuAD
|
21 |
---
|
22 |
|
23 |
+
# Sarashina-embedding-v1-1b
|
24 |
|
25 |
+
「Sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「Sarashina」をベースにした日本語テキスト埋め込みモデルです。
|
26 |
|
27 |
このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark)の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
|
28 |
|
|
|
33 |
### モデル説明
|
34 |
|
35 |
- **モデルタイプ:** Sentence Transformer
|
36 |
+
- **最大シーケンス長:** 8,192トークン
|
37 |
+
- **出力次元数:** 1,792次元
|
38 |
- **類似度関数:** コサイン類似度
|
39 |
- **言語:** 日本語
|
40 |
- **ライセンス:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
|
|
|
89 |
|
90 |
## 学習
|
91 |
|
92 |
+
"Sarashina-embedding-v1-1b"は、以下の2段階の学習ステージによって行われています。
|
93 |
|
94 |
### Stage 1: 弱教師あり学習
|
95 |
|
96 |
+
幅広いドメインに対して汎用的なテキスト埋め込みの性能を達成するために、私たちは、独自webクロールデータとオープンデータで構成された弱教師データによる対照学習を行いました。
|
97 |
|
98 |
#### データセット
|
99 |
|
|
|
128 |
|||
|
129 |
|**total**|**233,072**|
|
130 |
|
131 |
+
## [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)による性能評価
|
|
|
|
|
132 |
|
133 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
134 |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|