update readme
Browse files- README.md +5 -12
- README_JA.md +4 -4
README.md
CHANGED
@@ -20,15 +20,13 @@ datasets:
|
|
20 |
- sentence-transformers/NQ-retrieval
|
21 |
- sbintuitions/JSQuAD
|
22 |
- SkelterLabsInc/JaQuAD
|
23 |
-
|
24 |
---
|
25 |
|
26 |
# Sarashina-Embedding-v1-1B
|
27 |
|
28 |
**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
|
29 |
|
30 |
-
|
31 |
-
"Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "Sarashina".
|
32 |
We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark).
|
33 |
|
34 |
This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
@@ -36,17 +34,15 @@ This model maps sentences & paragraphs to a 1792-dimensional dense vector space
|
|
36 |
## Model Details
|
37 |
|
38 |
### Model Description
|
|
|
39 |
- **Model Type:** Sentence Transformer
|
40 |
-
|
41 |
- **Maximum Sequence Length:** 8,192 tokens
|
42 |
- **Output Dimensionality:** 1,792 dimensions
|
43 |
- **Similarity Function:** Cosine Similarity
|
44 |
- **Language:** Japanese
|
45 |
- **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
|
46 |
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
### Full Model Architecture
|
51 |
|
52 |
```
|
@@ -67,6 +63,7 @@ pip install -U sentence-transformers
|
|
67 |
```
|
68 |
|
69 |
Then you can load this model and run inference.
|
|
|
70 |
```python
|
71 |
from sentence_transformers import SentenceTransformer
|
72 |
|
@@ -119,13 +116,10 @@ To achieve generic text embedding performance across a wide range of domains, we
|
|
119 |
|||
|
120 |
|**total**|**126,744,763**|
|
121 |
|
122 |
-
|
123 |
-
|
124 |
### Step2: Supervised Fine-tuning
|
125 |
|
126 |
To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.
|
127 |
|
128 |
-
|
129 |
# Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
|
130 |
|
131 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
@@ -138,9 +132,8 @@ To enable the model to learn a more accurate query-document similarity, we perfo
|
|
138 |
|||
|
139 |
|[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
|
140 |
|
141 |
-
|
142 |
## License
|
143 |
|
144 |
This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).
|
145 |
|
146 |
-
**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**
|
|
|
20 |
- sentence-transformers/NQ-retrieval
|
21 |
- sbintuitions/JSQuAD
|
22 |
- SkelterLabsInc/JaQuAD
|
|
|
23 |
---
|
24 |
|
25 |
# Sarashina-Embedding-v1-1B
|
26 |
|
27 |
**[日本語のREADME/Japanese README](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/README_JA.md)**
|
28 |
|
29 |
+
"Sarashina-Embedding-v1-1B" is a Japanese text embedding model, based on the 1.2B-parameter Japansese LLM "[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)".
|
|
|
30 |
We trained this model with multi-stage contrastive learning. We achieved the state-of-the-art average score in the average of 16 datasets in [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark).
|
31 |
|
32 |
This model maps sentences & paragraphs to a 1792-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
|
|
34 |
## Model Details
|
35 |
|
36 |
### Model Description
|
37 |
+
|
38 |
- **Model Type:** Sentence Transformer
|
39 |
+
- **Base model:** [Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)
|
40 |
- **Maximum Sequence Length:** 8,192 tokens
|
41 |
- **Output Dimensionality:** 1,792 dimensions
|
42 |
- **Similarity Function:** Cosine Similarity
|
43 |
- **Language:** Japanese
|
44 |
- **License:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
|
45 |
|
|
|
|
|
|
|
46 |
### Full Model Architecture
|
47 |
|
48 |
```
|
|
|
63 |
```
|
64 |
|
65 |
Then you can load this model and run inference.
|
66 |
+
|
67 |
```python
|
68 |
from sentence_transformers import SentenceTransformer
|
69 |
|
|
|
116 |
|||
|
117 |
|**total**|**126,744,763**|
|
118 |
|
|
|
|
|
119 |
### Step2: Supervised Fine-tuning
|
120 |
|
121 |
To enable the model to learn a more accurate query-document similarity, we performed supervised fine-tuning using the following dataset.
|
122 |
|
|
|
123 |
# Evaluation Results with [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)
|
124 |
|
125 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
|
|
132 |
|||
|
133 |
|[**sarashina-embedding-v1-1b**](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)(This model)|**8192**|**75.50**|**77.61**|82.71|**78.37**|**93.74**|**53.86**|62.00|
|
134 |
|
|
|
135 |
## License
|
136 |
|
137 |
This model is licensed under [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE).
|
138 |
|
139 |
+
**If you are interested in using this model for commercial purposes, please feel free to contact us through our [contact page](https://www.sbintuitions.co.jp/#contact).**
|
README_JA.md
CHANGED
@@ -22,7 +22,7 @@ datasets:
|
|
22 |
|
23 |
# Sarashina-Embedding-v1-1B
|
24 |
|
25 |
-
「Sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「
|
26 |
|
27 |
このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark)の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
|
28 |
|
@@ -33,13 +33,13 @@ datasets:
|
|
33 |
### モデル説明
|
34 |
|
35 |
- **モデルタイプ:** Sentence Transformer
|
|
|
36 |
- **最大シーケンス長:** 8,192トークン
|
37 |
- **出力次元数:** 1,792次元
|
38 |
- **類似度関数:** コサイン類似度
|
39 |
- **言語:** 日本語
|
40 |
- **ライセンス:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
|
41 |
|
42 |
-
|
43 |
### モデルアーキテクチャ
|
44 |
|
45 |
```
|
@@ -142,6 +142,6 @@ print(similarities.shape)
|
|
142 |
|
143 |
## ライセンス
|
144 |
|
145 |
-
このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)に基づいて公開されています.
|
146 |
|
147 |
-
**もしこのモデルの商用利用に興味がある場合は、気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)にご連絡ください。**
|
|
|
22 |
|
23 |
# Sarashina-Embedding-v1-1B
|
24 |
|
25 |
+
「Sarashina-embedding-v1-1b」は、1.2Bパラメータの日本語LLM「[Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)」をベースにした日本語テキスト埋め込みモデルです。
|
26 |
|
27 |
このモデルは、マルチステージの対照学習で訓練し、 [JMTEB](https://huggingface.co/datasets/sbintuitions/JMTEB)(Japanese Massive Text Embedding Benchmark)の16個のデータセットの平均で、(2024/12/1時点で)最高水準の平均スコアを達成しました。
|
28 |
|
|
|
33 |
### モデル説明
|
34 |
|
35 |
- **モデルタイプ:** Sentence Transformer
|
36 |
+
- **ベースモデル:** [Sarashina2.1-1B](https://huggingface.co/sbintuitions/sarashina2.1-1b)
|
37 |
- **最大シーケンス長:** 8,192トークン
|
38 |
- **出力次元数:** 1,792次元
|
39 |
- **類似度関数:** コサイン類似度
|
40 |
- **言語:** 日本語
|
41 |
- **ライセンス:** [Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)
|
42 |
|
|
|
43 |
### モデルアーキテクチャ
|
44 |
|
45 |
```
|
|
|
142 |
|
143 |
## ライセンス
|
144 |
|
145 |
+
このモデルは[Sarashina Model NonCommercial License Agreement](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b/blob/main/LICENSE)に基づいて公開されています.
|
146 |
|
147 |
+
**もしこのモデルの商用利用に興味がある場合は、気軽に[コンタクトページ](https://www.sbintuitions.co.jp/#contact)にご連絡ください。**
|