update readme
Browse files- README.md +19 -16
- README_JA.md +19 -16
README.md
CHANGED
@@ -20,6 +20,9 @@ datasets:
|
|
20 |
- sentence-transformers/NQ-retrieval
|
21 |
- sbintuitions/JSQuAD
|
22 |
- SkelterLabsInc/JaQuAD
|
|
|
|
|
|
|
23 |
---
|
24 |
|
25 |
# Sarashina-Embedding-v1-1B
|
@@ -102,17 +105,17 @@ To achieve generic text embedding performance across a wide range of domains, we
|
|
102 |
|
103 |
|dataset|counts|
|
104 |
|:-:|:-:|
|
105 |
-
|AutoWikiQA|50,521,135|
|
106 |
-
|web-crawled data|47,370,649|
|
107 |
-
|MQA|12,941,472|
|
108 |
-
|llm-japanese-dataset|9,074,340|
|
109 |
-
|wikipedia|5,555,212|
|
110 |
-
|Quiz dataset|988,478|
|
111 |
-
|Natural Questions|132,796|
|
112 |
-
|JSQuAD|62,859|
|
113 |
-
|snow|62,758|
|
114 |
-
|JaQuAD|31,746|
|
115 |
-
|mkqa|3,318|
|
116 |
|||
|
117 |
|**total**|**126,744,763**|
|
118 |
|
@@ -124,10 +127,10 @@ To enable the model to learn a more accurate query-document similarity, we perfo
|
|
124 |
|
125 |
|dataset|counts|
|
126 |
|:-:|:-:|
|
127 |
-
|JSNLI|141,388 |
|
128 |
-
|NU-MNLI|67,987|
|
129 |
-
|Mr. TyDi (only Japanese subset)| 3,697 |
|
130 |
-
|Natural Question (sampled)| 20,000|
|
131 |
|||
|
132 |
|**total**|**233,072**|
|
133 |
|
@@ -135,7 +138,7 @@ To enable the model to learn a more accurate query-document similarity, we perfo
|
|
135 |
|
136 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
137 |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|
138 |
-
| OpenAI/text-embedding-3-large
|
139 |
| [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
|
140 |
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
|
141 |
| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
|
|
|
20 |
- sentence-transformers/NQ-retrieval
|
21 |
- sbintuitions/JSQuAD
|
22 |
- SkelterLabsInc/JaQuAD
|
23 |
+
- wikimedia/wikipedia
|
24 |
+
- cl-nagoya/nu-mnli
|
25 |
+
- castorini/mr-tydi
|
26 |
---
|
27 |
|
28 |
# Sarashina-Embedding-v1-1B
|
|
|
105 |
|
106 |
|dataset|counts|
|
107 |
|:-:|:-:|
|
108 |
+
|[AutoWikiQA](https://huggingface.co/datasets/cl-nagoya/auto-wiki-qa)|50,521,135|
|
109 |
+
|web-crawled data (ours)|47,370,649|
|
110 |
+
|[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
|
111 |
+
|[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
|
112 |
+
|[wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
|
113 |
+
|Quiz dataset (ours)|988,478|
|
114 |
+
|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
|
115 |
+
|[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
|
116 |
+
|[snow](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|62,758|
|
117 |
+
|[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
|
118 |
+
|[mkqa](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|3,318|
|
119 |
|||
|
120 |
|**total**|**126,744,763**|
|
121 |
|
|
|
127 |
|
128 |
|dataset|counts|
|
129 |
|:-:|:-:|
|
130 |
+
|[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
|
131 |
+
|[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
|
132 |
+
|[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi/tree/main) (only Japanese subset)| 3,697 |
|
133 |
+
|[Natural Question](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
|
134 |
|||
|
135 |
|**total**|**233,072**|
|
136 |
|
|
|
138 |
|
139 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
140 |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|
141 |
+
| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)^1 | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
|
142 |
| [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
|
143 |
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
|
144 |
| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
|
README_JA.md
CHANGED
@@ -18,6 +18,9 @@ datasets:
|
|
18 |
- sentence-transformers/NQ-retrieval
|
19 |
- sbintuitions/JSQuAD
|
20 |
- SkelterLabsInc/JaQuAD
|
|
|
|
|
|
|
21 |
---
|
22 |
|
23 |
# Sarashina-Embedding-v1-1B
|
@@ -99,17 +102,17 @@ print(similarities.shape)
|
|
99 |
|
100 |
|dataset|counts|
|
101 |
|:-:|:-:|
|
102 |
-
|AutoWikiQA|50,521,135|
|
103 |
-
|web-crawled data|47,370,649|
|
104 |
-
|MQA|12,941,472|
|
105 |
-
|llm-japanese-dataset|9,074,340|
|
106 |
-
|wikipedia|5,555,212|
|
107 |
-
|Quiz dataset|988,478|
|
108 |
-
|Natural Questions|132,796|
|
109 |
-
|JSQuAD|62,859|
|
110 |
-
|snow|62,758|
|
111 |
-
|JaQuAD|31,746|
|
112 |
-
|mkqa|3,318|
|
113 |
|||
|
114 |
|**total**|**126,744,763**|
|
115 |
|
@@ -121,10 +124,10 @@ print(similarities.shape)
|
|
121 |
|
122 |
|dataset|counts|
|
123 |
|:-:|:-:|
|
124 |
-
|JSNLI|141,388 |
|
125 |
-
|NU-MNLI|67,987|
|
126 |
-
|Mr. TyDi (only Japanese subset)| 3,697 |
|
127 |
-
|Natural Question (sampled)| 20,000|
|
128 |
|||
|
129 |
|**total**|**233,072**|
|
130 |
|
@@ -132,7 +135,7 @@ print(similarities.shape)
|
|
132 |
|
133 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
134 |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|
135 |
-
| OpenAI/text-embedding-3-large
|
136 |
| [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
|
137 |
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
|
138 |
| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
|
|
|
18 |
- sentence-transformers/NQ-retrieval
|
19 |
- sbintuitions/JSQuAD
|
20 |
- SkelterLabsInc/JaQuAD
|
21 |
+
- wikimedia/wikipedia
|
22 |
+
- cl-nagoya/nu-mnli
|
23 |
+
- castorini/mr-tydi
|
24 |
---
|
25 |
|
26 |
# Sarashina-Embedding-v1-1B
|
|
|
102 |
|
103 |
|dataset|counts|
|
104 |
|:-:|:-:|
|
105 |
+
|[AutoWikiQA](https://huggingface.co/datasets/cl-nagoya/auto-wiki-qa)|50,521,135|
|
106 |
+
|web-crawled data (ours)|47,370,649|
|
107 |
+
|[MQA](https://huggingface.co/datasets/hpprc/mqa-ja)|12,941,472|
|
108 |
+
|[llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)|9,074,340|
|
109 |
+
|[wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)|5,555,212|
|
110 |
+
|Quiz dataset (ours)|988,478|
|
111 |
+
|[Natural Questions](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval)|132,796|
|
112 |
+
|[JSQuAD](https://huggingface.co/datasets/sbintuitions/JSQuAD)|62,859|
|
113 |
+
|[snow](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|62,758|
|
114 |
+
|[JaQuAD](https://huggingface.co/datasets/SkelterLabsInc/JaQuAD)|31,746|
|
115 |
+
|[mkqa](https://huggingface.co/datasets/cl-nagoya/ruri-dataset-ft)|3,318|
|
116 |
|||
|
117 |
|**total**|**126,744,763**|
|
118 |
|
|
|
124 |
|
125 |
|dataset|counts|
|
126 |
|:-:|:-:|
|
127 |
+
|[JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)|141,388 |
|
128 |
+
|[NU-MNLI](https://huggingface.co/datasets/cl-nagoya/nu-mnli)|67,987|
|
129 |
+
|[Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi/tree/main) (only Japanese subset)| 3,697 |
|
130 |
+
|[Natural Question](https://huggingface.co/datasets/sentence-transformers/NQ-retrieval) (sampled)| 20,000|
|
131 |
|||
|
132 |
|**total**|**233,072**|
|
133 |
|
|
|
135 |
|
136 |
Model |Max Tokens|Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
137 |
|:----------------------------------------------|:----------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|
138 |
+
| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)^1 | 8191 |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
|
139 |
| [cl-nagoya/ruri-large](https://huggingface.co/intfloat/multilingual-e5-large) | 512 |73.31 | 73.02 | **83.13** | 77.43 | 92.99 | 51.82 | 62.29 |
|
140 |
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) | 512 |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | **62.37** |
|
141 |
| [pkshatech/RoSEtta-base-ja](https://huggingface.co/pkshatech/RoSEtta-base-ja) |1024 |72.04 | 73.21 | 81.39 | 72.41 | 92.69 | 53.23 | 61.74 |
|